Custom Resource Definitions (CRDs)

This page provides detailed specifications for each ARK custom resource.

Resource Reference

Resource	API Version	Description
Evaluator	`ark.mckinsey.com/v1alpha1`	AI-powered query assessment services
Evaluation	`ark.mckinsey.com/v1alpha1`	Multi-type AI output assessments
A2ATask	`ark.mckinsey.com/v1alpha1`	Agent-to-Agent protocol tasks
ExecutionEngine	`ark.mckinsey.com/v1prealpha1`	External execution engines

Evaluators

Evaluators provide either deterministic or AI-powered assessment of teams / agents / queries / tools to support quality control and testing. For non-deterministic cases, they use the “LLM-as-a-Judge” pattern to automatically evaluate agent responses.

How Evaluators Work

Service Integration: Evaluators define assessment services that can be referenced by Evaluation CRDs
LLM-as-a-Judge: Evaluators use AI models to assess response quality across multiple criteria
Automatic Evaluation: Evaluators can use selectors to automatically create evaluations for matching queries
Post-hoc Assessment: Evaluations analyze completed queries without affecting query execution

Specification


apiVersion: ark.mckinsey.com/v1alpha1
kind: Evaluator
metadata:
  name: ark-evaluator
spec:
  description: "evaluator for query assessment"
  address:
    valueFrom:
      serviceRef:
        name: ark-evaluator
        port: "http"
        path: "/evaluate"
  selector:  # optional - for automatic query evaluation
    resourceType: Query
    matchLabels:
      evaluate: "true" # will automatically evaluate any query with label evaluate=true
  parameters:  # optional - including model configuration
    - name: model.name      # specify model to use (default: "default")
      value: "gpt-4-model"
    - name: model.namespace # specify model namespace (default: evaluator's namespace)
      value: "models"
    - name: min-score      # custom parameter passed to the evaluation service
      value: "0.8"

Auto-triggered Evaluation:


┌─────────────────────┐                    ┌──────────────┐
│ Evaluator with      │     Auto-triggers  │ New Query    │
│ selector:           │ ◄───────────────── │ labels:      │
│   matchLabels:      │     when created   │   evaluate:  │
│     evaluate: true  │     or modified    │     "true"   │
└─────────────────────┘                    └──────────────┘

Key fields

Address - Service Reference: Target specific execution service to execute an evaluation.
Query selector: Automatic match of queries to be evaluated using specific labels.
Parameter map: Pass default evaluation parameters to the target evaluation service.

Evaluations

Evaluations assess different metrics using various modes including direct assessment, dataset comparison, and query result evaluation.
Current design is aligned with: One Evaluation = One Evaluator = One Specific Assessment

Overview

Evaluations work with Evaluators to assess AI outputs, both deterministic and non-deterministic. Multiple evaluation modes can target different evaluators, and evaluators can automatically process evaluations based on label selectors.

Evaluation Flow:


┌─────────────┐      ┌─────────────┐      ┌─────────────────┐
│ Evaluation  │◄─────│ Evaluator(s)│      │ Ark Evaluation  │
│  Mode       │─────►│             │─────►│ Service(s)      │
└─────────────┘      └─────────────┘      └─────────────────┘
       │                     
       └─────► (*) Query ────► (Agent/Tool/Team)

Key Fields

type: Evaluation type (direct, baseline, query, batch, event)
evaluator: Reference to the Evaluator resource
config: Type-specific configuration with embedded fields
status.score: Evaluation score (0-1)
status.passed: Whether evaluation passed

Evaluation Types

Direct Type

Evaluate a single input/output pair:


apiVersion: ark.mckinsey.com/v1alpha1
kind: Evaluation
metadata:
  name: direct-eval
spec:
  type: direct
  evaluator:
    name: quality-evaluator
  config:
    input: "What's the weather in NYC?"
    output: "It's 72°F and sunny in New York City"

Baseline type

Evaluate against baseline datasets to measure performance and verify that the evaluator achieves proper metrics:


apiVersion: ark.mckinsey.com/v1alpha1
kind: Evaluation
metadata:
  name: baseline-eval
spec:
  type: baseline
  timeout: "10m"  # Extended timeout to keep the controller connection open until all samples get processed
  evaluator:
    name: llm-judge
    parameters:
    - name: golden-examples # reference dataset with test cases to baseline the evaluator performance
      valueFrom:
        configMapKeyRef:
          name: golden-examples
          key: examples
  config: {}
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: golden-examples
data:
  examples: |
    [
      {
        "input": "What is 7 + 3?",
        "expectedOutput": "10",
        "expectedMinScore": "0.9",
        "difficulty": "easy",
        "category": "arithmetic",
        "metadata": {
          "type": "basic-addition",
          "concept": "addition"
        }
      },
      {
        "input": "What is 6 × 4?",
        "expectedOutput": "24",
        "expectedMinScore": "0.85",
        "difficulty": "easy",
        "category": "arithmetic",
        "metadata": {
          "type": "multiplication",
          "concept": "multiplication"
        }
      },
      {
        "input": "If I buy 8 items at $3 each, how much do I spend?",
        "expectedOutput": "$24",
        "expectedMinScore": "0.8",
        "difficulty": "medium",
        "category": "word-problem",
        "metadata": {
          "type": "word-problem",
          "concept": "multiplication",
          "context": "shopping"
        }
      },
      {
        "input": "What is 1/2 + 1/4?",
        "expectedOutput": "3/4",
        "expectedMinScore": "0.75",
        "difficulty": "hard",
        "category": "fractions",
        "metadata": {
          "type": "fraction-addition",
          "concept": "fractions"
        }
      }
    ]

Query type evaluation

Evaluate existing query results:


apiVersion: ark.mckinsey.com/v1alpha1
kind: Evaluation
metadata:
  name: query-eval
spec:
  type: query
  evaluator:
    name: accuracy-evaluator
  config:
    queryRef:
      name: weather-query-123
      responseTarget: "weather-agent"

Batch type


┌─────────────┐      
│ Evaluation  │      Aggregates multiple child evaluations
│ type=Batch  │ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─  ─ ─ ─ ─ ─┐
└─────────────┘                                   │
                                                  ▼
                          ┌─────────────┐  ┌─────────────┐  ┌─────────────┐
                          │ Eval 1      │  │ Eval 2      │  │ Eval n      │
                          │ type=Query  │  │ type=Query  │  │ type=Direct │
                          └─────────────┘  └─────────────┘  └─────────────┘

Example combining explicit items with template-based dynamic creation


 
apiVersion: ark.mckinsey.com/v1alpha1
kind: Evaluation
metadata:
  name: hybrid-batch-eval
  namespace: default
spec:
  type: batch
  config:
    # Explicit evaluations (high priority, specific configs)
    items:
      - name: critical-accuracy-test
        type: direct
        evaluator:
          name: strict-evaluator
          parameters:
            - name: threshold
              value: "0.95"
        config:
          input: "What is the capital of France?"
          output: "Paris"
      
      - name: performance-baseline
        type: query
        evaluator:
          name: performance-evaluator
        config:
          queryRef:
            name: baseline-query
    
    # Template for dynamic creation from query selector
    template:
      namePrefix: auto-eval
      evaluator:
        name: standard-evaluator
      type: query
      config:
        queryRef:
          name: ""  # Will be filled dynamically
    
    # Select additional queries to evaluate using the template
    querySelector:
      matchLabels:
        category: "regression-test"
        priority: "medium"
      matchExpressions:
        - key: status
          operator: In
          values: ["completed", "ready"]
    
    concurrency: 3
    continueOnFailure: true

Event/Rule based evaluation

Rule-based evaluations using CEL (Common Expression Language):


apiVersion: ark.mckinsey.com/v1alpha1
kind: Evaluation
metadata:
  name: event-eval
spec:
  type: event
  evaluator:
    name: tool-usage-evaluator
  config:
    rules:
      - name: "weather-tool-called"
        expression: 'tools.was_called('get-weather')'
        description: "Validates get-weather tool was called"
        weight: 1

Execution Engines

Execution Engines provide custom runtime environments for specialized agent execution.

Specification


apiVersion: ark.mckinsey.com/v1prealpha1
kind: ExecutionEngine
metadata:
  name: custom-engine
spec:
  type: external
  endpoint: "http://custom-engine-service:8080"

Resource Relationships

ARK resources work together in common patterns:

Agent + Model + Tools: Basic agent with capabilities
Team + Multiple Agents: Multi-agent collaboration
Query + Targets: Requests to agents or teams
MCP Server + Tools: Standardized tool integration
Memory + Sessions: Persistent conversations