Custom Resource Definitions (CRDs)
This page provides detailed specifications for each ARK custom resource.
Resource Reference
Resource | API Version | Description |
---|---|---|
Evaluator | ark.mckinsey.com/v1alpha1 | AI-powered query assessment services |
Evaluation | ark.mckinsey.com/v1alpha1 | Multi-type AI output assessments |
ExecutionEngine | ark.mckinsey.com/v1prealpha1 | External execution engines |
Evaluators
Evaluators provide either deterministic or AI-powered assessment of teams / agents / queries / tools to support quality control and testing. For non-deterministic cases, they use the “LLM-as-a-Judge” pattern to automatically evaluate agent responses.
How Evaluators Work
- Service Integration: Evaluators define assessment services that can be referenced by Evaluation CRDs
- LLM-as-a-Judge: Evaluators use AI models to assess response quality across multiple criteria
- Automatic Evaluation: Evaluators can use selectors to automatically create evaluations for matching queries
- Post-hoc Assessment: Evaluations analyze completed queries without affecting query execution
Specification
apiVersion: ark.mckinsey.com/v1alpha1
kind: Evaluator
metadata:
name: ark-evaluator
spec:
description: "evaluator for query assessment"
address:
valueFrom:
serviceRef:
name: ark-evaluator
port: "http"
path: "/evaluate"
selector: # optional - for automatic query evaluation
resourceType: Query
matchLabels:
evaluate: "true" # will automatically evaluate any query with label evaluate=true
parameters: # optional - including model configuration
- name: model.name # specify model to use (default: "default")
value: "gpt-4-model"
- name: model.namespace # specify model namespace (default: evaluator's namespace)
value: "models"
- name: min-score # custom parameter passed to the evaluation service
value: "0.8"
Auto-triggered Evaluation:
┌─────────────────────┐ ┌──────────────┐
│ Evaluator with │ Auto-triggers │ New Query │
│ selector: │ ◄───────────────── │ labels: │
│ matchLabels: │ when created │ evaluate: │
│ evaluate: true │ or modified │ "true" │
└─────────────────────┘ └──────────────┘
Key fields
- Address - Service Reference: Target specific execution service to execute an evaluation.
- Query selector: Automatic match of queries to be evaluated using specific labels.
- Parameter map: Pass default evaluation parameters to the target evaluation service.
Evaluations
Evaluations assess different metrics using various modes including direct assessment, dataset comparison, and query result evaluation.
Current design is aligned with: One Evaluation = One Evaluator = One Specific Assessment
Overview
Evaluations work with Evaluators to assess AI outputs, both deterministic and non-deterministic. Multiple evaluation modes can target different evaluators, and evaluators can automatically process evaluations based on label selectors.
Evaluation Flow:
┌─────────────┐ ┌─────────────┐ ┌─────────────────┐
│ Evaluation │◄─────│ Evaluator(s)│ │ Ark Evaluation │
│ Mode │─────►│ │─────►│ Service(s) │
└─────────────┘ └─────────────┘ └─────────────────┘
│
└─────► (*) Query ────► (Agent/Tool/Team)
Key Fields
- type: Evaluation type (direct, baseline, query, batch, event)
- evaluator: Reference to the Evaluator resource
- config: Type-specific configuration with embedded fields
- status.score: Evaluation score (0-1)
- status.passed: Whether evaluation passed
Evaluation Types
Direct Type
Evaluate a single input/output pair:
apiVersion: ark.mckinsey.com/v1alpha1
kind: Evaluation
metadata:
name: direct-eval
spec:
type: direct
evaluator:
name: quality-evaluator
config:
input: "What's the weather in NYC?"
output: "It's 72°F and sunny in New York City"
Baseline type
Evaluate against baseline datasets to measure performance and verify that the evaluator achieves proper metrics:
apiVersion: ark.mckinsey.com/v1alpha1
kind: Evaluation
metadata:
name: baseline-eval
spec:
type: baseline
timeout: "10m" # Extended timeout to keep the controller connection open until all samples get processed
evaluator:
name: llm-judge
parameters:
- name: golden-examples # reference dataset with test cases to baseline the evaluator performance
valueFrom:
configMapKeyRef:
name: golden-examples
key: examples
config: {}
---
apiVersion: v1
kind: ConfigMap
metadata:
name: golden-examples
data:
examples: |
[
{
"input": "What is 7 + 3?",
"expectedOutput": "10",
"expectedMinScore": "0.9",
"difficulty": "easy",
"category": "arithmetic",
"metadata": {
"type": "basic-addition",
"concept": "addition"
}
},
{
"input": "What is 6 × 4?",
"expectedOutput": "24",
"expectedMinScore": "0.85",
"difficulty": "easy",
"category": "arithmetic",
"metadata": {
"type": "multiplication",
"concept": "multiplication"
}
},
{
"input": "If I buy 8 items at $3 each, how much do I spend?",
"expectedOutput": "$24",
"expectedMinScore": "0.8",
"difficulty": "medium",
"category": "word-problem",
"metadata": {
"type": "word-problem",
"concept": "multiplication",
"context": "shopping"
}
},
{
"input": "What is 1/2 + 1/4?",
"expectedOutput": "3/4",
"expectedMinScore": "0.75",
"difficulty": "hard",
"category": "fractions",
"metadata": {
"type": "fraction-addition",
"concept": "fractions"
}
}
]
Query type evaluation
Evaluate existing query results:
apiVersion: ark.mckinsey.com/v1alpha1
kind: Evaluation
metadata:
name: query-eval
spec:
type: query
evaluator:
name: accuracy-evaluator
config:
queryRef:
name: weather-query-123
responseTarget: "weather-agent"
Batch type
┌─────────────┐
│ Evaluation │ Aggregates multiple child evaluations
│ type=Batch │ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─┐
└─────────────┘ │
▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Eval 1 │ │ Eval 2 │ │ Eval n │
│ type=Query │ │ type=Query │ │ type=Direct │
└─────────────┘ └─────────────┘ └─────────────┘
Example combining explicit items with template-based dynamic creation
apiVersion: ark.mckinsey.com/v1alpha1
kind: Evaluation
metadata:
name: hybrid-batch-eval
namespace: default
spec:
type: batch
config:
# Explicit evaluations (high priority, specific configs)
items:
- name: critical-accuracy-test
type: direct
evaluator:
name: strict-evaluator
parameters:
- name: threshold
value: "0.95"
config:
input: "What is the capital of France?"
output: "Paris"
- name: performance-baseline
type: query
evaluator:
name: performance-evaluator
config:
queryRef:
name: baseline-query
# Template for dynamic creation from query selector
template:
namePrefix: auto-eval
evaluator:
name: standard-evaluator
type: query
config:
queryRef:
name: "" # Will be filled dynamically
# Select additional queries to evaluate using the template
querySelector:
matchLabels:
category: "regression-test"
priority: "medium"
matchExpressions:
- key: status
operator: In
values: ["completed", "ready"]
concurrency: 3
continueOnFailure: true
Event/Rule based evaluation
Rule-based evaluations using CEL (Common Expression Language):
apiVersion: ark.mckinsey.com/v1alpha1
kind: Evaluation
metadata:
name: event-eval
spec:
type: event
evaluator:
name: tool-usage-evaluator
config:
rules:
- name: "weather-tool-called"
expression: 'tools.was_called('get-weather')'
description: "Validates get-weather tool was called"
weight: 1
Execution Engines
Execution Engines provide custom runtime environments for specialized agent execution.
Specification
apiVersion: ark.mckinsey.com/v1prealpha1
kind: ExecutionEngine
metadata:
name: custom-engine
spec:
type: external
endpoint: "http://custom-engine-service:8080"
Resource Relationships
ARK resources work together in common patterns:
- Agent + Model + Tools: Basic agent with capabilities
- Team + Multiple Agents: Multi-agent collaboration
- Query + Targets: Requests to agents or teams
- MCP Server + Tools: Standardized tool integration
- Memory + Sessions: Persistent conversations