Skip to Content
ReferenceCustom Resource Definitions

Custom Resource Definitions (CRDs)

This page provides detailed specifications for each ARK custom resource.

Resource Reference

ResourceAPI VersionDescription
Evaluatorark.mckinsey.com/v1alpha1AI-powered query assessment services
Evaluationark.mckinsey.com/v1alpha1Multi-type AI output assessments
ExecutionEngineark.mckinsey.com/v1prealpha1External execution engines

Evaluators

Evaluators provide either deterministic or AI-powered assessment of teams / agents / queries / tools to support quality control and testing. For non-deterministic cases, they use the “LLM-as-a-Judge” pattern to automatically evaluate agent responses.

How Evaluators Work

  • Service Integration: Evaluators define assessment services that can be referenced by Evaluation CRDs
  • LLM-as-a-Judge: Evaluators use AI models to assess response quality across multiple criteria
  • Automatic Evaluation: Evaluators can use selectors to automatically create evaluations for matching queries
  • Post-hoc Assessment: Evaluations analyze completed queries without affecting query execution

Specification

apiVersion: ark.mckinsey.com/v1alpha1 kind: Evaluator metadata: name: ark-evaluator spec: description: "evaluator for query assessment" address: valueFrom: serviceRef: name: ark-evaluator port: "http" path: "/evaluate" selector: # optional - for automatic query evaluation resourceType: Query matchLabels: evaluate: "true" # will automatically evaluate any query with label evaluate=true parameters: # optional - including model configuration - name: model.name # specify model to use (default: "default") value: "gpt-4-model" - name: model.namespace # specify model namespace (default: evaluator's namespace) value: "models" - name: min-score # custom parameter passed to the evaluation service value: "0.8"

Auto-triggered Evaluation:

┌─────────────────────┐ ┌──────────────┐ │ Evaluator with │ Auto-triggers │ New Query │ │ selector: │ ◄───────────────── │ labels: │ │ matchLabels: │ when created │ evaluate: │ │ evaluate: true │ or modified │ "true" │ └─────────────────────┘ └──────────────┘

Key fields

  • Address - Service Reference: Target specific execution service to execute an evaluation.
  • Query selector: Automatic match of queries to be evaluated using specific labels.
  • Parameter map: Pass default evaluation parameters to the target evaluation service.

Evaluations

Evaluations assess different metrics using various modes including direct assessment, dataset comparison, and query result evaluation.
Current design is aligned with: One Evaluation = One Evaluator = One Specific Assessment

Overview

Evaluations work with Evaluators to assess AI outputs, both deterministic and non-deterministic. Multiple evaluation modes can target different evaluators, and evaluators can automatically process evaluations based on label selectors.

Evaluation Flow:

┌─────────────┐ ┌─────────────┐ ┌─────────────────┐ │ Evaluation │◄─────│ Evaluator(s)│ │ Ark Evaluation │ │ Mode │─────►│ │─────►│ Service(s) │ └─────────────┘ └─────────────┘ └─────────────────┘ └─────► (*) Query ────► (Agent/Tool/Team)

Key Fields

  • type: Evaluation type (direct, baseline, query, batch, event)
  • evaluator: Reference to the Evaluator resource
  • config: Type-specific configuration with embedded fields
  • status.score: Evaluation score (0-1)
  • status.passed: Whether evaluation passed

Evaluation Types

Direct Type

Evaluate a single input/output pair:

apiVersion: ark.mckinsey.com/v1alpha1 kind: Evaluation metadata: name: direct-eval spec: type: direct evaluator: name: quality-evaluator config: input: "What's the weather in NYC?" output: "It's 72°F and sunny in New York City"

Baseline type

Evaluate against baseline datasets to measure performance and verify that the evaluator achieves proper metrics:

apiVersion: ark.mckinsey.com/v1alpha1 kind: Evaluation metadata: name: baseline-eval spec: type: baseline timeout: "10m" # Extended timeout to keep the controller connection open until all samples get processed evaluator: name: llm-judge parameters: - name: golden-examples # reference dataset with test cases to baseline the evaluator performance valueFrom: configMapKeyRef: name: golden-examples key: examples config: {} --- apiVersion: v1 kind: ConfigMap metadata: name: golden-examples data: examples: | [ { "input": "What is 7 + 3?", "expectedOutput": "10", "expectedMinScore": "0.9", "difficulty": "easy", "category": "arithmetic", "metadata": { "type": "basic-addition", "concept": "addition" } }, { "input": "What is 6 × 4?", "expectedOutput": "24", "expectedMinScore": "0.85", "difficulty": "easy", "category": "arithmetic", "metadata": { "type": "multiplication", "concept": "multiplication" } }, { "input": "If I buy 8 items at $3 each, how much do I spend?", "expectedOutput": "$24", "expectedMinScore": "0.8", "difficulty": "medium", "category": "word-problem", "metadata": { "type": "word-problem", "concept": "multiplication", "context": "shopping" } }, { "input": "What is 1/2 + 1/4?", "expectedOutput": "3/4", "expectedMinScore": "0.75", "difficulty": "hard", "category": "fractions", "metadata": { "type": "fraction-addition", "concept": "fractions" } } ]

Query type evaluation

Evaluate existing query results:

apiVersion: ark.mckinsey.com/v1alpha1 kind: Evaluation metadata: name: query-eval spec: type: query evaluator: name: accuracy-evaluator config: queryRef: name: weather-query-123 responseTarget: "weather-agent"

Batch type

┌─────────────┐ │ Evaluation │ Aggregates multiple child evaluations │ type=Batch │ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─┐ └─────────────┘ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ Eval 1 │ │ Eval 2 │ │ Eval n │ │ type=Query │ │ type=Query │ │ type=Direct │ └─────────────┘ └─────────────┘ └─────────────┘

Example combining explicit items with template-based dynamic creation

apiVersion: ark.mckinsey.com/v1alpha1 kind: Evaluation metadata: name: hybrid-batch-eval namespace: default spec: type: batch config: # Explicit evaluations (high priority, specific configs) items: - name: critical-accuracy-test type: direct evaluator: name: strict-evaluator parameters: - name: threshold value: "0.95" config: input: "What is the capital of France?" output: "Paris" - name: performance-baseline type: query evaluator: name: performance-evaluator config: queryRef: name: baseline-query # Template for dynamic creation from query selector template: namePrefix: auto-eval evaluator: name: standard-evaluator type: query config: queryRef: name: "" # Will be filled dynamically # Select additional queries to evaluate using the template querySelector: matchLabels: category: "regression-test" priority: "medium" matchExpressions: - key: status operator: In values: ["completed", "ready"] concurrency: 3 continueOnFailure: true

Event/Rule based evaluation

Rule-based evaluations using CEL (Common Expression Language):

apiVersion: ark.mckinsey.com/v1alpha1 kind: Evaluation metadata: name: event-eval spec: type: event evaluator: name: tool-usage-evaluator config: rules: - name: "weather-tool-called" expression: 'tools.was_called('get-weather')' description: "Validates get-weather tool was called" weight: 1

Execution Engines

Execution Engines provide custom runtime environments for specialized agent execution.

Specification

apiVersion: ark.mckinsey.com/v1prealpha1 kind: ExecutionEngine metadata: name: custom-engine spec: type: external endpoint: "http://custom-engine-service:8080"

Resource Relationships

ARK resources work together in common patterns:

  • Agent + Model + Tools: Basic agent with capabilities
  • Team + Multiple Agents: Multi-agent collaboration
  • Query + Targets: Requests to agents or teams
  • MCP Server + Tools: Standardized tool integration
  • Memory + Sessions: Persistent conversations

Last updated on