Evaluator LLM Service
AI-powered query evaluation service that uses large language models as judges to assess response quality automatically.
Overview
The Evaluator LLM service implements the LLM-as-a-Judge pattern, providing automated evaluation of query responses across multiple quality dimensions. It integrates seamlessly with the ARK platform to provide quality gating for agent interactions.
Features
- LLM-as-a-Judge Pattern: Uses advanced language models to evaluate response quality objectively
- Multi-Criteria Assessment: Evaluates responses across 5 key dimensions
- Model Flexibility: Supports OpenAI and Azure OpenAI configurations
- Kubernetes Native: Deploys as Evaluator custom resource
- REST API: Simple HTTP interface for evaluation requests
Installation
Deploy the evaluator service using Helm:
# Install the evaluator-llm service
helm install evaluator-llm ./services/evaluator-llm/chart
# Verify deployment
kubectl get pods -l app.kubernetes.io/name=evaluator-llm
kubectl get evaluator evaluator-llm
Usage
1. Create Model Configuration
First, ensure you have a model configured for evaluation:
apiVersion: ark.mckinsey.com/v1alpha1
kind: Model
metadata:
name: evaluation-model
spec:
type: openai
url: https://api.openai.com/v1/chat/completions
model: gpt-4
apiKey: your-api-key
2. Configure Evaluator
The evaluator is automatically created by the Helm chart, but you can customize it:
apiVersion: ark.mckinsey.com/v1alpha1
kind: Evaluator
metadata:
name: llm-evaluator
spec:
type: llm-judge
description: "LLM-as-a-Judge evaluator for query assessment"
address:
valueFrom:
serviceRef:
name: evaluator-llm
port: "http"
path: "/evaluate"
modelRef:
name: evaluation-model
3. Use in Queries
Reference the evaluator in your queries:
apiVersion: ark.mckinsey.com/v1alpha1
kind: Query
metadata:
name: research-query
spec:
input: "Explain the benefits of renewable energy"
targets:
- type: agent
name: research-agent
evaluator:
name: llm-evaluator
Evaluation Process
When a query with an evaluator completes:
- Query Execution: Agent generates response normally
- Evaluation Trigger: Query status changes to “evaluating”
- AI Assessment: Evaluator analyzes response using configured model
- Quality Scoring: Response scored across multiple criteria
- Completion: Query marked as “done” after evaluation
Evaluation Criteria
The service evaluates responses across five dimensions (0-100 scale):
- Relevance: How well the response addresses the query
- Accuracy: Factual correctness and reliability
- Completeness: Comprehensiveness of the information
- Clarity: Readability and ease of understanding
- Usefulness: Practical value to the user
A response with an overall score ≥70 is considered “passed”.
API Reference
Health Endpoints
GET /health
- Service health statusGET /ready
- Service readiness check
Evaluation Endpoint
POST /evaluate
- Evaluate query responses
Request Format:
{
"queryId": "query-uuid",
"input": "user query text",
"responses": [
{
"target": {"type": "agent", "name": "agent-name"},
"content": "agent response content"
}
],
"query": {...},
"model": {
"spec": {...},
"metadata": {...}
}
}
Response Format:
{
"score": "85",
"passed": true,
"metadata": {
"reasoning": "Response demonstrates good accuracy...",
"criteria_scores": "relevance=90, accuracy=85, ..."
}
}
Configuration
Model Support
The evaluator supports these model types:
- OpenAI: Standard OpenAI API endpoints
- Azure OpenAI: Azure-hosted OpenAI services
Model configuration is passed automatically from the Evaluator custom resource.
Evaluation Parameters
The service uses optimized parameters for consistent evaluation:
- Temperature: 0.1 (low for consistent scoring)
- Max Tokens: 1000 (sufficient for detailed evaluation)
- Timeout: 30 seconds per evaluation
Monitoring
Monitor evaluator performance:
# Check service logs
kubectl logs -l app.kubernetes.io/name=evaluator-llm
# View evaluator status
kubectl get evaluator evaluator-llm -o yaml
# Monitor query evaluation phases
kubectl get query -w
Development
For local development:
cd services/evaluator-llm
# Install dependencies
make init
# Run locally
make dev
# Run tests
make test
# Check code quality
make lint
Architecture
The evaluator service consists of:
- FastAPI Application: REST API server with async endpoints
- LLM Evaluator: Core evaluation logic with structured prompting
- LLM Client: HTTP client supporting OpenAI and Azure APIs
- Type System: Pydantic models for request/response validation
The service integrates with ARK through:
- Evaluator CRD: Kubernetes custom resource for configuration
- ValueSource Resolution: Dynamic address and model resolution
- Operation Tracking: Telemetry and monitoring integration
Next Steps
- Testing - Learn about testing evaluators
- Observability - Monitor evaluation performance