ARK Evaluator
Unified AI evaluation service supporting both deterministic metrics assessment and LLM-as-a-Judge evaluation with comprehensive integration capabilities.
Overview
ARK Evaluator provides two complementary evaluation approaches:
- Deterministic Evaluation (
/evaluate-metrics
): Objective, metrics-based assessment for measurable performance criteria - LLM-as-a-Judge Evaluation (
/evaluate
): Intelligent, model-based assessment for subjective quality criteria
Features
- Dual Evaluation Methods: Both objective metrics and subjective AI assessment
- Multiple LLM Providers: Azure OpenAI, ARK Native, with more in development
- Advanced Integrations: Langfuse + RAGAS support with tracing
- Kubernetes Native: Deploys as Evaluator custom resource
- REST API: Simple HTTP interface with two evaluation endpoints
Installation
Build and deploy the evaluator service:
# From project root
make ark-evaluator-deps # Install dependencies (including ark-sdk)
make ark-evaluator-build # Build Docker image
make ark-evaluator-install # Deploy to cluster
Usage
Deterministic Metrics Evaluation
Objective performance assessment across token efficiency, cost analysis, performance metrics, and quality thresholds:
curl -X POST http://ark-evaluator:8000/evaluate-metrics \
-H "Content-Type: application/json" \
-d '{
"type": "direct",
"config": {
"input": "What is machine learning?",
"output": "Machine learning is a subset of AI..."
},
"parameters": {
"maxTokens": "1000",
"maxCostPerQuery": "0.05",
"tokenWeight": "0.3"
}
}'
LLM-as-a-Judge Evaluation
Intelligent quality assessment using language models for relevance, accuracy, completeness, clarity, and usefulness:
curl -X POST http://ark-evaluator:8000/evaluate \
-H "Content-Type: application/json" \
-d '{
"type": "direct",
"config": {
"input": "Explain renewable energy benefits",
"output": "Renewable energy offers cost savings..."
},
"parameters": {
"provider": "ark",
"scope": "relevance,accuracy,clarity",
"threshold": "0.8"
}
}'
Evaluation Capabilities
Deterministic Metrics
Objective performance assessment across four key dimensions:
- Token Score: Efficiency, limits, throughput for cost optimization
- Cost Score: Per-query cost, efficiency ratios for budget management
- Performance Score: Latency, response time, throughput for SLA compliance
- Quality Score: Completeness, length, error rates for content quality
LLM-as-a-Judge
Intelligent quality assessment using advanced language models:
- Relevance: How well response addresses the query
- Accuracy: Factual correctness and reliability
- Completeness: Comprehensiveness of information
- Clarity: Readability and communication effectiveness
- Usefulness: Practical value and actionability
Supported Providers
Currently Available
- Azure OpenAI: GPT-4o, GPT-4-turbo, GPT-3.5-turbo with enterprise features
- ARK Native: Configurable model endpoints with unified interface
Advanced Integrations
- Langfuse + RAGAS: RAGAS metrics with Azure OpenAI, automatic tracing, and comprehensive evaluation lineage
API Endpoints
Health & Status
GET /health
- Service health statusGET /ready
- Service readiness check
Evaluation Endpoints
POST /evaluate-metrics
- Deterministic metrics evaluationPOST /evaluate
- LLM-as-a-Judge evaluation
Configuration Examples
Deterministic Evaluation
parameters:
maxTokens: "2000"
maxDuration: "30s"
maxCostPerQuery: "0.08"
tokenWeight: "0.3"
costWeight: "0.3"
performanceWeight: "0.2"
qualityWeight: "0.2"
LLM Evaluation
parameters:
provider: "ark"
scope: "relevance,accuracy,completeness"
threshold: "0.8"
temperature: "0.1"
Langfuse + Azure OpenAI
parameters:
provider: "langfuse"
langfuse.host: "https://cloud.langfuse.com"
langfuse.azure_deployment: "gpt-4o"
metrics: "relevance,correctness,faithfulness"
Development
For local development:
# Development commands
make ark-evaluator-dev # Run service locally
make ark-evaluator-test # Run tests
Running make ark-evaluator-dev
sets up the service exposing two endpoints at http://localhost:8000
. Visit http://localhost:8000/docs
to see the interactive API documentation.
Testing /evaluate-metrics
The metrics endpoint performs deterministic analysis without requiring LLM setup. See /services/ark-evaluator/src/evaluator/metrics/
for the underlying logic.
curl -X POST http://localhost:8000/evaluate-metrics \
-H "Content-Type: application/json" \
-d '{
"type": "direct",
"config": {
"input": "What is 2+2?",
"output": "2+2 equals 4"
},
"parameters": {"temperature": "0.7"},
"evaluatorName": "test-evaluator"
}'
Testing /evaluate
The evaluate endpoint requires deployed ARK resources for LLM-as-a-Judge evaluation.
1. Define a model:
kubectl apply -f - <<EOF
apiVersion: ark/v1alpha1
kind: Model
metadata:
name: default
namespace: default
spec:
type: azure
config:
model: gpt-4o-mini
credentials:
secretRef:
name: azure-openai-secret
key: api-key
EOF
2. Define an evaluator:
kubectl apply -f - <<EOF
apiVersion: ark/v1alpha1
kind: Evaluator
metadata:
name: test-ark-evaluator
namespace: default
spec:
address:
value: http://ark-evaluator.default.svc.cluster.local:8000/evaluate
parameters:
- name: model.name
value: default
EOF
3. Send evaluation request:
curl -X POST http://localhost:8000/evaluate \
-H "Content-Type: application/json" \
-d '{
"type": "direct",
"config": {
"input": "What is the capital of France?",
"output": "Paris"
},
"parameters": {"temperature": "0.7"},
"evaluatorName": "test-ark-evaluator",
"model": {
"name": "default",
"type": "azure",
"namespace": "default"
}
}'
Troubleshooting
Token expiration issues: If your secret has expired, update it with a new token:
NEW_TOKEN="your-new-token-here"
kubectl patch secret azure-openai-secret -n default --type='json' -p="[{\"op\": \"replace\", \"path\": \"/data/token\", \"value\": \"$(echo -n $NEW_TOKEN | base64)\"}]"
Use Cases
- Production Monitoring: Real-time quality assessment, cost tracking, SLA compliance
- Model Comparison: A/B testing, cost-effectiveness analysis, performance benchmarking
- Content Quality: Automated content evaluation, support response assessment
- Development: Prompt engineering validation, model tuning, response optimization