Observability

Comprehensive observability is crucial for production agentic workloads. ARK provides integrated monitoring, tracing, and logging capabilities to help you understand and optimize your AI agent performance.

OpenTelemetry Integration

ARK provides observability through OpenTelemetry integration, allowing you to monitor and trace all operations across the controller, execution engines and any other services. You can connect to any OpenTelemetry-compatible provider using standard environment variables.

Telemetry is enabled by setting the OpenTelemetry environment variables:

Variable	Description	Example
`OTEL_EXPORTER_OTLP_ENDPOINT`	OTLP endpoint URL	`http://localhost:4318/v1/traces`
`OTEL_EXPORTER_OTLP_HEADERS`	Authentication headers	`Authorization=Basic <token>`
`OTEL_SERVICE_NAME`	Service name for telemetry	`ark-controller`
`OTEL_RESOURCE_ATTRIBUTES`	Additional resource attributes	`environment=production`

Architecture

The controller creates a root trace span (query.dispatch) for every query execution. Trace context propagates via W3C traceparent headers across A2A boundaries to the completions engine and execution engines, producing a connected trace tree.


┌─────────────────────┐
│   ARK Controller    │
│  query.dispatch     │──── traceparent ────► Completions Engine
│  (root span)        │                       query.<name>
└─────────┬───────────┘                         └─ agent/team/model
          │                                        └─ LLM calls, tools
          │
          └──── traceparent ────► Named Execution Engine
                                  (creates own child spans)

  OTEL Env Vars
┌─────────────────────────┐
│ Services, Engines, etc  ├─── → OTEL Endpoint
└─────────────────────────┘

Ark’s session ID propagates automatically via W3C baggage headers as ark.session.id, making it available across all services without manual header injection. The ark. prefix avoids collisions with executor-native session.id attributes.

Per-Tenant OTEL Routing

For multi-tenant deployments, Ark supports routing traces to tenant-specific OTEL endpoints. This enables:

Tenant isolation - Each tenant’s traces go to their own observability backend
Backend flexibility - Different tenants can use different OTEL backends (Langfuse, Phoenix, Opik, Jaeger, Honeycomb, etc.)
Cost attribution - Observability costs can be attributed per tenant
Compliance - Meet data residency or access control requirements

Enabling Per-Tenant Routing

Enable per-tenant OTEL discovery in your Helm values:


telemetry:
  tenantRouting:
    otelDiscovery: true

Tenant Configuration

Each tenant configures their OTEL endpoint by creating a Secret named otel-environment-variables in their namespace:


apiVersion: v1
kind: Secret
metadata:
  name: otel-environment-variables
  namespace: <tenant-namespace>
type: Opaque
stringData:
  OTEL_EXPORTER_OTLP_ENDPOINT: "https://otel-backend.example.com/v1/traces"
  OTEL_EXPORTER_OTLP_HEADERS: "Authorization=Bearer <token>"

The controller discovers these Secrets at startup and routes traces based on the query.namespace attribute.

Example Backend Configurations

Langfuse:


stringData:
  OTEL_EXPORTER_OTLP_ENDPOINT: "http://langfuse.svc:3000/api/public/otel"
  OTEL_EXPORTER_OTLP_HEADERS: "Authorization=Basic <base64(pk:sk)>"

Phoenix (Arize):


stringData:
  OTEL_EXPORTER_OTLP_ENDPOINT: "https://app.phoenix.arize.com/v1/traces"
  OTEL_EXPORTER_OTLP_HEADERS: "api_key=<phoenix_api_key>"

Opik:


stringData:
  OTEL_EXPORTER_OTLP_ENDPOINT: "https://www.comet.com/opik/api/v1/private/otel"
  OTEL_EXPORTER_OTLP_HEADERS: "Authorization=<opik_api_key>,Comet-Workspace=<workspace_name>,projectName=<project_name>"

Honeycomb:


stringData:
  OTEL_EXPORTER_OTLP_ENDPOINT: "https://api.honeycomb.io/v1/traces"
  OTEL_EXPORTER_OTLP_HEADERS: "x-honeycomb-team=<api_key>"

Jaeger:


stringData:
  OTEL_EXPORTER_OTLP_ENDPOINT: "http://jaeger-collector.svc:4318/v1/traces"

Architecture with Per-Tenant Routing

When per-tenant OTEL routing is enabled, traces are routed based on the query’s namespace:


                      ARK Controller
                            │
              ┌─────────────┼─────────────┐
              │             │             │
              ▼             ▼             ▼
        Primary OTEL   Tenant-A OTEL  Tenant-B OTEL
        (platform)     (Langfuse)     (Opik)

Applying Changes

After creating or updating tenant OTEL Secrets, restart the controller to pick up new configurations:


kubectl rollout restart deployment/ark-controller -n ark-system
kubectl rollout status deployment/ark-controller -n ark-system --timeout=120s

Automatic Injection of OTEL Configuration

One way to set up automatic OpenTelemetry configuration is through standardized ConfigMap and Secret references. This pattern allows any Kubernetes resource to automatically pick up OTEL environment variables when available:


apiVersion: apps/v1
kind: Deployment
spec:
  template:
    spec:
      containers:
      - name: your-app
        envFrom:
        # Standard OTEL configuration - will be injected if available
        - configMapRef:
            name: otel-environment-variables
            optional: true
        - secretRef:
            name: otel-environment-variables
            optional: true

When you create or update the standardized otel-environment-variables ConfigMap and Secret, all deployments and pods that reference them must be restarted to pick up the new environment variables:


# Restart components to pick up changes
kubectl rollout restart deployment/ark-controller -n ark-system

Service Name Configuration

You can optionally set the service name used for telemetry in your containers, using the OTEL_SERVICE_NAME variable:


spec:
  template:
    spec:
      containers:
      - name: your-app
        env:
        - name: OTEL_SERVICE_NAME
          value: "my-custom-service"

Additional OTEL Variables

These OpenTelemetry environment variables are also supported:

Variable	Description	Example
`OTEL_RESOURCE_ATTRIBUTES`	Additional resource attributes	`environment=production,version=1.0`
`OTEL_EXPORTER_OTLP_TIMEOUT`	Request timeout in milliseconds	`30000`
`OTEL_PROPAGATORS`	Trace context propagation format	`tracecontext,baggage`
`OTEL_TRACES_SAMPLER`	Sampling strategy	`always_on`, `always_off`, `traceidratio`
`OTEL_TRACES_SAMPLER_ARG`	Sampler configuration	`0.1` (for 10% sampling)

Next: Learn about observability options:

Phoenix Service - AI/ML model observability
Langfuse Service - Open Source LLM Application/Agent observability, evaluation, and prompt management
Opik - Open-source platform for LLM observability, evaluation, and prompt optimization