Skip to content

SLA & Monitoring

SalmAlm tracks performance metrics for every request — latency, token usage, cost, and error rates.

Metrics Tracked

Per-Request

  • Time to First Token (TTFT) — milliseconds until first streaming chunk
  • Total latency — end-to-end request time
  • Input/output tokens — per model
  • Estimated cost — based on model pricing

Aggregate

  • Uptime — server start time, total runtime
  • Request count — total requests processed
  • Error rate — failed requests / total
  • Model distribution — which models handle which tiers

Cost Tracking

SalmAlm estimates costs using built-in pricing tables:

Model Input ($/1M tok) Output ($/1M tok)
Claude Haiku 3.5 $0.80 $4.00
Claude Sonnet 4 $3.00 $15.00
GPT-4o $2.50 $10.00
Gemini 2.5 Flash $0.15 $0.60

Cost Cap

Set a daily spending limit:

SALMALM_COST_CAP=5.00  # $5/day limit

When the cap is reached, all LLM calls are blocked with a CostCapExceeded error until the next day.

Commands

Command Description
/usage Token usage and cost summary
/usage tokens Detailed token breakdown
/usage cost Cost breakdown by model
/latency Request latency statistics
/uptime Server uptime

Web Dashboard

Settings → Usage & Monitoring shows:

  • Daily/monthly usage charts
  • Cost breakdown by model
  • Latency percentiles (p50, p95, p99)
  • Real-time request feed

API Endpoints

GET /api/usage/daily    — Daily usage report
GET /api/usage/monthly  — Monthly aggregate
GET /api/metrics        — Prometheus-compatible metrics
GET /api/latency        — Latency statistics
GET /api/status         — Server health check

Alerting

SalmAlm logs warnings when:

  • Latency exceeds 10s (P95 threshold)
  • Daily cost exceeds 80% of cap
  • Error rate exceeds 5%
  • Context window approaches model limit

Circuit Breaker

The built-in circuit breaker detects:

  • Infinite loops — 3+ identical (tool, args) in last 6 iterations
  • Provider failures — consecutive 5xx errors trigger backoff
  • Cost overruns — immediate halt on cap exceeded