Skip to content

Monitoring

Verified against monitoring/* and src/app/observability/* on 2026-05-17.

Stack

Current monitoring profile (docker compose --profile monitoring) starts:

  • Prometheus
  • Grafana
  • Jaeger
flowchart LR
    API["FastAPI /metrics"] --> Prom["Prometheus"]
    W["Worker metrics"] --> Prom
    Prom --> Graf["Grafana"]
    API -. "OTLP" .-> J["Jaeger"]
    W -. "OTLP" .-> J

Startup

Bash
docker compose --profile monitoring up -d

Access points:

  • Grafana: http://localhost:3000
  • Prometheus: http://localhost:9090
  • Jaeger: http://localhost:16686

Key metrics (implemented)

Metric Type Labels Meaning
webhook_events_received_total Counter event_type incoming webhooks
webhook_events_processed_total Counter event_type, status processing outcome
webhook_processing_duration_seconds Histogram event_type processing latency
gitlab_api_requests_total Counter endpoint, status GitLab API calls
gitlab_api_duration_seconds Histogram endpoint GitLab API latency
http_requests_total Counter method, path, status HTTP traffic
http_request_duration_seconds Histogram method, path HTTP latency
rq_queue_depth Gauge queue_name queue depth
rq_failed_jobs_total Gauge - failed jobs
rq_workers_active Gauge - active workers
metrics_compute_total Counter status metric computations
metrics_compute_duration_seconds Histogram - metric compute latency
report_generation_duration_seconds Histogram format report generation latency
db_pool_size Gauge - DB pool size
db_pool_checked_out Gauge - used DB connections

Grafana dashboard

Provisioned dashboard:

  • monitoring/grafana/dashboards/gitpulse.json

Main panels:

  1. Webhook rate (received/processed)
  2. Processing latency (p50/p95/p99)
  3. Queue depth
  4. Success/failure rate
  5. DB pool utilization
  6. HTTP request rate
  7. GitLab API latency
  8. Report generation time
  9. Active workers
  10. Metrics compute duration

Alerts

Alert rules: monitoring/prometheus/alerts.yml

Alert Condition
QueueStuck rq_queue_depth > 50 for 5 minutes
HighFailureRate rate(webhook_events_processed_total{status="error"}[5m]) > 0.1
DatabaseDown up{job="gitpulse-api"} == 0
HighProcessingLatency webhook p95 latency > 10 s
WorkerDown rq_workers_active == 0
DBPoolExhausted db_pool_checked_out / db_pool_size > 0.9
HighHTTPErrorRate 5xx share > 5%

Operational checklist

  1. /health and /metrics return 200.
  2. http://localhost:9090/targets shows gitpulse-api as UP.
  3. Grafana datasource points to http://prometheus:9090.
  4. During incidents, correlate correlation_id across logs, metrics, and traces.