Monitoring¶
Verified against
monitoring/*andsrc/app/observability/*on 2026-05-17.
Stack¶
Current monitoring profile (docker compose --profile monitoring) starts:
- Prometheus
- Grafana
- Jaeger
flowchart LR
API["FastAPI /metrics"] --> Prom["Prometheus"]
W["Worker metrics"] --> Prom
Prom --> Graf["Grafana"]
API -. "OTLP" .-> J["Jaeger"]
W -. "OTLP" .-> J Startup¶
| Bash | |
|---|---|
Access points:
- Grafana:
http://localhost:3000 - Prometheus:
http://localhost:9090 - Jaeger:
http://localhost:16686
Key metrics (implemented)¶
| Metric | Type | Labels | Meaning |
|---|---|---|---|
webhook_events_received_total | Counter | event_type | incoming webhooks |
webhook_events_processed_total | Counter | event_type, status | processing outcome |
webhook_processing_duration_seconds | Histogram | event_type | processing latency |
gitlab_api_requests_total | Counter | endpoint, status | GitLab API calls |
gitlab_api_duration_seconds | Histogram | endpoint | GitLab API latency |
http_requests_total | Counter | method, path, status | HTTP traffic |
http_request_duration_seconds | Histogram | method, path | HTTP latency |
rq_queue_depth | Gauge | queue_name | queue depth |
rq_failed_jobs_total | Gauge | - | failed jobs |
rq_workers_active | Gauge | - | active workers |
metrics_compute_total | Counter | status | metric computations |
metrics_compute_duration_seconds | Histogram | - | metric compute latency |
report_generation_duration_seconds | Histogram | format | report generation latency |
db_pool_size | Gauge | - | DB pool size |
db_pool_checked_out | Gauge | - | used DB connections |
Grafana dashboard¶
Provisioned dashboard:
monitoring/grafana/dashboards/gitpulse.json
Main panels:
- Webhook rate (received/processed)
- Processing latency (p50/p95/p99)
- Queue depth
- Success/failure rate
- DB pool utilization
- HTTP request rate
- GitLab API latency
- Report generation time
- Active workers
- Metrics compute duration
Alerts¶
Alert rules: monitoring/prometheus/alerts.yml
| Alert | Condition |
|---|---|
QueueStuck | rq_queue_depth > 50 for 5 minutes |
HighFailureRate | rate(webhook_events_processed_total{status="error"}[5m]) > 0.1 |
DatabaseDown | up{job="gitpulse-api"} == 0 |
HighProcessingLatency | webhook p95 latency > 10 s |
WorkerDown | rq_workers_active == 0 |
DBPoolExhausted | db_pool_checked_out / db_pool_size > 0.9 |
HighHTTPErrorRate | 5xx share > 5% |
Operational checklist¶
/healthand/metricsreturn 200.http://localhost:9090/targetsshowsgitpulse-apias UP.- Grafana datasource points to
http://prometheus:9090. - During incidents, correlate
correlation_idacross logs, metrics, and traces.