Preskočiť na obsah

Monitoring

Overené podľa monitoring/* a src/app/observability/* dňa 2026-05-07.

Stack

Aktuálny monitoring profil (docker compose --profile monitoring) spúšťa:

  • Prometheus
  • Grafana
  • Jaeger
flowchart LR
    API["FastAPI /metrics"] --> Prom["Prometheus"]
    W["Worker metrics"] --> Prom
    Prom --> Graf["Grafana"]
    API -. "OTLP" .-> J["Jaeger"]
    W -. "OTLP" .-> J

Spustenie

Bash
docker compose --profile monitoring up -d

Prístupy:

  • Grafana: http://localhost:3000
  • Prometheus: http://localhost:9090
  • Jaeger: http://localhost:16686

Kľúčové metriky (implementované)

Metrika Typ Labely Význam
webhook_events_received_total Counter event_type prijaté webhooky
webhook_events_processed_total Counter event_type, status výsledok spracovania
webhook_processing_duration_seconds Histogram event_type latencia spracovania
gitlab_api_requests_total Counter endpoint, status volania na GitLab API
gitlab_api_duration_seconds Histogram endpoint latencia GitLab API
http_requests_total Counter method, path, status HTTP request objem
http_request_duration_seconds Histogram method, path HTTP latencia
rq_queue_depth Gauge queue_name hĺbka queue
rq_failed_jobs_total Gauge - počet failed jobov
rq_workers_active Gauge - aktívni workeri
metrics_compute_total Counter status výpočty metrík
metrics_compute_duration_seconds Histogram - čas výpočtu metrík
report_generation_duration_seconds Histogram format čas generovania reportu
db_pool_size Gauge - veľkosť DB poolu
db_pool_checked_out Gauge - využité DB connections

Grafana dashboard

Provisioned dashboard:

  • monitoring/grafana/dashboards/gitpulse.json

Hlavné panely:

  1. Webhook rate (received/processed)
  2. Processing latency (p50/p95/p99)
  3. Queue depth
  4. Success/failure rate
  5. DB pool utilization
  6. HTTP request rate
  7. GitLab API latency
  8. Report generation time
  9. Active workers
  10. Metrics compute duration

Alerty

Alert pravidlá: monitoring/prometheus/alerts.yml

Alert Podmienka
QueueStuck rq_queue_depth > 50 počas 5 min
HighFailureRate rate(webhook_events_processed_total{status="error"}[5m]) > 0.1
DatabaseDown up{job="gitpulse-api"} == 0
HighProcessingLatency p95 webhook latencia > 10 s
WorkerDown rq_workers_active == 0
DBPoolExhausted db_pool_checked_out / db_pool_size > 0.9
HighHTTPErrorRate podiel 5xx > 5%

Operačný checklist

  1. /health a /metrics vracajú 200.
  2. http://localhost:9090/targetsgitpulse-api v stave UP.
  3. Grafana datasource smeruje na http://prometheus:9090.
  4. Pri incidentoch korelujte correlation_id medzi logmi, metrikami a trace.