Monitoring¶
pyproc exposes Prometheus-compatible metrics via MetricsHandler and pool health checks via the Health() method. No external dependencies are required.
Metrics Endpoint Setup¶
Use MetricsHandler to serve metrics in Prometheus text exposition format:
package main
import (
"context"
"log"
"net/http"
"github.com/YuminosukeSato/pyproc/pkg/pyproc"
)
func main() {
pool, err := pyproc.NewPoolWithMetrics(pyproc.PoolOptions{
Config: pyproc.PoolConfig{Workers: 4, MaxInFlight: 10, MaxInFlightPerWorker: 1},
WorkerConfig: pyproc.WorkerConfig{
PythonExec: "python3",
WorkerScript: "worker.py",
SocketPath: "/tmp/pyproc",
},
}, nil)
if err != nil {
log.Fatal(err)
}
ctx := context.Background()
if err := pool.Start(ctx); err != nil {
log.Fatal(err)
}
http.Handle("/metrics", pyproc.MetricsHandler(pool))
log.Fatal(http.ListenAndServe(":9090", nil))
}
Available Metrics¶
| Metric | Type | Labels | Description |
|---|---|---|---|
pyproc_requests_total | counter | status (success, failed, timeout) | Total number of requests by outcome |
pyproc_request_duration_seconds | gauge | quantile (0.5, 0.95, 0.99) | Request latency percentiles in seconds |
pyproc_workers_total | gauge | Total number of workers in the pool | |
pyproc_workers_healthy | gauge | Number of healthy workers | |
pyproc_inflight_requests | gauge | Number of in-flight requests | |
pyproc_worker_restarts_total | counter | Total worker restarts |
Health Checks¶
Use pool.Health() to get a snapshot of pool health:
health := pool.Health()
fmt.Printf("Total: %d, Healthy: %d, LastCheck: %s\n",
health.TotalWorkers, health.HealthyWorkers, health.LastCheck)
Expose as an HTTP health endpoint:
http.HandleFunc("/healthz", func(w http.ResponseWriter, r *http.Request) {
health := pool.Health()
if health.HealthyWorkers == 0 {
w.WriteHeader(http.StatusServiceUnavailable)
fmt.Fprintf(w, "unhealthy: 0/%d workers healthy\n", health.TotalWorkers)
return
}
fmt.Fprintf(w, "ok: %d/%d workers healthy\n",
health.HealthyWorkers, health.TotalWorkers)
})
Prometheus Configuration¶
Add a scrape target in prometheus.yml:
scrape_configs:
- job_name: "pyproc"
scrape_interval: 15s
static_configs:
- targets: ["localhost:9090"]
Grafana Dashboard¶
Import the following JSON as a Grafana dashboard panel to visualize request rates and latency:
{
"panels": [
{
"title": "Request Rate",
"type": "timeseries",
"targets": [
{
"expr": "rate(pyproc_requests_total[1m])",
"legendFormat": "{{status}}"
}
]
},
{
"title": "Latency Percentiles",
"type": "timeseries",
"targets": [
{
"expr": "pyproc_request_duration_seconds{quantile=\"0.5\"}",
"legendFormat": "p50"
},
{
"expr": "pyproc_request_duration_seconds{quantile=\"0.95\"}",
"legendFormat": "p95"
},
{
"expr": "pyproc_request_duration_seconds{quantile=\"0.99\"}",
"legendFormat": "p99"
}
]
},
{
"title": "Worker Health",
"type": "gauge",
"targets": [
{
"expr": "pyproc_workers_healthy / pyproc_workers_total",
"legendFormat": "health ratio"
}
]
},
{
"title": "In-Flight Requests",
"type": "timeseries",
"targets": [
{
"expr": "pyproc_inflight_requests",
"legendFormat": "inflight"
}
]
}
]
}
Alerting Rules¶
Example Prometheus alerting rules for pyproc:
groups:
- name: pyproc
rules:
- alert: PyProcNoHealthyWorkers
expr: pyproc_workers_healthy == 0
for: 1m
labels:
severity: critical
annotations:
summary: "No healthy pyproc workers"
description: "All pyproc workers are unhealthy for more than 1 minute."
- alert: PyProcHighErrorRate
expr: sum(rate(pyproc_requests_total{status="failed"}[5m])) / sum(rate(pyproc_requests_total[5m])) > 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "pyproc error rate above 5%"
description: "Request failure rate is {{ $value | humanizePercentage }}."
- alert: PyProcHighLatency
expr: pyproc_request_duration_seconds{quantile="0.99"} > 0.5
for: 5m
labels:
severity: warning
annotations:
summary: "pyproc p99 latency above 500ms"
description: "p99 latency is {{ $value }}s."
- alert: PyProcWorkerRestarts
expr: rate(pyproc_worker_restarts_total[5m]) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "pyproc workers restarting frequently"
description: "Worker restart rate is {{ $value }}/s."
See Operations Guide for deployment and runtime configuration details.