Skip to content

Failure Behavior

This document describes the failure handling mechanisms of pyproc, including the timeout hierarchy, retry strategy, backpressure, and SLO definition templates.

Timeout Hierarchy

pyproc has a timeout system based on the effectiveDeadline function, which selects the earliest applicable deadline.

Currently two layers are active:

Layer 1: Context deadline (set by the caller via context.WithTimeout)
Layer 2: Transport default (ProtocolConfig.RequestTimeout)

The effectiveDeadline function also accepts a per-call timeout parameter, but Call does not currently expose an option to set it. Per-call timeouts are achieved by wrapping the context with context.WithTimeout.

Priority

The earliest (most restrictive) deadline is applied.

                 Context deadline
                     |
                     v
  effectiveDeadline --> select earliest deadline --> TimeoutError{Kind: winner}
                     ^
Transport default ---+

Example

// Layer 1: Context deadline (5 seconds from now)
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()

// Layer 2: Transport default = ProtocolConfig.RequestTimeout (default 60s)

// Result: Context's 5s is the earliest, so timeout occurs at 5 seconds
// The returned TimeoutError.Kind is TimeoutKindContext
err := pool.Call(ctx, "predict", req, &resp)

Configuration for Each Layer

Layer Configuration Method Default Value
Context context.WithTimeout / context.WithDeadline None
Transport ProtocolConfig.RequestTimeout 60 seconds

Worker Crash Behavior

When a worker process terminates abnormally, the pool marks that worker as unhealthy and routes subsequent requests to other healthy workers.

Current Behavior

Worker crash detected
        |
        v
  Worker marked unhealthy
        |
        v
  Traffic routed to remaining healthy workers

Automatic restart is not yet implemented. The RestartConfig struct exists in the configuration but is not wired into the pool health loop. Callers should monitor Pool.Health() and recreate the pool if too many workers become unhealthy.

Planned: Automatic Restart (not yet implemented)

RestartConfig defines parameters for automatic worker restart with exponential backoff. This functionality is planned but not yet active.

Parameter Type Default Description
MaxAttempts int 5 Maximum number of retry attempts
InitialBackoff time.Duration 1000ms Initial wait time
MaxBackoff time.Duration 30000ms Maximum wait time
Multiplier float64 2.0 Backoff multiplier

Backpressure

The Pool uses a global semaphore and a per-worker gate to limit concurrency.

Semaphore Mechanism

semaphore := make(chan struct{}, MaxInFlight)
// per-worker gate
inflightGate := make(chan struct{}, MaxInFlightPerWorker)
  • Workers: Number of worker processes (default: 4)
  • MaxInFlight: Maximum concurrent requests across the pool (default: 10)
  • MaxInFlightPerWorker: Maximum in-flight requests per worker (default: 1)
  • Effective max concurrency = min(MaxInFlight, Workers * MaxInFlightPerWorker)

Behavior When Semaphore Is Full

When the semaphore is full, Pool.Call blocks until the context is cancelled.

select {
case p.semaphore <- struct{}{}:
    // Semaphore acquired, execute request
    defer func() { <-p.semaphore }()
case <-ctx.Done():
    // Context cancelled (including timeout)
    return ctx.Err()
}

Callers can control the maximum wait time for backpressure by setting a timeout on the context.

Capacity Planning

Workers MaxInFlight MaxInFlightPerWorker Max Concurrent Requests
4 10 1 4
8 10 1 8
4 40 1 4
4 40 2 8

Requests exceeding min(MaxInFlight, Workers * MaxInFlightPerWorker) are queued until capacity becomes available.

SLO Definition Template

A template for defining SLOs in pyproc-based services.

Availability

healthy_workers / total_workers >= X%

Measured using the HealthStatus returned by Pool.Health().

status := pool.Health()
availability := float64(status.HealthyWorkers) / float64(status.TotalWorkers)
// Example: availability >= 0.75 (75%)

Latency

p99 < Y ms

Measured as the execution time of Pool.Call. Benchmark targets:

  • p50: < 100us
  • p99: < 500us
  • Payload: < 100KB JSON, 8 worker configuration

Error Rate

failed_calls / total_calls < Z%

Error categories counted toward SLO (see Error Categories):

  • Timeout errors: Counted
  • Connection errors: Counted
  • Worker errors: Counted
  • Protocol errors: Not counted (bug)
  • Pool lifecycle errors: Not counted (operational error)

Example Definition

slo:
  availability:
    target: 99.9%
    window: 30d
    metric: healthy_workers / total_workers
  latency:
    target_p99: 500us
    target_p50: 100us
    window: 30d
  error_rate:
    target: 0.1%
    window: 30d
    excluded:
      - protocol_errors
      - pool_lifecycle_errors

Related documentation: