Skip to content

Error Categories

Errors returned by pyproc are classified into five categories. Retry eligibility and SLO accounting guidance are provided for each category.

1. Timeout Errors

Occur when a call does not complete within the time limit. Returned as *pyproc.TimeoutError, with the Kind field identifying the source of the timeout.

Kind Meaning
TimeoutKindContext The deadline set on the caller's context.Context was exceeded
TimeoutKindPerCall The per-call timeout specified at Call time was exceeded
TimeoutKindTransport The default transport-layer timeout set via ProtocolConfig.RequestTimeout was exceeded

Detection:

var te *pyproc.TimeoutError
if errors.As(err, &te) {
    switch te.Kind {
    case pyproc.TimeoutKindContext:
        // Caller context deadline exceeded
    case pyproc.TimeoutKindPerCall:
        // Per-call timeout exceeded
    case pyproc.TimeoutKindTransport:
        // Transport layer default timeout exceeded
    }
}
  • Retry: Yes (idempotent operations only)
  • SLO: Counted

2. Connection Errors

Occur when UDS connection to a worker fails. Returned in situations such as a missing socket file or a stopped worker process.

Typical error messages:

  • "failed to connect to worker at <path> after <duration>" (timeout in ConnectToWorker)
  • "failed to connect to <path>: <cause>" (connection failure in Pool.connect)
  • "failed to connect: <cause>" (connection failure within Pool.Call)

Detection:

// Connection errors are detected via string matching
if strings.Contains(err.Error(), "failed to connect") {
    // Connection error
}

Note: String-based error detection is fragile. Wrapping detection in a helper function is recommended so that call sites can be updated in one place when typed errors become available.

// Recommended: wrap detection in a helper
func IsConnectionError(err error) bool {
    return err != nil && strings.Contains(err.Error(), "failed to connect")
}

Typed/sentinel errors for connection and protocol categories are planned for v1.0.

  • Retry: Yes (after worker restart)
  • SLO: Counted

3. Protocol Errors

Errors occurring at the wire protocol layer. Include framing failures (reading/writing message lengths) and JSON decode failures.

Typical error messages:

  • "failed to unmarshal response: <cause>" (JSON decode failure for response)
  • "response body is nil" (empty response body)
  • Framing errors (invalid message length, EOF during read)

Detection:

if strings.Contains(err.Error(), "unmarshal") ||
    strings.Contains(err.Error(), "response body is nil") {
    // Protocol error
}
  • Retry: No (possible data corruption)
  • SLO: Not counted (treat as a bug)

4. Worker Errors

Errors originating from the Python worker side. Correspond to the error returned by Response.Error() when the OK field of protocol.Response is false.

Typical cases:

  • Python-side exceptions (Response.OK == false, message in Response.ErrorMsg)
  • Worker process crash (connection is severed)

Detection:

// Python-side exceptions are returned via resp.Error()
// Pool.Call returns this error as-is
err := pool.Call(ctx, "predict", req, &resp)
if err != nil {
    // If it is neither a TimeoutError nor a connection error,
    // it is likely a worker error
    var te *pyproc.TimeoutError
    if !errors.As(err, &te) && !strings.Contains(err.Error(), "failed to connect") {
        // Worker error
    }
}
  • Retry: Case-by-case (business logic errors: no, transient errors: yes)
  • SLO: Counted

5. Pool Lifecycle Errors

Errors related to Pool lifecycle. Occur when the Pool has been shut down or when no healthy workers are available.

Typical error messages:

  • "pool is shut down" (Call after Pool.Shutdown has been invoked)
  • "no healthy workers available" (all workers are unhealthy)

Detection:

// Use strings.Contains to handle wrapped errors
if strings.Contains(err.Error(), "pool is shut down") {
    // Pool needs to be recreated
}
if strings.Contains(err.Error(), "no healthy workers available") {
    // Wait for recovery via health check, or recreate Pool
}
  • Retry: No (Pool recreation required)
  • SLO: Not counted (treat as an operational error)

Summary

Category Type / Detection Retry SLO
Timeout errors.As(*TimeoutError) Yes (idempotent only) Counted
Connection "failed to connect" string Yes (after restart) Counted
Protocol "unmarshal" etc. string No Not counted
Worker Errors not matching the above Case-by-case Counted
Pool lifecycle "pool is shut down" etc. No Not counted

Related documentation: