Error Categories¶
Errors returned by pyproc are classified into five categories. Retry eligibility and SLO accounting guidance are provided for each category.
1. Timeout Errors¶
Occur when a call does not complete within the time limit. Returned as *pyproc.TimeoutError, with the Kind field identifying the source of the timeout.
| Kind | Meaning |
|---|---|
TimeoutKindContext | The deadline set on the caller's context.Context was exceeded |
TimeoutKindPerCall | The per-call timeout specified at Call time was exceeded |
TimeoutKindTransport | The default transport-layer timeout set via ProtocolConfig.RequestTimeout was exceeded |
Detection:
var te *pyproc.TimeoutError
if errors.As(err, &te) {
switch te.Kind {
case pyproc.TimeoutKindContext:
// Caller context deadline exceeded
case pyproc.TimeoutKindPerCall:
// Per-call timeout exceeded
case pyproc.TimeoutKindTransport:
// Transport layer default timeout exceeded
}
}
- Retry: Yes (idempotent operations only)
- SLO: Counted
2. Connection Errors¶
Occur when UDS connection to a worker fails. Returned in situations such as a missing socket file or a stopped worker process.
Typical error messages:
"failed to connect to worker at <path> after <duration>"(timeout inConnectToWorker)"failed to connect to <path>: <cause>"(connection failure inPool.connect)"failed to connect: <cause>"(connection failure withinPool.Call)
Detection:
// Connection errors are detected via string matching
if strings.Contains(err.Error(), "failed to connect") {
// Connection error
}
Note: String-based error detection is fragile. Wrapping detection in a helper function is recommended so that call sites can be updated in one place when typed errors become available.
// Recommended: wrap detection in a helper
func IsConnectionError(err error) bool {
return err != nil && strings.Contains(err.Error(), "failed to connect")
}
Typed/sentinel errors for connection and protocol categories are planned for v1.0.
- Retry: Yes (after worker restart)
- SLO: Counted
3. Protocol Errors¶
Errors occurring at the wire protocol layer. Include framing failures (reading/writing message lengths) and JSON decode failures.
Typical error messages:
"failed to unmarshal response: <cause>"(JSON decode failure for response)"response body is nil"(empty response body)- Framing errors (invalid message length, EOF during read)
Detection:
if strings.Contains(err.Error(), "unmarshal") ||
strings.Contains(err.Error(), "response body is nil") {
// Protocol error
}
- Retry: No (possible data corruption)
- SLO: Not counted (treat as a bug)
4. Worker Errors¶
Errors originating from the Python worker side. Correspond to the error returned by Response.Error() when the OK field of protocol.Response is false.
Typical cases:
- Python-side exceptions (
Response.OK == false, message inResponse.ErrorMsg) - Worker process crash (connection is severed)
Detection:
// Python-side exceptions are returned via resp.Error()
// Pool.Call returns this error as-is
err := pool.Call(ctx, "predict", req, &resp)
if err != nil {
// If it is neither a TimeoutError nor a connection error,
// it is likely a worker error
var te *pyproc.TimeoutError
if !errors.As(err, &te) && !strings.Contains(err.Error(), "failed to connect") {
// Worker error
}
}
- Retry: Case-by-case (business logic errors: no, transient errors: yes)
- SLO: Counted
5. Pool Lifecycle Errors¶
Errors related to Pool lifecycle. Occur when the Pool has been shut down or when no healthy workers are available.
Typical error messages:
"pool is shut down"(CallafterPool.Shutdownhas been invoked)"no healthy workers available"(all workers are unhealthy)
Detection:
// Use strings.Contains to handle wrapped errors
if strings.Contains(err.Error(), "pool is shut down") {
// Pool needs to be recreated
}
if strings.Contains(err.Error(), "no healthy workers available") {
// Wait for recovery via health check, or recreate Pool
}
- Retry: No (Pool recreation required)
- SLO: Not counted (treat as an operational error)
Summary¶
| Category | Type / Detection | Retry | SLO |
|---|---|---|---|
| Timeout | errors.As(*TimeoutError) | Yes (idempotent only) | Counted |
| Connection | "failed to connect" string | Yes (after restart) | Counted |
| Protocol | "unmarshal" etc. string | No | Not counted |
| Worker | Errors not matching the above | Case-by-case | Counted |
| Pool lifecycle | "pool is shut down" etc. | No | Not counted |
Related documentation:
- Failure Behavior - Details on timeout hierarchy, retry, and backpressure
- Error Handling Guide - Best practices for error handling