Kubernetes Troubleshooting¶

This guide covers common issues when running pyproc on Kubernetes and how to diagnose them.

CrashLoopBackOff¶

Symptoms¶

NAME                          READY   STATUS             RESTARTS   AGE
pyproc-app-5d8f9b7c4-x2j9k   0/1     CrashLoopBackOff   5          3m

Diagnosis¶

Check container logs:

kubectl logs <pod-name> -c app --previous
kubectl logs <pod-name> -c worker --previous

Check events:

kubectl describe pod <pod-name>

Common Causes¶

Missing socket directory: The container cannot create the UDS socket.

# Check if the volume mount exists
kubectl exec <pod-name> -c app -- ls -la /var/run/pyproc

Fix: Verify the emptyDir volume is mounted at the socket path.

Permission denied on socket: The container user cannot write to the socket directory.

kubectl exec <pod-name> -c app -- id
kubectl exec <pod-name> -c app -- ls -la /var/run/

Fix: Set fsGroup in the Pod securityContext to match the container user group.

OOMKilled: The container exceeds its memory limit.

kubectl describe pod <pod-name> | grep -A 5 "Last State"

Fix: Increase resources.limits.memory for the affected container.

Python worker binary not found: pyproc-worker is not installed or not in PATH.

kubectl exec <pod-name> -c worker -- which pyproc-worker

Fix: Verify the Docker image installs pyproc-worker correctly.

Probe Failures¶

Liveness Probe Failure¶

Symptoms: Pod keeps restarting. Events show Liveness probe failed.

kubectl describe pod <pod-name> | grep -A 3 "Liveness"

Diagnosis:

# Test the health endpoint from inside the container
kubectl exec <pod-name> -c app -- wget -qO- http://localhost:8080/healthz

Common fixes:

Increase initialDelaySeconds if the app needs more startup time
Increase timeoutSeconds if the health check is slow under load
Increase failureThreshold to tolerate transient failures

Readiness Probe Failure¶

Symptoms: Pod is Running but not Ready. Traffic is not routed to the Pod.

kubectl get pods -o wide
kubectl describe pod <pod-name> | grep -A 3 "Readiness"

Diagnosis:

kubectl exec <pod-name> -c app -- wget -qO- http://localhost:8080/readyz

Common causes:

Python worker pool is still initializing
Worker processes failed to connect to UDS
Insufficient resources causing slow startup

UDS Permission Issues¶

Socket Not Found¶

Symptoms: Go application logs show connection refused or socket not found errors.

kubectl exec <pod-name> -c app -- ls -la /var/run/pyproc/

Diagnosis checklist:

Both containers mount the same volume at the same path
PYPROC_SOCKET_DIR environment variable matches the mount path
In sidecar mode, the worker container creates the socket before the Go app connects

Permission Denied on Socket¶

Symptoms: permission denied errors when connecting to the UDS.

# Check socket permissions
kubectl exec <pod-name> -c app -- ls -la /var/run/pyproc/worker.sock

# Check user identity in each container
kubectl exec <pod-name> -c app -- id
kubectl exec <pod-name> -c worker -- id

Fix: Both containers must run as the same user/group, or fsGroup must be set in the Pod securityContext:

spec:
  securityContext:
    fsGroup: 1000

Debug Commands¶

Pod Status and Events¶

# Overview
kubectl get pods -l app=pyproc-app -o wide

# Detailed status
kubectl describe pod <pod-name>

# Events for the namespace
kubectl get events --sort-by=.metadata.creationTimestamp

Container Logs¶

# Current logs
kubectl logs <pod-name> -c app
kubectl logs <pod-name> -c worker

# Previous container logs (after restart)
kubectl logs <pod-name> -c app --previous

# Follow logs
kubectl logs <pod-name> -c app -f

# Last 100 lines
kubectl logs <pod-name> -c app --tail=100

Interactive Debugging¶

# Shell into the container
kubectl exec -it <pod-name> -c app -- /bin/sh

# Check processes
kubectl exec <pod-name> -c app -- ps aux

# Check network (UDS)
kubectl exec <pod-name> -c app -- ls -la /var/run/pyproc/

# Check resource usage
kubectl top pod <pod-name> --containers

Ephemeral Debug Container¶

When the container has a read-only filesystem or minimal tooling:

kubectl debug -it <pod-name> --image=busybox --target=app

Resource Issues¶

Throttling (CPU)¶

Symptoms: High latency, slow responses, but no crashes.

kubectl top pod <pod-name> --containers

Check if CPU usage is near the limit. If throttled, increase resources.limits.cpu.

OOMKilled (Memory)¶

Symptoms: Container restarts with reason OOMKilled.

kubectl describe pod <pod-name> | grep -A 5 "Last State"

Fix:

Increase resources.limits.memory
Investigate if the Python worker has a memory leak
Check payload sizes (large JSON payloads consume memory)

Kubernetes Deployment: Pod configuration and manifests
Docker Deployment: Container image building
Monitoring: Metrics and observability