Graceful Shutdown in Go Is Harder Than You Think

"Just call server.Shutdown()" is advice that works great until you have a Kafka consumer mid-batch, a database transaction in progress, and Kubernetes sending SIGTERM while your readiness probe still returns healthy.

Graceful shutdown seems simple. It isn't. Here's everything that can go wrong and how to handle it.

The Naive Approach

Most tutorials show something like this:

func main() {
    server := &http.Server{Addr: ":8080", Handler: handler}

    go func() {
        if err := server.ListenAndServe(); err != http.ErrServerClosed {
            log.Fatal(err)
        }
    }()

    // Wait for interrupt
    quit := make(chan os.Signal, 1)
    signal.Notify(quit, syscall.SIGINT, syscall.SIGTERM)
    <-quit

    // Shutdown with timeout
    ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
    defer cancel()
    server.Shutdown(ctx)
}

This handles the HTTP server. But real services have more:

Background workers
Kafka/RabbitMQ consumers
Database connection pools
Distributed locks
Cache connections
Metrics reporters

The Kubernetes Timing Problem

When Kubernetes sends SIGTERM, several things happen simultaneously:

Your pod receives SIGTERM
Kubernetes removes the pod from Service endpoints
Ingress controllers update their backends
Other pods' DNS caches might still point to you

The problem: steps 2-4 take time. If you shut down immediately on SIGTERM, requests still routing to you will fail.

func main() {
    // ... setup ...

    quit := make(chan os.Signal, 1)
    signal.Notify(quit, syscall.SIGINT, syscall.SIGTERM)
    <-quit

    // WRONG: Immediate shutdown
    // Requests still in flight from other pods will 502

    // RIGHT: Wait for traffic to drain
    log.Println("Received shutdown signal, waiting for traffic to drain...")

    // Give Kubernetes time to update endpoints
    time.Sleep(5 * time.Second)

    // Now start graceful shutdown
    ctx, cancel := context.WithTimeout(context.Background(), 25*time.Second)
    defer cancel()
    server.Shutdown(ctx)
}

The Readiness Probe Dance

Your readiness probe should fail BEFORE you stop accepting requests:

type Server struct {
    httpServer *http.Server
    isReady    atomic.Bool
}

func (s *Server) readinessHandler(w http.ResponseWriter, r *http.Request) {
    if !s.isReady.Load() {
        w.WriteHeader(http.StatusServiceUnavailable)
        return
    }
    w.WriteHeader(http.StatusOK)
}

func (s *Server) Shutdown(ctx context.Context) error {
    // Step 1: Mark as not ready (fails readiness probe)
    s.isReady.Store(false)
    log.Println("Marked as not ready")

    // Step 2: Wait for Kubernetes to notice and stop sending traffic
    time.Sleep(5 * time.Second)
    log.Println("Drain period complete")

    // Step 3: Now shutdown the HTTP server
    return s.httpServer.Shutdown(ctx)
}

Coordinating Multiple Components

Real services have multiple things to shut down, and order matters.

The Shutdown Orchestrator Pattern

type ShutdownManager struct {
    components []ShutdownComponent
    timeout    time.Duration
}

type ShutdownComponent interface {
    Name() string
    Shutdown(ctx context.Context) error
    Priority() int // Lower = shutdown first
}

func (m *ShutdownManager) Shutdown(ctx context.Context) error {
    // Sort by priority (using slices package from Go 1.21+)
    slices.SortFunc(m.components, func(a, b ShutdownComponent) int {
        return cmp.Compare(a.Priority(), b.Priority())
    })

    ctx, cancel := context.WithTimeout(ctx, m.timeout)
    defer cancel()

    var errs []error
    for _, c := range m.components {
        log.Printf("Shutting down %s...", c.Name())
        start := time.Now()

        if err := c.Shutdown(ctx); err != nil {
            log.Printf("Error shutting down %s: %v", c.Name(), err)
            errs = append(errs, fmt.Errorf("%s: %w", c.Name(), err))
        } else {
            log.Printf("Shutdown %s complete (%v)", c.Name(), time.Since(start))
        }
    }

    return errors.Join(errs...)
}

Shutdown Order Matters

func main() {
    manager := &ShutdownManager{
        timeout: 30 * time.Second,
        components: []ShutdownComponent{
            // Priority 1: Stop accepting new work
            &ReadinessComponent{ready: &isReady},

            // Priority 2: Drain period
            &DrainComponent{duration: 5 * time.Second},

            // Priority 3: Stop HTTP server (waits for in-flight)
            &HTTPServerComponent{server: httpServer},

            // Priority 4: Stop background workers
            &WorkerPoolComponent{pool: workers},

            // Priority 5: Stop message consumers
            &KafkaConsumerComponent{consumer: kafkaConsumer},

            // Priority 6: Flush async operations
            &MetricsFlushComponent{reporter: metrics},

            // Priority 7: Close connections (last!)
            &DatabaseComponent{pool: dbPool},
            &RedisComponent{client: redis},
        },
    }

    // ... signal handling ...
    manager.Shutdown(context.Background())
}

Handling Kafka Consumers

Kafka consumers are tricky because you might be mid-batch when shutdown starts.

type KafkaConsumerComponent struct {
    consumer  *kafka.Consumer
    handler   MessageHandler
    wg        sync.WaitGroup
    shutdown  chan struct{}
    batchSize int
}

func (c *KafkaConsumerComponent) Run(ctx context.Context) {
    c.shutdown = make(chan struct{})

    for {
        select {
        case <-c.shutdown:
            return
        case <-ctx.Done():
            return
        default:
            // Fetch batch
            messages, err := c.consumer.FetchBatch(ctx, c.batchSize)
            if err != nil {
                continue
            }

            // Process batch with tracking
            c.wg.Add(1)
            go func() {
                defer c.wg.Done()

                for _, msg := range messages {
                    select {
                    case <-c.shutdown:
                        // Shutdown requested mid-batch
                        // Don't commit, let rebalance handle it
                        return
                    default:
                        if err := c.handler.Handle(msg); err != nil {
                            // Handle error...
                        }
                    }
                }

                // Only commit if we processed the whole batch
                c.consumer.Commit(messages)
            }()
        }
    }
}

func (c *KafkaConsumerComponent) Shutdown(ctx context.Context) error {
    // Signal consumer to stop
    close(c.shutdown)

    // Wait for in-flight batches with timeout
    done := make(chan struct{})
    go func() {
        c.wg.Wait()
        close(done)
    }()

    select {
    case <-done:
        log.Println("All Kafka batches completed")
    case <-ctx.Done():
        log.Println("Timeout waiting for Kafka batches")
    }

    return c.consumer.Close()
}

Database Transactions in Progress

Long-running transactions need special handling:

type TransactionManager struct {
    db         *sql.DB
    activeTxns sync.Map // map[string]*ManagedTx
    shutdown   atomic.Bool
}

type ManagedTx struct {
    tx       *sql.Tx
    id       string
    started  time.Time
    doneChan chan struct{}
}

func (m *TransactionManager) Begin(ctx context.Context) (*ManagedTx, error) {
    if m.shutdown.Load() {
        return nil, errors.New("shutdown in progress, rejecting new transactions")
    }

    tx, err := m.db.BeginTx(ctx, nil)
    if err != nil {
        return nil, err
    }

    mtx := &ManagedTx{
        tx:       tx,
        id:       uuid.New().String(),
        started:  time.Now(),
        doneChan: make(chan struct{}),
    }

    m.activeTxns.Store(mtx.id, mtx)
    return mtx, nil
}

func (m *TransactionManager) Shutdown(ctx context.Context) error {
    m.shutdown.Store(true)
    log.Println("Rejecting new transactions")

    // Wait for active transactions
    ticker := time.NewTicker(100 * time.Millisecond)
    defer ticker.Stop()

    for {
        select {
        case <-ctx.Done():
            // Timeout - log remaining transactions
            m.activeTxns.Range(func(key, value any) bool {
                mtx := value.(*ManagedTx)
                log.Printf("Transaction %s still active after %v", mtx.id, time.Since(mtx.started))
                return true
            })
            return ctx.Err()

        case <-ticker.C:
            count := 0
            m.activeTxns.Range(func(_, _ any) bool {
                count++
                return true
            })

            if count == 0 {
                log.Println("All transactions completed")
                return m.db.Close()
            }

            log.Printf("Waiting for %d active transactions", count)
        }
    }
}

The Complete Picture

func main() {
    // Setup components...

    // Shutdown orchestration
    quit := make(chan os.Signal, 1)
    signal.Notify(quit, syscall.SIGINT, syscall.SIGTERM)

    // Start services
    go httpServer.ListenAndServe()
    go kafkaConsumer.Run(ctx)
    go workers.Start()

    // Wait for signal
    sig := <-quit
    log.Printf("Received %v, starting graceful shutdown", sig)

    // Create shutdown context with total budget
    ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
    defer cancel()

    // Execute shutdown sequence
    if err := shutdownManager.Shutdown(ctx); err != nil {
        log.Printf("Shutdown completed with errors: %v", err)
        os.Exit(1)
    }

    log.Println("Shutdown complete")
}

Key Takeaways

HTTP shutdown alone isn't enough. You need to coordinate all components.
Kubernetes needs time. Add a drain period before stopping the HTTP server.
Order matters. Stop accepting work → drain in-flight → close connections.
Track in-flight work. Use WaitGroups or similar to know when it's safe to close resources.
Set deadlines. Use context timeouts to avoid hanging forever.
Log the shutdown. When things go wrong, you'll want to know what was happening.
Test it. Send SIGTERM to your service and verify the behavior.

Graceful shutdown is one of those things that seems simple until you actually need it to work reliably. Get it right, and your deployments become invisible to users.