Webhook Delivery System

The webhook delivery system consists of a background worker that processes pending events, delivers them to subscriber URLs, and manages retries.

Architecture

// internal/core/webhook/worker.go

type Worker struct {
    db         *pgxpool.Pool
    httpClient *http.Client
}

func (w *Worker) Run(ctx context.Context) {
    ticker := time.NewTicker(5 * time.Second)
    defer ticker.Stop()

    for {
        select {
        case <-ctx.Done():
            return
        case <-ticker.C:
            w.processBatch(ctx)
        }
    }
}

Delivery Process

The worker runs every 5 seconds and:

Fetches up to 50 pending events where status = 'pending' AND next_retry_at <= now()
For each event:
- Load subscription (get URL + signing_secret)
- Marshal payload to JSON
- Sign payload with HMAC-SHA256
- POST to URL with signature headers
- Update event status based on response

func (w *Worker) processBatch(ctx context.Context) {
    events := w.fetchPendingEvents(ctx, 50)

    for _, evt := range events {
        sub := w.loadSubscription(ctx, evt.SubscriptionID)
        if sub == nil || !sub.IsActive {
            w.markExhausted(ctx, evt.ID, "subscription inactive")
            continue
        }

        body, _ := json.Marshal(evt.Payload)
        signature := Sign(body, sub.SigningSecret)

        req, _ := http.NewRequestWithContext(ctx, "POST", sub.URL, bytes.NewReader(body))
        req.Header.Set("Content-Type", "application/json")
        req.Header.Set("X-Webhook-Signature", signature)
        req.Header.Set("X-Webhook-Event", evt.EventType)

        resp, err := w.httpClient.Do(req)
        if err != nil {
            w.recordFailure(ctx, evt, err.Error(), 0)
            continue
        }
        defer resp.Body.Close()

        if resp.StatusCode >= 200 && resp.StatusCode < 300 {
            w.markDelivered(ctx, evt.ID, resp.StatusCode)
        } else {
            w.recordFailure(ctx, evt, "", resp.StatusCode)
        }
    }
}

Retry Strategy

Exponential backoff with a maximum of 5 attempts:

Attempt	Delay	Cumulative
1	Immediate	0
2	1 minute	1 min
3	15 minutes	16 min
4	1 hour	1h 16m
5	6 hours	~7h 16m

After 5 failed attempts, the event status moves to exhausted.

func nextRetryDelay(attempt int) time.Duration {
    delays := []time.Duration{
        0,
        1 * time.Minute,
        15 * time.Minute,
        1 * time.Hour,
        6 * time.Hour,
    }
    if attempt >= len(delays) {
        return 0 // exhausted
    }
    return delays[attempt]
}

Event States

Events move through the following states:

pending → delivered (2xx response)
        ↓
        failed → pending (retry) → delivered
                              ↓
                              exhausted (after 5 attempts)

pending: Waiting for delivery or retry
delivered: Successfully delivered (2xx response)
failed: Last attempt failed (non-2xx response or network error)
exhausted: All retry attempts failed

HTTP Client Configuration

httpClient := &http.Client{
    Timeout: 10 * time.Second,
    Transport: &http.Transport{
        MaxIdleConns:        100,
        MaxIdleConnsPerHost: 10,
        IdleConnTimeout:     90 * time.Second,
    },
}

10-second timeout — Slow receivers don't block the worker
Connection pooling — Reuse connections for performance
Follows redirects — Up to 10 redirects (Go default)

Success Criteria

A delivery is considered successful if:

HTTP status code is 2xx (200-299)
Response received within 10 seconds

Any other response (3xx, 4xx, 5xx) or timeout triggers a retry.

Failure Handling

When a delivery fails:

Increment attempts counter
Set last_attempt_at to current time
Record last_status_code and last_error
Calculate next_retry_at based on attempt number
Update status:
- status = 'pending' if more retries available
- status = 'exhausted' if all retries exhausted

Dead Letter Queue

Events that reach exhausted status are effectively moved to a "dead letter queue":

They're no longer processed by the worker
Admins can view them via the API
They're retained for 90 days for debugging
Admins can manually retry them if needed (future feature)

Idempotency

Receivers should implement idempotency:

Each event has a unique id field (e.g., evt_01HQ3K5M7N8P9R0S)
Receivers can use this ID to deduplicate deliveries
If a delivery succeeds but the Core API doesn't receive the 2xx response (network issue), the event may be retried

Example receiver implementation:

func handleWebhook(w http.ResponseWriter, r *http.Request) {
    var event struct {
        ID string `json:"id"`
        // ... other fields
    }
    json.NewDecoder(r.Body).Decode(&event)

    // Check if we've already processed this event
    if alreadyProcessed(event.ID) {
        w.WriteHeader(http.StatusOK)
        return
    }

    // Process the event
    processEvent(event)

    // Mark as processed
    markProcessed(event.ID)

    w.WriteHeader(http.StatusOK)
}

Rate Limiting

The worker processes a maximum of 50 events per tick (5 seconds), resulting in:

Max throughput: 600 events/minute
Prevents thundering herd on bulk operations
Protects receiver endpoints from overload

If you need higher throughput, contact support to adjust the batch size.

Monitoring & Debugging

Delivery Logs

All delivery attempts are recorded in webhook_events:

sql

SELECT
    event_type,
    status,
    attempts,
    last_status_code,
    last_error,
    next_retry_at,
    created_at
FROM webhook_events
WHERE subscription_id = ?
ORDER BY created_at DESC
LIMIT 100;

Common Issues

Deliveries stuck in pending:

Check if next_retry_at is in the future (waiting for retry delay)
Verify subscription is is_active = true
Check worker logs for errors

All deliveries failing with timeout:

Receiver endpoint is too slow (> 10 seconds)
Receiver should return 2xx immediately and process async

Deliveries failing with 401/403:

Receiver is rejecting requests
Verify signature verification on receiver side

High failure rate:

Check receiver logs for errors
Verify receiver endpoint is correct
Test with /test endpoint

Performance Considerations

Database queries:

Index on (next_retry_at) WHERE status = 'pending' for fast batch fetch
Index on (subscription_id, created_at DESC) for delivery history queries
Index on (organization_id, created_at DESC) for org-wide views

Worker scaling:

Single worker instance handles 600 events/minute
For higher throughput, run multiple workers
Workers coordinate via database locks (no duplicate deliveries)

Future Enhancements

Manual retry — Admin UI to retry exhausted events
Webhook logs export — Download CSV of delivery history
Real-time delivery status — WebSocket updates for admins
Webhook analytics — Success rate, latency, error trends

Webhook Delivery System ​

Architecture ​

Delivery Process ​

Retry Strategy ​

Event States ​

HTTP Client Configuration ​

Success Criteria ​

Failure Handling ​

Dead Letter Queue ​

Idempotency ​

Rate Limiting ​

Monitoring & Debugging ​

Delivery Logs ​

Common Issues ​

Performance Considerations ​

Future Enhancements ​