Skip to main content

Deduplication

Deduplication prevents processing the same webhook multiple times when providers send duplicate requests. This feature protects your systems from duplicate charges, repeated notifications, and inconsistent state.

Overview

Webhook providers often retry failed deliveries, sometimes sending the same webhook multiple times. Without deduplication, your system might process the same order twice, send duplicate notifications, or create conflicting records.

Key features:

  • Automatic detection: Identifies duplicate webhooks based on your configuration
  • Three strategies: Choose how to identify duplicates (single field, include list, or exclude list)
  • Configurable window: Set how long to remember webhook signatures (default: 5 minutes)
  • Zero overhead: Rejected duplicates don't count toward your quota
  • Source-level: Each Source can have independent deduplication rules

How Deduplication Works

Detection Process

When a webhook arrives at a Source with deduplication enabled:

  1. Extract identifier: Based on your strategy, extract the deduplication key from the webhook
  2. Hash the key: Create a SHA256 hash of the identifier
  3. Check cache: Look up the hash in the deduplication cache
  4. Accept or reject:
    • If not found: Store hash and process webhook normally
    • If found: Reject as duplicate (return 200 OK but skip processing)

Time Window (TTL)

Deduplication uses a time-to-live (TTL) window:

  • Default: 5 minutes (300 seconds)
  • Configurable: Set any positive integer (in seconds)
  • After TTL expires: Same webhook is treated as new

Example: With 300-second TTL, a webhook received at 10:00:00 is stored until 10:05:00. If the same webhook arrives at 10:04:00, it's rejected as a duplicate. If it arrives at 10:06:00, it's processed as new.

Duplicate Response

When a duplicate is detected:

  • HTTP 200 OK is returned (success)
  • Webhook is NOT processed
  • Event is NOT created
  • Destinations are NOT notified
  • Duplicate count is tracked for monitoring

This prevents webhook providers from retrying endlessly.

Deduplication Strategies

Hooklistener offers three strategies for identifying duplicate webhooks. Choose based on your webhook provider's behavior.

Strategy 1: Single Field Path (body_path)

Extract a single field from the webhook body to identify duplicates.

Best for:

  • Webhooks with unique IDs (most common)
  • Simple deduplication needs
  • Consistent webhook structure

Configuration:

{
"deduplication_config": {
"enabled": true,
"body_path": "body.id",
"ttl_seconds": 300
}
}

Example - Stripe webhook:

{
"id": "evt_1234567890",
"type": "payment_intent.succeeded",
"data": {
"object": {
"id": "pi_1234567890",
"amount": 1000
}
}
}

Use body_path: "body.id" to deduplicate on the event ID.

Example - GitHub webhook:

{
"delivery_id": "12345678-1234-1234-1234-123456789012",
"action": "opened",
"pull_request": {
"id": 987654321
}
}

Use body_path: "body.delivery_id" to deduplicate on the delivery ID.

Strategy 2: Include Fields (include_fields)

Hash multiple specific fields together to identify duplicates.

Best for:

  • Webhooks without single unique ID
  • Composite keys
  • Partial payload matching

Configuration:

{
"deduplication_config": {
"enabled": true,
"include_fields": [
"body.order_id",
"body.customer_id",
"body.timestamp"
],
"ttl_seconds": 300
}
}

Example - E-commerce webhook:

{
"order_id": "ORD-12345",
"customer_id": "CUST-67890",
"timestamp": "2024-01-15T10:30:00Z",
"items": [...],
"shipping_address": {...}
}

All three fields (order_id, customer_id, timestamp) are combined and hashed together. A webhook is only considered a duplicate if ALL three match.

Strategy 3: Exclude Fields (exclude_fields)

Hash the entire payload except specified fields.

Best for:

  • Webhooks where most fields should be considered
  • Excluding timestamps, metadata, or dynamic fields
  • Complex payloads

Configuration:

{
"deduplication_config": {
"enabled": true,
"exclude_fields": [
"body.metadata.timestamp",
"body.metadata.server_id",
"headers.x-request-id"
],
"ttl_seconds": 300
}
}

Example - Monitoring webhook:

{
"alert_id": "alert-123",
"severity": "critical",
"message": "High CPU usage",
"metadata": {
"timestamp": "2024-01-15T10:30:00Z",
"server_id": "srv-456"
}
}

The webhook is deduplicated on everything EXCEPT the excluded fields (timestamp and server_id). If the alert content is identical but from a different server or time, it's still considered a duplicate.

Field Path Syntax

Field paths specify which fields to extract from webhooks.

Basic Paths

Root prefixes:

  • body. - Access request body (JSON)
  • headers. - Access HTTP headers
  • query. - Access query parameters
  • path. - Access URL path parameters

Examples:

"body.id"                    // Top-level field
"body.data.object.id" // Nested field
"headers.x-github-delivery" // Header (case-insensitive)
"query.event_type" // Query parameter

Array Access

Access specific array elements by index:

"body.items[0].id"           // First item
"body.tags[2]" // Third tag
"body.data.users[0].email" // Nested array access

Wildcard Selection

Use [*] to select all array elements:

"body.items[*].id"           // All item IDs
"body.tags[*]" // All tags
"body.data.orders[*].total" // All order totals

When using wildcards with body_path, all matching values are combined and hashed together.

Examples by Provider

Stripe:

{
"body_path": "body.id"
}

GitHub:

{
"include_fields": [
"headers.x-github-delivery",
"body.action"
]
}

Shopify:

{
"body_path": "body.id"
}

Custom webhooks:

{
"include_fields": [
"body.transaction_id",
"body.event_type"
]
}

Configuring Deduplication

Via Dashboard

Step 1: Edit Source

  1. Navigate to Sources
  2. Select your Source
  3. Click "Edit"

Step 2: Enable Deduplication

  1. Find "Deduplication" section
  2. Toggle "Enable Deduplication" on
  3. Choose strategy:
    • Single Field: Enter body_path
    • Include Fields: Add field paths to include
    • Exclude Fields: Add field paths to exclude
  4. Set TTL (default: 300 seconds)
  5. Click "Save"

Via API

Create Source with deduplication:

curl -X POST https://api.hooklistener.com/api/v1/sources \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"name": "Stripe Webhooks",
"type": "stripe",
"deduplication_config": {
"enabled": true,
"body_path": "body.id",
"ttl_seconds": 300
}
}'

Update existing Source:

curl -X PATCH https://api.hooklistener.com/api/v1/sources/{source_id} \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"deduplication_config": {
"enabled": true,
"include_fields": ["body.order_id", "body.customer_id"],
"ttl_seconds": 600
}
}'

Use Cases

Stripe Payment Webhooks

Problem: Stripe retries webhooks when they fail, potentially charging customers twice.

Solution:

{
"deduplication_config": {
"enabled": true,
"body_path": "body.id",
"ttl_seconds": 3600
}
}

Stripe's id field is unique per event. Use a longer TTL (1 hour) since Stripe may retry over longer periods.

GitHub Push Events

Problem: GitHub may send duplicate push events during network issues.

Solution:

{
"deduplication_config": {
"enabled": true,
"include_fields": [
"headers.x-github-delivery",
"body.after"
],
"ttl_seconds": 300
}
}

Combine delivery ID and commit SHA to ensure uniqueness.

Shopify Order Webhooks

Problem: Shopify webhooks may arrive multiple times during processing.

Solution:

{
"deduplication_config": {
"enabled": true,
"body_path": "body.id",
"ttl_seconds": 600
}
}

Use order ID for deduplication with 10-minute window.

Custom Application Webhooks

Problem: Your application sends webhooks that may duplicate during retries.

Solution - Without unique ID:

{
"deduplication_config": {
"enabled": true,
"include_fields": [
"body.user_id",
"body.action",
"body.resource_id"
],
"ttl_seconds": 300
}
}

Solution - With timestamp to exclude:

{
"deduplication_config": {
"enabled": true,
"exclude_fields": [
"body.timestamp",
"body.metadata.server_id"
],
"ttl_seconds": 300
}
}

High-Frequency Event Streams

Problem: IoT devices or monitoring systems send rapid events that may duplicate.

Solution:

{
"deduplication_config": {
"enabled": true,
"include_fields": [
"body.device_id",
"body.event_type",
"body.value"
],
"ttl_seconds": 60
}
}

Use shorter TTL (1 minute) for high-frequency streams where duplicates arrive quickly.

Best Practices

Choosing a Strategy

  1. Use body_path when:

    • Webhook has a unique ID field
    • Structure is consistent
    • Single field is sufficient
  2. Use include_fields when:

    • No single unique field exists
    • Need composite key
    • Want explicit control over what's checked
  3. Use exclude_fields when:

    • Most fields should be considered
    • Easier to list exclusions than inclusions
    • Payload structure varies slightly

Setting TTL

Short TTL (60-120 seconds):

  • High-frequency events
  • Quick retry cycles
  • Low memory usage priority

Medium TTL (300-600 seconds):

  • Standard webhooks
  • Most providers
  • Balanced approach

Long TTL (3600+ seconds):

  • Infrequent webhooks
  • Providers with long retry windows
  • Critical duplicate prevention

Rule of thumb: Set TTL to 2-3x your provider's retry interval.

Field Selection

  1. Always include unique identifiers

    • Event IDs
    • Transaction IDs
    • Delivery IDs
  2. Consider temporal fields

    • Include if part of uniqueness
    • Exclude if generated per request
  3. Test with real webhooks

    • Use sample payloads
    • Verify deduplication works
    • Check for false positives

Performance

  • Keep include_fields lists short (< 10 fields)
  • Use simple paths (avoid deep nesting when possible)
  • Avoid wildcards unless necessary
  • Monitor deduplication metrics

Monitoring Deduplication

Metrics to Track

Duplicate rate:

duplicate_webhooks / total_webhooks * 100

Typical rates:

  • 0-5%: Normal (occasional retries)
  • 5-15%: Common during provider issues
  • 15%+: Investigate provider or configuration

Dashboard Metrics

View in Sources → [Your Source] → Metrics:

  • Total webhooks received
  • Duplicate webhooks rejected
  • Duplicate rate over time
  • Deduplication hit rate

API Metrics

curl -X GET https://api.hooklistener.com/api/v1/sources/{source_id}/stats \
-H "Authorization: Bearer YOUR_API_KEY"

Response:

{
"total_requests": 10000,
"duplicates_rejected": 250,
"duplicate_rate": 2.5,
"period": "24h"
}

Troubleshooting

No Duplicates Detected

Symptoms:

  • Deduplication enabled
  • Expecting duplicates
  • All webhooks processed

Causes:

1. Wrong field path:

# Check actual webhook payload
curl -X GET https://api.hooklistener.com/api/v1/events/{event_id} \
-H "Authorization: Bearer YOUR_API_KEY"

Verify the field path exists in the payload.

2. Field value changes:

  • Timestamps in deduplication path
  • Random IDs generated per request
  • Dynamic content

Solution: Exclude dynamic fields or use include strategy.

3. TTL too short:

  • Duplicates arrive after TTL expires
  • Increase TTL to cover retry window

4. Deduplication not saved:

  • Check Source configuration
  • Verify enabled: true
  • Confirm strategy is set

Too Many False Positives

Symptoms:

  • Legitimate webhooks rejected as duplicates
  • Different events marked as duplicates

Causes:

1. Too broad exclusion:

// Problem: Excludes too much
{
"exclude_fields": [
"body.data" // Excludes entire data object
]
}

Solution: Be more specific

{
"exclude_fields": [
"body.data.timestamp",
"body.data.metadata"
]
}

2. Missing unique identifier:

// Problem: Only using non-unique fields
{
"include_fields": [
"body.type",
"body.status"
]
}

Solution: Add unique field

{
"include_fields": [
"body.id", // Unique!
"body.type",
"body.status"
]
}

3. TTL too long:

  • Keeping signatures too long
  • Different events treated as duplicates
  • Reduce TTL to appropriate window

Deduplication Not Working

Symptoms:

  • Configuration looks correct
  • Still processing duplicates

Debug steps:

1. Verify configuration:

curl -X GET https://api.hooklistener.com/api/v1/sources/{source_id} \
-H "Authorization: Bearer YOUR_API_KEY"

Check deduplication_config is set correctly.

2. Test field path:

# Send test webhook
curl -X POST https://api.hooklistener.com/api/v1/sources/{source_id}/ingest \
-H "Content-Type: application/json" \
-d '{
"id": "test-123",
"type": "test"
}'

# Send duplicate immediately
curl -X POST https://api.hooklistener.com/api/v1/sources/{source_id}/ingest \
-H "Content-Type: application/json" \
-d '{
"id": "test-123",
"type": "test"
}'

Second request should be rejected as duplicate.

3. Check logs: Look for deduplication messages:

  • "Duplicate request payload detected, skipping processing"
  • "Deduplication check failed"

4. Verify field exists: Ensure body_path or include_fields point to existing fields in payload.

Advanced Configuration

Multiple Sources, Different Rules

Configure each Source independently:

Production Source:

{
"name": "Production Stripe",
"deduplication_config": {
"enabled": true,
"body_path": "body.id",
"ttl_seconds": 3600
}
}

Development Source:

{
"name": "Development Stripe",
"deduplication_config": {
"enabled": false
}
}

Combining with Filters

Deduplication occurs BEFORE filters:

  1. Webhook received
  2. Deduplication check (if enabled)
  3. If not duplicate: Apply filters
  4. If passes filters: Apply transformations
  5. Forward to destinations

This means duplicates are rejected regardless of filter configuration.

Temporary Disabling

Temporarily disable without losing configuration:

curl -X PATCH https://api.hooklistener.com/api/v1/sources/{source_id} \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"deduplication_config": {
"enabled": false
}
}'

Configuration is preserved, just disabled.

Next Steps

Now that you understand deduplication:

  1. Configure Sources with deduplication enabled
  2. Monitor Events to verify deduplication is working
  3. Use Filters for additional webhook routing
  4. Track Issues if deduplication problems occur

Deduplication is essential for production webhook workflows. Configure it properly to prevent duplicate processing and ensure data consistency in your systems.