Async Batch Processing Pipelines

Async Batch Processing Pipelines decouple the arrival of maritime trade documents from their execution, so a port terminal, shipping line, or freight forwarder can absorb bursty, unpredictable submission traffic and turn it into an auditable, back-pressure-safe processing stream. Synchronous ingestion collapses under peak load — pre-deadline customs filings, simultaneous berthings, a VAN flushing a day’s backlog — and every stall propagates straight into terminal gate and vessel-planning delays. This pattern sits inside the Document Ingestion & EDI Parsing Workflows framework: acknowledge fast at the wire, enqueue immutable payloads, and let a pool of typed Python workers drain the queue at a concurrency the downstream systems can actually sustain.

The edge acknowledges fast and enqueues immutable payloads; a prefetch-1 worker pool drains the broker through one validation gate that fans out to accepted systems, a quarantine topic, and a dead-letter queue — with replay and reconcile paths feeding recoverable work back in.

Ingestion Boundary & Protocol Handling

The ingestion edge exists to accept work quickly and durably, never to process it. Documents arrive over four transport families with very different failure modes: SFTP drops from carrier and VAN partners, HTTPS webhooks from carrier APIs, EDI VAN interchanges (AS2/OFTP2), and legacy email gateways. Each endpoint performs only cheap, non-blocking checks — MIME type, declared size against the terminal SLA, character encoding, and a duplicate-interchange guard — then returns a receipt token to the sender and enqueues the payload. Returning the receipt synchronously while deferring the actual parse is what keeps back-pressure off upstream carrier systems during the pre-cutoff surge; the same syntax-tolerant, semantics-strict posture is used across every branch of the parent workflow.

Payloads are serialized onto the broker as immutable byte arrays with a metadata header envelope rather than being mutated in place. The minimum header contract every worker can rely on is small but load-bearing:

Header	Example	Purpose
`correlation_id`	`01HZX…` (ULID)	End-to-end trace key across broker, workers, and logs
`source_system`	`MAEU_AS2`	Carrier/VAN identity for schema selection and rate accounting
`document_type`	`IFCSUM` / `VERMAS` / `BL_PDF`	Priority routing and worker-pool selection
`interchange_ref`	`UNB03` control number	Idempotency key + duplicate-interchange detection
`ingestion_timestamp`	RFC 3339 UTC	Ordering, SLA measurement, and audit lineage
`raw_payload_sha256`	`9f2c…`	Content-addressed idempotency and audit receipt

Header-driven routing is what lets a time-critical SOLAS VERMAS (Verified Gross Mass) declaration skip the general OCR queue and land directly on a weight-validation worker, while a bulk scanned-B/L batch waits its turn. The broker itself is the shock absorber: it flattens arrival spikes into a queue depth the worker pool consumes at a steady rate, so a thousand manifests arriving in one minute never translate into a thousand simultaneous parses. Holding that consumption rate below what the downstream systems can sustain is the job of Backpressure control for EDI ingestion workers, which governs prefetch and queue-depth limits under a pre-cutoff surge.

Python Data Structure Mapping

Python owns this layer because of its async ecosystem and distributed task frameworks — Celery or RQ over Redis or RabbitMQ for distribution, asyncio for the I/O-bound legs. The critical design rule is that nothing crosses a queue boundary as an implicit dict. Every payload is deserialized into an explicitly typed model so a malformed message fails at construction, inside a worker, where it can be routed — not three hops downstream inside a customs submission.

The concurrency model is split by workload. I/O-bound stages — OCR pre-processing, external registry lookups, and PDF Bill of Lading Extraction — run on asyncio workers. CPU-bound stages — ISO 6346 check-digit math, cryptographic hashing, and regex-heavy EDI segment scanning — are pushed to a concurrent.futures.ProcessPoolExecutor so they never block the event loop. Maritime types map deliberately: UN/EDIFACT interchange headers become TypedDict structures on the hot path, while regulated declarations become pydantic models carrying decimal.Decimal precision and unit validation.

from __future__ import annotations

from decimal import Decimal
from enum import StrEnum

import structlog
from pydantic import BaseModel, Field, field_validator

log = structlog.get_logger()


class DocType(StrEnum):
    IFCSUM = "IFCSUM"
    VERMAS = "VERMAS"
    BL_PDF = "BL_PDF"


class BatchEnvelope(BaseModel):
    """Immutable unit of work as it crosses the broker boundary."""

    correlation_id: str = Field(..., min_length=26, max_length=26)  # ULID
    source_system: str
    document_type: DocType
    interchange_ref: str                       # UNB03, the idempotency seed
    raw_payload_sha256: str = Field(..., pattern=r"^[0-9a-f]{64}$")
    gross_weight_kg: Decimal | None = Field(default=None, ge=0)

    @field_validator("interchange_ref")
    @classmethod
    def strip_ref(cls, v: str) -> str:
        if not v.strip():
            raise ValueError("empty interchange control reference")
        return v.strip()

    def idempotency_key(self) -> str:
        # Content + control-number keyed: replays collapse to a no-op.
        return f"{self.source_system}:{self.interchange_ref}:{self.raw_payload_sha256}"

Coercion rules are deterministic: MEA+WT+G+15000:KGM becomes Decimal("15000") kilograms and stays in kilograms until the presentation boundary; an EDIFACT DTM value is parsed to a timezone-aware UTC datetime; a LOC element resolves to a five-character UN/LOCODE. Weights use Decimal, never float, because a rounding error on a VGM figure is a safety and compliance defect, not a display quirk.

Validation, Quarantine & Compliance Auditing

Maritime pipelines cannot tolerate silent failure — a missing container weight, a misparsed commodity code, or a truncated segment turns into a customs hold, a demurrage charge, or a segregation violation on the vessel. Validation therefore runs in three ordered tiers inside the worker, and the tier a payload fails in determines where it is routed.

Structural — format, encoding, mandatory-field presence. A payload that will not deserialize into BatchEnvelope or its message model raises a pydantic.ValidationError. This is unrecoverable corruption; it fails fast to the dead-letter queue.
Semantic — UN/EDIFACT segment rules, ISO 6346 container check digits (weighted modulo-11), and cross-field consistency such as declared gross weight versus summed line items. A record that parsed cleanly but references an unknown UN/LOCODE or a drifting check digit is a recoverable business exception; it routes to the quarantine topic, not the DLQ.
Regulatory — SOLAS VGM thresholds, dangerous-goods segregation, and tariff-code (HS/HTS) resolution against a cached registry so no worker blocks on a customs API during a peak window.

The distinction between the DLQ and the quarantine topic is the single most important operational decision on this page. The DLQ holds messages a human cannot fix without engineering; the quarantine topic holds messages an ops team can reconcile — a stale registry, a missing VGM — without halting the line. Every validation event writes an immutable audit record (correlation_id, rule_applied, original_value, transformed_value, compliance_status, rule_version) so a port-state-control or customs audit can replay exactly why a document was accepted, quarantined, or rejected. Records that need to resolve into the canonical trade schema are reconciled against the Bill of Lading Schema Mapping layer, and aggregate manifests are cross-checked against IFCSUM EDI Message Parsing output. The full multi-tier rule engine and its versioning strategy live in the Schema Validation Frameworks reference.

Downstream Integration

Once a batch clears validation, its output is not a file — it is a set of idempotent events that operational systems consume. A cleared manifest updates the terminal operating system (TOS); a cleared VGM releases a container for stowage; a status transition drives a milestone. Because the broker delivers at-least-once, every downstream publish is keyed on the envelope’s idempotency key so an AS2 retransmission or a worker retry collapses to a no-op rather than double-booking a berth or re-filing a customs entry. The dedupe-store mechanics behind that guarantee — the key derivation, the TTL window, and the set-if-not-exists check — are detailed in Idempotent deduplication for maritime document queues.

Container-level output resolves into the Container Hierarchy Data Models — the Vessel → Bay → Tier → Row → Container topology — so equipment tracking, reefer monitoring, and dangerous-goods segregation execute off a shared structure. Lifecycle transitions (GATE_IN, LOADED_ONBOARD) propagate to the Port Call Workflow Design state machine, which relies on the idempotency contract above to stay correct under duplicate delivery. Where physical movement must be correlated with position, cleared events join the real-time feed from AIS Data Stream Integration, and equipment status is reconciled through the Container Status Mapping Rules. Every cross-boundary publish authenticates under the Maritime Security Boundary Setup zero-trust controls, and TOS reads that the pipeline depends on follow the Terminal API Polling Strategies.

Fallback Chains & Uptime Guarantees

Uptime here is a property of graceful degradation, not of never failing. External dependencies — customs gateways, port-authority APIs, the message broker itself — will rate-limit, time out, and go down for maintenance. Production pipelines therefore make degraded operation an explicit, logged mode rather than an accident.

The lifecycle machine keeps each document on a single committed state, so a transient failure resumes rather than re-submits; only a permanent defect or an exhausted fallback drops to the dead-letter queue, and a replay returns it to validation.

Exponential backoff with jitter — retry transient 429/503 responses with base_delay=2s, max_delay=300s, and per-attempt jitter; cap retries before routing to quarantine so a degraded API cannot pin the worker pool.
Circuit breakers — open per downstream endpoint after five consecutive failures (60 s recovery timeout) and serve cached tariff tables and local HS/LOCODE lookups instead of hammering a failing service.
Deterministic idempotency — keys derived from the document SHA-256 or the UNB03 interchange control number make replay safe; at-least-once broker delivery plus idempotent processing yields effectively-once handling for regulatory documents.
Local spooling on broker loss — if the broker or a critical TOS endpoint (HTTP 503) is unreachable, ingestion writes validated payloads to encrypted disk volumes and replays automatically on health restoration, so no interchange is ever dropped.
Lifecycle state tracking — the INGESTED → VALIDATING → ENRICHED → ROUTED → ACKNOWLEDGED machine above ensures a transient network failure never triggers a duplicate customs submission; failed transitions carry fallback_reason and degraded_mode flags for the audit trail.

Failed tasks land in the DLQ with a structured payload (error_code, failed_segment, retry_count) so triage and manual override happen without a pipeline restart. Observability underpins all of it: structured JSON logs via structlog, Prometheus gauges for queue depth, worker utilisation, retry rate, and DLQ depth, and alerts on DLQ accumulation or a validation error rate above 2% of throughput. High-availability deployments drain in-flight tasks before SIGTERM, run Redis Sentinel or RabbitMQ quorum queues to survive a broker node loss, and autoscale worker pods on queue length under Kubernetes — targeting 99.95% availability without ever letting pipeline degradation escalate into a gate stoppage or a vessel-departure delay.

Step-by-step Implementation Guide

The reference pipeline below accepts a raw interchange, enqueues it durably, and drains it through a validating worker. Each step is runnable in isolation and uses type annotations with structlog for structured JSON logging. The queue-configuration detail beneath these steps — prefetch, acknowledgement, and DLQ wiring — is covered in Building Celery queues for maritime doc ingestion, and a complete worked pipeline is walked through in Async bulk B/L ingestion with Celery and a Redis DLQ.

Step 1 — Acknowledge at the edge and enqueue an immutable payload

Return a receipt token before any parsing, and content-address the payload so replays are detectable.

import hashlib

import structlog

log = structlog.get_logger()


def accept(raw: bytes, *, source_system: str) -> str:
    digest = hashlib.sha256(raw).hexdigest()
    log.info("ingested", source_system=source_system, sha256=digest, size=len(raw))
    broker.enqueue("documents.inbound", raw, headers={"sha256": digest})
    return digest  # receipt token handed straight back to the sender

Step 2 — Configure the worker for at-least-once, back-pressure-safe draining

# Celery worker settings — one task in flight per worker, ack after success.
worker_prefetch_multiplier = 1
task_acks_late = True
task_reject_on_worker_lost = True
task_serializer = "msgpack"        # compact, fast to deserialize under burst

Step 3 — Deserialize into a typed envelope at the worker boundary

from pydantic import ValidationError


def load_envelope(raw: bytes, headers: dict[str, str]) -> BatchEnvelope | None:
    try:
        return BatchEnvelope.model_validate_json(raw)
    except ValidationError as exc:
        log.error("structural_reject", sha256=headers.get("sha256"), errors=exc.errors())
        broker.publish("documents.dlq", raw, reason="STRUCTURAL")
        return None  # unrecoverable corruption -> dead-letter, never downstream

Step 4 — Run the tiered validation and choose a route

def route(env: BatchEnvelope, *, locode_ok: bool, vgm_ok: bool) -> str:
    if not locode_ok:
        log.warning("quarantine", key=env.idempotency_key(), reason="LOCODE_UNKNOWN")
        return "QUARANTINE"
    if env.document_type is DocType.VERMAS and not vgm_ok:
        log.warning("quarantine", key=env.idempotency_key(), reason="VGM_IMPLAUSIBLE")
        return "QUARANTINE"
    return "ACCEPTED"

Step 5 — Publish downstream idempotently, keyed on the interchange

def publish(env: BatchEnvelope, event: dict[str, object]) -> None:
    key = env.idempotency_key()
    if seen_store.setnx(key, ttl=86_400):      # first delivery only
        tos.emit(event, dedupe_key=key)
        log.info("published", key=key, document_type=env.document_type)
    else:
        log.info("duplicate_suppressed", key=key)  # replay collapses to a no-op

A ValidationError in Step 3 is a structural defect and belongs in the DLQ; a QUARANTINE verdict in Step 4 is a recoverable business exception and belongs on the quarantine topic. Keeping those two paths distinct is what lets the quarantine stream reconcile automatically once an authoritative source recovers.

Troubleshooting Common Failures

Symptom	Root cause	Fix
Worker OOM-killed on a large manifest	Whole payload materialised in memory before parsing	Stream segments; enforce `cgroups` memory limits; offload CPU parse to a process pool
Same customs entry filed twice	Non-idempotent publish under at-least-once delivery	Key downstream emits on `(source_system, UNB03, sha256)` so replays no-op
Queue depth climbs but throughput flat	`worker_prefetch_multiplier` hoarding tasks on a stuck worker	Set prefetch to 1 with `task_acks_late`; reject on worker loss
VERMAS stuck behind bulk OCR batch	No priority routing at the ingestion edge	Route on `document_type` header to a dedicated VGM worker pool
DLQ filling during a customs-API outage	Retries exhausted against a degraded endpoint	Trip the circuit breaker; serve cached tariff/LOCODE tables; route to quarantine, not DLQ
`LOCODE_UNKNOWN` on a valid port	Stale UN/LOCODE registry, newly gazetted code	Quarantine, refresh registry, reconcile — never hard-reject a valid interchange
Payloads lost when broker restarts	In-flight work not persisted	Enable quorum/durable queues + local disk spooling with automatic replay

Frequently Asked Questions

When should a message go to the dead-letter queue versus the quarantine topic?

Route to the DLQ only for corruption a human cannot fix without engineering — a broken envelope, an undecodable payload, a pydantic.ValidationError on a mandatory field. Route to the quarantine topic for records that parsed cleanly but failed a business or regulatory check an ops team can reconcile: an unknown UN/LOCODE, a missing VGM, a drifting ISO 6346 check digit. Mixing the two either drops legally valid documents or floods engineering with recoverable exceptions.

How do we get exactly-once processing when the broker is at-least-once?

You do not get exactly-once delivery — you get effectively-once processing by making the work idempotent. Derive a deterministic key from the document SHA-256 or the UNB03 interchange control number, record it in a dedupe store before publishing, and make every downstream emit a no-op on a repeat key. Redis or RabbitMQ will redeliver on worker loss; idempotent publishing is what stops that redelivery from double-booking a berth or re-filing a manifest.

Why set worker_prefetch_multiplier to 1 for maritime document workers?

Maritime tasks are heavy and uneven — a 40-page OCR batch next to a single VGM segment. The default prefetch lets a worker grab a block of messages up front; if it stalls mid-batch, everything it prefetched is stranded until it recovers. Setting prefetch to 1 with task_acks_late=True means a worker holds exactly one task, acknowledges only on success, and any node loss redelivers a single message rather than a hoarded block.

What happens to in-flight documents when the message broker goes down?

Ingestion endpoints switch to local disk buffering, writing validated payloads to encrypted volumes and replaying them automatically on reconnection, so no interchange is dropped. Durable or quorum queues preserve already-enqueued work across a node restart, and workers drain in-flight tasks before SIGTERM. The lifecycle state machine ensures a document mid-flight resumes from its last committed state rather than reprocessing from scratch.

Async bulk B/L ingestion with Celery and a Redis DLQ — a complete worked pipeline draining bulk bills of lading through Celery workers into a Redis dead-letter queue
Backpressure control for EDI ingestion workers — holding consumption below sustainable throughput with prefetch and queue-depth limits
Idempotent deduplication for maritime document queues — key derivation, TTL windows, and the dedupe store that makes at-least-once delivery safe
Building Celery queues for maritime doc ingestion — the queue, prefetch, and DLQ wiring beneath this pipeline
Schema Validation Frameworks — the multi-tier rule engine the validation stage calls
PDF Bill of Lading Extraction — the I/O-bound extraction stage feeding these workers
IFCSUM EDI Message Parsing — aggregate manifest parsing reconciled against batch output
Port Call Workflow Design — the idempotent state machine that consumes cleared events

Up: Document Ingestion & EDI Parsing Workflows — the parent framework governing ingestion, schema, and resilience contracts.

Async Batch Processing Pipelines #

Ingestion Boundary & Protocol Handling #

Python Data Structure Mapping #

Validation, Quarantine & Compliance Auditing #

Downstream Integration #

Fallback Chains & Uptime Guarantees #

Step-by-step Implementation Guide #

Step 1 — Acknowledge at the edge and enqueue an immutable payload #

Step 2 — Configure the worker for at-least-once, back-pressure-safe draining #

Step 3 — Deserialize into a typed envelope at the worker boundary #

Step 4 — Run the tiered validation and choose a route #

Step 5 — Publish downstream idempotently, keyed on the interchange #

Troubleshooting Common Failures #

Frequently Asked Questions #

Related #

Explore this section