Building Celery queues for maritime doc ingestion

Maritime port operations generate thousands of structured and unstructured documents daily — Bills of Lading, customs manifests, stowage plans, and UN/EDIFACT or ANSI X12 interchange messages. Processing these synchronously creates unacceptable latency at quay gates, customs checkpoints, and terminal operating system (TOS) integrations. Building Celery queues for maritime doc ingestion turns that bottleneck into a resilient, horizontally scalable workflow: the queue decouples document receipt from validation, parsing, and compliance routing so a port community system (PCS) can hold a sub-500 ms acknowledgment SLA even during peak vessel turnaround.

Architecture Alignment

This task sits inside the Async Batch Processing Pipelines topic area of the broader Document Ingestion & EDI Parsing Workflows domain. A message broker (RabbitMQ or Redis) absorbs backpressure while Celery orchestrates execution boundaries across isolated worker pools. The queue is deliberately thin: it accepts a raw payload, computes an audit hash, and fans the work out to version-specific handlers. The hard engineering problem is not enqueueing bytes — it is designing task boundaries that survive format drift, memory exhaustion, and strict regulatory gating without dropping a manifest mid-transit. Validation logic reuses the Schema Validation Frameworks contracts, aggregated-manifest parsing defers to IFCSUM EDI Message Parsing, and scanned documents route on to PDF Bill of Lading Extraction.

The ingestion call path: validate_and_route_edi hashes and version-checks each payload before the edi.parse workers extract segments; categorize_and_retry then forwards a normalised record to the TOS or diverts to quarantine, retry, or the regulatory-hold audit queue.

Prerequisites & Environment Setup

The pipeline targets Python 3.11+ (for tomllib, exception groups, and faster asyncio). Install the worker, broker client, structured-logging, and typed-validation stack:

pip install "celery[redis]==5.4.*" redis structlog pydantic

Structured JSON logging is mandatory in this niche — bare print() and unstructured logging strings are unacceptable because every task event must be queryable in the observability stack and reproducible for a customs audit. Configure structlog once, at worker boot, to emit JSON:

import structlog

structlog.configure(
    processors=[
        structlog.contextvars.merge_contextvars,
        structlog.processors.add_log_level,
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.JSONRenderer(),
    ],
)
log = structlog.get_logger("maritime.ingestion")

Environment variables the workers expect:

Variable	Purpose	Example
`CELERY_BROKER_URL`	Redis/RabbitMQ connection for the ingestion queue	`redis://broker:6379/0`
`CELERY_RESULT_BACKEND`	Result/state store for audit correlation	`redis://broker:6379/1`
`INGEST_QUARANTINE_TOPIC`	Topic for recoverable business exceptions	`edi.quarantine`
`INGEST_DLX`	Dead-letter exchange for unrecoverable corruption	`edi.dlx`
`VGM_MIN_KG`	SOLAS VGM plausibility floor for MEA parsing	`1000`

Run a dedicated worker pool per document class so a slow OCR job never starves a fast EDI acknowledgment: celery -A maritime_ingestion worker -Q edi.validate,edi.parse -c 8 --max-tasks-per-child=200. The --max-tasks-per-child bound recycles workers to bound memory growth from large stowage plans.

Step-by-step Implementation

Step 1 — Validate and route on the interchange version

Shipping documents rarely hold a static schema across carrier networks: a line may migrate from UN/EDIFACT D96A to D22A, or a forwarder may inject proprietary XML into a standard COARRI message. Hardcoded parsers fail catastrophically under this drift. The entry task computes an immutable SHA-256 hash for the IMO FAL audit trail, tolerates BOM markers and mixed encodings, then routes to a version-specific handler or quarantines the schema violation. Version tokens are defined by UN/EDIFACT directories; X12 envelopes must open with an ISA segment.

import hashlib

import structlog
from celery import Celery, shared_task

app = Celery("maritime_ingestion", broker="redis://broker:6379/0")
log = structlog.get_logger("maritime.validation")


class SchemaDriftError(Exception):
    """Raised when an interchange version does not match a deployed handler."""


@shared_task(bind=True, max_retries=3, default_retry_delay=30)
def validate_and_route_edi(self, raw_payload: bytes, edi_standard: str) -> str:
    payload_hash: str = hashlib.sha256(raw_payload).hexdigest()
    structlog.contextvars.bind_contextvars(payload_hash=payload_hash)
    try:
        try:
            decoded: str = raw_payload.decode("utf-8-sig")
        except UnicodeDecodeError:
            decoded = raw_payload.decode("iso-8859-1", errors="replace")
            log.warning("encoding_fallback_applied", fallback="iso-8859-1")

        if edi_standard == "EDIFACT":
            if "D96A" not in decoded and "D22A" not in decoded:
                raise SchemaDriftError("UNB segment directory version unsupported")
            return "edifact_parser"
        if edi_standard == "X12":
            if not decoded.startswith("ISA"):
                raise SchemaDriftError("Missing ISA interchange header")
            return "x12_parser"
        raise ValueError(f"Unsupported standard: {edi_standard}")
    except SchemaDriftError as exc:
        log.error("schema_drift_quarantined", error=str(exc))
        return "quarantine"
    except Exception as exc:  # noqa: BLE001 — transient broker/decoder faults
        log.exception("validation_failure")
        raise self.retry(exc=exc, countdown=2 ** self.request.retries)

Step 2 — Parse segments inside isolated extraction boundaries

Once validated, documents enter throughput-optimized extraction. IFCSUM aggregated manifests require segment-by-segment state tracking; scanned B/Ls demand OCR post-processing. Celery’s group and chain primitives parallelize extraction without blocking gate operations. The parser below masks payment PII (data minimization for GDPR and customs), and validates the SOLAS Verified Gross Mass carried in a MEA segment — for example MEA+AAE+VGM+KGM:21000 (qualifier VGM, unit KGM, value 21000).

import os
import re
from typing import Any

import structlog
from celery import shared_task

log = structlog.get_logger("maritime.parsing")
VGM_MIN_KG: float = float(os.environ.get("VGM_MIN_KG", "1000"))
_PAN_RE = re.compile(r"\b\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}\b")


@shared_task(bind=True, max_retries=2)
def parse_ifcsum_segments(self, decoded_text: str) -> dict[str, Any]:
    # Legacy AS2 gateways inject stray single line-breaks; collapse them.
    normalized: str = re.sub(r"(?<!\n)\n(?!\n)", " ", decoded_text)
    segments: list[str] = normalized.split("'")

    extracted: dict[str, Any] = {}
    for seg in segments:
        if seg.startswith("IFT"):
            extracted["instructions"] = _PAN_RE.sub("[REDACTED]", seg[3:])
        elif seg.startswith("MEA"):
            parts = seg.split("+")
            if len(parts) >= 4 and parts[2] == "VGM":
                try:
                    weight_kg = float(parts[3].split(":")[-1])
                    if weight_kg < VGM_MIN_KG:
                        raise ValueError("VGM below SOLAS plausibility floor")
                    extracted["verified_gross_mass"] = weight_kg
                except (IndexError, ValueError) as exc:
                    log.warning("vgm_parse_error", segment_prefix="MEA", detail=str(exc))

    log.info("ifcsum_parsed", segment_count=len(segments), status="success")
    return extracted

Step 3 — Categorize errors and drive retry policy

Maritime ops cannot afford silent failures or infinite retry loops. The retry layer distinguishes transient broker timeouts from permanent schema violations from compliance holds. A dead-letter exchange (DLX) keeps a malformed manifest from blocking downstream customs clearance: transient errors back off exponentially, permanent errors quarantine, and compliance violations flag a regulatory hold. Crucially, MemoryError is treated as permanent — retrying an out-of-memory task almost always re-triggers it.

Error categorization: transient faults back off and retry, permanent failures (including MemoryError) quarantine, compliance violations route to the audit queue, and a retry-exhausted transient fault falls through to quarantine.

from typing import Type

import structlog

log = structlog.get_logger("maritime.retry")

# SchemaDriftError is defined in the validation module (Step 1).
ERROR_CATEGORIES: dict[str, tuple[Type[Exception], ...]] = {
    "TRANSIENT": (ConnectionError, TimeoutError),
    "PERMANENT": (ValueError, SchemaDriftError, MemoryError),
    "COMPLIANCE": (PermissionError, KeyError),
}


def categorize_and_retry(task, exc: Exception, attempt: int) -> str:
    log.bind(retry_attempt=attempt, error_type=type(exc).__name__)
    for category, exceptions in ERROR_CATEGORIES.items():
        if isinstance(exc, exceptions):
            if category == "TRANSIENT":
                delay = min(2 ** attempt * 15, 300)
                log.warning("retrying_transient", delay=delay)
                raise task.retry(exc=exc, countdown=delay)
            if category == "COMPLIANCE":
                log.error("compliance_hold_triggered", detail=str(exc))
                return "audit_queue"
            log.error("permanent_failure_quarantined", detail=str(exc))
            return "quarantine"
    log.exception("unknown_error_fallback")
    raise task.retry(exc=exc, countdown=60)

Step 4 — Enforce observability metadata before completion

Every task must emit task_id, vessel_voyage_ref, document_type, and processing_latency_ms. The IMO FAL Convention and local port-authority mandates require immutable audit trails, so bind that metadata with Celery’s after_return signal rather than trusting each task to remember it. Schema-violation rates are tracked per carrier and alert when one carrier’s message structure drifts past threshold.

import time

import structlog
from celery.signals import task_prerun, task_postrun

log = structlog.get_logger("maritime.audit")
_STARTED: dict[str, float] = {}


@task_prerun.connect
def _mark_start(task_id: str, **_: object) -> None:
    _STARTED[task_id] = time.monotonic()


@task_postrun.connect
def _emit_audit(task_id: str, task, state: str, **_: object) -> None:
    latency_ms = round((time.monotonic() - _STARTED.pop(task_id, time.monotonic())) * 1000, 2)
    log.info(
        "task_completed",
        task_id=task_id,
        document_type=getattr(task, "document_type", "unknown"),
        processing_latency_ms=latency_ms,
        state=state,
    )

Edge Cases & Carrier Deviations

Symptom	Root cause	Fix
Whole interchange in one segment	Missing `UNA` service string; parser assumed `'` terminator	Read delimiters from `UNA`/`ISA`; fall back to defaults only when absent
`SchemaDriftError` on a valid file	Carrier sent a directory version (`D19B`) no handler is deployed for	Route to quarantine, deploy the version handler, replay from the topic
Garbled non-ASCII consignee names	Wrong codec assumed at decode	Keep the `utf-8-sig → iso-8859-1` fallback and log which codec won
VGM stored as `0` or below tare	`MEA+VGM` corrupted or absent from shipper	Reject below `VGM_MIN_KG`; never forward to the stowage planner
Retry storm hammering the broker	`MemoryError` mis-classified as transient	Keep `MemoryError` in `PERMANENT`; quarantine instead of retrying an OOM
Duplicate manifest events downstream	Non-idempotent publish on AS2 retransmission	Key publishes on the SHA-256 payload hash so replays collapse to a no-op

A frequent trap is the non-standard NAD qualifier: some carriers emit NAD+CZ (consignee’s agent) where a handler expects NAD+CN. Treat an unknown party qualifier as a quarantine reason, not a hard drop — the document is legally valid and a human can reconcile it.

Verification & Testing

Assert routing and VGM extraction with fixture payloads and pytest. Use bytes fixtures so encoding fallbacks are exercised, and structlog’s capture helper to assert the audit event fires.

import pytest
import structlog

from maritime_ingestion.tasks import parse_ifcsum_segments, validate_and_route_edi


@pytest.fixture
def ifcsum_payload() -> str:
    return "UNH+1+IFCSUM:D:96A:UN'IFT+3+Ship immediately'MEA+AAE+VGM+KGM:21000'"


def test_edifact_routes_to_parser() -> None:
    raw = b"UNB+UNOA:2'UNH+1+IFCSUM:D:96A:UN'"
    assert validate_and_route_edi.run(raw, "EDIFACT") == "edifact_parser"


def test_unknown_version_quarantines() -> None:
    raw = b"UNB+UNOA:2'UNH+1+IFCSUM:D:19B:UN'"
    assert validate_and_route_edi.run(raw, "EDIFACT") == "quarantine"


def test_vgm_extracted_from_mea(ifcsum_payload: str) -> None:
    result = parse_ifcsum_segments.run(ifcsum_payload)
    assert result["verified_gross_mass"] == 21000.0
    assert result["instructions"] == "Ship immediately"


def test_audit_log_shape(ifcsum_payload: str) -> None:
    cap = structlog.testing.LogCapture()
    structlog.configure(processors=[cap])
    parse_ifcsum_segments.run(ifcsum_payload)
    assert cap.entries[-1]["event"] == "ifcsum_parsed"
    assert cap.entries[-1]["status"] == "success"

A passing run emits one JSON line per task; the parse success looks like {"event": "ifcsum_parsed", "segment_count": 3, "status": "success", "log_level": "info", "timestamp": "..."}. Assert on the structured keys, never on a formatted string.

Frequently Asked Questions

Should a schema-drift failure retry or quarantine?

Quarantine. A SchemaDriftError means the interchange version has no deployed handler — retrying the identical bytes will fail identically and waste broker capacity. Route it to the quarantine topic with the offending version recorded, deploy the handler, then replay from the topic. Reserve retries for genuinely transient faults such as a broker timeout or a ConnectionError.

Why is MemoryError classified as permanent rather than transient?

Because retrying an out-of-memory task almost always re-triggers the same allocation and can cascade into a retry storm that starves the whole worker pool. Treat MemoryError as permanent, quarantine the payload, and address it structurally — stream large stowage plans instead of materializing them, and bound worker lifetime with --max-tasks-per-child.

Should validation and parsing run in the same Celery queue?

No. Keep a fast validation/acknowledgment queue separate from slow extraction (-Q edi.validate versus -Q edi.parse), each with its own worker pool. Mixing them lets a slow OCR job block the sub-500 ms acknowledgment the PCS depends on. The validation task simply returns a routing key; the extraction task does the heavy lifting on its own pool.

Async Batch Processing Pipelines — the parent topic area defining throughput and backpressure contracts
Schema Validation Frameworks — the structural/semantic/regulatory checks the validation task enforces
IFCSUM EDI Message Parsing — segment-level aggregated-manifest extraction invoked by the parse queue
PDF Bill of Lading Extraction — the OCR branch for scanned documents fanned out by the same queue
Bill of Lading Schema Mapping — the normalized record every parsed payload resolves into

Up: Async Batch Processing Pipelines — the topic area governing asynchronous document throughput.

Building Celery queues for maritime doc ingestion #

Architecture Alignment #

Prerequisites & Environment Setup #

Step-by-step Implementation #

Step 1 — Validate and route on the interchange version #

Step 2 — Parse segments inside isolated extraction boundaries #

Step 3 — Categorize errors and drive retry policy #

Step 4 — Enforce observability metadata before completion #

Edge Cases & Carrier Deviations #

Verification & Testing #

Frequently Asked Questions #

Related #

Related in Async Batch Processing Pipelines