Document Ingestion & EDI Parsing Workflows

In modern port operations and global shipping logistics, the velocity of cargo movement is directly constrained by the velocity of paperwork. Terminal operators, vessel planners, and customs authorities depend on a continuous stream of structured and unstructured documents to clear vessels, allocate berths, and reconcile manifests. When these flows break, demurrage accrues, yard utilization drops, and compliance audits fail. This is the problem space that Document Ingestion & EDI Parsing Workflows own: the engineering discipline of turning heterogeneous carrier submissions — scanned PDFs, native PDFs, CSV exports, and UN/EDIFACT transmissions — into validated, idempotent events that terminal operating systems (TOS), customs gateways, and ERP platforms can trust. The production stakes are unambiguous. A single malformed IFCSUM message that slips past the validation boundary can misroute a reefer stow, trigger a customs hold, or corrupt a berth plan hours before a vessel arrives. This section defines the reference architecture, the schema contracts, the compliance controls, and the resilience patterns that Python automation engineers use to keep those flows deterministic under peak load.

Heterogeneous carrier inputs normalize, converge on a single validation gate, then fan out to terminal, customs, and ERP systems — with quarantine and dead-letter routing for anything non-compliant.

The end-to-end flow above normalizes heterogeneous inputs, enforces a single validation gate, and fans out to terminal, customs, and ERP destinations — with quarantine and dead-letter routing for anything non-compliant. Every subsequent section drills into one band of that diagram, and each links to the deeper task-level guides that implement it.

Data Governance & Schema Standards

The ingestion layer is the first point of failure in any maritime data pipeline. Shipping lines, freight forwarders, and agents submit documents in formats that share almost nothing structurally: image-only bills of lading, native PDFs with embedded text layers, delimited flat files, UN/EDIFACT interchanges, ANSI X12 transactions, and — increasingly — JSON payloads from carrier REST APIs. A production system must normalize all of these into a single canonical representation before any business logic runs. Governance begins by declaring that canonical shape explicitly, versioning it alongside pipeline code, and enforcing it synchronously at the normalization boundary rather than treating validation as a downstream afterthought.

For unstructured and semi-structured cargo documents, optical character recognition (OCR) and layout-aware extraction models isolate the critical fields — consignee details, container numbers, seal IDs, and gross weights. The operational nuances of extracting data from carrier-specific templates are covered in PDF Bill of Lading Extraction, where layout drift and image quality directly degrade downstream reconciliation accuracy. Because carrier templates change without notice, the ingestion tier must run continuous drift detection: when a template’s field geometry no longer matches its fingerprint, anomalous files are quarantined, a manual review queue is triggered, and extraction rules are updated through version-controlled configuration rather than a hot patch.

Once a document is normalized, the pipeline shifts to structured message processing. UN/EDIFACT remains the lingua franca of maritime data exchange, with IFCSUM (International Forwarding and Consolidation Summary) serving as the primary message type for stowage planning and terminal manifest reconciliation; the trade-offs against the ANSI X12 300-series a North American partner may send instead are weighed in EDIFACT vs ANSI X12 for B/L exchange. Parsing it correctly means honoring segment sequencing, composite element mapping, and conditional qualifiers — the full contract is documented in IFCSUM EDI Message Parsing. A robust parser tolerates truncated interchanges, duplicate control numbers, and non-standard character encodings without corrupting transaction state. The most common segments a maritime ingestion tier must recognize are small in number but unforgiving in interpretation:

Segment	Name	Governance role in ingestion
`UNB`	Interchange Header	Sender/recipient identification, interchange control reference for deduplication
`UNH`	Message Header	Message type + version (e.g. `IFCSUM:D:00B:UN`) — drives version-specific parsing
`BGM`	Beginning of Message	Document/message function code; primary business reference
`NAD`	Name and Address	Party role via qualifier (`CA` carrier, `CN` consignee, `CZ` consignor)
`EQD`	Equipment Details	Container identifier + ISO 6346 size-type code
`MEA`	Measurements	Gross mass, VGM, tare — unit normalization required
`LOC`	Location	UN/LOCODE for load/discharge — validated against the registry
`DTM`	Date/Time/Period	ETA/ETD windows coerced to timezone-aware ISO 8601

Governance does not end at the segment boundary. Every extracted EQD container identifier must satisfy ISO 6346 structure and check-digit rules; every LOC must resolve against the official UN/LOCODE registry; every MEA weight must reconcile against SOLAS Verified Gross Mass thresholds. Where a document carries a commercial instrument such as a bill of lading, the normalized output feeds the Bill of Lading Schema Mapping layer, which preserves legal provenance, amendment history, and endorsement chains. Equipment references, meanwhile, resolve into the Container Hierarchy Data Models that track parent-child relationships across transshipment and depot moves. The authoritative reference for segment structures and directory versions is the official UN/EDIFACT documentation; enforcing those definitions in code — via Schema Validation Frameworks — is what keeps malformed payloads out of the terminal operating system.

Workflow Orchestration & State Machines

A document is not “processed” the moment it is parsed; it moves through a finite set of states, and every transition must be idempotent and replayable. Model each ingested payload as a state machine: RECEIVED → NORMALIZED → VALIDATED → DISPATCHED, with QUARANTINED and FAILED as terminal branches. Because carriers routinely resubmit the same interchange — sometimes with a fresh timestamp, sometimes byte-identical — the orchestrator must key state on a stable business identity (interchange control reference plus document reference) rather than on arrival order. Reprocessing the same message must converge to the same state without double-dispatching a berth assignment or duplicating a customs declaration.

Every payload is keyed on a stable business identity and advances idempotently through the pipeline states; the two terminal branches — dead-letter for structural defects, quarantine for business rejections — decide who is paged.

High-volume port environments generate thousands of documents per hour, and synchronous processing creates bottlenecks that delay vessel turnaround. Modern architectures parallelize ingestion, OCR, EDI decoding, and validation across worker pools while preserving strict ordering for sequential EDI transactions that share a control reference. The horizontal-scaling patterns — worker fan-out, ordered partitions, and dead-letter routing — are detailed in Async Batch Processing Pipelines, and the concrete broker-level implementation is walked through in Building Celery queues for maritime doc ingestion. Ordering guarantees matter most where a later message amends an earlier one: an IFCSUM update that arrives before its baseline must not overwrite newer state.

Orchestration also spans system boundaries. A validated stowage summary does not live in isolation — it emits events that downstream state machines consume. The berth-and-gate sequencing those events feed into is standardized by the Port Call Workflow Design framework, where each transition is gated by an external signal (AIS ping, customs clearance code, TOS slot confirmation) and an internal health check. Equipment-status events, in turn, are consumed by the Container Tracking & AIS Event Synchronization domain, which correlates document-derived state with live vessel positions from the AIS Data Stream Integration feed. Decoupling event ingestion from execution logic is what lets operators scale notification pipelines without introducing race conditions or duplicate assignments.

Security Boundaries & Compliance Controls

Maritime infrastructure operates under stringent international mandates, and the ingestion tier sits directly on the trust boundary between external carriers and internal cargo systems. Compliance is not a report generated after the fact — it is enforced at the parsing boundary, on every payload, before dispatch. Three regimes dominate the design:

ISPS / SOLAS security declarations. Under the IMO International Ship and Port Facility Security (ISPS) Code, vessel and port-facility security clearance flags must be present and consistent before a manifest is released to the TOS. Missing or contradictory security qualifiers are a quarantine condition, not a warning.
SOLAS Verified Gross Mass (VGM). Every container’s gross mass must be verified and within declared tolerance. A MEA weight that lacks a VGM method code, or that exceeds the container’s maximum rated gross, is intercepted at the boundary.
IMDG hazardous-material classification. Dangerous-goods consignments must carry valid IMDG class and UN number codes; these gate both stowage eligibility and customs routing.

At the integration edge, treat every inbound channel as untrusted. AS2, SFTP, and REST ingress each require mutual TLS, payload signing, and role-based access aligned to port authority security directives. The segmentation model — isolating public-facing vessel-tracking endpoints from internal cargo management systems, and enforcing cryptographic non-repudiation for customs declarations — is specified in the Maritime Security Boundary Setup protocol. Two audit requirements are non-negotiable: every ingested payload is persisted immutably with its raw bytes, and every validation decision (accept, quarantine, reject) is logged with the rule that fired, the operator or service identity, and a monotonic timestamp. Port state control inspections and customs disputes are won or lost on the completeness of that trail.

Resilience Engineering

Even with rigorous validation, network interruptions, malformed payloads, and upstream outages are operational realities rather than exceptions. Production-grade ingestion must distinguish, deterministically, between a transient failure worth retrying and a permanent data defect that must be routed to a human. Conflating the two is how pipelines either livelock on poison messages or silently drop recoverable ones.

The resilience contract rests on four patterns. Exponential backoff with jitter governs retries against transient errors — timeouts, connection resets, and rate-limit responses from a customs API. Circuit breakers trip after a threshold of consecutive failures against a downstream endpoint, shedding load and giving the dependency time to recover rather than amplifying an outage. Idempotent processing guarantees that a message replayed after a partial failure converges to the same state, so a mid-flight crash never double-books a berth. And a strict distinction between the quarantine topic and the dead-letter queue keeps operational triage sane: quarantine holds structurally valid but business-rejected payloads (a failed check digit, an unknown LOCODE) that a human can correct and resubmit, while the DLQ holds unprocessable messages (corrupt encoding, protocol violation) that require engineering intervention.

Degraded-mode operation deserves explicit design rather than emergent behavior. When a primary EDI channel fails, the system should fall back through secondary protocols — SFTP polling, a webhook relay, or supervised CSV ingestion — while preserving message ordering and deduplication. During degradation, observability must intensify, not go dark: structured telemetry emitted from every coroutine lets operators triage in minutes without halting yard operations or missing a customs submission deadline. The fallback chain and its uptime guarantees are the difference between a five-minute blip and a shift-long manifest backlog.

Production Python Implementation

The module below demonstrates this domain’s core contract end to end: Pydantic models for the canonical payload, an ISO 6346 check-digit validator, UN/EDIFACT segment parsing with fallback normalization, structured JSON logging via structlog, explicit error classification, and an async processor with exponential backoff. The validator-syntax and performance differences that decide how those models are written are compared in pydantic v1 vs v2 for maritime schema validation. It is runnable as written and mirrors the numbered steps enforced in production.

Configure structured JSON logging so every stage emits machine-parseable telemetry.
Model the canonical payload and processing states with typed Pydantic classes.
Validate each container identifier against ISO 6346 structure and its modulo-11 check digit.
Parse UN/EDIFACT segments, applying deterministic fallback for malformed segments.
Process each payload asynchronously with error classification and exponential backoff, routing to quarantine, dispatch, or dead-letter accordingly.

import asyncio
import re
from enum import Enum
from typing import Any, Optional

import structlog
from pydantic import BaseModel, Field, ValidationError

# Step 1 — Structured JSON logging: bare print() is unacceptable in this niche.
structlog.configure(
    processors=[
        structlog.processors.add_log_level,
        structlog.processors.TimeStamper(fmt="iso", utc=True),
        structlog.processors.JSONRenderer(),
    ]
)
logger = structlog.get_logger("maritime.ingestion")


# Step 2 — Canonical payload + explicit processing states.
class ProcessingStatus(str, Enum):
    PENDING = "PENDING"
    VALIDATED = "VALIDATED"
    QUARANTINED = "QUARANTINED"
    FAILED = "FAILED"


class MaritimePayload(BaseModel):
    raw_content: str
    container_ids: list[str] = Field(default_factory=list)
    status: ProcessingStatus = ProcessingStatus.PENDING
    error_context: Optional[dict[str, Any]] = None


class PermanentDefect(Exception):
    """Structurally unprocessable payload — routes to the dead-letter queue."""


# Step 3 — ISO 6346 structural + modulo-11 check-digit validation.
def validate_iso_6346(container_id: str) -> bool:
    """Validate ISO 6346 container format and modulo-11 check digit."""
    if not re.match(r"^[A-Z]{4}\d{7}$", container_id):
        return False
    # Letter values run A=10 .. Z=38, skipping every multiple of 11
    # (11, 22, 33), which are reserved.
    letter_values: dict[str, int] = {}
    value = 10
    for letter in "ABCDEFGHIJKLMNOPQRSTUVWXYZ":
        if value % 11 == 0:
            value += 1
        letter_values[letter] = value
        value += 1

    def char_value(c: str) -> int:
        return int(c) if c.isdigit() else letter_values[c]

    # Weight each of the first 10 characters by 2**position, sum, take modulo 11.
    total = sum(char_value(c) * (2 ** i) for i, c in enumerate(container_id[:10]))
    check_digit = total % 11
    if check_digit == 10:  # A computed remainder of 10 is encoded as 0.
        check_digit = 0
    return check_digit == int(container_id[-1])


# Step 4 — UN/EDIFACT segment parsing with deterministic fallback.
async def parse_edi_segments(raw_edi: str) -> list[dict[str, Any]]:
    """Parse UN/EDIFACT segments; flag malformed segments instead of aborting."""
    parsed: list[dict[str, Any]] = []
    for seg in raw_edi.split("'"):
        if not seg.strip():
            continue
        parts = seg.split("+")
        if parts[0].isalpha() and parts[0].isupper():
            parsed.append({"id": parts[0], "elements": parts[1:], "valid": True})
        else:
            logger.warning("segment_parse_fallback", segment=seg[:30])
            parsed.append({"id": "UNKNOWN", "elements": [seg], "fallback": True})
    return parsed


# Step 5 — Async processing with error classification + exponential backoff.
async def process_with_retry(
    payload: MaritimePayload, max_retries: int = 3
) -> MaritimePayload:
    backoff = 1.0
    for attempt in range(1, max_retries + 1):
        try:
            payload.container_ids = re.findall(r"[A-Z]{4}\d{7}", payload.raw_content)
            if not payload.container_ids:
                raise PermanentDefect("No ISO 6346 container identifiers found")

            invalid = [c for c in payload.container_ids if not validate_iso_6346(c)]
            if invalid:
                payload.status = ProcessingStatus.QUARANTINED
                payload.error_context = {"invalid_containers": invalid}
                logger.error("validation_quarantine", invalid_ids=invalid)
                return payload  # business-rejected -> quarantine, not DLQ

            segments = await parse_edi_segments(payload.raw_content)
            fallback = sum(1 for s in segments if s.get("fallback"))
            if fallback > len(segments) * 0.3:
                raise PermanentDefect("Excessive fallback parsing: structural corruption")

            payload.status = ProcessingStatus.VALIDATED
            logger.info(
                "payload_validated",
                container_count=len(payload.container_ids),
                segments_parsed=len(segments),
                fallback_ratio=round(fallback / max(len(segments), 1), 3),
            )
            return payload

        except PermanentDefect as exc:
            payload.status = ProcessingStatus.FAILED
            payload.error_context = {"error": str(exc), "route": "dead_letter"}
            logger.error("permanent_defect", error=str(exc))
            return payload
        except (TimeoutError, ConnectionError) as exc:
            if attempt < max_retries:
                logger.warning("transient_retry", attempt=attempt, backoff_s=backoff)
                await asyncio.sleep(backoff)
                backoff *= 2
                continue
            payload.status = ProcessingStatus.FAILED
            payload.error_context = {"error": str(exc), "attempt": attempt}
            logger.error("transient_exhausted", error=str(exc))
            return payload
    return payload


async def run_ingestion_pipeline(raw_documents: list[str]) -> list[MaritimePayload]:
    """Orchestrate async batch processing for port document ingestion."""
    tasks = [process_with_retry(MaritimePayload(raw_content=d)) for d in raw_documents]
    results = await asyncio.gather(*tasks, return_exceptions=True)
    out: list[MaritimePayload] = []
    for r in results:
        if isinstance(r, MaritimePayload):
            out.append(r)
        else:
            out.append(
                MaritimePayload(
                    raw_content="ERROR",
                    status=ProcessingStatus.FAILED,
                    error_context={"exception": str(r)},
                )
            )
    return out


if __name__ == "__main__":
    sample_payloads = [
        "BGM+351+IFCSUM001+9'NAD+MS+987654321::166'EQD+CN+MSCU1234566+22G1+2+5'",
        "CORRUPTED_SEGMENT_DATA_NO_VALID_IDS",
        "BGM+351+IFCSUM002+9'NAD+MS+123456789::166'EQD+CN+TCLU9876543+45G1+2+5'",
    ]
    asyncio.run(run_ingestion_pipeline(sample_payloads))

The contract this module encodes is what every downstream integration relies on: a payload is never dispatched unless its containers pass ISO 6346, its EDIFACT structure parses within tolerance, and its status is explicitly VALIDATED. Business rejections land in quarantine for human correction; structural defects land in the dead-letter path for engineering; transient faults retry with backoff. Detailed field-coercion rules and the multi-tier validation ladder that sits behind this gate are covered in Schema Validation Frameworks.

Operational Edge Cases & Known Carrier Deviations

The specification is clean; production traffic is not. The failures that actually page an on-call engineer concentrate in a handful of field-tested quirks, and encoding them as explicit rules — rather than discovering them per incident — is what separates a resilient pipeline from a brittle one.

Non-standard NAD qualifiers. Several carriers emit party qualifiers outside the standard directory, or overload MS (message sender) where a role-specific qualifier is expected. Maintain a per-carrier qualifier map and treat an unmapped qualifier as a quarantine condition, never a silent default.
ISO 6346 check-digit drift. Legacy fleets and manually keyed manifests routinely carry containers whose check digit does not validate — sometimes because the digit was computed under an older convention, sometimes from transcription error. A remainder of 10 is encoded as 0; a validator that forgets this rejects a legitimate identifier ending in 0. Quarantine, surface the computed-vs-declared digit, and let an operator adjudicate.
UN/LOCODE gaps. New or reclassified port locations lag the published registry. A LOC segment referencing an unlisted code should route to quarantine with the raw code preserved, and trigger a registry-refresh check rather than dropping the document.
Delimiter and encoding drift. Interchanges that omit the UNA service-string advice, or that arrive in a non-UTF-8 codepage, will shred a naive splitter. Detect the service string, honor its declared delimiters, and normalize encoding before segmentation.
VGM threshold breaches. A MEA gross mass exceeding the container’s maximum rated gross, or lacking a verification method code, is a SOLAS violation — hold it, do not round it into range.

Symptom	Root cause	Resolution
Container rejected ending in `0`	Check digit 10 encoded as 0 not handled	Apply the modulo-11 `10 → 0` rule before comparison
Party silently defaulted to sender	Unmapped `NAD` qualifier	Quarantine on unknown qualifier; extend carrier map
Discharge port not found	UN/LOCODE registry lag	Quarantine, preserve raw code, schedule registry refresh
Whole interchange unparseable	Missing `UNA`, non-UTF-8 codepage	Detect service string, normalize encoding pre-split
Manifest overweight rejected downstream	VGM exceeds max rated gross	Hold at boundary as SOLAS violation, notify operator

Frequently Asked Questions

How do we handle EDIFACT version mismatches in production?

Read the message version from the UNH segment (for example IFCSUM:D:00B:UN) and dispatch to a version-specific parser rather than assuming a single directory. Keep each version’s segment map version-controlled alongside the pipeline. When a carrier sends a version you do not yet support, route the interchange to quarantine with the declared version in the error context — never attempt a best-effort parse against the wrong directory, because silently mismapped composites are worse than an explicit rejection.

When does a message go to quarantine versus the dead-letter queue?

Quarantine holds payloads that are structurally valid but business-rejected — a failed ISO 6346 check digit, an unknown UN/LOCODE, an unmapped NAD qualifier — which a human can correct and resubmit. The dead-letter queue holds payloads that are unprocessable at the protocol level: corrupt encoding, a missing service string, or excessive fallback parsing that indicates structural corruption. The distinction drives who is paged: quarantine is an operations task, the DLQ is an engineering task.

Why enforce ISO 6346 check digits at the ingestion boundary rather than downstream?

Validating at the boundary keeps corrupt equipment identifiers out of the Container Hierarchy Data Models entirely, so no downstream reconciliation, stow plan, or customs declaration can inherit a bad identity. Deferring the check means the invalid ID propagates through joins and events, and the eventual failure surfaces far from its cause — often as a mystery mismatch in the TOS hours later.

How do we preserve ordering when messages are processed asynchronously?

Partition work by a stable business key — the interchange control reference plus document reference — and keep all messages that share a key on the same ordered worker or partition, while parallelizing across independent keys. This preserves the ordering guarantee where it matters (an IFCSUM amendment must not overtake its baseline) without serializing the entire stream. The broker-level implementation is walked through in Building Celery queues for maritime doc ingestion.

What belongs in the immutable audit trail for a customs dispute?

Persist the raw inbound bytes, the parsed canonical payload, and every validation decision with the specific rule that fired, the acting service or operator identity, and a monotonic UTC timestamp. Non-repudiation for customs declarations additionally requires the payload signature and the mutual-TLS peer identity captured at ingress, as specified in Maritime Security Boundary Setup. Port state control inspections and demurrage disputes are settled on the completeness of that record.

How should the pipeline degrade when the primary EDI channel is down?

Fall back through pre-defined secondary protocols — SFTP polling, a webhook relay, or supervised CSV ingestion — while preserving deduplication and ordering keys so no message is processed twice or out of sequence. Trip a circuit breaker against the failed channel so retries stop amplifying the outage, and intensify structured telemetry during degraded mode so operators can see throughput and backlog in real time. The full fallback chain is detailed in Async Batch Processing Pipelines.

EDIFACT vs ANSI X12 for B/L exchange — choosing between the two dominant EDI syntaxes for bill-of-lading interchange.
pydantic v1 vs v2 for maritime schema validation — validator syntax and performance trade-offs across the two pydantic generations.
PDF Bill of Lading Extraction — OCR and layout-aware extraction from carrier B/L templates.
IFCSUM EDI Message Parsing — segment-level parsing of the maritime manifest handshake.
Schema Validation Frameworks — the multi-tier validation gate behind every dispatch.
Async Batch Processing Pipelines — horizontal scaling, ordering guarantees, and DLQ routing.
Maritime Security Boundary Setup — zero-trust controls at the ingestion edge.

↑ Up: Maritime Shipping Documentation & Port Operations Automation

Document Ingestion & EDI Parsing Workflows #

Data Governance & Schema Standards #

Workflow Orchestration & State Machines #

Security Boundaries & Compliance Controls #

Resilience Engineering #

Production Python Implementation #

Operational Edge Cases & Known Carrier Deviations #

Frequently Asked Questions #

Related #

Explore this section