Parsing IFCSUM 311 messages with Python

This guide solves one precise task: streaming a UN/EDIFACT IFCSUM 311 consolidated manifest through a single-pass Python parser that yields typed segments, enforces mandatory-segment compliance, and holds linear memory under 10,000+ TEU vessel calls — without ever loading the whole interchange into RAM.

Architecture Alignment

This task is the throughput-critical core inside the IFCSUM EDI Message Parsing discipline, which itself sits under the Document Ingestion & EDI Parsing Workflows domain. That parent layer owns envelope validation, trading-partner authorisation, and the CONTRL/APERAK acknowledgment handshake; this page owns only the leaf function that turns a validated .edi payload into an ordered stream of (tag, elements) tuples and a compliance verdict. Everything upstream — AS2/SFTP receipt, UNB/UNZ envelope checks, message isolation by directory version — has already run by the time a file reaches this parser. Because a single manifest can carry 15,000 CNI (consignment information) loops and nested GID (goods item detail) segments, a naive read-then-tokenise approach triggers garbage-collection thrashing and OOM kills during concurrent vessel processing. The design here is deliberately generator-driven so memory scales with segment size, not file size, and the resulting typed record flows into the Async Batch Processing Pipelines layer for concurrent fan-out and into the Schema Validation Frameworks layer for semantic and regulatory checks.

One pass, bounded memory: a 64 KB read loop resolves the UNA delimiters, splits at each unescaped terminator, and tokenises escape-aware — re-reading when a segment straddles a chunk — before mapping UNH · TDT · CNI into an IfcsumManifest. A per-segment SLA poll diverts a runaway file to TIMEOUT_FAILED; otherwise the mandatory-segment gate returns COMPLIANT or fails closed as NON_COMPLIANT.

Prerequisites & Environment Setup

The parser is pure-Python plus structlog; it needs no network credentials, but production deployments pin the SLA timeout and the mandatory-segment contract so a manifest’s meaning is diffable across releases.

Python 3.10+ — for X | None union syntax and dataclasses used below.
structlog>=24.1 — structured JSON logging for the audit trail; a bare print() is unacceptable in a customs-audited pipeline.
pytest>=8.0 — for the fixture-driven verification in the final section.
IFCSUM_SLA_SECONDS (env var) — hard processing budget, defaulting to the 900 s (15-minute) Terminal Operating System (TOS) ingestion window; the parser aborts and flags TIMEOUT_FAILED rather than blowing the SLA silently.
Reference standards — ISO 9735 for EDIFACT syntax rules (the UNA service string advice and release character), and the UN/EDIFACT IFCSUM message directory (D.xxB) for the 311 message-type structure and segment sequencing.

python -m venv .venv && source .venv/bin/activate
pip install "structlog>=24.1" "pytest>=8.0"
export IFCSUM_SLA_SECONDS=900

The compliance gate is only as good as its mandatory-segment contract. The minimum structural set for a valid IFCSUM 311 interchange is fixed before any value-level check runs:

EDIFACT segment	Role in IFCSUM 311	Why it is mandatory
`UNB`	Interchange header	Sender/receiver IDs, syntax version, control reference
`UNH`	Message header	Carries `IFCSUM:311` type identifier and message reference (DE 0062)
`BGM`	Beginning of message	Document/message function code (consolidated manifest)
`TDT`	Transport details	Vessel name, IMO number, and voyage reference
`CNI`	Consignment information	One loop per consolidated consignment — the payload core
`UNT`	Message trailer	Segment count and reference; proves the message is complete

Step-by-step Implementation

Each step is runnable in isolation; together they compose the streaming parser drawn above.

Step 1 — Configure structlog for JSON audit logging. Emit one JSON object per event with an ISO-8601 UTC timestamp so the audit pipeline can index parse events directly, without a custom logging.Formatter.

from __future__ import annotations

import structlog

structlog.configure(
    processors=[
        structlog.processors.add_log_level,
        structlog.processors.TimeStamper(fmt="iso", utc=True),
        structlog.processors.JSONRenderer(),
    ],
)
log = structlog.get_logger("ifcsum_parser")

Step 2 — Declare the typed record and error taxonomy. Every failure is categorised so downstream retry logic is deterministic: STRUCTURAL halts the pipeline, COMPLIANCE routes to manual review, TIMEOUT degrades to partial ingestion. Implicit dictionaries are unacceptable — the manifest is a declared type.

from dataclasses import dataclass, field
from enum import Enum


class ErrorCategory(Enum):
    STRUCTURAL = "STRUCTURAL"
    COMPLIANCE = "COMPLIANCE"
    REGULATORY = "REGULATORY"
    TIMEOUT = "TIMEOUT"


@dataclass
class IfcsumManifest:
    message_ref: str
    vessel_name: str
    voyage_number: str
    container_count: int = 0
    compliance_status: str = "PENDING"
    processing_time_sec: float = 0.0
    error_log: list[dict[str, str]] = field(default_factory=list)

Step 3 — Resolve delimiters from the UNA service string. Per ISO 9735, an interchange may open with a UNA segment declaring the component (:), data (+), decimal, release (?), and terminator (') characters. Never assume defaults — a carrier that ships a non-standard release character will corrupt every escaped FTX free-text segment if you hard-code delimiters.

class Ifcsum311StreamParser:
    SEG_TERM = "'"   # segment terminator
    ELEM_SEP = "+"   # data element separator
    COMP_SEP = ":"   # component data element separator
    ESC_CHAR = "?"   # release (escape) character

    MANDATORY_SEGMENTS = {"UNB", "UNH", "BGM", "TDT", "CNI", "UNT"}

    def __init__(self, sla_timeout_sec: float = 900.0) -> None:
        self.sla_timeout_sec = sla_timeout_sec
        self._errors: list[dict[str, str]] = []
        self._start_time = 0.0

    def _parse_una(self, service_string: str) -> None:
        """Apply UNA advice: 'UNA' followed by six service characters."""
        if service_string.startswith("UNA") and len(service_string) >= 9:
            self.COMP_SEP = service_string[3]
            self.ELEM_SEP = service_string[4]
            # service_string[5] is the decimal notation character (not stored)
            self.ESC_CHAR = service_string[6]
            self.SEG_TERM = service_string[8]

    def _log_error(self, category: ErrorCategory, detail: str,
                   segment: str | None = None) -> None:
        entry = {"category": category.value, "detail": detail}
        if segment:
            entry["segment_preview"] = segment[:60]
        self._errors.append(entry)
        log.warning("parse_error", **entry)

Step 4 — Stream segments with an escape-aware tokenizer. Read fixed 64 KB chunks and split on the first unescaped segment terminator, buffering the remainder until more data arrives. This keeps memory bounded to one segment plus one chunk regardless of file size, and correctly handles apostrophes released with ? inside free text.

import time


class Ifcsum311StreamParser(Ifcsum311StreamParser):  # continued for readability

    def _check_timeout(self) -> bool:
        if time.monotonic() - self._start_time > self.sla_timeout_sec:
            self._log_error(ErrorCategory.TIMEOUT, "SLA timeout exceeded")
            return True
        return False

    def _split_buffer(self, buffer: str) -> tuple[str | None, str]:
        """Return (segment, remainder) at the first unescaped terminator."""
        i = 0
        while i < len(buffer):
            if buffer[i] == self.ESC_CHAR:
                i += 2  # skip the released character
                continue
            if buffer[i] == self.SEG_TERM:
                return buffer[:i], buffer[i + 1:]
            i += 1
        return None, buffer

    def _split_segment(self, raw: str) -> list[str]:
        """Split a segment into elements, honouring the release character."""
        elements, current, i = [], [], 0
        while i < len(raw):
            if raw[i] == self.ESC_CHAR and i + 1 < len(raw):
                current.append(raw[i + 1])
                i += 2
                continue
            if raw[i] == self.ELEM_SEP:
                elements.append("".join(current))
                current = []
            else:
                current.append(raw[i])
            i += 1
        elements.append("".join(current))
        return elements

    def parse_stream(self, file_path: str):
        """Generator yielding (segment_tag, elements) in one pass."""
        self._start_time = time.monotonic()
        self._errors.clear()
        log.info("stream_open", file=file_path)
        with open(file_path, "r", encoding="utf-8-sig") as fh:  # utf-8-sig strips BOM
            buffer, una_checked = "", False
            for chunk in iter(lambda: fh.read(65536), ""):
                buffer += chunk
                if not una_checked and len(buffer) >= 9:
                    if buffer.startswith("UNA"):
                        self._parse_una(buffer[:9])
                        buffer = buffer[9:]
                    una_checked = True
                while True:
                    segment, rest = self._split_buffer(buffer)
                    if segment is None:
                        break  # need more data
                    buffer = rest
                    segment = segment.strip()
                    if not segment:
                        continue
                    if self._check_timeout():
                        return
                    yield segment[:3], self._split_segment(segment[3:])

Step 5 — Extract the manifest and enforce compliance gating. Consume the generator once, mapping the segments that matter (UNH message reference, TDT vessel/voyage composites, CNI count) and recording which mandatory tags were seen. After the stream drains, subtract the encountered set from MANDATORY_SEGMENTS — any remainder fails the structural gate without a second pass over the file.

class Ifcsum311StreamParser(Ifcsum311StreamParser):

    def extract_manifest(self, file_path: str) -> IfcsumManifest:
        manifest = IfcsumManifest("UNKNOWN", "UNKNOWN", "UNKNOWN")
        seen: set[str] = set()

        for tag, elements in self.parse_stream(file_path):
            seen.add(tag)
            if tag == "UNH" and elements:
                manifest.message_ref = elements[0]          # DE 0062
            elif tag == "TDT":
                # TDT+20++VSL:VesselName:...+Voyage
                if len(elements) > 2:
                    comp = elements[2].split(self.COMP_SEP)
                    manifest.vessel_name = comp[1] if len(comp) > 1 else "UNKNOWN"
                if len(elements) > 3:
                    manifest.voyage_number = elements[3].split(self.COMP_SEP)[0]
            elif tag == "CNI":
                manifest.container_count += 1
            if self._check_timeout():
                manifest.compliance_status = "TIMEOUT_FAILED"
                break

        missing = self.MANDATORY_SEGMENTS - seen
        if manifest.compliance_status != "TIMEOUT_FAILED":
            if missing:
                self._log_error(ErrorCategory.COMPLIANCE,
                                f"Missing mandatory segments: {sorted(missing)}")
                manifest.compliance_status = "NON_COMPLIANT"
            else:
                manifest.compliance_status = "COMPLIANT"

        manifest.error_log = self._errors
        manifest.processing_time_sec = time.monotonic() - self._start_time
        log.info("extract_complete", status=manifest.compliance_status,
                 teus=manifest.container_count,
                 seconds=round(manifest.processing_time_sec, 3))
        return manifest

Edge Cases & Carrier Deviations

Missing UNA service string. Many forwarders omit the UNA header entirely and rely on default delimiters. The parser only consumes UNA when the buffer literally starts with it, so an absent header falls through to the ISO 9735 defaults rather than swallowing the first nine bytes of a UNB.
Unescaped apostrophes in FTX segments. Gate systems frequently inject a literal ' inside free text without releasing it with ?. This is unrecoverable from syntax alone — it produces a spuriously short segment. Flag it as STRUCTURAL and quarantine the interchange; never guess where the real terminator was.
Windows line endings mid-segment. Legacy mainframe gateways append \r\n after every segment terminator. Because .strip() runs on each split segment, stray carriage returns are removed before tag extraction, so \r\nUNH never becomes the bogus tag UN.
Byte Order Marks. UTF-8 files exported from Windows tooling carry a leading BOM that would make the first tag UNB. Decoding with utf-8-sig strips it; decoding with plain utf-8 silently fails the UNB envelope check downstream.
Regulatory holds inside FTX/RFF. Customs seals and quarantine flags ride in free-text or reference segments. The structured error_log lets the Schema Validation Frameworks layer flag one consignment for targeted quarantine rather than rejecting the whole vessel manifest.

Verification & Testing

Assert the two properties that matter: a complete interchange resolves to COMPLIANT with the right TEU count, and a manifest missing a mandatory tag fails closed as NON_COMPLIANT. Write a minimal fixture to a temp file and drive the parser end-to-end.

import pytest

SAMPLE = (
    "UNB+UNOA:2+FWDR:ZZ+TERMINAL:ZZ+240703:1200+1'"
    "UNH+1+IFCSUM:D:03B:UN'"
    "BGM+785+MREF001+9'"
    "TDT+20++VSL:EVER GIVEN:166:IMO:9811000+0X1'"
    "CNI+1'CNI+2'CNI+3'"
    "UNT+8+1'UNZ+1+1'"
)


@pytest.fixture
def sample_edi(tmp_path):
    path = tmp_path / "ifcsum.edi"
    path.write_text(SAMPLE, encoding="utf-8")
    return str(path)


def test_compliant_manifest(sample_edi: str) -> None:
    m = Ifcsum311StreamParser().extract_manifest(sample_edi)
    assert m.compliance_status == "COMPLIANT"
    assert m.container_count == 3
    assert m.vessel_name == "EVER GIVEN"
    assert m.message_ref == "1"


def test_missing_mandatory_segment(tmp_path) -> None:
    # Drop the TDT segment -> structural gate must fail closed.
    broken = SAMPLE.replace("TDT+20++VSL:EVER GIVEN:166:IMO:9811000+0X1'", "")
    path = tmp_path / "broken.edi"
    path.write_text(broken, encoding="utf-8")
    m = Ifcsum311StreamParser().extract_manifest(str(path))
    assert m.compliance_status == "NON_COMPLIANT"
    assert any(e["category"] == "COMPLIANCE" for e in m.error_log)

A successful extraction emits a structured line the audit pipeline can index directly:

{"event": "extract_complete", "status": "COMPLIANT", "teus": 3, "seconds": 0.004, "level": "info", "timestamp": "2026-07-03T12:00:00Z"}

Frequently Asked Questions

Why stream 64 KB chunks instead of reading the whole file and splitting on the terminator?

Because IFCSUM 311 manifests for a single deep-sea call routinely exceed 800 MB of raw text once you count 15,000 CNI loops and their nested GID segments. Reading the whole payload and calling .split("'") allocates the entire file plus a full list of every segment at once, which triggers garbage-collection thrashing and OOM kills when several vessels are processed concurrently. The generator holds only one chunk plus one partial segment in memory, so peak usage is flat regardless of whether the file is 8 MB or 800 MB.

Should a malformed segment raise an exception or be logged and skipped?

It depends on the category. A STRUCTURAL fault — an unescaped terminator that desynchronises the whole stream — must halt the interchange, because every segment after it is unreliable. A COMPLIANCE gap (a missing mandatory tag) is recorded in error_log and returned as NON_COMPLIANT so the interchange routes to manual review without stalling other vessels’ manifests. This mirrors the fail-closed posture the parent IFCSUM EDI Message Parsing discipline applies at the envelope boundary.

How does the SLA timeout avoid corrupting a partial manifest?

_check_timeout() is polled once per segment against time.monotonic(), so a runaway file stops at a segment boundary, never mid-tokenise. When it trips, extract_manifest sets compliance_status to TIMEOUT_FAILED and returns the partial record with its error_log intact, letting the Async Batch Processing Pipelines layer degrade to partial ingestion rather than silently blowing the 15-minute TOS window.

IFCSUM EDI Message Parsing — the parent discipline: envelope validation, partner authorisation, and typed model mapping that wraps this parser.
Schema Validation Frameworks — the semantic and regulatory layer that consumes the typed manifest and cross-checks HS codes, tariff codes, and VGM.
Async Batch Processing Pipelines — concurrent fan-out where multiple vessel manifests stream through this parser at once.
PDF Bill of Lading Extraction — the probabilistic OCR path for unstructured documents, contrasted with this deterministic EDIFACT path.

↑ Back to IFCSUM EDI Message Parsing.

Parsing IFCSUM 311 messages with Python #

Architecture Alignment #

Prerequisites & Environment Setup #

Step-by-step Implementation #

Edge Cases & Carrier Deviations #

Verification & Testing #

Frequently Asked Questions #

Related topics #

Related in IFCSUM EDI Message Parsing