PDF Bill of Lading Extraction

PDF Bill of Lading extraction is the process boundary that turns carrier-issued Bill of Lading (B/L) PDFs — the legal contract of carriage, cargo receipt, and negotiable document of title — into strictly typed, validated records that terminal and customs systems can act on without a human re-keying a field. Despite digitization mandates, port authorities, terminal operators, and freight forwarders still ingest high volumes of proprietary carrier PDFs, and every extracted field feeds customs clearance, automated stowage planning, and gate release. This work sits inside the Document Ingestion & EDI Parsing Workflows discipline and shares its posture: tolerate messy layouts at the wire, then apply uncompromising validation before anything reaches a terminal operating system (TOS). A production-grade extraction pipeline must be deterministic, mapped to recognized maritime data standards, and resilient enough to hold its SLA through carrier template drift and OCR fallbacks.

Each extraction stage is gated by an extraction_confidence score: a low-confidence result cascades to the next, more expensive stage, while any stage that clears the threshold commits to one shared typed record — OCR and human review are fallbacks, never the default path.

Ingestion Boundary & Protocol Handling

Incoming B/Ls arrive over hardened HTTPS upload endpoints, SFTP drops from carrier portals, and email-to-pipeline gateways. Before a single byte of text is parsed, the ingestion boundary enforces pre-parsing validation: it detects password protection and encrypted content, verifies embedded cryptographic signatures where the carrier supplies them, checks file size against per-terminal SLAs, and confirms page-count completeness so a truncated multi-page B/L never proceeds as if whole. Documents that fail these checks are quarantined with a structured reason code rather than silently dropped — a missing page is a recoverable operator problem, not a parse failure.

Every accepted file receives a SHA-256 tracking hash at receipt. That hash is the correlation key threaded through every downstream log line and commit, giving end-to-end auditability for customs and port state control. Validated payloads are then serialized onto a message broker and routed to worker pools via Async Batch Processing Pipelines. Decoupling ingestion from extraction is what keeps the TOS from starving during a peak vessel call: a berth window that drops two thousand B/Ls in ten minutes must not block on the slowest OCR page, and each file must be independently retryable when a transient worker fault occurs.

import hashlib
from pathlib import Path

import structlog
from pydantic import BaseModel, Field

log = structlog.get_logger()


class IngestVerdict(BaseModel):
    document_hash: str = Field(..., pattern=r"^[0-9a-f]{64}$")
    accepted: bool
    reason: str = "OK"
    page_count: int = Field(..., ge=0)


def admit(pdf_path: Path, *, max_bytes: int, encrypted: bool, pages: int) -> IngestVerdict:
    raw = pdf_path.read_bytes()
    document_hash = hashlib.sha256(raw).hexdigest()

    if encrypted:
        verdict = IngestVerdict(document_hash=document_hash, accepted=False,
                                reason="PASSWORD_PROTECTED", page_count=pages)
    elif len(raw) > max_bytes:
        verdict = IngestVerdict(document_hash=document_hash, accepted=False,
                                reason="OVERSIZE", page_count=pages)
    elif pages == 0:
        verdict = IngestVerdict(document_hash=document_hash, accepted=False,
                                reason="EMPTY_DOCUMENT", page_count=pages)
    else:
        verdict = IngestVerdict(document_hash=document_hash, accepted=True, page_count=pages)

    log.info("ingest_verdict", **verdict.model_dump())
    return verdict

Python Data Structure Mapping

Carrier PDF templates lack uniformity. Shipping lines deploy proprietary layouts with varying table structures, embedded fonts, and coordinate drift across document revisions, so rigid DOM or template-matching approaches fail under production drift. The extraction engine instead relies on coordinate-aware text positioning combined with maritime-specific regular expressions: by normalizing page geometry and reading bounding boxes, the pipeline isolates header metadata (B/L number, vessel/voyage, port of loading and discharge) before it walks cargo line items. The coordinate normalization, multi-page table stitching, and OCR fallback triggers behind this are documented in full in Extracting B/L tables with pdfplumber and regex.

Raw extracted strings are not the deliverable — a typed record is. Extracted values are coerced onto Pydantic models aligned with UN/EDIFACT and IMO data dictionaries, so malformed data is rejected at construction rather than discovered three systems downstream. The same target contract is shared with the Bill of Lading Schema Mapping layer that normalizes EDI-sourced B/Ls, which means a PDF-sourced record and an EDIFACT-sourced record are indistinguishable to every consumer once typed. Line items mirror EDI message segments (NAD for parties, CNI for consignment, GID for goods description) so the record hands off cleanly to IFCSUM EDI Message Parsing for aggregated manifest reconciliation.

Field coercion follows fixed rules so two engineers reading the same PDF produce the same record:

PDF field	Raw example	Canonical field	Coercion rule
B/L number	`MAEU 5 8891 2347`	`bl_number`	Strip whitespace, upper-case, verify carrier prefix
Vessel IMO	`IMO 9074729`	`vessel_imo`	7-digit integer; mod-11 check digit
Container ID	`MSKU 123456-7`	`container_id`	ISO 6346 owner+serial+check; validated by checksum
Port of loading	`Rotterdam, NL`	`pol`	Resolve to UN/LOCODE `NLRTM` against registry
Gross weight	`15,000.00 KG`	`gross_weight_kg`	Parse to `Decimal`; normalize unit to kilograms
Number of packages	`240 CTNS`	`package_count`	Parse leading integer; drop packaging qualifier

from decimal import Decimal

from pydantic import BaseModel, Field, field_validator


class BLLineItem(BaseModel):
    container_id: str = Field(..., pattern=r"^[A-Z]{4}\d{7}$")   # ISO 6346
    package_count: int = Field(..., ge=0)
    goods_description: str
    gross_weight_kg: Decimal = Field(..., ge=0)


class ExtractedBL(BaseModel):
    document_hash: str = Field(..., pattern=r"^[0-9a-f]{64}$")
    bl_number: str = Field(..., min_length=1)
    vessel_imo: int = Field(..., ge=1_000_000, le=9_999_999)
    pol: str = Field(..., pattern=r"^[A-Z]{5}$")   # UN/LOCODE
    pod: str = Field(..., pattern=r"^[A-Z]{5}$")
    line_items: list[BLLineItem]
    extraction_confidence: float = Field(..., ge=0.0, le=1.0)

    @field_validator("bl_number")
    @classmethod
    def normalize_bl(cls, v: str) -> str:
        return v.replace(" ", "").upper()

Validation, Quarantine & Compliance Auditing

Extraction errors do not stay abstract — they inflate cargo dwell time, distort demurrage calculations, and surface as compliance findings. The pipeline therefore applies a three-tier validation boundary, from cheapest structural check to most expensive regulatory rule, and routes failures by whether a human can plausibly fix them.

Structural validation. Enforce mandatory fields, types, and length constraints against the maritime schema registry using Pydantic strict mode. A missing bl_number, an invalid voyage format, or a malformed container ID is a format defect — it fails fast to the dead-letter queue (DLQ), because no partial record should proceed.
Cross-field / semantic validation. Verify logical relationships: the declared gross weight matches the sum of line-item weights, the container count equals line-item quantity, and the vessel IMO number passes the mod-11 check-digit algorithm (the first six digits weighted by 7, 6, 5, 4, 3, 2, summed, with the check digit equal to the rightmost digit of that sum). UN/LOCODEs resolve against the official registry; container IDs resolve against ISO 6346.
Regulatory / business-rule enforcement. Flag restricted cargo codes, mismatched seal numbers, expired validity dates, and SOLAS Verified Gross Mass (VGM) values below tare per port authority regulations.

The routing distinction is load-bearing. An unrecoverable structural defect goes to the DLQ; a record that parsed cleanly but failed a semantic or regulatory check goes to a separate QUARANTINE topic where a shipping-ops operator can correct an unknown LOCODE or supply a missing VGM without halting the line. Every validation event emits a structured JSON log carrying the document hash, worker ID, timestamp, rule applied, and outcome — an immutable audit trail that satisfies customs inspection and supports rapid root-cause analysis during port state control.

def imo_check_digit_ok(imo: int) -> bool:
    digits = [int(d) for d in f"{imo:07d}"]
    weighted = sum(w * d for w, d in zip((7, 6, 5, 4, 3, 2), digits[:6]))
    return weighted % 10 == digits[6]


def route_extracted(bl: ExtractedBL, *, locode_ok: bool, vgm_ok: bool) -> str:
    declared = sum(li.gross_weight_kg for li in bl.line_items)
    if not imo_check_digit_ok(bl.vessel_imo):
        log.warning("dlq", document_hash=bl.document_hash, reason="IMO_CHECKSUM")
        return "DLQ"
    if not locode_ok:
        log.warning("quarantine", document_hash=bl.document_hash, reason="LOCODE_UNKNOWN")
        return "QUARANTINE"
    if not vgm_ok or declared <= 0:
        log.warning("quarantine", document_hash=bl.document_hash, reason="VGM_IMPLAUSIBLE")
        return "QUARANTINE"
    log.info("bl_accepted", document_hash=bl.document_hash, bl_number=bl.bl_number)
    return "ACCEPTED"

Downstream Integration

An accepted ExtractedBL does not sit in a table waiting to be queried — it drives operational state. A single B/L routinely references multiple containers with distinct seals, cargo descriptions, and dangerous-goods codes, so the extracted line items resolve into the Container Hierarchy Data Models — the Vessel → Bay → Tier → Row → Container topology — rather than a flattened copy, letting reefer monitoring and hazmat segregation execute off one shared structure.

Status transitions derived from the B/L propagate as milestone events. When a record reaches GATE_IN or LOADED_ONBOARD, it publishes to the Port Call Workflow Design state machine, keyed on a deterministic (bl_number, milestone, event_hash) tuple so that a re-uploaded PDF or a retried commit collapses to a no-op instead of corrupting a stowage plan. The extracted cargo record is reconciled against IFCSUM EDI Message Parsing output where the same shipment also appears on an aggregated manifest, and the same identifiers let equipment movements correlate with the container event stream governed by Container Status Mapping Rules. Every cross-boundary publish is authenticated under the Maritime Security Boundary Setup zero-trust controls, so a PDF that entered over a public upload endpoint never gains implicit trust downstream.

Fallback Chains & Uptime Guarantees

Uptime in PDF extraction depends on graceful degradation, not brittle failure. The engine prioritizes data integrity over speed through an explicit fallback chain: primary coordinate-aware extraction, then a shifted bounding-box parse that tolerates the few-point coordinate drift carriers introduce between revisions, then OCR fallback with Tesseract for scanned bills of lading when a page is scanned or the embedded text layer is unusable, and finally a human review queue. Each hop is gated by an extraction_confidence score, and every transition is logged so drift in a specific carrier template is visible before it becomes an SLA breach.

Circuit breakers per carrier template. Track failure rates keyed by carrier template fingerprint. After a burst of low-confidence parses on one template — the signature of a silent layout change — open the circuit for that template and route its traffic straight to OCR or review, so one carrier’s redesign cannot cascade into a pipeline-wide backlog.
Exponential backoff with jitter. Retry transient worker faults and OCR-service timeouts using base_delay * 2 ** attempt + jitter, capped at three attempts before the record moves to quarantine.
DLQ versus quarantine. Corrupted, encrypted, or structurally invalid PDFs land in the DLQ and require engineering; documents that extracted cleanly but failed a business rule land in the quarantine topic and reconcile automatically once an operator or registry supplies the missing value.
Idempotent commits. Because the document hash keys every write, at-least-once broker delivery is safe: a redelivered PDF produces the identical record and the identical no-op downstream.

Stateless extraction containers run under Kubernetes with persistent queues guaranteeing at-least-once delivery, and every transformation step is logged to a centralized observability stack for real-time drift detection and SLA monitoring.

Step-by-step Implementation Guide

The reference flow below takes a carrier PDF from bytes to a routed record. Each step is runnable in isolation and uses type annotations with structlog for structured JSON logging.

Step 1 — Admit the document at the ingestion boundary

Hash the file, check for encryption, size, and page completeness, and emit a structured verdict before any text is parsed. Rejected documents are quarantined by reason code, never dropped.

verdict = admit(Path("MAEU58891.pdf"), max_bytes=25_000_000, encrypted=False, pages=3)
if not verdict.accepted:
    raise SystemExit(f"quarantined: {verdict.reason}")

Step 2 — Extract header metadata by coordinate, with OCR fallback

Read header bounding boxes first; if confidence is low, retry with a shifted box, then OCR. Record which stage produced the value.

def extract_header(page, *, ocr) -> tuple[dict[str, str], float]:
    fields = page.extract_words(x_tolerance=1.5)  # pdfplumber coordinate-aware
    confidence = score_header(fields)
    if confidence < 0.6:
        fields, confidence = ocr(page.to_image(resolution=300))
    log.info("header_extracted", confidence=round(confidence, 3))
    return fields, confidence

Step 3 — Stitch multi-page cargo tables into line items

Cargo tables frequently span pages; concatenate rows on a stable column geometry before typing them so a page break never splits a container across two records.

def stitch_line_items(pages) -> list[BLLineItem]:
    rows: list[BLLineItem] = []
    for page in pages:
        for r in page.extract_table() or []:
            rows.append(BLLineItem(
                container_id=r[0].replace(" ", "").replace("-", "").upper(),
                package_count=int(r[1].split()[0]),
                goods_description=r[2].strip(),
                gross_weight_kg=Decimal(r[3].replace(",", "").split()[0]),
            ))
    return rows

Step 4 — Construct and structurally validate the record

Assemble the ExtractedBL. A pydantic.ValidationError here is a structural defect and belongs in the DLQ.

bl = ExtractedBL(
    document_hash=verdict.document_hash,
    bl_number=header["bl_number"],
    vessel_imo=int(header["imo"]),
    pol=header["pol_locode"],
    pod=header["pod_locode"],
    line_items=line_items,
    extraction_confidence=confidence,
)
log.info("bl_structurally_valid", bl_number=bl.bl_number)

Step 5 — Apply semantic checks and route

Cross-reference the IMO check digit, UN/LOCODEs, and VGM, then route to ACCEPTED, QUARANTINE, or the DLQ.

outcome = route_extracted(bl, locode_ok=True, vgm_ok=True)
if outcome == "ACCEPTED":
    publish_milestone(bl, milestone="GATE_IN")   # idempotent on (bl_number, milestone, event_hash)

Troubleshooting Common Failures

Symptom	Root cause	Fix
Header fields blank on a native PDF	Carrier moved the title block; hard-coded bounding box missed it	Widen `x_tolerance`, then fall back to shifted box and OCR (Step 2)
Whole page returns as one text blob	Scanned image with no embedded text layer	Route the page to Tesseract OCR at 300 dpi; never regex an image
One container split across two records	Cargo table spans a page break	Stitch rows on stable column geometry before typing (Step 3)
`ValidationError: container_id`	OCR confused `0`/`O` or `1`/`I` in the ISO 6346 serial	Reconcile against the equipment registry; flag as `check_digit_drift`, do not drop
`IMO_CHECKSUM` on a real vessel	OCR misread a digit, or an MMSI landed in the IMO slot	DLQ with context; re-OCR the header region at higher resolution
`VGM_IMPLAUSIBLE` (weight below tare)	Shipper omitted VGM or OCR dropped a digit	Quarantine as `VGM_IMPLAUSIBLE`; never forward to the stowage planner
Duplicate gate/manifest events	Non-idempotent publish on PDF re-upload	Key milestones on `(bl_number, milestone, event_hash)` so replays are no-ops
`LOCODE_UNKNOWN` on a valid port	Free-text port name resolved against a stale registry	Quarantine, refresh the UN/LOCODE registry, reconcile — do not hard-reject

Frequently Asked Questions

When should the pipeline fall back from coordinate extraction to OCR?

Gate the fallback on an explicit confidence score, not on an exception. Coordinate-aware extraction handles native PDFs with an intact text layer cheaply and deterministically, so try it first. When the header or table score drops below threshold — a signature of a scanned page or a silently redesigned template — retry with a shifted bounding box, then Tesseract OCR at 300 dpi. OCR is the most expensive and error-prone stage, so it is a fallback, never the default path.

Should a PDF-sourced B/L and an EDIFACT-sourced B/L share one schema?

Yes. Both resolve into the same typed record so that TOS, customs, and stowage consumers never branch on how the document arrived. This page’s ExtractedBL aligns with the Bill of Lading Schema Mapping target used for EDI, and the source is carried as a tagged attribute rather than a structural fork — the only format-specific logic lives in extraction itself.

Why quarantine an implausible VGM instead of rejecting the whole document?

A VGM below the container tare almost always means the shipper omitted the figure or OCR dropped a digit — the rest of the B/L is legally valid and operationally needed. Rejecting the document would strand real cargo. Quarantine preserves the record with a VGM_IMPLAUSIBLE reason so operations can supply the correct mass and reconcile, while the stowage planner is protected from acting on a bad weight. The DLQ stays reserved for corruption a human cannot fix without engineering.

How do we stop a re-uploaded PDF from double-triggering downstream workflows?

Make every commit idempotent. The SHA-256 document hash assigned at ingestion keys the record, and each milestone publish is keyed on a deterministic (bl_number, milestone, event_hash) tuple. A redelivered file therefore produces the identical record and a no-op event, which is the contract the Port Call Workflow Design state machine relies on to stay correct under at-least-once broker delivery.

OCR fallback with Tesseract for scanned bills of lading — the confidence-gated Tesseract stage for scanned pages with no usable text layer
Extracting B/L tables with pdfplumber and regex — the coordinate-parsing and table-stitching detail beneath this pipeline
Async Batch Processing Pipelines — the broker and worker pool that decouples ingestion from extraction
IFCSUM EDI Message Parsing — aggregated manifest data reconciled against extracted B/L records
Bill of Lading Schema Mapping — the shared normalized B/L record this extraction targets
Container Hierarchy Data Models — resolving multi-container B/L references into the equipment topology

Up: Document Ingestion & EDI Parsing Workflows — the parent discipline governing ingestion, parsing, validation, and downstream integration.

PDF Bill of Lading Extraction #

Ingestion Boundary & Protocol Handling #

Python Data Structure Mapping #

Validation, Quarantine & Compliance Auditing #

Downstream Integration #

Fallback Chains & Uptime Guarantees #

Step-by-step Implementation Guide #

Step 1 — Admit the document at the ingestion boundary #

Step 2 — Extract header metadata by coordinate, with OCR fallback #

Step 3 — Stitch multi-page cargo tables into line items #

Step 4 — Construct and structurally validate the record #

Step 5 — Apply semantic checks and route #

Troubleshooting Common Failures #

Frequently Asked Questions #

Related #

Explore this section