Extracting B/L tables with pdfplumber and regex

Extracting B/L tables with pdfplumber and regex solves one precise task: recovering cargo line items and header fields from a carrier Bill of Lading PDF when the tables are borderless, merged, or whitespace-aligned and a rigid grid parser returns nothing usable. This page shows a deterministic, auditable two-layer strategy — pdfplumber grid extraction first, a maritime-lexicon regex fallback second — that holds its SLA through carrier template drift and OCR noise without materialising whole documents in memory.

Architecture Alignment

This task is the coordinate-and-regex core beneath the PDF Bill of Lading Extraction pipeline, which itself sits inside the Document Ingestion & EDI Parsing Workflows discipline. The parent pipeline admits a document at the ingestion boundary and routes validated records downstream; this page owns the middle step — turning the raw page geometry into typed rows. It inherits the same posture: tolerate messy layouts at the wire, then apply uncompromising validation. Rows produced here are coerced against the shared record defined in Bill of Lading Schema Mapping, so a PDF-sourced line item is indistinguishable from an EDIFACT-sourced one once typed, and the worker pool that drives this extraction is supplied by the Async Batch Processing Pipelines broker.

Grid extraction runs first; a found table streams straight to typed rows. A degraded, borderless, or scanned page falls through to the maritime-lexicon regex layer — a field match still yields rows, and a page that matches nothing is skipped and logged as an extraction gap, never silently dropped.

Prerequisites & Environment Setup

The reference code targets Python 3.11+ for the list[...] / dict[...] builtins and X | None unions. Install the extraction and validation stack:

Package	Version	Role
`pdfplumber`	`>=0.11`	Coordinate-aware text + grid table extraction
`pydantic`	`>=2.6`	Typed row models, strict-mode structural validation
`structlog`	`>=24.1`	Structured JSON audit logging (never bare `print`)
`pytest`	`>=8.0`	Fixture-driven verification of extracted rows

python -m venv .venv && source .venv/bin/activate
pip install "pdfplumber>=0.11" "pydantic>=2.6" "structlog>=24.1" "pytest>=8.0"

Configure structlog once at process start so every extraction event emits machine-parseable JSON with an ISO-8601 timestamp — this is the audit trail customs and port state control inspections expect, retained for the 7-year regulatory minimum:

import logging

import structlog

structlog.configure(
    processors=[
        structlog.contextvars.merge_contextvars,
        structlog.processors.add_log_level,
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.JSONRenderer(),
    ],
    wrapper_class=structlog.make_filtering_bound_logger(logging.INFO),
)
log = structlog.get_logger("bl_table_extractor")

No credentials or external registries are required for extraction itself; UN/LOCODE and ISO 6346 resolution happen in the downstream validation tier described in Schema Validation Frameworks.

Step-by-step Implementation

Each step below is runnable in isolation. Together they form a generator that streams one typed row at a time, so memory stays bounded regardless of manifest length.

Step 1 — Define the typed row and the maritime lexicon

Model the row explicitly; implicit dictionaries are unacceptable in production. The lexicon patterns anchor to Bill of Lading terminology with tolerance for OCR whitespace. Field names follow the UN/EDIFACT data dictionary (NAD for consignee, EQD for equipment) so rows hand off cleanly to IFCSUM EDI Message Parsing.

import re
from decimal import Decimal
from typing import Any

from pydantic import BaseModel, Field, field_validator


class BLTableRow(BaseModel):
    bl_number: str = Field(..., min_length=4, max_length=16)
    consignee: str = ""
    container_no: str = Field("", pattern=r"^([A-Z]{4}\d{7})?$")   # ISO 6346
    gross_weight_kg: Decimal | None = None
    page_index: int = Field(..., ge=0)
    extraction_method: str   # "table" | "regex_fallback"

    @field_validator("bl_number")
    @classmethod
    def normalize_bl(cls, v: str) -> str:
        return v.replace(" ", "").upper()


# Maritime lexicon: ISO 6346 container ids, IMDG-adjacent header labels,
# tolerant of the whitespace and punctuation OCR injects into scanned B/Ls.
LEXICON = {
    "bl_number": re.compile(r"(?:B/L|BILL\s+OF\s+LADING|BL\s*NO\.?)\s*[:.]?\s*([A-Z0-9]{4,16})", re.I),
    "consignee": re.compile(r"(?:CONSIGNEE|TO\s+ORDER\s+OF)\s*[:.]?\s*([A-Z0-9\s&.,'\-]{5,80})", re.I),
    "container_no": re.compile(r"(?:CONTAINER|CNTR|CNO)\s*[:.]?\s*([A-Z]{4}\s?\d{7})", re.I),
    "gross_weight": re.compile(r"(?:GROSS\s+WEIGHT|G\.?W\.?)\s*[:.]?\s*([\d,]+\.?\d*)\s*(?:KG|KGM)?", re.I),
}


def normalize_ocr_text(raw: str) -> str:
    """Strip control characters and collapse the whitespace OCR scatters into fields."""
    return re.sub(r"[\x00-\x1f\x7f-\x9f]+", " ", raw or "").strip()

Step 2 — Attempt grid extraction with tuned table settings

pdfplumber reads real ruling lines well but misaligns on borderless layouts. Read the table with explicit strategies and tight intersection tolerances; when the carrier draws no lines, text strategies let the parser infer columns from word geometry rather than returning an empty grid.

import pdfplumber

TABLE_SETTINGS = {
    "vertical_strategy": "lines",
    "horizontal_strategy": "lines",
    "intersection_y_tolerance": 5,
    "intersection_x_tolerance": 5,
    "snap_tolerance": 3,
}
# Fallback for borderless carrier templates: infer columns from word positions.
TEXT_SETTINGS = {"vertical_strategy": "text", "horizontal_strategy": "text"}


def extract_grid_rows(page: "pdfplumber.page.Page", page_idx: int) -> list[dict[str, Any]]:
    for settings in (TABLE_SETTINGS, TEXT_SETTINGS):
        tables = page.extract_tables(table_settings=settings) or []
        rows = [
            {"raw": [normalize_ocr_text(str(c)) for c in row], "method": "table", "page": page_idx}
            for table in tables
            for row in table
            if row and any(cell for cell in row if cell)
        ]
        if rows:
            log.info("grid_extracted", page=page_idx, rows=len(rows), strategy=settings["vertical_strategy"])
            return rows
    return []

Step 3 — Fall back to the regex lexicon when the grid degrades

When both grid strategies yield nothing — merged cells, a rotated freight manifest, or a scanned image page — pull the full page text and match the lexicon. A row that recovers any field is worth keeping; a page that matches nothing is logged as a gap, never silently dropped.

def extract_regex_row(page: "pdfplumber.page.Page", page_idx: int) -> list[dict[str, Any]]:
    text = normalize_ocr_text(page.extract_text() or "")
    matches = {field: (m.group(1).strip() if (m := pat.search(text)) else "")
               for field, pat in LEXICON.items()}
    if any(matches.values()):
        log.info("regex_fallback_hit", page=page_idx, fields=[k for k, v in matches.items() if v])
        return [{"raw": matches, "method": "regex_fallback", "page": page_idx}]
    log.warning("extraction_gap", page=page_idx, reason="no_grid_no_lexicon")
    return []

Step 4 — Stream pages under a bounded memory footprint

Multi-page manifests exhaust worker heaps if loaded whole. Open the PDF as a context manager and yield rows page by page so the footprint stays constant; call page.flush_cache() to release the per-page object cache pdfplumber accumulates across a large document.

from pathlib import Path
from typing import Iterator

from pdfminer.pdfparser import PDFSyntaxError


def stream_rows(pdf_path: Path) -> Iterator[dict[str, Any]]:
    if not pdf_path.exists():
        log.error("file_not_found", path=str(pdf_path))
        return
    try:
        with pdfplumber.open(str(pdf_path)) as pdf:
            for page_idx, page in enumerate(pdf.pages):
                rows = extract_grid_rows(page, page_idx) or extract_regex_row(page, page_idx)
                yield from rows
                page.flush_cache()   # bound heap growth across large manifests
    except PDFSyntaxError as exc:
        log.error("pdf_corrupt", path=str(pdf_path), error=str(exc))

Step 5 — Gate each row and emit the typed record

Compliance gating decides what proceeds. Reject rows whose bl_number fails the alphanumeric contract, coerce the weight to Decimal, and construct the BLTableRow. A pydantic.ValidationError here is a structural defect bound for the dead-letter queue; a clean row is emitted with a PASS audit line.

def gate_row(row: dict[str, Any]) -> BLTableRow | None:
    raw, method, page = row.get("raw", {}), row["method"], row["page"]
    cells = raw if isinstance(raw, dict) else {}   # regex rows are dicts
    bl_num = cells.get("bl_number", "").upper().replace(" ", "")
    if not re.fullmatch(r"[A-Z0-9]{4,16}", bl_num):
        log.warning("compliance_gate_failed", page=page, method=method, reason="invalid_bl_format")
        return None
    weight = cells.get("gross_weight", "").replace(",", "")
    try:
        record = BLTableRow(
            bl_number=bl_num,
            consignee=cells.get("consignee", "").strip(),
            container_no=cells.get("container_no", "").replace(" ", "").upper(),
            gross_weight_kg=Decimal(weight) if weight else None,
            page_index=page,
            extraction_method=method,
        )
    except (ValueError, ArithmeticError) as exc:
        log.warning("compliance_gate_failed", page=page, method=method, error=str(exc))
        return None
    log.info("record_validated", bl_number=bl_num, method=method, compliance_status="PASS")
    return record

Edge Cases & Carrier Deviations

Borderless / whitespace columns. Carriers that print no ruling lines make vertical_strategy="lines" return an empty grid. The TEXT_SETTINGS retry in Step 2 recovers most of these; the regex fallback catches the rest.
Merged header cells. A consignee block spanning two logical columns collapses into one cell. The lexicon in Step 3 anchors on the CONSIGNEE / TO ORDER OF label rather than column position, so a merged cell still yields the party.
OCR 0/O and 1/I drift in ISO 6346. A scanned container id such as MSKO123456I fails the ^[A-Z]{4}\d{7}$ pattern. Do not discard the row — flag it as check_digit_drift and reconcile against the equipment registry defined in the Container Hierarchy Data Models, because the physical box is real.
Rotated freight manifests. Landscape or 90°-rotated pages defeat both grid strategies. Pass page.rotation through and, when nonzero, re-extract after page.dedupe_chars() and rotation normalisation before invoking the lexicon.
Container split across a page break. A cargo table that wraps to the next page yields two partial rows. Stitch on the stable bl_number key before the record reaches Container Status Mapping Rules, rather than emitting two half-containers.
Weight qualifier noise. 15,000.00 KGM versus 15.000,00 KG (European decimal comma) both appear in real traffic; normalise the thousands and decimal separators before Decimal() or the gate rejects a valid mass.

Verification & Testing

Assert correctness against a fixture that mirrors a borderless carrier layout so the regex path is exercised, not just the grid path. The test drives gate_row end to end and checks both the typed output and the structured-log verdict.

import pytest


@pytest.fixture
def borderless_row() -> dict:
    return {
        "raw": {
            "bl_number": "maeu 588912347",
            "consignee": "ACME FORWARDING GMBH",
            "container_no": "MSKU 1234567",
            "gross_weight": "15,000.00",
        },
        "method": "regex_fallback",
        "page": 0,
    }


def test_gate_normalizes_and_types(borderless_row: dict) -> None:
    rec = gate_row(borderless_row)
    assert rec is not None
    assert rec.bl_number == "MAEU588912347"        # spaces stripped, upper-cased
    assert rec.container_no == "MSKU1234567"        # ISO 6346 pattern satisfied
    assert rec.gross_weight_kg == Decimal("15000.00")
    assert rec.extraction_method == "regex_fallback"


def test_gate_rejects_malformed_bl(borderless_row: dict) -> None:
    borderless_row["raw"]["bl_number"] = "??"       # below 4-char contract
    assert gate_row(borderless_row) is None

A passing gate_row emits one JSON line per accepted row; assert the shape in integration tests:

{"event": "record_validated", "bl_number": "MAEU588912347", "method": "regex_fallback", "compliance_status": "PASS", "level": "info", "timestamp": "2026-07-03T09:12:44Z"}

Frequently Asked Questions

When should the grid strategy give way to the regex fallback?

Gate the handoff on emptiness, not on an exception. Try vertical_strategy="lines" first because it is the most precise on ruled tables, then retry with text strategies for borderless layouts. Only when both return zero rows — merged cells, a rotated page, or a scanned image — invoke the lexicon. The regex path is the recovery layer, never the default, because label-anchored patterns are looser than true column geometry and should not override a clean grid.

Why call flush_cache instead of gc.collect in the streaming loop?

pdfplumber caches parsed objects per page, and page.flush_cache() releases exactly that cache the moment a page is done, keeping the footprint constant across a thousand-page manifest. A blanket gc.collect() is far more expensive, runs a full-heap sweep on every iteration, and does not target the library’s own cache — so it costs throughput without reliably bounding the specific growth you care about.

Should a regex-recovered row be trusted as much as a grid-extracted one?

Treat it as recoverable, not authoritative. The extraction_method field tags every row’s provenance so downstream consumers and the Schema Validation Frameworks tier can apply stricter semantic checks — container check-digit, UN/LOCODE resolution, VGM plausibility — to regex_fallback rows before they reach a terminal operating system. The tag is what lets you keep the row moving while still auditing exactly how it was won.

PDF Bill of Lading Extraction — the full ingestion-to-routing pipeline this task sits inside
Bill of Lading Schema Mapping — the shared typed record the extracted rows are coerced onto
IFCSUM EDI Message Parsing — aggregated manifest reconciliation for the parsed line items
Async Batch Processing Pipelines — the broker and worker pool that streams PDFs into this extractor
Schema Validation Frameworks — the semantic and regulatory tier that gates extracted rows

Up: PDF Bill of Lading Extraction — the parent pipeline governing ingestion, extraction, validation, and downstream integration.

Extracting B/L tables with pdfplumber and regex #

Architecture Alignment #

Prerequisites & Environment Setup #

Step-by-step Implementation #

Step 1 — Define the typed row and the maritime lexicon #

Step 2 — Attempt grid extraction with tuned table settings #

Step 3 — Fall back to the regex lexicon when the grid degrades #

Step 4 — Stream pages under a bounded memory footprint #

Step 5 — Gate each row and emit the typed record #

Edge Cases & Carrier Deviations #

Verification & Testing #

Frequently Asked Questions #

Related #

Related in PDF Bill of Lading Extraction