Extracting B/L tables with pdfplumber and regex
In high-throughput terminal environments, the deterministic parsing of shipping documentation remains a critical operational dependency. Carrier-specific layout variations, legacy OCR artifacts, and inconsistent column alignment routinely break rigid extraction pipelines. When engineering resilient Document Ingestion & EDI Parsing Workflows, automation teams must balance structural analysis with pattern-driven fallbacks. Extracting B/L tables with pdfplumber and regex delivers a low-overhead, highly auditable approach that satisfies strict maritime compliance mandates while maintaining real-world port SLAs.
Adaptive Boundary Detection & Structural Fallbacks
flowchart LR A["PDF page"] --> B["pdfplumber
grid table extraction"] B -->|tables found| R[("Rows")] B -->|degraded / borderless| C["Regex fallback
maritime lexicon"] C -->|fields matched| R C -->|no match| S["Skip · log gap"]
Bill of Lading documents exhibit severe format drift across issuing lines, terminal operators, and sequential print runs. While pdfplumber excels at grid-based table detection, it frequently misaligns when carriers use whitespace-separated columns, merged cells, or rotated freight manifests. The operational fix requires dynamic threshold tuning rather than static configuration.
Adjusting vertical_strategy, horizontal_strategy, and tolerance parameters based on document metadata allows the parser to adapt to borderless layouts. When structural extraction degrades below a confidence threshold, regex fallbacks anchored to maritime lexicon recover orphaned fields. This dual-layer strategy prevents queue blockage during peak vessel turnaround windows and aligns with modern PDF Bill of Lading Extraction architectures.
import re
import json
import logging
import traceback
from typing import Iterator, Dict, Any, Optional, List
from dataclasses import dataclass, asdict
from pathlib import Path
import pdfplumber
# Structured logging configuration for audit compliance
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s | %(levelname)s | %(message)s',
datefmt='%Y-%m-%dT%H:%M:%S%z'
)
logger = logging.getLogger("bl_table_extractor")
@dataclass
class BLTableRow:
bl_number: str
consignee: str
container_no: str
gross_weight_kg: Optional[float]
page_index: int
extraction_method: str # "table" | "regex_fallback"
# Maritime lexicon patterns with OCR noise tolerance
LEXICON_PATTERNS = {
"bl_number": re.compile(r"(?:B/L|BILL\sOF\sLADING|BL\sNO\.?)\s*[:.]?\s*([A-Z0-9]{4,12})", re.IGNORECASE),
"consignee": re.compile(r"(?:CONSIGNEE|TO\sORDER\sOF)\s*[:.]?\s*([A-Z\s&.,'-]{5,80})", re.IGNORECASE),
"container_no": re.compile(r"(?:CONTAINER|CNTR|CNO)\s*[:.]?\s*([A-Z]{4}\d{7})", re.IGNORECASE),
"gross_weight": re.compile(r"(?:GROSS\sWEIGHT|G\.?W\.?|KG)\s*[:.]?\s*([\d,]+\.?\d*)", re.IGNORECASE)
}
def normalize_ocr_text(raw_text: str) -> str:
"""Collapse whitespace, strip control characters, and normalize line breaks."""
return re.sub(r'[\x00-\x1f\x7f-\x9f]+', ' ', raw_text).strip()
def extract_table_with_fallback(page: pdfplumber.page.Page, page_idx: int) -> List[Dict[str, Any]]:
"""Attempt grid extraction; fall back to regex if structural confidence is low."""
rows = []
# Adaptive table extraction parameters
table_settings = {
"vertical_strategy": "lines",
"horizontal_strategy": "lines",
"intersection_y_tolerance": 5,
"intersection_x_tolerance": 5,
"snap_tolerance": 3
}
try:
tables = page.extract_tables(table_settings=table_settings)
if tables and len(tables) > 0:
for table in tables:
for row in table:
if row and any(cell for cell in row if cell):
# Basic structural mapping (carrier-dependent)
cleaned = [normalize_ocr_text(str(c)) if c else "" for c in row]
rows.append({
"raw": cleaned,
"method": "table",
"page": page_idx
})
return rows
except Exception as e:
logger.warning(json.dumps({
"event": "table_extraction_failed",
"error": str(e),
"page": page_idx,
"fallback": "regex"
}))
# Regex fallback for borderless/merged layouts
raw_text = normalize_ocr_text(page.extract_text() or "")
matches = {}
for field, pattern in LEXICON_PATTERNS.items():
match = pattern.search(raw_text)
matches[field] = match.group(1).strip() if match else ""
if any(matches.values()):
rows.append({
"raw": matches,
"method": "regex_fallback",
"page": page_idx
})
return rows
Stream-Based Memory Management for High-Volume Ingestion
Processing multi-page freight manifests at scale introduces heap exhaustion risks that crash worker containers and violate terminal processing SLAs. Loading entire PDF binaries into memory, especially when paired with OCR preprocessing, creates unsustainable bottlenecks. The solution lies in stream-based page iteration combined with Python generator patterns.
By leveraging pdfplumber.open() as a context manager and yielding parsed rows incrementally, engineers maintain a constant memory footprint regardless of document size. Explicit garbage collection triggers, tracemalloc profiling in staging environments, and page-level chunking ensure sub-200MB heap usage even during concurrent ingestion bursts. This approach integrates cleanly with Async Batch Processing Pipelines, allowing terminal operators to process thousands of manifests without blocking vessel turnaround operations.
from pdfminer.pdfparser import PDFSyntaxError
import gc
def stream_bl_extraction(pdf_path: Path) -> Iterator[Dict[str, Any]]:
"""Generator-based page iteration to enforce constant memory footprint.
pdfplumber pages are owned by the PDF context manager; they are released when
the context exits. Within the loop, explicit gc.collect() bounds per-page
object accumulation between iterations.
"""
if not pdf_path.exists():
logger.error(json.dumps({"event": "file_not_found", "path": str(pdf_path)}))
return
try:
with pdfplumber.open(str(pdf_path)) as pdf:
for page_idx, page in enumerate(pdf.pages):
# Page-level isolation: extract and immediately yield
extracted = extract_table_with_fallback(page, page_idx)
for row in extracted:
yield row
# Release accumulated per-page objects between iterations.
# pdfplumber caches page objects lazily; collect here to bound
# heap growth across large manifests.
gc.collect()
except PDFSyntaxError as e:
logger.error(json.dumps({"event": "pdf_corrupt", "path": str(pdf_path), "error": str(e)}))
except Exception as e:
logger.critical(json.dumps({"event": "stream_interrupted", "path": str(pdf_path), "trace": traceback.format_exc()}))
Regex Calibration & Compliance Gating
Regex precision directly dictates compliance gating success. Port authorities and customs brokers require field-level accuracy exceeding 99.5% to avoid EDI rejection penalties and demurrage disputes. Threshold tuning involves calibrating regex boundaries to ignore scanning noise while strictly capturing alphanumeric freight identifiers.
Regulatory constraints under IMO FAL Convention and SOLAS VGM mandates require immutable audit trails, PII minimization, and strict schema adherence. Implementing Schema Validation Frameworks alongside explicit error categorization ensures malformed records are quarantined rather than propagated to downstream IFCSUM EDI Message Parsing systems.
def validate_and_gate(row: Dict[str, Any]) -> Optional[BLTableRow]:
"""Apply schema validation, compliance gating, and structured audit logging."""
raw = row.get("raw", {})
method = row.get("method", "unknown")
page = row.get("page", 0)
# Compliance gating: mandatory fields per customs/terminal SLA
bl_num = raw.get("bl_number", "").upper().replace(" ", "")
container = raw.get("container_no", "").upper()
weight_str = raw.get("gross_weight", "").replace(",", "")
if not bl_num or not re.match(r"^[A-Z0-9]{4,12}$", bl_num):
logger.warning(json.dumps({
"event": "compliance_gate_failed",
"reason": "invalid_bl_format",
"page": page,
"method": method
}))
return None
try:
weight_kg = float(weight_str) if weight_str else None
except ValueError:
weight_kg = None
record = BLTableRow(
bl_number=bl_num,
consignee=raw.get("consignee", "").strip(),
container_no=container,
gross_weight_kg=weight_kg,
page_index=page,
extraction_method=method
)
# Structured audit emission for regulatory retention (7-year minimum)
logger.info(json.dumps({
"event": "record_validated",
"bl_number": bl_num,
"method": method,
"compliance_status": "PASS"
}))
return record
Operational Integration & Auditability
Production deployment requires deterministic routing, idempotent retries, and strict separation of extraction logic from downstream EDI translation. When integrating with terminal operating systems (TOS), extraction outputs should be serialized to JSON or Avro, hashed for integrity verification, and routed through a message broker with dead-letter queue (DLQ) isolation.
The combination of adaptive table boundaries, stream-based memory management, and regex-calibrated compliance gating creates a resilient foundation for maritime document automation. Engineers should monitor extraction drift metrics, rotate regex dictionaries quarterly against carrier manifest updates, and enforce strict data minimization policies to align with GDPR and port authority data sovereignty requirements. For implementation details on logging configuration and library versioning, consult the official Python logging documentation and pdfplumber API reference.