Deep dive · 04 / 04

Architecture.

Three rules, applied consistently across 12 operators. Plus the design note for each one.

Rule 1 — additive overlays, not rewrites

When we edit, stamp, watermark, or sign a PDF, we do not rewrite the source content stream. We add a new layer on top. That means:

Source pixels are preserved — if a file renders correctly before the operator, it renders correctly after.
Signatures applied by upstream tools stay valid (we are not touching what they signed).
Operators compose: you can Bates, then watermark, then sign, and each layer is recoverable.

Redaction is the one place we deliberately break this rule: we purge the glyphs under each redacted region from the content stream, because covering with a rectangle is not a real redaction. Everything else is additive.

Rule 2 — regex plus checksum for PII

PII detection is rules + validation. Regex finds candidate strings; a checksum or format validator confirms. Credit-card numbers run through Luhn; routing numbers run through ABA's checksum; SSNs are bounded by their historical block allocations.

No LLM is in this path. The rules are auditable and the false-positive modes are enumerable. On the 60-test seeded corpus the detector matched 28 / 28 planted items. Adding a new type is a rule + validator pair in a config file.

Rule 3 — thin operator chains

Every capability is a small operator with a typed input, a typed output, a pre-registered quality config, and a test battery. Capabilities compose by chaining operators — PII detection feeds redact, bulk-fill feeds Bates, classifier feeds batch-rename. No hidden orchestration layer, no inference-time prompt composition.

Operator-level design notes

PDF parse, edit, redact

Parse returns a structured JSON: blocks, reading order, equations, tables, figures, citations, domain tag. Edit re-renders from a modified parse result using the overlay path; source content stream is preserved. Redact takes a list of page + bounding-box pairs and purges matching glyphs. Fidelity score: 0.96 on 25+ tests with a 0.95 bar.

OCR

Scanned vs born-digital is detected automatically. For scanned pages we write a text layer over the rendered page and leave the visuals alone. Backends are swappable (OcrMac, EasyOCR, Tesseract, handwriting models). Fidelity measured on three dimensions: text (0.99), layout (0.81), handwriting (1.00 on the synthetic set). 18+ tests.

Format conversion

PDF ↔ Markdown ↔ HTML ↔ DOCX ↔ CSV. Round-trip tests compare structural content (headings, tables, lists) after a forward-then-reverse conversion. A config flag turns on a round-trip self-test so you can validate a single document before you trust the pipeline.

Tables and cross-page joining

The table extractor returns each table as a CSV with table_id and row_id columns so you can join on row. The cross-page joiner merges tables that span page breaks using a heuristic on header repetition, column-type stability, and continuation markers. Scored 1.00 on a 27-test synthetic corpus at bar 0.90; real-world drift is expected and noted as a known limitation.

Bates numbering

Sequential stamping with configurable prefix, start number, and padding width. Output includes the numbering metadata so you can keep a production log. Scored 1.000 across 37 tests at bar 1.00 on every dimension.

Bulk form fill

Two modes. AcroForm mode fills named fields from CSV columns. Coordinate mode overlays text at configured x,y positions on non-form templates. One row in → one PDF out. Scored 1.000 across 25 tests.

PII detection

Regex rules + Luhn / ABA / SSN-block validators. Returns page + bounding-box + confidence for each match so output feeds directly into the redact endpoint. 28 / 28 seeded matches on the 60-test suite. Zero LLM in this path.

Digital signatures

Two modes: native tamper-detection signing (Ed25519 + SHA-256, 5 credits) and PAdES (B-B and B-T profiles, 15 credits, bring-your-own-cert). See signatures for the trade-off.

Password operations

AES-256 encrypt and decrypt, owner and user password setting, permissions flags (print, copy, modify, annotate). Scored 1.000 across 23 tests and 8 dimensions at bar 1.00.

Overlay annotations

Headers, footers, and watermarks as a separate layer. Positioned by page, rotated, tiled, or single-stamped. Source stream untouched. Scored 1.000 across 51 tests and 9 dimensions.

Compression + PDF/A

Lossless image recompression with optional Ghostscript-backed conversion to PDF/A-2b for archival compliance. Returns before / after byte size. Scored 1.00 across 24 tests on 6 dimensions; the two PDF/A dimensions that require a Ghostscript binary are skipped on machines without it.

Page classifier + batch rename

Drops a mixed folder of invoices, receipts, contracts, statements. Classifier labels each page, the operator splits and renames using a template you supply (e.g. {vendor}-{date}-{type}.pdf). Clean confusion matrix on a 14-document holdout; scored 1.000 across 53 tests at bar 0.85.

Why operators, not a monolith

Each operator is independently testable, independently ships, and independently declines to ship when it misses a quality bar. A regression in one operator does not block the others. A new operator joins the catalog by dropping in its config, its battery, and its rule file.

Stack summary

Python core for operators; pypdf, pdfminer, pikepdf, pyhanko, cryptography for the PDF and crypto lifts.
Optional Ghostscript for PDF/A-2b conversion.
Postgres + Redis for job queue and usage accounting.
FastAPI for the HTTP surface; the same core runs inside the CLI and the Python client.