§0Quick start

Pin a PDF by its flat URL. Bytes are immutable - the same URL returns the same PDF forever.

# fetch a known-corrupt PDF (truncated at xref)
curl -fSL -o sample.pdf \
  https://pdfbin.net/xref-truncated.pdf
# ~700 bytes; your parser should fall back to xref-recovery
import httpx, pytest
from myparser import parse_pdf

def test_xref_recovery():
    pdf = httpx.get("https://pdfbin.net/"
                    "xref-truncated.pdf").content
    with pytest.raises(EOFError):
        parse_pdf(pdf)
const url = "https://pdfbin.net/" +
            "irs-1040-blank.pdf";
const res = await fetch(url);
const buf = await res.arrayBuffer();
// blank IRS 1040, public-domain via irs.gov
renderToCanvas(buf);
resp, err := http.Get("https://pdfbin.net/" +
    "xref-truncated.pdf")
if err != nil { t.Fatal(err) }
defer resp.Body.Close()
// expected: io.ErrUnexpectedEOF from your parser

Catalog

Five views of the same fixture set. A PDF may appear in more than one section - that's the multi-axis catalog at work.

§1By Form Factor

Canonical clean PDFs. Paper sizes (DIN A4, US Letter, JIS B5), page counts (1 to 500), orientations, and byte sizes (1MB to 50MB).

IDDescriptionFacetsURL
clean-100-pages.pdfClean US Letter, 100 portrait pages.
clean-10mb.pdfClean US Letter PDF padded to approximately 10 MB via an attached random-bytes file (embedded-file feature).embedded-file
clean-1mb.pdfClean US Letter PDF padded to approximately 1 MB via an attached random-bytes file (embedded-file feature).embedded-file
clean-25mb.pdfClean US Letter PDF padded to approximately 25 MB via an attached random-bytes file (embedded-file feature).embedded-file
clean-500-pages.pdfClean US Letter, 500 portrait pages.
clean-50mb.pdfClean US Letter PDF padded to approximately 50 MB via an attached random-bytes file (embedded-file feature).embedded-file
clean-a4-1page.pdfClean DIN A4 (210x297mm), single portrait page, body text.
clean-a4-3page.pdfClean DIN A4, three portrait pages.
clean-jis-b5-1page.pdfClean JIS B5 (182x257mm) - not ISO B5. Single portrait page.
clean-letter-1page.pdfClean US Letter (8.5x11in), single portrait page, body text.
clean-letter-3page.pdfClean US Letter, three portrait pages.
clean-mixed-orientation.pdfUS Letter, alternating portrait and landscape across 5 pages.
clean-paper-sizes-mixed.pdfMixed paper sizes: A4, US Letter, and JIS B5 in one document.

§2By Document Shape

PDFs shaped like real-world documents - fax covers, invoices, IRS forms, lab reports, contracts.

IDDescriptionFacetsURL
bank-statement-letter-clean.pdfBank account statement on US Letter. Synthesized.bank-statement
contract-nda-letter-clean.pdfMutual non-disclosure agreement on US Letter. Synthesized.contract-nda
fax-cover-letter-clean.pdfFax cover sheet on US Letter (To/From/Company/Fax/Pages/Date/Re). Synthesized.fax-cover-sheet
fax-cover-letter-scanned-noisy.pdfFax cover sheet rendered as a noisy 300 DPI scan - realistic faxed-document case.fax-cover-sheetscanned-noisy
invoice-letter-clean.pdfInvoice on US Letter with line items and totals. Synthesized.invoice
irs-1040-blank.pdfBlank IRS Form 1040, imported verbatim from irs.gov. US federal work, public domain.irs-1040
lab-report-letter-clean.pdfDiagnostics lab report on US Letter. Synthesized.lab-report
receipt-letter-clean.pdfCoffee-shop receipt on US Letter. Synthesized.receipt
receipt-scanned-noisy-300dpi.pdfReceipt rendered as a noisy 300 DPI scan - classic crumpled-receipt OCR target.receiptscanned-noisy

§3By Spec Compliance

PDF spec versions (1.4 / 1.7 / 2.0) and PDF/A conformance levels (1B, 1A, 2B, 3B).

IDDescriptionFacetsURL
pdf-1.4-clean.pdfClean PDF saved targeting spec version 1.4.
pdf-1.7-clean.pdfClean PDF saved targeting spec version 1.7.
pdf-2.0-clean.pdfClean PDF saved targeting spec version 2.0.
pdfa-1a-compliant.pdfPDF/A-1A compliant document (accessible / tagged variant of -1B).PDF/A-1A
pdfa-1b-compliant.pdfPDF/A-1B compliant document (visual appearance preserved).PDF/A-1B
pdfa-2b-compliant.pdfPDF/A-2B compliant document (PDF 1.7 features allowed).PDF/A-2B
pdfa-3b-with-attachment.pdfPDF/A-3B with an embedded plain-text attachment - the headline PDF/A-3 feature.PDF/A-3Bembedded-file

§4By Provenance

Digital-native vs scanned variants. Scans vary by DPI, noise, and skew - useful for OCR test suites.

IDDescriptionFacetsURL
fax-cover-letter-scanned-noisy.pdfFax cover sheet rendered as a noisy 300 DPI scan - realistic faxed-document case.fax-cover-sheetscanned-noisy
receipt-scanned-noisy-300dpi.pdfReceipt rendered as a noisy 300 DPI scan - classic crumpled-receipt OCR target.receiptscanned-noisy
scanned-clean-200dpi.pdfClean 200 DPI scan of a generic Letter PDF.scanned-clean
scanned-clean-300dpi.pdfClean 300 DPI scan of a generic Letter PDF.scanned-clean
scanned-noisy-300dpi.pdfNoisy 300 DPI scan with high-density speckle noise.scanned-noisy
scanned-skewed-3deg.pdfClean 300 DPI scan rotated 3 degrees.scanned-skewed
scanned-skewed-noisy.pdfNoisy 3-degree skewed 300 DPI scan - hardest realistic case.scanned-noisy-skewed

§5By Failure Mode

PDFs damaged in known, named ways. Each PDF's facet record names the exact corruption.

IDDescriptionFacetsURL
byte-flipped-mid-stream.pdfOne byte XOR-flipped mid-content-stream. Likely renders pages with garbage.corrupt-byte-flipped
eof-missing.pdf%%EOF marker stripped. Parsers that key on it cannot find the end.corrupt-eof-missing
header-truncated.pdfFirst line (%PDF-1.7) removed. Parsers that key on the header fail to detect a PDF.corrupt-header-truncated
object-generation-mismatch.pdfFirst object's header says generation 1; xref says generation 0.corrupt-object-generation-mismatch
stream-length-mismatch.pdfOne stream object's /Length is overstated by 99 bytes.corrupt-stream-length-mismatch
trailer-missing.pdfTrailer dictionary removed; startxref present but points to nothing useful.corrupt-trailer-missing
xref-truncated.pdfPDF byte-truncated at the start of the xref table. Parsers without xref-recovery will fail.corrupt-xref-truncated