§0Quick start
Pin a PDF by its flat URL. Bytes are immutable - the same URL returns the same PDF forever.
# fetch a known-corrupt PDF (truncated at xref) curl -fSL -o sample.pdf \ https://pdfbin.net/xref-truncated.pdf # ~700 bytes; your parser should fall back to xref-recovery
import httpx, pytest from myparser import parse_pdf def test_xref_recovery(): pdf = httpx.get("https://pdfbin.net/" "xref-truncated.pdf").content with pytest.raises(EOFError): parse_pdf(pdf)
const url = "https://pdfbin.net/" + "irs-1040-blank.pdf"; const res = await fetch(url); const buf = await res.arrayBuffer(); // blank IRS 1040, public-domain via irs.gov renderToCanvas(buf);
resp, err := http.Get("https://pdfbin.net/" + "xref-truncated.pdf") if err != nil { t.Fatal(err) } defer resp.Body.Close() // expected: io.ErrUnexpectedEOF from your parser
Catalog
Five views of the same fixture set. A PDF may appear in more than one section - that's the multi-axis catalog at work.
§1By Form Factor
Canonical clean PDFs. Paper sizes (DIN A4, US Letter, JIS B5), page counts (1 to 500), orientations, and byte sizes (1MB to 50MB).
| ID | Description | Facets | URL |
|---|---|---|---|
clean-100-pages.pdf | Clean US Letter, 100 portrait pages. | ||
clean-10mb.pdf | Clean US Letter PDF padded to approximately 10 MB via an attached random-bytes file (embedded-file feature). | embedded-file | |
clean-1mb.pdf | Clean US Letter PDF padded to approximately 1 MB via an attached random-bytes file (embedded-file feature). | embedded-file | |
clean-25mb.pdf | Clean US Letter PDF padded to approximately 25 MB via an attached random-bytes file (embedded-file feature). | embedded-file | |
clean-500-pages.pdf | Clean US Letter, 500 portrait pages. | ||
clean-50mb.pdf | Clean US Letter PDF padded to approximately 50 MB via an attached random-bytes file (embedded-file feature). | embedded-file | |
clean-a4-1page.pdf | Clean DIN A4 (210x297mm), single portrait page, body text. | ||
clean-a4-3page.pdf | Clean DIN A4, three portrait pages. | ||
clean-jis-b5-1page.pdf | Clean JIS B5 (182x257mm) - not ISO B5. Single portrait page. | ||
clean-letter-1page.pdf | Clean US Letter (8.5x11in), single portrait page, body text. | ||
clean-letter-3page.pdf | Clean US Letter, three portrait pages. | ||
clean-mixed-orientation.pdf | US Letter, alternating portrait and landscape across 5 pages. | ||
clean-paper-sizes-mixed.pdf | Mixed paper sizes: A4, US Letter, and JIS B5 in one document. |
§2By Document Shape
PDFs shaped like real-world documents - fax covers, invoices, IRS forms, lab reports, contracts.
| ID | Description | Facets | URL |
|---|---|---|---|
bank-statement-letter-clean.pdf | Bank account statement on US Letter. Synthesized. | bank-statement | |
contract-nda-letter-clean.pdf | Mutual non-disclosure agreement on US Letter. Synthesized. | contract-nda | |
fax-cover-letter-clean.pdf | Fax cover sheet on US Letter (To/From/Company/Fax/Pages/Date/Re). Synthesized. | fax-cover-sheet | |
fax-cover-letter-scanned-noisy.pdf | Fax cover sheet rendered as a noisy 300 DPI scan - realistic faxed-document case. | fax-cover-sheetscanned-noisy | |
invoice-letter-clean.pdf | Invoice on US Letter with line items and totals. Synthesized. | invoice | |
irs-1040-blank.pdf | Blank IRS Form 1040, imported verbatim from irs.gov. US federal work, public domain. | irs-1040 | |
lab-report-letter-clean.pdf | Diagnostics lab report on US Letter. Synthesized. | lab-report | |
receipt-letter-clean.pdf | Coffee-shop receipt on US Letter. Synthesized. | receipt | |
receipt-scanned-noisy-300dpi.pdf | Receipt rendered as a noisy 300 DPI scan - classic crumpled-receipt OCR target. | receiptscanned-noisy |
§3By Spec Compliance
PDF spec versions (1.4 / 1.7 / 2.0) and PDF/A conformance levels (1B, 1A, 2B, 3B).
| ID | Description | Facets | URL |
|---|---|---|---|
pdf-1.4-clean.pdf | Clean PDF saved targeting spec version 1.4. | ||
pdf-1.7-clean.pdf | Clean PDF saved targeting spec version 1.7. | ||
pdf-2.0-clean.pdf | Clean PDF saved targeting spec version 2.0. | ||
pdfa-1a-compliant.pdf | PDF/A-1A compliant document (accessible / tagged variant of -1B). | PDF/A-1A | |
pdfa-1b-compliant.pdf | PDF/A-1B compliant document (visual appearance preserved). | PDF/A-1B | |
pdfa-2b-compliant.pdf | PDF/A-2B compliant document (PDF 1.7 features allowed). | PDF/A-2B | |
pdfa-3b-with-attachment.pdf | PDF/A-3B with an embedded plain-text attachment - the headline PDF/A-3 feature. | PDF/A-3Bembedded-file |
§4By Provenance
Digital-native vs scanned variants. Scans vary by DPI, noise, and skew - useful for OCR test suites.
| ID | Description | Facets | URL |
|---|---|---|---|
fax-cover-letter-scanned-noisy.pdf | Fax cover sheet rendered as a noisy 300 DPI scan - realistic faxed-document case. | fax-cover-sheetscanned-noisy | |
receipt-scanned-noisy-300dpi.pdf | Receipt rendered as a noisy 300 DPI scan - classic crumpled-receipt OCR target. | receiptscanned-noisy | |
scanned-clean-200dpi.pdf | Clean 200 DPI scan of a generic Letter PDF. | scanned-clean | |
scanned-clean-300dpi.pdf | Clean 300 DPI scan of a generic Letter PDF. | scanned-clean | |
scanned-noisy-300dpi.pdf | Noisy 300 DPI scan with high-density speckle noise. | scanned-noisy | |
scanned-skewed-3deg.pdf | Clean 300 DPI scan rotated 3 degrees. | scanned-skewed | |
scanned-skewed-noisy.pdf | Noisy 3-degree skewed 300 DPI scan - hardest realistic case. | scanned-noisy-skewed |
§5By Failure Mode
PDFs damaged in known, named ways. Each PDF's facet record names the exact corruption.
| ID | Description | Facets | URL |
|---|---|---|---|
byte-flipped-mid-stream.pdf | One byte XOR-flipped mid-content-stream. Likely renders pages with garbage. | corrupt-byte-flipped | |
eof-missing.pdf | %%EOF marker stripped. Parsers that key on it cannot find the end. | corrupt-eof-missing | |
header-truncated.pdf | First line (%PDF-1.7) removed. Parsers that key on the header fail to detect a PDF. | corrupt-header-truncated | |
object-generation-mismatch.pdf | First object's header says generation 1; xref says generation 0. | corrupt-object-generation-mismatch | |
stream-length-mismatch.pdf | One stream object's /Length is overstated by 99 bytes. | corrupt-stream-length-mismatch | |
trailer-missing.pdf | Trailer dictionary removed; startxref present but points to nothing useful. | corrupt-trailer-missing | |
xref-truncated.pdf | PDF byte-truncated at the start of the xref table. Parsers without xref-recovery will fail. | corrupt-xref-truncated |