A tool that blacks out sensitive information in PDF files. It works on scanned documents, not just digital ones.
1. Install Tesseract, the text recognition engine:
# Ubuntu/Debian
sudo apt install tesseract-ocr
# macOS
brew install tesseractFor non-English documents, install additional language packs:
# Ubuntu/Debian (Korean, Japanese, Chinese Simplified)
sudo apt install tesseract-ocr-kor tesseract-ocr-jpn tesseract-ocr-chi-sim
# macOS
brew install tesseract-lang2. Install bleachpdf:
pip install bleachpdf3. Run it:
bleachpdf document.pdf -m "123456789" -m "JohnDoe"This creates output/document.pdf with black boxes covering any text that matches "123456789" or "JohnDoe".
For more complex patterns, you can put them in a config file instead — see Writing Patterns below.
It works on scanned documents. Most redaction tools only read the text layer inside a PDF file. That works fine for documents created on a computer, but fails completely on scanned papers, faxes, or PDFs that are really just pictures of pages.
This tool takes a different approach: it converts each page to an image, uses optical character recognition to read the text, finds your sensitive information, draws black boxes over it, and saves a new PDF. The original text layer is ignored entirely, so nothing slips through.
No hidden text can leak. The output is a clean PDF containing only images. There's no hidden text layer that could accidentally expose your information if someone copies and pastes from the document.
Free and private. No subscriptions, no accounts, no uploading your documents anywhere. Everything runs on your own computer.
Actually tested. Most free redaction tools ship with minimal or no automated testing. bleachpdf runs against olmOCR-bench, a standardized benchmark from the Allen Institute for AI containing thousands of challenging documents -- old scans, dense text, complex layouts, and more. Every test verifies that redacted text is actually hidden by re-scanning the output. See Testing Strategy for details.
- Each page gets converted to an image
- Text recognition finds words and their positions on the page
- Your patterns are matched against the text
- Black boxes are drawn over the matches
- The redacted images become a new PDF
- The output is scanned again to make sure nothing was missed
For simple cases, use the -m flag on the command line:
bleachpdf document.pdf -m "123456789" -m "JohnDoe"For repeated use or complex patterns, put them in a config file. The tool looks for a file called pii.yaml in the current directory (the name comes from "personally identifiable information").
The simplest pattern matches exact text:
patterns:
- 'match = "123456789"'
- 'match = "JohnDoe"'About spaces and punctuation: The tool strips out spaces, dashes, and punctuation before matching. So if your document shows 123-45-6789, the tool sees 123456789. Write your patterns the same way:
| Document shows | Write as |
|---|---|
123-45-6789 |
"123456789" |
John Doe |
"JohnDoe" |
ACCT #12345 |
"ACCT12345" |
To match text regardless of capitalization, use ~"..."i:
patterns:
- 'match = ~"johndoe"i' # matches JohnDoe, JOHNDOE, johndoe, etc.Sometimes you want to match patterns like "any 9-digit number" rather than a specific number. You can define rules for this:
patterns:
- |
match = d d d d d d d d d
d = ~"[0-9]"This matches any 9 digits in a row. The ~"[0-9]" means "any single digit from 0 to 9". Each d in the pattern represents one digit, so d d d d d d d d d means "nine digits".
Similarly, ~"[A-Za-z]" matches any letter.
You can mix literal text with patterns:
patterns:
- |
match = "ACCT" d d d d
d = ~"[0-9]"This matches "ACCT" followed by exactly 4 digits: ACCT1234, ACCT0001, etc.
Use + to mean "one or more":
patterns:
- |
match = "ACCT" d+
d = ~"[0-9]"This matches "ACCT" followed by any number of digits.
Social Security Number (any 9 digits):
patterns:
- |
match = d d d d d d d d d
d = ~"[0-9]"Phone number (any 10 digits):
patterns:
- |
match = d d d d d d d d d d
d = ~"[0-9]"A specific name (ignoring case):
patterns:
- 'match = ~"johndoe"i'
- 'match = ~"janedoe"i'The tool looks for your config file in these locations, in order:
- The path you give with
-cor--config - The
BLEACHPDF_CONFIGenvironment variable pii.yamlin the current directory~/.config/bleachpdf/pii.yaml(your personal config)/etc/xdg/bleachpdf/pii.yaml(system-wide config)
The pattern language is called a "parsing expression grammar." If you want to learn more advanced features like optional elements, grouping, and lookahead, see the parsimonious documentation.
bleachpdf document.pdf # Redact one file
bleachpdf document.pdf -o redacted.pdf # Choose the output filename
bleachpdf documents/ -o output/ # Redact all PDFs in a folder
bleachpdf "reports/*.pdf" -o output/ # Redact files matching a pattern
bleachpdf document.pdf -v # Show progress while running| Option | What it does |
|---|---|
-m, --match |
Text to redact (case-insensitive). Use multiple times for multiple patterns. |
-o, --output |
Where to save the result (default: output/) |
-c, --config |
Path to a config file |
-d, --dpi |
Image quality — higher means sharper but slower (default: 300) |
--lang |
Tesseract language(s) for OCR, e.g. eng, eng+kor (default: eng) |
-j, --jobs |
How many files to process at once (default: half your CPU cores) |
--relaxed |
Don't fail when no matches are found |
--no-verify |
Skip the safety check that re-scans the output |
-v, --verbose |
Show detailed progress |
-q, --quiet |
Don't print anything |
When the tool finishes, it returns a number indicating what happened:
| Code | Meaning |
|---|---|
| 0 | Success — redactions were made |
| 1 | Configuration problem — missing config file, invalid patterns, etc. |
| 2 | File problem — couldn't find input or write output |
| 3 | No matches — the patterns didn't match anything in the document |
| 4 | Verification failed — text is still visible after redaction |
By default, the tool treats "no matches found" as an error. This is intentional — if you're redacting a document, you probably expect it to contain the sensitive text. A missing match could mean:
- You're redacting the wrong document
- The text recognition couldn't read the document
- Your pattern has a typo
If you're processing a batch of documents where some legitimately won't contain matches, use --relaxed:
bleachpdf documents/ --relaxedIn relaxed mode, documents with no matches just get a warning instead of causing the tool to fail.
Note that verification failures (text still visible after redaction) are always fatal — that's a serious problem that can't be ignored.
Text recognition isn't perfect. Handwriting, unusual fonts, low-quality scans, and very small or dense text can cause recognition errors. The tool automatically retries at higher resolution if the first attempt finds nothing, but some documents may still fail.
Always check the output yourself. No automated redaction tool is 100% reliable. Before sharing a redacted document, open it and verify:
- Is all the sensitive information actually covered?
- Did anything slip through?
- Was anything accidentally over-redacted?
Think of this tool as a helpful first pass, not a replacement for careful human review. Also, note carefully the relevant details in the accompanying LICENSE file.
No license, right, or permission is granted -- expressly or by implication -- to use this software for censorship. This prohibition applies to all parties without exception, including but not limited to: individuals, companies, corporations, partnerships, nonprofit organizations, religious institutions, schools, universities, municipalities, counties, states, provinces, territories, national governments, intergovernmental bodies, and any agents or contractors acting on their behalf.
For the purposes of this restriction, "censorship" means using this software to suppress, obscure, or redact content in books, films, plays, newspapers, periodicals, websites, broadcasts, academic publications, or any other material created for public distribution or consumption, where the purpose is to prevent an audience from seeing the original content rather than to protect specific private information.
This software is designed to protect personal privacy. It is not designed to silence speech, and its author, John Byrd, does not grant permission for it to be used that way.
Each test redacts a document, then re-scans the output to verify the text is actually hidden.
pip install -e ".[dev]"
pytest tests/Tests run in parallel by default, using half your CPU cores. Override with --jobs:
pytest tests/ --jobs=4 # Use 4 workers
pytest tests/ -n 1 # Run serially (disable parallelism)
pytest tests/ --limit=10 # Only run first 10 test casesFor the full testing documentation—including filtering by category, setting pass thresholds, and CI configuration—see Testing Strategy.