bleachpdf

A tool that blacks out sensitive information in PDF files. It works on scanned documents, not just digital ones.

Quick Start

1. Install Tesseract, the text recognition engine:

# Ubuntu/Debian
sudo apt install tesseract-ocr

# macOS
brew install tesseract

For non-English documents, install additional language packs:

# Ubuntu/Debian (Korean, Japanese, Chinese Simplified)
sudo apt install tesseract-ocr-kor tesseract-ocr-jpn tesseract-ocr-chi-sim

# macOS
brew install tesseract-lang

2. Install bleachpdf:

pip install bleachpdf

3. Run it:

bleachpdf document.pdf -m "123456789" -m "JohnDoe"

This creates output/document.pdf with black boxes covering any text that matches "123456789" or "JohnDoe".

For more complex patterns, you can put them in a config file instead — see Writing Patterns below.

Why Use This?

It works on scanned documents. Most redaction tools only read the text layer inside a PDF file. That works fine for documents created on a computer, but fails completely on scanned papers, faxes, or PDFs that are really just pictures of pages.

This tool takes a different approach: it converts each page to an image, uses optical character recognition to read the text, finds your sensitive information, draws black boxes over it, and saves a new PDF. The original text layer is ignored entirely, so nothing slips through.

No hidden text can leak. The output is a clean PDF containing only images. There's no hidden text layer that could accidentally expose your information if someone copies and pastes from the document.

Free and private. No subscriptions, no accounts, no uploading your documents anywhere. Everything runs on your own computer.

Actually tested. Most free redaction tools ship with minimal or no automated testing. bleachpdf runs against olmOCR-bench, a standardized benchmark from the Allen Institute for AI containing thousands of challenging documents -- old scans, dense text, complex layouts, and more. Every test verifies that redacted text is actually hidden by re-scanning the output. See Testing Strategy for details.

How It Works

Each page gets converted to an image
Text recognition finds words and their positions on the page
Your patterns are matched against the text
Black boxes are drawn over the matches
The redacted images become a new PDF
The output is scanned again to make sure nothing was missed

Writing Patterns

For simple cases, use the -m flag on the command line:

bleachpdf document.pdf -m "123456789" -m "JohnDoe"

For repeated use or complex patterns, put them in a config file. The tool looks for a file called pii.yaml in the current directory (the name comes from "personally identifiable information").

Exact Text

The simplest pattern matches exact text:

patterns:
  - 'match = "123456789"'
  - 'match = "JohnDoe"'

About spaces and punctuation: The tool strips out spaces, dashes, and punctuation before matching. So if your document shows 123-45-6789, the tool sees 123456789. Write your patterns the same way:

Document shows	Write as
`123-45-6789`	`"123456789"`
`John Doe`	`"JohnDoe"`
`ACCT #12345`	`"ACCT12345"`

Ignoring Upper/Lowercase

To match text regardless of capitalization, use ~"..."i:

patterns:
  - 'match = ~"johndoe"i'    # matches JohnDoe, JOHNDOE, johndoe, etc.

Matching Any Digit or Letter

Sometimes you want to match patterns like "any 9-digit number" rather than a specific number. You can define rules for this:

patterns:
  - |
    match = d d d d d d d d d
    d = ~"[0-9]"

This matches any 9 digits in a row. The ~"[0-9]" means "any single digit from 0 to 9". Each d in the pattern represents one digit, so d d d d d d d d d means "nine digits".

Similarly, ~"[A-Za-z]" matches any letter.

Combining Text and Patterns

You can mix literal text with patterns:

patterns:
  - |
    match = "ACCT" d d d d
    d = ~"[0-9]"

This matches "ACCT" followed by exactly 4 digits: ACCT1234, ACCT0001, etc.

Repeating Patterns

Use + to mean "one or more":

patterns:
  - |
    match = "ACCT" d+
    d = ~"[0-9]"

This matches "ACCT" followed by any number of digits.

Common Examples

Social Security Number (any 9 digits):

patterns:
  - |
    match = d d d d d d d d d
    d = ~"[0-9]"

Phone number (any 10 digits):

patterns:
  - |
    match = d d d d d d d d d d
    d = ~"[0-9]"

A specific name (ignoring case):

patterns:
  - 'match = ~"johndoe"i'
  - 'match = ~"janedoe"i'

Where the Config File Can Live

The tool looks for your config file in these locations, in order:

The path you give with -c or --config
The BLEACHPDF_CONFIG environment variable
pii.yaml in the current directory
~/.config/bleachpdf/pii.yaml (your personal config)
/etc/xdg/bleachpdf/pii.yaml (system-wide config)

Learning More About Patterns

The pattern language is called a "parsing expression grammar." If you want to learn more advanced features like optional elements, grouping, and lookahead, see the parsimonious documentation.

Usage Examples

bleachpdf document.pdf                    # Redact one file
bleachpdf document.pdf -o redacted.pdf    # Choose the output filename
bleachpdf documents/ -o output/           # Redact all PDFs in a folder
bleachpdf "reports/*.pdf" -o output/      # Redact files matching a pattern
bleachpdf document.pdf -v                 # Show progress while running

Options

Option	What it does
`-m, --match`	Text to redact (case-insensitive). Use multiple times for multiple patterns.
`-o, --output`	Where to save the result (default: `output/`)
`-c, --config`	Path to a config file
`-d, --dpi`	Image quality — higher means sharper but slower (default: 300)
`--lang`	Tesseract language(s) for OCR, e.g. `eng`, `eng+kor` (default: `eng`)
`-j, --jobs`	How many files to process at once (default: half your CPU cores)
`--relaxed`	Don't fail when no matches are found
`--no-verify`	Skip the safety check that re-scans the output
`-v, --verbose`	Show detailed progress
`-q, --quiet`	Don't print anything

Exit Codes

When the tool finishes, it returns a number indicating what happened:

Code	Meaning
0	Success — redactions were made
1	Configuration problem — missing config file, invalid patterns, etc.
2	File problem — couldn't find input or write output
3	No matches — the patterns didn't match anything in the document
4	Verification failed — text is still visible after redaction

Strict vs Relaxed Mode

By default, the tool treats "no matches found" as an error. This is intentional — if you're redacting a document, you probably expect it to contain the sensitive text. A missing match could mean:

You're redacting the wrong document
The text recognition couldn't read the document
Your pattern has a typo

If you're processing a batch of documents where some legitimately won't contain matches, use --relaxed:

bleachpdf documents/ --relaxed

In relaxed mode, documents with no matches just get a warning instead of causing the tool to fail.

Note that verification failures (text still visible after redaction) are always fatal — that's a serious problem that can't be ignored.

Limitations

Text recognition isn't perfect. Handwriting, unusual fonts, low-quality scans, and very small or dense text can cause recognition errors. The tool automatically retries at higher resolution if the first attempt finds nothing, but some documents may still fail.

Always check the output yourself. No automated redaction tool is 100% reliable. Before sharing a redacted document, open it and verify:

Is all the sensitive information actually covered?
Did anything slip through?
Was anything accidentally over-redacted?

Think of this tool as a helpful first pass, not a replacement for careful human review. Also, note carefully the relevant details in the accompanying LICENSE file.

No license granted for censorship

No license, right, or permission is granted -- expressly or by implication -- to use this software for censorship. This prohibition applies to all parties without exception, including but not limited to: individuals, companies, corporations, partnerships, nonprofit organizations, religious institutions, schools, universities, municipalities, counties, states, provinces, territories, national governments, intergovernmental bodies, and any agents or contractors acting on their behalf.

For the purposes of this restriction, "censorship" means using this software to suppress, obscure, or redact content in books, films, plays, newspapers, periodicals, websites, broadcasts, academic publications, or any other material created for public distribution or consumption, where the purpose is to prevent an audience from seeing the original content rather than to protect specific private information.

This software is designed to protect personal privacy. It is not designed to silence speech, and its author, John Byrd, does not grant permission for it to be used that way.

Development

Running Tests

Each test redacts a document, then re-scans the output to verify the text is actually hidden.

pip install -e ".[dev]"
pytest tests/

Tests run in parallel by default, using half your CPU cores. Override with --jobs:

pytest tests/ --jobs=4        # Use 4 workers
pytest tests/ -n 1            # Run serially (disable parallelism)
pytest tests/ --limit=10      # Only run first 10 test cases

For the full testing documentation—including filtering by category, setting pass thresholds, and CI configuration—see Testing Strategy.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
docs		docs
src/bleachpdf		src/bleachpdf
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pii.example.yaml		pii.example.yaml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

bleachpdf

Quick Start

Why Use This?

How It Works

Writing Patterns

Exact Text

Ignoring Upper/Lowercase

Matching Any Digit or Letter

Combining Text and Patterns

Repeating Patterns

Common Examples

Where the Config File Can Live

Learning More About Patterns

Usage Examples

Options

Exit Codes

Strict vs Relaxed Mode

Limitations

No license granted for censorship

Development

Running Tests

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

bleachpdf

Quick Start

Why Use This?

How It Works

Writing Patterns

Exact Text

Ignoring Upper/Lowercase

Matching Any Digit or Letter

Combining Text and Patterns

Repeating Patterns

Common Examples

Where the Config File Can Live

Learning More About Patterns

Usage Examples

Options

Exit Codes

Strict vs Relaxed Mode

Limitations

No license granted for censorship

Development

Running Tests

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages