RAG vs Markdown: Local Evidence Benchmark

Does compiling PDFs into structured Markdown before indexing improve RAG quality?

This project is not a RAG demo. It is a diagnostic framework to separate the effects of document preprocessing, retrieval, generation, and epistemic behavior (knowing when not to answer) in local RAG pipelines.

Research Questions

ID	Question	Status
RQ1	Does Markdown compilation improve overall quality vs. raw PDF text?	Completed (No — all 4 models prefer Raw: −3.6% to −6.0%)
RQ2	Does the Markdown advantage persist with stronger generative models?	Completed (No — Markdown consistently hurts across all model sizes)
RQ3	Are errors primarily caused by retrieval failures or generation failures?	Completed (~60% retrieval, ~25% generation, ~15% scoring/citation)

Pipelines

Pipeline A (Raw)   PDF → PyMuPDF → raw text → chunks → ChromaDB → retrieve → LLM
Pipeline B (MD)    PDF → PyMuPDF → raw text → Markdown → chunks → ChromaDB → retrieve → LLM

Pipeline	Description	Embeddings	Vector DB	LLM
A	Raw extracted text	all-MiniLM-L6-v2	ChromaDB (cosine)	Configurable
B	Compiled Markdown	all-MiniLM-L6-v2	ChromaDB (cosine)	Configurable

Results at a Glance

Model	Size	Pipeline A (Raw)	Pipeline B (MD)	Pipeline C (MD-filtered)	Δ (B−A)	Δ (C−B)
Gemma 3 4B	4B	2.29	1.99	2.16	−0.30 (−6.0%)	+0.17 (+3.4%)
Nemotron 3	~3B	2.46	2.28	2.37	−0.18 (−3.6%)	+0.09 (+1.8%)
DeepSeek V4 Flash	—	2.98	2.76	2.83	−0.22 (−4.4%)	+0.07 (+1.4%)
Gemma 4 26B	26B	3.12	2.86	—*	−0.26 (−5.2%)	—

*Gemma 4 26B Pipeline C unreliable due to Gemini API errors.

Key finding: All models prefer Raw over Markdown. Pipeline C (shallow chunk filter, threshold=200 chars) partially recovers the Markdown loss. The mechanism is confirmed: shallow header chunks (25% of retrieved Markdown chunks) create embedding false positives.

See docs/results.md, the comparative report, and the Fase 3 oracle report.

Interpreting These Results

This benchmark does not measure "RAG in general". It measures the ability of two pipeline configurations to answer 50 questions we designed. Key findings across all three phases:

Retrieval is the dominant constraint on this benchmark: the oracle test shows that giving the LLM full document text improves scores by +67–85%. However, the oracle bypasses retrieval entirely and uses raw text (not per-pipeline context), so the improvement is an upper-bound estimate. ~25% of errors persist even with oracle context.
Markdown universally hurts: All 4 models prefer Raw over Markdown (−3.6% to −6.0%). Pipeline C (shallow chunk filtering, 9 chunks removed) partially recovers the loss for 3/4 models (+0.07 to +0.17).
Famous papers: All documents are well-known arXiv papers. The error taxonomy helps separate RAG quality from parametric knowledge.

Experimental Philosophy

This project follows three principles:

Separate concerns. Each pipeline, model, and metric is a variable that can be changed independently.
Document failure modes. Errors are classified by origin (retrieval, generation, hallucination) not by final score.
Declare boundaries. Results are reported within their experimental context, not as general claims.

Setup

# Clone
git clone https://github.com/cioffiAI/rag-vs-markdown.git
cd rag-vs-markdown

# Virtual environment
python -m venv .venv
.venv\Scripts\activate    # Windows
source .venv/bin/activate # Linux/Mac

# Install
pip install -r requirements.txt

Place PDFs in data/raw/, then:

# 1. Extract text from PDFs
python scripts/extract.py

# 2. (Optional) Compile to Markdown for Pipeline B
python scripts/compile_markdown.py

# 3. Build index
python scripts/build_index.py --pipeline a   # raw text index
python scripts/build_index.py --pipeline b   # markdown index

# 4. Run queries
python scripts/query.py --pipeline b "your question?"

# 5. Batch run benchmark
python scripts/query.py --pipeline b --file data/benchmark_questions.json

# 6. Evaluate
python scripts/evaluate.py batch_b_*.json benchmark_results.md

# 7. Compare pipelines
python scripts/compare_pipelines.py

# 8. Classify errors (Fase 3)
python scripts/error_analysis.py batch_*.json --pipeline-label "B" --model-label "Gemma4"

# 9. Run oracle test (Fase 3)
python scripts/oracle_test.py --model gemma-4-26b-a4b-it

# 10. Compare retrieval vs oracle
python scripts/oracle_compare.py batch_results.json oracle_results.json

Benchmark

50 gold-standard questions (Italian questions, English LLM):

Type	Count	Description
Simple factual	15	Single-document, direct extraction
Local reasoning	10	Multi-paragraph within one document
Multi-document	10	Compare/contrast across 2+ papers
Table extraction	5	Numeric values from tables
Negative (trap)	10	Questions not answerable from corpus

Questions file: data/benchmark_questions.json

Project Structure

├── scripts/
│   ├── extract.py                 PDF → JSON (PyMuPDF)
│   ├── compile_markdown.py        JSON → Markdown with sections
│   ├── build_index.py             Chunk → embed → ChromaDB (--pipeline a|b)
│   ├── query.py                   Retrieve + LLM generation (multi-provider)
│   ├── evaluate.py                Score answers against ground truth
│   ├── compare_pipelines.py       A vs B comparison report
│   ├── error_analysis.py          E01–E07 error classification
│   ├── oracle_test.py             Oracle context (full document text)
│   ├── oracle_compare.py          Retrieval vs oracle score comparison
│   └── convert_benchmark_to_csv.py
├── data/
│   ├── raw/                       Input PDFs (not tracked)
│   ├── extracted/                 Per-page JSON (not tracked)
│   ├── processed/                 Markdown versions (not tracked)
│   ├── benchmark_questions.json   50 gold-standard questions
│   └── gold_questions.csv         CSV export
├── docs/
│   ├── results.md                 Results with interpretation
│   ├── limitations.md             Known constraints and threats to validity
│   └── error_taxonomy.md          Error classification system
├── reports/
│   ├── comparative_benchmark.md          Gemma 4 26B A vs B comparison
│   ├── fase3_retrieval_vs_generation.md  Oracle test + error classification
│   ├── error_profile_*.md                E01–E07 per-pipeline profiles
│   ├── oracle_comparison_*.md            Retrieval vs oracle comparisons
│   └── benchmark_results_*.md            Per-pipeline score tables
└── requirements.txt

Requirements

Python 3.10+
4GB+ RAM (for sentence-transformers on CPU)
For local inference: LM Studio or Ollama on localhost with an OpenAI-compatible endpoint
For cloud inference: Google Gemini API key (free tier: Gemma 4 26B, 15 RPM, 1500 RPD)
No GPU required

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
configs		configs
data		data
docs		docs
paper		paper
reports		reports
results/v0.1_baseline		results/v0.1_baseline
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
PLAN.md		PLAN.md
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAG vs Markdown: Local Evidence Benchmark

Research Questions

Pipelines

Results at a Glance

Interpreting These Results

Experimental Philosophy

Setup

Benchmark

Project Structure

Requirements

License

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RAG vs Markdown: Local Evidence Benchmark

Research Questions

Pipelines

Results at a Glance

Interpreting These Results

Experimental Philosophy

Setup

Benchmark

Project Structure

Requirements

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages