Does compiling PDFs into structured Markdown before indexing improve RAG quality?
This project is not a RAG demo. It is a diagnostic framework to separate the effects of document preprocessing, retrieval, generation, and epistemic behavior (knowing when not to answer) in local RAG pipelines.
| ID | Question | Status |
|---|---|---|
| RQ1 | Does Markdown compilation improve overall quality vs. raw PDF text? | Completed (No — all 4 models prefer Raw: −3.6% to −6.0%) |
| RQ2 | Does the Markdown advantage persist with stronger generative models? | Completed (No — Markdown consistently hurts across all model sizes) |
| RQ3 | Are errors primarily caused by retrieval failures or generation failures? | Completed (~60% retrieval, ~25% generation, ~15% scoring/citation) |
Pipeline A (Raw) PDF → PyMuPDF → raw text → chunks → ChromaDB → retrieve → LLM
Pipeline B (MD) PDF → PyMuPDF → raw text → Markdown → chunks → ChromaDB → retrieve → LLM
| Pipeline | Description | Embeddings | Vector DB | LLM |
|---|---|---|---|---|
| A | Raw extracted text | all-MiniLM-L6-v2 | ChromaDB (cosine) | Configurable |
| B | Compiled Markdown | all-MiniLM-L6-v2 | ChromaDB (cosine) | Configurable |
| Model | Size | Pipeline A (Raw) | Pipeline B (MD) | Pipeline C (MD-filtered) | Δ (B−A) | Δ (C−B) |
|---|---|---|---|---|---|---|
| Gemma 3 4B | 4B | 2.29 | 1.99 | 2.16 | −0.30 (−6.0%) | +0.17 (+3.4%) |
| Nemotron 3 | ~3B | 2.46 | 2.28 | 2.37 | −0.18 (−3.6%) | +0.09 (+1.8%) |
| DeepSeek V4 Flash | — | 2.98 | 2.76 | 2.83 | −0.22 (−4.4%) | +0.07 (+1.4%) |
| Gemma 4 26B | 26B | 3.12 | 2.86 | —* | −0.26 (−5.2%) | — |
*Gemma 4 26B Pipeline C unreliable due to Gemini API errors.
Key finding: All models prefer Raw over Markdown. Pipeline C (shallow chunk filter, threshold=200 chars) partially recovers the Markdown loss. The mechanism is confirmed: shallow header chunks (25% of retrieved Markdown chunks) create embedding false positives.
See docs/results.md, the comparative report, and the Fase 3 oracle report.
This benchmark does not measure "RAG in general". It measures the ability of two pipeline configurations to answer 50 questions we designed. Key findings across all three phases:
- Retrieval is the dominant constraint on this benchmark: the oracle test shows that giving the LLM full document text improves scores by +67–85%. However, the oracle bypasses retrieval entirely and uses raw text (not per-pipeline context), so the improvement is an upper-bound estimate. ~25% of errors persist even with oracle context.
- Markdown universally hurts: All 4 models prefer Raw over Markdown (−3.6% to −6.0%). Pipeline C (shallow chunk filtering, 9 chunks removed) partially recovers the loss for 3/4 models (+0.07 to +0.17).
- Famous papers: All documents are well-known arXiv papers. The error taxonomy helps separate RAG quality from parametric knowledge.
This project follows three principles:
- Separate concerns. Each pipeline, model, and metric is a variable that can be changed independently.
- Document failure modes. Errors are classified by origin (retrieval, generation, hallucination) not by final score.
- Declare boundaries. Results are reported within their experimental context, not as general claims.
# Clone
git clone https://github.com/cioffiAI/rag-vs-markdown.git
cd rag-vs-markdown
# Virtual environment
python -m venv .venv
.venv\Scripts\activate # Windows
source .venv/bin/activate # Linux/Mac
# Install
pip install -r requirements.txtPlace PDFs in data/raw/, then:
# 1. Extract text from PDFs
python scripts/extract.py
# 2. (Optional) Compile to Markdown for Pipeline B
python scripts/compile_markdown.py
# 3. Build index
python scripts/build_index.py --pipeline a # raw text index
python scripts/build_index.py --pipeline b # markdown index
# 4. Run queries
python scripts/query.py --pipeline b "your question?"
# 5. Batch run benchmark
python scripts/query.py --pipeline b --file data/benchmark_questions.json
# 6. Evaluate
python scripts/evaluate.py batch_b_*.json benchmark_results.md
# 7. Compare pipelines
python scripts/compare_pipelines.py
# 8. Classify errors (Fase 3)
python scripts/error_analysis.py batch_*.json --pipeline-label "B" --model-label "Gemma4"
# 9. Run oracle test (Fase 3)
python scripts/oracle_test.py --model gemma-4-26b-a4b-it
# 10. Compare retrieval vs oracle
python scripts/oracle_compare.py batch_results.json oracle_results.json50 gold-standard questions (Italian questions, English LLM):
| Type | Count | Description |
|---|---|---|
| Simple factual | 15 | Single-document, direct extraction |
| Local reasoning | 10 | Multi-paragraph within one document |
| Multi-document | 10 | Compare/contrast across 2+ papers |
| Table extraction | 5 | Numeric values from tables |
| Negative (trap) | 10 | Questions not answerable from corpus |
Questions file: data/benchmark_questions.json
├── scripts/
│ ├── extract.py PDF → JSON (PyMuPDF)
│ ├── compile_markdown.py JSON → Markdown with sections
│ ├── build_index.py Chunk → embed → ChromaDB (--pipeline a|b)
│ ├── query.py Retrieve + LLM generation (multi-provider)
│ ├── evaluate.py Score answers against ground truth
│ ├── compare_pipelines.py A vs B comparison report
│ ├── error_analysis.py E01–E07 error classification
│ ├── oracle_test.py Oracle context (full document text)
│ ├── oracle_compare.py Retrieval vs oracle score comparison
│ └── convert_benchmark_to_csv.py
├── data/
│ ├── raw/ Input PDFs (not tracked)
│ ├── extracted/ Per-page JSON (not tracked)
│ ├── processed/ Markdown versions (not tracked)
│ ├── benchmark_questions.json 50 gold-standard questions
│ └── gold_questions.csv CSV export
├── docs/
│ ├── results.md Results with interpretation
│ ├── limitations.md Known constraints and threats to validity
│ └── error_taxonomy.md Error classification system
├── reports/
│ ├── comparative_benchmark.md Gemma 4 26B A vs B comparison
│ ├── fase3_retrieval_vs_generation.md Oracle test + error classification
│ ├── error_profile_*.md E01–E07 per-pipeline profiles
│ ├── oracle_comparison_*.md Retrieval vs oracle comparisons
│ └── benchmark_results_*.md Per-pipeline score tables
└── requirements.txt
- Python 3.10+
- 4GB+ RAM (for sentence-transformers on CPU)
- For local inference: LM Studio or Ollama on localhost with an OpenAI-compatible endpoint
- For cloud inference: Google Gemini API key (free tier: Gemma 4 26B, 15 RPM, 1500 RPD)
- No GPU required
MIT