Skip to content

cioffiAI/rag-vs-markdown

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RAG vs Markdown: Local Evidence Benchmark

Does compiling PDFs into structured Markdown before indexing improve RAG quality?

This project is not a RAG demo. It is a diagnostic framework to separate the effects of document preprocessing, retrieval, generation, and epistemic behavior (knowing when not to answer) in local RAG pipelines.

Research Questions

ID Question Status
RQ1 Does Markdown compilation improve overall quality vs. raw PDF text? Completed (No — all 4 models prefer Raw: −3.6% to −6.0%)
RQ2 Does the Markdown advantage persist with stronger generative models? Completed (No — Markdown consistently hurts across all model sizes)
RQ3 Are errors primarily caused by retrieval failures or generation failures? Completed (~60% retrieval, ~25% generation, ~15% scoring/citation)

Pipelines

Pipeline A (Raw)   PDF → PyMuPDF → raw text → chunks → ChromaDB → retrieve → LLM
Pipeline B (MD)    PDF → PyMuPDF → raw text → Markdown → chunks → ChromaDB → retrieve → LLM
Pipeline Description Embeddings Vector DB LLM
A Raw extracted text all-MiniLM-L6-v2 ChromaDB (cosine) Configurable
B Compiled Markdown all-MiniLM-L6-v2 ChromaDB (cosine) Configurable

Results at a Glance

Model Size Pipeline A (Raw) Pipeline B (MD) Pipeline C (MD-filtered) Δ (B−A) Δ (C−B)
Gemma 3 4B 4B 2.29 1.99 2.16 −0.30 (−6.0%) +0.17 (+3.4%)
Nemotron 3 ~3B 2.46 2.28 2.37 −0.18 (−3.6%) +0.09 (+1.8%)
DeepSeek V4 Flash 2.98 2.76 2.83 −0.22 (−4.4%) +0.07 (+1.4%)
Gemma 4 26B 26B 3.12 2.86 —* −0.26 (−5.2%)

*Gemma 4 26B Pipeline C unreliable due to Gemini API errors.

Key finding: All models prefer Raw over Markdown. Pipeline C (shallow chunk filter, threshold=200 chars) partially recovers the Markdown loss. The mechanism is confirmed: shallow header chunks (25% of retrieved Markdown chunks) create embedding false positives.

See docs/results.md, the comparative report, and the Fase 3 oracle report.

Interpreting These Results

This benchmark does not measure "RAG in general". It measures the ability of two pipeline configurations to answer 50 questions we designed. Key findings across all three phases:

  • Retrieval is the dominant constraint on this benchmark: the oracle test shows that giving the LLM full document text improves scores by +67–85%. However, the oracle bypasses retrieval entirely and uses raw text (not per-pipeline context), so the improvement is an upper-bound estimate. ~25% of errors persist even with oracle context.
  • Markdown universally hurts: All 4 models prefer Raw over Markdown (−3.6% to −6.0%). Pipeline C (shallow chunk filtering, 9 chunks removed) partially recovers the loss for 3/4 models (+0.07 to +0.17).
  • Famous papers: All documents are well-known arXiv papers. The error taxonomy helps separate RAG quality from parametric knowledge.

Experimental Philosophy

This project follows three principles:

  1. Separate concerns. Each pipeline, model, and metric is a variable that can be changed independently.
  2. Document failure modes. Errors are classified by origin (retrieval, generation, hallucination) not by final score.
  3. Declare boundaries. Results are reported within their experimental context, not as general claims.

Setup

# Clone
git clone https://github.com/cioffiAI/rag-vs-markdown.git
cd rag-vs-markdown

# Virtual environment
python -m venv .venv
.venv\Scripts\activate    # Windows
source .venv/bin/activate # Linux/Mac

# Install
pip install -r requirements.txt

Place PDFs in data/raw/, then:

# 1. Extract text from PDFs
python scripts/extract.py

# 2. (Optional) Compile to Markdown for Pipeline B
python scripts/compile_markdown.py

# 3. Build index
python scripts/build_index.py --pipeline a   # raw text index
python scripts/build_index.py --pipeline b   # markdown index

# 4. Run queries
python scripts/query.py --pipeline b "your question?"

# 5. Batch run benchmark
python scripts/query.py --pipeline b --file data/benchmark_questions.json

# 6. Evaluate
python scripts/evaluate.py batch_b_*.json benchmark_results.md

# 7. Compare pipelines
python scripts/compare_pipelines.py

# 8. Classify errors (Fase 3)
python scripts/error_analysis.py batch_*.json --pipeline-label "B" --model-label "Gemma4"

# 9. Run oracle test (Fase 3)
python scripts/oracle_test.py --model gemma-4-26b-a4b-it

# 10. Compare retrieval vs oracle
python scripts/oracle_compare.py batch_results.json oracle_results.json

Benchmark

50 gold-standard questions (Italian questions, English LLM):

Type Count Description
Simple factual 15 Single-document, direct extraction
Local reasoning 10 Multi-paragraph within one document
Multi-document 10 Compare/contrast across 2+ papers
Table extraction 5 Numeric values from tables
Negative (trap) 10 Questions not answerable from corpus

Questions file: data/benchmark_questions.json

Project Structure

├── scripts/
│   ├── extract.py                 PDF → JSON (PyMuPDF)
│   ├── compile_markdown.py        JSON → Markdown with sections
│   ├── build_index.py             Chunk → embed → ChromaDB (--pipeline a|b)
│   ├── query.py                   Retrieve + LLM generation (multi-provider)
│   ├── evaluate.py                Score answers against ground truth
│   ├── compare_pipelines.py       A vs B comparison report
│   ├── error_analysis.py          E01–E07 error classification
│   ├── oracle_test.py             Oracle context (full document text)
│   ├── oracle_compare.py          Retrieval vs oracle score comparison
│   └── convert_benchmark_to_csv.py
├── data/
│   ├── raw/                       Input PDFs (not tracked)
│   ├── extracted/                 Per-page JSON (not tracked)
│   ├── processed/                 Markdown versions (not tracked)
│   ├── benchmark_questions.json   50 gold-standard questions
│   └── gold_questions.csv         CSV export
├── docs/
│   ├── results.md                 Results with interpretation
│   ├── limitations.md             Known constraints and threats to validity
│   └── error_taxonomy.md          Error classification system
├── reports/
│   ├── comparative_benchmark.md          Gemma 4 26B A vs B comparison
│   ├── fase3_retrieval_vs_generation.md  Oracle test + error classification
│   ├── error_profile_*.md                E01–E07 per-pipeline profiles
│   ├── oracle_comparison_*.md            Retrieval vs oracle comparisons
│   └── benchmark_results_*.md            Per-pipeline score tables
└── requirements.txt

Requirements

  • Python 3.10+
  • 4GB+ RAM (for sentence-transformers on CPU)
  • For local inference: LM Studio or Ollama on localhost with an OpenAI-compatible endpoint
  • For cloud inference: Google Gemini API key (free tier: Gemma 4 26B, 15 RPM, 1500 RPD)
  • No GPU required

License

MIT

About

Benchmark per confrontare RAG su PDF grezzo vs RAG su Markdown compilato. 75 domande, ChromaDB, evaluation framework.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors