Concise, clean README for the associated paper/repo. It keeps only the essentials: what the project is, how it works, what was evaluated, and the headline findings—using compact tables. https://www.arxiv.org/abs/2510.25055
GAPMAP shows that large language models (LLMs) can surface explicit (author‑signaled) and implicit (unstated, inferred) knowledge gaps in scientific articles. The work introduces TABI—an interpretable template (Claim · Grounds → Warrant + confidence bucket)—for implicit gaps. Larger models generally perform best; sentence‑aligned chunking of long text (~1K words) is safe and often helpful.
- Baselines for explicit and implicit gap detection
- TABI prompting template (Claim, Grounds, Warrant, Bucket)
- Evaluation scripts: ROUGE‑L F1 (explicit) and entailment‑based accuracy (implicit)
- Small, reproducible result tables and comparisons
- Explicit gaps: detect uncertainty/negation cues in paragraphs/sections; score predictions against gold spans with ROUGE‑L F1 using one‑to‑one matching.
- Implicit gaps (TABI): generate a Claim, cite supporting Grounds, add a Warrant, and assign a confidence bucket; judge correctness with bi‑directional entailment between predictions and gold premises/claims.
- Long context: optional sentence‑aligned ~1K‑word chunking; compare “no chunking” vs “chunked”.
| Task | Dataset (unit) | Domain & Scale (high level) |
|---|---|---|
| Explicit | IPBES (paragraphs) | Biodiversity; paragraph‑level gap spans |
| Explicit | Scientific Challenges & Directions (sections) | COVID‑19; sentences labeled within sections |
| Implicit | Manual implicit‑gap corpus (paragraphs) | Biomedical; ~hundreds of paragraphs |
| Implicit | Full‑text pilot (full papers + author survey) | Mixed STEM; ~dozens of articles |
| Task/Setting | Metric | Notes |
|---|---|---|
| Explicit (IPBES) | ROUGE‑L F1 | Stemming + one‑to‑one matching with a similarity threshold |
| Explicit (COVID‑19 sections) | Accuracy | Validate predicted statements with an ignorance‑cue dictionary |
| Implicit (paragraph level) | Accuracy (entailment) | Bi‑directional entailment between predicted claim/warrant and gold |
| Long‑context robustness | Comparison (no‑chunk vs chunk) | Sentence‑aligned ~1K‑word chunks; recall often improves |
A) Explicit — IPBES (ROUGE‑L F1)
Large open‑weight and strong closed‑weight models are both competitive; best results come from the largest models. Chunking preserves performance.
B) Explicit — COVID‑19 Sections (Accuracy)
Long sections are harder (single gold statement per section). The best closed‑weight large model leads, with chunking sometimes helping.
C) Implicit — Paragraph Level (Accuracy)
Best performance from large closed‑weight models; large open‑weights are close behind. Smaller models struggle without few‑shot guidance.
Key takeaways
- Scale helps (bigger models win), but strong open‑weight models can be competitive.
- TABI makes implicit gaps interpretable and easier to score.
- Chunking is a reliable preprocessing step and often boosts recall.
- Prepare text (paragraphs/sections/full text). Optionally chunk to ~1K words on sentence boundaries.
- Explicit task: run model → extract candidate gap statements → score with ROUGE‑L F1 (IPBES) or accuracy (COVID‑19).
- Implicit task (TABI): prompt using Claim / Grounds / Warrant + Bucket (few‑shot recommended) → score with entailment‑based accuracy.
- Report: aggregate P/R/F1 (explicit) and accuracy (implicit); compare no‑chunk vs chunked.
- Few‑shot examples materially improve TABI outputs; zero‑shot tends to be vague.
- COVID‑19 sections may contain multiple gaps though only one is labeled; numeric/contrastive cues matter, not just lexical hedges.
- For deployment: keep a human‑in‑the‑loop and consider domain adaptation.
Specify your license here (e.g., MIT).