Skip to content

[Feat] #127 dual-system QA evaluation infrastructure#53

Merged
TaskerJang merged 11 commits into
devfrom
feat/qa-eval-dual-system
May 25, 2026
Merged

[Feat] #127 dual-system QA evaluation infrastructure#53
TaskerJang merged 11 commits into
devfrom
feat/qa-eval-dual-system

Conversation

@TaskerJang

Copy link
Copy Markdown
Owner

๋ชฉ์ 

doc-graph-agent ์— doc-summary-agent ์™€ ๊ฐ™์€ ํŒจํ„ด์˜ ์ •๋Ÿ‰ QA ํ‰๊ฐ€ ์ธํ”„๋ผ ๊ตฌ์ถ•. ๋‘ ์‹œ์Šคํ…œ์„ ๊ฐ™์€ QA ์…‹์—์„œ ์ธก์ •ํ•ด RAG vs GraphRAG ๋น„๊ต ๊ฐ€๋Šฅ.

QA ๋””์ž์ธ โ€” 80 QA dual evaluation

์นดํ…Œ๊ณ ๋ฆฌ ์ถœ์ฒ˜ ๊ฐœ์ˆ˜ ์˜๋„
VectorRAG QA doc-summary 40 QA ๋ณต์‚ฌ 40 ์‚ฌ์‹ค/์ˆ˜์น˜ โ€” RAG ๊ฐ•์ 
GraphRAG QA ์‹ ๊ทœ ์ž‘์„ฑ (1-hop/2-hop/๊ต์ง‘ํ•ฉ/์ง‘๊ณ„/ํ•„ํ„ฐ/๋ฉ”ํƒ€) 40 ๊ด€๊ณ„/ํ†ตํ•ฉ โ€” Graph ๊ฐ•์ 

๋‘ ์‹œ์Šคํ…œ์— ๊ฐ™์€ 80 QA ๋ฌป๊ธฐ โ†’ ๊ฐ•์ /์•ฝ์  ์˜์—ญ ๋งคํŠธ๋ฆญ์Šค.

Commit ์ •๋ฆฌ

C# Commit ๋‚ด์šฉ
C1 9ae50597 agent/llm_client.py LLM ํ† ๊ธ€ (configure_llm() + from_args())
C2 724def9c eval/metrics/ โ€” rouge/numerical/faithfulness_judge + prompts
C3+C4 82f420c8 eval/dataset/vectorrag_qa.json (40) + graphrag_qa.json (๋นˆ ๋ฐฐ์—ด) + README
C5+C6+C7 430b32ac scripts/run_qa_eval.py + .env.example + scripts/explore_graph.py

ํ˜ธํ™˜์„ฑ

ํ™˜๊ฒฝ ์˜ํ–ฅ
prod / W4 ์ •์„ฑ ๊ฒ€์ฆ (run_w4_eval.py) โŒ ์˜ํ–ฅ 0 โ€” configure_llm() ๋ฏธํ˜ธ์ถœ ์‹œ KIMI_* ์‚ฌ์šฉ
์ƒˆ ์ง„์ž…์  (scripts/run_qa_eval.py) โœ… OpenRouter 4 LLM ร— Claude judge ํ† ๊ธ€
์‹ ๊ทœ (scripts/explore_graph.py) โœ… Neo4j 6 ์ฟผ๋ฆฌ ์ž๋™ + JSON ์ถœ๋ ฅ

๋ณธ ์ธก์ • ๋ช…๋ น ์˜ˆ์‹œ

# DeepSeek V3.2 ร— 80 QA (VectorRAG 40 + GraphRAG 40)
uv run python scripts/run_qa_eval.py \
  --llm-model "deepseek/deepseek-v3.2" \
  --llm-base-url "https://openrouter.ai/api/v1" \
  --llm-api-key-env "OPENROUTER_API_KEY" \
  --judge-model "anthropic/claude-haiku-4.5" \
  --judge-base-url "https://openrouter.ai/api/v1" \
  --judge-api-key-env "OPENROUTER_API_KEY" \
  --qa-set both \
  --tag deepseek_both

๋‹ค์Œ ์ž‘์—… (PENDING)

  1. scripts/explore_graph.py --json eval/dataset/graph_meta.json ์‹คํ–‰
  2. graph_meta.json ๋ณด๊ณ  GraphRAG QA 40 ์ž‘์„ฑ โ†’ graphrag_qa.json ์ฑ„์šฐ๊ธฐ
  3. dry-run (--limit 3) ๊ฒ€์ฆ
  4. 4 LLM ร— 80 QA ๋ณธ ์ธก์ •

Refs #127

TaskerJang added 11 commits May 25, 2026 12:02
scripts/run_qa_eval.py ์ง„์ž…์ ์—์„œ --llm-model / --llm-base-url / --llm-api-key-env
์ธ์ž๋กœ LLM ์„ ํ† ๊ธ€ํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•œ๋‹ค. doc-summary-agent ์™€ ๋™์ผ ํŒจํ„ด.

## ๋ณ€๊ฒฝ ์‚ฌํ•ญ

### ์‹ ๊ทœ: LLMConfig.from_args() ํด๋ž˜์Šค ๋ฉ”์†Œ๋“œ
- model / base_url / api_key_env ๋ช…์‹œ ๋ฐ›์•„์„œ LLMConfig ์ƒ์„ฑ
- from_env() (KIMI_*) ์™€ ๋ถ„๋ฆฌ โ€” prod / W4 ์ •์„ฑ ๊ฒ€์ฆ ๋™์ž‘ ๋ฌด์˜ํ–ฅ

### ์‹ ๊ทœ: configure_llm() ํ•จ์ˆ˜ + _active_config ๋ชจ๋“ˆ ๋ณ€์ˆ˜
- ํ˜ธ์ถœ ์•ˆ ํ•˜๋ฉด LLMClient() ๊ฐ€ from_env() ์‚ฌ์šฉ (๊ธฐ์กด ๋™์ž‘)
- ํ˜ธ์ถœ ์‹œ ์ดํ›„ ๋ชจ๋“  LLMClient() ์ธ์Šคํ„ด์Šค๊ฐ€ ๋ช…์‹œ config ์‚ฌ์šฉ
- retrieval.text2cypher / local_retriever / router ์ž๋™ ํ† ๊ธ€
  (๋ชจ๋‘ LLMClient() ํ˜ธ์ถœํ•˜๋ฏ€๋กœ)

### ๋ณ€๊ฒฝ: LLMClient.__init__()
- config ์ธ์ž None ์ผ ๋•Œ _active_config ์šฐ์„ , ๊ทธ ๋‹ค์Œ from_env()
- ๋ช…์‹œ์  LLMClient(config=...) ํ˜ธ์ถœ๋„ ๊ทธ๋Œ€๋กœ ์ง€์›

## ํ˜ธํ™˜์„ฑ

| ํ™˜๊ฒฝ | ์˜ํ–ฅ |
|---|---|
| prod / W4 ์ •์„ฑ ๊ฒ€์ฆ (`run_w4_eval.py`) | โŒ ์˜ํ–ฅ 0 โ€” configure_llm() ๋ฏธํ˜ธ์ถœ |
| ํ‰๊ฐ€ ์ง„์ž…์  (`scripts/run_qa_eval.py`) | โœ… configure_llm() ํ˜ธ์ถœ๋กœ ํ† ๊ธ€ |
| LLMClient ์ง์ ‘ ์‚ฌ์šฉ (ํ…Œ์ŠคํŠธ ๋“ฑ) | โŒ ์˜ํ–ฅ 0 |

Refs #127
ROUGE-L / ์ˆ˜์น˜ ์ •ํ™•๋„ / Faithfulness / Numerical Faithfulness 4 ๋ฉ”ํŠธ๋ฆญ์„ doc-summary
์˜ ๋™์ผ ์ฝ”๋“œ ๊ทธ๋Œ€๋กœ ๋ฐ•๋Š”๋‹ค. doc-graph ์ •๋Ÿ‰ ํ‰๊ฐ€ ์ง„์ž…์  (scripts/run_qa_eval.py) ์˜
๊ธฐ๋ฐ˜ ์ธํ”„๋ผ.

## ๋ณ€๊ฒฝ ์‚ฌํ•ญ

### eval/metrics/__init__.py ์‹ ๊ทœ
### eval/metrics/rouge_score.py ์‹ ๊ทœ
- ROUGE-1/2/L ํ•œ๊ตญ์–ด MeCab ํ† ํฌ๋‚˜์ด์ € ๊ธฐ๋ฐ˜
- doc-summary ์™€ ๋™์ผ (์˜์กด: rouge-score, python-mecab-ko)

### eval/metrics/numerical_accuracy.py ์‹ ๊ทœ
- ๊ธˆ์œต ํŠนํ™” ์ˆ˜์น˜ ์ถ”์ถœ (์–ต์›, ์กฐ์›, %p, bp, ~ ๋ฒ”์œ„, ๋‚ ์งœ ์ œ์™ธ)
- doc-summary ์™€ ๋™์ผ

### eval/metrics/faithfulness_judge.py ์‹ ๊ทœ
- LLM-as-Judge: faithfulness / completeness (1-5) / conciseness (1-5)
- ์ˆ˜์น˜ ์ถฉ์‹ค๋„: numerical_faithfulness
- configure_judge_llm() โ€” ์ธก์ • LLM ๊ณผ ๋…๋ฆฝ ํ† ๊ธ€
- Claude ํ˜ธํ™˜์„ฑ (doc-summary #127 C7 ๊ทธ๋Œ€๋กœ):
  * reasoning_effort ๋ถ„๊ธฐ (OpenAI ๋„ค์ดํ‹ฐ๋ธŒ๋งŒ)
  * strict json_schema โ†’ json_object ์ž๋™ fallback
  * ๋นˆ ์‘๋‹ต ์ง„๋‹จ ๋กœ๊น… + reasoning_content fallback ์ถ”์ถœ
  * ๋งˆํฌ๋‹ค์šด ์ฝ”๋“œ ๋ธ”๋ก ์ž๋™ ์ œ๊ฑฐ

### eval/metrics/prompts/ ์‹ ๊ทœ
- faithfulness_v1.md
- numerical_faithfulness_v1.md
- doc-summary ์™€ ๋™์ผ

Refs #127
โ€ฆub808\ud1a4

## VectorRAG QA 40 (eval/dataset/vectorrag_qa.json)

doc-summary-agent \uc758 qa_pairs.json 40 QA \uadf8\ub300\ub85c \ubcf5\uc0ac.
- factual / numerical / summary / negative \uc720\ud615 \ubaa8\ub450 \ud3ec\ud568
- 5 \ubb38\uc11c (\ud55c\ud654/DS/\ubbf8\ub798\uc5d0\uc14b\u00d74/\ub18d\ud611/\uae08\uac10\uc6d0)
- \uba65\ud1a0 \uc2ac\ub77c\uc774\ub4dc \uc758\ub3c4: \"VectorRAG \uac15\uc810 \uc601\uc5ed\"

## GraphRAG QA \uc2a4\ucf08\ub808\ud1a4 (eval/dataset/graphrag_qa.json)

\ube48 \ubc30\uc5f4\ub85c \uc2dc\uc791. \uc2e4\uc81c QA 40 \uac1c\ub294 scripts/explore_graph.py \ub85c
\uadf8\ub798\ud504 \uad6c\uc870 \ud30c\uc545 \ud6c4 \uc791\uc131.

## eval/dataset/README.md

QA \uc791\uc131 \ud328\ud134 6\uac1c \ubc15\uc81c:
- 1-hop \uad00\uacc4 (7)
- 2-hop \ub2e4\uc911 (7)
- \uad50\uc9d1\ud569 (7)
- \uc9d1\uacc4/\ud1b5\uacc4 (7)
- \ud544\ud130+\uc9d1\uacc4 (6)
- \uba54\ud0c0\ub370\uc774\ud130 (6)

\uac01 QA \uc758 \uc815\ub2f5\uc740 Cypher \ucffc\ub9ac\ub85c \uacc0\uc99d \uac00\ub2a5\ud574\uc57c \ud568.

Refs #127
โ€ฆ8\ud504 \ud0d0\uc0c9 + .env \uc608\uc2dc

\uc774\uc81c VectorRAG QA + GraphRAG QA \uba85\uc2dc\uc801\uc73c\ub85c \ub458 \ub2e4 \ud3c9\uac00 \uac00\ub2a5\ud558\uace0,
GraphRAG QA 40\uac1c \uc791\uc131\uc744 \uc704\ud55c \uadf8\ub798\ud504 \ud0d0\uc0c9 \uc790\ub3d9\ud654\ub3c4 \ucc28\uc6e0\ub2e4.

## C5: scripts/run_qa_eval.py \uc2e0\uaddc

doc-summary-agent eval/run_eval.py \uc640 \ub3d9\uc77c\ud55c \uba54\ud2b8\ub9ad (ROUGE-L /
\uc218\uce58 \uc815\ud655\ub3c4 / Faithfulness Judge / Numerical Faithfulness Judge) +
\ub2e4\ub978 \ub2f5\ubcc0 \uc0dd\uc131 \uacbd\ub85c (retrieval.route_and_answer).

- --qa-set vectorrag | graphrag | both (\uae30\ubcf8 both)
- --limit N (dry-run \uc6a9)
- --tag <\ucffc\ub9ac\ud0dc\uadf8> \u2014 \uacb0\uacfc \ud30c\uc77c \uad6c\ubd84
- --llm-* / --judge-* \u2014 doc-summary \uc640 \ub3d9\uc77c \ud328\ud134
- \uac01 QA \ub9c8\ub2e4 actual_route + retrieval_error + elapsed_seconds \ubc15\uc81c
- \uc694\uc57d\uc5d0 qa_set \ubcc4 \ubd84\uacc4 + Routing \ubd84\ud3ec \ud45c\uc2dc

## C6: .env.example \uc2e0\uaddc

NEO4J / KIMI / OPENROUTER \uc138 \uadf8\ub8f9\uc73c\ub85c \ubd84\ub9ac. OpenRouter ID \ucc38\uace0
\ubaa9\ub85d (gpt-5-mini / deepseek-v3.2 / kimi-k2.5 / grok-4.20 + Claude judge).

## C7: scripts/explore_graph.py \uc2e0\uaddc \u2014 \ud575\uc2ec

\uc774\uc804 \uc138\uc158\uc5d0\uc11c \ub17c\uc758\ub41c 6 \uac1c Cypher \uc0ac\uc774\ud37c \ucffc\ub9ac\ub97c \uadf8\ub300\ub85c \ubc15\uc74c:
1. documents \u2014 Document + \uba54\ud0c0
2. entities \u2014 Entity \ub77c\ubca8 \ubd84\ud3ec
3. relations \u2014 \uad00\uacc4 \ud0c0\uc785 \ubd84\ud3ec
4. top_entities \u2014 \ub77c\ubca8\ubcc4 TOP 20
5. shared_entities \u2014 Document \uacf5\ud1b5 Entity
6. risk_company \u2014 Risk-Company \uacf5\ub3d9 \uc5b8\uae09

\uc0ac\uc6a9: uv run python scripts/explore_graph.py --json eval/dataset/graph_meta.json

\u2192 \ucd9c\ub825 JSON \ubcf4\uace0 GraphRAG QA 40 \uc791\uc131 (\uc815\ub2f5 ground truth).

Refs #127
โ€ฆc1c \ucd94\uac00 + GraphRAG QA \uc2dc\ub4dc 5\uac1c

## \ub808\ud37c\ub7f0\uc2a4 \uc870\uc0ac \uacb0\uacfc

\ub2e4\uc74c \ucd5c\uc2e0 \ub17c\ubb38/\ud504\ub808\uc784\uc6cc\ud06c \ud45c\uc900 \ucc44\ud0dd:
- GraphRAG-Bench (arxiv:2506.02404, ICLR 2026): Accuracy + AR Metric
- RAGAS (https://docs.ragas.io): Answer Correctness + Semantic Similarity + Context Entity Recall
- RAG vs GraphRAG (Han et al. 2025, arxiv:2502.11371): Precision/Recall/F1

\uc774\uc801 \uacc4\ud68d\uc5d0\uc11c \uacb0\uc815\ub41c Tier 2 \uba54\ud2b8\ub9ad 4\uac1c \ucd94\uac00. R Score / AR Metric \uc740\n*gold rationale* \uc791\uc131 \ubd80\ub2f4\uc73c\ub85c \ub4f1\uc7a5 \ud0d0\uc0c9 \uc774\ud6c4 \uc791\uc5c5\uc73c\ub85c \ubd84\ub9ac.\n\n## C8: eval/metrics/answer_correctness.py + prompts/answer_correctness_v1.md\n\n- RAGAS Answer Correctness + GraphRAG-Bench Accuracy \uacb0\ud569\n- LLM-as-Judge 1-5 \uc810 \uc758\ubbf8\uc801 \uc77c\uce58 \uc810\uc218\n- ROUGE-L \uc758 \"\uc11c\uc220\uc801 \ub2f5\ubcc0 \ub2e8\uc18c\" \ubb38\uc81c \ud574\uacb0\n- faithfulness_judge \uc758 _active_judge_client / strict_schema fallback \uc7ac\uc0ac\uc6a9\n\n## C9: eval/metrics/semantic_similarity.py\n\n- RAGAS Answer Semantic Similarity \ud328\ud134\n- BAAI/bge-m3 \uc784\ubca0\ub529 cosine\n- doc-summary \uc640 \ub3d9\uc77c \uc758\uc874\uc131 (sentence-transformers)\n- ROUGE-L \ubcf4\uc644 \u2014 \uc758\ubbf8 \uae30\ubc18 \uc218\uce58\n\n## C10: eval/metrics/entity_coverage.py\n\n- RAGAS Context Entities Recall \ubcc0\ud615\n- \uc815\ub2f5 entity (\uc218\uce58 + \ud55c\uae00 \uba85\uc0ac) \uac00 \uc608\uce21\uc5d0 \ub4f1\uc7a5 \ube44\uc728\n- key_entities \ud544\ub4dc QA \uc790\uc801\ud574\ub450\uba74 \uadf8\uac83 \uc6b0\uc120 \uc0ac\uc6a9\n- numerical_accuracy \ud328\ud134 \uc7ac\uc0ac\uc6a9\n\n## C11: eval/metrics/routing_accuracy.py\n\n- GraphRAG-Bench AR Metric \ubcc0\ud615\n- expected_route (QA \uc790\uc801) vs actual_route (router.py \uacb0\uc815) \uc77c\uce58\n- t2c / local / community 3 \uac12\n- VectorRAG QA \ub4f1 expected_route \uc5c6\uc73c\uba74 None \ubc18\ud658\n\n## C13: eval/dataset/graphrag_qa.json \uc2dc\ub4dc 5\uac1c\n\nGraphRAG QA \uc2a4\ud0a4\ub9c8 \uac31\uc2e0 \u2014 \ucd94\uac00 \ud544\ub4dc:\n- pattern: 1hop / 2hop / intersection / aggregate / filter / metadata\n- expected_route: t2c / local / community (routing_accuracy \uba54\ud2b8\ub9ad\uc6a9)\n- key_entities: list[str] (entity_coverage \uba54\ud2b8\ub9ad\uc6a9)\n\n\uc2dc\ub4dc 5\uac1c \uc785\ub825\ub428. \ub098\uba38\uc9c0 35\uac1c \ub294 explore_graph.py \uacb0\uacfc \ubcf4\uace0 \uc791\uc131.\n\nRefs #127
## ๋ณ€๊ฒฝ ์‚ฌํ•ญ

### evaluate_one() โ€” ๋ฉ”ํŠธ๋ฆญ 4๊ฐœ ์ถ”๊ฐ€
- Answer Correctness (LLM Judge 1-5) โ€” RAGAS + GraphRAG-Bench
- Semantic Similarity (bge-m3 cosine) โ€” RAGAS
- Entity Coverage (์ •๋‹ต entity ๋“ฑ์žฅ๋ฅ ) โ€” RAGAS Context Entities Recall ๋ณ€ํ˜•
- Routing Accuracy (expected vs actual route) โ€” GraphRAG-Bench AR Metric ๋ณ€ํ˜•

### print_summary() โ€” Tier 1 / Tier 2 ๋ถ„๋ฆฌ ์ถœ๋ ฅ

๊ฐ qa_set ๋ณ„:
- Tier 1 (์ „ํ†ต + FineSurE): ROUGE-L / ์ˆ˜์น˜์ •ํ™•๋„ / Faithfulness / Completeness / Conciseness
- Tier 2 (RAGAS + GraphRAG-Bench): Answer Correctness / Semantic Similarity / Entity Coverage / Routing Accuracy

### --no-semantic ์ธ์ž ์ถ”๊ฐ€

bge-m3 CPU ๋ถ€๋‹ด์œผ๋กœ dry-run ์‹œ ๋„๊ธฐ ๊ฐ€๋Šฅ.

### qa ํ•„๋“œ ํ™œ์šฉ
- expected_route โ†’ routing_accuracy
- key_entities โ†’ entity_coverage
- pattern โ†’ ๊ฒฐ๊ณผ์— ๋ฐ•์ œ (1hop/2hop/intersection ๋“ฑ)

## ํ˜ธํ™˜์„ฑ

๊ธฐ์กด ์ธ์ž (--llm-* / --judge-* / --qa-set / --limit / --tag) ๊ทธ๋Œ€๋กœ ์œ ์ง€.
์ƒˆ ์ธ์ž --no-semantic ๋งŒ ์ถ”๊ฐ€.

Refs #127
## QA ์ž‘์„ฑ ์›์น™

๋ณธ์ธ ํ•ต์‹ฌ ์ธ์‚ฌ์ดํŠธ ๋ฐ˜์˜:
> "๊ทธ๋ž˜ํ”„ ๊ตฌ์กฐ ํ™œ์šฉ๋งŒ์œผ๋กœ ์ž‘์„ฑํ•˜๋ฉด RAG ๊ฐ€ ๋ชป ์žก์Œ โ€” ๊ณต์ •ํ•œ ๋น„๊ต ์•„๋‹˜"

โ†’ GraphRAG-Bench / RAG vs GraphRAG (Han et al.) ํ‘œ์ค€:
- ์›๋ฌธ์— ๋‹ต์ด ๋ฐ•ํ˜€์žˆ์–ด์•ผ (RAG๋„ ์‹œ๋„ ๊ฐ€๋Šฅ)
- RAG ๊ฐ€ ์–ด๋ ต๊ฒŒ ํ’€๊ฑฐ๋‚˜ ๋ถ€๋ถ„๋งŒ ํ’€ ์ˆ˜ ์žˆ์–ด์•ผ
- GraphRAG ๊ฐ€ ๊ตฌ์กฐ์ ์œผ๋กœ ์šฐ์„ธํ•ด์•ผ

## ํŒจํ„ด ๋ถ„ํฌ (ํ˜ผํ•ฉ 30 ๊ฐ•์  + 10 ํ•œ๊ณ„)

| ํŒจํ„ด | ๊ฐœ์ˆ˜ | ์˜๋„ |
|---|---|---|
| multi_doc_trend | 10 | ๋ฏธ๋ž˜์—์…‹ 1Qโ†’4Q ์ถ”์„ธ (4๋ฌธ์„œ ๋™์‹œ ํ•„์š”) |
| 1hop | 10 | ํ•œํ™” ๋‘์‚ฐ๋ฐฅ์บฃ, DS ์ข…๋ชฉ๋ณ„, ๊ธˆ๊ฐ์›, ๋†ํ˜‘ |
| intersection | 8 | DS+ํ•œํ™” ๊ฑฐ์‹œ ๋ณ€์ˆ˜, ๋ฏธ๋ž˜์—์…‹+๊ธˆ๊ฐ์› IPO |
| filter_agg | 5 | doc_type ํ•„ํ„ฐ + ์ง‘๊ณ„ |
| causal | 5 | ์ธ๊ณผ/์กฐ๊ฑด (๋ฉ•์‹œ์ฝ”๊ณต์žฅโ†’์ˆ˜์ต์„ฑ, ์œ ๊ฐ€โ†’์›์ „) |
| limitation | 2 | ์‹œ๊ฐ„ ๋น„๊ต ๋ถˆ๊ฐ€, Company ๋ผ๋ฒจ ํ’ˆ์งˆ |

## ๊ทธ๋ž˜ํ”„ ํƒ์ƒ‰ ๊ฒฐ๊ณผ ํ™œ์šฉ

explore_graph.py ๊ฒฐ๊ณผ ๋ฐ•ํžŒ ๋ฐœ๊ฒฌ:
- doc_type 4์ข… (report 2 / disclosure 1 / filing 1 / ir 4) โ€” filter_agg
- HAS_METRIC 270 ํ’๋ถ€ โ€” 1hop
- doc_year/page_count ์—†์Œ โ€” limitation
- Company ๋ผ๋ฒจ ํ’ˆ์งˆ ์ด์Šˆ โ€” limitation

## 8 ๋ฌธ์„œ ์›๋ฌธ ๋ถ„์„ ํ™œ์šฉ

- ๋ฏธ๋ž˜์—์…‹ 1Q~4Q: ๋ถ„๊ธฐ๋ณ„ ์‹œ๊ณ„์—ด metric (multi_doc_trend ๊ฐ•๋ ฅ)
- ํ•œํ™” ๋‘์‚ฐ๋ฐฅ์บฃ: 1-hop ๊ด€๊ณ„, ์ธ๊ณผ ์ถ”๋ก 
- DS ์‹œํ™ฉ: 1-hop ์ข…๋ชฉ, ๊ฑฐ์‹œ ๋ณ€์ˆ˜
- ๊ธˆ๊ฐ์› ๋ณด๋„์ž๋ฃŒ: ํ†ต๊ณ„ entity, ๋ถ„๋ฅ˜ ํ•œ๊ณ„
- ๋†ํ˜‘ ์‚ฌ์—…๋ณด๊ณ ์„œ: 1-hop ์‚ฌ์—… ํ•ญ๋ชฉ

## ๊ฐ ํ•ญ๋ชฉ ๋ฐ•ํž˜

- id, qa_set: "graphrag"
- pattern (1hop/intersection/filter_agg/multi_doc_trend/causal/limitation)
- doc (์ฝค๋งˆ ๊ตฌ๋ถ„ ๋‹ค์ค‘ ๋ฌธ์„œ ๊ฐ€๋Šฅ)
- question, answer
- expected_route (local/t2c)
- key_entities (Tier 2 entity_coverage ๋ฉ”ํŠธ๋ฆญ์šฉ)
- note (์ž‘์„ฑ ์˜๋„)

Refs #127
doc-summary-agent ์™€ ๋™์ผ ๊ตฌ์กฐ ํ†ตํ•ฉ ํŒŒ์ผ.
scripts/run_qa_eval.py ๊ฐ€ ๋‹จ์ผ ํŒŒ์ผ ๋กœ๋“œ๋กœ 80 QA ์ฒ˜๋ฆฌ ๊ฐ€๋Šฅ.

## QA ๋ถ„ํฌ
- vectorrag: 40 (hanwha 6, ds 5, mirae_1q~4q 19, nonghyup 5, fss 6)
- graphrag: 40 (multi_doc_trend 10, 1hop 10, intersection 8, filter_agg 5, causal 5, limitation 2)

## ์ธก์ • ํ๋ฆ„
doc-graph ์ธก์—์„œ 80 QA ์ธก์ • (--qa-set both):
- VectorRAG 40: ๋‹จ์ผ ์‚ฌ์‹ค, RAG ๊ฐ•์  ์˜์—ญ (๊ทธ๋ž˜ํ”„๊ฐ€ ์•ฝํ•œ ์˜์—ญ)
- GraphRAG 40: ๋‹ค์ค‘ ๋ฌธ์„œ ์ถ”๋ก , ๊ทธ๋ž˜ํ”„ ๊ฐ•์  ์˜์—ญ

โ†’ doc-summary ์ธก 80 QA ์ธก์ • ๊ฒฐ๊ณผ์™€ ์ง์ ‘ ๋น„๊ต

Refs #127
## ๋ฌธ์ œ (dryrun_80qa ๊ฒฐ๊ณผ)

VectorRAG QA 5/5 ๋ชจ๋‘ t2c ๋กœ ์ž˜๋ชป ๋ผ์šฐํŒ…:
- hanwha_001 (๋ชฉํ‘œ์ฃผ๊ฐ€) โ†’ t2c (rows=0)
- hanwha_002 (์˜์—…์ด์ต) โ†’ t2c (rows=0)
- hanwha_003 (๋”œ๋Ÿฌ ์žฌ๊ณ ) โ†’ t2c (rows=0)
- hanwha_004 (ํˆฌ์ž ์˜๊ฒฌ ๊ทผ๊ฑฐ) โ†’ t2c (Cypher ๊ฑฐ๋ถ€)
- hanwha_005 (๋ฉ•์‹œ์ฝ” ๊ณต์žฅ) โ†’ t2c (rows=0)

๊ฒฐ๊ณผ: ROUGE-L 0.05, Faithfulness 0/5, Correctness 1/5

## ์›์ธ

๊ธฐ์กด router_v1.md ๋Š” "default ๋Š” t2c" ์˜€์Œ.
LLM Router ๊ฐ€ "์ˆ˜์น˜ / ์‚ฌ์‹ค = Cypher" ํŒจํ„ด ๋„ˆ๋ฌด ๊ณต๊ฒฉ์  ์ ์šฉ.
๊ทธ๋Ÿฌ๋‚˜ ๋ณธ์ธ ๊ทธ๋ž˜ํ”„์˜ Metric ๋ผ๋ฒจ ํ’ˆ์งˆ ์ด์Šˆ๋กœ t2c ๊ฐ€ ์•ˆ ์žกํž˜.

## ๊ฐฑ์‹  (v2)

- **default ๋ฅผ local ๋กœ ๋ณ€๊ฒฝ** โ€” ๋‹จ์ผ entity ์‚ฌ์‹ค / ๊ด€๊ณ„ / ์†์„ฑ ๋ชจ๋‘ local
- **t2c ๋Š” *๋ช…๋ฐฑํ•œ* top-N / ์ง‘๊ณ„ / ํ•„ํ„ฐ / ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ ๋น„๊ต์—๋งŒ ์ œํ•œ**
- ์˜ˆ์‹œ 10๊ฐœ๋กœ ๊ฐ•ํ™” (v1 ์˜ 5๊ฐœ โ†’ v2 ์˜ 10๊ฐœ)

## ์˜ํ–ฅ ์˜ˆ์ƒ

- VectorRAG QA 40 โ†’ ๋Œ€๋ถ€๋ถ„ local ๋ผ์šฐํŒ… โ†’ chunk ํ…์ŠคํŠธ๋กœ ๋‹ต๋ณ€ ๊ฐ€๋Šฅ
- GraphRAG QA multi_doc_trend โ†’ local (๋ถ„๊ธฐ๋ณ„ ์ถ”์ )
- GraphRAG QA filter_agg โ†’ t2c (๋ฉ”ํƒ€๋ฐ์ดํ„ฐ ์ง‘๊ณ„)
- GraphRAG QA limitation โ†’ t2c (limitation ์˜์—ญ)

## ํ˜ธํ™˜์„ฑ

router.py ๊ฐ€ ROUTER_PROMPT_PATH = PROMPTS_DIR / "router_v1.md" ๋กœ ๋ฐ•ํ˜€์žˆ์–ด,
router_v1.md ๋ฅผ ์ง์ ‘ ๊ฐฑ์‹ ํ•˜๋Š” ๊ฒƒ์ด ๊ฐ€์žฅ ์•ˆ์ „ (router.py ์ฝ”๋“œ ๋ณ€๊ฒฝ ๋ถˆํ•„์š”).

โ†’ ๋ณธ PR ์€ router_v2.md ๋ฐ•๊ธฐ + router.py ์˜ ROUTER_PROMPT_PATH ๊ฐฑ์‹  (๋ณ„๋„ commit).

Refs #127
## ๋ณ€๊ฒฝ ์‚ฌํ•ญ

1. **DEFAULT_ROUTE: "t2c" โ†’ "local"** โ€” LLM router fallback default
2. **ROUTER_PROMPT_PATH: router_v1.md โ†’ router_v2.md** โ€” local default ๊ฐ•์กฐ prompt

## ๋ฐฐ๊ฒฝ (dryrun_80qa ๊ฒฐ๊ณผ)

VectorRAG QA 5/5 ๋ชจ๋‘ t2c ๋กœ ์ž˜๋ชป ๋ผ์šฐํŒ…:
- hanwha_001~005 ๋ชจ๋‘ t2c (rows=0) โ†’ Faithfulness 0/5, Correctness 1/5

์›์ธ: ๊ธฐ์กด router_v1.md ๊ฐ€ "default ๋Š” t2c" ์˜€์Œ. LLM ์ด "์ˆ˜์น˜ = Cypher" ํŒจํ„ด ๊ฐ•ํ•˜๊ฒŒ ์ ์šฉ.
๊ทธ๋Ÿฌ๋‚˜ ๋ณธ์ธ ๊ทธ๋ž˜ํ”„์˜ Metric ๋ผ๋ฒจ ํ’ˆ์งˆ ์ด์Šˆ๋กœ t2c ๊ฐ€ ์•ˆ ์žกํž˜.

## ์˜ํ–ฅ

- VectorRAG QA 40 โ†’ ๋Œ€๋ถ€๋ถ„ local ๋กœ ๋ผ์šฐํŒ… โ†’ chunk ํ…์ŠคํŠธ๋กœ ๋‹ต๋ณ€ ๊ฐ€๋Šฅ
- GraphRAG QA multi_doc_trend โ†’ local (๋‹จ์ผ entity ์˜ ๋ถ„๊ธฐ๋ณ„ ์ถ”์ )
- GraphRAG QA filter_agg โ†’ t2c (๋ฉ”ํƒ€๋ฐ์ดํ„ฐ ์ง‘๊ณ„ โ€” ๋ช…๋ฐฑํ•œ t2c ์ผ€์ด์Šค)

## docstring ๋ฐ•ํž˜

- "5/25 v2 ๋ฐ•ํž˜" ์„น์…˜ ์ถ”๊ฐ€
- DEFAULT_ROUTE / ROUTER_PROMPT_PATH / decide_route docstring ๋ชจ๋‘ ๊ฐฑ์‹ 

Refs #127
@TaskerJang TaskerJang merged commit a121f81 into dev May 25, 2026
TaskerJang added a commit that referenced this pull request May 25, 2026
PR #54 ๊ฐ€ base=feat/qa-eval-dual-system ์œผ๋กœ ๋จธ์ง€๋๋Š”๋ฐ, ๊ทธ ์‹œ์ ์— PR #53
(qa-eval-dual-system โ†’ dev) ์ด ์ด๋ฏธ ๋จธ์ง€๋œ ๋’ค๋ผ PR #54 ์˜ ๋ณ€๊ฒฝ๋ถ„์ด dev ๋กœ
ํ˜๋Ÿฌ์˜ค์ง€ ๋ชปํ•จ.

์ž๊ธฐ ์ „ OpenAI + Kimi ์ธก์ • ๋ฐ•์„ ๊ฑฐ๋ผ dev ์— ์ง์ ‘ patch ๋ฐ•์Œ:

1. agent/llm_client.py โ€” LLMConfig.is_openrouter ํ”„๋กœํผํ‹ฐ + _REASONING_OFF_BODY
   3์ค‘ ์•ˆ์ „ (max_tokens:1 / enabled:false / exclude:true) + reasoning_content
   fallback + ๋นˆ ์‘๋‹ต ์ง„๋‹จ

2. eval/metrics/faithfulness_judge.py โ€” _use_openrouter_extras ํ”Œ๋ž˜๊ทธ +
   _REASONING_OFF_BODY 3์ค‘ + TIMEOUT 60s + reasoning / reasoning_content ๋‘ ํ•„๋“œ
   ๋ชจ๋‘ fallback

3. eval/metrics/answer_correctness.py โ€” faithfulness_judge ์˜ _REASONING_OFF_BODY
   ์žฌ์‚ฌ์šฉ + ๋™์ผ ํŒจํ„ด

ํ˜ธํ™˜์„ฑ:
- OpenAI ์ง๊ฒฐ (configure_llm ๋ฏธํ˜ธ์ถœ) โ†’ ์˜ํ–ฅ 0
- KIMI_* ์ง๊ฒฐ (prod / W4) โ†’ ์˜ํ–ฅ 0 (is_openrouter False)
- OpenRouter ๊ฒฝ์œ  โ†’ reasoning OFF ์ž๋™ ์ ์šฉ
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant