[Feat] #127 dual-system QA evaluation infrastructure#53
Merged
Conversation
scripts/run_qa_eval.py ์ง์ ์ ์์ --llm-model / --llm-base-url / --llm-api-key-env ์ธ์๋ก LLM ์ ํ ๊ธํ ์ ์๊ฒ ํ๋ค. doc-summary-agent ์ ๋์ผ ํจํด. ## ๋ณ๊ฒฝ ์ฌํญ ### ์ ๊ท: LLMConfig.from_args() ํด๋์ค ๋ฉ์๋ - model / base_url / api_key_env ๋ช ์ ๋ฐ์์ LLMConfig ์์ฑ - from_env() (KIMI_*) ์ ๋ถ๋ฆฌ โ prod / W4 ์ ์ฑ ๊ฒ์ฆ ๋์ ๋ฌด์ํฅ ### ์ ๊ท: configure_llm() ํจ์ + _active_config ๋ชจ๋ ๋ณ์ - ํธ์ถ ์ ํ๋ฉด LLMClient() ๊ฐ from_env() ์ฌ์ฉ (๊ธฐ์กด ๋์) - ํธ์ถ ์ ์ดํ ๋ชจ๋ LLMClient() ์ธ์คํด์ค๊ฐ ๋ช ์ config ์ฌ์ฉ - retrieval.text2cypher / local_retriever / router ์๋ ํ ๊ธ (๋ชจ๋ LLMClient() ํธ์ถํ๋ฏ๋ก) ### ๋ณ๊ฒฝ: LLMClient.__init__() - config ์ธ์ None ์ผ ๋ _active_config ์ฐ์ , ๊ทธ ๋ค์ from_env() - ๋ช ์์ LLMClient(config=...) ํธ์ถ๋ ๊ทธ๋๋ก ์ง์ ## ํธํ์ฑ | ํ๊ฒฝ | ์ํฅ | |---|---| | prod / W4 ์ ์ฑ ๊ฒ์ฆ (`run_w4_eval.py`) | โ ์ํฅ 0 โ configure_llm() ๋ฏธํธ์ถ | | ํ๊ฐ ์ง์ ์ (`scripts/run_qa_eval.py`) | โ configure_llm() ํธ์ถ๋ก ํ ๊ธ | | LLMClient ์ง์ ์ฌ์ฉ (ํ ์คํธ ๋ฑ) | โ ์ํฅ 0 | Refs #127
ROUGE-L / ์์น ์ ํ๋ / Faithfulness / Numerical Faithfulness 4 ๋ฉํธ๋ฆญ์ doc-summary ์ ๋์ผ ์ฝ๋ ๊ทธ๋๋ก ๋ฐ๋๋ค. doc-graph ์ ๋ ํ๊ฐ ์ง์ ์ (scripts/run_qa_eval.py) ์ ๊ธฐ๋ฐ ์ธํ๋ผ. ## ๋ณ๊ฒฝ ์ฌํญ ### eval/metrics/__init__.py ์ ๊ท ### eval/metrics/rouge_score.py ์ ๊ท - ROUGE-1/2/L ํ๊ตญ์ด MeCab ํ ํฌ๋์ด์ ๊ธฐ๋ฐ - doc-summary ์ ๋์ผ (์์กด: rouge-score, python-mecab-ko) ### eval/metrics/numerical_accuracy.py ์ ๊ท - ๊ธ์ต ํนํ ์์น ์ถ์ถ (์ต์, ์กฐ์, %p, bp, ~ ๋ฒ์, ๋ ์ง ์ ์ธ) - doc-summary ์ ๋์ผ ### eval/metrics/faithfulness_judge.py ์ ๊ท - LLM-as-Judge: faithfulness / completeness (1-5) / conciseness (1-5) - ์์น ์ถฉ์ค๋: numerical_faithfulness - configure_judge_llm() โ ์ธก์ LLM ๊ณผ ๋ ๋ฆฝ ํ ๊ธ - Claude ํธํ์ฑ (doc-summary #127 C7 ๊ทธ๋๋ก): * reasoning_effort ๋ถ๊ธฐ (OpenAI ๋ค์ดํฐ๋ธ๋ง) * strict json_schema โ json_object ์๋ fallback * ๋น ์๋ต ์ง๋จ ๋ก๊น + reasoning_content fallback ์ถ์ถ * ๋งํฌ๋ค์ด ์ฝ๋ ๋ธ๋ก ์๋ ์ ๊ฑฐ ### eval/metrics/prompts/ ์ ๊ท - faithfulness_v1.md - numerical_faithfulness_v1.md - doc-summary ์ ๋์ผ Refs #127
โฆub808\ud1a4 ## VectorRAG QA 40 (eval/dataset/vectorrag_qa.json) doc-summary-agent \uc758 qa_pairs.json 40 QA \uadf8\ub300\ub85c \ubcf5\uc0ac. - factual / numerical / summary / negative \uc720\ud615 \ubaa8\ub450 \ud3ec\ud568 - 5 \ubb38\uc11c (\ud55c\ud654/DS/\ubbf8\ub798\uc5d0\uc14b\u00d74/\ub18d\ud611/\uae08\uac10\uc6d0) - \uba65\ud1a0 \uc2ac\ub77c\uc774\ub4dc \uc758\ub3c4: \"VectorRAG \uac15\uc810 \uc601\uc5ed\" ## GraphRAG QA \uc2a4\ucf08\ub808\ud1a4 (eval/dataset/graphrag_qa.json) \ube48 \ubc30\uc5f4\ub85c \uc2dc\uc791. \uc2e4\uc81c QA 40 \uac1c\ub294 scripts/explore_graph.py \ub85c \uadf8\ub798\ud504 \uad6c\uc870 \ud30c\uc545 \ud6c4 \uc791\uc131. ## eval/dataset/README.md QA \uc791\uc131 \ud328\ud134 6\uac1c \ubc15\uc81c: - 1-hop \uad00\uacc4 (7) - 2-hop \ub2e4\uc911 (7) - \uad50\uc9d1\ud569 (7) - \uc9d1\uacc4/\ud1b5\uacc4 (7) - \ud544\ud130+\uc9d1\uacc4 (6) - \uba54\ud0c0\ub370\uc774\ud130 (6) \uac01 QA \uc758 \uc815\ub2f5\uc740 Cypher \ucffc\ub9ac\ub85c \uacc0\uc99d \uac00\ub2a5\ud574\uc57c \ud568. Refs #127
โฆ8\ud504 \ud0d0\uc0c9 + .env \uc608\uc2dc \uc774\uc81c VectorRAG QA + GraphRAG QA \uba85\uc2dc\uc801\uc73c\ub85c \ub458 \ub2e4 \ud3c9\uac00 \uac00\ub2a5\ud558\uace0, GraphRAG QA 40\uac1c \uc791\uc131\uc744 \uc704\ud55c \uadf8\ub798\ud504 \ud0d0\uc0c9 \uc790\ub3d9\ud654\ub3c4 \ucc28\uc6e0\ub2e4. ## C5: scripts/run_qa_eval.py \uc2e0\uaddc doc-summary-agent eval/run_eval.py \uc640 \ub3d9\uc77c\ud55c \uba54\ud2b8\ub9ad (ROUGE-L / \uc218\uce58 \uc815\ud655\ub3c4 / Faithfulness Judge / Numerical Faithfulness Judge) + \ub2e4\ub978 \ub2f5\ubcc0 \uc0dd\uc131 \uacbd\ub85c (retrieval.route_and_answer). - --qa-set vectorrag | graphrag | both (\uae30\ubcf8 both) - --limit N (dry-run \uc6a9) - --tag <\ucffc\ub9ac\ud0dc\uadf8> \u2014 \uacb0\uacfc \ud30c\uc77c \uad6c\ubd84 - --llm-* / --judge-* \u2014 doc-summary \uc640 \ub3d9\uc77c \ud328\ud134 - \uac01 QA \ub9c8\ub2e4 actual_route + retrieval_error + elapsed_seconds \ubc15\uc81c - \uc694\uc57d\uc5d0 qa_set \ubcc4 \ubd84\uacc4 + Routing \ubd84\ud3ec \ud45c\uc2dc ## C6: .env.example \uc2e0\uaddc NEO4J / KIMI / OPENROUTER \uc138 \uadf8\ub8f9\uc73c\ub85c \ubd84\ub9ac. OpenRouter ID \ucc38\uace0 \ubaa9\ub85d (gpt-5-mini / deepseek-v3.2 / kimi-k2.5 / grok-4.20 + Claude judge). ## C7: scripts/explore_graph.py \uc2e0\uaddc \u2014 \ud575\uc2ec \uc774\uc804 \uc138\uc158\uc5d0\uc11c \ub17c\uc758\ub41c 6 \uac1c Cypher \uc0ac\uc774\ud37c \ucffc\ub9ac\ub97c \uadf8\ub300\ub85c \ubc15\uc74c: 1. documents \u2014 Document + \uba54\ud0c0 2. entities \u2014 Entity \ub77c\ubca8 \ubd84\ud3ec 3. relations \u2014 \uad00\uacc4 \ud0c0\uc785 \ubd84\ud3ec 4. top_entities \u2014 \ub77c\ubca8\ubcc4 TOP 20 5. shared_entities \u2014 Document \uacf5\ud1b5 Entity 6. risk_company \u2014 Risk-Company \uacf5\ub3d9 \uc5b8\uae09 \uc0ac\uc6a9: uv run python scripts/explore_graph.py --json eval/dataset/graph_meta.json \u2192 \ucd9c\ub825 JSON \ubcf4\uace0 GraphRAG QA 40 \uc791\uc131 (\uc815\ub2f5 ground truth). Refs #127
โฆc1c \ucd94\uac00 + GraphRAG QA \uc2dc\ub4dc 5\uac1c ## \ub808\ud37c\ub7f0\uc2a4 \uc870\uc0ac \uacb0\uacfc \ub2e4\uc74c \ucd5c\uc2e0 \ub17c\ubb38/\ud504\ub808\uc784\uc6cc\ud06c \ud45c\uc900 \ucc44\ud0dd: - GraphRAG-Bench (arxiv:2506.02404, ICLR 2026): Accuracy + AR Metric - RAGAS (https://docs.ragas.io): Answer Correctness + Semantic Similarity + Context Entity Recall - RAG vs GraphRAG (Han et al. 2025, arxiv:2502.11371): Precision/Recall/F1 \uc774\uc801 \uacc4\ud68d\uc5d0\uc11c \uacb0\uc815\ub41c Tier 2 \uba54\ud2b8\ub9ad 4\uac1c \ucd94\uac00. R Score / AR Metric \uc740\n*gold rationale* \uc791\uc131 \ubd80\ub2f4\uc73c\ub85c \ub4f1\uc7a5 \ud0d0\uc0c9 \uc774\ud6c4 \uc791\uc5c5\uc73c\ub85c \ubd84\ub9ac.\n\n## C8: eval/metrics/answer_correctness.py + prompts/answer_correctness_v1.md\n\n- RAGAS Answer Correctness + GraphRAG-Bench Accuracy \uacb0\ud569\n- LLM-as-Judge 1-5 \uc810 \uc758\ubbf8\uc801 \uc77c\uce58 \uc810\uc218\n- ROUGE-L \uc758 \"\uc11c\uc220\uc801 \ub2f5\ubcc0 \ub2e8\uc18c\" \ubb38\uc81c \ud574\uacb0\n- faithfulness_judge \uc758 _active_judge_client / strict_schema fallback \uc7ac\uc0ac\uc6a9\n\n## C9: eval/metrics/semantic_similarity.py\n\n- RAGAS Answer Semantic Similarity \ud328\ud134\n- BAAI/bge-m3 \uc784\ubca0\ub529 cosine\n- doc-summary \uc640 \ub3d9\uc77c \uc758\uc874\uc131 (sentence-transformers)\n- ROUGE-L \ubcf4\uc644 \u2014 \uc758\ubbf8 \uae30\ubc18 \uc218\uce58\n\n## C10: eval/metrics/entity_coverage.py\n\n- RAGAS Context Entities Recall \ubcc0\ud615\n- \uc815\ub2f5 entity (\uc218\uce58 + \ud55c\uae00 \uba85\uc0ac) \uac00 \uc608\uce21\uc5d0 \ub4f1\uc7a5 \ube44\uc728\n- key_entities \ud544\ub4dc QA \uc790\uc801\ud574\ub450\uba74 \uadf8\uac83 \uc6b0\uc120 \uc0ac\uc6a9\n- numerical_accuracy \ud328\ud134 \uc7ac\uc0ac\uc6a9\n\n## C11: eval/metrics/routing_accuracy.py\n\n- GraphRAG-Bench AR Metric \ubcc0\ud615\n- expected_route (QA \uc790\uc801) vs actual_route (router.py \uacb0\uc815) \uc77c\uce58\n- t2c / local / community 3 \uac12\n- VectorRAG QA \ub4f1 expected_route \uc5c6\uc73c\uba74 None \ubc18\ud658\n\n## C13: eval/dataset/graphrag_qa.json \uc2dc\ub4dc 5\uac1c\n\nGraphRAG QA \uc2a4\ud0a4\ub9c8 \uac31\uc2e0 \u2014 \ucd94\uac00 \ud544\ub4dc:\n- pattern: 1hop / 2hop / intersection / aggregate / filter / metadata\n- expected_route: t2c / local / community (routing_accuracy \uba54\ud2b8\ub9ad\uc6a9)\n- key_entities: list[str] (entity_coverage \uba54\ud2b8\ub9ad\uc6a9)\n\n\uc2dc\ub4dc 5\uac1c \uc785\ub825\ub428. \ub098\uba38\uc9c0 35\uac1c \ub294 explore_graph.py \uacb0\uacfc \ubcf4\uace0 \uc791\uc131.\n\nRefs #127
## ๋ณ๊ฒฝ ์ฌํญ ### evaluate_one() โ ๋ฉํธ๋ฆญ 4๊ฐ ์ถ๊ฐ - Answer Correctness (LLM Judge 1-5) โ RAGAS + GraphRAG-Bench - Semantic Similarity (bge-m3 cosine) โ RAGAS - Entity Coverage (์ ๋ต entity ๋ฑ์ฅ๋ฅ ) โ RAGAS Context Entities Recall ๋ณํ - Routing Accuracy (expected vs actual route) โ GraphRAG-Bench AR Metric ๋ณํ ### print_summary() โ Tier 1 / Tier 2 ๋ถ๋ฆฌ ์ถ๋ ฅ ๊ฐ qa_set ๋ณ: - Tier 1 (์ ํต + FineSurE): ROUGE-L / ์์น์ ํ๋ / Faithfulness / Completeness / Conciseness - Tier 2 (RAGAS + GraphRAG-Bench): Answer Correctness / Semantic Similarity / Entity Coverage / Routing Accuracy ### --no-semantic ์ธ์ ์ถ๊ฐ bge-m3 CPU ๋ถ๋ด์ผ๋ก dry-run ์ ๋๊ธฐ ๊ฐ๋ฅ. ### qa ํ๋ ํ์ฉ - expected_route โ routing_accuracy - key_entities โ entity_coverage - pattern โ ๊ฒฐ๊ณผ์ ๋ฐ์ (1hop/2hop/intersection ๋ฑ) ## ํธํ์ฑ ๊ธฐ์กด ์ธ์ (--llm-* / --judge-* / --qa-set / --limit / --tag) ๊ทธ๋๋ก ์ ์ง. ์ ์ธ์ --no-semantic ๋ง ์ถ๊ฐ. Refs #127
## QA ์์ฑ ์์น ๋ณธ์ธ ํต์ฌ ์ธ์ฌ์ดํธ ๋ฐ์: > "๊ทธ๋ํ ๊ตฌ์กฐ ํ์ฉ๋ง์ผ๋ก ์์ฑํ๋ฉด RAG ๊ฐ ๋ชป ์ก์ โ ๊ณต์ ํ ๋น๊ต ์๋" โ GraphRAG-Bench / RAG vs GraphRAG (Han et al.) ํ์ค: - ์๋ฌธ์ ๋ต์ด ๋ฐํ์์ด์ผ (RAG๋ ์๋ ๊ฐ๋ฅ) - RAG ๊ฐ ์ด๋ ต๊ฒ ํ๊ฑฐ๋ ๋ถ๋ถ๋ง ํ ์ ์์ด์ผ - GraphRAG ๊ฐ ๊ตฌ์กฐ์ ์ผ๋ก ์ฐ์ธํด์ผ ## ํจํด ๋ถํฌ (ํผํฉ 30 ๊ฐ์ + 10 ํ๊ณ) | ํจํด | ๊ฐ์ | ์๋ | |---|---|---| | multi_doc_trend | 10 | ๋ฏธ๋์์ 1Qโ4Q ์ถ์ธ (4๋ฌธ์ ๋์ ํ์) | | 1hop | 10 | ํํ ๋์ฐ๋ฐฅ์บฃ, DS ์ข ๋ชฉ๋ณ, ๊ธ๊ฐ์, ๋ํ | | intersection | 8 | DS+ํํ ๊ฑฐ์ ๋ณ์, ๋ฏธ๋์์ +๊ธ๊ฐ์ IPO | | filter_agg | 5 | doc_type ํํฐ + ์ง๊ณ | | causal | 5 | ์ธ๊ณผ/์กฐ๊ฑด (๋ฉ์์ฝ๊ณต์ฅโ์์ต์ฑ, ์ ๊ฐโ์์ ) | | limitation | 2 | ์๊ฐ ๋น๊ต ๋ถ๊ฐ, Company ๋ผ๋ฒจ ํ์ง | ## ๊ทธ๋ํ ํ์ ๊ฒฐ๊ณผ ํ์ฉ explore_graph.py ๊ฒฐ๊ณผ ๋ฐํ ๋ฐ๊ฒฌ: - doc_type 4์ข (report 2 / disclosure 1 / filing 1 / ir 4) โ filter_agg - HAS_METRIC 270 ํ๋ถ โ 1hop - doc_year/page_count ์์ โ limitation - Company ๋ผ๋ฒจ ํ์ง ์ด์ โ limitation ## 8 ๋ฌธ์ ์๋ฌธ ๋ถ์ ํ์ฉ - ๋ฏธ๋์์ 1Q~4Q: ๋ถ๊ธฐ๋ณ ์๊ณ์ด metric (multi_doc_trend ๊ฐ๋ ฅ) - ํํ ๋์ฐ๋ฐฅ์บฃ: 1-hop ๊ด๊ณ, ์ธ๊ณผ ์ถ๋ก - DS ์ํฉ: 1-hop ์ข ๋ชฉ, ๊ฑฐ์ ๋ณ์ - ๊ธ๊ฐ์ ๋ณด๋์๋ฃ: ํต๊ณ entity, ๋ถ๋ฅ ํ๊ณ - ๋ํ ์ฌ์ ๋ณด๊ณ ์: 1-hop ์ฌ์ ํญ๋ชฉ ## ๊ฐ ํญ๋ชฉ ๋ฐํ - id, qa_set: "graphrag" - pattern (1hop/intersection/filter_agg/multi_doc_trend/causal/limitation) - doc (์ฝค๋ง ๊ตฌ๋ถ ๋ค์ค ๋ฌธ์ ๊ฐ๋ฅ) - question, answer - expected_route (local/t2c) - key_entities (Tier 2 entity_coverage ๋ฉํธ๋ฆญ์ฉ) - note (์์ฑ ์๋) Refs #127
doc-summary-agent ์ ๋์ผ ๊ตฌ์กฐ ํตํฉ ํ์ผ. scripts/run_qa_eval.py ๊ฐ ๋จ์ผ ํ์ผ ๋ก๋๋ก 80 QA ์ฒ๋ฆฌ ๊ฐ๋ฅ. ## QA ๋ถํฌ - vectorrag: 40 (hanwha 6, ds 5, mirae_1q~4q 19, nonghyup 5, fss 6) - graphrag: 40 (multi_doc_trend 10, 1hop 10, intersection 8, filter_agg 5, causal 5, limitation 2) ## ์ธก์ ํ๋ฆ doc-graph ์ธก์์ 80 QA ์ธก์ (--qa-set both): - VectorRAG 40: ๋จ์ผ ์ฌ์ค, RAG ๊ฐ์ ์์ญ (๊ทธ๋ํ๊ฐ ์ฝํ ์์ญ) - GraphRAG 40: ๋ค์ค ๋ฌธ์ ์ถ๋ก , ๊ทธ๋ํ ๊ฐ์ ์์ญ โ doc-summary ์ธก 80 QA ์ธก์ ๊ฒฐ๊ณผ์ ์ง์ ๋น๊ต Refs #127
## ๋ฌธ์ (dryrun_80qa ๊ฒฐ๊ณผ) VectorRAG QA 5/5 ๋ชจ๋ t2c ๋ก ์๋ชป ๋ผ์ฐํ : - hanwha_001 (๋ชฉํ์ฃผ๊ฐ) โ t2c (rows=0) - hanwha_002 (์์ ์ด์ต) โ t2c (rows=0) - hanwha_003 (๋๋ฌ ์ฌ๊ณ ) โ t2c (rows=0) - hanwha_004 (ํฌ์ ์๊ฒฌ ๊ทผ๊ฑฐ) โ t2c (Cypher ๊ฑฐ๋ถ) - hanwha_005 (๋ฉ์์ฝ ๊ณต์ฅ) โ t2c (rows=0) ๊ฒฐ๊ณผ: ROUGE-L 0.05, Faithfulness 0/5, Correctness 1/5 ## ์์ธ ๊ธฐ์กด router_v1.md ๋ "default ๋ t2c" ์์. LLM Router ๊ฐ "์์น / ์ฌ์ค = Cypher" ํจํด ๋๋ฌด ๊ณต๊ฒฉ์ ์ ์ฉ. ๊ทธ๋ฌ๋ ๋ณธ์ธ ๊ทธ๋ํ์ Metric ๋ผ๋ฒจ ํ์ง ์ด์๋ก t2c ๊ฐ ์ ์กํ. ## ๊ฐฑ์ (v2) - **default ๋ฅผ local ๋ก ๋ณ๊ฒฝ** โ ๋จ์ผ entity ์ฌ์ค / ๊ด๊ณ / ์์ฑ ๋ชจ๋ local - **t2c ๋ *๋ช ๋ฐฑํ* top-N / ์ง๊ณ / ํํฐ / ๋ฉํ๋ฐ์ดํฐ ๋น๊ต์๋ง ์ ํ** - ์์ 10๊ฐ๋ก ๊ฐํ (v1 ์ 5๊ฐ โ v2 ์ 10๊ฐ) ## ์ํฅ ์์ - VectorRAG QA 40 โ ๋๋ถ๋ถ local ๋ผ์ฐํ โ chunk ํ ์คํธ๋ก ๋ต๋ณ ๊ฐ๋ฅ - GraphRAG QA multi_doc_trend โ local (๋ถ๊ธฐ๋ณ ์ถ์ ) - GraphRAG QA filter_agg โ t2c (๋ฉํ๋ฐ์ดํฐ ์ง๊ณ) - GraphRAG QA limitation โ t2c (limitation ์์ญ) ## ํธํ์ฑ router.py ๊ฐ ROUTER_PROMPT_PATH = PROMPTS_DIR / "router_v1.md" ๋ก ๋ฐํ์์ด, router_v1.md ๋ฅผ ์ง์ ๊ฐฑ์ ํ๋ ๊ฒ์ด ๊ฐ์ฅ ์์ (router.py ์ฝ๋ ๋ณ๊ฒฝ ๋ถํ์). โ ๋ณธ PR ์ router_v2.md ๋ฐ๊ธฐ + router.py ์ ROUTER_PROMPT_PATH ๊ฐฑ์ (๋ณ๋ commit). Refs #127
## ๋ณ๊ฒฝ ์ฌํญ 1. **DEFAULT_ROUTE: "t2c" โ "local"** โ LLM router fallback default 2. **ROUTER_PROMPT_PATH: router_v1.md โ router_v2.md** โ local default ๊ฐ์กฐ prompt ## ๋ฐฐ๊ฒฝ (dryrun_80qa ๊ฒฐ๊ณผ) VectorRAG QA 5/5 ๋ชจ๋ t2c ๋ก ์๋ชป ๋ผ์ฐํ : - hanwha_001~005 ๋ชจ๋ t2c (rows=0) โ Faithfulness 0/5, Correctness 1/5 ์์ธ: ๊ธฐ์กด router_v1.md ๊ฐ "default ๋ t2c" ์์. LLM ์ด "์์น = Cypher" ํจํด ๊ฐํ๊ฒ ์ ์ฉ. ๊ทธ๋ฌ๋ ๋ณธ์ธ ๊ทธ๋ํ์ Metric ๋ผ๋ฒจ ํ์ง ์ด์๋ก t2c ๊ฐ ์ ์กํ. ## ์ํฅ - VectorRAG QA 40 โ ๋๋ถ๋ถ local ๋ก ๋ผ์ฐํ โ chunk ํ ์คํธ๋ก ๋ต๋ณ ๊ฐ๋ฅ - GraphRAG QA multi_doc_trend โ local (๋จ์ผ entity ์ ๋ถ๊ธฐ๋ณ ์ถ์ ) - GraphRAG QA filter_agg โ t2c (๋ฉํ๋ฐ์ดํฐ ์ง๊ณ โ ๋ช ๋ฐฑํ t2c ์ผ์ด์ค) ## docstring ๋ฐํ - "5/25 v2 ๋ฐํ" ์น์ ์ถ๊ฐ - DEFAULT_ROUTE / ROUTER_PROMPT_PATH / decide_route docstring ๋ชจ๋ ๊ฐฑ์ Refs #127
TaskerJang
added a commit
that referenced
this pull request
May 25, 2026
PR #54 ๊ฐ base=feat/qa-eval-dual-system ์ผ๋ก ๋จธ์ง๋๋๋ฐ, ๊ทธ ์์ ์ PR #53 (qa-eval-dual-system โ dev) ์ด ์ด๋ฏธ ๋จธ์ง๋ ๋ค๋ผ PR #54 ์ ๋ณ๊ฒฝ๋ถ์ด dev ๋ก ํ๋ฌ์ค์ง ๋ชปํจ. ์๊ธฐ ์ OpenAI + Kimi ์ธก์ ๋ฐ์ ๊ฑฐ๋ผ dev ์ ์ง์ patch ๋ฐ์: 1. agent/llm_client.py โ LLMConfig.is_openrouter ํ๋กํผํฐ + _REASONING_OFF_BODY 3์ค ์์ (max_tokens:1 / enabled:false / exclude:true) + reasoning_content fallback + ๋น ์๋ต ์ง๋จ 2. eval/metrics/faithfulness_judge.py โ _use_openrouter_extras ํ๋๊ทธ + _REASONING_OFF_BODY 3์ค + TIMEOUT 60s + reasoning / reasoning_content ๋ ํ๋ ๋ชจ๋ fallback 3. eval/metrics/answer_correctness.py โ faithfulness_judge ์ _REASONING_OFF_BODY ์ฌ์ฌ์ฉ + ๋์ผ ํจํด ํธํ์ฑ: - OpenAI ์ง๊ฒฐ (configure_llm ๋ฏธํธ์ถ) โ ์ํฅ 0 - KIMI_* ์ง๊ฒฐ (prod / W4) โ ์ํฅ 0 (is_openrouter False) - OpenRouter ๊ฒฝ์ โ reasoning OFF ์๋ ์ ์ฉ
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
๋ชฉ์
doc-graph-agent ์ doc-summary-agent ์ ๊ฐ์ ํจํด์ ์ ๋ QA ํ๊ฐ ์ธํ๋ผ ๊ตฌ์ถ. ๋ ์์คํ ์ ๊ฐ์ QA ์ ์์ ์ธก์ ํด RAG vs GraphRAG ๋น๊ต ๊ฐ๋ฅ.
QA ๋์์ธ โ 80 QA dual evaluation
๋ ์์คํ ์ ๊ฐ์ 80 QA ๋ฌป๊ธฐ โ ๊ฐ์ /์ฝ์ ์์ญ ๋งคํธ๋ฆญ์ค.
Commit ์ ๋ฆฌ
9ae50597agent/llm_client.pyLLM ํ ๊ธ (configure_llm() + from_args())724def9ceval/metrics/โ rouge/numerical/faithfulness_judge + prompts82f420c8eval/dataset/vectorrag_qa.json(40) +graphrag_qa.json(๋น ๋ฐฐ์ด) + README430b32acscripts/run_qa_eval.py+.env.example+scripts/explore_graph.pyํธํ์ฑ
run_w4_eval.py)scripts/run_qa_eval.py)scripts/explore_graph.py)๋ณธ ์ธก์ ๋ช ๋ น ์์
๋ค์ ์์ (PENDING)
scripts/explore_graph.py --json eval/dataset/graph_meta.json์คํRefs #127