[Feat] #127 dual-system QA evaluation infrastructure by TaskerJang · Pull Request #53 · TaskerJang/doc-graph-agent

TaskerJang · 2026-05-25T03:16:28Z

목적

doc-graph-agent 에 doc-summary-agent 와 같은 패턴의 정량 QA 평가 인프라 구축. 두 시스템을 같은 QA 셋에서 측정해 RAG vs GraphRAG 비교 가능.

QA 디자인 — 80 QA dual evaluation

카테고리	출처	개수	의도
VectorRAG QA	doc-summary 40 QA 복사	40	사실/수치 — RAG 강점
GraphRAG QA	신규 작성 (1-hop/2-hop/교집합/집계/필터/메타)	40	관계/통합 — Graph 강점

두 시스템에 같은 80 QA 묻기 → 강점/약점 영역 매트릭스.

Commit 정리

C#	Commit	내용
C1	`9ae50597`	`agent/llm_client.py` LLM 토글 (configure_llm() + from_args())
C2	`724def9c`	`eval/metrics/` — rouge/numerical/faithfulness_judge + prompts
C3+C4	`82f420c8`	`eval/dataset/vectorrag_qa.json` (40) + `graphrag_qa.json` (빈 배열) + README
C5+C6+C7	`430b32ac`	`scripts/run_qa_eval.py` + `.env.example` + `scripts/explore_graph.py`

호환성

환경	영향
prod / W4 정성 검증 (`run_w4_eval.py`)	❌ 영향 0 — configure_llm() 미호출 시 KIMI_* 사용
새 진입점 (`scripts/run_qa_eval.py`)	✅ OpenRouter 4 LLM × Claude judge 토글
신규 (`scripts/explore_graph.py`)	✅ Neo4j 6 쿼리 자동 + JSON 출력

본 측정 명령 예시

# DeepSeek V3.2 × 80 QA (VectorRAG 40 + GraphRAG 40)
uv run python scripts/run_qa_eval.py \
  --llm-model "deepseek/deepseek-v3.2" \
  --llm-base-url "https://openrouter.ai/api/v1" \
  --llm-api-key-env "OPENROUTER_API_KEY" \
  --judge-model "anthropic/claude-haiku-4.5" \
  --judge-base-url "https://openrouter.ai/api/v1" \
  --judge-api-key-env "OPENROUTER_API_KEY" \
  --qa-set both \
  --tag deepseek_both

다음 작업 (PENDING)

scripts/explore_graph.py --json eval/dataset/graph_meta.json 실행
graph_meta.json 보고 GraphRAG QA 40 작성 → graphrag_qa.json 채우기
dry-run (--limit 3) 검증
4 LLM × 80 QA 본 측정

Refs #127

scripts/run_qa_eval.py 진입점에서 --llm-model / --llm-base-url / --llm-api-key-env 인자로 LLM 을 토글할 수 있게 한다. doc-summary-agent 와 동일 패턴. ## 변경 사항 ### 신규: LLMConfig.from_args() 클래스 메소드 - model / base_url / api_key_env 명시 받아서 LLMConfig 생성 - from_env() (KIMI_*) 와 분리 — prod / W4 정성 검증 동작 무영향 ### 신규: configure_llm() 함수 + _active_config 모듈 변수 - 호출 안 하면 LLMClient() 가 from_env() 사용 (기존 동작) - 호출 시 이후 모든 LLMClient() 인스턴스가 명시 config 사용 - retrieval.text2cypher / local_retriever / router 자동 토글 (모두 LLMClient() 호출하므로) ### 변경: LLMClient.__init__() - config 인자 None 일 때 _active_config 우선, 그 다음 from_env() - 명시적 LLMClient(config=...) 호출도 그대로 지원 ## 호환성 | 환경 | 영향 | |---|---| | prod / W4 정성 검증 (`run_w4_eval.py`) | ❌ 영향 0 — configure_llm() 미호출 | | 평가 진입점 (`scripts/run_qa_eval.py`) | ✅ configure_llm() 호출로 토글 | | LLMClient 직접 사용 (테스트 등) | ❌ 영향 0 | Refs #127

ROUGE-L / 수치 정확도 / Faithfulness / Numerical Faithfulness 4 메트릭을 doc-summary 의 동일 코드 그대로 박는다. doc-graph 정량 평가 진입점 (scripts/run_qa_eval.py) 의 기반 인프라. ## 변경 사항 ### eval/metrics/__init__.py 신규 ### eval/metrics/rouge_score.py 신규 - ROUGE-1/2/L 한국어 MeCab 토크나이저 기반 - doc-summary 와 동일 (의존: rouge-score, python-mecab-ko) ### eval/metrics/numerical_accuracy.py 신규 - 금융 특화 수치 추출 (억원, 조원, %p, bp, ~ 범위, 날짜 제외) - doc-summary 와 동일 ### eval/metrics/faithfulness_judge.py 신규 - LLM-as-Judge: faithfulness / completeness (1-5) / conciseness (1-5) - 수치 충실도: numerical_faithfulness - configure_judge_llm() — 측정 LLM 과 독립 토글 - Claude 호환성 (doc-summary #127 C7 그대로): * reasoning_effort 분기 (OpenAI 네이티브만) * strict json_schema → json_object 자동 fallback * 빈 응답 진단 로깅 + reasoning_content fallback 추출 * 마크다운 코드 블록 자동 제거 ### eval/metrics/prompts/ 신규 - faithfulness_v1.md - numerical_faithfulness_v1.md - doc-summary 와 동일 Refs #127

…ub808\ud1a4 ## VectorRAG QA 40 (eval/dataset/vectorrag_qa.json) doc-summary-agent \uc758 qa_pairs.json 40 QA \uadf8\ub300\ub85c \ubcf5\uc0ac. - factual / numerical / summary / negative \uc720\ud615 \ubaa8\ub450 \ud3ec\ud568 - 5 \ubb38\uc11c (\ud55c\ud654/DS/\ubbf8\ub798\uc5d0\uc14b\u00d74/\ub18d\ud611/\uae08\uac10\uc6d0) - \uba65\ud1a0 \uc2ac\ub77c\uc774\ub4dc \uc758\ub3c4: \"VectorRAG \uac15\uc810 \uc601\uc5ed\" ## GraphRAG QA \uc2a4\ucf08\ub808\ud1a4 (eval/dataset/graphrag_qa.json) \ube48 \ubc30\uc5f4\ub85c \uc2dc\uc791. \uc2e4\uc81c QA 40 \uac1c\ub294 scripts/explore_graph.py \ub85c \uadf8\ub798\ud504 \uad6c\uc870 \ud30c\uc545 \ud6c4 \uc791\uc131. ## eval/dataset/README.md QA \uc791\uc131 \ud328\ud134 6\uac1c \ubc15\uc81c: - 1-hop \uad00\uacc4 (7) - 2-hop \ub2e4\uc911 (7) - \uad50\uc9d1\ud569 (7) - \uc9d1\uacc4/\ud1b5\uacc4 (7) - \ud544\ud130+\uc9d1\uacc4 (6) - \uba54\ud0c0\ub370\uc774\ud130 (6) \uac01 QA \uc758 \uc815\ub2f5\uc740 Cypher \ucffc\ub9ac\ub85c \uacc0\uc99d \uac00\ub2a5\ud574\uc57c \ud568. Refs #127

…8\ud504 \ud0d0\uc0c9 + .env \uc608\uc2dc \uc774\uc81c VectorRAG QA + GraphRAG QA \uba85\uc2dc\uc801\uc73c\ub85c \ub458 \ub2e4 \ud3c9\uac00 \uac00\ub2a5\ud558\uace0, GraphRAG QA 40\uac1c \uc791\uc131\uc744 \uc704\ud55c \uadf8\ub798\ud504 \ud0d0\uc0c9 \uc790\ub3d9\ud654\ub3c4 \ucc28\uc6e0\ub2e4. ## C5: scripts/run_qa_eval.py \uc2e0\uaddc doc-summary-agent eval/run_eval.py \uc640 \ub3d9\uc77c\ud55c \uba54\ud2b8\ub9ad (ROUGE-L / \uc218\uce58 \uc815\ud655\ub3c4 / Faithfulness Judge / Numerical Faithfulness Judge) + \ub2e4\ub978 \ub2f5\ubcc0 \uc0dd\uc131 \uacbd\ub85c (retrieval.route_and_answer). - --qa-set vectorrag | graphrag | both (\uae30\ubcf8 both) - --limit N (dry-run \uc6a9) - --tag <\ucffc\ub9ac\ud0dc\uadf8> \u2014 \uacb0\uacfc \ud30c\uc77c \uad6c\ubd84 - --llm-* / --judge-* \u2014 doc-summary \uc640 \ub3d9\uc77c \ud328\ud134 - \uac01 QA \ub9c8\ub2e4 actual_route + retrieval_error + elapsed_seconds \ubc15\uc81c - \uc694\uc57d\uc5d0 qa_set \ubcc4 \ubd84\uacc4 + Routing \ubd84\ud3ec \ud45c\uc2dc ## C6: .env.example \uc2e0\uaddc NEO4J / KIMI / OPENROUTER \uc138 \uadf8\ub8f9\uc73c\ub85c \ubd84\ub9ac. OpenRouter ID \ucc38\uace0 \ubaa9\ub85d (gpt-5-mini / deepseek-v3.2 / kimi-k2.5 / grok-4.20 + Claude judge). ## C7: scripts/explore_graph.py \uc2e0\uaddc \u2014 \ud575\uc2ec \uc774\uc804 \uc138\uc158\uc5d0\uc11c \ub17c\uc758\ub41c 6 \uac1c Cypher \uc0ac\uc774\ud37c \ucffc\ub9ac\ub97c \uadf8\ub300\ub85c \ubc15\uc74c: 1. documents \u2014 Document + \uba54\ud0c0 2. entities \u2014 Entity \ub77c\ubca8 \ubd84\ud3ec 3. relations \u2014 \uad00\uacc4 \ud0c0\uc785 \ubd84\ud3ec 4. top_entities \u2014 \ub77c\ubca8\ubcc4 TOP 20 5. shared_entities \u2014 Document \uacf5\ud1b5 Entity 6. risk_company \u2014 Risk-Company \uacf5\ub3d9 \uc5b8\uae09 \uc0ac\uc6a9: uv run python scripts/explore_graph.py --json eval/dataset/graph_meta.json \u2192 \ucd9c\ub825 JSON \ubcf4\uace0 GraphRAG QA 40 \uc791\uc131 (\uc815\ub2f5 ground truth). Refs #127

…c1c \ucd94\uac00 + GraphRAG QA \uc2dc\ub4dc 5\uac1c ## \ub808\ud37c\ub7f0\uc2a4 \uc870\uc0ac \uacb0\uacfc \ub2e4\uc74c \ucd5c\uc2e0 \ub17c\ubb38/\ud504\ub808\uc784\uc6cc\ud06c \ud45c\uc900 \ucc44\ud0dd: - GraphRAG-Bench (arxiv:2506.02404, ICLR 2026): Accuracy + AR Metric - RAGAS (https://docs.ragas.io): Answer Correctness + Semantic Similarity + Context Entity Recall - RAG vs GraphRAG (Han et al. 2025, arxiv:2502.11371): Precision/Recall/F1 \uc774\uc801 \uacc4\ud68d\uc5d0\uc11c \uacb0\uc815\ub41c Tier 2 \uba54\ud2b8\ub9ad 4\uac1c \ucd94\uac00. R Score / AR Metric \uc740\n*gold rationale* \uc791\uc131 \ubd80\ub2f4\uc73c\ub85c \ub4f1\uc7a5 \ud0d0\uc0c9 \uc774\ud6c4 \uc791\uc5c5\uc73c\ub85c \ubd84\ub9ac.\n\n## C8: eval/metrics/answer_correctness.py + prompts/answer_correctness_v1.md\n\n- RAGAS Answer Correctness + GraphRAG-Bench Accuracy \uacb0\ud569\n- LLM-as-Judge 1-5 \uc810 \uc758\ubbf8\uc801 \uc77c\uce58 \uc810\uc218\n- ROUGE-L \uc758 \"\uc11c\uc220\uc801 \ub2f5\ubcc0 \ub2e8\uc18c\" \ubb38\uc81c \ud574\uacb0\n- faithfulness_judge \uc758 _active_judge_client / strict_schema fallback \uc7ac\uc0ac\uc6a9\n\n## C9: eval/metrics/semantic_similarity.py\n\n- RAGAS Answer Semantic Similarity \ud328\ud134\n- BAAI/bge-m3 \uc784\ubca0\ub529 cosine\n- doc-summary \uc640 \ub3d9\uc77c \uc758\uc874\uc131 (sentence-transformers)\n- ROUGE-L \ubcf4\uc644 \u2014 \uc758\ubbf8 \uae30\ubc18 \uc218\uce58\n\n## C10: eval/metrics/entity_coverage.py\n\n- RAGAS Context Entities Recall \ubcc0\ud615\n- \uc815\ub2f5 entity (\uc218\uce58 + \ud55c\uae00 \uba85\uc0ac) \uac00 \uc608\uce21\uc5d0 \ub4f1\uc7a5 \ube44\uc728\n- key_entities \ud544\ub4dc QA \uc790\uc801\ud574\ub450\uba74 \uadf8\uac83 \uc6b0\uc120 \uc0ac\uc6a9\n- numerical_accuracy \ud328\ud134 \uc7ac\uc0ac\uc6a9\n\n## C11: eval/metrics/routing_accuracy.py\n\n- GraphRAG-Bench AR Metric \ubcc0\ud615\n- expected_route (QA \uc790\uc801) vs actual_route (router.py \uacb0\uc815) \uc77c\uce58\n- t2c / local / community 3 \uac12\n- VectorRAG QA \ub4f1 expected_route \uc5c6\uc73c\uba74 None \ubc18\ud658\n\n## C13: eval/dataset/graphrag_qa.json \uc2dc\ub4dc 5\uac1c\n\nGraphRAG QA \uc2a4\ud0a4\ub9c8 \uac31\uc2e0 \u2014 \ucd94\uac00 \ud544\ub4dc:\n- pattern: 1hop / 2hop / intersection / aggregate / filter / metadata\n- expected_route: t2c / local / community (routing_accuracy \uba54\ud2b8\ub9ad\uc6a9)\n- key_entities: list[str] (entity_coverage \uba54\ud2b8\ub9ad\uc6a9)\n\n\uc2dc\ub4dc 5\uac1c \uc785\ub825\ub428. \ub098\uba38\uc9c0 35\uac1c \ub294 explore_graph.py \uacb0\uacfc \ubcf4\uace0 \uc791\uc131.\n\nRefs #127

## 변경 사항 ### evaluate_one() — 메트릭 4개 추가 - Answer Correctness (LLM Judge 1-5) — RAGAS + GraphRAG-Bench - Semantic Similarity (bge-m3 cosine) — RAGAS - Entity Coverage (정답 entity 등장률) — RAGAS Context Entities Recall 변형 - Routing Accuracy (expected vs actual route) — GraphRAG-Bench AR Metric 변형 ### print_summary() — Tier 1 / Tier 2 분리 출력 각 qa_set 별: - Tier 1 (전통 + FineSurE): ROUGE-L / 수치정확도 / Faithfulness / Completeness / Conciseness - Tier 2 (RAGAS + GraphRAG-Bench): Answer Correctness / Semantic Similarity / Entity Coverage / Routing Accuracy ### --no-semantic 인자 추가 bge-m3 CPU 부담으로 dry-run 시 끄기 가능. ### qa 필드 활용 - expected_route → routing_accuracy - key_entities → entity_coverage - pattern → 결과에 박제 (1hop/2hop/intersection 등) ## 호환성 기존 인자 (--llm-* / --judge-* / --qa-set / --limit / --tag) 그대로 유지. 새 인자 --no-semantic 만 추가. Refs #127

## QA 작성 원칙 본인 핵심 인사이트 반영: > "그래프 구조 활용만으로 작성하면 RAG 가 못 잡음 — 공정한 비교 아님" → GraphRAG-Bench / RAG vs GraphRAG (Han et al.) 표준: - 원문에 답이 박혀있어야 (RAG도 시도 가능) - RAG 가 어렵게 풀거나 부분만 풀 수 있어야 - GraphRAG 가 구조적으로 우세해야 ## 패턴 분포 (혼합 30 강점 + 10 한계) | 패턴 | 개수 | 의도 | |---|---|---| | multi_doc_trend | 10 | 미래에셋 1Q→4Q 추세 (4문서 동시 필요) | | 1hop | 10 | 한화 두산밥캣, DS 종목별, 금감원, 농협 | | intersection | 8 | DS+한화 거시 변수, 미래에셋+금감원 IPO | | filter_agg | 5 | doc_type 필터 + 집계 | | causal | 5 | 인과/조건 (멕시코공장→수익성, 유가→원전) | | limitation | 2 | 시간 비교 불가, Company 라벨 품질 | ## 그래프 탐색 결과 활용 explore_graph.py 결과 박힌 발견: - doc_type 4종 (report 2 / disclosure 1 / filing 1 / ir 4) — filter_agg - HAS_METRIC 270 풍부 — 1hop - doc_year/page_count 없음 — limitation - Company 라벨 품질 이슈 — limitation ## 8 문서 원문 분석 활용 - 미래에셋 1Q~4Q: 분기별 시계열 metric (multi_doc_trend 강력) - 한화 두산밥캣: 1-hop 관계, 인과 추론 - DS 시황: 1-hop 종목, 거시 변수 - 금감원 보도자료: 통계 entity, 분류 한계 - 농협 사업보고서: 1-hop 사업 항목 ## 각 항목 박힘 - id, qa_set: "graphrag" - pattern (1hop/intersection/filter_agg/multi_doc_trend/causal/limitation) - doc (콤마 구분 다중 문서 가능) - question, answer - expected_route (local/t2c) - key_entities (Tier 2 entity_coverage 메트릭용) - note (작성 의도) Refs #127

doc-summary-agent 와 동일 구조 통합 파일. scripts/run_qa_eval.py 가 단일 파일 로드로 80 QA 처리 가능. ## QA 분포 - vectorrag: 40 (hanwha 6, ds 5, mirae_1q~4q 19, nonghyup 5, fss 6) - graphrag: 40 (multi_doc_trend 10, 1hop 10, intersection 8, filter_agg 5, causal 5, limitation 2) ## 측정 흐름 doc-graph 측에서 80 QA 측정 (--qa-set both): - VectorRAG 40: 단일 사실, RAG 강점 영역 (그래프가 약한 영역) - GraphRAG 40: 다중 문서 추론, 그래프 강점 영역 → doc-summary 측 80 QA 측정 결과와 직접 비교 Refs #127

## 문제 (dryrun_80qa 결과) VectorRAG QA 5/5 모두 t2c 로 잘못 라우팅: - hanwha_001 (목표주가) → t2c (rows=0) - hanwha_002 (영업이익) → t2c (rows=0) - hanwha_003 (딜러 재고) → t2c (rows=0) - hanwha_004 (투자 의견 근거) → t2c (Cypher 거부) - hanwha_005 (멕시코 공장) → t2c (rows=0) 결과: ROUGE-L 0.05, Faithfulness 0/5, Correctness 1/5 ## 원인 기존 router_v1.md 는 "default 는 t2c" 였음. LLM Router 가 "수치 / 사실 = Cypher" 패턴 너무 공격적 적용. 그러나 본인 그래프의 Metric 라벨 품질 이슈로 t2c 가 안 잡힘. ## 갱신 (v2) - **default 를 local 로 변경** — 단일 entity 사실 / 관계 / 속성 모두 local - **t2c 는 *명백한* top-N / 집계 / 필터 / 메타데이터 비교에만 제한** - 예시 10개로 강화 (v1 의 5개 → v2 의 10개) ## 영향 예상 - VectorRAG QA 40 → 대부분 local 라우팅 → chunk 텍스트로 답변 가능 - GraphRAG QA multi_doc_trend → local (분기별 추적) - GraphRAG QA filter_agg → t2c (메타데이터 집계) - GraphRAG QA limitation → t2c (limitation 영역) ## 호환성 router.py 가 ROUTER_PROMPT_PATH = PROMPTS_DIR / "router_v1.md" 로 박혀있어, router_v1.md 를 직접 갱신하는 것이 가장 안전 (router.py 코드 변경 불필요). → 본 PR 은 router_v2.md 박기 + router.py 의 ROUTER_PROMPT_PATH 갱신 (별도 commit). Refs #127

## 변경 사항 1. **DEFAULT_ROUTE: "t2c" → "local"** — LLM router fallback default 2. **ROUTER_PROMPT_PATH: router_v1.md → router_v2.md** — local default 강조 prompt ## 배경 (dryrun_80qa 결과) VectorRAG QA 5/5 모두 t2c 로 잘못 라우팅: - hanwha_001~005 모두 t2c (rows=0) → Faithfulness 0/5, Correctness 1/5 원인: 기존 router_v1.md 가 "default 는 t2c" 였음. LLM 이 "수치 = Cypher" 패턴 강하게 적용. 그러나 본인 그래프의 Metric 라벨 품질 이슈로 t2c 가 안 잡힘. ## 영향 - VectorRAG QA 40 → 대부분 local 로 라우팅 → chunk 텍스트로 답변 가능 - GraphRAG QA multi_doc_trend → local (단일 entity 의 분기별 추적) - GraphRAG QA filter_agg → t2c (메타데이터 집계 — 명백한 t2c 케이스) ## docstring 박힘 - "5/25 v2 박힘" 섹션 추가 - DEFAULT_ROUTE / ROUTER_PROMPT_PATH / decide_route docstring 모두 갱신 Refs #127

PR #54 가 base=feat/qa-eval-dual-system 으로 머지됐는데, 그 시점에 PR #53 (qa-eval-dual-system → dev) 이 이미 머지된 뒤라 PR #54 의 변경분이 dev 로 흘러오지 못함. 자기 전 OpenAI + Kimi 측정 박을 거라 dev 에 직접 patch 박음: 1. agent/llm_client.py — LLMConfig.is_openrouter 프로퍼티 + _REASONING_OFF_BODY 3중 안전 (max_tokens:1 / enabled:false / exclude:true) + reasoning_content fallback + 빈 응답 진단 2. eval/metrics/faithfulness_judge.py — _use_openrouter_extras 플래그 + _REASONING_OFF_BODY 3중 + TIMEOUT 60s + reasoning / reasoning_content 두 필드 모두 fallback 3. eval/metrics/answer_correctness.py — faithfulness_judge 의 _REASONING_OFF_BODY 재사용 + 동일 패턴 호환성: - OpenAI 직결 (configure_llm 미호출) → 영향 0 - KIMI_* 직결 (prod / W4) → 영향 0 (is_openrouter False) - OpenRouter 경유 → reasoning OFF 자동 적용

TaskerJang added 11 commits May 25, 2026 12:02

chore(deps): #127 rouge-score + python-mecab-ko 추가 — ROUGE-L 메트릭용

273cb01

TaskerJang mentioned this pull request May 25, 2026

fix: reasoning OFF for OpenRouter reasoning models (Kimi K2.5 / Claude judge) #54

Merged

TaskerJang merged commit a121f81 into dev May 25, 2026

TaskerJang mentioned this pull request May 26, 2026

[Eval] 검증 — P1~P5 적용 후 동일 80 QA 재측정 (Before/After 비교) #61

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feat] #127 dual-system QA evaluation infrastructure#53

[Feat] #127 dual-system QA evaluation infrastructure#53
TaskerJang merged 11 commits into
devfrom
feat/qa-eval-dual-system

TaskerJang commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

TaskerJang commented May 25, 2026

목적

QA 디자인 — 80 QA dual evaluation

Commit 정리

호환성

본 측정 명령 예시

다음 작업 (PENDING)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant