Skip to content

fix: reasoning OFF for OpenRouter reasoning models (Kimi K2.5 / Claude judge)#54

Merged
TaskerJang merged 3 commits into
feat/qa-eval-dual-systemfrom
feat/disable-reasoning
May 25, 2026
Merged

fix: reasoning OFF for OpenRouter reasoning models (Kimi K2.5 / Claude judge)#54
TaskerJang merged 3 commits into
feat/qa-eval-dual-systemfrom
feat/disable-reasoning

Conversation

@TaskerJang

Copy link
Copy Markdown
Owner

๋ฐฐ๊ฒฝ

PR #53 ์˜ ์ฒซ 80 QA ์ธก์ • ๊ฒฐ๊ณผ๋ฅผ ๋ถ„์„ํ•œ ๊ฒฐ๊ณผ, doc-graph-agent ๋„ doc-summary-agent (PR #133) ์™€ ๋™์ผํ•œ reasoning model ๋นˆ ์‘๋‹ต ๋ฌธ์ œ ์˜ ์˜ํ–ฅ์„ ๋ฐ›์•˜์Œ์ด ํ™•์ธ๋จ.

์ฆ๊ฑฐ

์‹œ์Šคํ…œ LLM Faithfulness ๋น„๊ณ 
doc-summary deepseek-v3.2 87.5% (n=57) ์ •์ƒ
doc-summary Kimi K2.5 13.8% (n=80) reasoning ๋ฌด๋ ฅํ™” โ†’ ๋นˆ ์‘๋‹ต ๋ˆ„์ 
doc-graph deepseek-v3.2 80%+ ์ถ”์ • ์ •์ƒ
doc-graph Kimi K2.5 42.5~45% (n=80) ์ ˆ๋ฐ˜์œผ๋กœ ๋–จ์–ด์ง

doc-graph ๊ฐ€ doc-summary ๋งŒํผ ์™„์ „ํžˆ ๋ฌด๋„ˆ์ง€์ง€ ์•Š์€ ์ด์œ ๋Š” LLM ํ˜ธ์ถœ ํšŸ์ˆ˜ ์ฐจ์ด:

  • doc-summary: Map-Reduce ํŒจํ„ด โ†’ 80 QA ร— ~20 ์ฒญํฌ ร— 3๋‹จ๊ณ„ โ‰ˆ 5,000 ํ˜ธ์ถœ (๋ˆ„์  ์˜ํ–ฅ ํผ)
  • doc-graph: ๋‹จ์ผ retrieval โ†’ 80 QA ร— 1๋‹จ๊ณ„ โ‰ˆ 80 ํ˜ธ์ถœ (์˜ํ–ฅ ์ ์Œ)

๋‹ค๋งŒ ์˜ํ–ฅ ์ž์ฒด๋Š” ๋˜‘๊ฐ™์ด ๋ฐ›์Œ โ†’ Faithfulness ๊ฐ€ deepseek ์ธก์ • ๋Œ€๋น„ ์ ˆ๋ฐ˜.

์›์ธ

OpenRouter ๊ณต์‹ ๋ฌธ์„œ (https://openrouter.ai/docs/guides/best-practices/reasoning-tokens) ๋ฐ•ํžŒ ๋‚ด์šฉ:

Reasoning tokens are considered output tokens and charged accordingly.

Kimi K2.5 / DeepSeek V3.2 ๋“ฑ reasoning ๋ชจ๋ธ์€ thinking tokens ๊ฐ€ max_tokens ๋ฅผ ์†Œ์ง„ํ•œ ๋’ค ์‹ค์ œ ์ถœ๋ ฅ (content) ์€ ๋นˆ ๋ฌธ์ž์—ด๋กœ ๋ฐ˜ํ™˜๋˜๋Š” ์ผ€์ด์Šค๊ฐ€ ๋ฐœ์ƒ.

์ด์ „ ์ฝ”๋“œ ๋ฐ•ํžŒ reasoning: {"enabled": False} ๋งŒ์œผ๋กœ๋Š” ์ผ๋ถ€ ๋ชจ๋ธ (ํŠนํžˆ Kimi K2.5) ์—์„œ ๋ฌด์‹œ ๋จ.

ํŒจ์น˜

๊ณต์‹ ๋ฌธ์„œ ๋ฐ•ํžŒ ์ •๋‹ต ํŒจํ„ด:

"reasoning": {
    "max_tokens": 1,    # โ† ๊ณต์‹ ๊ถŒ์žฅ๊ฐ’ (๋ชจ๋“  ๋ชจ๋ธ ํ˜ธํ™˜)
    "enabled": False,   # โ† Anthropic ์ผ๋ถ€ ๋ชจ๋ธ
    "exclude": True,    # โ† reasoning ์‘๋‹ต ์ „๋‹ฌ X
}

3์ค‘ ์•ˆ์ „์„ ๋ชจ๋“  LLM ํ˜ธ์ถœ ์ง€์ ์— ๋ฐ•์Œ:

๋ณ€๊ฒฝ ํŒŒ์ผ

  1. agent/llm_client.py โ€” retrieval ๋‹ต๋ณ€ ์ƒ์„ฑ LLM

    • LLMConfig.is_openrouter ํ”„๋กœํผํ‹ฐ ๋ฐ•์•„ ์ž๋™ ๊ฐ์ง€
    • OpenRouter ๊ฒฝ์œ  ์‹œ๋งŒ extra_body ๋ฐ•๊ธฐ (Kimi ์ง๊ฒฐ / OpenAI ์ง๊ฒฐ ์˜ํ–ฅ 0)
    • ๋นˆ ์‘๋‹ต ์ง„๋‹จ + reasoning_content / reasoning ํ•„๋“œ fallback
  2. eval/metrics/faithfulness_judge.py โ€” Faithfulness / Numerical Faithfulness judge

    • _use_openrouter_extras ํ”Œ๋ž˜๊ทธ ๋ฐ•์Œ
    • _REASONING_OFF_BODY 3์ค‘ ์•ˆ์ „
    • TIMEOUT 30 โ†’ 60s
    • reasoning + reasoning_content ๋‘ ํ•„๋“œ ๋ชจ๋‘ fallback
  3. eval/metrics/answer_correctness.py โ€” Answer Correctness judge

    • faithfulness_judge._REASONING_OFF_BODY ์žฌ์‚ฌ์šฉ
    • ๋™์ผํ•œ reasoning OFF ํŒจํ„ด

ํ˜ธํ™˜์„ฑ

ํ™˜๊ฒฝ ์˜ํ–ฅ
prod / W4 ์ •์„ฑ ๊ฒ€์ฆ (KIMI_* ์ง๊ฒฐ) โŒ ์˜ํ–ฅ 0 โ€” is_openrouter False ๋ฐ•ํ˜€์„œ extra_body ๋ฏธ์ ์šฉ
OpenAI ์ง๊ฒฐ (configure_llm ๋ฏธํ˜ธ์ถœ) โŒ ์˜ํ–ฅ 0
OpenRouter ๊ฒฝ์œ  ์ธก์ • โœ… reasoning OFF ์ž๋™ ์ ์šฉ

์˜ˆ์ƒ ํšจ๊ณผ

  • doc-graph Kimi ์žฌ์ธก์ • ์‹œ Faithfulness 42-45% โ†’ 70-85% ๋ฐ•ํž ๊ฑฐ ์˜ˆ์ƒ
  • doc-summary Kimi ์ธก์ • ๋ฐ•ํžŒ deepseek 87.5% ์™€ ๋น„๊ต ๊ฐ€๋Šฅํ•œ ์ˆ˜์น˜ ๋ฐ•ํž ๊ฑฐ
  • ๋น„์šฉ ์ ˆ๊ฐ: reasoning tokens ์•ˆ ๋ฐ•ํ˜€์„œ ์•ฝ 1/3

์žฌ์ธก์ • ๋ช…๋ น์–ด

cd C:\Users\taske\doc-graph-agent && git fetch origin && git checkout feat/disable-reasoning && uv run python scripts/run_qa_eval.py --llm-model "moonshotai/kimi-k2.5" --llm-base-url "https://openrouter.ai/api/v1" --llm-api-key-env "OPENROUTER_API_KEY" --judge-model "anthropic/claude-haiku-4.5" --judge-base-url "https://openrouter.ai/api/v1" --judge-api-key-env "OPENROUTER_API_KEY" --qa-set both --tag kimi_no_reasoning_80qa

๐Ÿค– Generated with Claude (์ž๊ณ  ์ผ์–ด๋‚˜๋ฉด ๊ฒฐ๊ณผ ๋ฐ•ํ˜€์žˆ์„ ๊ฑฐ ์˜ˆ์ƒ)

Refs #127, PR #53, doc-summary PR #133

โ€ฆt.py

doc-summary-agent PR #133 ๊ณผ ๋™์ผ ํŒจํ„ด.

๋ฌธ์ œ:
- Kimi K2.5 (moonshotai/kimi-k2.5) ๋Š” reasoning model
- thinking tokens ๊ฐ€ max_tokens ๋‹ค ๋ฐ•์•„๋ฒ„๋ฆฌ๊ณ  content ๋นˆ ๋ฌธ์ž์—ด ๋ฐ˜ํ™˜
- doc-graph ์˜ ์ฒซ ์ธก์ • ๊ฒฐ๊ณผ Faithfulness 42-45% โ€” deepseek 87.5% ์˜ ์ ˆ๋ฐ˜
- ์ฆ‰ doc-graph ๋„ reasoning ์˜ํ–ฅ ๋ฐ›์•˜์Œ (๋‹ค๋งŒ LLM ํ˜ธ์ถœ ํšŸ์ˆ˜ ์ ์–ด์„œ ์™„์ „ fail ์•ˆ ๋ฐ•ํž˜)

๊ณต์‹ ๋ฌธ์„œ (https://openrouter.ai/docs/guides/best-practices/reasoning-tokens):
- reasoning: {"max_tokens": 1}  โ† ๊ณต์‹ ๊ถŒ์žฅ๊ฐ’ (๋ชจ๋“  ๋ชจ๋ธ ํ˜ธํ™˜)
- reasoning: {"enabled": false}  โ† Anthropic ์ผ๋ถ€ ๋ชจ๋ธ

ํŒจ์น˜ ๋ฐ•์€ ๊ฑฐ:
1. _REASONING_OFF_BODY 3์ค‘ ์•ˆ์ „:
   - enabled: false
   - max_tokens: 1  โ† ๊ณต์‹ ๊ถŒ์žฅ (0 ๋ฐ•์€ ๊ฑฐ ์ผ๋ถ€ ๋ชจ๋ธ ๊ฑฐ๋ถ€ ๊ฐ€๋Šฅ)
   - exclude: true
2. LLMConfig.is_openrouter ํ”„๋กœํผํ‹ฐ ๋ฐ•์•„์„œ ์ž๋™ ๊ฐ์ง€
3. OpenRouter ๊ฒฝ์œ  ์‹œ๋งŒ extra_body ๋ฐ•๊ธฐ (Kimi ์ง๊ฒฐ / OpenAI ์ง๊ฒฐ ์˜ํ–ฅ 0)
4. reasoning_content / reasoning ํ•„๋“œ fallback ๋ฐ•๊ธฐ
5. ๋นˆ ์‘๋‹ต ์ง„๋‹จ logger.warning ๋ฐ•๊ธฐ

์˜ˆ์ƒ ํšจ๊ณผ:
- 80 QA ์žฌ์ธก์ • ์‹œ Faithfulness 42-45% โ†’ 70-85% ๋ฐ•ํž ๊ฑฐ ์˜ˆ์ƒ
- ๋น„์šฉ ์ ˆ๊ฐ: reasoning tokens ์•ˆ ๋ฐ•ํ˜€์„œ ์•ฝ 1/3 ๋ฐ•ํž˜
์ด์ „ ์ฝ”๋“œ: reasoning: {"enabled": False} ๋งŒ ๋ฐ•ํ˜”๋Š”๋ฐ Kimi ๊ฐ™์€ ์ผ๋ถ€
๋ชจ๋ธ์—์„œ ๋ฌด์‹œ๋˜๋Š” ์ผ€์ด์Šค ๋ฐœ๊ฒฌ (์ด์ „ ์ธก์ •์˜ 60+ Judge ๋นˆ ์‘๋‹ต ์ฆ๊ฑฐ).

๊ณต์‹ ๋ฌธ์„œ (https://openrouter.ai/docs/guides/best-practices/reasoning-tokens):
- max_tokens: 1 ์ด ๊ณต์‹ ๊ถŒ์žฅ๊ฐ’ (๋ชจ๋“  ๋ชจ๋ธ ํ˜ธํ™˜)
- enabled: false ๋Š” Anthropic ์ผ๋ถ€ ๋ชจ๋ธ๋งŒ ์ง€์›

ํŒจ์น˜:
1. reasoning 3์ค‘ ์•ˆ์ „:
   - enabled: false
   - max_tokens: 1  โ† ๊ณต์‹ ๊ถŒ์žฅ
   - exclude: true
2. _use_openrouter_extras ํ”Œ๋ž˜๊ทธ ๋ฐ•์•„ OpenRouter ๊ฒฝ์œ  ์‹œ๋งŒ ์ ์šฉ
3. TIMEOUT 30 -> 60s (reasoning ๋ชจ๋ธ ๋Œ€์‘ margin)
4. reasoning + reasoning_content ๋‘ ํ•„๋“œ ๋ชจ๋‘ fallback ๋ฐ•๊ธฐ
5. summarizer/llm.py ์™€ ๋™์ผ ํŒจํ„ด (doc-summary PR #133 ์ •ํ•ฉ)
faithfulness_judge.py ์™€ ๋™์ผ ํŒจํ„ด ์ ์šฉ.

๋ณ€๊ฒฝ:
1. faithfulness_judge ์˜ _REASONING_OFF_BODY ์žฌ์‚ฌ์šฉ
   (max_tokens: 1 + enabled: false + exclude: true 3์ค‘)
2. _use_openrouter_extras ํ”Œ๋ž˜๊ทธ ๋ฐ•์•„ OpenRouter ๊ฒฝ์œ  ์‹œ๋งŒ ์ ์šฉ
3. TIMEOUT 30 -> 60s
4. reasoning + reasoning_content ๋‘ ํ•„๋“œ ๋ชจ๋‘ fallback
5. ๋นˆ ์‘๋‹ต ์ง„๋‹จ logger.warning ๋ฐ•๊ธฐ
@TaskerJang TaskerJang merged commit 711c34b into feat/qa-eval-dual-system May 25, 2026
TaskerJang added a commit that referenced this pull request May 25, 2026
PR #54 ๊ฐ€ base=feat/qa-eval-dual-system ์œผ๋กœ ๋จธ์ง€๋๋Š”๋ฐ, ๊ทธ ์‹œ์ ์— PR #53
(qa-eval-dual-system โ†’ dev) ์ด ์ด๋ฏธ ๋จธ์ง€๋œ ๋’ค๋ผ PR #54 ์˜ ๋ณ€๊ฒฝ๋ถ„์ด dev ๋กœ
ํ˜๋Ÿฌ์˜ค์ง€ ๋ชปํ•จ.

์ž๊ธฐ ์ „ OpenAI + Kimi ์ธก์ • ๋ฐ•์„ ๊ฑฐ๋ผ dev ์— ์ง์ ‘ patch ๋ฐ•์Œ:

1. agent/llm_client.py โ€” LLMConfig.is_openrouter ํ”„๋กœํผํ‹ฐ + _REASONING_OFF_BODY
   3์ค‘ ์•ˆ์ „ (max_tokens:1 / enabled:false / exclude:true) + reasoning_content
   fallback + ๋นˆ ์‘๋‹ต ์ง„๋‹จ

2. eval/metrics/faithfulness_judge.py โ€” _use_openrouter_extras ํ”Œ๋ž˜๊ทธ +
   _REASONING_OFF_BODY 3์ค‘ + TIMEOUT 60s + reasoning / reasoning_content ๋‘ ํ•„๋“œ
   ๋ชจ๋‘ fallback

3. eval/metrics/answer_correctness.py โ€” faithfulness_judge ์˜ _REASONING_OFF_BODY
   ์žฌ์‚ฌ์šฉ + ๋™์ผ ํŒจํ„ด

ํ˜ธํ™˜์„ฑ:
- OpenAI ์ง๊ฒฐ (configure_llm ๋ฏธํ˜ธ์ถœ) โ†’ ์˜ํ–ฅ 0
- KIMI_* ์ง๊ฒฐ (prod / W4) โ†’ ์˜ํ–ฅ 0 (is_openrouter False)
- OpenRouter ๊ฒฝ์œ  โ†’ reasoning OFF ์ž๋™ ์ ์šฉ
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant