[bug][examples] Fix lcb dataset issue in examples dir by s-chundi · Pull Request #1349 · NovaSky-AI/SkyRL

s-chundi · 2026-03-19T14:38:09Z

Fixes #542

The original deepcoder_train.json is a 5.2 GB JSON array. When datasets.load_dataset("json", ...) reads it, PyArrow's JSON reader sets a block_size based on the file size, but block_size is an int32_t — so any file over ~2 GB overflows and crashes with OverflowError: value too large to convert to int32_t.

Changed lcb_dataset.py to output parquet files directly via Dataset.from_list() instead of writing large JSON arrays. Updated data paths to reference the new .parquet files.

Confirmed by training on the resulting parquet files

devin-ai-integration

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 2 additional findings.

gemini-code-assist · 2026-03-19T14:42:26Z

Warning

Gemini is experiencing higher than usual traffic and was unable to create the review. Please try again in a few hours by commenting /gemini review.

s-chundi · 2026-03-21T14:13:59Z

/gemini review

gemini-code-assist

Code Review

This pull request resolves a crash when loading large JSON datasets by switching to the more efficient Parquet format. The changes in lcb_dataset.py correctly replace pandas with datasets for this purpose, and run_lcb.sh is updated accordingly. I've included a suggestion to remove a redundant file creation, which will make the data processing script more efficient.

examples/train/livecodebench/lcb_dataset.py

Fix lcb dataset issue in examples dir

97458b3

devin-ai-integration bot reviewed Mar 19, 2026

View reviewed changes

s-chundi changed the title ~~Fix lcb dataset issue in examples dir~~ [bug] Fix lcb dataset issue in examples dir Mar 19, 2026

s-chundi changed the title ~~[bug] Fix lcb dataset issue in examples dir~~ [bug][examples] Fix lcb dataset issue in examples dir Mar 19, 2026

gemini-code-assist bot reviewed Mar 21, 2026

View reviewed changes

examples/train/livecodebench/lcb_dataset.py Show resolved Hide resolved

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bug][examples] Fix lcb dataset issue in examples dir#1349

[bug][examples] Fix lcb dataset issue in examples dir#1349
s-chundi wants to merge 1 commit intoNovaSky-AI:mainfrom
s-chundi:fix/lcb-example-bug

s-chundi commented Mar 19, 2026 •

edited by devin-ai-integration bot

Loading

Uh oh!

devin-ai-integration bot left a comment

Uh oh!

gemini-code-assist bot commented Mar 19, 2026

Uh oh!

s-chundi commented Mar 21, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

s-chundi commented Mar 19, 2026 • edited by devin-ai-integration bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

devin-ai-integration bot left a comment

Choose a reason for hiding this comment

✅ Devin Review: No Issues Found

Uh oh!

gemini-code-assist bot commented Mar 19, 2026

Uh oh!

s-chundi commented Mar 21, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

s-chundi commented Mar 19, 2026 •

edited by devin-ai-integration bot

Loading