Skip to content

[bug][examples] Fix lcb dataset issue in examples dir#1349

Open
s-chundi wants to merge 1 commit intoNovaSky-AI:mainfrom
s-chundi:fix/lcb-example-bug
Open

[bug][examples] Fix lcb dataset issue in examples dir#1349
s-chundi wants to merge 1 commit intoNovaSky-AI:mainfrom
s-chundi:fix/lcb-example-bug

Conversation

@s-chundi
Copy link

@s-chundi s-chundi commented Mar 19, 2026

Fixes #542

The original deepcoder_train.json is a 5.2 GB JSON array. When datasets.load_dataset("json", ...) reads it, PyArrow's JSON reader sets a block_size based on the file size, but block_size is an int32_t — so any file over ~2 GB overflows and crashes with OverflowError: value too large to convert to int32_t.

Changed lcb_dataset.py to output parquet files directly via Dataset.from_list() instead of writing large JSON arrays. Updated data paths to reference the new .parquet files.

Confirmed by training on the resulting parquet files


Open with Devin

Copy link
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 2 additional findings.

Open in Devin Review

@gemini-code-assist
Copy link
Contributor

Warning

Gemini is experiencing higher than usual traffic and was unable to create the review. Please try again in a few hours by commenting /gemini review.

@s-chundi s-chundi changed the title Fix lcb dataset issue in examples dir [bug] Fix lcb dataset issue in examples dir Mar 19, 2026
@s-chundi s-chundi changed the title [bug] Fix lcb dataset issue in examples dir [bug][examples] Fix lcb dataset issue in examples dir Mar 19, 2026
@s-chundi
Copy link
Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request resolves a crash when loading large JSON datasets by switching to the more efficient Parquet format. The changes in lcb_dataset.py correctly replace pandas with datasets for this purpose, and run_lcb.sh is updated accordingly. I've included a suggestion to remove a redundant file creation, which will make the data processing script more efficient.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Unable to load lcb dataset

1 participant