PR2: add concurrent_files for bounded concurrent reads#2
Draft
sumedhsakdeo wants to merge 4 commits intofix/arrow-scan-streaming-3036from
Draft
PR2: add concurrent_files for bounded concurrent reads#2sumedhsakdeo wants to merge 4 commits intofix/arrow-scan-streaming-3036from
sumedhsakdeo wants to merge 4 commits intofix/arrow-scan-streaming-3036from
Conversation
988d3f2 to
7cfb2b1
Compare
This was referenced Feb 14, 2026
b72b7ba to
cbd6029
Compare
7cfb2b1 to
a66d7e1
Compare
cbd6029 to
55d68b8
Compare
a66d7e1 to
8be61c8
Compare
55d68b8 to
444549f
Compare
c6547c3 to
a4cb212
Compare
sumedhsakdeo
commented
Feb 15, 2026
pyiceberg/io/pyarrow.py
Outdated
Comment on lines
+1723
to
+1725
| if cancel_event.is_set(): | ||
| return | ||
| acquired = True |
Owner
Author
There was a problem hiding this comment.
Suggested change
| if cancel_event.is_set(): | |
| return | |
| acquired = True | |
| acquired = True | |
| if cancel_event.is_set(): | |
| return |
c383049 to
d11eb43
Compare
13feb8d to
86b5a4a
Compare
444549f to
1f20655
Compare
86b5a4a to
c643dd2
Compare
1f20655 to
07287b6
Compare
c643dd2 to
dcf9b13
Compare
07287b6 to
a0a29c8
Compare
Add _bounded_concurrent_batches() with proper lock discipline: - Queue backpressure caps memory (scan.max-buffered-batches, default 16) - Semaphore limits concurrent file reads (concurrent_files param) - Cancel event with timeouts on all blocking ops (no lock over IO) - Error propagation and early termination support When streaming=True and concurrent_files > 1, batches are yielded as they arrive from parallel file reads. File ordering is not guaranteed (documented). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace shared ExecutorFactory + Semaphore with per-scan ThreadPoolExecutor(max_workers=concurrent_files) for deterministic shutdown and simpler concurrency control. Refactor to_record_batches into helpers: - _prepare_tasks_and_deletes: resolve delete files - _iter_batches_streaming: bounded concurrent streaming path - _iter_batches_materialized: executor.map materialization path - _apply_limit: unified row limit logic (was duplicated) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…tests and docs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
91c6fa0 to
7c415d4
Compare
a0a29c8 to
2474b12
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Part of apache#3036
Summary
concurrent_filesparameter for bounded concurrent reads across multiple files in arrival orderThreadPoolExecutor(max_workers=concurrent_files)with boundedqueue.Queue(maxsize=16)for backpressurethreading.Eventfor cancellation on early terminationto_record_batchesinto helpers:_prepare_tasks_and_deletes,_iter_batches_arrival,_iter_batches_materialized,_apply_limitconcurrent_files=1preserves sequential behaviorOrdering semantics
ScanOrder.TASK(default)ScanOrder.ARRIVAL, concurrent_files=1ScanOrder.ARRIVAL, concurrent_files>1PR Stack
This is PR 2 of 3 for apache#3036:
batch_sizeforwardingScanOrderenum — stop materializing entire filesconcurrent_files— bounded concurrent reads in arrival orderAre these changes tested?
Yes — 9 tests in
test_bounded_concurrent_batches.py: correctness, backpressure, error propagation, early termination, concurrency limits, ArrowScan integration with limit. Plus 3 concurrent-specific tests intest_pyarrow.py.Are there any user-facing changes?
Yes — new
concurrent_filesparameter onto_arrow_batch_reader()(used withorder=ScanOrder.ARRIVAL)