Skip to content

Time partition#37

Merged
JamesMcClung merged 11 commits into
mainfrom
time-partition
May 14, 2026
Merged

Time partition#37
JamesMcClung merged 11 commits into
mainfrom
time-partition

Conversation

@JamesMcClung
Copy link
Copy Markdown
Owner

Somewhat brute-force a method for lazily indexing time in list data. This method emerged given the following constraints:

  • Indexing a List by time must not scan every file to "find" the right time.
  • Allow for more than one partition per file (precludes using time as an index, since then some partitions could have non-distinct upper and lower boundaries).
  • Keep data flow one-way, i.e., don't back-propagate a list of file indices from the pipeline to the loader.

JamesMcClung and others added 9 commits May 14, 2026 14:41
Field test passes on first run — xr.open_mfdataset + isel(t=-1) +
compute already produces a dask graph that reads bulk from exactly the
indexed file (verified: 1 read of 'jeh' from pfd.000000010.bp). The
investigation methodology in the spec turned out to be belt-and-
suspenders; nothing to fix in the field stack.

Particle test is marked xfail and confirms the diagnosed shape:
77 bulk reads (7 columns x 11 files) instead of 7 (1 file). Fix
deferred per the spec.

Co-Authored-By: Claude <noreply@anthropic.com>
New optional ListMetadata fields describing the partition layout of the
underlying dask DataFrame. Used by Idx (next commit) to do partition
pruning instead of a predicate filter when iseling along the partition
dim. None defaults preserve existing behavior.

Co-Authored-By: Claude <noreply@anthropic.com>
Track per-step partition layout in metadata so Idx (next commit) can do
dask-native partition pruning. Subfile chunking is preserved — each step
still has CONFIG.dask_chunk_size-bounded partitions; we just record the
ranges.

Co-Authored-By: Claude <noreply@anthropic.com>
Mirror of the change in particle_bp. Same shape, same intent.

Co-Authored-By: Claude <noreply@anthropic.com>
When iseling along the partition_dim of a list, use df.partitions[...]
to let dask prune the graph instead of df[df[dim] == pos], which forces
every partition to be read to evaluate the predicate.

For the prt-bin-time idx case on test-2d: bulk reads drop from 77 (7
columns x 11 files) to 7 (7 columns x 1 file). The test_idx_efficient
particle case is no longer xfail.

LazyList.compute() now clears partition_dim/partition_ranges since they
describe the dask layout and are meaningless after materialization to a
pandas frame.

Co-Authored-By: Claude <noreply@anthropic.com>
Document the new ListMetadata fields and the loader invariant that
keeps them in sync with the dd.DataFrame layout. Without this, a future
loader implementer could silently lose Idx's partition-pruning
optimization by forgetting to set them.

Co-Authored-By: Claude <noreply@anthropic.com>
@JamesMcClung JamesMcClung added the optimization Improves performance label May 14, 2026
JamesMcClung and others added 2 commits May 14, 2026 15:18
Mirror the existing --idx t=-1 tests with --pos t=999 (nearest resolves
to the last file). Field passes; particle xfails for the same structural
reason Idx did, fix in next commit.

Co-Authored-By: Claude <noreply@anthropic.com>
Pos translates each coord-valued sel into an integer-index isel against
the dim's coords and hands the dict to Idx. That picks up Idx's new
partition-pruning behavior for free: --pos t=<value> on particles now
reads bulk from exactly the nearest file's partitions, not all of them.

Non-coord dims (e.g. filtering particle columns like px by value range)
keep the predicate-filter path; Idx can't handle those since it needs
coords for the isel translation.

Idx is lazy-imported inside apply_list to dodge a circular import via
lib.plotting.animated_plot -> idx.

Co-Authored-By: Claude <noreply@anthropic.com>
@JamesMcClung JamesMcClung merged commit 3908d80 into main May 14, 2026
2 checks passed
@JamesMcClung JamesMcClung deleted the time-partition branch May 14, 2026 19:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

optimization Improves performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant