Time partition by JamesMcClung · Pull Request #37 · JamesMcClung/psc-plot

JamesMcClung · 2026-05-14T19:15:37Z

Somewhat brute-force a method for lazily indexing time in list data. This method emerged given the following constraints:

Indexing a List by time must not scan every file to "find" the right time.
Allow for more than one partition per file (precludes using time as an index, since then some partitions could have non-distinct upper and lower boundaries).
Keep data flow one-way, i.e., don't back-propagate a list of file indices from the pipeline to the loader.

Field test passes on first run — xr.open_mfdataset + isel(t=-1) + compute already produces a dask graph that reads bulk from exactly the indexed file (verified: 1 read of 'jeh' from pfd.000000010.bp). The investigation methodology in the spec turned out to be belt-and- suspenders; nothing to fix in the field stack. Particle test is marked xfail and confirms the diagnosed shape: 77 bulk reads (7 columns x 11 files) instead of 7 (1 file). Fix deferred per the spec. Co-Authored-By: Claude <noreply@anthropic.com>

New optional ListMetadata fields describing the partition layout of the underlying dask DataFrame. Used by Idx (next commit) to do partition pruning instead of a predicate filter when iseling along the partition dim. None defaults preserve existing behavior. Co-Authored-By: Claude <noreply@anthropic.com>

Track per-step partition layout in metadata so Idx (next commit) can do dask-native partition pruning. Subfile chunking is preserved — each step still has CONFIG.dask_chunk_size-bounded partitions; we just record the ranges. Co-Authored-By: Claude <noreply@anthropic.com>

Mirror of the change in particle_bp. Same shape, same intent. Co-Authored-By: Claude <noreply@anthropic.com>

When iseling along the partition_dim of a list, use df.partitions[...] to let dask prune the graph instead of df[df[dim] == pos], which forces every partition to be read to evaluate the predicate. For the prt-bin-time idx case on test-2d: bulk reads drop from 77 (7 columns x 11 files) to 7 (7 columns x 1 file). The test_idx_efficient particle case is no longer xfail. LazyList.compute() now clears partition_dim/partition_ranges since they describe the dask layout and are meaningless after materialization to a pandas frame. Co-Authored-By: Claude <noreply@anthropic.com>

Document the new ListMetadata fields and the loader invariant that keeps them in sync with the dd.DataFrame layout. Without this, a future loader implementer could silently lose Idx's partition-pruning optimization by forgetting to set them. Co-Authored-By: Claude <noreply@anthropic.com>

Mirror the existing --idx t=-1 tests with --pos t=999 (nearest resolves to the last file). Field passes; particle xfails for the same structural reason Idx did, fix in next commit. Co-Authored-By: Claude <noreply@anthropic.com>

Pos translates each coord-valued sel into an integer-index isel against the dim's coords and hands the dict to Idx. That picks up Idx's new partition-pruning behavior for free: --pos t=<value> on particles now reads bulk from exactly the nearest file's partitions, not all of them. Non-coord dims (e.g. filtering particle columns like px by value range) keep the predicate-filter path; Idx can't handle those since it needs coords for the isel translation. Idx is lazy-imported inside apply_list to dodge a circular import via lib.plotting.animated_plot -> idx. Co-Authored-By: Claude <noreply@anthropic.com>

JamesMcClung and others added 9 commits May 14, 2026 14:41

particle_h5: populate partition_dim,partition_ranges

74a06e7

Mirror of the change in particle_bp. Same shape, same intent. Co-Authored-By: Claude <noreply@anthropic.com>

gitignore: +docs

c3e6872

test_idx_efficient: renames

fc1476e

idx: rename

272b40d

JamesMcClung added the optimization Improves performance label May 14, 2026

JamesMcClung and others added 2 commits May 14, 2026 15:18

test_idx_efficient: +pos tests

e37842d

Mirror the existing --idx t=-1 tests with --pos t=999 (nearest resolves to the last file). Field passes; particle xfails for the same structural reason Idx did, fix in next commit. Co-Authored-By: Claude <noreply@anthropic.com>

JamesMcClung merged commit 3908d80 into main May 14, 2026
2 checks passed

JamesMcClung deleted the time-partition branch May 14, 2026 19:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Time partition#37

Time partition#37
JamesMcClung merged 11 commits into
mainfrom
time-partition

JamesMcClung commented May 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

JamesMcClung commented May 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant