Compatibility for zarr-python 3.x#9552
Conversation
1ed4ef1 to
bb2bb6c
Compare
TomAugspurger
left a comment
There was a problem hiding this comment.
This set of changes should be backwards compatible and work with zarr-python 2.x (so reading and writing zarr v2 data).
I'll work through zarr-python 3.x now. I think we might want to parametrize most of these tests by zarr_version=[2, 3] to confirm that we can read / write zarr v2 data with zarr-python 3.x
xarray/backends/zarr.py
Outdated
|
|
||
| if _zarr_v3() and zarr_array.metadata.zarr_format == 3: | ||
| encoding["codec_pipeline"] = [ | ||
| x.to_dict() for x in zarr_array.metadata.codecs |
There was a problem hiding this comment.
Maybe this instead?
| x.to_dict() for x in zarr_array.metadata.codecs | |
| zarr_array.metadata.to_dict()["codecs"] |
A bit wasteful since everything has to be serialized, but presumably zarr knows better how to serialize the codec pipeline than we do here?
9f2cb2f to
d11d593
Compare
* removed open_consolidated workarounds * removed _store_version check * pass through zarr_version
a324329 to
6087e5e
Compare
- skip write_empty_chunks on 3.x - update patch targets
jhamman
left a comment
There was a problem hiding this comment.
Great progress here @TomAugspurger. I'm impressed by how little you've changed in the backend itself and I'm noting the pain around testing (I felt some of that w/ dask as well).
|
I just pushed a commit reverting the changes to avoid values equal to the I think this is ready to go once CI finishes. I expect upstream-ci to fail on the |
|
There's one typing failure we might want to address: I'll do some reading about how best to handle type annotations when the proper type depends on the version of a dependency. Edit: a complication here is that this is in |
I don't see why the typing of |
|
Good catch, this affects both. I was hoping something like this would work: from pathlib import Path
try:
from zarr.storage import StoreLike as _StoreLike
except ImportError:
_StoreLike = str | Path
StoreLike = type[_StoreLike]
def f(x: StoreLike) -> StoreLike:
return xbut mypy doesn't like that |
my 2 cents... we should not get hung up on this right now. a) there are plenty of other failures in the upstream-dev-mypy check unrelated to this PR and b) its probably not worth hacking something in here when there are bigger issues with the upstream zarr implementation to sort out. |
dcherian
left a comment
There was a problem hiding this comment.
Thanks @TomAugspurger et al. This looks good. I have some minor comments, which I can address later today.
| zarr.consolidate_metadata(self.zarr_group.store) | ||
| kwargs = {} | ||
| if _zarr_v3(): | ||
| # https://github.com/zarr-developers/zarr-python/pull/2113#issuecomment-2386718323 |
There was a problem hiding this comment.
Can this be removed at some point in the future? If so, it would be good to add a TODO
There was a problem hiding this comment.
I'll look more closely later, but for now I think this will be required, following a deliberate change in zarr v3 consolidated metadata.
With v2 metadata, I think that consolidated happened at the store-level, and was all-or-nothing. If you have two Groups with Arrays, the consolidated metadata will be placed at the store root, and will contain everything:
# zarr v2
In [1]: import json, xarray as xr
In [2]: store = {}
In [3]: a = xr.tutorial.load_dataset("air_temperature")
In [4]: b = xr.tutorial.load_dataset("rasm")
In [5]: a.to_zarr(store=store, group="A")
/Users/tom/gh/zarr-developers/zarr-v2/.direnv/python-3.10/lib/python3.10/site-packages/xarray/core/dataset.py:2562: SerializationWarning: saving variable None with floating point data as an integer dtype without any _FillValue to use for NaNs
return to_zarr( # type: ignore[call-overload,misc]
Out[5]: <xarray.backends.zarr.ZarrStore at 0x11113edc0>
In [6]: b.to_zarr(store=store, group="B")
Out[6]: <xarray.backends.zarr.ZarrStore at 0x10cab2440>
In [7]: list(json.loads(store['.zmetadata'])['metadata'])
Out[7]: # contains nodes from both A and B
['.zgroup',
'A/.zattrs',
'A/.zgroup',
'A/air/.zarray',
'A/air/.zattrs',
'A/lat/.zarray',
'A/lat/.zattrs',
'A/lon/.zarray',
'A/lon/.zattrs',
'A/time/.zarray',
'A/time/.zattrs',
'B/.zattrs',
'B/.zgroup',
'B/Tair/.zarray',
'B/Tair/.zattrs',
'B/time/.zarray',
'B/time/.zattrs',
'B/xc/.zarray',
'B/xc/.zattrs',
'B/yc/.zarray',
'B/yc/.zattrs']With v3, consolidated metadata is scoped to a Group, so we can provide the group we want to consolidated (the zarr-python API does support "consolidate everything in the store at the root", but I don't think we want that because you'd need to open it at the root when reading, and I think it's kinda where for ds.to_zarr(group="A") to be reading / writing stuff outside of the A prefix).
There was a problem hiding this comment.
Potentially it would make sense to have two versions of consolidated metadata:
- Everything at a specific group/node level
- Everything in a group and all of its subgroups (i.e., for DataTree)
There was a problem hiding this comment.
Agreed. zarr-developers/zarr-specs#309 has some discussion on adding a depth field to the spec for consolidated metadata. That's currently implicitly depth=None, which is everything below a group. depth=0 or 1 would be just the immediate children. That's not standardized or implemented anywhere yet, but the current implementation is forwards compatible and it shouldn't be a ton of effort.
* main: Fix multiple grouping with missing groups (pydata#9650) flox: Properly propagate multiindex (pydata#9649) Update Datatree html repr to indicate inheritance (pydata#9633) Re-implement map_over_datasets using group_subtrees (pydata#9636) fix zarr intersphinx (pydata#9652) Replace black and blackdoc with ruff-format (pydata#9506) Fix error and missing code cell in io.rst (pydata#9641) Support alternative names for the root node in DataTree.from_dict (pydata#9638) Updates to DataTree.equals and DataTree.identical (pydata#9627) DOC: Clarify error message in open_dataarray (pydata#9637) Add zip_subtrees for paired iteration over DataTrees (pydata#9623) Type check datatree tests (pydata#9632) Add missing `memo` argument to DataTree.__deepcopy__ (pydata#9631) Bug fixes for DataTree indexing and aggregation (pydata#9626) Add inherit=False option to DataTree.copy() (pydata#9628) docs(groupby): mention deprecation of `squeeze` kwarg (pydata#9625) Migration guide for users of old datatree repo (pydata#9598) Reimplement Datatree typed ops (pydata#9619)
|
Let's get this in by the end of the week. |
* main: Add close() method to DataTree and use it to clean-up open files in tests (pydata#9651) Change URL for pydap test (pydata#9655)
|
👏 Thanks all! Especially @TomAugspurger for doing the lion's share of the work here. |
* main: Add `DataTree.persist` (pydata#9682) Typing annotations for arithmetic overrides (e.g., DataArray + Dataset) (pydata#9688) Raise `ValueError` for unmatching chunks length in `DataArray.chunk()` (pydata#9689) Fix inadvertent deep-copying of child data in DataTree (pydata#9684) new blank whatsnew (pydata#9679) v2024.10.0 release summary (pydata#9678) drop the length from `numpy`'s fixed-width string dtypes (pydata#9586) fixing behaviour for group parameter in `open_datatree` (pydata#9666) Use zarr v3 dimension_names (pydata#9669) fix(zarr): use inplace array.resize for zarr 2 and 3 (pydata#9673) implement `dask` methods on `DataTree` (pydata#9670) support `chunks` in `open_groups` and `open_datatree` (pydata#9660) Compatibility for zarr-python 3.x (pydata#9552) Update to_dataframe doc to match current behavior (pydata#9662) Reduce graph size through writing indexes directly into graph for ``map_blocks`` (pydata#9658)
* main: (85 commits) Refactor out utility functions from to_zarr (pydata#9695) Use the same function to floatize coords in polyfit and polyval (pydata#9691) Add `DataTree.persist` (pydata#9682) Typing annotations for arithmetic overrides (e.g., DataArray + Dataset) (pydata#9688) Raise `ValueError` for unmatching chunks length in `DataArray.chunk()` (pydata#9689) Fix inadvertent deep-copying of child data in DataTree (pydata#9684) new blank whatsnew (pydata#9679) v2024.10.0 release summary (pydata#9678) drop the length from `numpy`'s fixed-width string dtypes (pydata#9586) fixing behaviour for group parameter in `open_datatree` (pydata#9666) Use zarr v3 dimension_names (pydata#9669) fix(zarr): use inplace array.resize for zarr 2 and 3 (pydata#9673) implement `dask` methods on `DataTree` (pydata#9670) support `chunks` in `open_groups` and `open_datatree` (pydata#9660) Compatibility for zarr-python 3.x (pydata#9552) Update to_dataframe doc to match current behavior (pydata#9662) Reduce graph size through writing indexes directly into graph for ``map_blocks`` (pydata#9658) Add close() method to DataTree and use it to clean-up open files in tests (pydata#9651) Change URL for pydap test (pydata#9655) Fix multiple grouping with missing groups (pydata#9650) ...
Can directly rely on upstream xarray's ZarrStore.open_store_variable method since Zarr v3 compatibility was added in pydata/xarray#9552.
This PR begins the process of adding compatibility with zarr-python 3.x. It's intended to be run against zarr-python v3 + the open PRs referenced in #9515.
All of the zarr test cases should be parameterized by
zarr_format=[2, 3]with zarr-python 3.x to exercise reading and writing both formats.This is currently passing with zarr-python==2.18.3.
zarr-python 3.x has about 61 failures, all of which are related to data types that aren't yet implemented in zarr-python 3.x.I'll also note that #5475 is going to become a larger issue once people start writing Zarr-V3 datasets.
_FillValuereally the same as zarr'sfill_value? #5475whats-new.rst