Open
Conversation
paigem
commented
May 9, 2023
|
|
||
| As an example, if a file has dimensions `(time=744, lat=180, lon=360)`, the default approach would result in chunks like e.g. `(1, 180, 360)`, so that each disk read extracts an area at a single time step. If this file is expected to be used for timeseries analysis (i.e. to do computations for a single location in space across all timesteps), we would need to read in the entire dataset to access that single location across all time. For this timeseries analysis, a better chunking strategy would be `(744, 1, 1)`, so that each disk read extracts all of time steps at each point location. See the [rechunking section](https://acdguide.github.io/BigData/computations/computations.html#rechunking) for some example tools to help with rechunking. See {ref}`cdo` and {ref}`nco` for overviews of what those tools offer, and also note that NCO has a tool for [timeseries reshaping](http://nco.sourceforge.net/nco.html#Timeseries-Reshaping-mode_002c-aka-Splitting). | ||
|
|
||
| For data where mixed mode analysis is required, it is best to find a chunking scheme that balances these two approaches, and results in chunk sizes that are broadly commensurate with typical on-board memory. In other words, we might pick a chunking approach like `(100, 180, 360)`. General advice is to aim for chunks between 100-500MB, to minimise file reads while balancing with typical available memory sizes (say 8GB). |
Contributor
Author
There was a problem hiding this comment.
On disk, we should use chunk sizes of 4MB? @paolap thinks this is too small, so maybe we can say ~20-100MB.
@dougiesquire notes that this is dependent on the system used.
- We should add: experimentation is likely required to figure out the optimal chunk size.
- Chunking also affects compression capacity.
- If writing netCDF through Xarray (
to_netcdf()), you can specify chunks but Xarray will also use a default if user does not specify. - Link to https://docs.unidata.ucar.edu/nug/current/netcdf_perf_chunking.html
Final decision:
- Write a general sentence about importance of chunking and link to blog
- Add sentence about difference between chunking on disk vs in dask
hot007
requested changes
May 10, 2023
Contributor
hot007
left a comment
There was a problem hiding this comment.
Looks good, but others please review my suggestions before accepting/merging!
| For data where mixed mode analysis is required, it is best to find a chunking scheme that balances these two approaches, and results in chunk sizes that are broadly commensurate with typical on-board memory. In other words, we might pick a chunking approach like `(100, 180, 360)` which would result in chunks that are approximately `25MB`. This is reasonably computationally efficient, though could be bigger. General advice is to aim for chunks between 100-500MB, to minimise file reads while balancing with typical available memory sizes (say 8GB). | ||
| Large data arrays are composed of smaller units which are called *chunks*. This is why some software, like xarray, can load data lazily, i.e. load into memory only the data chunks it needs to perform a specific operation (see some examples in the [Analysis section](https://acdguide.github.io/BigData/computations/computations-intro.html)). | ||
|
|
||
| All data stored in netcdf files have been written in chunks, following some chunking strategy. [NCO](https://acdguide.github.io/BigData/software/software-other.html#nco) has a [list of different chunking policies](https://nco.sourceforge.net/nco.html#Chunking) that you can apply to write a netcdf file. The most common and default approach is to prioritise accessing the data as a grid, so that retrieving all grid points at one timestep will require loading only 1 or few chunks at one time. This chunking strategy means that each timestep is on a different chunk. While this is ideal for some types of computations (e.g. to plot a single timestep of the data), this chunking scheme is very slow (and sometimes prohibitively so) in other cases (e.g. to analyse a timeseries). |
Contributor
There was a problem hiding this comment.
Suggested change
| All data stored in netcdf files have been written in chunks, following some chunking strategy. [NCO](https://acdguide.github.io/BigData/software/software-other.html#nco) has a [list of different chunking policies](https://nco.sourceforge.net/nco.html#Chunking) that you can apply to write a netcdf file. The most common and default approach is to prioritise accessing the data as a grid, so that retrieving all grid points at one timestep will require loading only 1 or few chunks at one time. This chunking strategy means that each timestep is on a different chunk. While this is ideal for some types of computations (e.g. to plot a single timestep of the data), this chunking scheme is very slow (and sometimes prohibitively so) in other cases (e.g. to analyse a timeseries). | |
| All data stored in netCDF files have been written to storage in chunks, following some chunking strategy. [NCO](https://acdguide.github.io/BigData/software/software-other.html#nco) has a [list of different chunking policies](https://nco.sourceforge.net/nco.html#Chunking) that you can apply to write a netCDF file. Ideally, chunk sizes should align with/be a multiple of "block" sizes (write quanta) on the underlying storage infrastructure, e.g. a multiple of 4MB. Note that the size of chunks chosen can affect how compressible the resulting file is. | |
| The most common and default approach is to prioritise accessing the data as a grid, so that retrieving all grid points at one timestep will require loading only 1 or few chunks at one time. This chunking strategy means that each timestep is on a different chunk. While this is ideal for some types of computations (e.g. to plot a single timestep of the data), this chunking scheme is very slow to read (and sometimes prohibitively so) in other cases (e.g. to analyse a timeseries). Use of xarray's `to_netcdf()` will use a default chunking unless a chunking scheme is specified, which may not be appropriate for the data. UniData offer some [advice on chunking and netCDF performance](https://docs.unidata.ucar.edu/nug/current/netcdf_perf_chunking.html). | |
| The above regards the data chunks written to storage. When working with `dask`, the user can specify chunk size also. This does not change how the data is stored on disk, but how it is stored in memory. Therefore dask specified chunks should be multiples of file chunks, otherwise read performance can be severely compromised. |
Contributor
There was a problem hiding this comment.
I've tried to address the items arising from our discussion in the meeting but my changes would benefit from others' review!
| ## NetCDF metadata | ||
|
|
||
| When using self-describing data formats (such as netCDF), it is important to understand the various attributes contained in the metadata, how to interact with them, and potential issues of which to remain aware. | ||
| When using self-describing data formats (such as netCDF), it is important to understand the various attributes contained in the metadata, how to interact with them, and potential issues to remain aware of. |
Contributor
There was a problem hiding this comment.
Reject change, the original removed a dangling participle (which may or may not really be a rule in English!)
Suggested change
| When using self-describing data formats (such as netCDF), it is important to understand the various attributes contained in the metadata, how to interact with them, and potential issues to remain aware of. | |
| When using self-describing data formats (such as netCDF), it is important to understand the various attributes contained in the metadata, how to interact with them, and potential issues of which to remain aware. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #92