Make wording more clear on netcdf page by paigem · Pull Request #96 · ACDguide/BigData

paigem · 2023-05-08T23:40:10Z

Closes #92

paigem · 2023-05-09T01:31:56Z

BigData/data/data-netcdf.md

+
+As an example, if a file has dimensions `(time=744, lat=180, lon=360)`, the default approach would result in chunks like e.g. `(1, 180, 360)`, so that each disk read extracts an area at a single time step. If this file is expected to be used for timeseries analysis (i.e. to do computations for a single location in space across all timesteps), we would need to read in the entire dataset to access that single location across all time. For this timeseries analysis, a better chunking strategy would be `(744, 1, 1)`, so that each disk read extracts all of time steps at each point location. See the [rechunking section](https://acdguide.github.io/BigData/computations/computations.html#rechunking) for some example tools to help with rechunking. See {ref}`cdo` and {ref}`nco` for overviews of what those tools offer, and also note that NCO has a tool for [timeseries reshaping](http://nco.sourceforge.net/nco.html#Timeseries-Reshaping-mode_002c-aka-Splitting).
+
+For data where mixed mode analysis is required, it is best to find a chunking scheme that balances these two approaches, and results in chunk sizes that are broadly commensurate with typical on-board memory. In other words, we might pick a chunking approach like `(100, 180, 360)`. General advice is to aim for chunks between 100-500MB, to minimise file reads while balancing with typical available memory sizes (say 8GB).


On disk, we should use chunk sizes of 4MB? @paolap thinks this is too small, so maybe we can say ~20-100MB.

@dougiesquire notes that this is dependent on the system used.

We should add: experimentation is likely required to figure out the optimal chunk size.

Chunking also affects compression capacity.

If writing netCDF through Xarray (to_netcdf()), you can specify chunks but Xarray will also use a default if user does not specify.

Link to https://docs.unidata.ucar.edu/nug/current/netcdf_perf_chunking.html

Final decision:

Write a general sentence about importance of chunking and link to blog

Add sentence about difference between chunking on disk vs in dask

hot007

Looks good, but others please review my suggestions before accepting/merging!

hot007 · 2023-05-10T04:46:05Z

BigData/data/data-netcdf.md

-For data where mixed mode analysis is required, it is best to find a chunking scheme that balances these two approaches, and results in chunk sizes that are broadly commensurate with typical on-board memory. In other words, we might pick a chunking approach like `(100, 180, 360)` which would result in chunks that are approximately `25MB`. This is reasonably computationally efficient, though could be bigger. General advice is to aim for chunks between 100-500MB, to minimise file reads while balancing with typical available memory sizes (say 8GB).
+Large data arrays are composed of smaller units which are called *chunks*. This is why some software, like xarray, can load data lazily, i.e. load into memory only the data chunks it needs to perform a specific operation (see some examples in the [Analysis section](https://acdguide.github.io/BigData/computations/computations-intro.html)). 
+
+All data stored in netcdf files have been written in chunks, following some chunking strategy. [NCO](https://acdguide.github.io/BigData/software/software-other.html#nco) has a [list of different chunking policies](https://nco.sourceforge.net/nco.html#Chunking) that you can apply to write a netcdf file. The most common and default approach is to prioritise accessing the data as a grid, so that retrieving all grid points at one timestep will require loading only 1 or few chunks at one time. This chunking strategy means that each timestep is on a different chunk. While this is ideal for some types of computations (e.g. to plot a single timestep of the data), this chunking scheme is very slow (and sometimes prohibitively so) in other cases (e.g. to analyse a timeseries).


Suggested change

All data stored in netcdf files have been written in chunks, following some chunking strategy. [NCO](https://acdguide.github.io/BigData/software/software-other.html#nco) has a [list of different chunking policies](https://nco.sourceforge.net/nco.html#Chunking) that you can apply to write a netcdf file. The most common and default approach is to prioritise accessing the data as a grid, so that retrieving all grid points at one timestep will require loading only 1 or few chunks at one time. This chunking strategy means that each timestep is on a different chunk. While this is ideal for some types of computations (e.g. to plot a single timestep of the data), this chunking scheme is very slow (and sometimes prohibitively so) in other cases (e.g. to analyse a timeseries).

All data stored in netCDF files have been written to storage in chunks, following some chunking strategy. [NCO](https://acdguide.github.io/BigData/software/software-other.html#nco) has a [list of different chunking policies](https://nco.sourceforge.net/nco.html#Chunking) that you can apply to write a netCDF file. Ideally, chunk sizes should align with/be a multiple of "block" sizes (write quanta) on the underlying storage infrastructure, e.g. a multiple of 4MB. Note that the size of chunks chosen can affect how compressible the resulting file is.

The most common and default approach is to prioritise accessing the data as a grid, so that retrieving all grid points at one timestep will require loading only 1 or few chunks at one time. This chunking strategy means that each timestep is on a different chunk. While this is ideal for some types of computations (e.g. to plot a single timestep of the data), this chunking scheme is very slow to read (and sometimes prohibitively so) in other cases (e.g. to analyse a timeseries). Use of xarray's `to_netcdf()` will use a default chunking unless a chunking scheme is specified, which may not be appropriate for the data. UniData offer some [advice on chunking and netCDF performance](https://docs.unidata.ucar.edu/nug/current/netcdf_perf_chunking.html).

The above regards the data chunks written to storage. When working with `dask`, the user can specify chunk size also. This does not change how the data is stored on disk, but how it is stored in memory. Therefore dask specified chunks should be multiples of file chunks, otherwise read performance can be severely compromised.

I've tried to address the items arising from our discussion in the meeting but my changes would benefit from others' review!

hot007 · 2023-05-10T05:02:12Z

BigData/data/data-netcdf.md

 ## NetCDF metadata

-When using self-describing data formats (such as netCDF), it is important to understand the various attributes contained in the metadata, how to interact with them, and potential issues of which to remain aware.
+When using self-describing data formats (such as netCDF), it is important to understand the various attributes contained in the metadata, how to interact with them, and potential issues to remain aware of.


Reject change, the original removed a dangling participle (which may or may not really be a rule in English!)

Suggested change

When using self-describing data formats (such as netCDF), it is important to understand the various attributes contained in the metadata, how to interact with them, and potential issues to remain aware of.

When using self-describing data formats (such as netCDF), it is important to understand the various attributes contained in the metadata, how to interact with them, and potential issues of which to remain aware.

make wording more clear on netcdf page

6fc6a83

paigem requested review from hot007 and paolap May 8, 2023 23:40

paigem commented May 9, 2023

View reviewed changes

hot007 requested changes May 10, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make wording more clear on netcdf page#96

Make wording more clear on netcdf page#96
paigem wants to merge 1 commit intoACDguide:mainfrom
paigem:netcdf-page-update

paigem commented May 8, 2023

Uh oh!

paigem May 9, 2023

Uh oh!

hot007 left a comment

Uh oh!

hot007 May 10, 2023

Uh oh!

hot007 May 10, 2023

Uh oh!

hot007 May 10, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		As an example, if a file has dimensions `(time=744, lat=180, lon=360)`, the default approach would result in chunks like e.g. `(1, 180, 360)`, so that each disk read extracts an area at a single time step. If this file is expected to be used for timeseries analysis (i.e. to do computations for a single location in space across all timesteps), we would need to read in the entire dataset to access that single location across all time. For this timeseries analysis, a better chunking strategy would be `(744, 1, 1)`, so that each disk read extracts all of time steps at each point location. See the [rechunking section](https://acdguide.github.io/BigData/computations/computations.html#rechunking) for some example tools to help with rechunking. See {ref}`cdo` and {ref}`nco` for overviews of what those tools offer, and also note that NCO has a tool for [timeseries reshaping](http://nco.sourceforge.net/nco.html#Timeseries-Reshaping-mode_002c-aka-Splitting).

		For data where mixed mode analysis is required, it is best to find a chunking scheme that balances these two approaches, and results in chunk sizes that are broadly commensurate with typical on-board memory. In other words, we might pick a chunking approach like `(100, 180, 360)`. General advice is to aim for chunks between 100-500MB, to minimise file reads while balancing with typical available memory sizes (say 8GB).

-All data stored in netcdf files have been written in chunks, following some chunking strategy. [NCO](https://acdguide.github.io/BigData/software/software-other.html#nco) has a [list of different chunking policies](https://nco.sourceforge.net/nco.html#Chunking) that you can apply to write a netcdf file. The most common and default approach is to prioritise accessing the data as a grid, so that retrieving all grid points at one timestep will require loading only 1 or few chunks at one time. This chunking strategy means that each timestep is on a different chunk. While this is ideal for some types of computations (e.g. to plot a single timestep of the data), this chunking scheme is very slow (and sometimes prohibitively so) in other cases (e.g. to analyse a timeseries).
+All data stored in netCDF files have been written to storage in chunks, following some chunking strategy. [NCO](https://acdguide.github.io/BigData/software/software-other.html#nco) has a [list of different chunking policies](https://nco.sourceforge.net/nco.html#Chunking) that you can apply to write a netCDF file. Ideally, chunk sizes should align with/be a multiple of "block" sizes (write quanta) on the underlying storage infrastructure, e.g. a multiple of 4MB. Note that the size of chunks chosen can affect how compressible the resulting file is.
+The most common and default approach is to prioritise accessing the data as a grid, so that retrieving all grid points at one timestep will require loading only 1 or few chunks at one time. This chunking strategy means that each timestep is on a different chunk. While this is ideal for some types of computations (e.g. to plot a single timestep of the data), this chunking scheme is very slow to read (and sometimes prohibitively so) in other cases (e.g. to analyse a timeseries). Use of xarray's `to_netcdf()` will use a default chunking unless a chunking scheme is specified, which may not be appropriate for the data. UniData offer some [advice on chunking and netCDF performance](https://docs.unidata.ucar.edu/nug/current/netcdf_perf_chunking.html).
+The above regards the data chunks written to storage. When working with `dask`, the user can specify chunk size also. This does not change how the data is stored on disk, but how it is stored in memory. Therefore dask specified chunks should be multiples of file chunks, otherwise read performance can be severely compromised.

	When using self-describing data formats (such as netCDF), it is important to understand the various attributes contained in the metadata, how to interact with them, and potential issues to remain aware of.
	When using self-describing data formats (such as netCDF), it is important to understand the various attributes contained in the metadata, how to interact with them, and potential issues of which to remain aware.

Conversation

paigem commented May 8, 2023

Uh oh!

paigem May 9, 2023

Choose a reason for hiding this comment

Uh oh!

hot007 left a comment

Choose a reason for hiding this comment

Uh oh!

hot007 May 10, 2023

Choose a reason for hiding this comment

Uh oh!

hot007 May 10, 2023

Choose a reason for hiding this comment

Uh oh!

hot007 May 10, 2023

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants