-
Notifications
You must be signed in to change notification settings - Fork 297
Description
✨ Feature Request
The iris.save
function only supports saving a single file at a time and is not lazy. However, the dask.array.store
function that backs the NetCDF saver supports delayed saving. For our use case, it would be computationally more efficient to have this supported by iris.save
. Would it be possible to allow providing the compute=False
option to the current iris.save
function, so instead of saving directly it returns a dask.delayed.Delayed
object that can be computed at a later time?
Alternatively, having a save function in iris that allows saving a list of cubes to a list of files, one cube per file (also similar to da.store
but then working for cubes) would also work for us.
Motivation
In our case, multi model statistics in ESMValTool, we are interested in computing statistics (e.g. mean, median) over a number of climate models (cubes). Before we can compute those statistics, we need to load the data from disk and regrid the cubes to the same horizontal grid (and optionally to the same vertical levels). Then we merge all cubes into a single cube with a 'model' dimension and collapse along that dimension using e.g. iris.analysis.MEAN
to compute the mean.
We want to store both the regridded input cubes and the cube(s) containing the statistics, each cube in it's own netCDF file according to the CMIP/CMOR conventions. Because iris.save
only allows saving a single cube to a single file and is immediately executed, the load and regrid needs to be executed (1 + the number of statistics) times. Having support for delayed saving (or saving a list of cubes to a matching list of files) would save computational time, because the regridded chunks can be re-used for computing each statistic (as well as for storing the regridded cube), so we only need to load and regrid the chunk once.
Additional context
Example script that shows the use case
This is an example script that demonstrates our workflow and how we could use the requested save function to speed up the multi-model statistics computation. Note that the script uses lazy multi-model statistics, which are still in development in ESMValGroup/ESMValCore#968.
import os
import sys
import dask
import dask.array as da
import iris
from netCDF4 import Dataset
from esmvalcore.preprocessor import multi_model_statistics, regrid
def save(cube, target, compute):
"""Save the data from a 3D cube to file using da.store."""
dataset = Dataset(target, "w")
dataset.createDimension("time", cube.shape[0])
dataset.createDimension("lat", cube.shape[1])
dataset.createDimension("lon", cube.shape[2])
dataset.createVariable(
"var",
"f4",
(
"time",
"lat",
"lon",
),
)
return da.store(cube.core_data(), dataset["var"], compute=compute)
def main(in_filenames):
"""Compute multi-model statistics over the input files."""
target_grid = "1x1"
cubes = {}
for in_filename in in_filenames:
cube = iris.load_cube(in_filename)
cube = regrid(cube, target_grid, scheme="linear")
out_filename = os.path.basename(in_filename)
cubes[out_filename] = cube
statistics = multi_model_statistics(cubes.values(), "overlap", ["mean", "std_dev"])
for statistic, cube in statistics.items():
out_filename = statistic + ".nc"
cubes[out_filename] = cube
results = []
for out_filename, cube in cubes.items():
result = save(cube, out_filename, compute=False)
results.append(result)
dask.compute(results)
# for out_filename, cube in cubes.items():
# iris.save(cube, out_filename)
if __name__ == "__main__":
# This script takes a list of netCDF files containing 3D variables as arguments
main(sys.argv[1:])