Skip to content

Support lazy saving #4190

@bouweandela

Description

@bouweandela

✨ Feature Request

The iris.save function only supports saving a single file at a time and is not lazy. However, the dask.array.store function that backs the NetCDF saver supports delayed saving. For our use case, it would be computationally more efficient to have this supported by iris.save. Would it be possible to allow providing the compute=False option to the current iris.save function, so instead of saving directly it returns a dask.delayed.Delayed object that can be computed at a later time?

Alternatively, having a save function in iris that allows saving a list of cubes to a list of files, one cube per file (also similar to da.store but then working for cubes) would also work for us.

Motivation

In our case, multi model statistics in ESMValTool, we are interested in computing statistics (e.g. mean, median) over a number of climate models (cubes). Before we can compute those statistics, we need to load the data from disk and regrid the cubes to the same horizontal grid (and optionally to the same vertical levels). Then we merge all cubes into a single cube with a 'model' dimension and collapse along that dimension using e.g. iris.analysis.MEAN to compute the mean.

We want to store both the regridded input cubes and the cube(s) containing the statistics, each cube in it's own netCDF file according to the CMIP/CMOR conventions. Because iris.save only allows saving a single cube to a single file and is immediately executed, the load and regrid needs to be executed (1 + the number of statistics) times. Having support for delayed saving (or saving a list of cubes to a matching list of files) would save computational time, because the regridded chunks can be re-used for computing each statistic (as well as for storing the regridded cube), so we only need to load and regrid the chunk once.

Additional context

Example script that shows the use case

This is an example script that demonstrates our workflow and how we could use the requested save function to speed up the multi-model statistics computation. Note that the script uses lazy multi-model statistics, which are still in development in ESMValGroup/ESMValCore#968.

import os
import sys

import dask
import dask.array as da
import iris
from netCDF4 import Dataset

from esmvalcore.preprocessor import multi_model_statistics, regrid


def save(cube, target, compute):
    """Save the data from a 3D cube to file using da.store."""
    dataset = Dataset(target, "w")
    dataset.createDimension("time", cube.shape[0])
    dataset.createDimension("lat", cube.shape[1])
    dataset.createDimension("lon", cube.shape[2])
    dataset.createVariable(
        "var",
        "f4",
        (
            "time",
            "lat",
            "lon",
        ),
    )

    return da.store(cube.core_data(), dataset["var"], compute=compute)


def main(in_filenames):
    """Compute multi-model statistics over the input files."""
    target_grid = "1x1"
    cubes = {}
    for in_filename in in_filenames:
        cube = iris.load_cube(in_filename)
        cube = regrid(cube, target_grid, scheme="linear")
        out_filename = os.path.basename(in_filename)
        cubes[out_filename] = cube

    statistics = multi_model_statistics(cubes.values(), "overlap", ["mean", "std_dev"])
    for statistic, cube in statistics.items():
        out_filename = statistic + ".nc"
        cubes[out_filename] = cube

    results = []
    for out_filename, cube in cubes.items():
        result = save(cube, out_filename, compute=False)
        results.append(result)

    dask.compute(results)

    # for out_filename, cube in cubes.items():
    #     iris.save(cube, out_filename)


if __name__ == "__main__":
    # This script takes a list of netCDF files containing 3D variables as arguments
    main(sys.argv[1:])

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions