This repository contains the code for creating the BigEarthNet v2 dataset in a reproducible manner.
If you use this work, please cite:
K. N. Clasen, L. Hackel, T. Burgert, G. Sumbul, B. Demіr and M. Volker, "reBEN: Refined BigEarthNet Dataset for Remote Sensing Image Analysis", IEEE International Geoscience and Remote Sensing Symposium (IGARSS), 2025.
@inproceedings{clasen2025refinedbigearthnet,
title={{reBEN}: Refined BigEarthNet Dataset for Remote Sensing Image Analysis},
author={Clasen, Kai Norman and Hackel, Leonard and Burgert, Tom and Sumbul, Gencer and Demir, Beg{\"u}m and Markl, Volker},
year={2025},
booktitle={IEEE International Geoscience and Remote Sensing Symposium (IGARSS)},
}
On a high-level, the pipeline:
- Derives the original Sentinel-2 L1C tile names that were used to construct the BigEarthNet v1.0 dataset.
- Searches for the current L1C tiles in the new Copernicus Data Space Ecosystem (CDSE).
- Downloads the L1C tiles from the CDSE service.
- Processeses the L1C to L2A tiles with sen2cor v2.11.
- Divides the tiles into 1200m x 1200m patches, associates each patch with a pixel-level reference map and scene-level multi-labels.
- Aligns the constructed patches and metadata with the BigEarthNet v1.0 Sentinel-1 patches to construct the complete dataset.
- Compresses the resulting dataset to prepare it for distribution.
The following sections describe how to execute the individual steps and provides
some additional information in the Detail
subsections.
- Download & install Nix with
flake
support - Manually download the CORINE Land Cover (CLC) 2018 vector database (
gpkg
) via the copernicus.eu website, as it cannot be automatically downloaded in a script.- Add it to the store via
nix-store --add-fixed sha256 U2018_CLC2018_V2020_20u1.gpkg
- Add it to the store via
- Enter the prepared environment via
nix develop --impure .#
.
To generate the mapping from the original Sentinel-2 L1C names from BigEarthNet v1.0 to the new Copernicus Data Space Ecosystem (CDSE) IDs, run:
nix run .#ben-s2-tile-names-to-ids-runner
This program will generate the required bigearthnet_s2_tile_ids.csv
.
It is important to note that if the resulting CSV file differs from the reference
CSV file stored in tracked_artifacts
, the source tiles from CDSE have changed,
and the resulting dataset will most likely differ from the current one.
To download the tiles with the given IDs, run:
nix run .#ben-s2-l1c-tile-downloader -- --csv-path <PATH_TO_TILE_IDS.CSV> --output-dir <PATH_TO_DIR>
Important
You might need to re-run the above line multiple times until it succeeds. After a few seconds, all tile downloads will have been added to the job queue and will continue to be downloaded even if the main process is interrupted/stopped.
Previous experience has shown that the CDSE service is quite unreliable
and often fails in the middle of the download, causing the downloader script to fail for those tiles.
The only reliable option would be to force retry for every failure (as many different exit codes have
been observed in testing) multiple times.
However, even then, data was unavailable and required contacting support and for them to fix the underlying issue first.
To avoid overloading the servers due to internal service issues on their side, the recommendation is to retry the script
manually 3 times (it will skip over already successfully downloaded tiles).
If it continues to fail, please get in touch with the CDSE support by providing the error message in the associated .err
files.
You should then check the output of the tiles by comparing the output file from:
nix run .#ben-s2-l1c-hasher <download-dir>
with the file in trached-artifacts/ben_s2_l1c_hashes.csv
via diff
.
We have seen that the resulting zip files have changed without the ID changing:
35c35
< 4754499f-f990-556c-954e-713d49128c34.out,770c9bfbda7e310a55ca07e8dc429b40c9e5e30dfb8e8fe1a304d46d9b012e08
---
> 4754499f-f990-556c-954e-713d49128c34.out,d6644add4d88e2741083c2ef1a605d140d63d037da42566572481142c271833f
50c50
< 5f680c98-a4a9-588f-963f-d30f8d2138b1.out,b5e20b02f4ae1e83e69d73269c761ed7de54ae2bd9ea3e28b111a9ab2ce33209
---
> 5f680c98-a4a9-588f-963f-d30f8d2138b1.out,51f16658059b133d89fe590eac5cd7a8ab762cc9d9272137a7a0c1f64db85694
99c99
< c07fc9d1-cb2b-5cf8-b2f1-09b80df3ca8d.out,6229d8b3dfa256ac9c7cfe23bfb5f5a3abf977feb3cd306e8ebbe62811880874
---
> c07fc9d1-cb2b-5cf8-b2f1-09b80df3ca8d.out,cb65aa1a04961107d12f21ca58aaa2911ef1c1cf3612a3d336111771ec637b3d
In this instance, preview images and additional HTML metadata files were added to zip file. However, the image data itself was unchanged.
To convert the Sentinel-2 L1C tiles to the L2A data product, run the following command:
nix run .#ben-s2-l1c-to-l2a-converter-runner -- <download-dir> --export-dir <l2a-dir>
Similar to the download program mentioned earlier, this program
will keep running through the queue even if the main program is interrupted.
It is a quite CPU-intensive task and may take a few hours to complete.
You can track the progress by using the command: pueue status --group sentinel-l1c-to-l2a
.
Generating the BigEarthNet v2 data is the most resource-intensive part of the pipeline
and might require adjusting the values in postgres_conf.nix
to adjust
the pre-selected parameters on smaller servers.
The required services for the data generation need to be started via:
devenv up --tui=false
# or devenv up --keep-tui if the shell is configured correctly,
# which might not be the case for some SSH connections.
This command starts a dedicated PostgreSQL server and initializes the database with the required schemas.
If the flyway-runner
reports that the schema has been applied successfully,
stop the processes via Ctrl+C
.
If the command fails, please see the Debugging section of the document.
After stopping the process, restart the environment in the background via devenv up --tui=false & disown
or pueue add 'devenv up --tui=false'
(might require running pueued --daemonize
first).
Next, start the data generation also in the background by running
pueue add "\
nix run .#ben-data-generator -- \
--L2As-root-dir=<L2As_DOWNLOAD_PATH> \
--export-patch-dir=<+TIFF_DIR> \
--export-reference-maps-dir=<+REFERENCE_MAPS_DIR> \
--export-metadata-dir=<+METADATA_DIR> \
--v1-metadata-dir=$BEN_V1_METADATA_DIR \
--clc2018-gpkg-path=$CLC2018_PATH \
--country-geojson-path=$BEN_COUNTRY_GEOJSON_PATH"
Note that the export directories will be created if they do not exist and that the input artifacts are already provided as environment variables and linked to the hashed version.
Important
The various metadata files are generated
independently from one another and should not be used directly.
For example, the patch_id_label_mapping.csv
only contains patches with a minimum area
covered by label information. Whereas the patch_id_split_mapping.csv
contains
the split mapping for all generated patches and will contain more
unique patch_id
values than the one from the label mapping file.
For more details, see the code comments.
Important
The command might take a considerable amount of time to complete, so make sure to run it in a way that does not require an active ssh connection if necessary.
After generating data and inserting it into the PostgreSQL database and the relevant data, the remaining steps merge the original Sentinel-1 data with the newly generated Sentinel-2 data.
nix run .#ben-data-finalizer -- \
--target-dir <ALIGNED_DIR> \
--s2-root-dir <+TIFF_DIR> \
--reference-maps-root-dir <+REFERENCE_MAPS_DIR> \
--s1-root-dir <EXTRACTED_S1_DIR> \
--patch-id-label-mapping-file <PREV>/metadata/patch_id_label_mapping.csv \
--patch-id-s2v1-mapping-file <PREV>/metadata/patch_id_s2v1_mapping.csv \
--patch-id-split-mapping-file <PREV>/metadata/patch_id_split_mapping.csv \
--patch-id-country-mapping-file <PREV>/metadata/patch_id_country_mapping.csv \
--old-s1s2-mapping-file /nix/store/69ag0abnyaymgzhm2n9ap0y0kq7mqd58-s1s2_mapping.csv \
--old-patches-with-cloud-and-shadow-file /nix/store/zi5rw18q0g7chy40nzs20zmblmvvkf9h-patches_with_cloud_and_shadow.csv \
--old-patches-with-seasonal-snow-file /nix/store/1crxdk4ncacbdar05nlxcg0wvay05khv-patches_with_seasonal_snow.csv
Tip
This step requires the extracted BigEarthNet v1.0 Sentinel-1 directory
It is important to note that this step not only aligns/adds the Sentinel-1 data but also aligns the different metadata files!
Finally, before publishing the dataset, compress the directories and metadata files into
individual ZSTD
compressed archives to minize the number of individual files
and to greatly reduce the required download size.
nix run .#zstd-compressor -- <ALIGNED_DIR> --output-path <TO_BE_UPLOADED_DIR>
In the following, a few common issues and solutions are provided:
The original source of the Sentinel-2 tile names is the RSiM GitLab repository.
However, to ensure that this file remains accessible and is not changed, a compressed copy
of the csv file is stored under tracked-artifacts/tile_names_and_links.csv.bz2
.
The file was compressed with
bzip2 /nix/store/ams5fsnv0adsgdsg49wmlq7rnmdq8j4d-tile_names_and_links.csv --best --stdout > tracked-artifacts/tile_names_and_links.csv.bz2
If the GitLab repository is down, the flake.nix
dependency can be updated to use
the local version instead.
Some intermediate results are also stored under tracked-artifacts
in a compressed format (zstd -19
)
to allow easier detection of possible output changes.
However, the contents should ideally be compared after extracting the data and joining them, to avoid changes
due to the compression options and/or to the CSV escape format.
To track whether or not the image data has remained the same, the script tiff-hasher
can be executed, and its output compared to the output of the previous run.
The script will calculate the sha256
hash of each file, write it as a CSV file, and
then store the hash of the CSV file itself to allow for a quick comparison.
Concretly, run:
# generate the csv and sha256 file
nix run .#tiff-hasher <path-to-patches> /tmp/patch_hashes.csv
# compare the outputs with the previous run
# if there is no output, then the files do not differ and the resulting tiff
# files are identical to the previous run!
diff /tmp/patch_hashes.csv.sha256 <repository>/tracked-artifacts/patch_hashes.csv.sha256
# same for the reference-maps patches
nix run .#tiff-hasher <path-to-reference-maps-dir> /tmp/reference_maps.csv
diff /tmp/reference_maps_hashes.csv.sha256 <repository>/tracked-artifacts/reference_maps_hashes.csv.sha256
The specific CSV files are not tracked in Git
but the associated checksum file is.
The main reason why not all CSV files are tracked is that they are several hundred
MBs in size.
If you see errors such as FATAL: could not map anonymous shared memory
, it means
that the selected shared_buffers
and effective_cache_size
size is too large and should be adjusted to according to the
available RAM of the server hosting the database.
If you see errors such as start http server on :9999 failed error="listen tcp :9999: bind: address already in use"
,
then there is probably already a running devenv up
process running.
Stop the managing process by killing the process that is named process-compose
.
If you see a tui starting and then immediately stopping you can investigate the logs
via cat /tmp/process-compose-<USERNAME>
.
But by opening it via devenv up --keep-tui
you should be able to inspect which
job is failing and investigate the error.
If no additional information is printed, then nix is probably running via nix-portable
and adding an explicit shell might be required, such as: nix develop .#tile-downloader-env --command sh
Important
If you run devenv up
it must be run from the root directory!
Otherwise the state
directory will be created inside the sub-directory and cause issues!