Overview of the EOPF Zarr format

Introduction

In our journey to understand cloud-optimised Earth Observation (EO) data, we have frequently mentioned the Zarr format. Now, we will take a closer look and truly understand what Zarr is and why it is such a game-changer for large datasets like those from ESA’s Sentinel missions. This chapter will break down the essential building blocks of Zarr, explaining how it organises data to make it incredibly efficient for cloud-based analysis.

What we will learn

  • 🔗 What is Zarr?
  • 🏛️ What are the main components of Zarr?
  • 🧱 How is Zarr organised?

What Is Zarr?

Zarr is an open-source, cloud-native protocol for storing multi-dimensional arrays. It is specifically desgined to work well with cloud storage and larger-scale computing systems and can be seen as a cloud-native alternative to older formats like HDF5 or NetCDF.

Key advantage to traditional formats is that the Zarr specification stores large multi-dimensional arrays in chunks, which are smaller pieces of the larger array. Chunks can be accessed individually or multiple chunks can be read and written in parallel, making data access highly efficient.

Zarr works across different storage systems, including local file systems, cloud object storage as well as distributed file systems; offering a greater flexibility compared to traditional file formats.

In addition, Zarr embeds metadata directly alongside the data. This makes Zarr self-descriptive, as each data array contains descriptive information about itself, such as data type, dimensions or additional attributes.

Note

Pro tip: Learn more about Zarr in the official Zarr Documentation and the Zarr V3 storage specification

Components of Zarr

Zarr is organised in a human-readable, hierarchical structure using simple JSON metadata files and is composed of groups and stores, chunks and metadata:

Groups and Stores

Groups and stores are concepts that allow Zarr to differentiate between (i) where the data is stored (stores) and (ii) how it is organised (groups). A group is a container for logically organising the data, similar to folders in a file system. A store defines where the data is stored; it can be e.g. a bucket in the cloud or a directory on a disk.

Chunks

Zarr divides arrays into smaller, independent pieces (chunks). Through chunking it is possible to retrieve and process specific areas without loading the complete dataset. Its organisation into chunks is the main reason Zarr’s high performance. Chunks are saved as binary files inside a /c directory and are further organised through nested folder paths based on their index, e.g. c/0/0/0 for the chunk position [0,0.0].

Metadata

Zarr uses descriptive metadata to describe the individual arrays but also the full hierarchy of the dataset. Metadata are stored in zarr.json files and are available on the array, group and store level. This structured metadata approach makes Zarr datasets self-descriptive and easy to navigate.

The graphic below shows an overview of all relevant Zarr components.

Zarr conceptual structure and overview of Zarr components

EOPF Zarr Structure

The ESA EOPF defines Zarr as the encoding format for the EOPF Sentinel Zarr Samples Service. The Zarr encoding is well aligned with ESA’s objective of enhancing the accessibility of Sentinel data by modernising the previous .SAFE encoding into a flexible, cloud-native structure. The cloud-native nature of zarr is expected to broaden the applications of the Sentinel data within the geospatial community while maintaining data quality and established algorithms.

EOPF Zarr products contain of four main groups:

Group Description
Attributes STAC format metadata for the Zarr element
Measurements Main retrieved variables
Conditions Measurement context (geometric angles, meteorological/instrumental data)
Quality Flags and quality information for measurement filtering

EOPF Zarr prodcut example - Sentinel-2 L2A

Let us imagine a Sentinel-2 L2A tile. The tile has dimensions of approximately 10,980 by 10,980 pixels, and include 12 spectral bands (B01 to B12, excluding B10) at different resolutions, plus additional data arrays such as a Scene Classification Map (SCL) and Atmospheric Optical Thickness (AOT).

For efficient handling, the data is divided into 1,024 by 1,024-pixel chunks. This chunking strategy allows for optimal performance when reading specific spatial regions of interest.

The figure below gives a graphical overview of how a EOPF Zarr Sentinel-2 L2A product file is organised. Overview of the EOPF Zarr product structure on the example of Sentinel-2 L2A

The table below provides a more detailed outline what content is available in the different groups.

Group Content
Attributes Processing history metadata
Chunking configuration
Global metadata (acquisition time, sensing time, etc.)
Product-specific metadata≈
Measurements 10m resolution (r10) B02 (Blue, 490nm)
B03 (Green, 560nm)
B04 (Red, 665nm)
B08 (NIR, 842nm)
20m resolution (r20) B05 (Red Edge 1, 705nm)
B06 (Red Edge 2, 740nm)
B07 (Red Edge 3, 783nm)
B8A (Narrow NIR, 865nm)
B11 (SWIR 1, 1610nm)
B12 (SWIR 2, 2190nm)
60m resolution (r60) B01 (Coastal aerosol, 443nm) <br?B09 (Water vapour, 945nm)
Conditions Sun angles (zenith, azimuth)
Viewing angles
Mean solar irradiance
Atmospheric parameters such as (i) Aerosol Optical Thickness (AOT), (ii) Water Vapor (WV) and (iii) Cloud and snow probability
Quality Scene Classification Layer (SCL)
Quality flags for each band
Detector footprint
Defective pixels masks


Note

Zarr Deep Dive: Dive deeper into the benefits of Zarr in a blogpost by Lindsey Nield from the Earthmover team: Fundamentals: What is Zarr? A Cloud-Native Format for Tensor Data.

Conclusion

The EOPF Zarr structure allows for efficient access to individual bands or specific spatial regions without loading the entire dataset, making it ideal for large-scale geospatial analysis. It further ensures all relevant metadata is co-located with the data, enhancing data discoverability and usability.

What’s next?

Now that you have a theoretical grasp of the Zarr format, the next section Discover EOPF Zarr - Sentinel-2 L2A will provide a first hands-on experience opening an EOPF Zarr product. We will transition to our first Jupyter Notebook where you will directly interact with a Zarr store.