Creating Zarr Overviews

Introduction

Large satellite products, such as Sentinel-2 Level-2A scenes, contain tens of millions of pixels per band. Accessing or visualising them at full resolution is often unnecessary, especially for exploratory analysis, map rendering, or quality checks. Multiscale pyramids address this by providing progressively coarser versions of the original data, allowing client applications to request only the level of detail required. This approach reduces computational load, improves user experience, and aligns with modern cloud-native geospatial workflows.

Our approach consist of two notebooks:

Part 1: Creating Zarr Overviews
Part 2: Visualising Multiscale Pyramids

In this notebook, we will demonstrate how to create overviews (also called multiscale pyramids) for large Earth Observation datasets stored in Zarr format. Overviews are downscaled representations of gridded data that support efficient visualisation and scalable access to high-resolution datasets. They enable fast inspection at multiple zoom levels, reduce data transfer volumes, and improve performance when working with cloud-optimised storage.

What we will learn

🗂️ How to compute multiscale overview levels from high-resolution satellite data
📊 How to attach GeoZarr-compliant metadata to datasets
💾 How to write overview pyramids to Zarr storage

Prerequisites

This notebook uses: - Dataset: Sentinel-2 L2A reflectance data from EODC object storage (hosted by the STAC catalogue) - Resolution: 10m spatial resolution (10980 × 10980 pixels) - Bands: Blue (b02), Green (b03), Red (b04), NIR (b08)

The workflow follows a compute-then-write pattern that separates in-memory computation from disk persistence, allowing validation before committing changes.

Import libraries

import os, warnings, json, s3fs, zarr, dask
import xarray as xr
from pathlib import Path
warnings.filterwarnings("ignore")
try:
    bucket = os.environ["BUCKET_NAME"]
    access = os.environ["ACCESS_KEY"]
    secret = os.environ["SECRET_KEY"]
    bucket_endpoint = os.environ["BUCKET_ENDPOINT"]
except KeyError as e:
    raise RuntimeError(
        f"Missing required environment variable: {e}\n"
        "Please set BUCKET_NAME, ACCESS_KEY, and SECRET_KEY"
    )

# S3 filesystem
fs = s3fs.S3FileSystem(
    key=access,
    secret=secret,
    client_kwargs={"endpoint_url": bucket_endpoint},
)

Copy Remote Dataset to Local Storage

The starting point for our overviews starts with the “local” access to the STAC zarr item we are interested in. We do this based on two main reasons:

The remote dataset is read-only (object storage) and we need a writable local copy to add new groups (L1-L5)
This preserves the complete original structure

To make sure that we use a convenient scene, we select a source URL from the STAC catalogue.

remote_url = ("https://objects.eodc.eu/e05ab01a9d56408d82ac32d69a5aae2a:202508-s02msil2a/31/products/cpm_v256/S2A_MSIL2A_20250831T135741_N0511_R010_T26WPD_20250831T185012.zarr")
store_path = ("S2A_MSIL2A_20250831T135741_N0511_R010_T26WPD_20250831T185012.zarr")
base_store = fs.get_mapper(f"{bucket}/{store_path}")

As a first step, we download the remote Zarr dataset and saves it as a Zarr copy ready on a S3 bucket for exploration. Chunking helps large datasets load faster especially when used in visualisation tools that read data in small spatial tiles.

print(f"Copying remote Zarr to local storage... (may take several minutes)")
try:
    s2l2a_remote = xr.open_datatree(remote_url, engine="zarr")
except Exception as e:
    raise RuntimeError(f"Failed to open remote dataset: {e}\nCheck network connectivity and URL")
print(f"Copying to s3://{base_store} ...")
s2l2a_remote.to_zarr(
    store=base_store,
    mode="w",
    consolidated=False,
    zarr_version=2,
    compute=True,
)

Copying remote Zarr to local storage... (may take several minutes)
Copying to s3://<fsspec.mapping.FSMap object at 0x7f7fccbf4980> ...

<xarray.backends.zarr.ZarrStore at 0x7f7f58697060>

Then, we can access the dataset from the S3 bucket and look inside the group that contains the 10-metre reflectance data to understand which variables, dimensions, and coordinates it contains.

# --- Load Local Dataset and Inspect Structure ---
variable_group_path = "measurements/reflectance/r10m"
r10m_store = fs.get_mapper(
    f"{bucket}/{store_path}/{variable_group_path}"
)
dataset = xr.open_dataset(r10m_store, engine="zarr")
dataset

Compute Overviews (In-Memory)

Now we compute the overview levels in memory only - no data is written to disk at this stage.

We extract the reflectance group at 10m resolution and automatically discover variables and dimensions.

Key operations: - Open the local Zarr dataset as a datatree - Extract the reflectance group and convert to an xarray Dataset - Automatically discover all variables and dimensions - Identify spatial coordinate names (x, y)

In this step, we identify the spatial dimensions and variables in the dataset and define the scale levels that will be used to generate lower-resolution overviews.

scales = [2, 4, 8, 16, 32, 64, 128]  # Scale factors for each level
variables = [var for var in dataset.data_vars]  # Discover variables
spatial_dims = [dim for dim in dataset.dims]  # Discover dimensions
x_dim = next((d for d in spatial_dims if d in ['x', 'X', 'lon', 'longitude']), 'x')  # Identify x dimension
y_dim = next((d for d in spatial_dims if d in ['y', 'Y', 'lat', 'latitude']), 'y')  # Identify y dimension
print(f"Variables: {variables} | Dims: {spatial_dims} | Shape: {dataset[variables[0]].shape} | Using: x_dim='{x_dim}', y_dim='{y_dim}'\n")

Variables: ['b03', 'b02', 'b04', 'b08'] | Dims: ['y', 'x'] | Shape: (10980, 10980) | Using: x_dim='x', y_dim='y'

Accessing such information allows us to generate a series of lower-resolution overview datasets directly in memory.

For each scale factor, we use xarray’s coarsen() function to average groups of pixels along the spatial dimensions (x, y). Each coarsened version is stored under a level name like L1, L2, etc., representing progressively coarser spatial resolutions.

overviews = {}  # Generate in-memory overview datasets
for i, factor in enumerate(scales):
    level_id = f"L{i+1}"
    coarsened = dataset.coarsen({x_dim: factor, y_dim: factor}, boundary="trim").mean()
    overviews[level_id] = coarsened[variables]

print(f"Created {len(overviews)} overview levels:")
for level_id, level_ds in overviews.items():
    print(f"  {level_id}: shape {level_ds[variables[0]].shape}, dims {dict(level_ds.dims)}")
print("\nOverview datasets created successfully (in memory only, not written to disk)")

Created 7 overview levels:
  L1: shape (5490, 5490), dims {'y': 5490, 'x': 5490}
  L2: shape (2745, 2745), dims {'y': 2745, 'x': 2745}
  L3: shape (1372, 1372), dims {'y': 1372, 'x': 1372}
  L4: shape (686, 686), dims {'y': 686, 'x': 686}
  L5: shape (343, 343), dims {'y': 343, 'x': 343}
  L6: shape (171, 171), dims {'y': 171, 'x': 171}
  L7: shape (85, 85), dims {'y': 85, 'x': 85}

Overview datasets created successfully (in memory only, not written to disk)

Attach Multiscales Metadata

Once the data has been processed, relating it with the GeoZarr-compliant metadata will enhance the application and self description:

Version: Schema version (“1.0”)
Resampling method: How data was aggregated (“average”)
Variables: Which bands have overviews
Layout: The complete hierarchy including L0 (base) and all derived levels

The metadata is stored in dataset.attrs["multiscales"] following the GeoZarr Overviews specification. This ensures interoperability with GeoZarr-aware tools and libraries.

Here we prepare the information needed to describe the overview hierarchy in the GeoZarr metadata. We set overview_path to indicate where the overview groups will be stored, record the resampling_method("average") used to create them, and compute the base spatial resolutions (x_res and y_res) from the coordinate spacing.

overview_path = "overviews"  # Where overviews are written ("." for direct children)
resampling_method = "average"
x_res = abs(float(dataset['x'].values[1] - dataset['x'].values[0]))
y_res = abs(float(dataset['y'].values[1] - dataset['y'].values[0]))

Now, we build the multiscales layout metadata that describes how all overview levels relate to the base dataset.

The first entry (L0) represents the original data, including its spatial resolution (cell_size). Each subsequent level (L1, L2, …) is added to the layout with information about its path, the level it was derived from, the scale factors applied, the resampling method, and its corresponding cell size.

Finally, this complete structure is stored in dataset.attrs["multiscales"] following the GeoZarr Overviews specification (draft). The printed JSON summary shows the final metadata layout that GeoZarr-aware tools can use to identify and navigate between resolution levels.

layout = [{"id": "L0", "path": ".", "cell_size": [x_res, y_res]}]  # Base level (native data at current group)
for i, factor in enumerate(scales):
    level_id = f"L{i+1}"
    level_path = level_id if overview_path == "." else f"{overview_path}/{level_id}"
    level_cell_size = [x_res * factor, y_res * factor]
    layout.append({"id": level_id, "path": level_path, "derived_from": "L0" if i == 0 else f"L{i}", "factors": [factor, factor], "resampling_method": resampling_method, "cell_size": level_cell_size})
dataset.attrs["multiscales"] = {"version": "1.0", "resampling_method": resampling_method, "variables": variables, "layout": layout}
print("Metadata structure:")
print(json.dumps(dataset.attrs["multiscales"], indent=2))

Metadata structure:
{
  "version": "1.0",
  "resampling_method": "average",
  "variables": [
    "b03",
    "b02",
    "b04",
    "b08"
  ],
  "layout": [
    {
      "id": "L0",
      "path": ".",
      "cell_size": [
        10.0,
        10.0
      ]
    },
    {
      "id": "L1",
      "path": "overviews/L1",
      "derived_from": "L0",
      "factors": [
        2,
        2
      ],
      "resampling_method": "average",
      "cell_size": [
        20.0,
        20.0
      ]
    },
    {
      "id": "L2",
      "path": "overviews/L2",
      "derived_from": "L1",
      "factors": [
        4,
        4
      ],
      "resampling_method": "average",
      "cell_size": [
        40.0,
        40.0
      ]
    },
    {
      "id": "L3",
      "path": "overviews/L3",
      "derived_from": "L2",
      "factors": [
        8,
        8
      ],
      "resampling_method": "average",
      "cell_size": [
        80.0,
        80.0
      ]
    },
    {
      "id": "L4",
      "path": "overviews/L4",
      "derived_from": "L3",
      "factors": [
        16,
        16
      ],
      "resampling_method": "average",
      "cell_size": [
        160.0,
        160.0
      ]
    },
    {
      "id": "L5",
      "path": "overviews/L5",
      "derived_from": "L4",
      "factors": [
        32,
        32
      ],
      "resampling_method": "average",
      "cell_size": [
        320.0,
        320.0
      ]
    },
    {
      "id": "L6",
      "path": "overviews/L6",
      "derived_from": "L5",
      "factors": [
        64,
        64
      ],
      "resampling_method": "average",
      "cell_size": [
        640.0,
        640.0
      ]
    },
    {
      "id": "L7",
      "path": "overviews/L7",
      "derived_from": "L6",
      "factors": [
        128,
        128
      ],
      "resampling_method": "average",
      "cell_size": [
        1280.0,
        1280.0
      ]
    }
  ]
}

new_attrs = dataset.attrs.copy()   # includes your multiscales metadata
json_bytes = json.dumps(new_attrs).encode("utf-8")
# FSMap stores values as bytes
r10m_store['.zattrs'] = json_bytes

Write Overviews to Local Zarr Store

The final step consists of adding the computed overviews to the Zarr store copy.

Overview path options: - overview_path="." - Write overviews as direct children (L1, L2, L3, …) - overview_path="overviews" - Write overviews in a subfolder (overviews/L1, overviews/L2, …)

Write operations: 1. Write L1-L5 - Add overview levels as subgroups 2. Add metadata - Update group attributes with multiscales metadata

Key point: Native data stays at the group level. The multiscales metadata uses path: "." for L0 to reference the existing native data without duplication.

Result with overview_path=".":

measurements/reflectance/r10m/
├── b02, b03, b04, b08  # Native data (L0 via path=".")
├── x, y                # Coordinates
├── L1/                 # Overview levels (direct children)
├── L2/
├── L3/
├── L4/
├── L5/
└── .zattrs             # multiscales metadata

Alternative with overview_path="overviews":

measurements/reflectance/r10m/
├── b02, b03, b04, b08  # Native data (L0 via path=".")
├── x, y                # Coordinates
├── overviews/          # Overview levels in subfolder
│   ├── L1/
│   ├── L2/
│   ├── L3/
│   ├── L4/
│   └── L5/
└── .zattrs             # multiscales metadata

print(f"Adding overviews to {variable_group_path} | Variables: {variables} | Path: '{overview_path}'\n")

# Create the overview group folder on S3
zarr.open_group(store=base_store, mode="a", zarr_version=2)

print(f"Writing {len(overviews)} overview levels...")
for level_id, level_dataset in overviews.items():

    level_store = fs.get_mapper(
        f"{bucket}/{store_path}/"
        f"{variable_group_path}/{overview_path}/{level_id}"
    )

    level_dataset.to_zarr(
        store=level_store,
        mode="a",
        zarr_version=2,
    )

# Write coordinates + attrs into r10m group
coords_only = xr.Dataset(coords=dataset.coords, attrs=dataset.attrs)
coords_only.to_zarr(
    store=r10m_store,
    mode="a",
    zarr_version=2,
)

print(f"Generating consolidated metadata for {overview_path}/")
zarr.consolidate_metadata(store=base_store)

print(f"\nSuccessfully added overviews to {variable_group_path}\n")
print(f"Final structure:\n  {variable_group_path}/")
print(f"    ├── {', '.join(variables)}")
print(f"    ├── {x_dim}, {y_dim}")
print(f"    ├── {overview_path}/")
for level_id in overviews.keys():
    print(f"    │   ├── {level_id}/")
print(f"    └── .zattrs")

Adding overviews to measurements/reflectance/r10m | Variables: ['b03', 'b02', 'b04', 'b08'] | Path: 'overviews'

Writing 7 overview levels...
Generating consolidated metadata for overviews/

Successfully added overviews to measurements/reflectance/r10m

Final structure:
  measurements/reflectance/r10m/
    ├── b03, b02, b04, b08
    ├── x, y
    ├── overviews/
    │   ├── L1/
    │   ├── L2/
    │   ├── L3/
    │   ├── L4/
    │   ├── L5/
    │   ├── L6/
    │   ├── L7/
    └── .zattrs

💪 Now it is your turn

With everything we have learnt so far, you are now able to create multiscale overviews for your own datasets.

Task 1: Experiment With Different Scale Factors

Try modifying the scales list to create different pyramid structures. For example: - Fewer levels: scales = [2, 4, 8] for a smaller pyramid - More aggressive downsampling: scales = [4, 16, 64] for rapid zoom levels - Fine-grained levels: scales = [2, 3, 4, 6, 8] for smoother transitions

Task 2: Apply To Your Own Dataset

Use this notebook as a template for your own Earth Observation data: 1. Replace the URL with your own Zarr dataset 2. Let the code discover variables and dimensions automatically 3. Adjust scale factors based on your data resolution 4. Validate and write the results

Conclusion

This tutorial demonstrated the complete workflow for creating GeoZarr-compliant multiscale overviews:

✅ Load and discover dataset structure automatically
✅ Compute overview levels in memory (no disk I/O)
✅ Attach specification-compliant metadata
✅ Write to Zarr storage

Key takeaways: - The compute-then-write pattern separates computation from I/O - Dynamic discovery makes code adaptable to different datasets

What’s next?

In the next notebook of this chapter we will visualise the generated overviews with map libraries for progressive rendering.