Access and analyse EOPF STAC Zarr data with R

By: @sharlagelfand

Introduction

In this tutorial, we will demonstrate how to access EOPF Zarr products directly from the EOPF Sentinel Zarr Sample Service STAC Catalog using R. We will introduce R packages that enable us to effectively get an overview of and read Zarr arrays.

What we will learn

  • ā˜ļø How to overview cloud-optimised datasets from the EOPF Zarr STAC Catalog with the Rarr package.
  • šŸ”Ž Read, examine, and scale loaded datasets
  • šŸ“Š Perform simple data analyses with the loaded data

Prerequisites

An R environment is required to follow this tutorial, with R version >= 4.5.0. For local development, we recommend using either RStudio or Positron and making use of RStudio projects for a self-contained coding environment.

We will use the following packages in this tutorial:

  • rstac (for accessing the STAC catalog)
  • tidyverse (for data manipulation)
  • terra (for working with spatial data in raster format)
  • stars (for reading, manipulating, and plotting spatiotemporal data)

You can install them directly from CRAN:

# install.packages("rstac")
# install.packages("tidyverse")
# install.packages("stars")
# install.packages("terra")

We will also use the Rarr package (version >= 1.10.0) to read Zarr data. It must be installed from Bioconductor, so first install the BiocManager package:

# install.packages("BiocManager")

Then, use this package to install Rarr:

# BiocManager::install("Rarr")

Finally, load the packages into your environment:

library(rstac)
library(tidyverse)
library(Rarr)
library(stars)
library(terra)

Access Zarr data from the STAC Catalog

The first step of accessing Zarr data is to understand the assets within the EOPF Sample Service STAC catalog. The first tutorial goes into detail on this, so we recommend reviewing it if you have not already.

For the first part of this tutorial, we will be using data from the Sentinel-2 Level-2A Collection. We fetch the ā€œproductā€ asset under a given item, and can look at its URL:

first_item <- stac("https://stac.core.eopf.eodc.eu/") |>
  collections(collection_id = "sentinel-2-l2a") |>
  items(limit = 1) |>
  get_request()

first_item_id <- first_item[["features"]][[1]][["id"]]

s2_l2a_item <- stac("https://stac.core.eopf.eodc.eu/") |>
  collections(collection_id = "sentinel-2-l2a") |>
  items(feature_id = first_item_id) |>
  get_request()

s2_l2a_product <- s2_l2a_item |>
  assets_select(asset_names = "product")

s2_l2a_product_url <- s2_l2a_product |>
  assets_url()

s2_l2a_product_url
[1] "https://objects.eodc.eu:443/e05ab01a9d56408d82ac32d69a5aae2a:202603-s02msil2a-eu/10/products/cpm_v262/S2C_MSIL2A_20260310T141011_N0512_R053_T26WNE_20260310T172808.zarr"

The product is the ā€œtop levelā€ Zarr asset, which contains the full Zarr product hierarchy. We can use zarr_overview() to get an overview of it, setting as_data_frame to TRUE so that we can see the entries in a data frame instead of printed directly to the console. Each entry is a Zarr array; we remove product_url from the path to get a better idea of what each array is. Since this is something we will want to do multiple times throughout the tutorial, we create a helper function for this.

derive_store_array <- function(store, product_url) {
  store |>
    mutate(array = str_remove(path, product_url)) |>
    relocate(array, .before = path)
}

zarr_store <- s2_l2a_product_url |>
  zarr_overview(as_data_frame = TRUE) |>
  derive_store_array(s2_l2a_product_url)

zarr_store
# A tibble: 124 Ɨ 8
   array           path  data_type endianness compressor dim   chunk_dim nchunks
   <chr>           <chr> <chr>     <chr>      <chr>      <lis> <list>    <list> 
 1 /conditions/ge… http… unicode2… little     blosc      <int> <int [1]> <dbl>  
 2 /conditions/ge… http… unicode96 little     blosc      <int> <int [1]> <dbl>  
 3 /conditions/ge… http… unicode96 little     blosc      <int> <int [1]> <dbl>  
 4 /conditions/ge… http… float64   little     blosc      <int> <int [1]> <dbl>  
 5 /conditions/ge… http… float64   little     blosc      <int> <int [2]> <dbl>  
 6 /conditions/ge… http… float64   little     blosc      <int> <int [3]> <dbl>  
 7 /conditions/ge… http… float64   little     blosc      <int> <int [5]> <dbl>  
 8 /conditions/ge… http… float32   little     blosc      <int> <int [1]> <dbl>  
 9 /conditions/ge… http… float32   little     blosc      <int> <int [1]> <dbl>  
10 /conditions/ma… http… uint8     <NA>       blosc      <int> <int [2]> <dbl>  
# ℹ 114 more rows

This shows us the path to access the Zarr array, the number of chunks it contains, the type of data, as well as its dimensions and chunking structure.

We can also look at overviews of individual arrays. First, let’s narrow down to measurements taken at 20-metre resolution:

r20m <- zarr_store |>
  filter(str_starts(array, "/measurements/reflectance/r20m/"))

r20m
# A tibble: 12 Ɨ 8
   array           path  data_type endianness compressor dim   chunk_dim nchunks
   <chr>           <chr> <chr>     <chr>      <chr>      <lis> <list>    <list> 
 1 /measurements/… http… uint16    little     blosc      <int> <int [2]> <dbl>  
 2 /measurements/… http… uint16    little     blosc      <int> <int [2]> <dbl>  
 3 /measurements/… http… uint16    little     blosc      <int> <int [2]> <dbl>  
 4 /measurements/… http… uint16    little     blosc      <int> <int [2]> <dbl>  
 5 /measurements/… http… uint16    little     blosc      <int> <int [2]> <dbl>  
 6 /measurements/… http… uint16    little     blosc      <int> <int [2]> <dbl>  
 7 /measurements/… http… uint16    little     blosc      <int> <int [2]> <dbl>  
 8 /measurements/… http… uint16    little     blosc      <int> <int [2]> <dbl>  
 9 /measurements/… http… uint16    little     blosc      <int> <int [2]> <dbl>  
10 /measurements/… http… uint16    little     blosc      <int> <int [2]> <dbl>  
11 /measurements/… http… float32   little     blosc      <int> <int [1]> <dbl>  
12 /measurements/… http… float32   little     blosc      <int> <int [1]> <dbl>  

Then, we select the B02 array and examine its dimensions and chunking:

r20m |>
  filter(str_ends(array, "b02")) |>
  select(path, nchunks, dim, chunk_dim) |>
  as.list()
$path
[1] "https://objects.eodc.eu:443/e05ab01a9d56408d82ac32d69a5aae2a:202603-s02msil2a-eu/10/products/cpm_v262/S2C_MSIL2A_20260310T141011_N0512_R053_T26WNE_20260310T172808.zarr/measurements/reflectance/r20m/b02"

$nchunks
$nchunks[[1]]
[1] 6 6


$dim
$dim[[1]]
[1] 5490 5490


$chunk_dim
$chunk_dim[[1]]
[1] 915 915

We can also see an overview of individual arrays using zarr_overview(). With the default setting (where as_data_frame is FALSE), this prints information on the array directly to the console, in a more digestible way:

r20m_b02 <- r20m |>
  filter(str_ends(array, "b02")) |>
  pull(path)

r20m_b02 |>
  zarr_overview()
Type: Array
Path: https://objects.eodc.eu:443/e05ab01a9d56408d82ac32d69a5aae2a:202603-s02msil2a-eu/10/products/cpm_v262/S2C_MSIL2A_20260310T141011_N0512_R053_T26WNE_20260310T172808.zarr/measurements/reflectance/r20m/b02
Shape: 5490 x 5490
Chunk Shape: 915 x 915
No. of Chunks: 36 (6 x 6)
Data Type: uint16
Endianness: little
Compressor: blosc

The above overview tells us that the data is two-dimensional, with dimensions 5490 x 5490. Zarr data is split up into chunks, which are smaller, independent piece of the larger array. Chunks can be accessed individually, without loading the entire array. In this case, there are 36 chunks in total, with 6 along each of the dimensions, each of size 915 x 915.

Read Zarr data

To read in Zarr data, we use read_zarr_array(), and can pass a list to the index argument, describing which elements we want to extract along each dimension. Since this array is two-dimensional, we can think of the dimensions as rows and columns of the data. For example, to select the first 10 rows and the first 5 columns:

r20m_b02 |>
  read_zarr_array(index = list(1:10, 1:5))
      [,1] [,2] [,3] [,4] [,5]
 [1,] 8061 7403 5516 7279 8178
 [2,] 7686 6522 7570 7776 8233
 [3,] 7538 7903 8248 7923 8396
 [4,] 7635 7874 8136 7670 8042
 [5,] 8010 7654 8235 7207 7256
 [6,] 7702 7918 8543 7331 7169
 [7,] 7421 8105 8534 7644 7641
 [8,] 6946 7983 8291 7017 7641
 [9,] 7089 7972 8089 6688 7021
[10,] 6755 7627 7979 7131 6719

Coordinates

Similarly, we can read in the x and y coordinates corresponding to data at 10m resolution. These x and y coordinates do not correspond to latitude and longitude—to understand the coordinate reference system used in each data set, we access the proj:code property of the STAC item. In this case, the coordinate reference system is EPSG:32626, which represents metres from the UTM zone’s origin.

s2_l2a_item[["properties"]][["proj:code"]]
[1] "EPSG:32626"

We can see that x and y are one dimensional:

r20m_x <- r20m |>
  filter(str_ends(array, "x")) |>
  pull(path)

r20m_x |>
  zarr_overview()
Type: Array
Path: https://objects.eodc.eu:443/e05ab01a9d56408d82ac32d69a5aae2a:202603-s02msil2a-eu/10/products/cpm_v262/S2C_MSIL2A_20260310T141011_N0512_R053_T26WNE_20260310T172808.zarr/measurements/reflectance/r20m/x
Shape: 5490
Chunk Shape: 5490
No. of Chunks: 1 (1)
Data Type: float32
Endianness: little
Compressor: blosc
r20m_y <- r20m |>
  filter(str_ends(array, "y")) |>
  pull(path)

r20m_y |>
  zarr_overview()
Type: Array
Path: https://objects.eodc.eu:443/e05ab01a9d56408d82ac32d69a5aae2a:202603-s02msil2a-eu/10/products/cpm_v262/S2C_MSIL2A_20260310T141011_N0512_R053_T26WNE_20260310T172808.zarr/measurements/reflectance/r20m/y
Shape: 5490
Chunk Shape: 5490
No. of Chunks: 1 (1)
Data Type: float32
Endianness: little
Compressor: blosc

Which means that, when combined, they form a grid that describes the location of each point in the 2-dimensional measurements, such as B02. We will go into this more in the examples below.

The x and y dimensions can be read in using the same logic: by describing which elements we want to extract. Since there is only one dimension, we only need to supply one entry in the indexing list:

r20m_x |>
  read_zarr_array(list(1:5))
[1] 499990 500010 500030 500050 500070

Or, we can read in the whole array (by not providing any elements to index) and view its first few values with head(). Of course, reading in the whole array, rather than a small section of it, will take longer.

r20m_x <- r20m_x |>
  read_zarr_array()

r20m_x |>
  head(5)
[1] 499990 500010 500030 500050 500070
r20m_y <- r20m_y |>
  read_zarr_array()

r20m_y |>
  head(5)
[1] 8000030 8000010 7999990 7999970 7999950

Different resolutions

With EOPF data, some measurements are available at multiple resolutions. For example, we can see that the B02 spectral band is available at 10m, 20m, and 60m resolution:

b02 <- zarr_store |>
  filter(str_starts(array, "/measurements/reflectance"), str_ends(array, "b02"))

b02 |>
  select(array)
# A tibble: 3 Ɨ 1
  array                             
  <chr>                             
1 /measurements/reflectance/r10m/b02
2 /measurements/reflectance/r20m/b02
3 /measurements/reflectance/r60m/b02

The resolution affects the dimensions of the data: when measurements are taken at a higher resolution, there will be more data. We can see here that there is more data for the 10m resolution than the 20m resolution (recall, its dimensions are 5490 x 5490), and less for the 60m resolution:

b02 |>
  filter(array == "/measurements/reflectance/r10m/b02") |>
  pull(dim)
[[1]]
[1] 10980 10980
b02 |>
  filter(array == "/measurements/reflectance/r60m/b02") |>
  pull(dim)
[[1]]
[1] 1830 1830

Scaling data

Zarr data arrays are often packed or compressed in order to limit space. Because of this, when we read in the data, we also need to check whether the values need to be scaled or offset to their actual physical units or meaningful values. This information is contained in the metadata associated with the Zarr store. We create a helper function to obtain these values, setting the offset to 0 and scale to 1 if they do not need to be offset or scaled.

get_scale_and_offset <- function(zarr_url, array) {
  metadata <- Rarr:::.read_zmetadata(
    zarr_url,
    s3_client = Rarr:::.create_s3_client(zarr_url)
  )

  metadata <- metadata[["metadata"]]

  array_metadata <- metadata[[paste0(array, "/.zattrs")]]

  scale <- array_metadata[["scale_factor"]]
  scale <- ifelse(is.null(scale), 1, scale)

  offset <- array_metadata[["add_offset"]]
  offset <- ifelse(is.null(offset), 0, offset)


  list(
    scale = scale,
    offset = offset
  )
}

Then, we can check the scale and offset of the B02 spectral band, which tells us that we need to multiply the values by 0.0001 and offset them by -0.1.

get_scale_and_offset(
  s2_l2a_product_url,
  "measurements/reflectance/r20m/b02"
)
$scale
[1] 0.0001

$offset
[1] -0.1

We can also check for the x dimensions. This tells us that no scaling or offsetting needs to be done.

get_scale_and_offset(
  s2_l2a_product_url,
  "measurements/reflectance/r20m/x"
)
$scale
[1] 1

$offset
[1] 0

Simple Data Analysis: Calculating NDVI

Let us now do a simple analysis with the data from the EOPF Zarr STAC Catalog. Let us calculate the Normalized Difference Vegetation Index (NDVI).

First, we access the Red (B04) and Near-InfraRed (B08A) bands, which are needed for calculation of the NDVI, at 20m resolution:

r20m_b04 <- r20m |>
  filter(str_ends(array, "b04")) |>
  pull(path) |>
  read_zarr_array()

r20m_b04[1:5, 1:5]
     [,1] [,2] [,3] [,4] [,5]
[1,] 9348 8522 6287 8166 9455
[2,] 8786 7601 8588 8861 9485
[3,] 8574 9110 9453 9140 9663
[4,] 8842 9066 9407 8797 9193
[5,] 9333 8868 9542 8266 8287
r20m_b8a <- r20m |>
  filter(str_ends(array, "b8a")) |>
  pull(path) |>
  read_zarr_array()

r20m_b8a[1:5, 1:5]
     [,1] [,2] [,3] [,4] [,5]
[1,] 9447 8046 7295 7946 9487
[2,] 8597 8050 8505 8772 9567
[3,] 8757 9151 9353 9106 9623
[4,] 8980 9323 9342 8815 9202
[5,] 9242 9041 9321 8460 8448

Then, we access the scale and offset for the bands and for the coordinates:

r20m_b04_scale_offset <- get_scale_and_offset(
  s2_l2a_product_url,
  "measurements/reflectance/r20m/b04"
)

r20m_b04_scale_offset
$scale
[1] 0.0001

$offset
[1] -0.1
r20m_b8a_scale_offset <- get_scale_and_offset(
  s2_l2a_product_url,
  "measurements/reflectance/r20m/b8a"
)

r20m_b8a_scale_offset
$scale
[1] 0.0001

$offset
[1] -0.1
r20m_x_scale_offset <- get_scale_and_offset(
  s2_l2a_product_url,
  "measurements/reflectance/r20m/x"
)

r20m_x_scale_offset
$scale
[1] 1

$offset
[1] 0
r20m_y_scale_offset <- get_scale_and_offset(
  s2_l2a_product_url,
  "measurements/reflectance/r20m/y"
)

r20m_y_scale_offset
$scale
[1] 1

$offset
[1] 0

The function st_as_stars() is used to get our data into a format that makes data manipulation and visualisation easy. In particular, it is easy to use mutate() to apply the necessary scale and offset factors to the data.

This format is also beneficial because it allows for a quick summary of the data and its attributes, providing information such as the median and mean values of the bands and information on the grid:

ndvi_data <- st_as_stars(B04 = r20m_b04, B08A = r20m_b8a) |>
  st_set_dimensions(1, names = "X", values = r20m_x) |>
  st_set_dimensions(2, names = "Y", values = r20m_y) |>
  mutate(
    B04 = B04 * r20m_b04_scale_offset[["scale"]] + r20m_b04_scale_offset[["offset"]],
    B08A = B08A * r20m_b8a_scale_offset[["scale"]] + r20m_b8a_scale_offset[["offset"]]
  )

ndvi_data
stars object with 2 dimensions and 2 attributes
attribute(s), summary of first 100000 cells:
        Min.  1st Qu. Median      Mean 3rd Qu.   Max.
B04   0.0643 0.337275 0.7026 0.6619895  0.8438 1.4626
B08A  0.0525 0.301375 0.6498 0.6237404  0.8014 1.4822
dimension(s):
  from   to  offset delta point x/y
X    1 5490  499990    20 FALSE [x]
Y    1 5490 8000030   -20 FALSE [y]

Now, we perform the initial steps for NDVI calculation:

  • sum_bands: Calculates the sum of the Near-Infrared and Red bands.
  • diff_bands: Calculates the difference between the Near-Infrared and Red bands.
ndvi_data <- ndvi_data |>
  mutate(
    sum_bands = B04 + B08A,
    diff_bands = B04 - B08A
  )

Then, we calculate the NDVI, which is diff_bands / sum_bands. To prevent division by zero errors in areas where both red and NIR bands might be zero (e.g., water bodies or clouds), we also replace any NaN values resulting from division by zero with 0. This ensures a clean and robust NDVI product.

ndvi_data <- ndvi_data |>
  mutate(
    ndvi = diff_bands / sum_bands,
    ndvi = ifelse(sum_bands == 0, 0, ndvi)
  )

In a final step, we can visualise the calculated NDVI.

plot(ndvi_data, as_points = FALSE, axes = TRUE, breaks = "equal", col = hcl.colors)

šŸ’Ŗ Now it is your turn

Task 1: Explore five additional Sentinel-2 Items

Replicate the RGB quick-look for five additional items from your area of interest and review the spatial changes.

Task 2: Calculate NDVI

Replicate the NDVI calculation for the additional items.

Task 3: Applying more advanced analysis techniques

The EOPF STAC Catalog offers a wealth of data beyond Sentinel-2. Replicate the search and data access for data from other collections.

Conclusion

In this tutorial we established a connection to the EOPF Sentinel Zarr Sample Service STAC Catalog and directly accessed an EOPF Zarr array with Rarr. We explored how to review the full Zarr store, read individual arrays, and gave an understanding of Zarr array chunking and dimensions. We also did a simple calculation and visualisation of NDVI using the stars package.

What’s next?

In the following notebook we will present a workflow that makes use of rstac and gdalcubes in R to create a raster data cube and visualize an aggregated RGB plot!