Cloud Probability Analysis with EOPF Sentinel-2 Data

By: Erik Haß (@gallah-gallah) and Dr. Zhan Li (@zhanlilz) | BASF Digital Farming GmbH

Introduction

This notebook demonstrates time-series analysis of cloud probability on Sentinel-2 data. The cloud probability layer (cld) is a valuable resource for assessing intra-tile variability of clouds. In standard workflows, users typically rely on tile metadata regarding overall cloud coverage, which provides a straightforward way to exclude data from an analysis.

However, relying solely on metadata can be limiting. By utilizing thecld layer, we can perform a “give-me-all” query to assess seasonality and generate an average cloudiness image for a specific tile. This offers a detailed representation of the meteorological conditions present during the acquisition of the imagery.

The cld layer is essential for refining the acquisition process, especially for tasks requiring high temporal resolution or consistent data quality. Unlike binary metadata—which might flag an entire tile as “cloudy”—cloud probability layers provide pixel-level insights. This allows you to retain tiles that have high overall cloud coverage but still contain large, cloud-free patches of interest.

By integrating cloud probability into your ingestion pipeline, you can shift from a “discard if cloudy” approach to a more nuanced “mask and keep” strategy. This drastically increases the usable data volume for time-series analysis in regions with frequent cloud cover.

Combining this data with agricultural field boundaries allows for intelligent ranking. You can prioritize observations where specific fields of interest are clear, even if the surrounding tile is partially obscured, ensuring that your analysis remains robust despite adverse weather conditions.Then we will add agricultural fields into the mix. Ranking their cloudiness reveals that even with higher overall cloud cover, specific areas will remain useful for analysis.

What we will learn

🚀 Querying EOPF STAC API: Learn to search and retrieve Sentinel-2 L2A products using pystac-client, filtering by tile ID, date range, and excluding deprecated items
⚡ Parallel Processing with Dask: Use dask.delayed and LocalCluster to efficiently load and process large remote sensing datasets in parallel batches
📊 Temporal & Seasonal Analysis: Stack multi-temporal scenes, compute mean cloudiness, cloud frequency, and seasonal delta images to reveal cloud dynamics
🌾 Field-Level Spatial Statistics: Integrate the vector boundaries of agricultural fields with raster data using rioxarray to rank agricultural fields by observation quality
🖥️ Advanced: Interactive Visualization with Panel & hvplot: Build dynamic dashboards with variable/time selectors and transparency controls to explore multi-temporal raster data overlaid on OpenStreetMap tiles

Prerequisites

Key packages

Package	Purpose
`pystac-client`	Query the EOPF STAC API
`xarray` + `zarr`	Open and manipulate EOPF Zarr products lazily
`rioxarray`	CRS handling and spatial clipping of DataArrays
`dask` / `distributed`	Parallel, out-of-core processing across scenes
`geopandas`	Load and reproject field boundary vectors
`hvplot` / `panel`	Interactive visualisation

Before you start

💡 This notebook assumes familiarity with the EOPF STAC API and Zarr-backed DataTree access. If you are new to these, work through the EOPF 101 introductory notebooks first.

Sentinel-2 L2A product overview → Sentinel Wiki: S2 Products
MGRS tile grid reference → Sentinel-2 Tiling Grid KMZ
EOPF STAC endpoint: https://stac.core.eopf.eodc.eu/

# Standard library imports
import time
from io import StringIO

# Third-party imports
import geopandas as gpd
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import xarray as xr
import rioxarray
from dask.base import compute
from dask.delayed import delayed
from matplotlib.colors import BoundaryNorm, ListedColormap
from pystac import Item
from pystac_client import Client as StacClient

# Distributed/Local imports
from distributed import Client, LocalCluster

#Visualisation

import panel as pn
import hvplot.xarray
import hvplot.pandas
import holoviews as hv
# https://panel.holoviz.org/api/config.html#comms-panel-comms
pn.extension(comms='default')
hv.extension("bokeh")

Helper functions

`filter_items_deprecated`

def filter_items_deprecated(items: list[Item]) -> list[Item]:
    """Filter out deprecated STAC items, which are deprecated.

    Args:
        items: A list of STAC items to filter.

    Returns:
        A list of STAC items that are not marked as deprecated.
    """
    sel_items = []
    for item in items:
        # Check if the item is marked as deprecated in its properties
        deprecated = item.properties.get("deprecated", False)
        if not deprecated:
            # If the item is not deprecated, add it to the list of selected items
            sel_items.append(item)
    return sel_items

try:
    from distributed import get_client
    existing_client = get_client()
    print("⚠️  Closing existing Dask cluster...")
    existing_client.close()
except ValueError:
    # No existing cluster, which is fine
    pass

# Initialize Dask LocalCluster for parallel computation
cluster = LocalCluster(
    n_workers=4,              
    threads_per_worker=2,     
    memory_limit='4GB',       
    silence_logs=30           
)
client = Client(cluster)

print(f"✅ Dask cluster initialized with {len(cluster.workers)} workers")
print(f"📊 Dashboard available at: {client.dashboard_link}")
print(f"💾 Total memory available: {4 * len(cluster.workers)} GB")

# Display cluster info
client

✅ Dask cluster initialized with 4 workers
📊 Dashboard available at: http://127.0.0.1:8787/status
💾 Total memory available: 16 GB

Client

Client-d238cf5c-68aa-11f1-ae6c-6e9a916ac8c0

Connection method: Cluster object	Cluster type: distributed.LocalCluster
Dashboard: http://127.0.0.1:8787/status

Cluster Info

LocalCluster

3c467f2d

Dashboard: http://127.0.0.1:8787/status	Workers: 4
Total threads: 8	Total memory: 14.90 GiB
Status: running	Using processes: True

Scheduler Info

Scheduler

Scheduler-2b842dcc-a69a-4b83-87d6-a904a5ded63c

Comm: tcp://127.0.0.1:32941	Workers: 0
Dashboard: http://127.0.0.1:8787/status	Total threads: 0
Started: Just now	Total memory: 0 B

Workers

Worker: 0

Comm: tcp://127.0.0.1:32961	Total threads: 2
Dashboard: http://127.0.0.1:39143/status	Memory: 3.73 GiB
Nanny: tcp://127.0.0.1:33481
Local directory: /tmp/dask-scratch-space/worker-qzc4100l

Worker: 1

Comm: tcp://127.0.0.1:45741	Total threads: 2
Dashboard: http://127.0.0.1:45649/status	Memory: 3.73 GiB
Nanny: tcp://127.0.0.1:37261
Local directory: /tmp/dask-scratch-space/worker-end9htr3

Worker: 2

Comm: tcp://127.0.0.1:44097	Total threads: 2
Dashboard: http://127.0.0.1:46431/status	Memory: 3.73 GiB
Nanny: tcp://127.0.0.1:37991
Local directory: /tmp/dask-scratch-space/worker-01jub8b2

Worker: 3

Comm: tcp://127.0.0.1:41727	Total threads: 2
Dashboard: http://127.0.0.1:42403/status	Memory: 3.73 GiB
Nanny: tcp://127.0.0.1:46843
Local directory: /tmp/dask-scratch-space/worker-f6udbwxa

Connect to EOPF STAC

eopf_stac_api_root_endpoint = "https://stac.core.eopf.eodc.eu/"
eopf_catalog = StacClient.open(url=eopf_stac_api_root_endpoint)
eopf_catalog

type "Catalog"
id "eopf-sample-service-stac-api"
stac_version "1.1.0"
description "STAC catalog of the EOPF Sentinel Zarr Samples Service"
links[] 21 items
- 0
  - rel "self"
  - href "https://stac.core.eopf.eodc.eu/"
  - type "application/json"
- 1
  - rel "root"
  - href "https://stac.core.eopf.eodc.eu/"
  - type "application/json"
  - title "EOPF Sentinel Zarr Samples Service STAC API"
- 2
  - rel "data"
  - href "https://stac.core.eopf.eodc.eu/collections"
  - type "application/json"
- 3
  - rel "conformance"
  - href "https://stac.core.eopf.eodc.eu/conformance"
  - type "application/json"
  - title "STAC/OGC conformance classes implemented by this server"
- 4
  - rel "search"
  - href "https://stac.core.eopf.eodc.eu/search"
  - type "application/geo+json"
  - title "STAC search"
  - method "GET"
- 5
  - rel "search"
  - href "https://stac.core.eopf.eodc.eu/search"
  - type "application/geo+json"
  - title "STAC search"
  - method "POST"
- 6
  - rel "http://www.opengis.net/def/rel/ogc/1.0/queryables"
  - href "https://stac.core.eopf.eodc.eu/queryables"
  - type "application/schema+json"
  - title "Queryables"
  - method "GET"
- 7
  - rel "child"
  - href "https://stac.core.eopf.eodc.eu/collections/sentinel-2-l2a"
  - type "application/json"
  - title "Sentinel-2 Level-2A"
- 8
  - rel "child"
  - href "https://stac.core.eopf.eodc.eu/collections/sentinel-1-l1-grd"
  - type "application/json"
  - title "Sentinel-1 Level-1 GRD"
- 9
  - rel "child"
  - href "https://stac.core.eopf.eodc.eu/collections/sentinel-2-l1c"
  - type "application/json"
  - title "Sentinel-2 Level-1C"
- 10
  - rel "child"
  - href "https://stac.core.eopf.eodc.eu/collections/sentinel-1-l1-slc"
  - type "application/json"
  - title "Sentinel-1 Level-1 SLC"
- 11
  - rel "child"
  - href "https://stac.core.eopf.eodc.eu/collections/sentinel-3-olci-l1-efr"
  - type "application/json"
  - title "Sentinel-3 OLCI Level-1 EFR"
- 12
  - rel "child"
  - href "https://stac.core.eopf.eodc.eu/collections/sentinel-3-slstr-l2-lst"
  - type "application/json"
  - title "Sentinel-3 SLSTR Level-2 LST"
- 13
  - rel "child"
  - href "https://stac.core.eopf.eodc.eu/collections/sentinel-3-slstr-l1-rbt"
  - type "application/json"
  - title "Sentinel-3 SLSTR Level-1 RBT"
- 14
  - rel "child"
  - href "https://stac.core.eopf.eodc.eu/collections/sentinel-1-l2-ocn"
  - type "application/json"
  - title "Sentinel-1 Level-2 OCN"
- 15
  - rel "child"
  - href "https://stac.core.eopf.eodc.eu/collections/sentinel-3-slstr-l2-frp"
  - type "application/json"
  - title "Sentinel-3 SLSTR Level-2 FRP"
- 16
  - rel "child"
  - href "https://stac.core.eopf.eodc.eu/collections/sentinel-3-olci-l1-err"
  - type "application/json"
  - title "Sentinel-3 OLCI Level-1 ERR"
- 17
  - rel "child"
  - href "https://stac.core.eopf.eodc.eu/collections/sentinel-3-olci-l2-lrr"
  - type "application/json"
  - title "Sentinel-3 OLCI Level-2 LRR"
- 18
  - rel "child"
  - href "https://stac.core.eopf.eodc.eu/collections/sentinel-3-olci-l2-lfr"
  - type "application/json"
  - title "Sentinel-3 OLCI Level-2 LFR"
- 19
  - rel "service-desc"
  - href "https://stac.core.eopf.eodc.eu/api"
  - type "application/vnd.oai.openapi+json;version=3.0"
  - title "OpenAPI service description"
- 20
  - rel "service-doc"
  - href "https://stac.core.eopf.eodc.eu/api.html"
  - type "text/html"
  - title "OpenAPI service documentation"
conformsTo[] 21 items
- 0 "http://www.opengis.net/spec/cql2/1.0/conf/basic-cql2"
- 1 "http://www.opengis.net/spec/cql2/1.0/conf/cql2-json"
- 2 "http://www.opengis.net/spec/cql2/1.0/conf/cql2-text"
- 3 "http://www.opengis.net/spec/ogcapi-features-1/1.0/conf/core"
- 4 "http://www.opengis.net/spec/ogcapi-features-1/1.0/conf/geojson"
- 5 "http://www.opengis.net/spec/ogcapi-features-1/1.0/conf/oas30"
- 6 "http://www.opengis.net/spec/ogcapi-features-3/1.0/conf/features-filter"
- 7 "http://www.opengis.net/spec/ogcapi-features-3/1.0/conf/filter"
- 8 "https://api.stacspec.org/v1.0.0-rc.2/item-search#filter"
- 9 "https://api.stacspec.org/v1.0.0/collections"
- 10 "https://api.stacspec.org/v1.0.0/collections/extensions/transaction"
- 11 "https://api.stacspec.org/v1.0.0/core"
- 12 "https://api.stacspec.org/v1.0.0/item-search"
- 13 "https://api.stacspec.org/v1.0.0/item-search#fields"
- 14 "https://api.stacspec.org/v1.0.0/item-search#query"
- 15 "https://api.stacspec.org/v1.0.0/item-search#sort"
- 16 "https://api.stacspec.org/v1.0.0/ogcapi-features"
- 17 "https://api.stacspec.org/v1.0.0/ogcapi-features#fields"
- 18 "https://api.stacspec.org/v1.0.0/ogcapi-features#query"
- 19 "https://api.stacspec.org/v1.0.0/ogcapi-features#sort"
- 20 "https://api.stacspec.org/v1.0.0/ogcapi-features/extensions/transaction"
title "EOPF Sentinel Zarr Samples Service STAC API"

Inspect collection

S2l2a_coll = eopf_catalog.get_collection('sentinel-2-l2a')
print('Keywords:        ',S2l2a_coll.keywords)
print('Catalog ID:      ',S2l2a_coll.id)

Keywords:         ['Copernicus', 'Sentinel', 'EU', 'ESA', 'Satellite', 'Global', 'Imagery', 'Reflectance']
Catalog ID:       sentinel-2-l2a

### Query for tile of interest (here we choose Northern France)

tile_id = "31UEQ"

This is the give-me-all query

fr = eopf_catalog.search( 
    collections= 'sentinel-2-l2a',
    query={"grid:code": {"eq": f"MGRS-{tile_id}"}},
    datetime='2017-01-01T00:00:00Z/2026-03-31T23:59:59.999999Z'
)

items_list = list(fr.items())

items = filter_items_deprecated(items_list)
################################# DEBUG #########################################
############# only include one image per month to speed up debugging ############
_, item_ind = np.unique([val.datetime.month for val in items], return_index=True)
items = [items[val] for val in item_ind]
################################# DEBUG #########################################

Inspect one example tile

items[0]

type "Feature"
stac_version "1.1.0"
stac_extensions[] 13 items
- 0 "https://stac-extensions.github.io/timestamps/v1.1.0/schema.json"
- 1 "https://stac-extensions.github.io/eo/v2.0.0/schema.json"
- 2 "https://stac-extensions.github.io/sat/v1.0.0/schema.json"
- 3 "https://stac-extensions.github.io/projection/v2.0.0/schema.json"
- 4 "https://stac-extensions.github.io/mgrs/v1.0.0/schema.json"
- 5 "https://stac-extensions.github.io/grid/v1.1.0/schema.json"
- 6 "https://stac-extensions.github.io/view/v1.0.0/schema.json"
- 7 "https://stac-extensions.github.io/processing/v1.2.0/schema.json"
- 8 "https://stac-extensions.github.io/product/v0.1.0/schema.json"
- 9 "https://stac-extensions.github.io/scientific/v1.0.0/schema.json"
- 10 "https://cs-si.github.io/eopf-stac-extension/v1.2.0/schema.json"
- 11 "https://stac-extensions.github.io/version/v1.2.0/schema.json"
- 12 "https://stac-extensions.github.io/raster/v2.0.0/schema.json"
id "S2B_MSIL2A_20260131T104159_N0511_R008_T31UEQ_20260131T143154"
geometry
- type "Polygon"
- coordinates[] 1 items
  - 0[] 12 items
    
    0[] 2 items
    
    0 3.383173476047945
    
    1 48.66255060592481
    
    1[] 2 items
    
    0 3.385611113576781
    
    1 48.6690358072244
    
    2[] 2 items
    
    0 3.440782603161841
    
    1 48.81514477088555
    
    3[] 2 items
    
    0 3.496005446473834
    
    1 48.96124192652695
    
    4[] 2 items
    
    0 3.551268297456118
    
    1 49.10740353085201
    
    5[] 2 items
    
    0 3.606901863652825
    
    1 49.253571659907536
    
    6[] 2 items
    
    0 3.662755007190602
    
    1 49.3997401995281
    
    7[] 2 items
    
    0 3.719174435807381
    
    1 49.54584690995426
    
    8[] 2 items
    
    0 3.758846424616649
    
    1 49.64773855983708
    
    9[] 2 items
    
    0 4.520595653725129
    
    1 49.64273706516441
    
    10[] 2 items
    
    0 4.49066964052278
    
    1 48.65538534877128
    
    11[] 2 items
    
    0 3.383173476047945
    
    1 48.66255060592481
bbox[] 4 items
- 0 3.383173476047945
- 1 48.65538534877128
- 2 4.520595653725129
- 3 49.64773855983708
properties
- gsd 10.0
- created "2026-05-05T19:59:52.250825Z"
- mission "Sentinel-2"
- sci:doi "10.5270/S2_-znk9xsj"
- updated "2026-05-05T19:59:52.457639Z"
- datetime "2026-01-31T10:41:59.024000Z"
- platform "sentinel-2b"
- grid:code "MGRS-31UEQ"
- proj:bbox[] 4 items
  - 0 499980.0
  - 1 5390220.0
  - 2 609780.0
  - 3 5500020.0
- proj:code "EPSG:32631"
- providers[] 3 items
  - 0
    
    url "https://commission.europa.eu/"
    
    name "European Commission"
    
    roles[] 1 items
    
    0 "licensor"
  - 1
    
    url "https://sentinel.esa.int/web/sentinel/missions/sentinel-2"
    
    name "ESA"
    
    roles[] 2 items
    
    0 "producer"
    
    1 "processor"
  - 2
    
    url "https://zarr.eopf.copernicus.eu/"
    
    name "EOPF Sentinel Zarr Samples Service"
    
    roles[] 2 items
    
    0 "host"
    
    1 "processor"
- published "2026-05-05T19:59:52.250825Z"
- deprecated False
- instruments[] 1 items
  - 0 "msi"
- end_datetime "2026-01-31T10:41:59.024000Z"
- product:type "S02MSIL2A"
- constellation "sentinel-2"
- eo:snow_cover 0.0
- mgrs:utm_zone 31
- eo:cloud_cover 92.335999
- start_datetime "2026-01-31T10:41:59.024000Z"
- sat:orbit_state "descending"
- eopf:datatake_id "GS2B_20260131T104159_046509_N05.11"
- mgrs:grid_square "EQ"
- processing:level "L2A"
- view:sun_azimuth 161.742357182465
- eopf:datastrip_id "S2B_OPER_MSI_L2A_DS_2BPS_20260131T143154_S20260131T104153_N05.11"
- mgrs:latitude_band "U"
- processing:lineage "systematic"
- processing:version "05.11"
- product:timeliness "PT3H"
- sat:absolute_orbit 46509
- sat:relative_orbit 8
- view:sun_elevation 68.3155648028842
- processing:facility "ESA"
- processing:software
  - EOPF-CPM "2.7.0"
- eopf:instrument_mode "INS-NOBS"
- product:timeliness_category "NRT"
- sat:platform_international_designator "2015-028A"
links[] 6 items
- 0
  - rel "collection"
  - href "https://stac.core.eopf.eodc.eu/collections/sentinel-2-l2a"
  - type "application/json"
- 1
  - rel "parent"
  - href "https://stac.core.eopf.eodc.eu/collections/sentinel-2-l2a"
  - type "application/json"
- 2
  - rel "root"
  - href "https://stac.core.eopf.eodc.eu/"
  - type "application/json"
  - title "EOPF Sentinel Zarr Samples Service STAC API"
- 3
  - rel "self"
  - href "https://stac.core.eopf.eodc.eu/collections/sentinel-2-l2a/items/S2B_MSIL2A_20260131T104159_N0511_R008_T31UEQ_20260131T143154"
  - type "application/geo+json"
- 4
  - rel "cite-as"
  - href "https://doi.org/10.5270/S2_-znk9xsj"
- 5
  - rel "license"
  - href "https://sentinel.esa.int/documents/247904/690755/Sentinel_Data_Legal_Notice"
  - type "application/pdf"
  - title "Legal notice on the use of Copernicus Sentinel Data and Service Information"
assets
- SR_10m
  - href "https://objects.eodc.eu/e05ab01a9d56408d82ac32d69a5aae2a:notebook-data/tutorial_data/cpm_v270/S2B_MSIL2A_20260131T104159_N0511_R008_T31UEQ_20260131T143154.zarr/measurements/reflectance/r10m"
  - type "application/vnd+zarr"
  - title "Surface Reflectance - 10m"
  - gsd 10
  - bands[] 4 items
    
    0
    
    name "B02"
    
    description "Blue (band 2)"
    
    eo:common_name "blue"
    
    eo:center_wavelength 0.49
    
    eo:full_width_half_max 0.098
    
    1
    
    name "B03"
    
    description "Green (band 3)"
    
    eo:common_name "green"
    
    eo:center_wavelength 0.56
    
    eo:full_width_half_max 0.045
    
    2
    
    name "B04"
    
    description "Red (band 4)"
    
    eo:common_name "red"
    
    eo:center_wavelength 0.665
    
    eo:full_width_half_max 0.038
    
    3
    
    name "B08"
    
    description "NIR 1 (band 8)"
    
    eo:common_name "nir"
    
    eo:center_wavelength 0.842
    
    eo:full_width_half_max 0.145
  - xarray:open_dataset_kwargs
    
    chunks
    
    engine "eopf-zarr"
    
    op_mode "native"
  - roles[] 3 items
    
    0 "data"
    
    1 "reflectance"
    
    2 "dataset"
- SR_20m
  - href "https://objects.eodc.eu/e05ab01a9d56408d82ac32d69a5aae2a:notebook-data/tutorial_data/cpm_v270/S2B_MSIL2A_20260131T104159_N0511_R008_T31UEQ_20260131T143154.zarr/measurements/reflectance/r20m"
  - type "application/vnd+zarr"
  - title "Surface Reflectance - 20m"
  - gsd 20
  - bands[] 10 items
    
    0
    
    name "B01"
    
    description "Coastal aerosol (band 1)"
    
    eo:common_name "coastal"
    
    eo:center_wavelength 0.443
    
    eo:full_width_half_max 0.027
    
    1
    
    name "B02"
    
    description "Blue (band 2)"
    
    eo:common_name "blue"
    
    eo:center_wavelength 0.49
    
    eo:full_width_half_max 0.098
    
    2
    
    name "B03"
    
    description "Green (band 3)"
    
    eo:common_name "green"
    
    eo:center_wavelength 0.56
    
    eo:full_width_half_max 0.045
    
    3
    
    name "B04"
    
    description "Red (band 4)"
    
    eo:common_name "red"
    
    eo:center_wavelength 0.665
    
    eo:full_width_half_max 0.038
    
    4
    
    name "B05"
    
    description "Red edge 1 (band 5)"
    
    eo:common_name "rededge071"
    
    eo:center_wavelength 0.704
    
    eo:full_width_half_max 0.019
    
    5
    
    name "B06"
    
    description "Red edge 2 (band 6)"
    
    eo:common_name "rededge075"
    
    eo:center_wavelength 0.74
    
    eo:full_width_half_max 0.018
    
    6
    
    name "B07"
    
    description "Red edge 3 (band 7)"
    
    eo:common_name "rededge078"
    
    eo:center_wavelength 0.783
    
    eo:full_width_half_max 0.028
    
    7
    
    name "B8A"
    
    description "NIR 2 (band 8A)"
    
    eo:common_name "nir08"
    
    eo:center_wavelength 0.865
    
    eo:full_width_half_max 0.033
    
    8
    
    name "B11"
    
    description "SWIR 1 (band 11)"
    
    eo:common_name "swir16"
    
    eo:center_wavelength 1.61
    
    eo:full_width_half_max 0.143
    
    9
    
    name "B12"
    
    description "SWIR 2 (band 12)"
    
    eo:common_name "swir22"
    
    eo:center_wavelength 2.19
    
    eo:full_width_half_max 0.242
  - xarray:open_dataset_kwargs
    
    chunks
    
    engine "eopf-zarr"
    
    op_mode "native"
  - roles[] 3 items
    
    0 "data"
    
    1 "reflectance"
    
    2 "dataset"
- SR_60m
  - href "https://objects.eodc.eu/e05ab01a9d56408d82ac32d69a5aae2a:notebook-data/tutorial_data/cpm_v270/S2B_MSIL2A_20260131T104159_N0511_R008_T31UEQ_20260131T143154.zarr/measurements/reflectance/r60m"
  - type "application/vnd+zarr"
  - title "Surface Reflectance - 60m"
  - gsd 60
  - bands[] 11 items
    
    0
    
    name "B01"
    
    description "Coastal aerosol (band 1)"
    
    eo:common_name "coastal"
    
    eo:center_wavelength 0.443
    
    eo:full_width_half_max 0.027
    
    1
    
    name "B02"
    
    description "Blue (band 2)"
    
    eo:common_name "blue"
    
    eo:center_wavelength 0.49
    
    eo:full_width_half_max 0.098
    
    2
    
    name "B03"
    
    description "Green (band 3)"
    
    eo:common_name "green"
    
    eo:center_wavelength 0.56
    
    eo:full_width_half_max 0.045
    
    3
    
    name "B04"
    
    description "Red (band 4)"
    
    eo:common_name "red"
    
    eo:center_wavelength 0.665
    
    eo:full_width_half_max 0.038
    
    4
    
    name "B05"
    
    description "Red edge 1 (band 5)"
    
    eo:common_name "rededge071"
    
    eo:center_wavelength 0.704
    
    eo:full_width_half_max 0.019
    
    5
    
    name "B06"
    
    description "Red edge 2 (band 6)"
    
    eo:common_name "rededge075"
    
    eo:center_wavelength 0.74
    
    eo:full_width_half_max 0.018
    
    6
    
    name "B07"
    
    description "Red edge 3 (band 7)"
    
    eo:common_name "rededge078"
    
    eo:center_wavelength 0.783
    
    eo:full_width_half_max 0.028
    
    7
    
    name "B8A"
    
    description "NIR 2 (band 8A)"
    
    eo:common_name "nir08"
    
    eo:center_wavelength 0.865
    
    eo:full_width_half_max 0.033
    
    8
    
    name "B09"
    
    description "NIR 3 (band 9)"
    
    eo:common_name "nir09"
    
    eo:center_wavelength 0.945
    
    eo:full_width_half_max 0.026
    
    9
    
    name "B11"
    
    description "SWIR 1 (band 11)"
    
    eo:common_name "swir16"
    
    eo:center_wavelength 1.61
    
    eo:full_width_half_max 0.143
    
    10
    
    name "B12"
    
    description "SWIR 2 (band 12)"
    
    eo:common_name "swir22"
    
    eo:center_wavelength 2.19
    
    eo:full_width_half_max 0.242
  - xarray:open_dataset_kwargs
    
    chunks
    
    engine "eopf-zarr"
    
    op_mode "native"
  - roles[] 3 items
    
    0 "data"
    
    1 "reflectance"
    
    2 "dataset"
- AOT_10m
  - href "https://objects.eodc.eu/e05ab01a9d56408d82ac32d69a5aae2a:notebook-data/tutorial_data/cpm_v270/S2B_MSIL2A_20260131T104159_N0511_R008_T31UEQ_20260131T143154.zarr/quality/atmosphere/r10m/aot"
  - type "application/vnd+zarr"
  - title "Aerosol optical thickness (AOT)"
  - description "Aerosol Optical Thickness map at 10m (for 550nm) resampled from 20m AOT map"
  - gsd 10
  - nodata 0
  - data_type "uint16"
  - raster:scale 0.001
  - raster:offset 0
  - roles[] 1 items
    
    0 "data"
- B01_20m
  - href "https://objects.eodc.eu/e05ab01a9d56408d82ac32d69a5aae2a:notebook-data/tutorial_data/cpm_v270/S2B_MSIL2A_20260131T104159_N0511_R008_T31UEQ_20260131T143154.zarr/measurements/reflectance/r20m/b01"
  - type "application/vnd+zarr"
  - title "Coastal aerosol (band 1) - 20m"
  - description "BOA reflectance from MSI acquisition at spectral band 01 443 nm"
  - gsd 20
  - bands[] 1 items
    
    0
    
    name "B01"
    
    description "Coastal aerosol (band 1)"
    
    eo:common_name "coastal"
    
    eo:center_wavelength 0.443
    
    eo:full_width_half_max 0.027
  - nodata 0
  - data_type "uint16"
  - raster:scale 0.0001
  - raster:offset -0.1
  - roles[] 2 items
    
    0 "data"
    
    1 "reflectance"
- B02_10m
  - href "https://objects.eodc.eu/e05ab01a9d56408d82ac32d69a5aae2a:notebook-data/tutorial_data/cpm_v270/S2B_MSIL2A_20260131T104159_N0511_R008_T31UEQ_20260131T143154.zarr/measurements/reflectance/r10m/b02"
  - type "application/vnd+zarr"
  - title "Blue (band 2) - 10m"
  - description "BOA reflectance from MSI acquisition at spectral band 02 490 nm"
  - gsd 10
  - bands[] 1 items
    
    0
    
    name "B02"
    
    description "Blue (band 2)"
    
    eo:common_name "blue"
    
    eo:center_wavelength 0.49
    
    eo:full_width_half_max 0.098
  - nodata 0
  - data_type "uint16"
  - raster:scale 0.0001
  - raster:offset -0.1
  - roles[] 2 items
    
    0 "data"
    
    1 "reflectance"
- B03_10m
  - href "https://objects.eodc.eu/e05ab01a9d56408d82ac32d69a5aae2a:notebook-data/tutorial_data/cpm_v270/S2B_MSIL2A_20260131T104159_N0511_R008_T31UEQ_20260131T143154.zarr/measurements/reflectance/r10m/b03"
  - type "application/vnd+zarr"
  - title "Green (band 3) - 10m"
  - description "BOA reflectance from MSI acquisition at spectral band 03 560 nm"
  - gsd 10
  - bands[] 1 items
    
    0
    
    name "B03"
    
    description "Green (band 3)"
    
    eo:common_name "green"
    
    eo:center_wavelength 0.56
    
    eo:full_width_half_max 0.045
  - nodata 0
  - data_type "uint16"
  - raster:scale 0.0001
  - raster:offset -0.1
  - roles[] 2 items
    
    0 "data"
    
    1 "reflectance"
- B04_10m
  - href "https://objects.eodc.eu/e05ab01a9d56408d82ac32d69a5aae2a:notebook-data/tutorial_data/cpm_v270/S2B_MSIL2A_20260131T104159_N0511_R008_T31UEQ_20260131T143154.zarr/measurements/reflectance/r10m/b04"
  - type "application/vnd+zarr"
  - title "Red (band 4) - 10m"
  - description "BOA reflectance from MSI acquisition at spectral band 04 665 nm"
  - gsd 10
  - bands[] 1 items
    
    0
    
    name "B04"
    
    description "Red (band 4)"
    
    eo:common_name "red"
    
    eo:center_wavelength 0.665
    
    eo:full_width_half_max 0.038
  - nodata 0
  - data_type "uint16"
  - raster:scale 0.0001
  - raster:offset -0.1
  - roles[] 2 items
    
    0 "data"
    
    1 "reflectance"
- B05_20m
  - href "https://objects.eodc.eu/e05ab01a9d56408d82ac32d69a5aae2a:notebook-data/tutorial_data/cpm_v270/S2B_MSIL2A_20260131T104159_N0511_R008_T31UEQ_20260131T143154.zarr/measurements/reflectance/r20m/b05"
  - type "application/vnd+zarr"
  - title "Red edge 1 (band 5) - 20m"
  - description "BOA reflectance from MSI acquisition at spectral band 05 705 nm"
  - gsd 20
  - bands[] 1 items
    
    0
    
    name "B05"
    
    description "Red edge 1 (band 5)"
    
    eo:common_name "rededge071"
    
    eo:center_wavelength 0.704
    
    eo:full_width_half_max 0.019
  - nodata 0
  - data_type "uint16"
  - raster:scale 0.0001
  - raster:offset -0.1
  - roles[] 2 items
    
    0 "data"
    
    1 "reflectance"
- B06_20m
  - href "https://objects.eodc.eu/e05ab01a9d56408d82ac32d69a5aae2a:notebook-data/tutorial_data/cpm_v270/S2B_MSIL2A_20260131T104159_N0511_R008_T31UEQ_20260131T143154.zarr/measurements/reflectance/r20m/b06"
  - type "application/vnd+zarr"
  - title "Red edge 2 (band 6) - 20m"
  - description "BOA reflectance from MSI acquisition at spectral band 06 740 nm"
  - gsd 20
  - bands[] 1 items
    
    0
    
    name "B06"
    
    description "Red edge 2 (band 6)"
    
    eo:common_name "rededge075"
    
    eo:center_wavelength 0.74
    
    eo:full_width_half_max 0.018
  - nodata 0
  - data_type "uint16"
  - raster:scale 0.0001
  - raster:offset -0.1
  - roles[] 2 items
    
    0 "data"
    
    1 "reflectance"
- B07_20m
  - href "https://objects.eodc.eu/e05ab01a9d56408d82ac32d69a5aae2a:notebook-data/tutorial_data/cpm_v270/S2B_MSIL2A_20260131T104159_N0511_R008_T31UEQ_20260131T143154.zarr/measurements/reflectance/r20m/b07"
  - type "application/vnd+zarr"
  - title "Red edge 3 (band 7) - 20m"
  - description "BOA reflectance from MSI acquisition at spectral band 07 783 nm"
  - gsd 20
  - bands[] 1 items
    
    0
    
    name "B07"
    
    description "Red edge 3 (band 7)"
    
    eo:common_name "rededge078"
    
    eo:center_wavelength 0.783
    
    eo:full_width_half_max 0.028
  - nodata 0
  - data_type "uint16"
  - raster:scale 0.0001
  - raster:offset -0.1
  - roles[] 2 items
    
    0 "data"
    
    1 "reflectance"
- B08_10m
  - href "https://objects.eodc.eu/e05ab01a9d56408d82ac32d69a5aae2a:notebook-data/tutorial_data/cpm_v270/S2B_MSIL2A_20260131T104159_N0511_R008_T31UEQ_20260131T143154.zarr/measurements/reflectance/r10m/b08"
  - type "application/vnd+zarr"
  - title "NIR 1 (band 8) - 10m"
  - description "BOA reflectance from MSI acquisition at spectral band 08 842 nm"
  - gsd 10
  - bands[] 1 items
    
    0
    
    name "B08"
    
    description "NIR 1 (band 8)"
    
    eo:common_name "nir"
    
    eo:center_wavelength 0.842
    
    eo:full_width_half_max 0.145
  - nodata 0
  - data_type "uint16"
  - raster:scale 0.0001
  - raster:offset -0.1
  - roles[] 2 items
    
    0 "data"
    
    1 "reflectance"
- B09_60m
  - href "https://objects.eodc.eu/e05ab01a9d56408d82ac32d69a5aae2a:notebook-data/tutorial_data/cpm_v270/S2B_MSIL2A_20260131T104159_N0511_R008_T31UEQ_20260131T143154.zarr/measurements/reflectance/r60m/b09"
  - type "application/vnd+zarr"
  - title "NIR 3 (band 9) - 60m"
  - description "BOA reflectance from MSI acquisition at spectral band 09 940 nm"
  - gsd 60
  - bands[] 1 items
    
    0
    
    name "B09"
    
    description "NIR 3 (band 9)"
    
    eo:common_name "nir09"
    
    eo:center_wavelength 0.945
    
    eo:full_width_half_max 0.026
  - nodata 0
  - data_type "uint16"
  - raster:scale 0.0001
  - raster:offset -0.1
  - roles[] 2 items
    
    0 "data"
    
    1 "reflectance"
- B11_20m
  - href "https://objects.eodc.eu/e05ab01a9d56408d82ac32d69a5aae2a:notebook-data/tutorial_data/cpm_v270/S2B_MSIL2A_20260131T104159_N0511_R008_T31UEQ_20260131T143154.zarr/measurements/reflectance/r20m/b11"
  - type "application/vnd+zarr"
  - title "SWIR 1 (band 11) - 20m"
  - description "BOA reflectance from MSI acquisition at spectral band 11 1610 nm"
  - gsd 20
  - bands[] 1 items
    
    0
    
    name "B11"
    
    description "SWIR 1 (band 11)"
    
    eo:common_name "swir16"
    
    eo:center_wavelength 1.61
    
    eo:full_width_half_max 0.143
  - nodata 0
  - data_type "uint16"
  - raster:scale 0.0001
  - raster:offset -0.1
  - roles[] 2 items
    
    0 "data"
    
    1 "reflectance"
- B12_20m
  - href "https://objects.eodc.eu/e05ab01a9d56408d82ac32d69a5aae2a:notebook-data/tutorial_data/cpm_v270/S2B_MSIL2A_20260131T104159_N0511_R008_T31UEQ_20260131T143154.zarr/measurements/reflectance/r20m/b12"
  - type "application/vnd+zarr"
  - title "SWIR 2 (band 12) - 20m"
  - description "BOA reflectance from MSI acquisition at spectral band 12 2190 nm"
  - gsd 20
  - bands[] 1 items
    
    0
    
    name "B12"
    
    description "SWIR 2 (band 12)"
    
    eo:common_name "swir22"
    
    eo:center_wavelength 2.19
    
    eo:full_width_half_max 0.242
  - nodata 0
  - data_type "uint16"
  - raster:scale 0.0001
  - raster:offset -0.1
  - roles[] 2 items
    
    0 "data"
    
    1 "reflectance"
- B8A_20m
  - href "https://objects.eodc.eu/e05ab01a9d56408d82ac32d69a5aae2a:notebook-data/tutorial_data/cpm_v270/S2B_MSIL2A_20260131T104159_N0511_R008_T31UEQ_20260131T143154.zarr/measurements/reflectance/r20m/b8a"
  - type "application/vnd+zarr"
  - title "NIR 2 (band 8A) - 20m"
  - description "BOA reflectance from MSI acquisition at spectral band 8A 865 nm"
  - gsd 20
  - bands[] 1 items
    
    0
    
    name "B8A"
    
    description "NIR 2 (band 8A)"
    
    eo:common_name "nir08"
    
    eo:center_wavelength 0.865
    
    eo:full_width_half_max 0.033
  - nodata 0
  - data_type "uint16"
  - raster:scale 0.0001
  - raster:offset -0.1
  - roles[] 2 items
    
    0 "data"
    
    1 "reflectance"
- SCL_20m
  - href "https://objects.eodc.eu/e05ab01a9d56408d82ac32d69a5aae2a:notebook-data/tutorial_data/cpm_v270/S2B_MSIL2A_20260131T104159_N0511_R008_T31UEQ_20260131T143154.zarr/conditions/mask/l2a_classification/r20m/scl"
  - type "application/vnd+zarr"
  - title "Scene classification map (SCL)"
  - description "scene classification map at 20m"
  - gsd 20
  - nodata 0
  - data_type "uint8"
  - raster:scale 1
  - raster:offset 0
  - roles[] 1 items
    
    0 "data"
- TCI_10m
  - href "https://objects.eodc.eu/e05ab01a9d56408d82ac32d69a5aae2a:notebook-data/tutorial_data/cpm_v270/S2B_MSIL2A_20260131T104159_N0511_R008_T31UEQ_20260131T143154.zarr/quality/l2a_quicklook/r10m/tci"
  - type "application/vnd+zarr"
  - title "True color image"
  - gsd 10
  - bands[] 3 items
    
    0
    
    name "B04"
    
    description "Red (band 4)"
    
    eo:common_name "red"
    
    eo:center_wavelength 0.665
    
    eo:full_width_half_max 0.038
    
    1
    
    name "B03"
    
    description "Green (band 3)"
    
    eo:common_name "green"
    
    eo:center_wavelength 0.56
    
    eo:full_width_half_max 0.045
    
    2
    
    name "B02"
    
    description "Blue (band 2)"
    
    eo:common_name "blue"
    
    eo:center_wavelength 0.49
    
    eo:full_width_half_max 0.098
  - roles[] 1 items
    
    0 "data"
- WVP_10m
  - href "https://objects.eodc.eu/e05ab01a9d56408d82ac32d69a5aae2a:notebook-data/tutorial_data/cpm_v270/S2B_MSIL2A_20260131T104159_N0511_R008_T31UEQ_20260131T143154.zarr/quality/atmosphere/r10m/wvp"
  - type "application/vnd+zarr"
  - title "Water vapour (WVP)"
  - description "Water Vapour Content map at 10m"
  - gsd 10
  - nodata 0
  - data_type "uint16"
  - raster:scale 0.001
  - raster:offset 0
  - roles[] 1 items
    
    0 "data"
- product
  - href "https://objects.eodc.eu/e05ab01a9d56408d82ac32d69a5aae2a:notebook-data/tutorial_data/cpm_v270/S2B_MSIL2A_20260131T104159_N0511_R008_T31UEQ_20260131T143154.zarr"
  - type "application/vnd+zarr"
  - title "EOPF Product"
  - description "The full Zarr store of the EOPF product"
  - xarray:open_datatree_kwargs
    
    chunks
    
    engine "eopf-zarr"
    
    op_mode "native"
  - roles[] 2 items
    
    0 "data"
    
    1 "metadata"
- zipped_product
  - href "https://download.user.eopf.eodc.eu/zip/collections/sentinel-2-l2a/items/S2B_MSIL2A_20260131T104159_N0511_R008_T31UEQ_20260131T143154.zip"
  - type "application/zip"
  - title "Zipped EOPF Product"
  - description "The full EOPF Zarr store as zip archive"
  - roles[] 3 items
    
    0 "data"
    
    1 "metadata"
    
    2 "archive"
- product_metadata
  - href "https://objects.eodc.eu/e05ab01a9d56408d82ac32d69a5aae2a:notebook-data/tutorial_data/cpm_v270/S2B_MSIL2A_20260131T104159_N0511_R008_T31UEQ_20260131T143154.zarr/.zmetadata"
  - type "application/json"
  - title "Consolidated Metadata"
  - description "Consolidated metadata of the EOPF product"
  - roles[] 1 items
    
    0 "metadata"
collection "sentinel-2-l2a"

Load Cloud Probability and SCL data with dask

Here we use Dask delayed to load multiple Sentinel-2 scenes in parallel. This approach: - Distributes the loading across multiple workers - Handles failures gracefully without stopping the entire process - Tracks timestamps for proper time-series analysis - Uses chunked reads for memory efficiency

The cloud probability layer (cld) and scene classification (scl) is loaded at 20m resolution from each scene.

def load_eopf_cloud(item, bbox=None):
    """Load cloud probability from an EOPF Sentinel-2 STAC item.
    
    Pixels where the SCL (Scene Classification Layer) equals 0 ("No Data")
    are masked as NaN so they are excluded from downstream analysis.
    
    Args:
        item: A PySTAC Item with a 'product' Zarr asset.
        bbox: Optional (minx, miny, maxx, maxy) in the scene's CRS (e.g. UTM)
              to spatially subset. If None, loads the full scene.
    
    Returns:
        dict with 'data' (xr.DataArray), 'datetime' (str), 'id' (str)
        or None if loading fails.
    """
    try:
        url = item.assets["product"].href
        dt = xr.open_datatree(url, engine="zarr", chunks={}, decode_timedelta=True)
        cld = dt["quality/probability/r20m/cld"]
        scl = dt["conditions/mask/l2a_classification/r20m/scl"]
        
        # Spatial subset if bbox provided
        if bbox is not None:
            minx, miny, maxx, maxy = bbox
            cld = cld.sel(x=slice(minx, maxx), y=slice(maxy, miny))
            scl = scl.sel(x=slice(minx, maxx), y=slice(maxy, miny))
            if any(s == 0 for s in cld.sizes.values()):
                print(f"⚠️  {item.id}: empty after subsetting, skipped")
                return None

        # Prepare the cld DataArray
        # cld.rio.write_crs(item.properties.get("proj:code"))
        # Mask pixels where SCL == 0 (No Data) as NaN
        cld = cld.where(scl != 0)
        
        datetime_str = item.properties.get("datetime")
        cloud_cover = item.properties.get("eo:cloud_cover", "N/A")
        print(f"✓ Loaded {item.id} | Cloud cover: {cloud_cover}%")
        return {"data": cld, "datetime": datetime_str, "id": item.id}
    except Exception as e:
        print(f"✗ Failed to load {item.id}: {e}")
        return None

Work on the full tile or use a bounding box to subset the data (e.g. for debugging)

# Option A: Full scene (no bbox)
scene_bbox = None

# Option B: Subset to a geometry's bounding box — uncomment and set geom:
# geom = fields_gdf.iloc[0].geometry   # or any Shapely geometry
# scene_bbox = geom.bounds             # (minx, miny, maxx, maxy)

batch_size = 10
scenes = []
times = []

print(f"🚀 Loading {len(items)} scenes in batches of {batch_size}...")
print("   This prevents overwhelming the remote STAC server\n")

for batch_idx in range(0, len(items), batch_size):
    batch = items[batch_idx:batch_idx+batch_size]
    batch_num = batch_idx // batch_size + 1
    total_batches = (len(items) + batch_size - 1) // batch_size
    
    print(f"📦 Batch {batch_num}/{total_batches}: Loading {len(batch)} scenes...")
    
    tasks = [delayed(load_eopf_cloud)(item, bbox=scene_bbox) for item in batch]
    results = compute(*tasks)
    
    for result in results:
        if result is not None:
            scenes.append(result["data"])
            # Convert to timezone-naive timestamp
            dt = pd.to_datetime(result["datetime"])
            if dt.tz is not None:
                dt = dt.tz_localize(None)
            times.append(dt)
    
    print(f"Batch {batch_num} complete: {len([r for r in results if r is not None])}/{len(batch)} succeeded")
    
    if batch_idx + batch_size < len(items):
        time.sleep(1)
    print()

print(f"✅ Successfully loaded {len(scenes)} out of {len(items)} scenes total")

🚀 Loading 11 scenes in batches of 10...
   This prevents overwhelming the remote STAC server

📦 Batch 1/2: Loading 10 scenes...
Batch 1 complete: 10/10 succeeded

📦 Batch 2/2: Loading 1 scenes...
Batch 2 complete: 1/1 succeeded

✅ Successfully loaded 11 out of 11 scenes total

Open a SCL layer to highlight pot. no-data areas in the cloud probability maps

href = items[0].assets["SCL_20m"].href.split("/conditions")[0]

ds = xr.open_datatree(href, engine="zarr", chunks={}, decode_timedelta=True)

ds["/conditions/mask/l2a_classification/r20m"]

<xarray.DataTree 'r20m'>
Group: /conditions/mask/l2a_classification/r20m
    Dimensions:  (y: 5490, x: 5490)
    Coordinates:
      * y        (y) float32 22kB 5.5e+06 5.5e+06 5.5e+06 ... 5.39e+06 5.39e+06
      * x        (x) float32 22kB 5e+05 5e+05 5e+05 ... 6.098e+05 6.098e+05
    Data variables:
        scl      (y, x) float64 241MB dask.array<chunksize=(5490, 5490), meta=np.ndarray>

# Map integers to readable labels
scl_labels = [
    "No Data", "Saturated", "Dark Area", "Cloud Shadows", 
    "Vegetation", "Bare Soils", "Water", "Low Cloud Prob.", 
    "Medium Cloud Prob.", "High Cloud Prob.", "Cirrus", "Snow/Ice"
]

scl_colors = [
    "#000000", "#FF0000", "#646464", "#5F3214", "#008000", 
    "#A52A2A", "#0000FF", "#C0C0C0", "#808080", "#FFFFFF", 
    "#ADD8E6", "#FF00FF"
]

# Setup colormap and normalizer
cmap = ListedColormap(scl_colors)
norm = BoundaryNorm(np.arange(-0.5, 12.5, 1), cmap.N)

# Create the plot
fig, ax = plt.subplots(figsize=(12, 8))
im = ds["/conditions/mask/l2a_classification/r20m/scl"].plot.imshow(
    ax=ax, cmap=cmap, norm=norm, add_colorbar=False, interpolation="none"
)

# Add the labeled colorbar
cbar = fig.colorbar(im, ax=ax, ticks=range(12))
cbar.ax.set_yticklabels(scl_labels)  # Mapping indices to strings
cbar.set_label('Surface Classification')

plt.tight_layout()

Tile can show partial coverage due to acquistion handling, please refer e.g. here for details https://sentiwiki.copernicus.eu/web/s2-products#S2Products-MSIProductsOverview

# Optionally, uncomment below to save the SCL raster for use in GIS or other tools
# ds["/conditions/mask/l2a_classification/r20m/scl"].rio.to_raster("scl.tif")

Time-Series Analysis: Stack and Aggregate

Now we’ll combine all loaded scenes into a single xarray DataArray with a time dimension. This enables: - Temporal aggregation: Calculate mean cloudiness over time - Seasonal analysis: Group by month or season - Cloud frequency: Count how often each pixel is cloudy - Lazy computation: Dask delays execution until .compute() is called

# Some scenes may have identical timestamps - keep only unique ones
print("🔍 Checking for duplicate timestamps...")
print(f"   Original: {len(times)} scenes")

# Create DataFrame to track duplicates
scene_info = pd.DataFrame({
    'datetime': times,
    'scene_idx': range(len(times))
})

scene_info['datetime_normalized'] = pd.to_datetime(scene_info['datetime'])

# Find duplicates
duplicates = scene_info[scene_info.duplicated(subset='datetime_normalized', keep=False)]
if len(duplicates) > 0:
    print(f"   ⚠️  Found {len(duplicates)} duplicate timestamps:")
    for dt in duplicates['datetime_normalized'].unique():
        dup_count = len(duplicates[duplicates['datetime_normalized'] == dt])
        print(f"      - {dt}: {dup_count} copies")

# Keep only first occurrence of each timestamp
scene_info_unique = scene_info.drop_duplicates(subset='datetime_normalized', keep='first')
unique_indices = scene_info_unique['scene_idx'].tolist()

# Filter scenes and times
scenes_unique = [scenes[i] for i in unique_indices]
times_unique = [times[i] for i in unique_indices]

print(f"   ✅ After deduplication: {len(scenes_unique)} unique scenes")
print()

# Scenes with too much NoData can create artifacts at tile edges
MAX_NODATA_PERCENT = 50  # Adjust this threshold as needed (set to 100 to disable filtering)

print(f"🔍 Filtering scenes with >{MAX_NODATA_PERCENT}% NoData...")
scenes_filtered = []
times_filtered = []
nodata_stats = []

for scene, time_ in zip(scenes_unique, times_unique):
    total_pixels = scene.size
    valid_pixels = scene.count().values
    nodata_percent = (1 - valid_pixels / total_pixels) * 100
    
    if nodata_percent <= MAX_NODATA_PERCENT:
        scenes_filtered.append(scene)
        times_filtered.append(time_)
        nodata_stats.append({'datetime': time_, 'nodata_percent': nodata_percent})
    else:
        print(f"   ⚠️  Excluding {time_.strftime('%Y-%m-%d')}: {nodata_percent:.1f}% NoData")

print(f"   ✅ Retained {len(scenes_filtered)} scenes with <{MAX_NODATA_PERCENT}% NoData")
print()

time_coord = pd.DatetimeIndex(times_filtered, name='time')
da_stack = xr.concat(scenes_filtered, dim=time_coord)

print("🔄 Computing observation count per pixel...")
print("   (This will access remote data once and cache results)")

valid_count = da_stack.count(dim='time').compute()

print("   ✅ Observation count computed and cached in memory")

# Minimum-observation threshold: mask pixels with too few valid scenes
MIN_OBS = 3  # Increase this to be more conservative
obs_mask = valid_count >= MIN_OBS

print("🔄 Computing mean cloud probability and cloud frequency...")

# The mean automatically handles weighting - each pixel's mean is computed
# only from its valid observations (NaN values are excluded)
# Compute immediately to get concrete values
mean_cloudiness = da_stack.mean(dim='time').where(obs_mask).compute()

# Cloud frequency — divide by valid count per pixel, not total scene count
is_cloudy = da_stack > 50
cloud_frequency = (is_cloudy.sum(dim='time').compute() / valid_count * 100).where(obs_mask)

print("   ✅ Statistics computed")

print(f"\n📦 Stacked shape: {da_stack.shape}")
print(f"📅 Time range: {da_stack.time.values[0]} → {da_stack.time.values[-1]}")
print(f"🔢 Min valid observations threshold: {MIN_OBS}")
print("📊 Observation count statistics:")
print(f"   - Min:    {int(valid_count.min().values)}")
print(f"   - Max:    {int(valid_count.max().values)}")
print(f"   - Mean:   {float(valid_count.mean().values):.1f}")

total_pixels = valid_count.size
masked_pixels = obs_mask.sum().values
coverage_percent = (masked_pixels / total_pixels) * 100
print(f"   - Pixels with ≥{MIN_OBS} obs: {coverage_percent:.1f}%")
print()
print("💡 Key statistics are now cached in memory")

🔍 Checking for duplicate timestamps...
   Original: 11 scenes
   ✅ After deduplication: 11 unique scenes

🔍 Filtering scenes with >50% NoData...
   ✅ Retained 11 scenes with <50% NoData

🔄 Computing observation count per pixel...
   (This will access remote data once and cache results)
   ✅ Observation count computed and cached in memory
🔄 Computing mean cloud probability and cloud frequency...

   ✅ Statistics computed

📦 Stacked shape: (11, 5490, 5490)
📅 Time range: 2026-01-31T10:41:59.024000000 → 2025-12-29T10:45:01.024000000
🔢 Min valid observations threshold: 3
📊 Observation count statistics:
   - Min:    11
   - Max:    11
   - Mean:   11.0
   - Pixels with ≥3 obs: 100.0%

💡 Key statistics are now cached in memory

scene_info

	datetime	scene_idx	datetime_normalized
0	2026-01-31 10:41:59.024	0	2026-01-31 10:41:59.024
1	2026-02-28 10:50:01.025	1	2026-02-28 10:50:01.025
2	2026-03-30 10:50:31.025	2	2026-03-30 10:50:31.025
3	2025-04-26 10:36:29.024	3	2025-04-26 10:36:29.024
4	2025-05-31 10:40:41.025	4	2025-05-31 10:40:41.025
5	2025-06-30 10:40:41.025	5	2025-06-30 10:40:41.025
6	2025-07-25 10:50:51.024	6	2025-07-25 10:50:51.024
7	2025-09-28 10:38:31.025	7	2025-09-28 10:38:31.025
8	2025-10-31 10:52:21.025	8	2025-10-31 10:52:21.025
9	2025-11-30 10:54:21.025	9	2025-11-30 10:54:21.025
10	2025-12-29 10:45:01.024	10	2025-12-29 10:45:01.024

# Optionally, uncomment below to save the mean cloudiness raster for use in GIS or other tools
# mean_cloudiness.rio.to_raster("cld.tif")

time_series = da_stack.mean(dim=['x', 'y'])
df = time_series.to_dataframe()

plt.figure(figsize=(12, 6))

plt.bar(df.index, df['cld'], color='steelblue', alpha=0.7, label='Cloud Probability', width=0.8)

plt.title(f"Cloud Probability Over Time FR-tile {tile_id}")
plt.ylabel("Average Cloud Probability (%)")
plt.xlabel("Date")
plt.grid(True, linestyle='--', alpha=0.3)
plt.legend()
plt.tight_layout()

_ = plt.show()

_ = mean_cloudiness.plot.imshow()

This is our first desired outcome: each pixel in this image represents the percentage of time that specific location was identified as cloudy across the selected time period. High values could potentially correspond to more persistent, localized cloud cover or shadows (e.g. orographic cloud formation; in other areas, it may indicate persistent micro-climates).

Of course, the final result will vary depending on the selected time period. With at least for 1 year coverage we can get an idea about seasonality

Seasonal and Monthly Analysis

Analyze cloud patterns by grouping data temporally. We can identify which seasons or months have the clearest skies.

Note: We removed timezone information earlier (.tz_localize(None)) because xarray’s groupby() datetime accessors like .season and .month don’t support timezone-aware timestamps.

# Group by season (DJF, MAM, JJA, SON)
seasonal_means = da_stack.groupby('time.season').mean(dim='time')

# or group by Month for finer granularity
monthly_means = da_stack.groupby('time.month').mean(dim='time')

# select a specific season to analyze
spring_cloudiness = seasonal_means.sel(season='MAM') # March, April, May

seasonal_means

<xarray.DataArray 'cld' (season: 4, y: 5490, x: 5490)> Size: 482MB
dask.array<transpose, shape=(4, 5490, 5490), dtype=float32, chunksize=(1, 5490, 5490), chunktype=numpy.ndarray>
Coordinates:
  * season   (season) object 32B 'DJF' 'JJA' 'MAM' 'SON'
  * y        (y) float32 22kB 5.5e+06 5.5e+06 5.5e+06 ... 5.39e+06 5.39e+06
  * x        (x) float32 22kB 5e+05 5e+05 5e+05 ... 6.098e+05 6.098e+05
Attributes:
    _eopf_attrs:  {'coordinates': ['y', 'x'], 'dimensions': ['y', 'x'], 'dtyp...
    dtype:        <f4
    long_name:    cloud probability mask at 20m
    units:        percentage
    valid_max:    100
    valid_min:    0

Visualize Seasonal Differences (Delta Images)

To highlight how cloud probability varies between seasons, we can create delta images that show the difference between season pairs. This helps identify: - Which areas are cloudier in winter vs. summer - Seasonal shifts in cloud patterns - Areas with consistent vs. variable cloud cover across seasons

# Create delta images comparing seasons
# Common comparisons: Winter-Summer, Spring-Fall

fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# Individual season plots
seasons = ['DJF', 'MAM', 'JJA', 'SON']
season_labels = ['Winter (DJF)', 'Spring (MAM)', 'Summer (JJA)', 'Fall (SON)']

for idx, (season, label) in enumerate(zip(seasons, season_labels)):
    ax = axes[idx // 2, idx % 2]
    seasonal_means.sel(season=season).plot.imshow(
        ax=ax, cmap='Blues', vmin=0, vmax=100,
        cbar_kwargs={'label': 'Cloud Probability (%)'}
    )
    ax.set_title(f'{label} - Mean Cloud Probability')
    ax.set_xlabel('X')
    ax.set_ylabel('Y')

plt.tight_layout()
plt.show()

# Create delta (difference) images between season pairs
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Define season comparisons with descriptive labels
comparisons = [
    ('DJF', 'JJA', 'Winter - Summer'),
    ('MAM', 'SON', 'Spring - Fall'),
    ('JJA', 'DJF', 'Summer - Winter')
]

for idx, (season1, season2, title) in enumerate(comparisons):
    delta = seasonal_means.sel(season=season1) - seasonal_means.sel(season=season2)
    
    # Use diverging colormap: negative (blue) = less cloudy, positive (red) = more cloudy
    im = delta.plot.imshow(
        ax=axes[idx], 
        cmap='RdBu_r',  # Red for more clouds, Blue for less clouds
        vmin=-50, vmax=50,  # Center at 0
        cbar_kwargs={'label': 'Cloud Probability Difference (%)'}
    )
    axes[idx].set_title(f'{title}\n(Red = {season1} Cloudier, Blue = {season2} Cloudier)')
    axes[idx].set_xlabel('X')
    axes[idx].set_ylabel('Y')

plt.tight_layout()
plt.show()

print("📊 Interpretation:")
print("  🔴 Red areas: First season is cloudier")
print("  🔵 Blue areas: Second season is cloudier")
print("  ⚪ White areas: Similar cloud probability in both seasons")

📊 Interpretation:
  🔴 Red areas: First season is cloudier
  🔵 Blue areas: Second season is cloudier
  ⚪ White areas: Similar cloud probability in both seasons

Looking at the delta-images (the difference between two seasonal means, e.g., \(\Delta \text{cld} = \text{mean}_{\text{season2}} - \text{mean}_{\text{season1}}\)), we can identify distinct shifts in cloud probability across the year. These maps provide a critical understanding of how atmospheric dynamics evolve over time.

Positive values in the delta-image indicate an increase in cloud probability in the second period compared to the first, while negative values indicate a “clearing” effect. This is particularly useful for identifying the onset of wet seasons or the influence of seasonal fog in low-lying agricultural valleys.

These differences often reflect large-scale synoptic changes—such as shifting wind patterns, localized convective activity, or changes in regional humidity—that dictate the “data-acquisition window” for specific months.

By analyzing these fluctuations, you can move away from static cloud-coverage assumptions. Instead, you gain a predictive understanding of when your specific tile is most likely to be obscured, allowing for more strategic scheduling of multi-temporal analysis based on the “clearest” seasonal window.

By subtracting the mean cloud probability of the clear-sky period from the high-variability season, we isolate the areas most susceptible to seasonal obstruction. This highlights that while certain regions remain stable year-round, others show high sensitivity to seasonal weather patterns, directly impacting the effective temporal resolution available for agricultural monitoring.

# Statistical comparison: Mean cloud probability by season
print("📈 Average Cloud Probability by Season:\n")
for season in ['DJF', 'MAM', 'JJA', 'SON']:
    season_name = {'DJF': 'Winter', 'MAM': 'Spring', 'JJA': 'Summer', 'SON': 'Fall'}[season]
    mean_val = float(seasonal_means.sel(season=season).mean().compute())
    std_val = float(seasonal_means.sel(season=season).std().compute())
    print(f"  {season_name:8s} ({season}): {mean_val:5.2f}% ± {std_val:5.2f}%")

# Create a bar chart
season_names = ['Winter\n(DJF)', 'Spring\n(MAM)', 'Summer\n(JJA)', 'Fall\n(SON)']
season_keys = ['DJF', 'MAM', 'JJA', 'SON']
means = [float(seasonal_means.sel(season=s).mean().compute()) for s in season_keys]

plt.figure(figsize=(10, 6))
bars = plt.bar(season_names, means, color=['#3498db', '#2ecc71', '#e74c3c', '#f39c12'], 
               alpha=0.8, edgecolor='black', linewidth=1.5)

# Add value labels on bars
for bar, mean in zip(bars, means):
    height = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2., height,
             f'{mean:.1f}%', ha='center', va='bottom', fontsize=12, fontweight='bold')

plt.ylabel('Average Cloud Probability (%)', fontsize=12)
plt.title('Seasonal Cloud Probability Comparison', fontsize=14, fontweight='bold')
plt.ylim(0, max(means) * 1.15)
plt.grid(axis='y', linestyle='--', alpha=0.3)
plt.tight_layout()
plt.show()

📈 Average Cloud Probability by Season:

  Winter   (DJF): 40.95% ± 17.76%
  Spring   (MAM): 34.43% ± 22.59%
  Summer   (JJA): 10.59% ± 19.68%
  Fall     (SON): 39.19% ± 19.76%

Cleanup Dask Cluster

When you’re finished with all computations, it’s good practice to close the Dask cluster to free up resources.

# Clean up Dask cluster when finished
# Uncomment the lines below when you're done with computations
client.close()
cluster.close()
print("✅ Dask cluster closed")

✓ Loaded S2C_MSIL2A_20251031T105221_N0511_R051_T31UEQ_20251031T144221 | Cloud cover: 99.634349%
✓ Loaded S2C_MSIL2A_20250630T104041_N0511_R008_T31UEQ_20250630T161206 | Cloud cover: 17.821619%
✓ Loaded S2B_MSIL2A_20260131T104159_N0511_R008_T31UEQ_20260131T143154 | Cloud cover: 92.335999%
✓ Loaded S2C_MSIL2A_20260330T105031_N0512_R051_T31UEQ_20260330T162516 | Cloud cover: 62.743378%
✓ Loaded S2A_MSIL2A_20251229T104501_N0511_R008_T31UEQ_20251229T144308 | Cloud cover: 0.027862%
✓ Loaded S2A_MSIL2A_20250725T105051_N0511_R051_T31UEQ_20250725T143858 | Cloud cover: 34.8084%
✓ Loaded S2B_MSIL2A_20250426T103629_N0511_R008_T31UEQ_20250426T114837 | Cloud cover: 25.744334%
✓ Loaded S2C_MSIL2A_20260228T105001_N0512_R051_T31UEQ_20260228T145912 | Cloud cover: 90.135467%
✓ Loaded S2C_MSIL2A_20250531T104041_N0511_R008_T31UEQ_20250531T162214 | Cloud cover: 99.921113%
✓ Loaded S2C_MSIL2A_20251130T105421_N0511_R051_T31UEQ_20251130T143216 | Cloud cover: 59.23211%
✓ Loaded S2C_MSIL2A_20250928T103831_N0511_R008_T31UEQ_20250928T161815 | Cloud cover: 0.103102%
✅ Dask cluster closed

Field-Based Cloud Cover Analysis

Now we’ll analyze cloud cover for specific agricultural fields using field boundaries from a geojson string. This approach: - Loads vector boundaries: Read field geometries from a geojson string - Spatial subsetting: Extract only the pixels within each field boundary - Per-field statistics: Calculate mean cloud cover for each field across all time steps - Field ranking: Identify which fields have the best/worst observation conditions

This is particularly useful for agricultural monitoring where you want to know which fields are most frequently obscured by clouds.

### A few sample agricultural field boundaries * The boundaries are given in the format of geojson strings in the CRS (coordinate reference system) of EPSG:4326, that is, WGS84 lon/lat degrees.

fields_geojson_str = '{"type": "FeatureCollection", "features": [{"id": "29492", "type": "Feature", "properties": {}, "geometry": {"type": "MultiPolygon", "coordinates": [[[[3.3512915, 49.5071137], [3.3525724, 49.5071098], [3.3525784, 49.5079439], [3.3512975, 49.5079478], [3.3512915, 49.5071137]]]]}}, {"id": "29424", "type": "Feature", "properties": {}, "geometry": {"type": "MultiPolygon", "coordinates": [[[[3.5118823, 49.1951703], [3.5197165, 49.1951353], [3.5197703, 49.2002695], [3.5119353, 49.2003045], [3.5118823, 49.1951703]]]]}}, {"id": "26890", "type": "Feature", "properties": {}, "geometry": {"type": "MultiPolygon", "coordinates": [[[[3.6448076, 49.5955906], [3.6447806, 49.5956178], [3.6448255, 49.5957521], [3.6448706, 49.5958953], [3.6449468, 49.5961729], [3.6451269, 49.5967279], [3.6452484, 49.5971578], [3.6454872, 49.5978468], [3.6455454, 49.597963], [3.6455616, 49.5980526], [3.6456048, 49.598124], [3.6457851, 49.598149], [3.6459231, 49.5981385], [3.6460872, 49.5980739], [3.6462233, 49.5979916], [3.6463993, 49.5978459], [3.646846, 49.5974818], [3.646886, 49.5974275], [3.646536, 49.5972516], [3.6464801, 49.5972253], [3.6464239, 49.59719], [3.646396, 49.5971813], [3.6463956, 49.5971633], [3.6463668, 49.5971187], [3.6461552, 49.5969503], [3.646099, 49.596915], [3.6460428, 49.5968797], [3.6460143, 49.5968441], [3.6458319, 49.5967383], [3.6457889, 49.5966758], [3.6457467, 49.5966494], [3.6455782, 49.5965434], [3.6454099, 49.5964464], [3.6453535, 49.5964021], [3.6453254, 49.5963844], [3.6452964, 49.5963309], [3.6452676, 49.5962863], [3.6452528, 49.5962505], [3.6452243, 49.5962149], [3.6452098, 49.5961881], [3.6451948, 49.5961433], [3.6451651, 49.5960628], [3.6450604, 49.5957495], [3.6450021, 49.5956334], [3.644974, 49.5956157], [3.6449321, 49.5955982], [3.6448904, 49.5955897], [3.6448352, 49.5955903], [3.6448076, 49.5955906]]]]}}, {"id": "16041", "type": "Feature", "properties": {}, "geometry": {"type": "MultiPolygon", "coordinates": [[[[3.3498827, 49.2388241], [3.3499649, 49.2388144], [3.3499923, 49.2388142], [3.3499457, 49.238536], [3.3519167, 49.2382858], [3.3537508, 49.2380456], [3.3540656, 49.238007], [3.354885, 49.239887], [3.3557183, 49.2417758], [3.3517535, 49.243229], [3.3512493, 49.243413], [3.3492257, 49.2389105], [3.3495949, 49.2388445], [3.3495951, 49.2388535], [3.3498827, 49.2388241]]]]}}, {"id": "3211", "type": "Feature", "properties": {}, "geometry": {"type": "MultiPolygon", "coordinates": [[[[3.8013103, 49.0771152], [3.8013541, 49.0772135], [3.8013009, 49.0772681], [3.8011581, 49.0775394], [3.801028, 49.0777747], [3.8008961, 49.0779471], [3.800362, 49.078412], [3.800242, 49.0785213], [3.7993567, 49.0791432], [3.7989816, 49.0794264], [3.7985792, 49.0797099], [3.7981484, 49.0799579], [3.7977039, 49.0802059], [3.797595, 49.0802253], [3.7975403, 49.0802259], [3.7973884, 49.0801738], [3.7973336, 49.0801745], [3.7972787, 49.0801662], [3.7972108, 49.080185], [3.7969664, 49.0802509], [3.796858, 49.0802882], [3.7967362, 49.0803346], [3.7965873, 49.0803903], [3.7965191, 49.0804001], [3.79645, 49.080374], [3.7964356, 49.0803472], [3.7964044, 49.0802128], [3.7964269, 49.0800418], [3.7964077, 49.0798443], [3.7963301, 49.0795217], [3.7962833, 49.0793156], [3.7961946, 49.079083], [3.7966424, 49.0789517], [3.7968592, 49.0788771], [3.7967036, 49.0786993], [3.7965623, 49.0785393], [3.7962918, 49.0786504], [3.7959948, 49.0787888], [3.7959365, 49.0786637], [3.795764, 49.0783692], [3.7955082, 49.0780308], [3.7953099, 49.0777906], [3.7950411, 49.0774793], [3.7949695, 49.0773634], [3.7956221, 49.0772206], [3.7964654, 49.0770485], [3.7967514, 49.0770001], [3.7969694, 49.0769705], [3.7971054, 49.0769419], [3.7972956, 49.0768946], [3.7975671, 49.0768194], [3.7981929, 49.0766949], [3.7989552, 49.0765598], [3.7991871, 49.076539], [3.799328, 49.0766811], [3.7993703, 49.0767255], [3.8009072, 49.0764101], [3.8013103, 49.0771152]]]]}}, {"id": "5547", "type": "Feature", "properties": {}, "geometry": {"type": "MultiPolygon", "coordinates": [[[[3.3630865, 48.9858561], [3.3665062, 48.9858453], [3.3665227, 48.988096], [3.3631029, 48.9881068], [3.3630865, 48.9858561]]]]}}, {"id": "9919", "type": "Feature", "properties": {}, "geometry": {"type": "MultiPolygon", "coordinates": [[[[3.3839195, 49.6247472], [3.3879561, 49.6247337], [3.3879769, 49.627356], [3.3839401, 49.6273695], [3.3839195, 49.6247472]]]]}}, {"id": "32171", "type": "Feature", "properties": {}, "geometry": {"type": "MultiPolygon", "coordinates": [[[[3.2549845, 49.6456402], [3.2558002, 49.6463205], [3.2564055, 49.6469031], [3.2593834, 49.6446423], [3.2585573, 49.6441816], [3.2577511, 49.6437401], [3.2565353, 49.6430737], [3.2564359, 49.64315], [3.2563418, 49.6432232], [3.2552575, 49.6440664], [3.2541636, 49.6449112], [3.2549845, 49.6456402]]]]}}, {"id": "25340", "type": "Feature", "properties": {}, "geometry": {"type": "MultiPolygon", "coordinates": [[[[4.0261277, 49.2628002], [4.0288276, 49.2627762], [4.0288643, 49.2645432], [4.0261643, 49.2645672], [4.0261277, 49.2628002]]]]}}, {"id": "24502", "type": "Feature", "properties": {}, "geometry": {"type": "MultiPolygon", "coordinates": [[[[3.0741046, 49.4107643], [3.0774247, 49.4107621], [3.0774281, 49.4129284], [3.0741079, 49.4129305], [3.0741046, 49.4107643]]]]}}, {"id": "10200", "type": "Feature", "properties": {}, "geometry": {"type": "MultiPolygon", "coordinates": [[[[3.0760301, 48.6862008], [3.0815491, 48.6861971], [3.081555, 48.6898513], [3.0760356, 48.6898551], [3.0760301, 48.6862008]]]]}}, {"id": "36225", "type": "Feature", "properties": {}, "geometry": {"type": "MultiPolygon", "coordinates": [[[[3.3502132, 49.1842432], [3.3520024, 49.1842377], [3.3520107, 49.1854106], [3.3502215, 49.185416], [3.3502132, 49.1842432]]]]}}, {"id": "29495", "type": "Feature", "properties": {}, "geometry": {"type": "MultiPolygon", "coordinates": [[[[3.5458069, 49.3197752], [3.5487899, 49.319761], [3.5488115, 49.3217111], [3.5458284, 49.3217252], [3.5458069, 49.3197752]]]]}}, {"id": "19730", "type": "Feature", "properties": {}, "geometry": {"type": "MultiPolygon", "coordinates": [[[[3.5283004, 49.3487766], [3.5301105, 49.3502127], [3.5311426, 49.3495548], [3.5300733, 49.3486124], [3.5309598, 49.3481271], [3.5302914, 49.347535], [3.5283004, 49.3487766]]]]}}, {"id": "39060", "type": "Feature", "properties": {}, "geometry": {"type": "MultiPolygon", "coordinates": [[[[3.9390276, 48.8836952], [3.942775, 48.8836646], [3.9428215, 48.8861361], [3.9390739, 48.8861667], [3.9390276, 48.8836952]]]]}}, {"id": "25299", "type": "Feature", "properties": {}, "geometry": {"type": "MultiPolygon", "coordinates": [[[[3.1957332, 49.4270314], [3.1983678, 49.4270269], [3.1983747, 49.4287454], [3.19574, 49.4287499], [3.1957332, 49.4270314]]]]}}, {"id": "9895", "type": "Feature", "properties": {}, "geometry": {"type": "MultiPolygon", "coordinates": [[[[4.1383536, 48.994904], [4.1386364, 48.9950643], [4.1387288, 48.9951309], [4.1382556, 49.0015165], [4.137872, 49.0013931], [4.1383536, 48.994904]]]]}}, {"id": "12887", "type": "Feature", "properties": {}, "geometry": {"type": "MultiPolygon", "coordinates": [[[[3.2712551, 48.7923239], [3.2732282, 48.7923192], [3.2732353, 48.7936228], [3.2712622, 48.7936275], [3.2712551, 48.7923239]]]]}}, {"id": "2181", "type": "Feature", "properties": {}, "geometry": {"type": "MultiPolygon", "coordinates": [[[[3.4994793, 49.2672776], [3.4994429, 49.2673882], [3.4994316, 49.2674336], [3.4994319, 49.2676374], [3.499397, 49.2677905], [3.499347, 49.2679001], [3.4992731, 49.2680398], [3.4992298, 49.2681201], [3.4991828, 49.2682076], [3.4990106, 49.2686074], [3.4988125, 49.269006], [3.4986983, 49.2692277], [3.4986893, 49.2692435], [3.4919384, 49.2697473], [3.4921422, 49.2671282], [3.4925477, 49.2670423], [3.4927179, 49.2670063], [3.4927331, 49.2670031], [3.4927557, 49.2669979], [3.493012, 49.2669385], [3.4931549, 49.2669054], [3.4935751, 49.2668463], [3.4936922, 49.2668361], [3.4938518, 49.2668155], [3.4940877, 49.2668016], [3.4943345, 49.26679], [3.4945949, 49.2667607], [3.4947726, 49.2667353], [3.4950261, 49.2666799], [3.4954326, 49.266595], [3.4960862, 49.266534], [3.4967454, 49.2664888], [3.4971739, 49.2664656], [3.4974847, 49.2664391], [3.4981077, 49.2664115], [3.4981265, 49.2666588], [3.4979321, 49.2666656], [3.497855, 49.2666701], [3.497381, 49.2666978], [3.4970061, 49.2667779], [3.4961605, 49.2669308], [3.4953803, 49.2670733], [3.4953603, 49.2670845], [3.4953268, 49.2671021], [3.495329, 49.2671377], [3.4953741, 49.2671665], [3.4955126, 49.2672023], [3.4955657, 49.2672129], [3.4958969, 49.2672543], [3.4964736, 49.2672634], [3.4972194, 49.2671852], [3.497779, 49.267171], [3.498059, 49.2671764], [3.4983952, 49.2671964], [3.4984092, 49.2671975], [3.499224, 49.2672605], [3.4994793, 49.2672776]]]]}}, {"id": "10029", "type": "Feature", "properties": {}, "geometry": {"type": "MultiPolygon", "coordinates": [[[[4.1711743, 49.2050195], [4.1693415, 49.2054114], [4.1720111, 49.2096151], [4.1738647, 49.2092459], [4.1711743, 49.2050195]]]]}}]}'
with StringIO(fields_geojson_str) as fobj:
    fields_gdf = gpd.read_file(fobj)
fields_gdf.head()

	id	geometry
0	29492	MULTIPOLYGON (((3.35129 49.50711, 3.35257 49.5...
1	29424	MULTIPOLYGON (((3.51188 49.19517, 3.51972 49.1...
2	26890	MULTIPOLYGON (((3.64481 49.59559, 3.64478 49.5...
3	16041	MULTIPOLYGON (((3.34988 49.23882, 3.34996 49.2...
4	3211	MULTIPOLYGON (((3.80131 49.07712, 3.80135 49.0...

fields_gdf = fields_gdf.to_crs("32631") # use the UTM zone that covers the area of interest

# Visualize field boundaries on the first scene
fig, ax = plt.subplots(1, 1, figsize=(12, 12))

# Plot the first scene as background
scenes[0].plot.imshow(ax=ax, cmap='Blues', vmin=0, vmax=100, alpha=0.6)
   
fields_gdf.boundary.plot(ax=ax, color='red', linewidth=2, label='Field Boundaries')
    
# Add field IDs as labels
for idx, row in fields_gdf.iterrows():
    centroid = row.geometry.centroid

ax.set_title(f'Field Boundaries on Cloud Probability Map\n{times[0].strftime("%Y-%m-%d")}', 
             fontsize=14, fontweight='bold')
ax.legend(loc='upper right')
plt.tight_layout()
plt.show()

print(f"💡 Red boundaries show the {len(fields_gdf)} agricultural fields to be analyzed")

💡 Red boundaries show the 20 agricultural fields to be analyzed

# prepare metadata once
for scene in scenes:
    if scene.rio.crs is None:
        scene.rio.write_crs("EPSG:32631", inplace=True)

# Build a list of ALL lazy tasks across ALL fields and ALL scenes
all_lazy_tasks = []
task_mapping = [] # To keep track of which result belongs to which field

print(f"🏗️ Building task graph for {len(fields_gdf)} fields...")

for field_idx, field_row in fields_gdf.iterrows():
    geom = field_row.geometry
    minx, miny, maxx, maxy = geom.bounds
    for scene_idx, scene in enumerate(scenes):
        # Windowed slicing (LAZY)
        window = scene.sel(x=slice(minx, maxx), y=slice(maxy, miny))
        
        # Masking (LAZY)
        masked = window.rio.clip([geom], fields_gdf.crs, all_touched=True)
        
        # Mean calculation (LAZY)
        lazy_mean = masked.mean(skipna=True)
        
        all_lazy_tasks.append(lazy_mean)
        task_mapping.append(field_idx)

🏗️ Building task graph for 20 fields...

# 3. Compute everything in ONE go
print(f"🚀 Sending {len(all_lazy_tasks)} tasks to Dask cluster. Check your Dashboard!")
all_results = compute(*all_lazy_tasks)

# 4. Process the results back into your dataframe
# (The order is preserved in all_results)
results_df = pd.DataFrame({
    'field_id': task_mapping,
    'value': [float(r) for r in all_results]
})

# Filter out NaNs and aggregate by field
field_stats_df = results_df.dropna().groupby('field_id')['value'].agg(['mean', 'std', 'count']).reset_index()

🚀 Sending 220 tasks to Dask cluster. Check your Dashboard!

field_stats_df

	field_id	mean	std	count
0	0	20.380682	35.810214	11
1	1	16.984926	33.114923	11
2	2	33.890572	42.008622	11
3	3	32.640754	45.870265	11
4	4	42.074084	48.503024	11
5	5	26.176977	41.599340	11
6	6	26.199192	40.545276	11
7	7	26.558496	45.017012	11
8	8	51.431818	50.127833	11
9	9	29.294192	45.520428	11
10	10	10.733636	28.349403	11
11	11	36.363636	50.452498	11
12	12	28.292262	36.440260	11
13	13	24.347107	42.006202	11
14	14	50.662338	49.455747	11
15	15	18.210101	40.438115	11
16	16	40.649948	48.484341	11
17	17	23.952922	41.508642	11
18	18	19.179648	32.532642	11
19	19	23.612702	40.475394	11

# Rank fields by mean cloud cover (best to worst observation conditions)
field_stats_df_sorted = field_stats_df.sort_values('mean')

for rank, (idx, row) in enumerate(field_stats_df_sorted.iterrows(), 1):
    emoji = "🌟" if rank <= 3 else "⚠️" if rank > len(field_stats_df_sorted) - 3 else "  "

print(f"🌟 Best Field (Least Cloudy):  Field {field_stats_df_sorted.iloc[0]['field_id']} "
      f"with {field_stats_df_sorted.iloc[0]['mean']:.2f}% mean cloud cover")
print(f"⚠️  Worst Field (Most Cloudy): Field {field_stats_df_sorted.iloc[-1]['field_id']} "
      f"with {field_stats_df_sorted.iloc[-1]['mean']:.2f}% mean cloud cover")
print(f"📈 Average across all fields: {field_stats_df['mean'].mean():.2f}% "
      f"± {field_stats_df['mean'].std():.2f}%")

🌟 Best Field (Least Cloudy):  Field 10.0 with 10.73% mean cloud cover
⚠️  Worst Field (Most Cloudy): Field 8.0 with 51.43% mean cloud cover
📈 Average across all fields: 29.08% ± 10.84%

# Create a figure with two subplots: one for the bar chart and one for the box plot
fig, axes = plt.subplots(1, 2, figsize=(18, 6), gridspec_kw={'width_ratios': [3, 1]})

# 1. Bar chart of mean cloud cover per field (sorted)
ax1 = axes[0]
colors = ['green' if i < 3 else 'red' if i >= len(field_stats_df_sorted) - 3 else 'steelblue' 
          for i in range(len(field_stats_df_sorted))]
bars = ax1.bar(range(len(field_stats_df_sorted)), 
               field_stats_df_sorted['mean'],
               color=colors, alpha=0.7, edgecolor='black', linewidth=1.5)

ax1.set_xlabel('Field Rank', fontsize=12, fontweight='bold')
ax1.set_ylabel('Mean Cloud Cover (%)', fontsize=12, fontweight='bold')
ax1.set_title('Field Ranking by Cloud Cover\n(Green=Best, Red=Worst)', 
              fontsize=14, fontweight='bold')
ax1.grid(axis='y', linestyle='--', alpha=0.3)
ax1.axhline(y=field_stats_df_sorted['mean'].mean(), 
            color='orange', linestyle='--', linewidth=2, label='Average')
ax1.legend()

# Add field IDs as x-tick labels
ax1.set_xticks(range(len(field_stats_df_sorted)))
ax1.set_xticklabels([f"{int(fid)}" for fid in field_stats_df_sorted['field_id']], 
                     rotation=45, ha='right', fontsize=9)

# 2. Box plot to show distribution
ax2 = axes[1]
ax2.boxplot(field_stats_df_sorted['mean'], vert=True, patch_artist=True,
            boxprops=dict(facecolor='lightgray', color='black'),
            medianprops=dict(color='orange', linewidth=2))
ax2.set_title('Cloud Cover Distribution', fontsize=14, fontweight='bold')
ax2.set_ylabel('Cloud Cover (%)')
ax2.set_xticks([]) # Hide x-axis ticks for the box plot
ax2.grid(axis='y', linestyle='--', alpha=0.3)

plt.tight_layout()
# plt.savefig('field_cloud_rankings_and_distribution.png')

Optional: Quick interactive visualization with panel and hvplot

Beyond our core analysis, we have implemented an integrated dashboard to bridge the gap between complex data processing and intuitive visual interpretation. This interface allows for the seamless visualization of processed layers, providing an accessible gateway to the underlying results.

Install the additional libraries for quick interactive visualization

# %pip install panel hvplot holoviews geoviews jupyter_bokeh ipywidgets_bokeh

Install the jupyterlab extension jupyter-bokeh

Then refresh the webpage of the JupyterHub to make the jupyterlab extension to take effect

A Quick Single Band Viewer

Create an interactive single-band viewer with: - Variable selection dropdown - Time selection dropdown - Transparency slider - OpenStreetMap background tiles for spatial context

# Provide an xarray Dataset and some metadata for the interactive visualization
ds4vis = seasonal_means.to_dataset(name='mean_cloud_probability')
# ###### Subset the ds4vis for smaller data and faster demo ######
# Uncomment the two lines below to visualize the entire tile
ds4vis = ds4vis.rio.write_crs(scenes_filtered[0]['spatial_ref'].attrs['crs_wkt'])
ds4vis = ds4vis.rio.clip_box(minx=3.3492257, miny=49.1842377, maxx=3.5488115, maxy=49.3502127, crs="EPSG:4326")
# ###### End of subset the ds4vis for smaller data and faster demo ######
time_dim_name = 'season'
ds_crs = 'EPSG:32631'

bg_tiles = True # display the openstreetmap background tiles in the plot for better spatial context. This can be set to False if you prefer a plain background.
# Interactive plot with dropdowns to select data variable and time
# available time values
available_times = ds4vis.coords[time_dim_name].values
time_strings_quick = [str(t) for t in available_times]
# available variable names
available_vars = sorted(ds4vis.data_vars)

# Create widgets for variable and time selection
# Create a dropdown for variable selection (e.g., cloud probability, SCL, or any band)
var_select = pn.widgets.Select(
    name="Variable", 
    options=available_vars, # available variables to select in the dropdown
    value=available_vars[0] # default to the first available var
)

# Create a dropdown for time selection
time_select_quick = pn.widgets.Select(
    name="Time",
    options=dict(zip(sorted(time_strings_quick), range(len(time_strings_quick)))),
    value=0, # default to the first available time in the dataset
)

# Create a slider for transparency control of the raster layers
alpha_slider = pn.widgets.FloatSlider(name="Raster Transparency", start=0.0, end=1.0, value=0.8, step=0.05)



def plot_single_band(var_name, time_val, alpha_val):
    """Generate an interactive hvplot image for a selected variable and time step of the
    xarray Dataset that we just created `ds4vis`, with the overlay of the field
    boundaries `fields_gdf`.
    
    Creates a georeferenced raster visualization with optional OpenStreetMap 
    background tiles for spatial context. The plot is interactive and supports
    panning, zooming, and hover inspection.
    
    Args:
        var_name: Name of the data variable to plot (e.g., 'mean_cloud_probability').
        time_val: Integer index into the time/season dimension to select which
            temporal slice to display.
        alpha_val: Transparency of the raster layer (0.0 = fully transparent, 
            1.0 = fully opaque).
    
    Returns:
        holoviews.core.overlay.Overlay: A HoloViews Overlay combining the raster
            image with field boundary polygons, displayable in a Jupyter notebook
            or Panel dashboard.
    
    Note:
        This function relies on module-level variables: `ds4vis` (the xarray 
        Dataset), `time_dim_name` (name of the time dimension), `ds_crs` 
        (coordinate reference system), and `bg_tiles` (whether to show map tiles).
    """
    # Find the nearest time in the dataset
    nearest_time = ds4vis.coords[time_dim_name].values[time_val]
    image_pobj = ds4vis[var_name].sel({time_dim_name:nearest_time}).hvplot.image(
        x="x", y="y",
        frame_width=500, frame_height=500,
        data_aspect=1,
        title=f"Variable {var_name} @ {str(nearest_time)[:19]}",
        cmap="viridis",
        alpha=alpha_val,
        crs=ds_crs,
        tiles=bg_tiles
    )
    polygons_pobj = fields_gdf.hvplot.polygons(
        line_color='red', line_width=2, fill_alpha=0, tools=['hover'], title="Field Boundaries", crs=fields_gdf.crs
    )
    return image_pobj * polygons_pobj


# Use value_throttled for sliders to only update on mouse release (prevents sync errors)
single_band_plot = pn.bind(plot_single_band, var_select, time_select_quick, alpha_slider.param.value_throttled)

dashboard = pn.Column(
    pn.Row(time_select_quick, var_select, alpha_slider),
    # pn.Row(single_band_plot),
    pn.Row(pn.panel(single_band_plot, loading_indicator=True))
)

# Note that when you change the widget value, it may take a while to load the data and update the plots
dashboard

💪 Now it is your turn

Task 1: Explore Results for a different Sentinel-2 tile

Access the Sentinel-2 tile grid and select a different tile-id.
Change the tile-id here
Provide a way to compare the results between different tiles, e.g. use a way to describe their spatial variance, consistency or identify extremes.

Task 2: Time-series field level vs. tile level

Use one tile of your choice above and define the boundaries of some agricultural fields (~10 - 20) across the tile by changing the geojson string here.
Derive a vegetation index per acquisition date.
Compare the time-series between the different fields.
What is the impact of availability of scenes across the season for agricultural applications? Can you identify patterns?

Task 3: Putting Cloud Cover in Perspective — Sensor Limitations and Alternatives

Sentinel-2 is a passive optical sensor operating in a sun-synchronous orbit with a fixed local overpass time (~10:30 AM solar time). This means every scene captures the same time of day, and any cloud present at that moment will always obscure the observation — there is no second chance within that day.

Look up the exact overpass time for your tile from the Sentinel-2 acquisition schedule. How does a fixed morning overpass affect the representativeness of the cloud statistics you computed?
Could diurnal cloud cycles (e.g. afternoon convective clouds in summer) systematically bias the cloud probability map? Discuss for your chosen region.
Active sensors such as Sentinel-1 SAR operate independently of cloud cover and solar illumination. When would you prefer SAR over optical data for agricultural monitoring, and what are the trade-offs (e.g. interpretation complexity, lack of spectral bands)?

Conclusion

What You’ve Accomplished

In this notebook, you’ve mastered a workflow for time-series cloud probability analysis using EOPF Sentinel-2 data. Here’s what you’ve achieved:

🎯 Key Results

Pixel-Level Cloud Intelligence: Rather than relying on metadata-level cloud coverage, you now have granular, pixel-based insights into cloud probability patterns across your study area. This enables you to distinguish between partially cloudy scenes that still contain usable data and truly unusable observations.
Temporal Patterns Revealed: By stacking and analyzing multiple scenes across years, you discovered seasonal cloud dynamics—understanding when and where persistent cloud cover occurs. This knowledge is invaluable for scheduling operational monitoring campaigns in regions with predictable cloud seasonality.
Field-Level Decision Making: You transformed satellite imagery into actionable agricultural intelligence by ranking individual fields by their observation quality. Fields with consistently clear skies can be prioritized for analysis, while fields in persistent cloud shadows can be flagged for special attention or alternative data sources.
Scalable Data Processing: You leveraged Dask for parallel computation, demonstrating how to work efficiently with large remote-sensing datasets without overwhelming local resources.

💡 The Shift in Perspective

This notebook exemplifies a critical paradigm shift in Earth observation:

Before: “This tile has 60% cloud cover — discard it”
After: “This tile has 60% cloud cover, but fields A and B remain clear — proceed with analysis”

By integrating cloud probability analysis into your pipeline, you’ve transformed cloud cover from a binary rejection criterion into a nuanced quality metric that preserves usable data.

🚀 Real-World Applications

Your new skillset applies directly to:

Agricultural Monitoring: Identify the optimal observation windows for crop health assessments
Change Detection: Mask out permanent clouds before deriving temporal statistics
Data Ranking: Prioritize scenes for manual inspection or downstream processing

🎓 Learning Checkpoint

You’ve demonstrated competency in:

✅ STAC API queries for large-scale remote sensing data
✅ Parallel processing with Dask for geospatial data
✅ Temporal aggregation and seasonal analysis
✅ Spatial subsetting and masking with vector boundaries
✅ Statistical ranking and quality assessment
✅ Interactive visualization for stakeholder communication

What’s next?

In the following notebook we will explore the application of neural networks in Leaf Area Index (LAI) computing derived from Sentinel-2 Level-2A satellite data.

This notebook was developed as a contribution to the EOPF Zarr Community Notebook Competition.