Skip to content

Ingest Datasets#

This guide will walk you through the process of ingesting local datasets into the REF. In REF, to ingest means that we record local datasets in the REF database, letting REF know where they exist and to what format they conform. Ingesting datasets is the first step in the REF workflow.

The REF supports the following dataset formats:

  • CMIP6
  • Obs4MIPs

Downloading the input data is out of scope for this guide, but we recommend using the intake-esgf to download CMIP6 data. If you have access to a high-performance computing (HPC) system, you may have a local archive of CMIP6 data already available.

What is Ingestion?#

When processing diagnostics, the REF needs to know the location of the datasets and various metadata. Ingestion is the process of extracting metadata from datasets and storing it in a local database. This makes it easier to query and filter datasets for further processing.

The REF extracts metadata for each dataset (and file if a dataset contains multiple files). The collection of metadata, also known as a data catalog, is stored in a local SQLite database. This database is used to query and filter datasets for further processing.

Ingesting Datasets#

To ingest datasets, use the ref datasets ingest command. This command takes a path to a directory containing datasets as an argument and the type of the dataset being ingested (only cmip6 is currently supported).

This will walk through the provided directory looking for *.nc files to ingest. Each file will be loaded and its metadata extracted.

>>> ref --log-level INFO datasets ingest --source-type cmip6 /path/to/cmip6
2024-12-05 12:00:05.979 | INFO     | climate_ref.database:__init__:77 - Connecting to database at sqlite:///.climate_ref/db/climate_ref.db
2024-12-05 12:00:05.987 | INFO     | alembic.runtime.migration:__init__:215 - Context impl SQLiteImpl.
2024-12-05 12:00:05.987 | INFO     | alembic.runtime.migration:__init__:218 - Will assume non-transactional DDL.
2024-12-05 12:00:05.989 | INFO     | alembic.runtime.migration:run_migrations:623 - Running upgrade  -> ea2aa1134cb3, dataset-rework
2024-12-05 12:00:05.995 | INFO     | climate_ref.cli.datasets:ingest:115 - ingesting /path/to/cmip6
2024-12-05 12:00:06.401 | INFO     | climate_ref.cli.datasets:ingest:127 - Found 9 files for 5 datasets

  activity_id   institution_id   source_id       experiment_id   member_id   table_id   variable_id   grid_label   version
 ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
  ScenarioMIP   CSIRO            ACCESS-ESM1-5   ssp126          r1i1p1f1    Amon       rlut          gn           v20210318
  ScenarioMIP   CSIRO            ACCESS-ESM1-5   ssp126          r1i1p1f1    Amon       rlut          gn           v20210318
  ScenarioMIP   CSIRO            ACCESS-ESM1-5   ssp126          r1i1p1f1    Amon       rsdt          gn           v20210318
  ScenarioMIP   CSIRO            ACCESS-ESM1-5   ssp126          r1i1p1f1    Amon       rsdt          gn           v20210318
  ScenarioMIP   CSIRO            ACCESS-ESM1-5   ssp126          r1i1p1f1    Amon       rsut          gn           v20210318
  ScenarioMIP   CSIRO            ACCESS-ESM1-5   ssp126          r1i1p1f1    Amon       rsut          gn           v20210318
  ScenarioMIP   CSIRO            ACCESS-ESM1-5   ssp126          r1i1p1f1    Amon       tas           gn           v20210318
  ScenarioMIP   CSIRO            ACCESS-ESM1-5   ssp126          r1i1p1f1    Amon       tas           gn           v20210318
  ScenarioMIP   CSIRO            ACCESS-ESM1-5   ssp126          r1i1p1f1    fx         areacella     gn           v20210318

2024-12-05 12:00:06.409 | INFO     | climate_ref.cli.datasets:ingest:131 - Processing dataset CMIP6.ScenarioMIP.CSIRO.ACCESS-ESM1-5.ssp126.r1i1p1f1.Amon.rlut.gn
2024-12-05 12:00:06.431 | INFO     | climate_ref.cli.datasets:ingest:131 - Processing dataset CMIP6.ScenarioMIP.CSIRO.ACCESS-ESM1-5.ssp126.r1i1p1f1.Amon.rsdt.gn
2024-12-05 12:00:06.441 | INFO     | climate_ref.cli.datasets:ingest:131 - Processing dataset CMIP6.ScenarioMIP.CSIRO.ACCESS-ESM1-5.ssp126.r1i1p1f1.Amon.rsut.gn
2024-12-05 12:00:06.449 | INFO     | climate_ref.cli.datasets:ingest:131 - Processing dataset CMIP6.ScenarioMIP.CSIRO.ACCESS-ESM1-5.ssp126.r1i1p1f1.Amon.tas.gn
2024-12-05 12:00:06.459 | INFO     | climate_ref.cli.datasets:ingest:131 - Processing dataset CMIP6.ScenarioMIP.CSIRO.ACCESS-ESM1-5.ssp126.r1i1p1f1.fx.areacella.gn

Querying ingested datasets#

You can query the ingested datasets using the ref datasets list command. This will display a list of datasets and their associated metadata. The --column flag allows you to specify which columns to display (defaults to all columns). See ref datasets list-columns for a list of available columns.

>>> ref datasets list --column instance_id --column variable_id

  instance_id                                                             variable_id
 ─────────────────────────────────────────────────────────────────────────────────────
  CMIP6.ScenarioMIP.CSIRO.ACCESS-ESM1-5.ssp126.r1i1p1f1.Amon.rlut.gn      rlut
  CMIP6.ScenarioMIP.CSIRO.ACCESS-ESM1-5.ssp126.r1i1p1f1.Amon.rsdt.gn      rsdt
  CMIP6.ScenarioMIP.CSIRO.ACCESS-ESM1-5.ssp126.r1i1p1f1.Amon.rsut.gn      rsut
  CMIP6.ScenarioMIP.CSIRO.ACCESS-ESM1-5.ssp126.r1i1p1f1.Amon.tas.gn       tas
  CMIP6.ScenarioMIP.CSIRO.ACCESS-ESM1-5.ssp126.r1i1p1f1.fx.areacella.gn   areacella