Ingest Datasets#
This guide will walk you through the process of ingesting local datasets into the REF. In REF, to ingest means that we record local datasets in the REF database, letting REF know where they exist and to what format they conform. Ingesting datasets is the first step in the REF workflow.
The REF supports the following dataset formats:
- CMIP6
- Obs4MIPs
Downloading the input data is out of scope for this guide, but we recommend using the intake-esgf to download CMIP6 data. If you have access to a high-performance computing (HPC) system, you may have a local archive of CMIP6 data already available.
What is Ingestion?#
When processing diagnostics, the REF needs to know the location of the datasets and various metadata. Ingestion is the process of extracting metadata from datasets and storing it in a local database. This makes it easier to query and filter datasets for further processing.
The REF extracts metadata for each dataset (and file if a dataset contains multiple files). The collection of metadata, also known as a data catalog, is stored in a local SQLite database. This database is used to query and filter datasets for further processing.
Ingesting Datasets#
To ingest datasets, use the ref datasets ingest command.
This command takes a path to a directory containing datasets as an argument
and the type of the dataset being ingested (only cmip6 is currently supported).
This will walk through the provided directory looking for *.nc files to ingest.
Each file will be loaded and its metadata extracted.
>>> ref --log-level INFO datasets ingest --source-type cmip6 /path/to/cmip6
2024-12-05 12:00:05.979 | INFO | climate_ref.database:__init__:77 - Connecting to database at sqlite:///.climate_ref/db/climate_ref.db
2024-12-05 12:00:05.987 | INFO | alembic.runtime.migration:__init__:215 - Context impl SQLiteImpl.
2024-12-05 12:00:05.987 | INFO | alembic.runtime.migration:__init__:218 - Will assume non-transactional DDL.
2024-12-05 12:00:05.989 | INFO | alembic.runtime.migration:run_migrations:623 - Running upgrade -> ea2aa1134cb3, dataset-rework
2024-12-05 12:00:05.995 | INFO | climate_ref.cli.datasets:ingest:115 - ingesting /path/to/cmip6
2024-12-05 12:00:06.401 | INFO | climate_ref.cli.datasets:ingest:127 - Found 9 files for 5 datasets
activity_id institution_id source_id experiment_id member_id table_id variable_id grid_label version
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
ScenarioMIP CSIRO ACCESS-ESM1-5 ssp126 r1i1p1f1 Amon rlut gn v20210318
ScenarioMIP CSIRO ACCESS-ESM1-5 ssp126 r1i1p1f1 Amon rlut gn v20210318
ScenarioMIP CSIRO ACCESS-ESM1-5 ssp126 r1i1p1f1 Amon rsdt gn v20210318
ScenarioMIP CSIRO ACCESS-ESM1-5 ssp126 r1i1p1f1 Amon rsdt gn v20210318
ScenarioMIP CSIRO ACCESS-ESM1-5 ssp126 r1i1p1f1 Amon rsut gn v20210318
ScenarioMIP CSIRO ACCESS-ESM1-5 ssp126 r1i1p1f1 Amon rsut gn v20210318
ScenarioMIP CSIRO ACCESS-ESM1-5 ssp126 r1i1p1f1 Amon tas gn v20210318
ScenarioMIP CSIRO ACCESS-ESM1-5 ssp126 r1i1p1f1 Amon tas gn v20210318
ScenarioMIP CSIRO ACCESS-ESM1-5 ssp126 r1i1p1f1 fx areacella gn v20210318
2024-12-05 12:00:06.409 | INFO | climate_ref.cli.datasets:ingest:131 - Processing dataset CMIP6.ScenarioMIP.CSIRO.ACCESS-ESM1-5.ssp126.r1i1p1f1.Amon.rlut.gn
2024-12-05 12:00:06.431 | INFO | climate_ref.cli.datasets:ingest:131 - Processing dataset CMIP6.ScenarioMIP.CSIRO.ACCESS-ESM1-5.ssp126.r1i1p1f1.Amon.rsdt.gn
2024-12-05 12:00:06.441 | INFO | climate_ref.cli.datasets:ingest:131 - Processing dataset CMIP6.ScenarioMIP.CSIRO.ACCESS-ESM1-5.ssp126.r1i1p1f1.Amon.rsut.gn
2024-12-05 12:00:06.449 | INFO | climate_ref.cli.datasets:ingest:131 - Processing dataset CMIP6.ScenarioMIP.CSIRO.ACCESS-ESM1-5.ssp126.r1i1p1f1.Amon.tas.gn
2024-12-05 12:00:06.459 | INFO | climate_ref.cli.datasets:ingest:131 - Processing dataset CMIP6.ScenarioMIP.CSIRO.ACCESS-ESM1-5.ssp126.r1i1p1f1.fx.areacella.gn
Querying ingested datasets#
You can query the ingested datasets using the ref datasets list command.
This will display a list of datasets and their associated metadata.
The --column flag allows you to specify which columns to display (defaults to all columns).
See ref datasets list-columns for a list of available columns.
>>> ref datasets list --column instance_id --column variable_id
instance_id variable_id
─────────────────────────────────────────────────────────────────────────────────────
CMIP6.ScenarioMIP.CSIRO.ACCESS-ESM1-5.ssp126.r1i1p1f1.Amon.rlut.gn rlut
CMIP6.ScenarioMIP.CSIRO.ACCESS-ESM1-5.ssp126.r1i1p1f1.Amon.rsdt.gn rsdt
CMIP6.ScenarioMIP.CSIRO.ACCESS-ESM1-5.ssp126.r1i1p1f1.Amon.rsut.gn rsut
CMIP6.ScenarioMIP.CSIRO.ACCESS-ESM1-5.ssp126.r1i1p1f1.Amon.tas.gn tas
CMIP6.ScenarioMIP.CSIRO.ACCESS-ESM1-5.ssp126.r1i1p1f1.fx.areacella.gn areacella