Welcome to roocs-utils’s documentation!

Quick Guide

roocs-utils

Pypi Build Status Documentation

A package containing common components for the roocs project.

Credits

This package was created with Cookiecutter and the audreyr/cookiecutter-pypackage project template.

Installation

Stable release

To install roocs-utils, run this command in your terminal:

$ pip install roocs-utils

This is the preferred method to install roocs-utils, as it will always install the most recent stable release.

If you don’t have pip installed, this Python installation guide can guide you through the process.

Install from GitHub

roocs-utils can be downloaded from the Github repo.

$ git clone git://github.com/roocs/roocs-utils
$ cd roocs-utils

Create Conda environment named roocs_utils:

$ conda env create -f environment.yml
$ source activate roocs_utils

Install roocs-utils in development mode:

$ pip install -r requirements.txt
$ pip install -r requirements_dev.txt
$ pip install -e .

Run tests:

$ python -m pytest tests/

Usage

To use roocs-utils in a project:

import roocs_utils

For information on the configuration options available in roocs-utils, see: https://roocs-utils.readthedocs.io/en/latest/configuration.html#roocs-utils

API

Parameters

class roocs_utils.parameter.area_parameter.AreaParameter(input)[source]

Bases: roocs_utils.parameter.base_parameter._BaseParameter

Class for area parameter used in subsetting operation.

Area can be input as:
A string of comma separated values: “0.,49.,10.,65”
A sequence of strings: (“0”, “-10”, “120”, “40”)
A sequence of numbers: [0, 49.5, 10, 65]

An area must have 4 values.

Validates the area input and parses the values into numbers.

allowed_input_types = [<class 'collections.abc.Sequence'>, <class 'str'>, <class 'roocs_utils.parameter.param_utils.Series'>, <class 'NoneType'>]
asdict()[source]

Returns a dictionary of the area values

class roocs_utils.parameter.collection_parameter.CollectionParameter(input)[source]

Bases: roocs_utils.parameter.base_parameter._BaseParameter

Class for collection parameter used in operations.

A collection can be input as:
A string of comma separated values: “cmip5.output1.INM.inmcm4.rcp45.mon.ocean.Omon.r1i1p1.latest.zostoga,cmip5.output1.MPI-M.MPI-ESM-LR.rcp45.mon.ocean.Omon.r1i1p1.latest.zostoga”
A sequence of strings: e.g. (“cmip5.output1.INM.inmcm4.rcp45.mon.ocean.Omon.r1i1p1.latest.zostoga”, “cmip5.output1.MPI-M.MPI-ESM-LR.rcp45.mon.ocean.Omon.r1i1p1.latest.zostoga”)
A sequence of roocs_utils.utils.file_utils.FileMapper objects

Validates the input and parses the items.

allowed_input_types = [<class 'collections.abc.Sequence'>, <class 'str'>, <class 'roocs_utils.parameter.param_utils.Series'>, <class 'roocs_utils.utils.file_utils.FileMapper'>]
class roocs_utils.parameter.level_parameter.LevelParameter(input)[source]

Bases: roocs_utils.parameter.base_parameter._BaseIntervalOrSeriesParameter

Class for level parameter used in subsetting operation.

Level can be input as:
A string of slash separated values: “1000/2000”
A sequence of strings: e.g. (“1000.50”, “2000.60”)
A sequence of numbers: e.g. (1000.50, 2000.60)

A level input must be 2 values.

If using a string input a trailing slash indicates you want to use the lowest/highest level of the dataset. e.g. “/2000” will subset from the lowest level in the dataset to 2000.

Validates the level input and parses the values into numbers.

asdict()[source]

Returns a dictionary of the level values

class roocs_utils.parameter.time_parameter.TimeParameter(input)[source]

Bases: roocs_utils.parameter.base_parameter._BaseIntervalOrSeriesParameter

Class for time parameter used in subsetting operation.

Time can be input as:
A string of slash separated values: “2085-01-01T12:00:00Z/2120-12-30T12:00:00Z”
A sequence of strings: e.g. (“2085-01-01T12:00:00Z”, “2120-12-30T12:00:00Z”)

A time input must be 2 values.

If using a string input a trailing slash indicates you want to use the earliest/ latest time of the dataset. e.g. “2085-01-01T12:00:00Z/” will subset from 01/01/2085 to the final time in the dataset.

Validates the times input and parses the values into isoformat.

asdict()[source]

Returns a dictionary of the time values

get_bounds()[source]

Returns a tuple of the (start, end) times, calculated from the value of the parameter. Either will default to None.

class roocs_utils.parameter.time_components_parameter.TimeComponentsParameter(input)[source]

Bases: roocs_utils.parameter.base_parameter._BaseParameter

Class for time components parameter used in subsetting operation.

The Time Components are any, or none of:
  • year: [list of years]

  • month: [list of months]

  • day: [list of days]

  • hour: [list of hours]

  • minute: [list of minutes]

  • second: [list of seconds]

month is special: you can use either strings or values:

“feb”, “mar” == 2, 3 == “02,03”

Validates the times input and parses them into a dictionary.

allowed_input_types = [<class 'dict'>, <class 'str'>, <class 'roocs_utils.parameter.param_utils.TimeComponents'>, <class 'NoneType'>]
asdict()[source]
get_bounds()[source]

Returns a tuple of the (start, end) times, calculated from the value of the parameter. Either will default to None.

class roocs_utils.parameter.param_utils.Interval(*data)[source]

Bases: object

A simple class for handling an interval of any type. It holds a start and end but does not try to resolve the range, it is just a container to be used by other tools. The contents can be of any type, such as datetimes, strings etc.

class roocs_utils.parameter.param_utils.Series(*data)[source]

Bases: object

A simple class for handling a series selection, created by any sequence as input. It has a value that holds the sequence as a list.

class roocs_utils.parameter.param_utils.TimeComponents(year=None, month=None, day=None, hour=None, minute=None, second=None)[source]

Bases: object

A simple class for parsing and representing a set of time components. The components are stored in a dictionary of {time_comp: values}, such as:

{“year”: [2000, 2001], “month”: [1, 2, 3]}

Note that you can provide month strings as strings or numbers, e.g.:

“feb”, “Feb”, “February”, 2

roocs_utils.parameter.param_utils.area

alias of roocs_utils.parameter.param_utils.Series

roocs_utils.parameter.param_utils.collection

alias of roocs_utils.parameter.param_utils.Series

roocs_utils.parameter.param_utils.dimensions

alias of roocs_utils.parameter.param_utils.Series

roocs_utils.parameter.param_utils.interval

alias of roocs_utils.parameter.param_utils.Interval

roocs_utils.parameter.param_utils.level_interval

alias of roocs_utils.parameter.param_utils.Interval

roocs_utils.parameter.param_utils.level_series

alias of roocs_utils.parameter.param_utils.Series

roocs_utils.parameter.param_utils.parse_datetime(dt, defaults=None)[source]

Parses string to datetime and returns isoformat string for it. If defaults is set, use that in case dt is None.

roocs_utils.parameter.param_utils.parse_range(x, caller)[source]
roocs_utils.parameter.param_utils.parse_sequence(x, caller)[source]
roocs_utils.parameter.param_utils.series

alias of roocs_utils.parameter.param_utils.Series

roocs_utils.parameter.param_utils.string_to_dict(s, splitters=('|', ':', ','))[source]

Convert a string to a dictionary of dictionaries, based on splitting rules: splitters.

roocs_utils.parameter.param_utils.time_components

alias of roocs_utils.parameter.param_utils.TimeComponents

roocs_utils.parameter.param_utils.time_interval

alias of roocs_utils.parameter.param_utils.Interval

roocs_utils.parameter.param_utils.time_series

alias of roocs_utils.parameter.param_utils.Series

roocs_utils.parameter.param_utils.to_float(i, allow_none=True)[source]
roocs_utils.parameter.parameterise.parameterise(collection=None, area=None, level=None, time=None, time_components=None)[source]

Parameterises inputs to instances of parameter classes which allows them to be used throughout roocs. For supported formats for each input please see their individual classes.

Parameters
  • collection – Collection input in any supported format.

  • area – Area input in any supported format.

  • level – Level input in any supported format.

  • time – Time input in any supported format.

  • time_components – Time Components input in any supported format.

Returns

Parameters as instances of their respective classes.

Project Utils

class roocs_utils.project_utils.DatasetMapper(dset, project=None, force=False)[source]

Bases: object

Class to map to data path, dataset ID and files from any dataset input.

dset must be a string and can be input as:
A dataset ID: e.g. “cmip5.output1.INM.inmcm4.rcp45.mon.ocean.Omon.r1i1p1.latest.zostoga”
A file path: e.g. “/badc/cmip5/data/cmip5/output1/MOHC/HadGEM2-ES/rcp85/mon/atmos/Amon/r1i1p1/latest/tas/tas_Amon_HadGEM2-ES_rcp85_r1i1p1_200512-203011.nc”
A path to a group of files: e.g. “/badc/cmip5/data/cmip5/output1/MOHC/HadGEM2-ES/rcp85/mon/atmos/Amon/r1i1p1/latest/tas/*.nc”
A directory e.g. “/badc/cmip5/data/cmip5/output1/MOHC/HadGEM2-ES/rcp85/mon/atmos/Amon/r1i1p1/latest/tas”
An instance of the FileMapper class (that represents a set of files within a single directory)

When force=True, if the project can not be identified, any attempt to use the base_dir of a project to resolve the data path will be ignored. Any of data_path, ds_id and files that can be set, will be set.

SUPPORTED_EXTENSIONS = ('.nc', '.gz')
property base_dir

The base directory of the input dataset.

property data_path

Dataset input converted to a data path.

property ds_id

Dataset input converted to a ds id.

property files

The files found from the input dataset.

property project

The project of the dataset input.

property raw

Raw dataset input.

roocs_utils.project_utils.datapath_to_dsid(datapath)[source]

Switches from dataset path to ds id.

Parameters

datapath – dataset path.

Returns

dataset id of input dataset path.

roocs_utils.project_utils.derive_ds_id(dset)[source]

Derives the dataset id of the provided dset.

Parameters

dset – dset input of type described by DatasetMapper.

Returns

ds id of input dataset.

roocs_utils.project_utils.derive_dset(dset)[source]

Derives the dataset path of the provided dset.

Parameters

dset – dset input of type described by DatasetMapper.

Returns

dataset path of input dataset.

roocs_utils.project_utils.dset_to_filepaths(dset, force=False)[source]

Gets filepaths deduced from input dset.

Parameters
  • dset – dset input of type described by DatasetMapper.

  • force – When True and if the project of the input dset cannot be identified, DatasetMapper will attempt to find the files anyway. Default is False.

Returns

File paths deduced from input dataset.

roocs_utils.project_utils.dsid_to_datapath(dsid)[source]

Switches from ds id to dataset path.

Parameters

dsid – dataset id.

Returns

dataset path of input dataset id.

roocs_utils.project_utils.get_data_node_dirs_dict()[source]

Get a dictionary of the data node roots used for retreiving original files.

roocs_utils.project_utils.get_facet(facet_name, facets, project)[source]

Get facet from project config

roocs_utils.project_utils.get_project_base_dir(project)[source]

Get the base directory of a project from the config.

roocs_utils.project_utils.get_project_from_data_node_root(url)[source]

Identify the project from data node root by identifyng the data node root in the input url.

roocs_utils.project_utils.get_project_from_ds(ds)[source]

Gets the project from an xarray Dataset/DataArray.

Parameters

ds – xarray Dataset/DataArray.

Returns

The project derived from the input dataset.

roocs_utils.project_utils.get_project_name(dset)[source]

Gets the project from an input dset.

Parameters

dset – dset input of type described by DatasetMapper.

Returns

The project derived from the input dataset.

roocs_utils.project_utils.get_projects()[source]

Gets all the projects available in the config.

roocs_utils.project_utils.map_facet(facet, project)[source]

Return mapped facet value from config or facet name if not found.

roocs_utils.project_utils.switch_dset(dset)[source]

Switches between dataset path and ds id.

Parameters

dset – either dataset path or dataset ID.

Returns

either dataset path or dataset ID - switched from the input.

roocs_utils.project_utils.url_to_file_path(url)[source]

Convert input url of an original file to a file path

Xarray Utils

Other utilities

roocs_utils.utils.common.parse_size(size)[source]

Parse size string into number of bytes.

Parameters

size – (str) size to parse in any unit

Returns

(int) number of bytes

class roocs_utils.utils.time_utils.AnyCalendarDateTime(year, month, day, hour, minute, second)[source]

Bases: object

A class to represent a datetime that could be of any calendar.

Has the ability to add and subtract a day from the input based on MAX_DAY, MIN_DAY, MAX_MONTH and MIN_MONTH

DAY_RANGE = range(1, 32)
HOUR_RANGE = range(0, 24)
MINUTE_RANGE = range(0, 60)
MONTH_RANGE = range(1, 13)
SECOND_RANGE = range(0, 60)
add_day()[source]

Add a day to the input datetime.

sub_day(n=1)[source]

Subtract a day to the input datetime.

validate_input(input, name, range)[source]
property value
roocs_utils.utils.time_utils.str_to_AnyCalendarDateTime(dt, defaults=None)[source]

Takes a string representing date/time and returns a DateTimeAnyTime object. String formats should start with Year and go through to Second, but you can miss out anything from month onwards.

Parameters
  • dt – (str) string representing a date/time.

  • defaults – (list) The default values to use for year, month, day, hour, minute and second if they cannot be parsed from the string. A default value must be provided for each component. If defaults=None, [-1, 1, 1, 0, 0, 0] is used.

Returns

AnyCalendarDateTime object

roocs_utils.utils.time_utils.to_isoformat(tm)[source]

Returns an ISO 8601 string from a time object (of different types).

Parameters

tm – Time object

Returns

(str) ISO 8601 time string

class roocs_utils.utils.file_utils.FileMapper(file_list, dirpath=None)[source]

Bases: object

Class to represent a set of files that exist in the same directory as one object.

Parameters
  • file_list – the list of files to represent. If dirpath not providedm these should be full file paths.

  • dirpath – The directory path where the files exist. Default is None.

If dirpath is not provided it will be deduced from the file paths provided in file_list.

file_list

list of file names of the files represented.

file_paths

list of full file paths of the files represented.

dirpath

The directory path where the files exist. Either deduced or provided.

roocs_utils.utils.file_utils.is_file_list(coll)[source]

Checks whether a collection is a list of files.

Parameters

(list) (coll) – collection to check.

Returns

True if collection is a list of files, else returns False.

Configuration options

There are many configuartion options that can be adjusted to change the behaviour of the roocs stack. The configuration file used can always be found under <package>/etc/roocs.ini where package is a package in roocs e.g. roocs-utils.

Any section of the configuration files can be overwritten by creating a new INI file with the desired sections and values and then setting the environment variable ROOCS_CONFIG as the file path to the new INI file. e.g. ROOCS_CONFIG="path/to/config.ini"

The configuration settings used are listed and explained below. Explanations will be provided as comments in the code blocks if needed. Examples are provided so these settings will not necesarily match up with what is used in each of the packages.

Specifying types

It is possible to specify the type of the entries in the configuration file, for example if you want a value to be a list when the file is parsed.

This is managed through a [config_data_types] section at the top of the INI file which has the following options:

[config_data_types]
# use only in roocs-utils
lists =
dicts =
ints =
floats =
boolean =
# use the below in all other packages
extra_lists =
extra_dicts =
extra_ints =
extra_floats =
extra_booleans =

Simply adding the name of the value you want to format afer = will render the correct format. e.g. boolean = use_inventory is_default_for_path will set both use_inventory and is_default_for_path as booleans.

roocs-utils

In roocs-utils there are project level settings. The settings under each project heading are the same. e.g. for cmip5 the heading is [project:cmip5]:

[project:cmip5]
project_name = cmip5
# base directory for data file paths
base_dir = /badc/cmip5/data/cmip5
# if a dataset id is identified as coming from this project, should these be the default settings used (as opposed to usig the c3s-cmip5 settings by default)
is_default_for_path = True
# template for the output file name - used in ``clisops.utils.file_namers``
file_name_template = {__derive__var_id}_{frequency}_{model_id}_{experiment_id}_r{realization}i{initialization_method}p{physics_version}{__derive__time_range}{extra}.{__derive__extension}
# defaults used in file name template above if the dataset doesn't contain the attribute
attr_defaults =
    model_id:no-model
    frequency:no-freq
    experiment:no-expt
    realization:X
    initialization_method:X
    physics_version:X
# the order of facets in the file paths of datasets for this project
facet_rule = activity product institute model experiment frequency realm mip_table ensemble_member version variable
# what particular facets will be identifed as in this project - not currently used
mappings =
    project:project_id
# whether to use an intake catalog or not for this project
use_catalog = False
# where original files can be downloaded
data_node_root = https://data.mips.copernicus-climate.eu/thredds/fileServer/esg_c3s-cmip6/

There are settings for the environment:

[environment]
# relating to the number of threads to use for processing
OMP_NUM_THREADS=1
MKL_NUM_THREADS=1
OPENBLAS_NUM_THREADS=1
VECLIB_MAXIMUM_THREADS = 1
NUMEXPR_NUM_THREADS = 1

The elastic search settings are specifed here:

[elasticsearch]
endpoint = elasticsearch.ceda.ac.uk
port = 443
# names of the elasticsearch indexes used for the various stores
character_store = roocs-char
fix_store = roocs-fix
analysis_store = roocs-analysis
fix_proposal_store = roocs-fix-prop

clisops

These are settings that are specific to clisops:

[clisops:read]
# memory limit for chunks - dask breaks up its underlying array into chunks
chunk_memory_limit = 250MiB

[clisops:write]
# maximum file size of output files. Files are split if this is exceeded
file_size_limit = 1GB
# staging directory to output files to before they are moved to the requested output directory
# if unset, the files are output straight to the requested output directory
output_staging_dir = /gws/smf/j04/cp4cds1/c3s_34e/rook_prod_cache

daops

daops provides settings for using the intake catalog:

[catalog]
# provides the url for the intake catalog with details of datasets
intake_catalog_url = https://raw.githubusercontent.com/cp4cds/c3s_34g_manifests/master/intake/catalogs/c3s.yaml

rook

There are currently no settings in rook but these would be set in the same way as the clisops and daops settings. e.g. with [rook:section] headings.

dachar

These are settings that are specific to dachar:

[dachar:processing]
# LOTUS settings for scanning datasets
queue = short-serial
# large settings for scanning large datasets
wallclock_large = 23:59
memory_large = 32000
# settings for scanning smaller datasets
wallclock_small = 04:00
memory_small = 4000

[dachar:output_paths]
# output paths for scanning datasets and generating fixes
_base_path = ./outputs
base_log_dir = %(_base_path)s/logs
batch_output_path = %(base_log_dir)s/batch-outputs/{grouped_ds_id}
json_output_path = %(_base_path)s/register/{grouped_ds_id}.json
success_path = %(base_log_dir)s/success/{grouped_ds_id}.log
no_files_path = %(base_log_dir)s/failure/no_files/{grouped_ds_id}.log
pre_extract_error_path = %(base_log_dir)s/failure/pre_extract_error/{grouped_ds_id}.log
extract_error_path = %(base_log_dir)s/failure/extract_error/{grouped_ds_id}.log
write_error_path = %(base_log_dir)s/failure/write_error/{grouped_ds_id}.log
fix_path = %(_base_path)s/fixes/{grouped_ds_id}.json


[dachar:checks]
# checks to run when analysing a sample of datasets
# common checks are run on all samples
common = coord_checks.RankCheck coord_checks.MissingCoordCheck
# it is possible to specify checks that will be run on datasets from specific projects
cmip5 =
cmip6 =
cordex = coord_checks.ExampleCheck


[dachar:settings]
# elasticsearch api token that allows write access to indexes
elastic_api_token =
# how many directories levels to join by to create the name of a new directory when outputting results of scans
# see ``dachar.utils.switch_ds.get_grouped_ds_id``
dir_grouping_level = 4
# threshold at which an anomaly in a sample of datasets will be identified for a fix - not currently used
# the lower threshold (between 0 and 1), the more likely the anomaly will be to get fixed
concern_threshold = 0.2
# possible locations for scans and analysis of datasets
locations = ceda dkrz other

catalog maker

In the catalog maker there are project level settings as well. The settings under each project heading are the same. Settings for the catalog maker are:

[project:c3s-cmip6]
# directory to store catalog and dataset list used in generation of catalog
# if catalog_dir is the same for different projects, the yaml file in this directory will be updated for each project, rather than a new one made
catalog_dir = ./catalog_data
# Where the csv file will be generated
csv_dir = %(catalog_dir)s/%(project_name)s/
# Where the user will provide a dataset list which will be used to generate the catalog
datasets_file = %(catalog_dir)s/%(project_name)s-datasets.txt

Further settings for the intake catalog workflow are:

[log]
# directory for logging outputs from LOTUS when generating catalog entries
log_base_dir = /gws/smf/j04/cp4cds1/c3s_34e/inventory/log

[workflow]
split_level = 4
# max duration for LOTUS jobs, as "hh:mm:ss"
max_duration = 04:00:00
# job queue on LOTUS
job_queue = short-serial
# number of datasets to process in one batch - fewer batches is better as it prevents "Exception: Could not obtain file lock" error
n_per_batch = 750

Examples

[27]:
import roocs_utils
[28]:
dir(roocs_utils)
[28]:
['AreaParameter',
 'CONFIG',
 'CollectionParameter',
 'LevelParameter',
 'TimeParameter',
 '__author__',
 '__builtins__',
 '__cached__',
 '__contact__',
 '__copyright__',
 '__doc__',
 '__file__',
 '__license__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 '__version__',
 'area_parameter',
 'base_parameter',
 'collection_parameter',
 'config',
 'exceptions',
 'get_config',
 'level_parameter',
 'parameter',
 'parameterise',
 'roocs_utils',
 'time_parameter',
 'utils',
 'xarray_utils']

Parameters

Parameters classes are used to parse inputs of collection, area, time and level used as arguments in the subsetting operation

The area values can be input as: * A string of comma separated values: “0.,49.,10.,65” * A sequence of strings: (“0”, “-10”, “120”, “40”) * A sequence of numbers: [0, 49.5, 10, 65]

[29]:
area = roocs_utils.AreaParameter("0.,49.,10.,65")

# the lat/lon bounds can be returned in a dictionary
print(area.asdict())

# the values can be returned as a tuple
print(area.tuple)
{'lon_bnds': (0.0, 10.0), 'lat_bnds': (49.0, 65.0)}
(0.0, 49.0, 10.0, 65.0)

A collection can be input as * A string of comma separated values: “cmip5.output1.INM.inmcm4.rcp45.mon.ocean.Omon.r1i1p1.latest.zostoga,cmip5.output1.MPI-M.MPI-ESM-LR.rcp45.mon.ocean.Omon.r1i1p1.latest.zostoga” * A sequence of strings: e.g. (“cmip5.output1.INM.inmcm4.rcp45.mon.ocean.Omon.r1i1p1.latest.zostoga”,“cmip5.output1.MPI-M.MPI-ESM-LR.rcp45.mon.ocean.Omon.r1i1p1.latest.zostoga”)

[30]:
collection = roocs_utils.CollectionParameter("cmip5.output1.INM.inmcm4.rcp45.mon.ocean.Omon.r1i1p1.latest.zostoga,cmip5.output1.MPI-M.MPI-ESM-LR.rcp45.mon.ocean.Omon.r1i1p1.latest.zostoga")

# the collection ids can be returned as a tuple
print(collection.tuple)
('cmip5.output1.INM.inmcm4.rcp45.mon.ocean.Omon.r1i1p1.latest.zostoga', 'cmip5.output1.MPI-M.MPI-ESM-LR.rcp45.mon.ocean.Omon.r1i1p1.latest.zostoga')

Level can be input as: * A string of slash separated values: “1000/2000” * A sequence of strings: e.g. (“1000.50”, “2000.60”) A sequence of numbers: e.g. (1000.50, 2000.60)

Level inputs should be a range of the levels you want to subset over

[31]:
level = roocs_utils.LevelParameter((1000.50, 2000.60))

# the first and last level in the range provided can be returned in a dictionary
print(level.asdict())

# the values can be returned as a tuple
print(level.tuple)
{'first_level': 1000.5, 'last_level': 2000.6}
(1000.5, 2000.6)

Time can be input as: * A string of slash separated values: “2085-01-01T12:00:00Z/2120-12-30T12:00:00Z” * A sequence of strings: e.g. (“2085-01-01T12:00:00Z”, “2120-12-30T12:00:00Z”)

Time inputs should be the start and end of the time range you want to subset over

[32]:
time = roocs_utils.TimeParameter("2085-01-01T12:00:00Z/2120-12-30T12:00:00Z")

# the first and last time in the range provided can be returned in a dictionary
print(time.asdict())

# the values can be returned as a tuple
print(time.tuple)
{'start_time': '2085-01-01T12:00:00+00:00', 'end_time': '2120-12-30T12:00:00+00:00'}
('2085-01-01T12:00:00+00:00', '2120-12-30T12:00:00+00:00')

Parameterise parameterises inputs to instances of parameter classes which allows them to be used throughout roocs.

[33]:
roocs_utils.parameter.parameterise("cmip5.output1.INM.inmcm4.rcp45.mon.ocean.Omon.r1i1p1.latest.zostoga", "0.,49.,10.,65", (1000.50, 2000.60), "2085-01-01T12:00:00Z/2120-12-30T12:00:00Z")
[33]:
{'collection': Datasets to analyse:
 cmip5.output1.INM.inmcm4.rcp45.mon.ocean.Omon.r1i1p1.latest.zostoga,
 'area': Area to subset over:
  (0.0, 49.0, 10.0, 65.0),
 'level': Level range to subset over
  first_level: 1000.5
  last_level: 2000.6,
 'time': Time period to subset over
  start time: 2085-01-01T12:00:00+00:00
  end time: 2120-12-30T12:00:00+00:00}

Xarray utils

Xarray utils can bu used to identify the main variable in a dataset as well as idnetifying the type of a coordinate or returning a coordinate based on an attribute or a type

[34]:
from roocs_utils.xarray_utils import xarray_utils as xu
import xarray as xr
[35]:
ds = xr.open_mfdataset("../tests/mini-esgf-data/test_data/badc/cmip5/data/cmip5/output1/MOHC/HadGEM2-ES/rcp85/mon/atmos/Amon/r1i1p1/latest/tas/*.nc", use_cftime=True, combine="by_coords")
[36]:
# find the main variable of the dataset
main_var = xu.get_main_variable(ds)

print("main var =", main_var)

ds[main_var]
main var = tas
[36]:
<xarray.DataArray 'tas' (time: 3530, lat: 2, lon: 2)>
dask.array<concatenate, shape=(3530, 2, 2), dtype=float32, chunksize=(300, 2, 2), chunktype=numpy.ndarray>
Coordinates:
    height   float64 1.5
  * lat      (lat) float64 -90.0 35.0
  * lon      (lon) float64 0.0 187.5
  * time     (time) object 2005-12-16 00:00:00 ... 2299-12-16 00:00:00
Attributes:
    standard_name:     air_temperature
    long_name:         Near-Surface Air Temperature
    comment:           near-surface (usually, 2 meter) air temperature.
    units:             K
    original_name:     mo: m01s03i236
    cell_methods:      time: mean
    cell_measures:     area: areacella
    history:           2010-12-04T13:50:30Z altered by CMOR: Treated scalar d...
    associated_files:  baseURL: http://cmip-pcmdi.llnl.gov/CMIP5/dataLocation...
[37]:
# to get the coord types

for coord in ds.coords:
    print("\ncoord name =", coord, "\ncoord type =", xu.get_coord_type(ds[coord]))

print("\n There is a level, time, latitude and longitude coordinate in this dataset")

coord name = height
coord type = level

coord name = lat
coord type = latitude

coord name = lon
coord type = longitude

coord name = time
coord type = time

 There is a level, time, latitude and longitude coordinate in this dataset
[38]:
# to check the type of a coord

print(xu.is_level(ds["height"]))
print(xu.is_latitude(ds["lon"]))
True
None
[39]:
# to find a coordinate of a specific type

print("time =", xu.get_coord_by_type(ds, "time"))

# to find the level coordinate,set ignore_aux_coords to False

print("\nlevel =", xu.get_coord_by_type(ds, "level", ignore_aux_coords=False))
time = <xarray.DataArray 'time' (time: 3530)>
array([cftime.Datetime360Day(2005, 12, 16, 0, 0, 0, 0),
       cftime.Datetime360Day(2006, 1, 16, 0, 0, 0, 0),
       cftime.Datetime360Day(2006, 2, 16, 0, 0, 0, 0), ...,
       cftime.Datetime360Day(2299, 10, 16, 0, 0, 0, 0),
       cftime.Datetime360Day(2299, 11, 16, 0, 0, 0, 0),
       cftime.Datetime360Day(2299, 12, 16, 0, 0, 0, 0)], dtype=object)
Coordinates:
    height   float64 1.5
  * time     (time) object 2005-12-16 00:00:00 ... 2299-12-16 00:00:00
Attributes:
    bounds:         time_bnds
    axis:           T
    long_name:      time
    standard_name:  time

level = <xarray.DataArray 'height' ()>
array(1.5)
Coordinates:
    height   float64 1.5
Attributes:
    units:          m
    axis:           Z
    positive:       up
    long_name:      height
    standard_name:  height
[40]:
# to find a coordinate based on an attribute you expect it to have

xu.get_coord_by_attr(ds, "standard_name", "latitude")
[40]:
<xarray.DataArray 'lat' (lat: 2)>
array([-90.,  35.])
Coordinates:
    height   float64 1.5
  * lat      (lat) float64 -90.0 35.0
Attributes:
    bounds:         lat_bnds
    units:          degrees_north
    axis:           Y
    long_name:      latitude
    standard_name:  latitude

Other utilities

Other utilities allow parsing a memory size of any unit into bytes and converting a time object into an ISO 8601 string

[41]:
from roocs_utils.utils.common import parse_size
from roocs_utils.utils.time_utils import to_isoformat
from datetime import datetime
[42]:
# to parse a size into bytes
size = '50MiB'
size_in_b = parse_size(size)
size_in_b
[42]:
52428800.0
[43]:
# to convert a time object into a time string
time = datetime(2005, 7, 14, 12, 30)
time_str = to_isoformat(time)
time_str
[43]:
'2005-07-14T12:30:00'

Contributing

Contributions are welcome, and they are greatly appreciated! Every little bit helps, and credit will always be given.

You can contribute in many ways:

Types of Contributions

Report Bugs

Report bugs at https://github.com/roocs/roocs-utils/issues.

If you are reporting a bug, please include:

  • Your operating system name and version.

  • Any details about your local setup that might be helpful in troubleshooting.

  • Detailed steps to reproduce the bug.

Fix Bugs

Look through the GitHub issues for bugs. Anything tagged with “bug” and “help wanted” is open to whoever wants to implement it.

Implement Features

Look through the GitHub issues for features. Anything tagged with “enhancement” and “help wanted” is open to whoever wants to implement it.

Write Documentation

roocs-utils could always use more documentation, whether as part of the official roocs-utils docs, in docstrings, or even on the web in blog posts, articles, and such.

Submit Feedback

The best way to send feedback is to file an issue at https://github.com/roocs/roocs-utils/issues.

If you are proposing a feature:

  • Explain in detail how it would work.

  • Keep the scope as narrow as possible, to make it easier to implement.

  • Remember that this is a volunteer-driven project, and that contributions are welcome :)

Get Started!

Ready to contribute? Here’s how to set up roocs-utils for local development.

#. Fork the roocs-utils repo on GitHub. #.

Clone your fork locally:

$ git clone git@github.com:your_name_here/roocs-utils.git

  1. Install your local copy into a virtualenv. Assuming you have virtualenvwrapper installed, this is how you set up your fork for local development:

    $ mkvirtualenv roocs-utils $ cd roocs-utils/ $ python setup.py develop

  2. Create a branch for local development:

    $ git checkout -b name-of-your-bugfix-or-feature

    Now you can make your changes locally.

  3. When you are done making changes, check that your changes pass flake8 and the tests, including testing other Python versions with tox:

    $ flake8 roocs-utils tests $ python setup.py test or py.test $ tox

    To get flake8 and tox, just pip install them into your virtualenv.

  4. Commit your changes and push your branch to GitHub:

    $ git add . $ git commit -m “Your detailed description of your changes.” $ git push origin name-of-your-bugfix-or-feature

  5. Submit a pull request through the GitHub website.

Pull Request Guidelines

Before you submit a pull request, check that it meets these guidelines:

  1. The pull request should include tests.

  2. If the pull request adds functionality, the docs should be updated. Put your new functionality into a function with a docstring, and add the feature to the list in README.md.

  3. The pull request should work for Python 2.7, 3.4, 3.5 and 3.6, and for PyPy. Check https://travis-ci.com/github/roocs/roocs-utils/pull_requests and make sure that the tests pass for all supported Python versions.

Tips

To run a subset of tests:

$ py.test tests.test_roocs_utils

Deploying

A reminder for the maintainers on how to deploy. Make sure all your changes are committed (including an entry in HISTORY.md). Then run:

$ bumpversion patch # possible: major / minor / patch
$ git push
$ git push --tags

Travis will then deploy to PyPI if tests pass.

Credits

Development Lead

Co-Developers

Version History

v0.6.2 (2022-05-03)

Bug Fixes

  • Fixed get_coords_by_type in xarray_utils to handle non existing coords (#99).

v0.6.1 (2022-04-19)

Bug Fixes

  • Added data_node_root in roocs.ini for C3S-CORDEX and C3S-CMIP5 (#97).

v0.6.0 (2022-04-14)

Bug Fixes

  • Updated default roocs.ini for C3S-CORDEX (#93, #95).

  • Fix added for get_bbox on C3S-CORDEX (#94).

v0.5.0 (2021-10-26)

Bug Fixes

  • When a project was provided to roocs_utils.project_utils.DatasetMapper, getting the base directory would be skipped, causing an error. This has been resolved.

  • roocs_utils.project_utils.DatasetMapper can now accept fixed_path_mappings that include “.gz” (gzip) files. This is allowed because Xarray can read gzipped netCDF files.

Breaking Changes

  • Intake catalog maker removed, now in it’s own package: roocs/catalog-maker

  • Change to input parameter classes:: * Added: roocs_utils.parameter.time_components_parameter.TimeComponentsParameter * Modified input types required for classes:

    * ``roocs_utils.parameter.time_parameter.TimeParameter``
    * ``roocs_utils.parameter.level_parameter.LevelParameter``
    
    • They both now require their inputs to be one of:: * roocs_utils.parameter.param_utils.Interval - to specify a range/interval * roocs_utils.parameter.param_utils.Series - to specify a series of values

New Features

  • roocs_utils.xarray_utils.xarray_utils now accepts keyword arguments to pass through to xarray’s open_dataset or open_mfdataset. If the argument provided is not an option for open_dataset, then open_mfdataset will be used, even for one file.

  • The roocs.ini config file can now accept fixed_path_modifiers to work together with the fixed_path_mappings section. For example, you can specify parameters in the modifiers that will be expanded into the mappings:

    fixed_path_modifiers =
        variable:cld dtr frs pet pre tmn tmp tmx vap wet
    fixed_path_mappings =
        cru_ts.4.04.{variable}:cru_ts_4.04/data/{variable}/*.nc
        cru_ts.4.05.{variable}:cru_ts_4.05/data/{variable}/cru_ts4.05.1901.2*.{variable}.dat.nc.gz
    

    In this example, the variable parameter will be expanded out to each of the options provided in the list.

  • The roocs_utils.xarray_utils.xarray_utils.open_xr_dataset() function was improved so that the time units of the first data file are preserved in: ds.time.encoding["units"]. A multi-file dataset has now keeps the time “units” of the first file (if present). This is useful for converting to other formats (e.g. CSV).

Other Changes

  • Python 3.6 no longer tested in GitHub actions.

v0.4.2 (2021-05-18)

Breaking Changes

  • Remove abcunit-backend and psycopg2 dependencies from requirements.txt, these must now be manually installed in order to use the catalog maker.

v0.4.0 (2021-05-18)

Breaking Changes

  • Inventory maker now removed and replaced by intake catalog maker which writes a csv file with the dataset entries and a yaml description file.

  • In etc/roocs.ini the option use_inventory has been replaced by use_catalog and the inventory maker options have been replaced with equivalent catalog options. However, the option to include file paths or not no longer exists.

  • The catalog maker now uses a database backend and creates a csv file so there are 3 new dependencies for the catalog maker: pandas and abcunit-backend and psycopg2.

This means a database backend must be specified and the paths for the pickle files in etc/roocs.ini are no longer necessary. For more information see the README.

Other Changes

  • oyaml removed as a dependency

v0.3.0 (2021-03-30)

New Features

  • Added AnyCalendarDateTime and str_to_AnyCalendarDateTime to utils.time_utils to aid in handling date strings that may not exist in all calendar types.

  • Inventory maker will check latitude and longitude of the dataset it is scanning are within acceptable bounds and raise an exception if they are not.

v0.2.1 (2021-02-19)

Bug Fixes

  • clean up imports … remove pandas dependency.

v0.2.0 (2021-02-18)

Breaking Changes

  • cf_xarray>=0.3.1 now required due to differing level identification of coordinates between versions.

  • oyaml>=0.9 - new dependency for inventory

  • Interface to inventory maker changed. Detailed instructions for use added in README.

  • Adjusted file name template. Underscore removed before __derive__time_range

  • New dev dependency: GitPython==3.1.12

New Features

  • Added use_inventory option to roocs.ini config and allow data to be used without checking an inventory.

  • DatasetMapper class and wrapper functions added to roocs_utils.project_utils and roocs_utils.xarray_utils.xarray_utils to resolve all paths and dataset ids in the same way.

  • FileMapper added in roocs_utils.utils.file_utils to resolve resolve multiple files with the same directory to their directory path.

  • Fixed path mapping support added in DatasetMapper

  • Added DimensionParameter to be used with the average operation.

Other Changes

  • Removed submodule for test data. Test data is now cloned from git using GitPython and cached

  • CollectionParamter accepts an instance of FileMapper or a sequence of FileMapper objects

  • Adjusted file name template to include an extra option before the file extension.

  • Swapped from travis CI to GitHub actions

v0.1.5 (2020-11-23)

Breaking Changes

  • Replaced use of cfunits by cf_xarray and cftime (new dependency) in roocs_utils.xarray_utils.

v0.1.4 (2020-10-22)

Fixing pip install

Bug Fixes

  • Importing and using roocs-utils when pip installing now works

v0.1.3 (2020-10-21)

Fixing formatting of doc strings and imports

Breaking Changes

  • Use of roocs_utils.parameter.parameterise.parameterise:

import should now be from roocs_utils.parameter import parameterise and usage should be, for example parameters = parameterise(collection=ds, time=time, area=area, level=level)

New Features

  • Added a notebook to show examples

Other Changes

  • Updated formatting of doc strings

v0.1.2 (2020-10-15)

Updating the documentation and improving the changelog.

Other Changes

  • Updated doc strings to improve documentation.

  • Updated documentation.

v0.1.1 (2020-10-12)

Fixing mostly existing functionality to work more efficiently with the other packages in roocs.

Breaking Changes

  • environment.yml has been updated to bring it in line with requirements.txt.

  • level coordinates would previously have been identified as None. They are now identified as level.

New Features

  • parameterise function added in roocs_utils.parameter to use in all roocs packages.

  • ROOCS_CONFIG environment variable can be used to override default config in etc/roocs.ini. To use a local config file set ROOCS_CONFIG as the file path to this file. Several file paths can be provided separated by a :

  • Inventory functionality added - this can be used to create an inventory of datasets. See README for more info.

  • project_utils added with the following functions to get the project name of a dataset and the base directory for that project.

  • utils.common and utils.time_utils added.

  • is_level implemented in xarray_utils to identify whether a coordinate is a level or not.

Bug Fixes

  • xarray_utils.xarray_utils.get_main_variable updated to exclude common coordinates from the search for the main variable. This fixes a bug where coordinates such as lon_bounds would be returned as the main variable.

Other Changes

  • README update to explain inventory functionality.

  • Black and flake8 formatting applied.

  • Fixed import warning with collections.abc.

v0.1.0 (2020-07-30)

  • First release.

Indices and tables