Welcome to roocs-utils’s documentation!
Quick Guide
roocs-utils
A package containing common components for the roocs project.
Free software: BSD - see LICENSE file in top-level package directory
Documentation: https://roocs-utils.readthedocs.io.
Credits
This package was created with Cookiecutter
and the audreyr/cookiecutter-pypackage
project template.
Cookiecutter: https://github.com/audreyr/cookiecutter
cookiecutter-pypackage: https://github.com/audreyr/cookiecutter-pypackage
Installation
Stable release
To install roocs-utils, run this command in your terminal:
$ pip install roocs-utils
This is the preferred method to install roocs-utils, as it will always install the most recent stable release.
If you don’t have pip installed, this Python installation guide can guide you through the process.
Install from GitHub
roocs-utils can be downloaded from the Github repo.
$ git clone git://github.com/roocs/roocs-utils
$ cd roocs-utils
Create Conda environment named roocs_utils:
$ conda env create -f environment.yml
$ source activate roocs_utils
Install roocs-utils in development mode:
$ pip install -r requirements.txt
$ pip install -r requirements_dev.txt
$ pip install -e .
Run tests:
$ python -m pytest tests/
Usage
To use roocs-utils in a project:
import roocs_utils
For information on the configuration options available in roocs-utils, see: https://roocs-utils.readthedocs.io/en/latest/configuration.html#roocs-utils
API
Parameters
- class roocs_utils.parameter.area_parameter.AreaParameter(input)[source]
Bases:
roocs_utils.parameter.base_parameter._BaseParameter
Class for area parameter used in subsetting operation.
Area can be input as:A string of comma separated values: “0.,49.,10.,65”A sequence of strings: (“0”, “-10”, “120”, “40”)A sequence of numbers: [0, 49.5, 10, 65]An area must have 4 values.
Validates the area input and parses the values into numbers.
- allowed_input_types = [<class 'collections.abc.Sequence'>, <class 'str'>, <class 'roocs_utils.parameter.param_utils.Series'>, <class 'NoneType'>]
- asdict()[source]
Returns a dictionary of the area values
- class roocs_utils.parameter.collection_parameter.CollectionParameter(input)[source]
Bases:
roocs_utils.parameter.base_parameter._BaseParameter
Class for collection parameter used in operations.
A collection can be input as:A string of comma separated values: “cmip5.output1.INM.inmcm4.rcp45.mon.ocean.Omon.r1i1p1.latest.zostoga,cmip5.output1.MPI-M.MPI-ESM-LR.rcp45.mon.ocean.Omon.r1i1p1.latest.zostoga”A sequence of strings: e.g. (“cmip5.output1.INM.inmcm4.rcp45.mon.ocean.Omon.r1i1p1.latest.zostoga”, “cmip5.output1.MPI-M.MPI-ESM-LR.rcp45.mon.ocean.Omon.r1i1p1.latest.zostoga”)A sequence of roocs_utils.utils.file_utils.FileMapper objectsValidates the input and parses the items.
- allowed_input_types = [<class 'collections.abc.Sequence'>, <class 'str'>, <class 'roocs_utils.parameter.param_utils.Series'>, <class 'roocs_utils.utils.file_utils.FileMapper'>]
- class roocs_utils.parameter.level_parameter.LevelParameter(input)[source]
Bases:
roocs_utils.parameter.base_parameter._BaseIntervalOrSeriesParameter
Class for level parameter used in subsetting operation.
Level can be input as:A string of slash separated values: “1000/2000”A sequence of strings: e.g. (“1000.50”, “2000.60”)A sequence of numbers: e.g. (1000.50, 2000.60)A level input must be 2 values.
If using a string input a trailing slash indicates you want to use the lowest/highest level of the dataset. e.g. “/2000” will subset from the lowest level in the dataset to 2000.
Validates the level input and parses the values into numbers.
- asdict()[source]
Returns a dictionary of the level values
- class roocs_utils.parameter.time_parameter.TimeParameter(input)[source]
Bases:
roocs_utils.parameter.base_parameter._BaseIntervalOrSeriesParameter
Class for time parameter used in subsetting operation.
Time can be input as:A string of slash separated values: “2085-01-01T12:00:00Z/2120-12-30T12:00:00Z”A sequence of strings: e.g. (“2085-01-01T12:00:00Z”, “2120-12-30T12:00:00Z”)A time input must be 2 values.
If using a string input a trailing slash indicates you want to use the earliest/ latest time of the dataset. e.g. “2085-01-01T12:00:00Z/” will subset from 01/01/2085 to the final time in the dataset.
Validates the times input and parses the values into isoformat.
- asdict()[source]
Returns a dictionary of the time values
- get_bounds()[source]
Returns a tuple of the (start, end) times, calculated from the value of the parameter. Either will default to None.
- class roocs_utils.parameter.time_components_parameter.TimeComponentsParameter(input)[source]
Bases:
roocs_utils.parameter.base_parameter._BaseParameter
Class for time components parameter used in subsetting operation.
- The Time Components are any, or none of:
year: [list of years]
month: [list of months]
day: [list of days]
hour: [list of hours]
minute: [list of minutes]
second: [list of seconds]
- month is special: you can use either strings or values:
“feb”, “mar” == 2, 3 == “02,03”
Validates the times input and parses them into a dictionary.
- allowed_input_types = [<class 'dict'>, <class 'str'>, <class 'roocs_utils.parameter.param_utils.TimeComponents'>, <class 'NoneType'>]
- asdict()[source]
- get_bounds()[source]
Returns a tuple of the (start, end) times, calculated from the value of the parameter. Either will default to None.
- class roocs_utils.parameter.dimension_parameter.DimensionParameter(input)[source]
Bases:
roocs_utils.parameter.base_parameter._BaseParameter
Class for dimensions parameter used in averaging operation.
Area can be input as:A string of comma separated values: “time,latitude,longitude”A sequence of strings: (“time”, “longitude”)Dimensions can be None or any number of options from time, latitude, longitude and level provided these exist in the dataset being operated on.
Validates the dims input and parses the values into a sequence of strings.
- allowed_input_types = [<class 'collections.abc.Sequence'>, <class 'str'>, <class 'roocs_utils.parameter.param_utils.Series'>, <class 'NoneType'>]
- asdict()[source]
Returns a dictionary of the dimensions
- class roocs_utils.parameter.param_utils.Interval(*data)[source]
Bases:
object
A simple class for handling an interval of any type. It holds a start and end but does not try to resolve the range, it is just a container to be used by other tools. The contents can be of any type, such as datetimes, strings etc.
- class roocs_utils.parameter.param_utils.Series(*data)[source]
Bases:
object
A simple class for handling a series selection, created by any sequence as input. It has a value that holds the sequence as a list.
- class roocs_utils.parameter.param_utils.TimeComponents(year=None, month=None, day=None, hour=None, minute=None, second=None)[source]
Bases:
object
A simple class for parsing and representing a set of time components. The components are stored in a dictionary of {time_comp: values}, such as:
{“year”: [2000, 2001], “month”: [1, 2, 3]}
- Note that you can provide month strings as strings or numbers, e.g.:
“feb”, “Feb”, “February”, 2
- roocs_utils.parameter.param_utils.area
alias of
roocs_utils.parameter.param_utils.Series
- roocs_utils.parameter.param_utils.collection
alias of
roocs_utils.parameter.param_utils.Series
- roocs_utils.parameter.param_utils.dimensions
alias of
roocs_utils.parameter.param_utils.Series
- roocs_utils.parameter.param_utils.interval
alias of
roocs_utils.parameter.param_utils.Interval
- roocs_utils.parameter.param_utils.level_interval
alias of
roocs_utils.parameter.param_utils.Interval
- roocs_utils.parameter.param_utils.level_series
alias of
roocs_utils.parameter.param_utils.Series
- roocs_utils.parameter.param_utils.parse_datetime(dt, defaults=None)[source]
Parses string to datetime and returns isoformat string for it. If defaults is set, use that in case dt is None.
- roocs_utils.parameter.param_utils.parse_range(x, caller)[source]
- roocs_utils.parameter.param_utils.parse_sequence(x, caller)[source]
- roocs_utils.parameter.param_utils.series
alias of
roocs_utils.parameter.param_utils.Series
- roocs_utils.parameter.param_utils.string_to_dict(s, splitters=('|', ':', ','))[source]
Convert a string to a dictionary of dictionaries, based on splitting rules: splitters.
- roocs_utils.parameter.param_utils.time_components
alias of
roocs_utils.parameter.param_utils.TimeComponents
- roocs_utils.parameter.param_utils.time_interval
alias of
roocs_utils.parameter.param_utils.Interval
- roocs_utils.parameter.param_utils.time_series
alias of
roocs_utils.parameter.param_utils.Series
- roocs_utils.parameter.param_utils.to_float(i, allow_none=True)[source]
- roocs_utils.parameter.parameterise.parameterise(collection=None, area=None, level=None, time=None, time_components=None)[source]
Parameterises inputs to instances of parameter classes which allows them to be used throughout roocs. For supported formats for each input please see their individual classes.
- Parameters
collection – Collection input in any supported format.
area – Area input in any supported format.
level – Level input in any supported format.
time – Time input in any supported format.
time_components – Time Components input in any supported format.
- Returns
Parameters as instances of their respective classes.
Project Utils
- class roocs_utils.project_utils.DatasetMapper(dset, project=None, force=False)[source]
Bases:
object
Class to map to data path, dataset ID and files from any dataset input.
dset must be a string and can be input as:A dataset ID: e.g. “cmip5.output1.INM.inmcm4.rcp45.mon.ocean.Omon.r1i1p1.latest.zostoga”A file path: e.g. “/badc/cmip5/data/cmip5/output1/MOHC/HadGEM2-ES/rcp85/mon/atmos/Amon/r1i1p1/latest/tas/tas_Amon_HadGEM2-ES_rcp85_r1i1p1_200512-203011.nc”A path to a group of files: e.g. “/badc/cmip5/data/cmip5/output1/MOHC/HadGEM2-ES/rcp85/mon/atmos/Amon/r1i1p1/latest/tas/*.nc”A directory e.g. “/badc/cmip5/data/cmip5/output1/MOHC/HadGEM2-ES/rcp85/mon/atmos/Amon/r1i1p1/latest/tas”An instance of the FileMapper class (that represents a set of files within a single directory)When force=True, if the project can not be identified, any attempt to use the base_dir of a project to resolve the data path will be ignored. Any of data_path, ds_id and files that can be set, will be set.
- SUPPORTED_EXTENSIONS = ('.nc', '.gz')
- property base_dir
The base directory of the input dataset.
- property data_path
Dataset input converted to a data path.
- property ds_id
Dataset input converted to a ds id.
- property files
The files found from the input dataset.
- property project
The project of the dataset input.
- property raw
Raw dataset input.
- roocs_utils.project_utils.datapath_to_dsid(datapath)[source]
Switches from dataset path to ds id.
- Parameters
datapath – dataset path.
- Returns
dataset id of input dataset path.
- roocs_utils.project_utils.derive_ds_id(dset)[source]
Derives the dataset id of the provided dset.
- Parameters
dset – dset input of type described by DatasetMapper.
- Returns
ds id of input dataset.
- roocs_utils.project_utils.derive_dset(dset)[source]
Derives the dataset path of the provided dset.
- Parameters
dset – dset input of type described by DatasetMapper.
- Returns
dataset path of input dataset.
- roocs_utils.project_utils.dset_to_filepaths(dset, force=False)[source]
Gets filepaths deduced from input dset.
- Parameters
dset – dset input of type described by DatasetMapper.
force – When True and if the project of the input dset cannot be identified, DatasetMapper will attempt to find the files anyway. Default is False.
- Returns
File paths deduced from input dataset.
- roocs_utils.project_utils.dsid_to_datapath(dsid)[source]
Switches from ds id to dataset path.
- Parameters
dsid – dataset id.
- Returns
dataset path of input dataset id.
- roocs_utils.project_utils.get_data_node_dirs_dict()[source]
Get a dictionary of the data node roots used for retreiving original files.
- roocs_utils.project_utils.get_facet(facet_name, facets, project)[source]
Get facet from project config
- roocs_utils.project_utils.get_project_base_dir(project)[source]
Get the base directory of a project from the config.
- roocs_utils.project_utils.get_project_from_data_node_root(url)[source]
Identify the project from data node root by identifyng the data node root in the input url.
- roocs_utils.project_utils.get_project_from_ds(ds)[source]
Gets the project from an xarray Dataset/DataArray.
- Parameters
ds – xarray Dataset/DataArray.
- Returns
The project derived from the input dataset.
- roocs_utils.project_utils.get_project_name(dset)[source]
Gets the project from an input dset.
- Parameters
dset – dset input of type described by DatasetMapper.
- Returns
The project derived from the input dataset.
- roocs_utils.project_utils.get_projects()[source]
Gets all the projects available in the config.
- roocs_utils.project_utils.map_facet(facet, project)[source]
Return mapped facet value from config or facet name if not found.
- roocs_utils.project_utils.switch_dset(dset)[source]
Switches between dataset path and ds id.
- Parameters
dset – either dataset path or dataset ID.
- Returns
either dataset path or dataset ID - switched from the input.
- roocs_utils.project_utils.url_to_file_path(url)[source]
Convert input url of an original file to a file path
Xarray Utils
- roocs_utils.xarray_utils.xarray_utils.convert_coord_to_axis(coord)[source]
Converts coordinate type to its single character axis identifier (tzyx).
- Parameters
coord – (str) The coordinate to convert.
- Returns
(str) The single character axis identifier of the coordinate (tzyx).
- roocs_utils.xarray_utils.xarray_utils.get_coord_by_attr(ds, attr, value)[source]
Returns a coordinate based on a known attribute of a coordinate.
- Parameters
ds – Xarray Dataset or DataArray
attr – (str) Name of attribute to look for.
value – Expected value of attribute you are looking for.
- Returns
Coordinate of xarray dataset if found.
- roocs_utils.xarray_utils.xarray_utils.get_coord_by_type(ds, coord_type, ignore_aux_coords=True)[source]
Returns the xarray Dataset or DataArray coordinate of the specified type.
- Parameters
ds – Xarray Dataset or DataArray
coord_type – (str) Coordinate type to find.
ignore_aux_coords – (bool) If True then coordinates that are not dimensions are ignored. Default is True.
- Returns
Xarray Dataset coordinate (ds.coords[coord_id])
- roocs_utils.xarray_utils.xarray_utils.get_coord_type(coord)[source]
Gets the coordinate type.
- Parameters
coord – coordinate of xarray dataset e.g. coord = ds.coords[coord_id]
- Returns
The type of coordinate as a string. Either longitude, latitude, time, level or None
- roocs_utils.xarray_utils.xarray_utils.get_main_variable(ds, exclude_common_coords=True)[source]
Finds the main variable of an xarray Dataset
- Parameters
ds – xarray Dataset
exclude_common_coords – (bool) If True then common coordinates are excluded from the search for the main variable. common coordinates are time, level, latitude, longitude and bounds. Default is True.
- Returns
(str) The main variable of the dataset e.g. ‘tas’
- roocs_utils.xarray_utils.xarray_utils.is_latitude(coord)[source]
Determines if a coordinate is latitude.
- Parameters
coord – coordinate of xarray dataset e.g. coord = ds.coords[coord_id]
- Returns
(bool) True if the coordinate is latitude.
- roocs_utils.xarray_utils.xarray_utils.is_level(coord)[source]
Determines if a coordinate is level.
- Parameters
coord – coordinate of xarray dataset e.g. coord = ds.coords[coord_id]
- Returns
(bool) True if the coordinate is level.
- roocs_utils.xarray_utils.xarray_utils.is_longitude(coord)[source]
Determines if a coordinate is longitude.
- Parameters
coord – coordinate of xarray dataset e.g. coord = ds.coords[coord_id]
- Returns
(bool) True if the coordinate is longitude.
- roocs_utils.xarray_utils.xarray_utils.is_time(coord)[source]
Determines if a coordinate is time.
- Parameters
coord – coordinate of xarray dataset e.g. coord = ds.coords[coord_id]
- Returns
(bool) True if the coordinate is time.
- roocs_utils.xarray_utils.xarray_utils.open_xr_dataset(dset, **kwargs)[source]
Opens an xarray dataset from a dataset input.
- Parameters
dset – (Str or Path) ds_id, directory path or file path ending in *.nc.
kwargs – Any further keyword arguments to include when opening the dataset. use_cftime=True and decode_timedelta=False are used by default, along with combine=”by_coords” for open_mfdataset only.
Any list will be interpreted as list of files
Other utilities
- roocs_utils.utils.common.parse_size(size)[source]
Parse size string into number of bytes.
- Parameters
size – (str) size to parse in any unit
- Returns
(int) number of bytes
- class roocs_utils.utils.time_utils.AnyCalendarDateTime(year, month, day, hour, minute, second)[source]
Bases:
object
A class to represent a datetime that could be of any calendar.
Has the ability to add and subtract a day from the input based on MAX_DAY, MIN_DAY, MAX_MONTH and MIN_MONTH
- DAY_RANGE = range(1, 32)
- HOUR_RANGE = range(0, 24)
- MINUTE_RANGE = range(0, 60)
- MONTH_RANGE = range(1, 13)
- SECOND_RANGE = range(0, 60)
- add_day()[source]
Add a day to the input datetime.
- sub_day(n=1)[source]
Subtract a day to the input datetime.
- validate_input(input, name, range)[source]
- property value
- roocs_utils.utils.time_utils.str_to_AnyCalendarDateTime(dt, defaults=None)[source]
Takes a string representing date/time and returns a DateTimeAnyTime object. String formats should start with Year and go through to Second, but you can miss out anything from month onwards.
- Parameters
dt – (str) string representing a date/time.
defaults – (list) The default values to use for year, month, day, hour, minute and second if they cannot be parsed from the string. A default value must be provided for each component. If defaults=None, [-1, 1, 1, 0, 0, 0] is used.
- Returns
AnyCalendarDateTime object
- roocs_utils.utils.time_utils.to_isoformat(tm)[source]
Returns an ISO 8601 string from a time object (of different types).
- Parameters
tm – Time object
- Returns
(str) ISO 8601 time string
- class roocs_utils.utils.file_utils.FileMapper(file_list, dirpath=None)[source]
Bases:
object
Class to represent a set of files that exist in the same directory as one object.
- Parameters
file_list – the list of files to represent. If dirpath not providedm these should be full file paths.
dirpath – The directory path where the files exist. Default is None.
If dirpath is not provided it will be deduced from the file paths provided in file_list.
- file_list
list of file names of the files represented.
- file_paths
list of full file paths of the files represented.
- dirpath
The directory path where the files exist. Either deduced or provided.
- roocs_utils.utils.file_utils.is_file_list(coll)[source]
Checks whether a collection is a list of files.
- Parameters
(list) (coll) – collection to check.
- Returns
True if collection is a list of files, else returns False.
Configuration options
There are many configuartion options that can be adjusted to change the behaviour of the roocs stack.
The configuration file used can always be found under <package>/etc/roocs.ini
where package is a package in roocs e.g. roocs-utils.
Any section of the configuration files can be overwritten by creating a new INI file with the desired sections and values and then setting the environment variable ROOCS_CONFIG
as the file path to the new INI file.
e.g. ROOCS_CONFIG="path/to/config.ini"
The configuration settings used are listed and explained below. Explanations will be provided as comments in the code blocks if needed. Examples are provided so these settings will not necesarily match up with what is used in each of the packages.
Specifying types
It is possible to specify the type of the entries in the configuration file, for example if you want a value to be a list when the file is parsed.
This is managed through a [config_data_types]
section at the top of the INI file which has the following options:
[config_data_types]
# use only in roocs-utils
lists =
dicts =
ints =
floats =
boolean =
# use the below in all other packages
extra_lists =
extra_dicts =
extra_ints =
extra_floats =
extra_booleans =
Simply adding the name of the value you want to format afer =
will render the correct format. e.g. boolean = use_inventory is_default_for_path
will set both use_inventory
and is_default_for_path
as booleans.
roocs-utils
In roocs-utils there are project level settings. The settings under each project heading are the same.
e.g. for cmip5 the heading is [project:cmip5]
:
[project:cmip5]
project_name = cmip5
# base directory for data file paths
base_dir = /badc/cmip5/data/cmip5
# if a dataset id is identified as coming from this project, should these be the default settings used (as opposed to usig the c3s-cmip5 settings by default)
is_default_for_path = True
# template for the output file name - used in ``clisops.utils.file_namers``
file_name_template = {__derive__var_id}_{frequency}_{model_id}_{experiment_id}_r{realization}i{initialization_method}p{physics_version}{__derive__time_range}{extra}.{__derive__extension}
# defaults used in file name template above if the dataset doesn't contain the attribute
attr_defaults =
model_id:no-model
frequency:no-freq
experiment:no-expt
realization:X
initialization_method:X
physics_version:X
# the order of facets in the file paths of datasets for this project
facet_rule = activity product institute model experiment frequency realm mip_table ensemble_member version variable
# what particular facets will be identifed as in this project - not currently used
mappings =
project:project_id
# whether to use an intake catalog or not for this project
use_catalog = False
# where original files can be downloaded
data_node_root = https://data.mips.copernicus-climate.eu/thredds/fileServer/esg_c3s-cmip6/
There are settings for the environment:
[environment]
# relating to the number of threads to use for processing
OMP_NUM_THREADS=1
MKL_NUM_THREADS=1
OPENBLAS_NUM_THREADS=1
VECLIB_MAXIMUM_THREADS = 1
NUMEXPR_NUM_THREADS = 1
The elastic search settings are specifed here:
[elasticsearch]
endpoint = elasticsearch.ceda.ac.uk
port = 443
# names of the elasticsearch indexes used for the various stores
character_store = roocs-char
fix_store = roocs-fix
analysis_store = roocs-analysis
fix_proposal_store = roocs-fix-prop
clisops
These are settings that are specific to clisops:
[clisops:read]
# memory limit for chunks - dask breaks up its underlying array into chunks
chunk_memory_limit = 250MiB
[clisops:write]
# maximum file size of output files. Files are split if this is exceeded
file_size_limit = 1GB
# staging directory to output files to before they are moved to the requested output directory
# if unset, the files are output straight to the requested output directory
output_staging_dir = /gws/smf/j04/cp4cds1/c3s_34e/rook_prod_cache
daops
daops provides settings for using the intake catalog:
[catalog]
# provides the url for the intake catalog with details of datasets
intake_catalog_url = https://raw.githubusercontent.com/cp4cds/c3s_34g_manifests/master/intake/catalogs/c3s.yaml
rook
There are currently no settings in rook but these would be set in the same way as the clisops and daops settings. e.g. with [rook:section]
headings.
dachar
These are settings that are specific to dachar:
[dachar:processing]
# LOTUS settings for scanning datasets
queue = short-serial
# large settings for scanning large datasets
wallclock_large = 23:59
memory_large = 32000
# settings for scanning smaller datasets
wallclock_small = 04:00
memory_small = 4000
[dachar:output_paths]
# output paths for scanning datasets and generating fixes
_base_path = ./outputs
base_log_dir = %(_base_path)s/logs
batch_output_path = %(base_log_dir)s/batch-outputs/{grouped_ds_id}
json_output_path = %(_base_path)s/register/{grouped_ds_id}.json
success_path = %(base_log_dir)s/success/{grouped_ds_id}.log
no_files_path = %(base_log_dir)s/failure/no_files/{grouped_ds_id}.log
pre_extract_error_path = %(base_log_dir)s/failure/pre_extract_error/{grouped_ds_id}.log
extract_error_path = %(base_log_dir)s/failure/extract_error/{grouped_ds_id}.log
write_error_path = %(base_log_dir)s/failure/write_error/{grouped_ds_id}.log
fix_path = %(_base_path)s/fixes/{grouped_ds_id}.json
[dachar:checks]
# checks to run when analysing a sample of datasets
# common checks are run on all samples
common = coord_checks.RankCheck coord_checks.MissingCoordCheck
# it is possible to specify checks that will be run on datasets from specific projects
cmip5 =
cmip6 =
cordex = coord_checks.ExampleCheck
[dachar:settings]
# elasticsearch api token that allows write access to indexes
elastic_api_token =
# how many directories levels to join by to create the name of a new directory when outputting results of scans
# see ``dachar.utils.switch_ds.get_grouped_ds_id``
dir_grouping_level = 4
# threshold at which an anomaly in a sample of datasets will be identified for a fix - not currently used
# the lower threshold (between 0 and 1), the more likely the anomaly will be to get fixed
concern_threshold = 0.2
# possible locations for scans and analysis of datasets
locations = ceda dkrz other
catalog maker
In the catalog maker there are project level settings as well. The settings under each project heading are the same. Settings for the catalog maker are:
[project:c3s-cmip6]
# directory to store catalog and dataset list used in generation of catalog
# if catalog_dir is the same for different projects, the yaml file in this directory will be updated for each project, rather than a new one made
catalog_dir = ./catalog_data
# Where the csv file will be generated
csv_dir = %(catalog_dir)s/%(project_name)s/
# Where the user will provide a dataset list which will be used to generate the catalog
datasets_file = %(catalog_dir)s/%(project_name)s-datasets.txt
Further settings for the intake catalog workflow are:
[log]
# directory for logging outputs from LOTUS when generating catalog entries
log_base_dir = /gws/smf/j04/cp4cds1/c3s_34e/inventory/log
[workflow]
split_level = 4
# max duration for LOTUS jobs, as "hh:mm:ss"
max_duration = 04:00:00
# job queue on LOTUS
job_queue = short-serial
# number of datasets to process in one batch - fewer batches is better as it prevents "Exception: Could not obtain file lock" error
n_per_batch = 750
Examples
[27]:
import roocs_utils
[28]:
dir(roocs_utils)
[28]:
['AreaParameter',
'CONFIG',
'CollectionParameter',
'LevelParameter',
'TimeParameter',
'__author__',
'__builtins__',
'__cached__',
'__contact__',
'__copyright__',
'__doc__',
'__file__',
'__license__',
'__loader__',
'__name__',
'__package__',
'__path__',
'__spec__',
'__version__',
'area_parameter',
'base_parameter',
'collection_parameter',
'config',
'exceptions',
'get_config',
'level_parameter',
'parameter',
'parameterise',
'roocs_utils',
'time_parameter',
'utils',
'xarray_utils']
Parameters
Parameters classes are used to parse inputs of collection, area, time and level used as arguments in the subsetting operation
The area values can be input as: * A string of comma separated values: “0.,49.,10.,65” * A sequence of strings: (“0”, “-10”, “120”, “40”) * A sequence of numbers: [0, 49.5, 10, 65]
[29]:
area = roocs_utils.AreaParameter("0.,49.,10.,65")
# the lat/lon bounds can be returned in a dictionary
print(area.asdict())
# the values can be returned as a tuple
print(area.tuple)
{'lon_bnds': (0.0, 10.0), 'lat_bnds': (49.0, 65.0)}
(0.0, 49.0, 10.0, 65.0)
A collection can be input as * A string of comma separated values: “cmip5.output1.INM.inmcm4.rcp45.mon.ocean.Omon.r1i1p1.latest.zostoga,cmip5.output1.MPI-M.MPI-ESM-LR.rcp45.mon.ocean.Omon.r1i1p1.latest.zostoga” * A sequence of strings: e.g. (“cmip5.output1.INM.inmcm4.rcp45.mon.ocean.Omon.r1i1p1.latest.zostoga”,“cmip5.output1.MPI-M.MPI-ESM-LR.rcp45.mon.ocean.Omon.r1i1p1.latest.zostoga”)
[30]:
collection = roocs_utils.CollectionParameter("cmip5.output1.INM.inmcm4.rcp45.mon.ocean.Omon.r1i1p1.latest.zostoga,cmip5.output1.MPI-M.MPI-ESM-LR.rcp45.mon.ocean.Omon.r1i1p1.latest.zostoga")
# the collection ids can be returned as a tuple
print(collection.tuple)
('cmip5.output1.INM.inmcm4.rcp45.mon.ocean.Omon.r1i1p1.latest.zostoga', 'cmip5.output1.MPI-M.MPI-ESM-LR.rcp45.mon.ocean.Omon.r1i1p1.latest.zostoga')
Level can be input as: * A string of slash separated values: “1000/2000” * A sequence of strings: e.g. (“1000.50”, “2000.60”) A sequence of numbers: e.g. (1000.50, 2000.60)
Level inputs should be a range of the levels you want to subset over
[31]:
level = roocs_utils.LevelParameter((1000.50, 2000.60))
# the first and last level in the range provided can be returned in a dictionary
print(level.asdict())
# the values can be returned as a tuple
print(level.tuple)
{'first_level': 1000.5, 'last_level': 2000.6}
(1000.5, 2000.6)
Time can be input as: * A string of slash separated values: “2085-01-01T12:00:00Z/2120-12-30T12:00:00Z” * A sequence of strings: e.g. (“2085-01-01T12:00:00Z”, “2120-12-30T12:00:00Z”)
Time inputs should be the start and end of the time range you want to subset over
[32]:
time = roocs_utils.TimeParameter("2085-01-01T12:00:00Z/2120-12-30T12:00:00Z")
# the first and last time in the range provided can be returned in a dictionary
print(time.asdict())
# the values can be returned as a tuple
print(time.tuple)
{'start_time': '2085-01-01T12:00:00+00:00', 'end_time': '2120-12-30T12:00:00+00:00'}
('2085-01-01T12:00:00+00:00', '2120-12-30T12:00:00+00:00')
Parameterise parameterises inputs to instances of parameter classes which allows them to be used throughout roocs.
[33]:
roocs_utils.parameter.parameterise("cmip5.output1.INM.inmcm4.rcp45.mon.ocean.Omon.r1i1p1.latest.zostoga", "0.,49.,10.,65", (1000.50, 2000.60), "2085-01-01T12:00:00Z/2120-12-30T12:00:00Z")
[33]:
{'collection': Datasets to analyse:
cmip5.output1.INM.inmcm4.rcp45.mon.ocean.Omon.r1i1p1.latest.zostoga,
'area': Area to subset over:
(0.0, 49.0, 10.0, 65.0),
'level': Level range to subset over
first_level: 1000.5
last_level: 2000.6,
'time': Time period to subset over
start time: 2085-01-01T12:00:00+00:00
end time: 2120-12-30T12:00:00+00:00}
Xarray utils
Xarray utils can bu used to identify the main variable in a dataset as well as idnetifying the type of a coordinate or returning a coordinate based on an attribute or a type
[34]:
from roocs_utils.xarray_utils import xarray_utils as xu
import xarray as xr
[35]:
ds = xr.open_mfdataset("../tests/mini-esgf-data/test_data/badc/cmip5/data/cmip5/output1/MOHC/HadGEM2-ES/rcp85/mon/atmos/Amon/r1i1p1/latest/tas/*.nc", use_cftime=True, combine="by_coords")
[36]:
# find the main variable of the dataset
main_var = xu.get_main_variable(ds)
print("main var =", main_var)
ds[main_var]
main var = tas
[36]:
<xarray.DataArray 'tas' (time: 3530, lat: 2, lon: 2)> dask.array<concatenate, shape=(3530, 2, 2), dtype=float32, chunksize=(300, 2, 2), chunktype=numpy.ndarray> Coordinates: height float64 1.5 * lat (lat) float64 -90.0 35.0 * lon (lon) float64 0.0 187.5 * time (time) object 2005-12-16 00:00:00 ... 2299-12-16 00:00:00 Attributes: standard_name: air_temperature long_name: Near-Surface Air Temperature comment: near-surface (usually, 2 meter) air temperature. units: K original_name: mo: m01s03i236 cell_methods: time: mean cell_measures: area: areacella history: 2010-12-04T13:50:30Z altered by CMOR: Treated scalar d... associated_files: baseURL: http://cmip-pcmdi.llnl.gov/CMIP5/dataLocation...
- time: 3530
- lat: 2
- lon: 2
- dask.array<chunksize=(300, 2, 2), meta=np.ndarray>
Array Chunk Bytes 56.48 kB 4.80 kB Shape (3530, 2, 2) (300, 2, 2) Count 39 Tasks 13 Chunks Type float32 numpy.ndarray - height()float641.5
- units :
- m
- axis :
- Z
- positive :
- up
- long_name :
- height
- standard_name :
- height
array(1.5)
- lat(lat)float64-90.0 35.0
- bounds :
- lat_bnds
- units :
- degrees_north
- axis :
- Y
- long_name :
- latitude
- standard_name :
- latitude
array([-90., 35.])
- lon(lon)float640.0 187.5
- bounds :
- lon_bnds
- units :
- degrees_east
- axis :
- X
- long_name :
- longitude
- standard_name :
- longitude
array([ 0. , 187.5])
- time(time)object2005-12-16 00:00:00 ... 2299-12-...
- bounds :
- time_bnds
- axis :
- T
- long_name :
- time
- standard_name :
- time
array([cftime.Datetime360Day(2005, 12, 16, 0, 0, 0, 0), cftime.Datetime360Day(2006, 1, 16, 0, 0, 0, 0), cftime.Datetime360Day(2006, 2, 16, 0, 0, 0, 0), ..., cftime.Datetime360Day(2299, 10, 16, 0, 0, 0, 0), cftime.Datetime360Day(2299, 11, 16, 0, 0, 0, 0), cftime.Datetime360Day(2299, 12, 16, 0, 0, 0, 0)], dtype=object)
- standard_name :
- air_temperature
- long_name :
- Near-Surface Air Temperature
- comment :
- near-surface (usually, 2 meter) air temperature.
- units :
- K
- original_name :
- mo: m01s03i236
- cell_methods :
- time: mean
- cell_measures :
- area: areacella
- history :
- 2010-12-04T13:50:30Z altered by CMOR: Treated scalar dimension: 'height'. 2010-12-04T13:50:30Z altered by CMOR: replaced missing value flag (-1.07374e+09) with standard missing value (1e+20).
- associated_files :
- baseURL: http://cmip-pcmdi.llnl.gov/CMIP5/dataLocation gridspecFile: gridspec_atmos_fx_HadGEM2-ES_rcp85_r0i0p0.nc areacella: areacella_fx_HadGEM2-ES_rcp85_r0i0p0.nc
[37]:
# to get the coord types
for coord in ds.coords:
print("\ncoord name =", coord, "\ncoord type =", xu.get_coord_type(ds[coord]))
print("\n There is a level, time, latitude and longitude coordinate in this dataset")
coord name = height
coord type = level
coord name = lat
coord type = latitude
coord name = lon
coord type = longitude
coord name = time
coord type = time
There is a level, time, latitude and longitude coordinate in this dataset
[38]:
# to check the type of a coord
print(xu.is_level(ds["height"]))
print(xu.is_latitude(ds["lon"]))
True
None
[39]:
# to find a coordinate of a specific type
print("time =", xu.get_coord_by_type(ds, "time"))
# to find the level coordinate,set ignore_aux_coords to False
print("\nlevel =", xu.get_coord_by_type(ds, "level", ignore_aux_coords=False))
time = <xarray.DataArray 'time' (time: 3530)>
array([cftime.Datetime360Day(2005, 12, 16, 0, 0, 0, 0),
cftime.Datetime360Day(2006, 1, 16, 0, 0, 0, 0),
cftime.Datetime360Day(2006, 2, 16, 0, 0, 0, 0), ...,
cftime.Datetime360Day(2299, 10, 16, 0, 0, 0, 0),
cftime.Datetime360Day(2299, 11, 16, 0, 0, 0, 0),
cftime.Datetime360Day(2299, 12, 16, 0, 0, 0, 0)], dtype=object)
Coordinates:
height float64 1.5
* time (time) object 2005-12-16 00:00:00 ... 2299-12-16 00:00:00
Attributes:
bounds: time_bnds
axis: T
long_name: time
standard_name: time
level = <xarray.DataArray 'height' ()>
array(1.5)
Coordinates:
height float64 1.5
Attributes:
units: m
axis: Z
positive: up
long_name: height
standard_name: height
[40]:
# to find a coordinate based on an attribute you expect it to have
xu.get_coord_by_attr(ds, "standard_name", "latitude")
[40]:
<xarray.DataArray 'lat' (lat: 2)> array([-90., 35.]) Coordinates: height float64 1.5 * lat (lat) float64 -90.0 35.0 Attributes: bounds: lat_bnds units: degrees_north axis: Y long_name: latitude standard_name: latitude
- lat: 2
- -90.0 35.0
array([-90., 35.])
- height()float641.5
- units :
- m
- axis :
- Z
- positive :
- up
- long_name :
- height
- standard_name :
- height
array(1.5)
- lat(lat)float64-90.0 35.0
- bounds :
- lat_bnds
- units :
- degrees_north
- axis :
- Y
- long_name :
- latitude
- standard_name :
- latitude
array([-90., 35.])
- bounds :
- lat_bnds
- units :
- degrees_north
- axis :
- Y
- long_name :
- latitude
- standard_name :
- latitude
Other utilities
Other utilities allow parsing a memory size of any unit into bytes and converting a time object into an ISO 8601 string
[41]:
from roocs_utils.utils.common import parse_size
from roocs_utils.utils.time_utils import to_isoformat
from datetime import datetime
[42]:
# to parse a size into bytes
size = '50MiB'
size_in_b = parse_size(size)
size_in_b
[42]:
52428800.0
[43]:
# to convert a time object into a time string
time = datetime(2005, 7, 14, 12, 30)
time_str = to_isoformat(time)
time_str
[43]:
'2005-07-14T12:30:00'
Contributing
Contributions are welcome, and they are greatly appreciated! Every little bit helps, and credit will always be given.
You can contribute in many ways:
Types of Contributions
Report Bugs
Report bugs at https://github.com/roocs/roocs-utils/issues.
If you are reporting a bug, please include:
Your operating system name and version.
Any details about your local setup that might be helpful in troubleshooting.
Detailed steps to reproduce the bug.
Fix Bugs
Look through the GitHub issues for bugs. Anything tagged with “bug” and “help wanted” is open to whoever wants to implement it.
Implement Features
Look through the GitHub issues for features. Anything tagged with “enhancement” and “help wanted” is open to whoever wants to implement it.
Write Documentation
roocs-utils could always use more documentation, whether as part of the official roocs-utils docs, in docstrings, or even on the web in blog posts, articles, and such.
Submit Feedback
The best way to send feedback is to file an issue at https://github.com/roocs/roocs-utils/issues.
If you are proposing a feature:
Explain in detail how it would work.
Keep the scope as narrow as possible, to make it easier to implement.
Remember that this is a volunteer-driven project, and that contributions are welcome :)
Get Started!
Ready to contribute? Here’s how to set up roocs-utils
for local development.
#. Fork the roocs-utils
repo on GitHub.
#.
Clone your fork locally:
$ git clone git@github.com:your_name_here/roocs-utils.git
Install your local copy into a virtualenv. Assuming you have virtualenvwrapper installed, this is how you set up your fork for local development:
$ mkvirtualenv roocs-utils $ cd roocs-utils/ $ python setup.py develop
Create a branch for local development:
$ git checkout -b name-of-your-bugfix-or-feature
Now you can make your changes locally.
When you are done making changes, check that your changes pass flake8 and the tests, including testing other Python versions with tox:
$ flake8 roocs-utils tests $ python setup.py test or py.test $ tox
To get flake8 and tox, just pip install them into your virtualenv.
Commit your changes and push your branch to GitHub:
$ git add . $ git commit -m “Your detailed description of your changes.” $ git push origin name-of-your-bugfix-or-feature
Submit a pull request through the GitHub website.
Pull Request Guidelines
Before you submit a pull request, check that it meets these guidelines:
The pull request should include tests.
If the pull request adds functionality, the docs should be updated. Put your new functionality into a function with a docstring, and add the feature to the list in README.md.
The pull request should work for Python 2.7, 3.4, 3.5 and 3.6, and for PyPy. Check https://travis-ci.com/github/roocs/roocs-utils/pull_requests and make sure that the tests pass for all supported Python versions.
Tips
To run a subset of tests:
$ py.test tests.test_roocs_utils
Deploying
A reminder for the maintainers on how to deploy. Make sure all your changes are committed (including an entry in HISTORY.md). Then run:
$ bumpversion patch # possible: major / minor / patch
$ git push
$ git push --tags
Travis will then deploy to PyPI if tests pass.
Credits
Development Lead
Ag Stephens ag.stephens@stfc.ac.uk
Co-Developers
Eleanor Smith eleanor.smith@stfc.ac.uk @ellesmith88
Carsten Ehbrecht ehbrecht@dkrz.de
Pascal Bourgault pascal.bourgault@gmail.com
Version History
v0.5.0 (2021-10-26)
Bug Fixes
When a project was provided to
roocs_utils.project_utils.DatasetMapper
, getting the base directory would be skipped, causing an error. This has been resolved.roocs_utils.project_utils.DatasetMapper
can now accept fixed_path_mappings that include “.gz” (gzip) files. This is allowed because Xarray can read gzipped netCDF files.
Breaking Changes
Intake catalog maker removed, now in it’s own package: roocs/catalog-maker
Change to input parameter classes:: * Added:
roocs_utils.parameter.time_components_parameter.TimeComponentsParameter
* Modified input types required for classes:* ``roocs_utils.parameter.time_parameter.TimeParameter`` * ``roocs_utils.parameter.level_parameter.LevelParameter``
They both now require their inputs to be one of:: *
roocs_utils.parameter.param_utils.Interval
- to specify a range/interval *roocs_utils.parameter.param_utils.Series
- to specify a series of values
New Features
roocs_utils.xarray_utils.xarray_utils
now accepts keyword arguments to pass through to xarray’sopen_dataset
oropen_mfdataset
. If the argument provided is not an option foropen_dataset
, thenopen_mfdataset
will be used, even for one file.The roocs.ini config file can now accept fixed_path_modifiers to work together with the fixed_path_mappings section. For example, you can specify parameters in the modifiers that will be expanded into the mappings:
fixed_path_modifiers = variable:cld dtr frs pet pre tmn tmp tmx vap wet fixed_path_mappings = cru_ts.4.04.{variable}:cru_ts_4.04/data/{variable}/*.nc cru_ts.4.05.{variable}:cru_ts_4.05/data/{variable}/cru_ts4.05.1901.2*.{variable}.dat.nc.gz
In this example, the variable parameter will be expanded out to each of the options provided in the list.
The
roocs_utils.xarray_utils.xarray_utils.open_xr_dataset()
function was improved so that the time units of the first data file are preserved in:ds.time.encoding["units"]
. A multi-file dataset has now keeps the time “units” of the first file (if present). This is useful for converting to other formats (e.g. CSV).
Other Changes
Python 3.6 no longer tested in GitHub actions.
v0.4.2 (2021-05-18)
Breaking Changes
Remove abcunit-backend and psycopg2 dependencies from requirements.txt, these must now be manually installed in order to use the catalog maker.
v0.4.0 (2021-05-18)
Breaking Changes
Inventory maker now removed and replaced by intake catalog maker which writes a csv file with the dataset entries and a yaml description file.
In
etc/roocs.ini
the optionuse_inventory
has been replaced byuse_catalog
and the inventory maker options have been replaced with equivalent catalog options. However, the option to include file paths or not no longer exists.The catalog maker now uses a database backend and creates a csv file so there are 3 new dependencies for the catalog maker: pandas and abcunit-backend and psycopg2.
This means a database backend must be specified and the paths for the pickle files in etc/roocs.ini
are no longer necessary. For more information see the README.
Other Changes
oyaml removed as a dependency
v0.3.0 (2021-03-30)
New Features
Added
AnyCalendarDateTime
andstr_to_AnyCalendarDateTime
toutils.time_utils
to aid in handling date strings that may not exist in all calendar types.Inventory maker will check latitude and longitude of the dataset it is scanning are within acceptable bounds and raise an exception if they are not.
v0.2.1 (2021-02-19)
Bug Fixes
clean up imports … remove pandas dependency.
v0.2.0 (2021-02-18)
Breaking Changes
cf_xarray>=0.3.1 now required due to differing level identification of coordinates between versions.
oyaml>=0.9 - new dependency for inventory
Interface to inventory maker changed. Detailed instructions for use added in README.
Adjusted file name template. Underscore removed before
__derive__time_range
New dev dependency: GitPython==3.1.12
New Features
Added
use_inventory
option toroocs.ini
config and allow data to be used without checking an inventory.DatasetMapper
class and wrapper functions added toroocs_utils.project_utils
androocs_utils.xarray_utils.xarray_utils
to resolve all paths and dataset ids in the same way.FileMapper
added inroocs_utils.utils.file_utils
to resolve resolve multiple files with the same directory to their directory path.Fixed path mapping support added in
DatasetMapper
Added
DimensionParameter
to be used with the average operation.
Other Changes
Removed submodule for test data. Test data is now cloned from git using GitPython and cached
CollectionParamter
accepts an instance ofFileMapper
or a sequence ofFileMapper
objectsAdjusted file name template to include an
extra
option before the file extension.Swapped from travis CI to GitHub actions
v0.1.5 (2020-11-23)
Breaking Changes
Replaced use of
cfunits
bycf_xarray
andcftime
(new dependency) inroocs_utils.xarray_utils
.
v0.1.4 (2020-10-22)
Fixing pip install
Bug Fixes
Importing and using roocs-utils when pip installing now works
v0.1.3 (2020-10-21)
Fixing formatting of doc strings and imports
Breaking Changes
Use of
roocs_utils.parameter.parameterise.parameterise
:
import should now be from roocs_utils.parameter import parameterise
and usage should be, for example parameters = parameterise(collection=ds, time=time, area=area, level=level)
New Features
Added a notebook to show examples
Other Changes
Updated formatting of doc strings
v0.1.2 (2020-10-15)
Updating the documentation and improving the changelog.
Other Changes
Updated doc strings to improve documentation.
Updated documentation.
v0.1.1 (2020-10-12)
Fixing mostly existing functionality to work more efficiently with the other packages in roocs.
Breaking Changes
environment.yml
has been updated to bring it in line with requirements.txt.level
coordinates would previously have been identified asNone
. They are now identified aslevel
.
New Features
parameterise
function added inroocs_utils.parameter
to use in all roocs packages.ROOCS_CONFIG
environment variable can be used to override default config inetc/roocs.ini
. To use a local config file setROOCS_CONFIG
as the file path to this file. Several file paths can be provided separated by a:
Inventory functionality added - this can be used to create an inventory of datasets. See
README
for more info.project_utils
added with the following functions to get the project name of a dataset and the base directory for that project.utils.common
andutils.time_utils
added.is_level
implemented inxarray_utils
to identify whether a coordinate is a level or not.
Bug Fixes
xarray_utils.xarray_utils.get_main_variable
updated to exclude common coordinates from the search for the main variable. This fixes a bug where coordinates such aslon_bounds
would be returned as the main variable.
Other Changes
README
update to explain inventory functionality.Black
andflake8
formatting applied.Fixed import warning with
collections.abc
.
v0.1.0 (2020-07-30)
First release.