********************* Configuration options ********************* There are many configuration options that can be adjusted to change the behaviour of the roocs stack. The configuration file used can always be found under ``/etc/roocs.ini`` where package is a package in `roocs` e.g. `roocs-utils`. Any section of the configuration files can be overwritten by creating a new INI file with the desired sections and values and then setting the environment variable ``ROOCS_CONFIG`` as the file path to the new INI file. e.g. ``ROOCS_CONFIG="path/to/config.ini"`` Below are the configuration settings, with explanations provided as comments within the code blocks. The examples given may not directly correspond to the settings used in each package. Specifying types ################ It is possible to specify the type of the entries in the configuration file, for example if you want a value to be a list when the file is parsed. This is managed through a ``[config_data_types]`` section at the top of the ``ini`` file which has the following options: .. code-block:: [config_data_types] # use only in roocs-utils lists = dicts = ints = floats = boolean = # use the below in all other packages extra_lists = extra_dicts = extra_ints = extra_floats = extra_booleans = Simply adding the name of the value you want to format after ``=`` will render in the correct format. e.g. ``boolean = use_inventory is_default_for_path`` will set both ``use_inventory`` and ``is_default_for_path`` as booleans. roocs-utils ########### In roocs-utils there are project level settings. The settings under each project heading are the same. e.g. for cmip5 the heading is ``[project:cmip5]``: .. code-block:: [project:cmip5] project_name = cmip5 # base directory for data file paths base_dir = /badc/cmip5/data/cmip5 # if a dataset id is identified as coming from this project, should these be the default settings used (as opposed to usig the c3s-cmip5 settings by default) is_default_for_path = True # template for the output file name - used in ``clisops.utils.file_namers`` file_name_template = {__derive__var_id}_{frequency}_{model_id}_{experiment_id}_r{realization}i{initialization_method}p{physics_version}{__derive__time_range}{extra}.{__derive__extension} # defaults used in file name template above if the dataset doesn't contain the attribute attr_defaults = model_id:no-model frequency:no-freq experiment:no-expt realization:X initialization_method:X physics_version:X # the order of facets in the file paths of datasets for this project facet_rule = activity product institute model experiment frequency realm mip_table ensemble_member version variable # what particular facets will be identifed as in this project - not currently used mappings = project:project_id # whether to use an intake catalog or not for this project use_catalog = False # where original files can be downloaded data_node_root = https://data.mips.climate.copernicus.eu/thredds/fileServer/esg_c3s-cmip6/ There are settings for the environment: .. code-block:: [environment] # relating to the number of threads to use for processing OMP_NUM_THREADS=1 MKL_NUM_THREADS=1 OPENBLAS_NUM_THREADS=1 VECLIB_MAXIMUM_THREADS = 1 NUMEXPR_NUM_THREADS = 1 The elastic search settings are specified here: .. code-block:: [elasticsearch] endpoint = elasticsearch.ceda.ac.uk port = 443 # names of the elasticsearch indexes used for the various stores character_store = roocs-char fix_store = roocs-fix analysis_store = roocs-analysis fix_proposal_store = roocs-fix-prop clisops ####### These are settings that are specific to `clisops`: .. code-block:: [clisops:read] # memory limit for chunks - dask breaks up its underlying array into chunks chunk_memory_limit = 250MiB [clisops:write] # maximum file size of output files. Files are split if this is exceeded file_size_limit = 1GB # staging directory to output files to before they are moved to the requested output directory # if unset, the files are output straight to the requested output directory output_staging_dir = /gws/smf/j04/cp4cds1/c3s_34e/rook_prod_cache daops ##### `daops` provides settings for using the `intake` catalog: .. code-block:: [catalog] # provides the url for the intake catalog with details of datasets intake_catalog_url = https://raw.githubusercontent.com/cp4cds/c3s_34g_manifests/master/intake/catalogs/c3s.yaml rook #### There are currently no settings in `rook` but these would be set in the same way as the `clisops` and `daops` settings. e.g. with ``[rook:section]`` headings. dachar ###### These are settings that are specific to `dachar`: .. code-block:: [dachar:processing] # LOTUS settings for scanning datasets queue = short-serial # large settings for scanning large datasets wallclock_large = 23:59 memory_large = 32000 # settings for scanning smaller datasets wallclock_small = 04:00 memory_small = 4000 [dachar:output_paths] # output paths for scanning datasets and generating fixes _base_path = ./outputs base_log_dir = %(_base_path)s/logs batch_output_path = %(base_log_dir)s/batch-outputs/{grouped_ds_id} json_output_path = %(_base_path)s/register/{grouped_ds_id}.json success_path = %(base_log_dir)s/success/{grouped_ds_id}.log no_files_path = %(base_log_dir)s/failure/no_files/{grouped_ds_id}.log pre_extract_error_path = %(base_log_dir)s/failure/pre_extract_error/{grouped_ds_id}.log extract_error_path = %(base_log_dir)s/failure/extract_error/{grouped_ds_id}.log write_error_path = %(base_log_dir)s/failure/write_error/{grouped_ds_id}.log fix_path = %(_base_path)s/fixes/{grouped_ds_id}.json [dachar:checks] # checks to run when analysing a sample of datasets # common checks are run on all samples common = coord_checks.RankCheck coord_checks.MissingCoordCheck # it is possible to specify checks that will be run on datasets from specific projects cmip5 = cmip6 = cordex = coord_checks.ExampleCheck [dachar:settings] # elasticsearch api token that allows write access to indexes elastic_api_token = # how many directories levels to join by to create the name of a new directory when outputting results of scans # see ``dachar.utils.switch_ds.get_grouped_ds_id`` dir_grouping_level = 4 # threshold at which an anomaly in a sample of datasets will be identified for a fix - not currently used # the lower threshold (between 0 and 1), the more likely the anomaly will be to get fixed concern_threshold = 0.2 # possible locations for scans and analysis of datasets locations = ceda dkrz other catalog maker ############# In the catalog maker there are project level settings as well. The settings under each project heading are the same. Settings for the catalog maker are: .. code-block:: [project:c3s-cmip6] # directory to store catalog and dataset list used in generation of catalog # if catalog_dir is the same for different projects, the yaml file in this directory will be updated for each project, rather than a new one made catalog_dir = ./catalog_data # Where the csv file will be generated csv_dir = %(catalog_dir)s/%(project_name)s/ # Where the user will provide a dataset list which will be used to generate the catalog datasets_file = %(catalog_dir)s/%(project_name)s-datasets.txt Further settings for the intake catalog workflow are:: [log] # directory for logging outputs from LOTUS when generating catalog entries log_base_dir = /gws/smf/j04/cp4cds1/c3s_34e/inventory/log [workflow] split_level = 4 # max duration for LOTUS jobs, as "hh:mm:ss" max_duration = 04:00:00 # job queue on LOTUS job_queue = short-serial # number of datasets to process in one batch - fewer batches is better as it prevents "Exception: Could not obtain file lock" error n_per_batch = 750