Xarray-regex documentation

Welcome to the Xarray-regex package documentation !

Xarray-regex allows to find files based on regular expressions, in order to feed them to Xarray. It allows to easily create regular expressions using ‘Matchers’, to fix some elements of the expressions to select only certain files, and to easily retrieve information from filenames.

Finding files

The main entry point of this package is the FileFinder class. This is the object that will find files according to a regular expression. An instance is created using the root directory containing the files, and a pre-regular expression (abbreviated pre-regex) that will be transformed into a proper regex later.

When asking to find files, the finder will first create a regular-expression out of the pre-regex. It will then recursively find all files in the root directory and its subfolders, though not descending deeper than FileFinder.max_depth_scan folders (default is 3). The finder only keeps files that match the regex. The files can be retrieved using FileFinder.get_files().

Pre-regex

The pre-regex specifies the structure of the filenames relative to the root directory. It is a regular expression with the added feature of matchers.

A matcher is a part of the filename that vary from file to file. In the pre-regex, it is enclosed by parenthesis and preceded by ‘%’. It is represented by the xarray_regex.matcher.Matcher class.

Warning

Anything outside matchers in the pre-regex will be considered constant across files. For example, if we have daily files ‘sst_2003-01-01.nc’ with the date changing for each file, we could use the regex ‘sst_.*.nc’ which would match correctly all files, but the finder would in fact consider that all files are ‘sst_2003-01-01.nc’ (the first file found).

Inside the matchers parenthesis can be indicated multiple elements, separated by colons:

  • a group name (optional)

  • a name that will dictate the matcher regex using a correspondance table

  • a custom regex if correspondances are not enough (optional)

  • a keyword that will discard that matcher when retrieving information from a filename (optional)

The full syntax is as follows: ‘%([group:]name[:custom=custom regex:][:discard])’.

Note

The matchers are uniquely identified by their index in the pre-regex (starting at 0).

Name

The name of the matcher will dictate the regex used for that matcher (unless overriden by a custom regex), and how it will be used by functions that retrieve information from the filename. The Matcher.NAME_RGX class attribute will make the correspondance between name and regex:

Name

Regex

idx

\d*

Index

text

[a-zA-Z]*

Letters

char

\S*

Character

F

%Y-%m-%d

Date (YYYY-MM-DD)

x

%Y%m%d

Date (YYYYMMDD)

X

%H%M%S

Time (HHMMSS)

Y

\d\d\d\d

Year (YYYY)

m

\d\d

Month (MM)

d

\d\d

Day of month (DD)

j

\d\d\d

Day of year (DDD)

B

[a-zA-Z]*

Month name

H

\d\d

Hour 24 (HH)

M

\d\d

Minute (MM)

S

\d\d

Seconds (SS)

This table mostly follows the strftime format specifications.

So for example, ‘%(Y)’ will be replaced by a regex searching for 4 digits, and library.get_date will use it to find the date year.

A letter preceded by a percent sign ‘%’ in the regex will be recursively replaced by the corresponding name in the table. This can be used in the custom regex. This still counts as a single matcher and its name will not be changed, only the regex. So ‘%x’ will be replaced by ‘%Y%m%d’, in turn replaced by ‘\d\d\d\d\d\d\d’. A percentage character in the regex is escaped by another percentage (‘%%’).

Custom regex

All the possible use cases are not covered in the NAME_RGX table and one might want to use a specific regex:

sst_%(Y:custom=\d\d:)-%(doy:custom=\d\d\d:discard)

Warning

The custom regex must be terminated with a colon.

Discard keyword

doc:Information can be retrieved<retrieving_values> from the matches in the filename, but one might discard a matcher so it would not be used. For example for a file of weekly averages with a filename indicated the start and end dates of the average, we might want to only recover the starting date:

sst_%(x)-%(x:discard)

Nesting files

Found files can be retrieved using FileFinder.get_files(). This outputs a list of all files (relative to the finder root, or as absolute paths), sorted alphabetically. They can also be returned as a nested lists of filenames. This is aimed to work with xarray.open_mfdataset(), which will merge files in a specific order when supplied a nested list of files.

To this end, one must specify group names to the nested argument of the same function. The rightmost group will correspond to the innermost level.

An example is available in the examples.

Retrieve information

As some metadata might only be found in the filenames, FileFinder offer the possibility to retrieve it easily using the FileFinder.get_matches() method. Thus, a filename can be matched against the regex of the finder and returns a list of the matches found.

The package supply the function library.get_date to retrieve a datetime object from those matches:

from xarray_regex.library import get_date
matches = finder.get_matches(filename)
date = get_date(matches)

Combine with Xarray

Retrieving information can be used when opening multiple files with xarray.open_mfdataset().

FileFinder.get_func_process_filename() will turn a function into a suitable callable for the preprocess argument of xarray.open_mfdataset. The function should take an xarray.Dataset, a filename, and a FileFinder, and eventual additional arguments as input, and return an xarray.Dataset. This allows to use the finder and the dataset filename in the pre-processing. This following example show how to add a time dimension using the filename to find the timestamp:

def preprocess(ds, filename, finder):
  matches = finder.get_matches(filename)
  date = library.get_date(matches)

  ds = ds.assign_coords(time=pd.to_datetime([value]))
  return ds

ds = xr.open_mfdataset(finder.get_files(),
                       preprocess=f.get_func_process_filename(preprocess))

Note

The filename path sent to the function is automatically made relative to the finder root directory, so that it can be used directly with FileFinder.get_matches().

Fix matchers

The package allows to dynamically change the regular expression easily. This is done by replacing matchers in the regular expression by a given string, using the FileFinder.fix_matcher() method.

Matchers to replace can be selected either by their index in the pre-regex (starting from 0), or by their name, or their group and name following the syntax ‘group:name’. If using a matcher name or group+name, multiple matchers can be fixed to the same value at once.

For instance, when using the following pre-regex:

'%(time:m)/SST_%(time:Y)%(time:m)%(time:d)\.nc'

we can keep only the files corresponding to january using any of:

finder.fix_matcher(0, '01')
finder.fix_matcher('m', '01')
finder.fix_matcher('time:m', '01')

We could also select specific days using a regular expression:

finder.fix_matcher('d', '01|03|05|07')

This would create the following regular expression:

'(\d\d)/SST_(\d\d\d\d)(\d\d)(01|03|05|07)\.nc'

Examples

Plain time serie

Here the files are all in the same folder. Only the timestamp differ from one file to the other:

Data
├── SSH
│   ├── SSH_20070101.nc
│   ├── SSH_20070109.nc
│   └── ...
└── SST
    ├── A_2007001_2007008.L3m_8D_sst.nc
    ├── A_2007008_2007016.L3m_8D_sst.nc
    └── ...

We will scan for SST files:

from xarray_regex import FileFinder, library

root = 'Data/SST'
pregex = 'A_%(Y)%(j)_%(Y)%(j:discard)%(suffix)'
finder = FileFinder(root, pregex, suffix=r'\.L3m_8D_sst\.nc')

files = finder.get_files()

We would like to open all these files using Xarray, however the files lacks a defined ‘time’ dimensions to concatenate all files. To make it work, we can use the ‘preprocess’ argument of xarray.open_mfdataset:

def preprocess(ds, filename, finder):
  matches = finder.get_matches(filename)
  date = library.get_date(matches)

  ds = ds.assign_coords(time=pd.to_datetime([value]))
  return ds

ds = xr.open_mfdataset(files,
                       preprocess=f.get_func_process_filename(preprocess))

Nested files

We can scan both variables at the same time but retrieve the files as a nested list. We assume the filenames for both variable are structured in the same way. Groups in the pre-regex will define what matchers will be grouped together:

pregex = '%(variable:char)/%(variable:char)_%(time:Y)%(time:j)\.nc'

We can now group the files by variable or time:

>>> finder.get_files(relative=True, nested=['variable'])
[['SSH_20070101.nc',
  'SSH_20070109.nc',
  ...],
 ['SST_20070101.nc',
  'SST_20070109.nc',
  ...]]

>>> finder.get_files(relative=True, nested=['time'])
[['SSH_20070101.nc', 'SST_20070101.nc'],
 ['SSH_20070109.nc', 'SST_20070109.nc'],
 ...]

This works for any number of groups in any order.

API

Content

file_finder.FileFinder(root, pregex, …)

Find files using a regular expression.

Submodules

file_finder

Find files using a pre-regex.

library

Functions to retrieve values from filename.

matcher

Matcher object.

Source code: https://github.com/Descanonge/xarray-regex

Indices and tables