Xarray-regex documentation¶
Welcome to the Xarray-regex package documentation !
Xarray-regex allows to find files based on regular expressions, in order to feed them to Xarray. It allows to easily create regular expressions using ‘Matchers’, to fix some elements of the expressions to select only certain files, and to easily retrieve information from filenames.
Finding files¶
The main entry point of this package is the FileFinder
class.
This is the object that will find files according to a regular expression.
An instance is created using the root directory containing the files, and
a pre-regular expression (abbreviated pre-regex) that will be transformed into
a proper regex later.
When asking to find files, the finder will first create a regular-expression
out of the pre-regex.
It will then recursively find all files in the root directory and its
subfolders, though not descending deeper than FileFinder.max_depth_scan
folders (default is 3).
The finder only keeps files that match the regex.
The files can be retrieved using FileFinder.get_files()
.
Pre-regex¶
The pre-regex specifies the structure of the filenames relative to the root directory. It is a regular expression with the added feature of matchers.
A matcher is a part of the filename that vary from file to file.
In the pre-regex, it is enclosed by parenthesis and preceded by ‘%’.
It is represented by the xarray_regex.matcher.Matcher
class.
Warning
Anything outside matchers in the pre-regex will be considered constant across files. For example, if we have daily files ‘sst_2003-01-01.nc’ with the date changing for each file, we could use the regex ‘sst_.*.nc’ which would match correctly all files, but the finder would in fact consider that all files are ‘sst_2003-01-01.nc’ (the first file found).
Inside the matchers parenthesis can be indicated multiple elements, separated by colons:
a group name (optional)
a name that will dictate the matcher regex using a correspondance table
a custom regex if correspondances are not enough (optional)
a keyword that will discard that matcher when retrieving information from a filename (optional)
The full syntax is as follows: ‘%([group:]name[:custom=custom regex:][:discard])’.
Note
The matchers are uniquely identified by their index in the pre-regex (starting at 0).
Name¶
The name of the matcher will dictate the regex used for that matcher (unless
overriden by a custom regex), and how it will be used by functions that
retrieve information from the filename.
The Matcher.NAME_RGX
class
attribute will make the correspondance between name and regex:
Name |
Regex |
|
idx |
\d* |
Index |
text |
[a-zA-Z]* |
Letters |
char |
\S* |
Character |
F |
%Y-%m-%d |
Date (YYYY-MM-DD) |
x |
%Y%m%d |
Date (YYYYMMDD) |
X |
%H%M%S |
Time (HHMMSS) |
Y |
\d\d\d\d |
Year (YYYY) |
m |
\d\d |
Month (MM) |
d |
\d\d |
Day of month (DD) |
j |
\d\d\d |
Day of year (DDD) |
B |
[a-zA-Z]* |
Month name |
H |
\d\d |
Hour 24 (HH) |
M |
\d\d |
Minute (MM) |
S |
\d\d |
Seconds (SS) |
This table mostly follows the strftime format specifications.
So for example, ‘%(Y)’ will be replaced by a regex searching for 4 digits, and
library.get_date
will use it to find the
date year.
A letter preceded by a percent sign ‘%’ in the regex will be recursively replaced by the corresponding name in the table. This can be used in the custom regex. This still counts as a single matcher and its name will not be changed, only the regex. So ‘%x’ will be replaced by ‘%Y%m%d’, in turn replaced by ‘\d\d\d\d\d\d\d’. A percentage character in the regex is escaped by another percentage (‘%%’).
Custom regex¶
All the possible use cases are not covered in the NAME_RGX table and one might want to use a specific regex:
sst_%(Y:custom=\d\d:)-%(doy:custom=\d\d\d:discard)
Warning
The custom regex must be terminated with a colon.
Discard keyword¶
doc:Information can be retrieved<retrieving_values> from the matches in the filename, but one might discard a matcher so it would not be used. For example for a file of weekly averages with a filename indicated the start and end dates of the average, we might want to only recover the starting date:
sst_%(x)-%(x:discard)
Nesting files¶
Found files can be retrieved using FileFinder.get_files()
. This outputs
a list of all files (relative to the finder root, or as absolute paths), sorted
alphabetically.
They can also be returned as a nested lists of filenames.
This is aimed to work with xarray.open_mfdataset(),
which will merge files in a specific order when supplied a nested list of files.
To this end, one must specify group names to the nested argument of the same function. The rightmost group will correspond to the innermost level.
An example is available in the examples.
Retrieve information¶
As some metadata might only be found in the filenames, FileFinder offer the
possibility to retrieve it easily using the FileFinder.get_matches()
method.
Thus, a filename can be matched against the regex of the finder and returns a
list of the matches found.
The package supply the function library.get_date
to retrieve a datetime object from those
matches:
from xarray_regex.library import get_date
matches = finder.get_matches(filename)
date = get_date(matches)
Combine with Xarray¶
Retrieving information can be used when opening multiple files with xarray.open_mfdataset().
FileFinder.get_func_process_filename()
will turn a function into a
suitable callable for the preprocess argument of xarray.open_mfdataset.
The function should take an xarray.Dataset, a filename, and a
FileFinder
, and eventual additional arguments as input, and return
an xarray.Dataset.
This allows to use the finder and the dataset filename in the pre-processing.
This following example show how to add a time dimension using the filename to
find the timestamp:
def preprocess(ds, filename, finder):
matches = finder.get_matches(filename)
date = library.get_date(matches)
ds = ds.assign_coords(time=pd.to_datetime([value]))
return ds
ds = xr.open_mfdataset(finder.get_files(),
preprocess=f.get_func_process_filename(preprocess))
Note
The filename path sent to the function is automatically made relative to
the finder root directory, so that it can be used directly with
FileFinder.get_matches()
.
Fix matchers¶
The package allows to dynamically change the regular expression easily. This is
done by replacing matchers in the regular expression by a given string, using
the FileFinder.fix_matcher()
method.
Matchers to replace can be selected either by their index in the pre-regex (starting from 0), or by their name, or their group and name following the syntax ‘group:name’. If using a matcher name or group+name, multiple matchers can be fixed to the same value at once.
For instance, when using the following pre-regex:
'%(time:m)/SST_%(time:Y)%(time:m)%(time:d)\.nc'
we can keep only the files corresponding to january using any of:
finder.fix_matcher(0, '01')
finder.fix_matcher('m', '01')
finder.fix_matcher('time:m', '01')
We could also select specific days using a regular expression:
finder.fix_matcher('d', '01|03|05|07')
This would create the following regular expression:
'(\d\d)/SST_(\d\d\d\d)(\d\d)(01|03|05|07)\.nc'
Examples¶
Plain time serie¶
Here the files are all in the same folder. Only the timestamp differ from one file to the other:
Data
├── SSH
│ ├── SSH_20070101.nc
│ ├── SSH_20070109.nc
│ └── ...
└── SST
├── A_2007001_2007008.L3m_8D_sst.nc
├── A_2007008_2007016.L3m_8D_sst.nc
└── ...
We will scan for SST files:
from xarray_regex import FileFinder, library
root = 'Data/SST'
pregex = 'A_%(Y)%(j)_%(Y)%(j:discard)%(suffix)'
finder = FileFinder(root, pregex, suffix=r'\.L3m_8D_sst\.nc')
files = finder.get_files()
We would like to open all these files using Xarray, however the files lacks a defined ‘time’ dimensions to concatenate all files. To make it work, we can use the ‘preprocess’ argument of xarray.open_mfdataset:
def preprocess(ds, filename, finder):
matches = finder.get_matches(filename)
date = library.get_date(matches)
ds = ds.assign_coords(time=pd.to_datetime([value]))
return ds
ds = xr.open_mfdataset(files,
preprocess=f.get_func_process_filename(preprocess))
Nested files¶
We can scan both variables at the same time but retrieve the files as a nested list. We assume the filenames for both variable are structured in the same way. Groups in the pre-regex will define what matchers will be grouped together:
pregex = '%(variable:char)/%(variable:char)_%(time:Y)%(time:j)\.nc'
We can now group the files by variable or time:
>>> finder.get_files(relative=True, nested=['variable'])
[['SSH_20070101.nc',
'SSH_20070109.nc',
...],
['SST_20070101.nc',
'SST_20070109.nc',
...]]
>>> finder.get_files(relative=True, nested=['time'])
[['SSH_20070101.nc', 'SST_20070101.nc'],
['SSH_20070109.nc', 'SST_20070109.nc'],
...]
This works for any number of groups in any order.
API¶
Content
|
Find files using a regular expression. |
Submodules
Find files using a pre-regex. |
|
Functions to retrieve values from filename. |
|
Matcher object. |
Source code: https://github.com/Descanonge/xarray-regex