xarray_regex.file_finder

Find files using a pre-regex.

Classes

FileFinder(root, pregex, **replacements)

Find files using a regular expression.

class xarray_regex.file_finder.FileFinder(root: str, pregex: str, **replacements: str)

Bases: object

Find files using a regular expression.

Provides abilities to ‘fix’ some part of the regular expression, to retrieve values from matches in the expression, and to create an advanced pre-processing function for xarray.open_mfdataset.

Parameters
  • root (str) – The root directory of a filetree where all files can be found.

  • pregex (str) – The pre-regex. A regular expression with added ‘Matchers’. Only the matchers vary from file to file. See documentation for details.

  • replacements (str, optional) – Matchers to replace by a string: ‘matcher name’ = ‘replacement string’.

max_depth_scan

Maximum authorized depth when descending into filetree to scan files.

Type

int

root

The root directory of the finder.

Type

str

pregex

Pre-regex.

Type

str

regex

Regex obtained from the pre-regex.

Type

str

pattern

Compiled pattern obtained from the regex.

Type

re.pattern

matchers

List of matchers for this finder, in order.

Type

list of Matchers

segments

Segments of the pre-regex. Used to replace specific matchers. [‘text before matcher 1’, ‘matcher 1’, ‘text before matcher 2, ‘matcher 2’, …]

Type

list of str

fixed_matchers

Dictionnary of matchers with a set value. ‘matcher index’: ‘replacement string’

Type

dict

files

List of scanned files.

Type

list of str

scanned

If the finder has scanned files.

Type

bool

create_regex()

Create regex from pre-regex.

find_files()

Find files to scan.

Uses os.walk. Limit search to max_depth_scan levels of directories deep. Sort files alphabetically.

Raises
  • AttributeError – If no regex is set.

  • IndexError – If no files are found in the filetree.

fix_matcher(key: Union[int, str], value: str)

Fix a matcher to a string.

Parameters
  • key (int, or str, or tuple of str of lenght 2.) – If int, is matcher index, starts at 0. If str, can be matcher name, or a group and name combination with the syntax ‘group:name’. When using strings, if multiple matchers are found with the same name or group/name combination, all are fixed to the same value.

  • value (str) – Will replace the match for all files.

Raises
  • TypeError – Value must be a string.:

  • TypeError – key is neither int nor str.:

fix_matchers(fixes: Optional[Dict[Union[int, str], str]] = None)

Fix multiple values at once.

Parameters

fixes (dict) – Dictionnary of matcher key: value. See fix_matcher() for details. If None, no matcher will be fixed.

get_files(relative: bool = False, nested: Optional[List[str]] = None)List[str]

Return files that matches the regex.

Lazily scan files: if files were already scanned, just return the stored list of files.

Parameters
  • relative (bool) – If True, filenames are returned relative to the finder root directory. If not, filenames are absolute. Defaults to False.

  • nested (list of str) – If not None, return nested list of filenames with each level corresponding to a group in this argument. Last group in the list is at the innermost level. A level specified as None refer to matchers without a group.

Raises

KeyError – A level in nested is not in the pre-regex groups.:

get_func_process_filename(func: Callable, relative: bool = True, *args, **kwargs)Callable

Get a function that can preprocess a dataset.

Written to be used as the ‘process’ argument of xarray.open_mfdataset. Allows to use a function with additional arguments, that can retrieve information from the filename.

Parameters
  • func (Callable) – Input arguments (xarray.Dataset, filename: str, FileFinder, *args, **kwargs) Should return a Dataset. Filename is retrieved from the dataset encoding attribute.

  • relative (If True, filename is made relative to finder root.) – This is necessary to match the filename against the finder regex. Defaults to True.

  • args (optional) – Passed to func when called.

  • kwargs (optional) – Passed to func when called.

Returns

Function with the signature of the ‘process’ argument of xarray.open_mfdataset.

Return type

Callable

Examples

This retrieve the date from the filename, and add a time dimensions to the dataset with the corresponding value. >>> from xarray_regex import library … def process(ds, filename, finder, default_date=None): … matches = finder.get_matches(filename) … date = library.get_date(matches, default_date=default_date) … ds = ds.assign_coords(time=[date]) … return ds … … ds = xr.open_mfdataset(finder.get_files(), … preprocess=finder.get_func_process_filename( … process, default_date={‘hour’: 12}))

get_matchers(key: str)List[xarray_regex.matcher.Matcher]

Return list of matchers corresponding to key.

Parameters

key (str) – Can be matcher name, or group+name combination with the syntax: ‘group:name’.

Raises

KeyError – No matcher found.:

get_matches(filename: str, relative: bool = True)Dict[str, Dict]

Get matches for a given filename.

Apply regex to filename and return a dictionary of the results.

Parameters
  • filename – Filename to retrieve matches from.

  • relative – Is true if the filename is relative to the finder root directory. If false, the filename is made relative before being matched. Default to true.

Returns

[{‘match’: string matched,

’start’: start index in filename, ‘end’: end index in filename, ‘matcher’: Matcher object}, …]

Return type

list of dict

Raises
  • AttributeError – The regex is empty.:

  • ValueError – The filename did not match the pattern.:

  • IndexError – Not as many matches as matchers.:

property n_matchers

Number of matchers in pre-regex.

scan_pregex()

Scan pregex for matchers.

Add matchers objects to self. Set segments attribute.

set_pregex(pregex: str, **replacements: str)

Set pre-regex.

Apply replacements.

update_regex()

Update regex.

Set fixed matchers. Re-compile pattern. Scrap previous scanning.