Overview of the inverseRegex Package

Jasper Watson

2022-10-23

inverseRegex

The inverseRegex package allows users to reverse engineer regular expression patterns for R objects. Individual characters that make up an object are categorised into common groups and encoded into run-lengths. For example, the phrase “Hello World!” can be translated to "[[:upper:]][[:lower:]]{4} [[:upper:]][[:lower:]]{4}!".

This could be useful to summarise a dataset without viewing all individual entries or to aid in data cleaning. One could check that a column of dates all follow a “nnnn-nn-nn” format or that a column of strings consisted entirely of alphabetic characters (no zeros entered instead of the letter O for example).

Usage

The main function to use is inverseRegex(x) which will identify the different characters that make up the input object x. The different groups that can be identified are - '[[:digit:]]' - '[[:lower:]]' - '[[:upper:]]' - '[[:alpha:]]' - '[[:alnum:]]' - '[[:space:]]' - '[[:punct:]]'

See ?regex for an explanation of their meanings.

By default the only groups that will be identified are [[:digit:]], [[:upper:]], and [[:lower:]], with any other characters being left as is. This can altered with the following arguments:

Some examples of these arguments are below:

Users can also specify the different run lengths that will be identified. The inverseRegex function has an argument called numbersToKeep which allows the user to specify what lengths of repeated sequences should be identified explicitly. The default value is c(2, 3, 4, 5, 10). Run lengths not requested will be identified with a +.

Non-character Inputs

Many objects with a class other than character are supported, including logical, integer, numeric, Date, POSIXct, factor, matrix, data.frame, and tibble. They are all (except logical) converted to characters first and then the collection of regex patterns returned either as character vectors or as the same class as the input object if it was a matrix, data frame, or tibble. See ?inverseRegex for a full description of how they are treated. If users need a different character conversion method they can do it themselves prior to calling inverseRegex.

Special mention of numerics and data frames will be given here:

Inputs of Class numeric

An attempt has been made to convert numeric values into characters as directly as possible without losing or adding any information. When passed a numeric vector inverseRegex will convert it to character using: vapply(x, format, character(1), nsmall = 1). This will force at least one decimal place for all entries but will not add extra decimal places beyond that unless they were present in the individual input element; it will however remove trailing decimal zeros. For example:

Numerics are treated differently if they are present in a matrix, data frame, or tibble. In the case of a matrix if it has a mode of numeric then the entire object will be converted to character using trimws(format(x)). For data frames and tibbles each column of type numeric will be converted using trimws(format(x)). This means that unlike for numeric vectors described above, all numeric entries in matrices, data frames, and tibbles will have the same number of decimal places.

Identifying Rare Patterns

One of the main use cases of the package is to identify irregular entries in a dataset. To this end there is a function occurrencesLessThan which will call inverseRegex and return logical values with TRUE giving the location of any regex patterns that occur less than a certain number of times.

What constitutes a “rare” pattern can be specified with the fraction or n arguments. See ?occurrencesLessThan for a full description.