CWB
Functions | Variables

regopt.c File Reference

The CL_Regex object, and the CL Regular Expression Optimiser. More...

#include "globals.h"
#include "regopt.h"

Functions

Variables


Detailed Description

The CL_Regex object, and the CL Regular Expression Optimiser.

This is the CL front-end to POSIX regular expressions with CL semantics (most notably: CL regexes always match the entire string and NOT substrings.)

Note that the optimiser is handled automatically by the CL_Regex object.

All variables / functions containing "regopt" are internal to this module and are not exported in the CL API.

Optimisation is done by means of "grains". The grain array in a CL_Regex object is a list of short strings. Any string which will match the regex must contain at least one of these. Thus, the grains provide a quick way of filtering out strings that definitely WON'T match, and avoiding a time-wasting call to the POSIX regex matching function.

While a regex is being optimised, the grains are stored in non-exported global variables in this module. Subsequently they are transferred to members of the CL_regex object with which they are associated. The use of global variables and a fixed-size buffer for grains is partly due to historical reasons, but it does also serve to reduce memory allocation overhead.


Function Documentation

void cl_delete_regex ( CL_Regex  rx)

Deletes a CL_Regex object.

Note that we use cl_free to deallocate the internal PCRE buffers, not pcre_free, for the simple reason that pcre_free is just a function pointer that will normally contain free, and thus we miss out on the checking that cl_free provides.

Parameters:
rxThe CL_Regex to delete.

References cl_free, _CL_Regex::extra, _CL_Regex::grain, _CL_Regex::grains, _CL_Regex::haystack_buf, and _CL_Regex::needle.

Referenced by cl_regex2id(), free_booltree(), and free_environment().

CL_Regex cl_new_regex ( char *  regex,
int  flags,
CorpusCharset  charset 
)

Create a new CL_regex object (ie a regular expression buffer).

The regular expression is preprocessed according to the flags, and anchored to the start and end of the string. (That is, ^ is added to the start, $ to the end.)

Then the resulting regex is compiled (using PCRE) and optimised.

Parameters:
regexString containing the regular expression
flagsIGNORE_CASE, or IGNORE_DIAC, or both, or 0.
charsetThe character set of the regex.
Returns:
The new CL_Regex object, or NULL in case of error.

References CDA_EBADREGEX, CDA_OK, charset, _CL_Regex::charset, cl_debug, cl_errno, cl_free, cl_malloc(), CL_MAX_LINE_LENGTH, cl_regex_error, cl_regopt_analyse(), cl_string_canonical(), cl_string_latex2iso(), _CL_Regex::extra, _CL_Regex::flags, _CL_Regex::grains, _CL_Regex::haystack_buf, IGNORE_CASE, IGNORE_DIAC, _CL_Regex::needle, regopt_data_copy_to_regex_object(), and utf8.

Referenced by cl_regex2id(), do_flagged_string(), do_XMLTag(), main(), and scancorpus_add_key().

int cl_regex_match ( CL_Regex  rx,
char *  str 
)

Matches a regular expression against a string.

The regular expression contained in the CL_Regex is compared to the string. No settings or flags are passed to this function; rather, the settings that rx was created with are used.

Parameters:
rxThe regular expression to match.
strThe string to compare the regex to.
Returns:
Boolean: true if the regex matched, otherwise false.

References _CL_Regex::anchor_end, _CL_Regex::anchor_start, _CL_Regex::charset, cl_debug, cl_regopt_successes, cl_string_canonical(), _CL_Regex::extra, _CL_Regex::flags, _CL_Regex::grain, _CL_Regex::grain_len, _CL_Regex::grains, _CL_Regex::haystack_buf, _CL_Regex::jumptable, and _CL_Regex::needle.

Referenced by cl_regex2id(), eval_bool(), eval_constraint(), is_regular(), main(), and matchfirstpattern().

int cl_regex_optimised ( CL_Regex  rx)

Finds the level of optimisation of a CL_Regex.

This function returns the approximate level of optimisation, computed from the ratio of grain length to number of grains (0 = no grains, ergo not optimised at all).

Parameters:
rxThe CL_Regex to check.
Returns:
0 if rx is not optimised; otherwise an integer indicating optimisation level.

References _CL_Regex::grain_len, and _CL_Regex::grains.

Referenced by cl_regex2id().

int cl_regopt_analyse ( char *  regex)

Analyses a regular expression and tries to find the best set of grains.

Part of the regex optimiser. For a given regular expression, this function will try to extract a set of grains from regular expression {regex_string}. These grains are then used by the CL regex matcher and cl_regex2id() for faster regular expression search.

If successful, this function returns True and stores the grains in the optiomiser's global variables above (from which they should be copied to a CL_Regex object's corresponding members).

Usage: optimised = cl_regopt_analyse(regex_string);

This is a non-exported function.

Parameters:
regexString containing the regex to optimise.
Returns:
Boolean: true = ok, false = couldn't optimise regex.

References buf, cl_debug, cl_regopt_anchor_end, cl_regopt_anchor_start, cl_regopt_grain, cl_regopt_grain_len, cl_regopt_grains, grain_buffer, grain_buffer_grains, local_grain_data, make_jump_table(), read_disjunction(), read_grain(), read_kleene(), read_wildcard(), and update_grain_buffer().

Referenced by cl_new_regex().

int cl_regopt_count_get ( void  )

Get a reading from the "success counter" for optimised regexes.

The counter is incremented by 1 every time the "grain" system is used successfully to avoid calling PCRE. That is, it is incremented every time a string is scrutinised and found to contain none of the grains.

Usage:

cl_regopt_count_reset();

for (i = 0, hits = 0; i < n; i++) if (cl_regex_match(rx, haystacks[i])) hits++;

fprintf(stderr, "Found %d matches; avoided regex matching %d times out of %d trials", hits, cl_regopt_count_get(), n );

See also:
cl_regopt_count_reset
Returns:
an integer indicating the number of times a regular expression has been matched using the regopt system of "grains", rather than by calling an external regex library.

References cl_regopt_successes.

Referenced by cl_regex2id().

void cl_regopt_count_reset ( void  )

Reset the "success counter" for optimised regexes.

References cl_regopt_successes.

Referenced by cl_regex2id().

int is_safe_char ( unsigned char  c)

Is the given character a 'safe' character which will only match itself in a regex?

What counts as safe: A to Z, a to z, 0 to 9, minus, quote marks, percent, ampersand, slashes, excl mark, colon, semi colon, character, underscore, any value over 0x7f.

What counts as not safe therefore includes: brackets, braces, square brackets; questionmark, plus, and star; circumflex and dollar sign; dot; hash; etc.

(But, in UTF8, Unicode PUNC area equivalents of these characters will be safe.)

Parameters:
cThe character (cast to unsigned for the comparison.
Returns:
True for non-special characters; false for special characters.

Referenced by read_grain(), and read_matchall().

void make_jump_table ( void  )

Computes a jump table for Boyer-Moore searches.

Unlike the textbook version, this jumptable includes the last character of each grain (in order to avoid running the string comparing loops every time).

A non-exported function.

References cl_debug, cl_regopt_grain, cl_regopt_grain_len, cl_regopt_grains, and cl_regopt_jumptable.

Referenced by cl_regopt_analyse().

char* read_disjunction ( char *  mark,
int *  align_start,
int *  align_end 
)

Finds grains in a disjunction group - part of the CL Regex Optimiser.

This function find grains in disjunction group within a regular expression; the grains are then stored in the grain_buffer.

The first argument, mark, must point to the '(' at beginning of the disjunction group.

The booleans align_start and align_end are set to true if the grains from *all* alternatives are anchored at the start or end of the disjunction group, respectively.

This is a non-exported function.

Parameters:
markPointer to the disjunction group (see also function description).
align_startSee function description.
align_endSee function description.
Returns:
A pointer to first character after the disjunction group iff the parse succeeded, the original pointer in the mark argument otherwise.

References buf, grain_buffer, grain_buffer_grains, local_grain_data, MAX_GRAINS, read_grain(), and read_wildcard().

Referenced by cl_regopt_analyse().

char* read_grain ( char *  mark)

Reads in a grain from a regex - part of the CL Regex Optimiser.

A grain is a string of safe symbols not followed by ?, *, or {..}. This function finds the longest grain it can starting at the point in the regex indicated by mark; backslash-escaped characters are allowed but the backslashes must be stripped by the caller.

Parameters:
markPointer to location in the regex string from which to read.
Returns:
Pointer to the first character after the grain it has read in (or the original "mark" pointer if no grain is found).

References is_safe_char().

Referenced by cl_regopt_analyse(), and read_disjunction().

char* read_kleene ( char *  mark)

Reads in a repetition marker - part of the CL Regex Optimiser.

This function reads in a Kleene star (asterisk), ?, +, or the general repetition modifier {n,n}; it returns a pointer to the first character after the repetition modifier it has found.

Parameters:
markPointer to location in the regex string from which to read.
Returns:
Pointer to the first character after the star or other modifier it has read in (or the original "mark" pointer if a repetion modifier was not read).

Referenced by cl_regopt_analyse(), and read_wildcard().

char* read_matchall ( char *  mark)

Reads in a matchall (dot wildcard) or safe character - part of the CL Regex Optimiser.

This function reads in matchall, any safe character, or a reasonably safe-looking character class.

Parameters:
markPointer to location in the regex string from which to read.
Returns:
Pointer to the first character after the character (class) it has read in (or the original "mark" pointer if something suitable was not read).

References is_safe_char().

Referenced by read_wildcard().

char* read_wildcard ( char *  mark)

Reads in a wildcard - part of the CL Regex Optimiser.

This function reads in a wildcard segment matching arbitrary substring (but without a '|' symbol); it returns a pointer to the first character after the wildcard segment.

Note that effectively, wildcard equals matchall plus kleene.

Parameters:
markPointer to location in the regex string from which to read.
Returns:
Pointer to the first character after the wildcard segment (or the original "mark" pointer if a wildcard was not read).

References read_kleene(), and read_matchall().

Referenced by cl_regopt_analyse(), and read_disjunction().

void regopt_data_copy_to_regex_object ( CL_Regex  rx)

Internal regopt function: copies optimiser data from internal global variables to the member variables of argument CL_Regex object.

References _CL_Regex::anchor_end, _CL_Regex::anchor_start, cl_debug, cl_regopt_anchor_end, cl_regopt_anchor_start, cl_regopt_grain, cl_regopt_grain_len, cl_regopt_grains, cl_regopt_jumptable, cl_strdup(), _CL_Regex::grain, _CL_Regex::grain_len, _CL_Regex::grains, and _CL_Regex::jumptable.

Referenced by cl_new_regex().

void update_grain_buffer ( int  front_aligned,
int  anchored 
)

Updates the public grain buffer -- part of the CL Regex Optimiser.

This function copies the local grains to the public buffer, if they are better than the set of grains currently there.

A non-exported function.

Parameters:
front_alignedBoolean: if true, grain strings are aligned on the left when they are reduced to equal lengths.
anchoredBoolean: if true, the grains are anchored at beginning or end of string, depending on front_aligned.

References buf, CL_MAX_LINE_LENGTH, cl_regopt_anchor_end, cl_regopt_anchor_start, cl_regopt_grain, cl_regopt_grain_len, cl_regopt_grains, grain_buffer, grain_buffer_grains, and public_grain_data.

Referenced by cl_regopt_analyse().


Variable Documentation

char cl_regex_error[CL_MAX_LINE_LENGTH]

The error message from (PCRE) regex compilation are placed in this buffer if cl_new_regex() fails.

This global variable is part of the CL_Regex object's API.

Referenced by cl_new_regex(), and cl_regex2id().

Boolean: whether grains are anchored at end of string.

Referenced by cl_regopt_analyse(), regopt_data_copy_to_regex_object(), and update_grain_buffer().

Boolean: whether grains are anchored at beginning of string.

Referenced by cl_regopt_analyse(), regopt_data_copy_to_regex_object(), and update_grain_buffer().

char* cl_regopt_grain[MAX_GRAINS]

list of 'grains' (any matching string must contain one of these)

Referenced by cl_regopt_analyse(), make_jump_table(), regopt_data_copy_to_regex_object(), and update_grain_buffer().

all the grains have the same length

Referenced by cl_regopt_analyse(), make_jump_table(), regopt_data_copy_to_regex_object(), and update_grain_buffer().

A jump table for Boyer-Moore search algorithm; use _unsigned_ char as index;.

See also:
make_jump_table

Referenced by make_jump_table(), and regopt_data_copy_to_regex_object().

A counter of how many times the "grain" system has allwoed us to avoid calling the regex engine.

See also:
cl_regopt_count_get

Referenced by cl_regex_match(), cl_regopt_count_get(), and cl_regopt_count_reset().

char* grain_buffer[MAX_GRAINS]

Intermediate buffer for grains.

When a regex is parsed, grains for each segment are written to this intermediate buffer; if the new set of grains is better than the current one, it is copied to the cl_regopt_ variables.

Referenced by cl_regopt_analyse(), read_disjunction(), and update_grain_buffer().

The number of grains currently in the intermediate buffer.

See also:
grain_buffer

Referenced by cl_regopt_analyse(), read_disjunction(), and update_grain_buffer().

char local_grain_data[CL_MAX_LINE_LENGTH]

A buffer for grain strings.

See also:
public_grain_data

Referenced by cl_regopt_analyse(), and read_disjunction().

char public_grain_data[CL_MAX_LINE_LENGTH]

A buffer for grain strings.

See also:
local_grain_data

Referenced by update_grain_buffer().