CWB
Defines | Functions | Variables

special-chars.c File Reference

#include <ctype.h>
#include <glib.h>
#include "globals.h"
#include "special-chars.h"

Defines

Functions

Variables


Define Documentation

#define popc (   s,
 
)    s[p++]

Referenced by cl_string_latex2iso().

#define pushc (   s,
  c,
  p,
 
)    s[p++] = c; if (p>=m) goto endloop;

Referenced by cl_string_latex2iso().


Function Documentation

void cl_id_tolower ( char *  s)

Converts an uppercase corpus name to an equivalent lowercase form.

String is modified in situ. Only the ASCII characters are changed.

Note, this function doesn't check for what is and is not an allowed CWB-corpus-name character.

Referenced by cl_new_corpus(), encode_generate_registry_file(), and main().

void cl_id_toupper ( char *  s)

Converts a lowercase corpus name to an equivalent uppercase form.

String is modified in situ. Only the ASCII characters are changed.

Note, this function doesn't check for what is and is not an allowed CWB-corpus-name character.

The old version of this code was a line in cwb-encode that used the library toupper to cope with Latin1 characters. But these are no longer allowed in identifiers, which must be ASCII only.

Referenced by encode_generate_registry_file().

int cl_id_validate ( char *  s)

Checks a string to see if it is a valid CWB identifier.

The rules for these are as follows (see also the CQP lexer):

* all characters must be ASCII, ie less than 0x80; * must be at least 1 character long (of course) * first character must be an uppercase or lowercase letter or underscore * second and subsequent characters may also be digits, hyphen or fullstop. * mixed case is allowed (just-upper and just-lower is imposed elsewhere, where necessary).

TODO: should the CL registry lexer be amended to reflect these restricitons? (ID there is rather laxer than this)

Parameters:
sThe string to check.
Returns:
A boolean. True if the string is a valid ID. Otherwise false.

Referenced by cl_new_corpus(), and encode_generate_registry_file().

void cl_path_adjust_independent ( char *  path)

Standardises subdirectory-dividers in a string that represents a path into Unix-like form (ie with forward-slash), regardless of what OS we are in.

Or, to put it another way, changes backslashes into forward slashes under Windows.

This may be useful because of the need to move corpora between systems

  • in which case, the paths need to be in '/' format -- Windows tolerates forward slashes in paths a hell of a lot better than *nix tolerates unescaped backslashes!

Note that the path is modified in place.

Parameters:
pathThe path to modify (must be Ascii-compatible)

References SUBDIR_SEPARATOR.

void cl_path_adjust_os ( char *  path)

Standardises subdirectory-dividers in a string that represents a path, in an OS-sensitive way.

If the CL was compiled for Unix, backslash is changed to forwardslash. If the CL was compiled for Windows, forwardslash is changed to backslash.

Note that the path is modified in place.

Parameters:
pathThe path to modify (must be Ascii-compatible)

References SUBDIR_SEPARATOR.

char* cl_path_get_component ( char *  s)

Tokenises a string into components split by ':' (or ';' under Win32).

Parameters:
sThe string to tokenise; or, NULL if tokenisation has already been initialised.
Returns:
The next token from the string.
See also:
PATH_SEPARATOR

References last, and PATH_SEPARATOR.

char* cl_path_registry_quote ( char *  path)

Add quotes and escape slashes to a file path if necessary.

This is for the HOME and INFO fields of the registry file.

If either field contains any characters that can't be treated as an "ID" token by the registry parser, then we make sure it is treated as a string (quoted) instead, and make all appropriate substitutions

For consistency, this function always returns a newly allocated string, regardless of whether changes have been made.

Note that the way the registry parser works, it is quite happy with either "C:\dir\subdir" or "C:\\dir\\subdir" as a path for HOME or INFO.

Parameters:
pathString containing the path to quotify.
Returns:
The quotified string (newly allocated).

References cl_malloc(), and cl_strdup().

Referenced by encode_generate_registry_file().

char* cl_strcpy ( char *  buf,
const char *  src 
)

Replacement for strcpy that won't copy more than CL_MAX_LINE_LENGTH characters.

This is intended to make it easier to evade buffer overflows. But it doesn't protect against the opposite danger of losing important data from the end of a truncated string.

Note, buffer overflow is still possible if buf is a pointer to the middle of a buffer.

So this function is not a panacea, it's just a bit of a help.

It's also implemented in a way that is safe for down-strcpying, that is, if we are erasing a section from the start/middle of the string - cl_strcpy(string, string+3); for instance). The POSIX standard states that the normal strcpy has undefined behaviour if the objects overlap. That's not the case here.

Parameters:
bufA string buffer to copy to.
srcThe string pointer to copy from.
Returns:
In classic strcpy-stylie, this function uselessly returns buf.

References buf, and CL_MAX_LINE_LENGTH.

Referenced by encode_get_input_line(), ParsePrintOptions(), and range_declare().

void cl_string_canonical ( char *  s,
CorpusCharset  charset,
int  flags 
)

Converts a string to canonical form.

The "canonical form" of a string is for use in comparisons where case-insensitivity and/or diacritic insensitivity is desired.

Note that the string s is modified in place. This means it must have enough memory to cope with any expansions made in Unicode case folding. Ideally, allocate double the length of the string (since case-folding doesn't include any one -> more-than-two mappings so far as I know).

Note also that the arguments of this string were changed in v3.2.1. Now, a CorpusCharset is needed. This is because string canonicalising works differently in UTF8. In UTF8, the "composed" status of ALL strings is standardised (this is not dependent on flags; so this function should always be called on all strings that are going to be inserted into or searched for within, an indexed corpus; then we know we are always dealing with maximally-precomposed strings). Then case folding / accent folding is done by calling Unicode-aware functions. This is in contrast to the process for Latin1, which just uses a straightforward mapping table for both sorts of folding.

Parameters:
sThe string (currently: must be Ascii, Latin-1, or UTF8, but this is not checked for you!)
charsetThe character set to use in standardising. If this is utf8, complex accent and/or case folding will be done, as per the unicode standard. If it is anything else, the Latin1 mapping tables will be used (currently no other ISO mapping tables are built in and activated in the CL).
flagsThe flags that specify which conversions are required. Can be IGNORE_CASE and/or IGNORE_DIAC.

References cl_free, cl_string_maptable(), IGNORE_CASE, IGNORE_DIAC, and utf8.

Referenced by cl_new_regex(), cl_regex_match(), cl_string_qsort_compare(), encode_get_input_line(), print_tabulation(), SortExternally(), and SortSubcorpus().

char* cl_string_latex2iso ( char *  str,
char *  result,
int  target_len 
)

Converts ASCII strings with latex-style blackslash escapes for accented characters to ISO-8859-1 (Latin-1).

Syntax:

"[AaOoUus..] --> corresponding ISO 8859-1 character

octal} --> ISO 8859-1 character

Note that if cl_allow_latex2iso is FALSE, this function will simply copy the input to the output. So it is always safe to call this function.

See also:
cl_allow_latex2iso
Parameters:
strThe string to convert.
resultThe location to put the altered string (which should be shorter, or at least no longer than, the input string). If this parameter is NULL, space is automatically allocated for the output. result is allowed to be the same as str.
target_lenThe maximum length of the target string. If result is NULL, then this is deduced automatically.
Returns:
Pointer to the altered string (if result was NULL you need to catch this and free it when no longer needed).

See also:
cl_string_latex2iso
cl_string_latex2iso

References cl_allow_latex2iso, cl_malloc(), cl_strdup(), popc, and pushc.

Referenced by cl_new_regex(), do_flagged_string(), do_SetVariableValue(), and do_XMLTag().

unsigned char* cl_string_maptable ( CorpusCharset  charset,
int  flags 
)

Gets a specified character mapping table for use in regular expressions.

Returns pointer to static mapping table for given flags (IGNORE_CASE and IGNORE_DIAC) and character set.

Removed from the public API for 3.2.0 because there's no way for it to work if the CorpusCharset is UTF8. Prototype moved to special-chars.h

Tables exist for all character sets, but for all except Latin1 and ASCII, they are currently identical to the ASCII tables (i.e. the awareness of case/accent relationships in the upper half of each character set have not yet been inserted).

Parameters:
charsetThe character set of this corpus. Currently ignored.
flagsThe flags that specify which table is required. Can be IGNORE_CASE and/or IGNORE_DIAC.
Returns:
Pointer to the appropriate mapping table. DO NOT FREE this, or modify it, it is a CL-internal data blob.

References ascii, charset, identity_tab, identity_tab_init, IGNORE_CASE, IGNORE_DIAC, maptable_init_both(), maptable_init_identity(), nocase_nodiac_tab, nocase_nodiac_tab_init, nocase_tab, nodiac_tab, and utf8.

Referenced by cl_string_canonical().

int cl_string_qsort_compare ( const char *  s1,
const char *  s2,
CorpusCharset  charset,
int  flags,
int  reverse 
)

Compares two strings in a qsort-stylie!

This function is designed to be suitable for use as a callback with qsort(). As such, its return values are negative if s1 is "less than" s2; zero if the two strings are the same; and positive if s2 is "greater than" s2. But of course you can also use it on its own.

You cannot use it directly with qsort as its parameters are wrong. It needs to be wrapped in another function that (at least) provides the charset, flags and reverse arguments (e.g. from global variables or by calling other functions).

The two strings must be in the same character set. Both will be made canonical in accordance with the flags argument if it is set. Also, the comparison can be done on reverse-order strings.

Note that if either flags or reverse is non-zero, then memory allocation will be necessary. If you are calling this function in a loop, that could quickly get costly. To avoid this, a pair of one-time-allocated buffers are used - but this doesn't dispense with all need for allocation. [Another option would be to allow a buffer to be optionally supplied....]

Parameters:
s1First string to compare.
s2Second string to compare.
charsetCharacter set of the two strings.
flagsIGNORE_CASE, IGNORE_DIAC, both, or neither.
reverseBoolean: if true, strings are compared from end to beginning, rather than beginning to end.
Returns:
0 if the strings are the same. 1 if s1 is greater. -1 if s2 is greater.

References cl_free, cl_malloc(), CL_MAX_LINE_LENGTH, cl_string_canonical(), cl_string_reverse(), MIN, s1, s2, and utf8.

Referenced by i2compare().

char* cl_string_reverse ( const char *  s,
CorpusCharset  charset 
)

Creates a "backwards" version of the specified string.

The memory for the reversed string is newly allocated. (This is potentially wasteful, but it occurs in the depths of GLib, so short of reinventing the wheel we have to live with it.)

Parameters:
sString to reverse.
charsetThe character set of the string.
Returns:
Pointer to the new string.

References cl_strdup(), and utf8.

Referenced by cl_string_qsort_compare(), SortExternally(), and SortSubcorpus().

int cl_string_validate_encoding ( char *  s,
CorpusCharset  charset,
int  repair 
)

Checks the encoding of a string.

This function looks for bad bytes (or byte sequences in the case of UTF8); if any are present, it judges the string invalid. For ISO8859-* encodings, the string can optionally be "repaired" in-place by replacing bad bytes with '?' characters. If the "repair" is successful, the function returns True.

What counts as "bad" is of course relative to the character set that the string is encoded in - so this must be specified.

Parameters:
sNull-terminated string to check.
charsetCorpusCharset of the string's encoding.
repairif True, replace invalid 8-bit characters by '?'
Returns:
Boolean: true for valid, false for invalid.

References arabic, ascii, cyrillic, greek, hebrew, latin1, latin2, latin3, latin4, latin5, latin6, latin7, latin8, latin9, and utf8.

Referenced by encode_get_input_line(), and prepare_Query().

int cl_string_zap_controls ( char *  s,
CorpusCharset  charset,
char  replace,
int  zap_tabs,
int  zap_newlines 
)

Replaces any invalid control characters in a string.

"Invalid" control characters are any below 0x20.

The string is modified in situ. A typical "replace" to use would be '?' to match the action of cl_string_validate_encoding.

Parameters:
sThe string to modify.
charsetThe character set of the string.
replaceThe replacement character to use. If this is 0, the character is deleted rather than replaced.
zap_tabsWhether or not tabs should be zapped (boolean).
zap_newlinesWhether or not
and should be zapped (boolean).
Returns:
The number of characters replaced/deleted in the string.

Referenced by encode_get_input_line().

char* cl_xml_entity_decode ( char *  s)

Decode XML entities in a string.

This function decodes pre-defined XML entities in string s. It overwrites the input string s and also returns s for convenience.

(The entities are &lt; &gt; &amp; &quot; &apos;).

TODO -- numeric entities?

If passed NULL, it will not fall over - it will just pass NULL back!

This function is safe for strings in any encoding. The returned string will be at the same memory location and will always be the same length or shorter after the decoding of entities.

Parameters:
sA string to decode.
Returns:
The string (rewritten in situ).

Referenced by encode_add_wattr_line(), and range_open().

void maptable_init_both ( unsigned char *  maptable,
const unsigned char *  nocasetable,
const unsigned char *  nodiactable 
)

Initialise a "fold both case and diacritics" mapping table.

Referenced by cl_string_maptable().

void maptable_init_identity ( unsigned char *  maptable)

Initialise an "identity" mapping table.

Referenced by cl_string_maptable().


Variable Documentation

Boolean switch enabling/disabling latex-style escapes.

By default, it is false; if programs wish to allow these escapes they need to offer some means of changing this variable.

Note that enabling this variable may cause scrambling of the string for LatinX strings where X is not 1; and may cause undefined errors for UTF8 strings. In short, you should only activate it when you are working with a corpus whose charset is Latin1.

See also:
CorpusCharset

Referenced by cl_string_latex2iso().

const unsigned char identity_tab[unknown_charset][256]

Array of mapping tables used when NEITHER case NOR diacritics are to be stripped.

These are composite tables: they are only generated when needed (the corresponding identity_tab_init value is a boolean indicating whether this has been done yet).

Use a CorpusCharset value as the index into this array.

Referenced by cl_string_maptable().

int identity_tab_init[unknown_charset] = {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}

Referenced by cl_string_maptable().

unsigned char nocase_nodiac_tab[unknown_charset][256]

Array of mapping tables used when BOTH case AND diacritics are to be stripped.

These are composite tables: they are only generated when needed (the corresponding identity_tab_init value is a boolean indicating whether this has been done yet).

Use a CorpusCharset value as the index into this array.

Referenced by cl_string_maptable().

int nocase_nodiac_tab_init[unknown_charset] = {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}

Referenced by cl_string_maptable().

unsigned char nocase_tab[unknown_charset][256]

Array of tables mapping a character (the index) to the equivalent character in lowercase (the value).

There are as many tables as there are possible values of CorpusCharset. Moreover, tables must always be in the same order as the values of CorpusCharset are declared in.

This means starting at ascii == 0 and working up through the canonical order that is observable in cl.h

Use a CorpusCharset value as the index into this array.

See also:
CorpusCharset

Referenced by cl_string_maptable().

unsigned char nodiac_tab[unknown_charset][256]

Array of tables mapping a character (the index) to the equivalent character without any accents (the value).

There are as many tables as there are possible values of CorpusCharset. Moreover, tables must always be in the same order as the values of CorpusCharset are declared in.

This means starting at ascii == 0 and working up through the canonical order that is observable in cl.h

Use a CorpusCharset value as the index into this array.

See also:
CorpusCharset

Referenced by cl_string_maptable().