CWB
Data Structures | Defines | Typedefs | Enumerations | Functions | Variables

cwb-s-encode.c File Reference

cwb-s-encode adds an s-attribute to an existing corpus. More...

#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <string.h>
#include <assert.h>
#include "../cl/globals.h"
#include "../cl/endian.h"
#include "../cl/macros.h"
#include "../cl/storage.h"
#include "../cl/lexhash.h"

Data Structures

Defines

Typedefs

Enumerations

Functions

Variables


Detailed Description

cwb-s-encode adds an s-attribute to an existing corpus.

Input: a list of regions (on stdin or in the file specified in the first argument to the program name) with lines in the following format:

start TAB end [ TAB annotation ]

start = corpus position of first token in region (integer as text) end = corpus position of last token in region (integer as text) annotation = annotation text (only if s-attribute was specified with -V)

Output: file att.rng (plus att.avs, att.avx for -V attributes) where att is the specified attribute name.


Define Documentation

#define RNG_AVS   "%s" SUBDIR_SEP_STRING "%s.avs"

printf format string for path of attribute values of a given structural attribute

Referenced by sencode_open_files().

#define RNG_AVX   "%s" SUBDIR_SEP_STRING "%s.avx"

printf format string for path of attribute value index of a given structural attribute

Referenced by sencode_open_files().

#define RNG_RNG   "%s" SUBDIR_SEP_STRING "%s.rng"

printf format string for path of file storing ranges of given structural attribute

Referenced by sencode_open_files().

#define UMASK   0644

Typedef Documentation

typedef struct _SL * SL

The "structure list" data type is used for 'adding' regions (-a).

SL is a really bad name; should be "RegionList".

In this case, all existing regions are read into an ordered, bidirectional list; new regions are inserted into that list (overlaps are automatically resolved in favour of the 'earlier' region; if start point is identical, the longer region is retained). Only once the entire input has been read is the data actually encoded and stored on disk.


Enumeration Type Documentation

anonymous enum
Enumerator:
set_none 
set_any 
set_regular 
set_whitespace 

Function Documentation

int main ( int  argc,
char **  argv 
)
char* sencode_check_set ( char *  annot)

Changes an annotation string to standard set attribute syntax.

On first call, the function checks whether annotations are already given in standard '|'-delimited form; otherwise we assume we are using whitespace to split.

The return string may have been newly allocated (i.e. caller must use & free the returned value).

If there are syntax errors, returns NULL.

Parameters:
annotThe annotation string to check.
Returns:
The standardised string, or NULL if there was an error in the call to cl_make_set().

References _SL::annot, cl_free, cl_make_set(), set_any, set_att, set_none, set_regular, set_syntax_strict, and set_whitespace.

Referenced by main().

void sencode_close_files ( void  )

Close the disk files for the s-attribute being encoded.

References SencodeRange::avs, SencodeRange::avx, SencodeRange::fd, and SencodeRange::ready.

Referenced by main().

void sencode_declare_new_satt ( char *  name,
char *  directory,
int  store_values 
)

Initialises the "new_satt" variable for the s-attribute to be encoded, and sets name/directory.

References SencodeRange::avs, SencodeRange::avx, cl_strdup(), SencodeRange::dir, SencodeRange::fd, SencodeRange::last_cpos, SencodeRange::name, SencodeRange::num, SencodeRange::offset, SencodeRange::ready, and SencodeRange::store_values.

Referenced by sencode_parse_options().

void sencode_open_files ( void  )

Open disk files for the s-attribute being encoded (must have been declared first).

References SencodeRange::avs, SencodeRange::avx, buf, CL_MAX_LINE_LENGTH, SencodeRange::dir, SencodeRange::fd, SencodeRange::name, SencodeRange::ready, RNG_AVS, RNG_AVX, RNG_RNG, and SencodeRange::store_values.

Referenced by sencode_write_region().

int sencode_parse_line ( char *  line,
int *  start,
int *  end,
char **  annot 
)

Parses an input line into cwb-s-encode.

Usage:

ok = sencode_parse_line(char *line, int *start, int *end, char **annot);

Expects standard TAB-separated format; first two fields must be numbers, optional third field is returned in annot - if not present, annot is set to NULL.

Parameters:
lineThe line to be parsed.
startLocation for the start cpos.
endLocation for the end cos.
annotLocation for the annotation string.
Returns:
Boolean; true for all OK, false for error.

References cl_free, and cl_strdup().

Referenced by main().

void sencode_parse_options ( int  argc,
char **  argv 
)
void sencode_usage ( void  )

print usage message and exit

References progname, and VERSION.

Referenced by sencode_parse_options().

void sencode_write_region ( int  start,
int  end,
char *  annot 
)
void SL_delete ( SL  item)

delete region from list; updates SL_Point if it happened to point at item

References _SL::annot, cl_free, _SL::next, and _SL::prev.

Referenced by SL_insert().

void SL_insert ( int  start,
int  end,
char *  annot 
)

Inserts an item into the global structure list.

It adds a new region to the list: its start point, its end point, its annotation.

Combines SL_seek(), SL_insert_at_point() and ambiguity resolution.

References _SL::end, _SL::next, SL_delete(), SL_insert_after_point(), SL_seek(), and _SL::start.

Referenced by main().

SL SL_insert_after_point ( int  start,
int  end,
char *  annot 
)

insert region [start, end, annot] after SL_Point; no overlap/position checking

References _SL::annot, cl_malloc(), cl_strdup(), _SL::end, _SL::next, _SL::prev, SL_Point, _SL::start, and StructureList.

Referenced by SL_insert().

SL SL_next ( void  )

Gets a pointer to the next available structure on the global structure list.

Returns NULL if we're at the end of the list.text

References _SL::next, and SL_Point.

Referenced by main().

void SL_rewind ( void  )

Rewind the index-pointer to the start of the global structure list.

References StructureList.

Referenced by main().

SL SL_seek ( int  cpos)

Find region containing (or preceding) cpos; NULL = start of list; sets SL_Point to returned value.

References _SL::end, _SL::next, _SL::prev, SL_Point, _SL::start, and StructureList.

Referenced by SL_insert().


Variable Documentation

int add_to_existing = 0

add to existing attribute: implies in_memory; existing regions are automatically inserted at startup

Referenced by main(), and sencode_parse_options().

Corpus* corpus = NULL

corpus we're working on; at the moment, this is only required for add_to_existing

int debug = 0
int in_memory = 0

create list of regions in memory (allowing non-linear input), then write to disk

Referenced by main(), and sencode_parse_options().

cl_lexhash LH = NULL

Lexhash used when writing regions, to avoid multiple copies of annotations (-m mode)

Global (and only) instance of the cwb-s-encode SencodeRange object.

Contains information on the new s-attribute being coded.

char* progname = NULL
enum { ... } set_att

feature-set attributes: type of.

Initial value: not a feature set. Changes to set_any once we know we are dealing with a feature set. Changes to set_regular or set_whitespace once we know which format of f.s. it is.

Referenced by main(), sencode_check_set(), and sencode_parse_options().

check that set attributes are always given in the same syntax

Referenced by sencode_check_set(), and sencode_parse_options().

int silent = 0

debug mode on/off

avoid messages in -M / -a modes

SL SL_Point = NULL

pointer into global list; NULL = start of list; linear search starts from SL_Point

Referenced by SL_insert_after_point(), SL_next(), and SL_seek().

Referenced by sencode_parse_options().

SL StructureList = NULL

(single) global list

Referenced by SL_insert_after_point(), SL_rewind(), and SL_seek().

FILE* text_fd = NULL

stream handle for file to read from.

Referenced by main(), and sencode_parse_options().