An annotation object is simply a named list with each item containing a data frame. These frames should be thought of as tables living inside of a single database, with keys linking each table to one another. All tables are in the second normal form of Edgar Codd. For the most part they also satisfy the third normal form, or, equivalently, the formal tidy data model of Hadley Wickham. The limited departures from this more stringent requirement are justified below wherever they exist. In every case the cause is a transitive dependency that would require a complex range join to reconstruct.
We were primarily considered the Java/CoreNLP backend when constructing the schema because it is the most feature-rich. With the sole exception of word embeddings in spaCy, which require a matrix-based data structure, all of the fields from the Python and R backends map seamlessly into a subset of those provided by CoreNLP. Fields where the output may differ slightly amongst backends are described in the commentary below.
Several standards have previously been proposed for representing textual annotations. These include the linguistic Annotation Framework, NLP Interchange Format, and CoNLL-X. The function from_CoNLL
is included as a helper function in cleanNLP to convert from CoNLL formats into the cleanNLP data model. All of these, however, are concerned with representing annotations for interoperability between systems. Our goal is instead to create a data model well-suited to direct analysis, and therefore requires a new approach.
In this section each table is presented and justifications for its existence and form are given. Individual tables may be pulled out with access functions of the form get_*
. Example tables are pulled from the State of the Union corpus, which will be discussed at length in the next section.
The documents table contains one row per document in the annotation object. What exactly constitutes a document up to the user. It might include something as granular as a paragraph or as coarse as an entire novel. For many applications, particularly stylometry, it may be useful to simultaneously work with several hierarchical levels: sections, chapters, and an entire body of work. The solution in these cases is to define a document as the smallest unit of measurement, denoting the higher-level structures as metadata. The primary key for this table is a document id, stored as an integer index. The schema is given by:
By design, there should be no extrinsic meaning placed on this key. Other tables use it to map to one another and to the document table, but any metadata about the document is contained only in the document table rather than being forced into the document key. In other words, the temptation to use keys such as “Obama2016” is avoided because, while these look nice, they ultimately make programming hard.
These are all filled in automatically by the annotation function. Any number of additional corpora-specific metadata, such as the aforementioned section and chapter designations, may be attached as well. The document table for the example corpus is:
get_document(obama)
## # A tibble: 8 × 5
## id time version language
## <int> <dttm> <chr> <chr>
## 1 0 2017-04-02 22:56:00 3.7.0 en
## 2 1 2017-04-02 22:57:00 3.7.0 en
## 3 2 2017-04-02 22:58:00 3.7.0 en
## 4 3 2017-04-02 22:58:00 3.7.0 en
## 5 4 2017-04-02 22:59:00 3.7.0 en
## 6 5 2017-04-02 22:59:00 3.7.0 en
## 7 6 2017-04-02 23:00:00 3.7.0 en
## 8 7 2017-04-02 23:01:00 3.7.0 en
# ... with 1 more variables: uri <chr>
Notice that metadata such as the president, year, and party into the document table has been included. It may seem that common fields such as year and author should be added to the formal specification but the perceived advantage is minimal. It would still be necessary for users to manually add the content of these fields at some point as any other metadata is not unambiguously extractable from the raw text.
The token table contains one row for each unique token, usually a word or punctuation mark, in any document in the corpus. Any annotator that produces an output for each token has its results displayed here. These include the lemmatizer, the part of the speech tagger and speaker indicators. The schema is given by:
Given the annotators selected during the pipeline initialization, some of these columns may contain only missing data. A composite key exists by taking together the document id, sentence id, and token id. There is also a set of foreign keys and giving character offsets back into the original source document. An example of the table looks like this:
get_token(obama)
## # A tibble: 61,881 × 8
## id sid tid word lemma upos pos cid
## <int> <int> <int> <chr> <chr> <chr> <chr> <int>
## 1 0 0 1 Madam Madam NOUN NNP 0
## 2 0 0 2 Speaker Speaker NOUN NNP 6
## 3 0 0 3 , , . , 13
## 4 0 0 4 Mr. Mr. NOUN NNP 15
## 5 0 0 5 Vice Vice NOUN NNP 19
## 6 0 0 6 President President NOUN NNP 24
## 7 0 0 7 , , . , 33
## 8 0 0 8 Members Members NOUN NNP 35
## 9 0 0 9 of of ADP IN 43
## 10 0 0 10 Congress Congress NOUN NNP 46
## # ... with 61,871 more rows
A phantom token “ROOT” is included at the start of each sentence (it always has tid
equal to 0). This was added so that joins from the dependency table, which contains references to the sentence root, into the token table have no missing values.
The field “upos” contains the universal part of speech code, a language-agnostic classification, for the token. It could be argued that in order to maintain database normalization one should simply look up the universal part of speech code by finding the language code in the document table and joining a table mapping the Penn Treebank codes to the universal codes. This has not been done for several reasons. First, universal parts of speech are very useful for exploratory data analysis as they contain tags much more familiar to non-specialists such as “NOUN” (noun) and “CONJ” (conjunction). Asking users to apply a three table join just to access them seems overly cumbersome. Secondly, it is possible for users to use other parsers or annotation engines. These may not include granular part of speech codes and it would be difficult to figure out how to represent these if there were not a dedicated universal part of speech field.
Dependencies give the grammatical relationship between pairs of tokens within a sentence. As they are at the level of token pairs, they must be represented as a new table. Only one dependency should exist for any pair of tokens; the document id, sentence id, and source and target token ids together serve as a composite key. As dependencies exist only within a sentence, the sentence id does not need to be defined separately for the source and target. The schema is given by:
Dependencies take significantly longer to calculate than the lemmatization and part of speech tagging tasks. By default, the set_language
function selects a fast neural network parser that requires more memory but runs nearly twice as fast as other default parsers in the CoreNLP pipeline.
The get_dependency
function has an option (set to “FALSE” by default) to auto join the dependency to the target and source tokens and words from the token table. This is a common task and involves non-trivial calls to the left_join
function making it worthwhile to include as an option. The output, with the option turned on, is given by:
get_dependency(obama, get_token = TRUE)
## # A tibble: 16,592,392 × 11
## id sid tid tid_target relation relation_full word lemma sid_target
## <int> <int> <int> <int> <chr> <chr> <chr> <chr> <int>
## 1 0 0 0 2 root root <NA> <NA> 0
## 2 0 0 0 2 root root <NA> <NA> 1
## 3 0 0 0 2 root root <NA> <NA> 2
## 4 0 0 0 2 root root <NA> <NA> 3
## 5 0 0 0 2 root root <NA> <NA> 4
## 6 0 0 0 2 root root <NA> <NA> 5
## 7 0 0 0 2 root root <NA> <NA> 6
## 8 0 0 0 2 root root <NA> <NA> 7
## 9 0 0 0 2 root root <NA> <NA> 8
## 10 0 0 0 2 root root <NA> <NA> 9
## # ... with 16,592,382 more rows, and 2 more variables: word_target <chr>,
## # lemma_target <chr>
The word “ROOT” shows up in the first row, which would have been NA
had sentence roots not been explicitly included in the token table.
Our parser produces universal dependencies, which have a language-agnostic set of relationship types with language-specific subsets pertaining to specific grammatical relationships with a particular language. For the same reasons that both the part of speech codes and universal part of speech codes are included, each of these relationship types have been added to the dependency table.
Named entity recognition is the task of finding entities that can be defined by proper names, categorizing them, and standardizing their formats. The XML output of the Stanford CoreNLP pipeline places named entity information directly into their version of the token table. Doing this repeats information over every token in an entity and gives no canonical way of extracting the entirety of a single entity mention. We instead have a separate entity table, as is demanded by the normalized database structure, and record each entity mention in its own row. The schema is given by:
An example of the named entity table is given by:
get_entity(obama)
## # A tibble: 3,166 × 7
## id sid tid tid_end entity_type entity
## <int> <int> <int> <int> <chr> <chr>
## 1 0 1 2 2 PERSON Speaker
## 2 0 1 10 10 ORGANIZATION Congress
## 3 0 1 13 13 MISC Americans
## 4 0 1 15 17 DATE Fifty-one years ago
## 5 0 1 19 21 PERSON John F. Kennedy
## 6 0 3 1 1 DATE Tonight
## 7 0 3 11 11 MISC American
## 8 0 4 2 3 DURATION a decade
## 9 0 5 2 2 DURATION years
## 10 0 5 12 13 NUMBER 6 million
## # ... with 3,156 more rows, and 1 more variables: entity_normalized <chr>
The categories available in the field entity_type
are dependent on the models used in the annotation pipeline. The default English model selected for speed codes 2 and above include the categories: “LOCATION”, “PERSON”, “ORGANIZATION”, “MISC”, “MONEY”, “PERCENT”, “DATE” and “TIME”. The last four of these also have a normalized form, given in the final field of the table. As with the coreference table, a complete representation of the entity is given as a character string due to the difficulty in reconstructing this after the fact from the token table.
Coreferences link sets of tokens that refer to the same underlying person, object, or idea. One common example is the linking of a noun in one sentence to a pronoun in the next sentence. The coreference table describes these relationships but is not strictly a table of coreferences. Instead, each row represents a single mention of an expression and gives a reference id indicating all of the other mentions that it also coreferences. In theory, a given set of tokens might have two separate mentions if it refers to two different classes of references (though this is quite rare). The document, reference, and mention ids serve as a composite key for the table. The schema is given by:
Links back into the token table for the start, end and head of the mention are given as well; these are pushed to the right of the table as they should be considered foreign keys within this table.
An example helps to explain exactly what the coreference table represents (the first row is removed as its mention is quite long and makes the table hard to read):
get_coreference(obama)[-1,]
## # A tibble: 6,984 × 12
## id rid mid mention mention_type number gender animacy
## <int> <int> <int> <chr> <chr> <chr> <chr> <chr>
## 1 0 1537 1531 Afghanistan PROPER SINGULAR NEUTRAL INANIMATE
## 2 0 1537 1537 Afghanistan PROPER SINGULAR NEUTRAL INANIMATE
## 3 0 2050 2007 a democracy NOMINAL SINGULAR NEUTRAL INANIMATE
## 4 0 2050 2050 our democracy NOMINAL SINGULAR NEUTRAL INANIMATE
## 5 0 1796 1746 I PRONOMINAL SINGULAR UNKNOWN ANIMATE
## 6 0 1796 1796 I PRONOMINAL SINGULAR UNKNOWN ANIMATE
## 7 0 2053 2022 we PRONOMINAL PLURAL UNKNOWN ANIMATE
## 8 0 2053 2023 our PRONOMINAL PLURAL UNKNOWN ANIMATE
## 9 0 2053 2044 We PRONOMINAL PLURAL UNKNOWN ANIMATE
## 10 0 2053 2046 we PRONOMINAL PLURAL UNKNOWN ANIMATE
## # ... with 6,974 more rows, and 4 more variables: sid <int>, tid <int>,
## # tid_end <int>, tid_head <int>
There is a special relationship between the reference id rid
and the mention id mid
. The coreference annotator selects a specific mention for each reference that gets treated as the canonical mention for the entire class. The mention id for this mention becomes the reference id for the class, as can be in the above table with rows 1, 3, 5, and 9 all corresponding to the canonical mention of their respective classes. This relationship provides a way of identifying the canonical mention within a reference class and a way of treating the coreference table as pairs of mentions rather than individual mentions joined by a given key.
The text of the mention itself is included within the table. This was done because as the mention may span several tokens it would otherwise be very difficult to extract this information from the token table. It is also possible, though not supported in the current CoreNLP pipeline, that a mention could consist of a set of non-contiguous tokens, making this field impossible to otherwise reconstruct.
The schema for sentence-level information is given by:
An example of the output can be seen in:
get_sentence(obama)
## # A tibble: 2,988 × 3
## id sid sentiment
## <int> <int> <int>
## 1 0 0 1
## 2 0 1 3
## 3 0 2 1
## 4 0 3 1
## 5 0 4 2
## 6 0 5 1
## 7 0 6 3
## 8 0 7 1
## 9 0 8 1
## 10 0 9 1
## # ... with 2,978 more rows
The underlying sentiment model is a neural network.
Our final table in the data model stores the a relatively new concept of a word vector. Also known as word embeddings, these vectors are deterministic maps from the set of all available words into a high-dimensional, real valued vector space. Words with similar meanings or themes will tend to be clustered together in this high-dimensional space. For example, we would expect apple and pear to be very close to one another, with vegetables such as carrots, broccoli, and asparagus only slightly farther away. The embeddings can often be used as input features to fitting models on top of textual data. For a more detailed description of these embeddings, see the papers on either of the most well-known examples: GloVe and word2vec . Only the spaCy backend to currently supports word vectors; these are turned off by default because they take a significantly large amount of space to store. The embedding model used is a modification of the GloVe embeddings, mapping words into a 300-dimensional space. To compute the embeddings, set the parameter of to prior to running the annotation.
Word vectors are stored in a separate table from the tokens table out of convenience rather than as a necessity of preserving the data model’s normalized schema. Due to its size and the fact that the individual components of the word embedding have no intrinsic meaning, this table is stored as a matrix. ``{r} dim(get_vector(obama))
``` The first three columns hold the keys , , and , respectively. If no embedding is computed, the function returns an empty matrix.