‘soundcorrs’ is a small package whose purpose in life is to help linguists analyse sound correspondences between languages. It does not attempt to draw any conclusions on its own; this responsibility is placed entirely on the user. ‘soundcorrs’ merely automates and facilitates certain tasks, such as preparing the material part of the paper, or looking for examples of specific correspondences and, by making various functions available, suggests possible paths of analysis which may not be immediately obvious to the more traditional linguist.
This vignette assumes that the reader not only is a linguist and has at least a general idea about what kind of outputs he or she might want from ‘soundcorrs’, but also has at least a passing familiarity with R and a basic understanding of statistics. Most problems can probably be read up on as they appear in the text, but it is nevertheless recommended to start by very briefly acquainting oneself with R by reading the first page of maybe Quick-R, R Tutorial, or another R primer. In particular, it is assumed that the reader will know how to access and understand the built-in documentation, as not all arguments are discussed here.
A less technical introduction to ‘soundcorrs’ is also available in Stachowski K. 2020. Tools for Semi-Automatic Analysis of Sound Correspondences: The soundcorrs Package for R. Glottometrics 49. 66–86. If you use ‘soundcorrs’ in your research, please cite this paper.
The first section of this vignette discusses in short how to prepare data for ‘soundcorrs’. The second section is an overview of all the analytic functions exported by ‘soundcorrs’ organized by their output, and of helper functions in the alphabetical order.
‘soundcorrs’ functions operate on pairs/triples/… of words which come from different languages. The discussion below will use ‘L1’ to refer to the first language in the dataset, ‘L2’ to the second, etc.
Naturally, all the examples given below assume that ‘soundcorrs’ is installed and loaded:
# install.packages ("soundcorrs")
library (soundcorrs)
#> NOTE. Version 0.2.0 introduces some important changes.
#> Please consult https://cran.r-project.org/web/packages/soundcorrs/NEWS and run vignette("soundcorrs").
‘soundcorrs’ requires two kinds of data: transcription and word pairs/triples/…. Both are stored in tsv files, i.e. as tab-separated tables in text files.
Under BSD, Linux, and macOS, the recommended encoding is UTF-8. Unfortunately, it has been found to cause problems under Windows, so Windows users are advised to not use characters outside of the ASCII standard. Some issues can be fixed by converting from UTF-8 to UTF-8 (sic!) with ‘iconv()’, but others resist this and other treatments. Future versions of ‘soundcorrs’ hope to include a solution for this problem.
Transcription is not strictly necessary for the functioning of ‘soundcorrs’, but without it linguistic regular expresssions (“wildcards”) could not be defined, and involvement of phonetics in the analysis would be made more difficult. Transcription is stored in tsv files with two or three columns:
GRAPHEME, which contains the graphemes. Characters used by R as metacharacters in regular expressions, i.e. . + * ^ $ ? | ( ) [ ] { }, are not allowed. Multigraphs also should not be used as they can lead to unexpected and incorrect results, especially in the case metacharacters (“wildcards”).
VALUE, which contains a comma-separated list of features of the given grapheme. These are intended to be phonetic but do not necessarily have to be so. If the column META is missing, it is generated based on the column VALUE.
META, which contains a regular expression covering all the graphemes the given grapheme is meant to represent. In regular graphemes, this is simply the grapheme itself. In a metacharacter, such as ‘C’ for ‘any consonant’, this needs to be a listing of all consonantal graphemes in the transcription file, formatted as a regular expression. It is recommended to leave this column empty, as in such case ‘soundcorrs’ will generate it automatically.
‘soundcorrs’ contains two sample transcription files: ‘trans-common.tsv’ and ‘trans-ipa.tsv’. Both only cover the basics and are intended more as an illustration than anything else. To load one of them:
# establish the paths of the samples included in ‘soundcorrs’
system.file ("extdata", "trans-common.tsv", package="soundcorrs")
path.trans.com <- system.file ("extdata", "trans-ipa.tsv", package="soundcorrs")
path.trans.ipa <-
# and load them
read.transcription (path.trans.com)
trans.com <-#> Warning in transcription(data, col.grapheme, col.meta, col.value): Multiple
#> graphemes for values: [cons], [vow].
read.transcription (path.trans.ipa)
trans.ipa <-#> Warning in transcription(data, col.grapheme, col.meta, col.value): Missing the
#> metacharacters column. The "META" column was generated.
# transcription needs to be an object of class ‘transcription’
class (trans.com)
#> [1] "transcription"
# a basic summary
trans.com#> A "transcription" object.
#> File: /tmp/RtmpRzwbSc/Rinst5558a4db0ff/soundcorrs/extdata/trans-common.tsv.
#> Graphemes: 78.
# ‘data’ is the original data frame
# ‘cols’ is a guide to column names in ‘data’
# ‘zero’ are the characters denoting the linguistic zero
str (trans.com, max.level=1)
#> List of 3
#> $ data:'data.frame': 78 obs. of 3 variables:
#> $ cols:List of 3
#> $ zero: chr "-"
#> - attr(*, "class")= chr "transcription"
#> - attr(*, "file")= chr "/tmp/RtmpRzwbSc/Rinst5558a4db0ff/soundcorrs/extdata/trans-common.tsv"
Like the transcription, the data are also stored in tsv files. Two formats are theoretically possible: the “long format” in which every word is given its own row, and the “wide format” in which one row holds a pair/triple/… of words (see below).
For most tasks, words need to be segmented, and all words in a pair/triple/… must have the same number of segments. The default segment separator is ‘|’. If the words are not segmented, the function ‘addSeparators()’ can be used to facilitate the process of manual segmentation and alignment (see below). Tools for automatic alignment also exist (e.g. alineR, LingPy, PyAline), but it is recommended that their results be thoroughly checked by a human. Apart from the segmented and aligned form, each word must be assigned a language.
Hence, the two obligatory columns in the “long format” are
ALIGNED which holds the segmented and aligned word, and
LANGUAGE which holds the name of the language.
In the “wide format”, similarly, a minimum of two columns is necessary, each holding words from a different language. The information about which column holds which language can then be encoded simply as column names (e.g. ‘LATIN’), or in the form of a suffix attached to the names (e.g. ‘ALIGNED.Latin’).
Regarding the two formats, see also ‘long2wide()’ and ‘wide2long()’ below.
It is possible, though not necessarily recommended, to store data from each language in a separate file; it is also possible to use a different transcription for each language. This flexibility can easily lead to a somewhat cumbersome string of arguments for the reader function, so the ‘read.soundcorrs’ function is limited to reading the data for just one language. Individual ‘soundcorrs’ objects can be then combined into one using the ‘merge’ function. The reader function only accepts data in the “wide format”.
‘soundcorrs’ has three sample datasets: 1. the entirely made-up ‘data-abc.tsv’; 2. ‘data-capitals.tsv’ which contains the names of EU capitals in German, Polish and Spanish – from the linguistic point of view, this of course makes no sense; it is merely an example that will hopefully not be seen as too exotic regardless of which language or languages the user specializes in (my gratitude is due to José Andrés Alonso de la Fuente, PhD (Cracow, Poland) for help with Spanish data); and 3. ‘data-ie.tsv’ with a dozen examples of Grimm’s and Verner’s laws (adapted from Campbell L. 2013. Historical Linguistics. An Introduction. Edinburgh University Press. Pp. 136f). The ‘abc’ dataset is in the “long format”, the ‘capitals’ and ‘ie’ datasets are in the “wide format”. All three are also available as preloaded datasets ‘sampleSoundCorrsData.abc’, ‘sampleSoundCorrsData.capitals’, and ‘sampleSoundCorrsData.ie.’.
# establish the paths of the two datasets
system.file ("extdata", "data-abc.tsv", package="soundcorrs")
path.abc <- system.file ("extdata", "data-capitals.tsv", package="soundcorrs")
path.cap <- system.file ("extdata", "data-ie.tsv", package="soundcorrs")
path.ie <-
# read “capitals”
read.soundcorrs (path.cap, "German", "ALIGNED.German", path.trans.com)
d.cap.ger <-#> Warning in transcription(data, col.grapheme, col.meta, col.value): Multiple
#> graphemes for values: [cons], [vow].
#> Warning in soundcorrs(data, name, col.aligned,
#> read.transcription(transcription), : The following segments are not covered by
#> the transcription: jus, ŋk.
read.soundcorrs (path.cap, "Polish", "ALIGNED.Polish", path.trans.com)
d.cap.pol <-#> Warning in transcription(data, col.grapheme, col.meta, col.value): Multiple
#> graphemes for values: [cons], [vow].
#> Warning in soundcorrs(data, name, col.aligned,
#> read.transcription(transcription), : The following segments are not covered by
#> the transcription: ń, ẃ.
read.soundcorrs (path.cap, "Spanish", "ALIGNED.Spanish", path.trans.com)
d.cap.spa <-#> Warning in transcription(data, col.grapheme, col.meta, col.value): Multiple
#> graphemes for values: [cons], [vow].
#> Warning in soundcorrs(data, name, col.aligned,
#> read.transcription(transcription), : The following segments are not covered by
#> the transcription: ð, ja, ŋk.
merge (d.cap.ger, d.cap.pol, d.cap.spa)
d.cap <-
# read “ie”
read.soundcorrs (path.ie, "Lat", "LATIN", path.trans.com)
d.ie.lat <-#> Warning in transcription(data, col.grapheme, col.meta, col.value): Multiple
#> graphemes for values: [cons], [vow].
read.soundcorrs (path.ie, "Eng", "ENGLISH", path.trans.ipa)
d.ie.eng <-#> Warning in transcription(data, col.grapheme, col.meta, col.value): Missing the
#> metacharacters column. The "META" column was generated.
#> Warning in soundcorrs(data, name, col.aligned,
#> read.transcription(transcription), : The following segments are not covered by
#> the transcription: eɪ, ɪ, aʊ, uː, ɑː, ʊ, iː.
merge (d.ie.lat, d.ie.eng)
d.ie <-
# read “abc”
long2wide (read.table(path.abc,header=T), skip=c("ID"))
tmp <- soundcorrs (tmp, "L1", "ALIGNED.L1", trans.com)
d.abc.l1 <- soundcorrs (tmp, "L2", "ALIGNED.L2", trans.com)
d.abc.l2 <- merge (d.abc.l1, d.abc.l2)
d.abc <-
# some basic summary
d.abc.l1#> A "soundcorrs" object.
#> Languages (1): L1.
#> Entries: 6.
#> Columns (7): ID, DIALECT.L1, ALIGNED.L1, ORTHOGRAPHY.L1, DIALECT.L2, ALIGNED.L2, ORTHOGRAPHY.L2.
d.abc#> A "soundcorrs" object.
#> Languages (2): L1, L2.
#> Entries: 6.
#> Columns (7): ID, DIALECT.L1, ALIGNED.L1, ORTHOGRAPHY.L1, DIALECT.L2, ALIGNED.L2, ORTHOGRAPHY.L2.
# ‘cols’ are the names of important columns
# ‘data’ is the original data frame
# ‘names’ are the names of the languages
# ‘segms’ are words exploded into segments; ‘$z’ is a variant with linguistic zeros; ‘$nz’ without them
# ‘segpos’ is a lookup list to check which character belongs to which segment; ‘$z’ is a variant with linguistic zeros; ‘$nz’ without them
# ‘separators’ are the strings used as segment separators
# ‘trans’ are ‘transcription’ objects
# ‘words’ are words obtained by removing separators from the ‘col.aligned’ column; ‘$z’ is a variant with linguistic zeros; ‘$nz’ without them
str (d.abc, max.level=1)
#> List of 8
#> $ cols :List of 2
#> $ data :'data.frame': 6 obs. of 7 variables:
#> $ names : chr [1:2] "L1" "L2"
#> $ segms :List of 2
#> $ segpos :List of 2
#> $ separators: chr [1:2] "\\|" "\\|"
#> $ trans :List of 2
#> $ words :List of 2
#> - attr(*, "class")= chr "soundcorrs"
‘soundcorrs’ exports several functions intended for linguistic analysis. For easier orientation, they are organized below by what kind of outputs they produce, rather than by their names. ‘soundcorrs’ also exports several functions whose use for linguistic analysis, in and of themselves, is rather limited. Those are grouped in one subsection at the end, and discussed in the alphabetical order.
There are three different functions in ‘soundcorrs’ that produce contingnecy tables. There is some logic behind it. ‘summary()’ is only meant to give a general overview of the dataset; ‘coocc()’ is the essential contingency table function; and ‘allCooccs()’ produces an output that is meant to be printed rather than read from the screen.
‘summary()’ produces a segment-to-segment contingency table. The values may represent how many times the given segments correspond to each other (‘unit=“o”’) or in how many words they correspond to each other (‘unit=“w”’). This distinction exists because it is quite possible that there will be a segment which appears more than once in a single word. The argument ‘unit’ accepts nine different values: ‘“o(cc(ur(ence(s))))”’ and ‘“w(or(d(s)))”’. One more argument can be given to ‘summary()’; it is ‘count’, and it determines whether values are given in the absolute, or as relative. It accepts six values: ‘“a(bs(olute))”’ and ‘“r(el(ative))”’.
Note that ‘summary()’ reports how many times the given segments correspond to each other – not how often they co-occur in the same word. For example, in a pair L1 “a|b” : L2 “c|d”, the “a”/“d” cell will show 0 because because L1 “a” never corresponds directly to L2 “d”. This is a different perspective than in ‘coocc()’ below.
# a general overview of the dataset as a whole
summary (d.abc)
#> L2
#> L1 a b c o u w ə
#> - 0 0 0 0 0 0 2
#> a 4 0 0 1 1 0 0
#> b 0 5 0 0 0 1 0
#> c 0 0 6 0 0 0 0
# words are the default ‘unit’
summary (d.abc, unit="o")
#> L2
#> L1 a b c o u w ə
#> - 0 0 0 0 0 0 2
#> a 6 0 0 1 2 0 0
#> b 0 5 0 0 0 1 0
#> c 0 0 6 0 0 0 0
# in relative values …
summary (d.abc, count="r")
rels <-round (rels, 2)
#> L2
#> L1 a b c o u w ə
#> - 0.00 0.00 0.00 0.00 0.00 0.00 1.00
#> a 0.67 0.00 0.00 0.17 0.17 0.00 0.00
#> b 0.00 0.83 0.00 0.00 0.00 0.17 0.00
#> c 0.00 0.00 1.00 0.00 0.00 0.00 0.00
# … relative to entire rows
apply (rels, 1, sum)
#> - a b c
#> 1 1 1 1
‘coocc()’ has two modes: internal and external comparison. The former, invoked when ‘column=NULL’ (the default) cross-tabulates correspondences with themselves. The latter cross-tabulates correspondences with metadata taken from a column in the dataset whose name is given as the argument ‘column’. Like ‘summary()’ above, ‘coocc()’ has the argument ‘unit’ which has the same meaning, and also the argument ‘count’ which may appear to work a little differently. In actuality, its use with ‘summary()’ was a special case. The general idea is that the entire table is divided into blocks such that all rows represent correspondences of the same segment and, in the internal mode, so do all the columns.
Note that ‘coocc()’ reports how many times the given correspondences co-occur in the same word – not how often they appear in the entire dataset. For example, in a pair L1 “a|b” : L2 “c|d”, the “a:c”/“b:d” cell will show 1 because the correspondence L1 “a” : L2 “c” co-occurs with L1 “b” : L2 “d” in one word. This is a different perspective than in ‘summary()’ above.
# a general look in the internal mode
coocc (d.abc)
#> L1_L2
#> L1_L2 -_ə a_a a_o a_u b_b b_w c_c
#> -_ə 0 2 0 0 2 0 2
#> a_a 2 2 0 0 4 0 4
#> a_o 0 0 0 0 1 0 1
#> a_u 0 0 0 1 0 1 1
#> b_b 2 4 1 0 0 0 5
#> b_w 0 0 0 1 0 0 1
#> c_c 2 4 1 1 5 1 0
# now with metadata
coocc (d.abc, "DIALECT.L2")
#> DIALECT.L2
#> L1_L2 north south std
#> -_ə 0 2 0
#> a_a 0 2 2
#> a_o 1 0 0
#> a_u 1 0 0
#> b_b 1 2 2
#> b_w 1 0 0
#> c_c 2 2 2
# in the internal mode,
# the relative values are with regard to segment-to-segment blocks
coocc (d.abc, count="r")
tab <- which (rownames(tab) %hasPrefix% "a")
rows.a <- which (colnames(tab) %hasPrefix% "b")
cols.b <-sum (tab [rows.a, cols.b])
#> [1] 1
# there are four different segments in L1
sum (tab)
#> [1] NaN
# if two correspondences never co-occur, the relative value is 0/0
# which R represents as ‘NaN’, and prints as empty space
coocc (d.abc, count="r")
#> L1_L2
#> L1_L2 -_ə a_a a_o a_u b_b b_w c_c
#> -_ə 1.0000000 0.0000000 0.0000000 1.0000000 0.0000000 1.0000000
#> a_a 1.0000000 0.6666667 0.0000000 0.0000000 0.6666667 0.0000000 0.6666667
#> a_o 0.0000000 0.0000000 0.0000000 0.0000000 0.1666667 0.0000000 0.1666667
#> a_u 0.0000000 0.0000000 0.0000000 0.3333333 0.0000000 0.1666667 0.1666667
#> b_b 1.0000000 0.6666667 0.1666667 0.0000000 0.8333333
#> b_w 0.0000000 0.0000000 0.0000000 0.1666667 0.1666667
#> c_c 1.0000000 0.6666667 0.1666667 0.1666667 0.8333333 0.1666667
# in the external mode,
# the relative values are with regard to blocks of rows, and all columns
coocc (d.abc, "DIALECT.L2", count="r")
tab <- which (rownames(tab) %hasPrefix% "a")
rows.a <-sum (tab [rows.a, ])
#> [1] 1
‘allCooccs()’ splits a table produced by ‘coocc()’ into blocks, each containing the correspondences of one segment. Its primary purpose is to facilitate the application of tests of independence, for which see ‘lapplyTest()’ below.
‘allCooccs()’ takes all the same arguments as ‘coocc()’: ‘column’, ‘count’, and ‘unit’. In addition, it takes the argument ‘bin’ which determines whether the table should be just cut up, or whether all the resulting slices should also be binned.
The return value of ‘allCooccs()’ is a list which holds all the resulting tables, under names composed from the correspondences and connected with underscores. If ‘column = NULL’, they would be ‘a’, ‘b’, &c. if ‘bin = F’, and if ‘bin = T’, ‘a_b_c_d’ meaning L1 ‘a’ : L2 ‘b’ cross-tabulated with L1 ‘c’ : L2 ‘d’, and so on. If ‘column’ is not ‘NULL’, the names will be ‘a_b_northern’ meaning L1 ‘a’ : L2 ‘b’ tabulated with the ‘northern’ dialect, and so forth.
# for a small dataset, the result is going to be small
str (allCooccs(d.abc), max.level=0)
#>
|
| | 0%
|
|================== | 25%
|
|=================================== | 50%
|
|==================================================== | 75%
|
|======================================================================| 100%
#> List of 34
# but it can grow quite quickly with a larger dataset
str (allCooccs(d.cap), max.level=0)
#>
|
| | 0%
|
|== | 3%
|
|===== | 6%
|
|======= | 10%
|
|========= | 13%
|
|=========== | 16%
|
|============== | 19%
|
|================ | 23%
|
|================== | 26%
|
|==================== | 29%
|
|======================= | 32%
|
|========================= | 35%
|
|=========================== | 39%
|
|============================= | 42%
|
|================================ | 45%
|
|================================== | 48%
|
|==================================== | 52%
|
|====================================== | 55%
|
|========================================= | 58%
|
|=========================================== | 61%
|
|============================================= | 65%
|
|=============================================== | 68%
|
|================================================== | 71%
|
|==================================================== | 74%
|
|====================================================== | 77%
|
|======================================================== | 81%
|
|=========================================================== | 84%
|
|============================================================= | 87%
|
|=============================================================== | 90%
|
|================================================================= | 94%
|
|==================================================================== | 97%
|
|======================================================================| 100%
#> List of 5614
# the naming scheme
names (allCooccs(d.abc))
#>
|
| | 0%
|
|================== | 25%
|
|=================================== | 50%
|
|==================================================== | 75%
|
|======================================================================| 100%
#> [1] "-_ə_a_a" "-_ə_a_o" "-_ə_a_u" "-_ə_b_b" "-_ə_b_w" "-_ə_c_c" "a_a_-_ə"
#> [8] "a_a_b_b" "a_a_b_w" "a_a_c_c" "a_o_-_ə" "a_o_b_b" "a_o_b_w" "a_o_c_c"
#> [15] "a_u_-_ə" "a_u_b_b" "a_u_b_w" "a_u_c_c" "b_b_-_ə" "b_b_a_a" "b_b_a_o"
#> [22] "b_b_a_u" "b_b_c_c" "b_w_-_ə" "b_w_a_a" "b_w_a_o" "b_w_a_u" "b_w_c_c"
#> [29] "c_c_-_ə" "c_c_a_a" "c_c_a_o" "c_c_a_u" "c_c_b_b" "c_c_b_w"
# and with ‘column’ not ‘NULL’
names (allCooccs(d.abc,column="DIALECT.L2"))
#>
|
| | 0%
|
|================== | 25%
|
|=================================== | 50%
|
|==================================================== | 75%
|
|======================================================================| 100%
#> [1] "-_ə_north" "-_ə_south" "-_ə_std" "a_a_north" "a_a_south" "a_a_std"
#> [7] "a_o_north" "a_o_south" "a_o_std" "a_u_north" "a_u_south" "a_u_std"
#> [13] "b_b_north" "b_b_south" "b_b_std" "b_w_north" "b_w_south" "b_w_std"
#> [19] "c_c_north" "c_c_south" "c_c_std"
‘soundcorrs’ has three functions to look for specific examples. ‘findExamples()’ searches for words which exhibit a given correspondence; ‘findPairs()’ is a convenience wrapper for ‘findExamples()’ for when there are only two languages in the dataset; and ‘allPairs()’ produces an almost print-ready summary of the dataset, complete with tables and all the examples.
‘findExamples()’ searches a dataset for pairs/triples/… which exhibit a specific sound correspondence. It can take several arguments, which can be divided into three groups.
The first group is just one obligatory argument, ‘data’. This is the dataset, and it must be a ‘soundcorrs’ object.
The second group are the queries. There must be as many of them as there are languages in ‘data’, and they must be given in the same order as the languages in ‘data’. For example, if the dataset contains data from English and Polish, there need to be two queries, the first to be looked for in English data, and the second in Polish data. All queries can be regular expressions: such as defined in R, or custom metacharacters defined in the transcription. The can also be empty strings, which ‘findExamples()’ will interpret as a permission to accept anything.
The third group are optional arguments which define how the data are sifted and displayed. These arguments can only be used with a name (f.ex. ‘findExamples(data,“a”,“a”,cols=“all”)’) because R does not know a priori how many queries there are going to be.
The ‘cols’ argument defines which columns of the original data are displayed. It can be a vector of strings, “all”, or “aligned”. The last option is the default one.
The next two arguments are ‘distance.start’ and ‘distance.end’. These define the maximum permitted distance, in segments, between the matches. Let us use as example French “f|r|ã|-|s” and English “f|r|ā|n|s”, and imagine that we want to find cases where a French vowel corresponds to a vowel-n sequence in English. The part of the French word that interests us is segment number 3; its counterpart in the English word starts on segment 3 and ends on segment 4. The distance between the starts is 0, and the distance between the ends is 1. ‘findExamples()’ will find our pair of words if ‘distance.start’ is set to 0 or more, and ‘distance.end’ to 1 or more.
Both arguments can also take negative values which means that distance is not checked at all. These are in fact the defaults (-1). Effectively, the default behaviour of ‘findExamples()’ is to find any such pair/triple/… that the first word contains the first query, the second word the second query, and so on, regardless of whether they appear in the corresponding segments. For the example above, ‘findExamples(dataset,“f”,“r”)’ also returns a match. It may therefore seem irresponsible to set the default values of both arguments to -1, but in my experience, it very rarely produces false positives. On the other hand, the opposite behaviour (both arguments set to 0) may easily result in false negatives, ones that are not only of a much less intuitive kind, but also never give the user a chance to spot the problem as they are simply not displayed.
The next argument is ‘na.value’ which determines how missing values (‘NA’) are treated. This argument can only have one of two values: -1 and 0. The former means that missing values are considered non-matches, the latter that they are considered matches. The latter is the default. Note that an empty string query takes precedence over ‘na.value’, that is even whan ‘na.value’ is set to -1, ‘NA’s will show up in the results when the query is an empty string.
The last optional argument is ‘zeros’ which can be set to ‘TRUE’ or ‘FALSE’ (the default). The former means that search is performed on words with linguistic zeros in them. In the example above, the query “Vs” would find “f|r|ã|-|s” only if ‘zeros’ were set to ‘FALSE’.
The output of ‘findExamples()’ is a list with two fields: ‘data’ which is a data frame with matching examples, and ‘which’, a logical vector showing which examples in the original dataset were a match. The class of the return value is ‘df.findExamples’; this is purely for technical reasons, to allow for a more legible printed output.
See also ‘findPairs()’, a convenience wrapper around ‘findExamples()’.
# “ab” spans segments 1–2, while “a” only occupies segment 1
findExamples (d.abc, "ab", "a", distance.end=0)
#> No matches found.
findExamples (d.abc, "ab", "a", distance.end=1)
#> ALIGNED.L1 ALIGNED.L2
#> 1 a|b|c a|b|c
#> 2 a|b|a|c a|b|a|c
#> 5 a|b|c|- a|b|c|ə
#> 6 a|b|a|c|- a|b|a|c|ə
# linguistic zeros cannot be found if ‘zeros’ is set to ‘FALSE’
findExamples (d.abc, "-", "", zeros=T)
#> ALIGNED.L1 ALIGNED.L2
#> 5 a|b|c|- a|b|c|ə
#> 6 a|b|a|c|- a|b|a|c|ə
findExamples (d.abc, "-", "", zeros=F)
#> No matches found.
# both the usual and custom regular expressions are permissible
findExamples (d.abc, "a", "[ou]")
#> ALIGNED.L1 ALIGNED.L2
#> 3 a|b|c o|b|c
#> 4 a|b|a|c u|w|u|c
findExamples (d.abc, "a", "O")
#> ALIGNED.L1 ALIGNED.L2
#> 3 a|b|c o|b|c
#> 4 a|b|a|c u|w|u|c
# the output is actuall a list
str (findExamples(d.abc,"a","a"), max.level=1)
#> List of 2
#> $ data :'data.frame': 4 obs. of 2 variables:
#> $ which: logi [1:6] TRUE TRUE FALSE FALSE TRUE TRUE
#> - attr(*, "class")= chr "df.findExamples"
# ‘data’ is what is displayed on the screen
# ‘which’ is useful for subsetting
subset (d.abc, findExamples(d.abc,"a","O")$which)
#> A "soundcorrs" object.
#> Languages (2): L1, L2.
#> Entries: 2.
#> Columns (7): ID, DIALECT.L1, ALIGNED.L1, ORTHOGRAPHY.L1, DIALECT.L2, ALIGNED.L2, ORTHOGRAPHY.L2.
# ‘which’ can also be used to find examples
# that exhibit more than one correspondence.
findExamples (d.cap, "a", "a", "a", distance.start=0, distance.end=0)$which
aaa <- findExamples (d.cap, "b", "b", "b", distance.start=0, distance.end=0)$which
bbb <-$data [aaa & bbb,]
d.cap#> ALIGNED.German ORTHOGRAPHY.German ALIGNED.Polish
#> 22 b|r|a|t|ī|s|l|a|v|a Bratislava b|r|a|t|y|s|w|a|v|a
#> 24 b|ū|d|a|p|ä|s|t Budapest b|u|d|a|p|e|š|t
#> ORTHOGRAPHY.Polish ALIGNED.Spanish ORTHOGRAPHY.Spanish OFFICIAL.LANGUAGE
#> 22 Bratysława b|r|a|t|i|z|l|a|β|a Bratislava Slovak
#> 24 Budapeszt b|u|ð|a|p|e|s|t Budapest Hungarian
# the ‘cols’ argument can be used to alter the printed output
findExamples (d.abc, "a", "O", cols=c("ORTHOGRAPHY.L1","ORTHOGRAPHY.L2"))
#> ORTHOGRAPHY.L1 ORTHOGRAPHY.L2
#> 3 abc åbc
#> 4 abac uwuc
This is a convenience wrapper around ‘findExamples()’ which can be only used for datasets which contain data from exactly two languages. Instead of the three optional arguments of ‘findExamples()’, ‘findPairs()’ only has one. It is called ‘exact’, and it can take several different values.
The default is the inexact mode (‘exact’ set to 0 or ‘FALSE’). It corresponds to ‘distance.start’ and ‘distance.end’ being both set to -1, ‘na.value’ being set to 0, and ‘zeros’ being set to ‘FALSE’, which are also the default settings in ‘findExamples()’. The risk here are false positives. In my experience, however, those are rare, and because they are displayed, the user has a chance to spot them.
The opposite is the exact mode (‘exact’ set to 1 or ‘TRUE’), which corresponds to ‘distance.start’ and ‘distance.end’ being both set to 0, ‘na.value’ being set to -1, and ‘zeros’ to ‘TRUE’. The risk are false negatives, in my experience both much more common than false positives in the inexact mode, and effectively impossible to spot as they are simply not displayed.
A middle ground is the semi-exact mode (‘exact’ set to 0.5), where ‘distance.start’ and ‘distance.end’ are both set to 1, ‘na.value’ is set to 0, and ‘zeros’ to ‘FALSE’. It decreases the risk of false positives while increasing only a little the risk of false negatives.
Apart from the above, ‘findPairs()’ also has the parameter ‘cols’, whose value is passed directly to ‘findExamples()’.
The output of ‘findPairs()’ is the same as the output of ‘findExamples()’.
‘allPairs()’ does not have great analytic value in itself, but it can be useful when writing a paper e.g. on the phonetic adaptation of loanwords, to prepare its material part.
The output of ‘allPairs()’ consists of sections devoted to each segment, filled with a general contingency table of its various renderings, and followed by subsections which list all pairs exhibiting the given correspondence. ‘soundcorrs’ provides functions to format such output in HTML or in LaTeX, or not at all. Custom formatters are also not very difficult to write.
Tables can show the number of occurrences or the number of words in which the given correspondence manifests itself (‘unit’), in absolute or in relative terms (‘count’; all three with values as with ‘summary()’). Which columns are printed can be modified with ‘cols’, and whether to write to a file or to the screen, with ‘file’ (‘NULL’ meaning the screen). Lastly, the formatting is controlled by a special function, of which ‘soundcorrs’ provides three: ‘formatter.none()’, ‘formatter.html()’, and ‘formatter.latex()’. A custom formatter can also take additional arguments, which will be passed to it from the call to ‘allPairs()’.
# and see what result this gives
allPairs (d.abc, cols=c("ORTHOGRAPHY.L1","ORTHOGRAPHY.L2"))
#> section [1] "-"
#> table ə
#> table 2
#> subsection [1] "-" "ə"
#> data.frame ORTHOGRAPHY.L1 ORTHOGRAPHY.L2
#> data.frame 5 abc abca
#> data.frame 6 abac abaca
#> section [1] "a"
#> table a o u
#> table 4 1 1
#> subsection [1] "a" "a"
#> data.frame ORTHOGRAPHY.L1 ORTHOGRAPHY.L2
#> data.frame 1 abc abc
#> data.frame 2 abac abac
#> data.frame 5 abc abca
#> data.frame 6 abac abaca
#> subsection [1] "a" "o"
#> data.frame ORTHOGRAPHY.L1 ORTHOGRAPHY.L2
#> data.frame 3 abc åbc
#> subsection [1] "a" "u"
#> data.frame ORTHOGRAPHY.L1 ORTHOGRAPHY.L2
#> data.frame 4 abac uwuc
#> section [1] "b"
#> table b w
#> table 5 1
#> subsection [1] "b" "b"
#> data.frame ORTHOGRAPHY.L1 ORTHOGRAPHY.L2
#> data.frame 1 abc abc
#> data.frame 2 abac abac
#> data.frame 3 abc åbc
#> data.frame 5 abc abca
#> data.frame 6 abac abaca
#> subsection [1] "b" "w"
#> data.frame ORTHOGRAPHY.L1 ORTHOGRAPHY.L2
#> data.frame 4 abac uwuc
#> section [1] "c"
#> table c
#> table 6
#> subsection [1] "c" "c"
#> data.frame ORTHOGRAPHY.L1 ORTHOGRAPHY.L2
#> data.frame 1 abc abc
#> data.frame 2 abac abac
#> data.frame 3 abc åbc
#> data.frame 4 abac uwuc
#> data.frame 5 abc abca
#> data.frame 6 abac abaca
# a clearer result could be obtained by running
# allPairs (d.cap, cols=c("ORTHOGRAPHY.German","ORTHOGRAPHY.Polish"),
# file="~/Desktop/d.cap.html", formatter=formatter.html)
As was mentioned, the “capitals” dataset is linguistically absurd, and so it should not matter that all the Polish names of European capitals are listed as borrowed from German. If however, one wished to fix this problem, and do it not by copying the output to a word processor and replacing “>” with “:” there, but rather inside ‘soundcorrs’, this wish can be fulfilled easily enough. First, the existing ‘formatter.html()’ function needs to be written to a file to serve as a base for the new formatter: ‘dput(formatter.html, “~/Desktop/myFormatter.R”)’. Then, the beginning of the first line of this file needs to be changed to something like ‘myFormatter <- function’…, and finally, the “>” and “<” signs (written in HTML as ‘>’ and ‘<’, respectively) need to be replaced with a colon. All that is then left is to load the new function to R and use it to format the output of ‘allPairs()’:
# load the new formatter function …
# source ("~/Desktop/myFormatter.R")
# … and use it instead of ‘formatter.html()’
# allPairs (d.cap, cols=c("ORTHOGRAPHY.German","ORTHOGRAPHY.Polish"),
# file="~/Desktop/d.cap.html", formatter=myFormatter)
# note that this time the output will not open in the web browser automatically
Two ‘soundcorrs’ functions help automate fitting models to data: the simpler ‘multiFit()’ and the slightly more complex ‘fitTable()’.
‘multiFit()’ fits multiple models to a single dataset. It takes as argument the dataset, as well as a list of models, in which each element is a list that contains two named fields: ‘formula’, and ‘start’. The latter is a list of lists of starting estimates for the parameteres of the model, to be tested in case the previous ones fail to produce a fit. The user can specify the fitting function, as well as pass additional arguments to it.
The return value of ‘fitTable()’ is a list of lists containing the outputs of the fitting function. Warnings and errors, which are suppressed by ‘multiFit()’, are attached to the individual elements of the output as attributes. Technically, the result is of class ‘list.multiFit’ so that it can passed to ‘summary()’ to produce a table for easier comparison of the fits. The available metrics are ‘aic’, ‘bic’, ‘rss’ (the default), and ‘sigma’. In addition, the output of ‘fitTable()’ has an attribute ‘depth’; it is intended for ‘summary()’, and should not be changed by the user.
# prepare some random data
set.seed (27)
data.frame (X=1:10, Y=1:10 + runif(10,-1,1))
dataset <-
# prepare models to be tested
list (
models <-"model A" = list( formula="Y~a+X", start=list(list(a=1)) ),
"model B" = list( formula="Y~a^X", start=list(list(a=-1),list(a=1)) ))
# normally, (-1)^X would produce an error with ‘nls()’
# fit the models to the dataset
multiFit (models, dataset)
fit <-
# inspect the results
summary (fit)
#> model A model B
#> rss 4.059485 11.51618
‘fitTable()’ applies ‘multiFit()’ over a table, such as the ones produced by ‘coocc()’ or ‘summary()’. The arguments are: the models, the dataset, margin (as in ‘apply()’: 1 for rows, 2 for columns), the converter function, and additional arguments passed to ‘multiFit()’ (including the fitting function). The converter is a function that turns individual rows or columns of the table into data frames to which models can be fitted. ‘soundcorrs’ provides three simple functions: ‘vec2df.id()’ (the default one), ‘vec2df.hist()’, and ‘vec2df.rank()’. The first one only attaches a list of ‘X’ values, the second one extracts from a histogram the midpoints and counts, and the third one ranks the data. Any function can be used, so long as it takes a numeric vector as the only argument, and returns a data frame. The names of columns in the data frames returned by these three functions are ‘X’ and ‘Y’, something to be borne in mind when defining the formulae of the models.
As with ‘multiFit()’, the return value of ‘fitTable()’ is a list of the outputs of the fitting function, only in the case of ‘fitTable()’ it is nested. It, too, can be passed to ‘summary()’ to produce a convenient table.
# prepare the data
coocc (sampleSoundCorrsData.abc)
dataset <-
# prepare the models to be tested
list (
models <-"model A" = list( formula="Y~a*(X+b)^2", start=list(list(a=1,b=1)) ),
"model B" = list( formula="Y~a*(X-b)^2", start=list(list(a=1,b=1)) ))
# vanilla nls() often requires fairly accurate starting estimates
# fit the models to the dataset
fitTable (models, dataset, 1, vec2df.hist)
fit <-
# inspect the results
summary (fit, metric="sigma")
#> -_ə a_a a_o a_u b_b b_w c_c
#> model A NA 1.272453 NA NA NA NA NA
#> model B 0.4291194 1.03122 0.9342932 0.72328 0.5919122 0.9342932 0.5919122
In addition to analytic functions, ‘soundcorrs’ also exports several helpers. Let us now briefly discuss those, this time simply in the alphabetic order.
As was mentioned above, automatic segmentation and alignment requires careful supervision, and it may prove in the end to be easier to do by hand. ‘addSeparators()’ can facilitate the first half of this task by interspersing a vector of character strings with a separator.
# using the default ‘|’ …
addSeparators (d.abc$data$ORTHOGRAPHY.L1)
#> [1] "a|b|c" "a|b|a|c" "a|b|c" "a|b|a|c" "a|b|c" "a|b|a|c"
# … or a full stop
addSeparators (d.abc$data$ORTHOGRAPHY.L1, ".")
#> [1] "a.b.c" "a.b.a.c" "a.b.c" "a.b.a.c" "a.b.c" "a.b.a.c"
It may be sometimes that the data are insufficient for a test of independence, or that the contingency table is too diversified to draw concrete conclusions from it. ‘binTable()’ takes one or more rows and one or more columns as arguments, and leaves those rows and columns unchanged, while summing up all the others.
# build a table for a slightly larger dataset
coocc (d.cap)
tab <-
# let us focus on L1 a and o
which (rownames(tab) %hasPrefix% "a")
rows <- which (colnames(tab) %hasPrefix% "o")
cols <-binTable (tab, rows, cols)
#> o_o_o non-o_o_o
#> a_a_a 0 57
#> a_a_o 0 6
#> a_a_u 0 5
#> other 16 1041
# or on all a-like and o-like vowels
which (rownames(tab) %hasPrefix% "[aāäǟ]")
rows <- which (colnames(tab) %hasPrefix% "[oōöȫ]")
cols <-binTable (tab, rows, cols)
#> o_o_o ō_o_o ō_y_o other
#> a_a_a 0 1 0 56
#> a_a_o 0 0 0 6
#> a_a_u 0 0 0 5
#> ä_e_e 0 0 0 36
#> ā_-_- 1 0 0 6
#> ā_a_a 0 2 0 47
#> other 15 16 3 931
Metacharacters defined in the transcription (“wildcards”) can be used inside a ‘findPairs()’ query, but they can also be used with ‘grep()’ or any other function. ‘expandMeta()’ is a little function that translates them into regular expressions that vanilla R can understand.
# let us search a column other than the one specified as ‘aligned’
d.abc$data [, "ORTHOGRAPHY.L2"]
orth <-
# look for all VCC sequences
expandMeta(d.cap$trans[[1]],"VCC")
query <-grep(query,orth)]
orth [#> [1] "abc" "abca"
# look for all VCC words
expandMeta(d.cap$trans[[1]],"^VCC$")
query <-grep(query,orth)]
orth [#> [1] "abc"
Checks if a string begins with another string. In ‘soundcorrs’, this can be useful for extracting specific rows and columns from a contingency table.
# build a table for a slightly larger dataset
coocc (d.cap)
tab <-
# it is quite difficult to read as a whole, so let us focus
# on a-like vowels in L1 and s-like consonants in L2
which (rownames(tab) %hasPrefix% "[aāäǟ]")
rows <- which (colnames(tab) %hasPrefix% "[sśš]")
cols <-
tab [rows, cols]#> German_Polish_Spanish
#> German_Polish_Spanish s_s_s s_s_z s_z_z s_š_s š_š_s
#> a_a_a 1 1 0 1 1
#> a_a_o 0 0 0 0 1
#> a_a_u 0 0 0 0 0
#> ä_e_e 0 0 0 2 0
#> ā_-_- 0 0 1 0 0
#> ā_a_a 0 0 0 2 0
‘%hasSuffix%’ works nearly the same as ‘%hasPrefix%’, only instead of the beginning of a word, it looks at its end.
# build a table for a slightly larger dataset
coocc (d.cap)
tab <-
# it is quite difficult to read as a whole, so let us focus
# on what corresponds to a-like vowels in L1 and s-like consonants in L2
which (rownames(tab) %hasSuffix% "[aāäǟ]")
rows <- which (colnames(tab) %hasSuffix% "[sśš]")
cols <-
tab [rows, cols]#> German_Polish_Spanish
#> German_Polish_Spanish -_-_s s_s_s s_š_s z_s_s z_z_s š_š_s
#> -_-_a 0 0 0 0 0 0
#> -_a_a 1 1 0 0 0 0
#> -_a_ja 0 0 0 0 0 1
#> -_y_a 1 0 0 0 0 0
#> a_a_a 1 1 1 1 0 1
#> jus_o_a 0 0 0 0 0 0
#> ā_a_a 0 0 2 0 1 0
‘lapplyTest()’ is a variant of ‘base::lapply()’ specifically adjusted for the application of tests of independence. The main difference lies in the handling of warnings and errors.
This function takes a list of contingency tables, such as generated by ‘allCooccs()’ above, and applies to each of its elements a function given in ‘fun’. By default, it is ‘chisq.test()’, but any other test can be used, so long as its output contains an element named ‘p.value’. The result is a list of the outputs of ‘fun’, to each attached as an attribute a warning or an error if any were produced. Additional arguments to ‘fun’ can also be passed in a call to ‘lapplyTest()’.
Technically, the output is of class ‘list.lapplyTest’. It can be passed to ‘summary()’ to sift through the results and only print the ones with the p-value below the specified threshold (the default is 0.05). Those tests which produced a warning are prefixed with an exclamation mark.
# let us prepare the tables
allCooccs (d.abc, bin=F)
tabs <-#>
|
| | 0%
|
|================== | 25%
|
|=================================== | 50%
|
|==================================================== | 75%
|
|======================================================================| 100%
# and apply the chi-squared test to them
lapplyTest (tabs)
chisq <-
chisq#> $`-`
#>
#> Chi-squared test for given probabilities
#>
#> data: tab
#> X-squared = 6, df = 5, p-value = 0.3062
#>
#>
#> $a
#>
#> Pearson's Chi-squared test
#>
#> data: tab
#> X-squared = 7.7467, df = 6, p-value = 0.2573
#>
#>
#> $b
#>
#> Pearson's Chi-squared test
#>
#> data: tab
#> X-squared = 7.1944, df = 4, p-value = 0.126
#>
#>
#> $c
#>
#> Chi-squared test for given probabilities
#>
#> data: tab
#> X-squared = 6.5714, df = 5, p-value = 0.2545
#>
#>
#> attr(,"class")
#> [1] "list.lapplyTest"
# this is only an example on a tiny dataset, so let us be more forgiving
summary (chisq, p.value=0.3)
#> Total results: 4; with p-value ≤ 0.3: 3.
#> ! a: p-value = 0.257
#> ! b: p-value = 0.126
#> ! c: p-value = 0.255
# let us see the problems with ‘a’
attr (chisq$a, "error")
#> NULL
attr (chisq$a, "warning")
#> <simpleWarning in fun(tab, ...): Chi-squared approximation may be incorrect>
# this warning often means that the data were insufficient
$a
tabs#> L1_L2
#> L1_L2 -_ə b_b b_w c_c
#> a_a 2 4 0 4
#> a_o 0 1 0 1
#> a_u 0 0 1 1
‘long2wide()’, together with ‘wide2long()’ are used to convert data frames between the “long format” and the “wide format” (see above). Of these two, ‘long2wide()’ is particularly useful because the “long format” tends to be easier for humans to perform the segmentation, and is therefore preferable for storing data, while the “wide format” is used internally and required by ‘soundcorrs’.
During the conversion, the number of columns is almost doubled (while the number of rows halved), but because it is unwise to have duplicate column names, they are given suffixes – which are taken from the values in the column ‘LANGUAGE’. The name of the column used for that purpose can be changed using the ‘col.lang’ argument.
Some of the attributes pertain to only one word in a pair or to the pair as a whole. In the “long format” those have to be repeated, but in the “wide format” this is not necessary. ‘long2wide()’ allows for certain columns to be excluded from the conversion, using the ‘skip’ argument.
# the “abc” dataset is in the long format
read.table (path.abc, header=T)
abc.long <-
# the simplest conversion unnecessarily doubles the ID column
long2wide (abc.long)
#> ID.L1 DIALECT.L1 ALIGNED.L1 ORTHOGRAPHY.L1 ID.L2 DIALECT.L2 ALIGNED.L2
#> 1 1 std a|b|c abc 1 std a|b|c
#> 2 2 std a|b|a|c abac 2 std a|b|a|c
#> 3 3 std a|b|c abc 3 north o|b|c
#> 4 4 std a|b|a|c abac 4 north u|w|u|c
#> 5 5 std a|b|c|- abc 5 south a|b|c|ə
#> 6 6 std a|b|a|c|- abac 6 south a|b|a|c|ə
#> ORTHOGRAPHY.L2
#> 1 abc
#> 2 abac
#> 3 åbc
#> 4 uwuc
#> 5 abca
#> 6 abaca
# but this can be avoided with the ‘skip’ argument
long2wide (abc.long, skip="ID") abc.wide <-
‘ngrams()’ turns a vector of words into a list of n-grams, or a table of its frequencies. The first argument is the vector of words; the second is ‘n’, the length of n-grams to extract (defaults to ‘1’); and the last ‘as.table’ which determines whether the output is a list of n-grams or a table of its frequencies (defaults to ‘TRUE’).
Two more arguments are available. ‘borders’ is a vector of two character strings: the first to be prepended to all the words, and the second to be appended to them. This way it is clear which n-grams were in the initial, and which in the final position inside the word. ‘borders’ defaults to a vector of two empty strings. Lastly, ‘rm’ is a string of characters that are to be removed from the words before they are cut into n-grams. For instance, to remove all linguistic zeros use ‘rm=“-”’, and to remove zeros and segment separators, use ‘rm=“[-\|]”’.
# with n==1, ngrams() returns simply the frequencies of segments
ngrams (d.cap$data[,"ORTHOGRAPHY.Spanish"])
#>
#> A B C D E H L M N P R S T V Z _ a b c d e f g h i k
#> 1 5 2 1 1 1 4 1 1 2 2 1 1 4 1 3 30 5 3 7 14 1 5 1 15 1
#> l m n o p r s t u v x Á í
#> 11 5 9 10 2 11 13 7 9 2 1 1 4
# counts can easily be turned into a data frame with ranks
ngrams (d.cap$data[,"ORTHOGRAPHY.Spanish"], n=2)
tab <- as.matrix (sort(tab,decreasing=T))
mtx <-head (data.frame (RANK=1:length(mtx), COUNT=mtx, FREQ=mtx/sum(mtx)))
#> RANK COUNT FREQ
#> na 1 4 0.02339181
#> st 2 4 0.02339181
#> ag 3 3 0.01754386
#> ar 4 3 0.01754386
#> da 5 3 0.01754386
#> en 6 3 0.01754386
‘subset()’ does what its name suggests, i.e. it subsets a dataset using the provided condition. It returns a new ‘soundcorrs’ object.
# select only examples from L2’s northern dialect
subset (d.abc, DIALECT.L2=="north") $data
#> ID DIALECT.L1 ALIGNED.L1 ORTHOGRAPHY.L1 DIALECT.L2 ALIGNED.L2 ORTHOGRAPHY.L2
#> 3 3 std a|b|c abc north o|b|c åbc
#> 4 4 std a|b|a|c abac north u|w|u|c uwuc
# select only capitals of countries where German is an official language
subset (d.cap, grepl("German",d.cap$data$OFFICIAL.LANGUAGE)) $data
#> ALIGNED.German ORTHOGRAPHY.German ALIGNED.Polish
#> 5 l|u|k|s|ə|m|b|u|r|k|- Luxemburg l|u|k|s|e|m|b|u|r|k|-
#> 19 v|ī|-|-|-|n|- Wien ẃ|-|e|d|e|ń|-
#> 21 b|ä|r|l|ī|n Berlin b|e|r|l|i|n
#> 23 b|r|ü|-|s|ə|l|-|- Brüssel b|r|u|k|s|e|l|a|-
#> ORTHOGRAPHY.Polish ALIGNED.Spanish ORTHOGRAPHY.Spanish
#> 5 Luksemburg l|u|k|s|e|m|b|u|r|γ|o Ciudad_de_Luxemburgo
#> 19 Wiedeń b|j|e|-|-|n|a Viena
#> 21 Berlin b|e|r|l|i|n Berlín
#> 23 Bruksela b|r|u|-|s|e|l|a|s Bruselas
#> OFFICIAL.LANGUAGE
#> 5 Luxembourgish,French,German
#> 19 German
#> 21 German
#> 23 Dutch,French,German
# select only pairs in which L1 a : L2 a
subset (d.abc, findPairs(d.abc,"a","a")$which) $data
#> ID DIALECT.L1 ALIGNED.L1 ORTHOGRAPHY.L1 DIALECT.L2 ALIGNED.L2 ORTHOGRAPHY.L2
#> 1 1 std a|b|c abc std a|b|c abc
#> 2 2 std a|b|a|c abac std a|b|a|c abac
#> 5 5 std a|b|c|- abc south a|b|c|ə abca
#> 6 6 std a|b|a|c|- abac south a|b|a|c|ə abaca
‘wide2long()’ is simply the inverse of ‘long2wide()’. The conversion may not be perfect, as the order of the columns may change.
In ‘long2wide()’, suffixes were taken from the values in the ‘LANGUAGE’ column; this time they must be specified explicitly. They will be stored in a column defined by the argument ‘col.lang’, which defaults to ‘LANGUAGE’. However, the string that separated column names from suffixes will not be removed by default. To strip it, the argument ‘strip’ needs to be set to the length of the separator.
# let us use the converted “abc” dataset
abc.wide#> ID DIALECT.L1 ALIGNED.L1 ORTHOGRAPHY.L1 DIALECT.L2 ALIGNED.L2 ORTHOGRAPHY.L2
#> 1 1 std a|b|c abc std a|b|c abc
#> 2 2 std a|b|a|c abac std a|b|a|c abac
#> 3 3 std a|b|c abc north o|b|c åbc
#> 4 4 std a|b|a|c abac north u|w|u|c uwuc
#> 5 5 std a|b|c|- abc south a|b|c|ə abca
#> 6 6 std a|b|a|c|- abac south a|b|a|c|ə abaca
# with the separator preserved
wide2long (abc.wide, c(".L1",".L2"))
#> ALIGNED DIALECT ORTHOGRAPHY ID LANGUAGE
#> 1 a|b|c std abc 1 .L1
#> 2 a|b|a|c std abac 2 .L1
#> 3 a|b|c std abc 3 .L1
#> 4 a|b|a|c std abac 4 .L1
#> 5 a|b|c|- std abc 5 .L1
#> 6 a|b|a|c|- std abac 6 .L1
#> 7 a|b|c std abc 1 .L2
#> 8 a|b|a|c std abac 2 .L2
#> 9 o|b|c north åbc 3 .L2
#> 10 u|w|u|c north uwuc 4 .L2
#> 11 a|b|c|ə south abca 5 .L2
#> 12 a|b|a|c|ə south abaca 6 .L2
# and with the separator removed
wide2long (abc.wide, c(".L1",".L2"), strip=1)
#> ALIGNED DIALECT ORTHOGRAPHY ID LANGUAGE
#> 1 a|b|c std abc 1 L1
#> 2 a|b|a|c std abac 2 L1
#> 3 a|b|c std abc 3 L1
#> 4 a|b|a|c std abac 4 L1
#> 5 a|b|c|- std abc 5 L1
#> 6 a|b|a|c|- std abac 6 L1
#> 7 a|b|c std abc 1 L2
#> 8 a|b|a|c std abac 2 L2
#> 9 o|b|c north åbc 3 L2
#> 10 u|w|u|c north uwuc 4 L2
#> 11 a|b|c|ə south abca 5 L2
#> 12 a|b|a|c|ə south abaca 6 L2
If you found a bug, have a remark to make about ‘soundcorrs’, or wishes for its future releases, please write to kamil.stachowski@gmail.com.
If you use ‘soundcorrs’ in your research, please cite it as Stachowski K. [forthcoming]. soundcorrs: Tools for Semi-Automatic Analysis of Sound Correspondences.