The TSVIO package provides a fast but simple interface for accessing, read-only, (subsets of) potentially very large (many gigabytes) data matrices stored in plain text files.
Data files are required to be plain text files containing lines with tab-separated data columns.
Each line is separated into logical columns (fields) by tab characters.
The first line must contain unique labels for each data column. The first line may contain one less field than the remaining lines. Such files are often produced by R. Alternatively, the first line may contain the same number of fields as the remaining lines and the first field on that line is ignored. Such files are often produced by anything other than R.
Every line (row) after the first must contain the same number of fields. The first field of each line must be a unique row label. (Row and column labels are treated separately and can have labels in common.)
tsvio
assumes that the data file is static and does not change during an R session.
Before data can be read from a data file, an index file containing the starting position of the data line for each row label must be generated.
The index file can be generated explicitly by calling tsvGenIndex
:
tsvGenIndex (filename, indexfile)
tsvio
assumes that the data file is static and does not change during an R session. Hence, an index file, once created, does not change during an R session either.
The index file must be regenerated by the user whenever the data file changes. The tsvio
package cannot detect that the data file has changed. Using an outdated index file can result in erroneous results or a run-time error.
The data access functions described below can generate the index file automatically on first access. Depending on file permissions, this may allow the user to simply remove the index file whenever the data file is modified. A new index file will be generated on the next access (which will thus be slower than normal).
The function tsvGetData
is used to read data as a matrix:
tsvGetData (filename, indexfile, rowpatterns, colpatterns, dtype="", findany=TRUE)
rowpatterns
is either NULL
or a vector of row labels. If NULL
, data from all lines in the file is returned. Otherwise, only data from rows matching an entry in rowpatterns
is returned. Only exact matches are supported.
Similarly, colpatterns
specifies which columns to return data for.
Thus, the entire data matrix can be returned by specifying NULL
for both rowpatterns
and colpatterns
.
The return value is always a data matrix with two dimensions. If rowpatterns
or colpatterns
is a single element, the corresponding axis of the returned matrix is not ‘dropped’. The standard R function drop
can be used to delete any dimensions of length one if desired.
By default, if rowpatterns
or colpatterns
are not NULL
, any specified labels not in the data file will be silently ignored and not included in the result. However, if there are no matching rows or no matching columns, tsvGetData
will throw an error.
Setting the optional parameter findany
to FALSE
will cause tsvGetData
to throw an error if any specified label is not in the data file.
Rows and columns in the returned matrix will occur in same order as they appear in rowpatterns
and colpatterns
respectively. Duplicate entries in rowpatterns
or colpatterns
will never match any label (and always result in an error if findany
is FALSE
).
The returned matrix will have the same mode as the dtype
parameter, which can be a string, a numeric, or an integer. The value of the parameter is ignored. Returning a numeric or integer matrix can be much faster than returning a character matrix and then converting it. However, it requires all data elements in the data file to conform to that type. Otherwise tsvGetData
will throw an error.
The function tsvGetLines
returns a subset of the lines in the data file as a string vector:
tsvGetLines (filename, indexfile, patterns, findany=TRUE)
The string vector returned by tsvGetLines
consists of the entire first line in the data file, followed by the entirety of every line whose row label occurs in patterns. Unlike with tsvGetData
, patterns cannot be NULL
and matching lines are ordered by their order in the data file, not the order of their labels in patterns. If findany
is TRUE
, labels in patterns that do not occur are ignored. If no labels match, an error is thrown. If findany
is FALSE
, an error is thrown if there is no row for any label in patterns.