Overview

Introduction

Column Text Format (CTF) is a new tabular data format designed for simplicity and performance. CTF is the simplest column store you can imagine: it represents each column in a table as a plain text file. The underlying plain text means the data is human readable and familiar to programmers, unlike specialized binary formats. CTF is faster than row oriented formats like CSV when loading a subset of the columns in a table. This package provides functions to read and write CTF data from R.

What is CTF good for?

Teaching the concept of column stores
Processing large data sets
Integrating with UNIX style text processing pipelines

What are the alternatives to CTF?

If CTF isn’t exactly what you need, then you will be better off with a more established and stable data format. CSV works fine in many cases. If you need better performance, then consider existing columnar storage technologies such as HDF5 or Apache Parquet.

Anything else?

We created CTF in 2021, and we expect the metadata file associated with it to evolve significantly. Until version 1.0 is ready, anything could change at any time, and we make no promises about compatibility.

Quick Start

library(ctf)

The following examples use R’s builtin iris dataset.

First, let’s save iris in CTF format inside iris_ctf_data, a subdirectory of our temporary directory.

d <- file.path(tempdir(), "iris_ctf_data")
write.ctf(iris, d)

The code above created the directory iris_ctf_data inside a temporary directory, and wrote files corresponding to the columns in iris, plus one file for the metadata.

list.files(d)

## [1] "Petal.Length"       "Petal.Width"        "Sepal.Length"      
## [4] "Sepal.Width"        "Species"            "iris-metadata.json"

colnames(iris)

## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"

The column files are just plain text. Let’s verify by reading the first 5 lines.

pl_file <- file.path(d, "Petal.Length")
readLines(pl_file, n = 5L)

## [1] "1.4" "1.4" "1.3" "1.5" "1.4"

iris[1:5, "Petal.Length"]

## [1] 1.4 1.4 1.3 1.5 1.4

We can read the data saved in ctf format back into R as iris2, and make sure the data matches our original iris data.

iris2 <- read.ctf(d)
head(iris2)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

# Same thing:
head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Clean up:

unlink(d, recursive = TRUE)