---
title: "A. Data handling"
output: 
  rmarkdown::html_vignette:
    toc: true
vignette: >
  %\VignetteIndexEntry{A. Data handling}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
editor_options: 
  chunk_output_type: console
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width=6, 
  fig.height=4
)
# Legge denne i YAML på toppen for å skrive ut til tex
#output: 
#  pdf_document: 
#    keep_tex: true
# Original:
#  rmarkdown::html_vignette:
#    toc: true
```

```{r setup}
# Start the multiblock R package
library(multiblock)
```

# Read from file

Data are stored in many different file formats. The following three examples
cover two types of CSV-files and generic flat files.

```{r}
# Find directory extdata from the multiblock package
mbdir <- system.file('extdata/', package = "multiblock")

# Comma separated values, row names in first column
meta_data <- read.csv(paste0(mbdir, "/meta_data.csv"), row.names = 1)
# If working directory matches file location:
# meta_data <- read.csv('meta_data.csv', row.names = 1)
meta_data

# Semi-colon separated values (locales where the decimal point is comma),
# no row names
proteins <- read.csv2(paste0(mbdir, "/proteins.csv"))
proteins

# Blank space separated data without labels
genes <- read.table(paste0(mbdir, "/genes.dat"))
genes
```

# Data pre-processing

Before analysis, various types of pre-processing may be needed. Centring
and standardising/scaling may be considered the most basic. In R, these
operations are performed column-wise by default, leading to autoscaling.
If these operations are performed on the rows, we perform the standard
normal variate (SNV) instead.

```{r}
# Column-centring
genes_centred <- scale(genes, scale=FALSE)
colMeans(genes_centred) # Check mean values

# Autoscaling
genes_scaled <- scale(genes)
apply(genes_scaled, 2, sd) # Check standard deviations

# SNV (transpose, autoscale, re-transpose)
genes_snv <- t(scale(t(genes)))
apply(genes_snv, 1, sd) # Check standard deviations
```

## Re-coding categorical data

Most analysis methods require continuous input data. The __meta_data__
__data.frame__ contains a character vector (a factor in older R versions)
of categories. This package has a function __dummycode__ for converting
categorical data to various dummy formats.

```{r}
# Default is sum coding
dummycode(meta_data$colour)

# Treatment coding
dummycode(meta_data$colour, "contr.treatment")

# Full dummy-coding (rank deficient)
dummycode(meta_data$colour, drop = FALSE)

# Replace categorical with dummy-coded, use I() to index by common name
meta_data2 <- meta_data
meta_data2$colour <- I(dummycode(meta_data$colour, drop = FALSE))
meta_data2
meta_data2$colour
```


# Data structures for multiblock analysis

## Create list of blocks

A simple list of blocks can be created using the __list()__
function. Naming of the blocks can be done directly or 
after creation.

```{r}
# Direct approach
blocks1 <- list(meta = meta_data2, proteins = proteins, genes = genes)

# Two-step approach
blocks2 <- list(meta_data2, proteins, genes)
names(blocks2) <- c('meta', 'proteins', 'genes')

# Same result
identical(blocks1, blocks2)

# Access by name or number
blocks1[['meta']]
blocks2[[1]]
```

## Create data.frame of blocks

A __data.frame__ is a convenient storage format for data
in R and can handle many types of variables, e.g. numeric,
logical, character, factor or matrices. The latter is
useful for analyses of data with shared sample mode.

```{r}
# Construct block data.frame from list
df1 <- block.data.frame(blocks1)

# Construct block data.frame from data.frame:
# First merge blocks into data.frame
my_data <- cbind(meta_data2, proteins, genes)
# Then construct block data.frame using named 
# list of indexes
df2 <- block.data.frame(my_data, block_inds = 
        list(meta = 1:2, proteins = 3:5, genes = 6:8))

# Same result
identical(df1,df2)

# Access by name or number
df1[[2]]
df2[['proteins']]
df1[c(1,3)]
df1[-2]
df2[c('proteins','genes')]

# Use with formula interface (see other vignettes)
# sopls(meta ~ proteins + genes, data = df1)

# Use with single list interface (see other vignettes)
# mfa(df1[c(1,3)], ncomp = 3)
```