staRdom is a package for R [@R_Core_Team_2018] to analyse fluorescence and absorbance data of dissolved organic matter (DOM). It is possible to do the following steps:
staRdom was developed and is maintained at WasserCluster Lunz and the University of Natural Resources and Life Sciences, Vienna.
The analysis process was already discussed in other papers and tutorials. The aim of this package was to bring a familiar way of using PARAFAC analysis for DOM to the R platform. The offered functions follow the concept of @Murphy_2013. Reading it is recommended and can help your understanding!
For data correction, peak and indices calculation and slope parameters please see vignette for basic analysis. For information on eemR and its functions please see the eemR vignette. Details on the actual PARAFAC calculation can be found in the multiway documentation.
Some of the functions work parallel. You can set the number of parallel processes to be used. Here we use half of the available threads, which should be similar to the number of physical cores.
cores <- parallel::detectCores()/2
You can run a complete data correction and analysis as showen in this example with the data provided by the package.
The data is saved in a folder accesible by data(eem_list)
. Due to package size issues, only a small amount of samples is included and not all examples from this tutorial will deliver the same results.
The data is saved in a folder accesible by system.file("extdata/absorbance_eemR",package = "staRdom")
.
This is a table with an example of how to deal with diluted samples.
The data is saved in a folder accesible by system.file("extdata/metatable_eemR.csv",package = "staRdom")
.
A set of EEM samples was corrected and can be loaded into your R environment by data(eem_list)
A PARAFAC model was generated with the samples above. It can be loaded into the R environment by data(pfres_comps1.rda)
where outliers are still included and by data(pfres_comps2.rda)
without the outliers.
EEM data import is done with eem_read
(package eemR). Currently you can use it to import from Cary Eclipse, Aqualog, Fluoromax-4 and Shimadzu instruments. Files can be read recursively but file names must be unique.
folder <- system.file("extdata/cary/scans_day_1", package = "eemR")
eem_list <- eem_read(folder)
To have a look at your data, you can plot the samples.
ggeem(eem_list)
Absorbance data is imported with absorbance_read
. It is read from CSV or TXT files. The column header containig the wavelength must be either “wavelength” or “Wavelength”. A multi-sample file must have sample names as column names. A single-sample file can have sample name as column name or sample name as file name and “Abs.” as column name. All tables are combined to one with one wavelength column and one column for each sample containing the absorbance data.
data(absorbance)
Dilution factors were saved in a table to demonstrate cases where dilution factors vary between samples.
meta <- read.table(system.file("extdata/metatable_eemR.csv",package = "staRdom"), header = TRUE, sep = " ", dec = ".", row.names = 1)
The data can be checked for possible incorrect entries. The results are to help you to reorganise your data and revise the steps above in case of any problems. No correction is done automatically! Checks are done on NAs in data, duplicate and invalid sample names, wavelength range mismatches and missing data in either inconsistencies in samples names between EEM, absorbance and metadata (like samples missing in one of the three sets).
problem <- eem_checkdata(eem_list,absorbance,meta,metacolumns = c("dilution"),error=FALSE)
##
## samples missing " is_blank_corrected "
## nano
## sample1
## sample2
## sample3
##
## samples missing " is_scatter_corrected "
## nano
## sample1
## sample2
## sample3
##
## samples missing " is_ife_corrected "
## nano
## sample1
## sample2
## sample3
##
## samples missing " is_raman_normalized "
## nano
## sample1
## sample2
## sample3
##
## EEM samples missing absorbance data:
## nano in
## /home/maetz/R/x86_64-pc-linux-gnu-library/3.5/eemR/extdata/cary/scans_day_1
##
## Metadata column dilution misses data for samples:
I you used the template for the peak picking (vignette for basic analysis), the correction is already done and you can start a PARAFAC analysis with the eem_list
resulting from that.
If you want to change your sample names, you can use eem_name_replace
for that. In the example, “(FD3)” is removed as it is not part of the samples names but could be in the file names. You can use this function for any replacement in file names. Regular expressions can be used.
eem_list <- eem_name_replace(eem_list,c("\\(FD3\\)"),c(""))
Blanks are samples of milliq that must contain one of nano, miliq, milliq, mq or blank in their file names. They are linked to the samples in the same subfolders. If multiple blanks were measured, they are averaged. Blanks are substracted from each sample to reduce the effects of scatter bands [@Murphy_2013].
eem_list <- eem_list <- eem_remove_blank(eem_list)
## A total of 1 blank EEMs will be averaged.
ggeem(eem_list)
Inner-filter effects (IFE) occure when excitation light is absorbed by chromophores. A simple method to correct the IFE is to use the sample's absorbance. The EEM is multiplied by a correction matrix correcponding to each wavelength pair. The example uses a length of cuvette of absorption measurment of 5 cm.
eem_list <- eem_ife_correction(eem_list,absorbance, cuvl = 5)
## Warning in FUN(X[[i]], ...): No absorbance data was found for sample nano!
## sample1
## Range of IFE correction factors: 1.0022 1.0923
## Range of total absorbance (Atotal) : 4e-04 0.0153
##
## sample2
## Range of IFE correction factors: 1.0012 1.0559
## Range of total absorbance (Atotal) : 2e-04 0.0094
##
## sample3
## Range of IFE correction factors: 1.0032 1.1885
## Range of total absorbance (Atotal) : 6e-04 0.03
ggeem(eem_list)
Fluorescence is normalised to a standard scale of Raman Units by dividing all intensities by the area of the Raman peak. Depending on where you get the data from, you can use blanks, numeric values or data frames as source for the values.
eem_list <- eem_raman_normalisation2(eem_list, blank = "blank")
## A total of 1 blank EEMs will be averaged.
## Raman area: 9.540904
## Raman area: 9.540904
## Raman area: 9.540904
ggeem(eem_list)
Raman areas can be calculated separately with eem_raman_area
.
From this step on, blanks are not needed anymore. You can remove them from your sample set.
eem_list <- eem_extract(eem_list, c("nano", "miliq", "milliq", "mq", "blank"),ignore_case = TRUE)
## Removed sample(s): nano
absorbance <- select(absorbance, -matches("nano|miliq|milliq|mq|blank", ignore.case = TRUE))
The function removes scattering from the samples. remove_scatter
is a named logical vector where names are “raman1”, “raman2”, “rayleigh1” and “rayleigh2”. remove_scatter_width
is either a number or a vector containing 4 different values, one for each scatter type. [@Murphy_2013; @2006]
remove_scatter <- c("raman1" = TRUE, "raman2" = TRUE, "rayleigh1" = TRUE, "rayleigh2" = TRUE)
remove_scatter_width <- c(15,15,15,15)
eem_list <- eem_rem_scat(eem_list, remove_scatter = remove_scatter, remove_scatter_width = remove_scatter_width)
ggeem(eem_list)
Removed scatter areas can be interpolated along excitration wavelengths [@Elcoroaristizabal_2015].
eem_list <- eem_interp(eem_list, cores = cores)
ggeem(eem_list)
dil_data <- meta["dilution"]
eem_list <- eem_dilution(eem_list,dil_data)
ggeem(eem_list)
Depending on your instrument smoothing the data could be beneficial for peak picking. For PARAFAC analysis please
eem4peaks <- eem_smooth(eem_list, n = 4)
ggeem(eem4peaks)
summary(eem_list)
## sample ex_min ex_max em_min em_max is_blank_corrected
## 1 sample1 220 450 230 600 TRUE
## 2 sample2 220 450 230 600 TRUE
## 3 sample3 220 450 230 600 TRUE
## is_scatter_corrected is_ife_corrected is_raman_normalized manufacturer
## 1 TRUE TRUE TRUE Cary Eclipse
## 2 TRUE TRUE TRUE Cary Eclipse
## 3 TRUE TRUE TRUE Cary Eclipse
There can be warnings about not present wavelengths but usually interpolation works fine.
bix <- eem_biological_index(eem4peaks)
coble_peaks <- eem_coble_peaks(eem4peaks)
fi <- eem_fluorescence_index(eem4peaks)
hix <- eem_humification_index(eem4peaks, scale = TRUE)
indices_peaks <- bix %>%
full_join(coble_peaks, by = "sample") %>%
full_join(fi, by = "sample") %>%
full_join(hix, by = "sample")
indices_peaks
## sample bix b t a m c
## 1 sample1 0.6805683 0.07955867 0.09864216 0.3951777 0.2586302 0.2052121
## 2 sample2 0.7910467 0.05615903 0.05560088 0.1615910 0.1027934 0.0885342
## 3 sample3 0.4773068 0.08953814 0.13054726 0.9236350 0.6476370 0.6645666
## fi hix
## 1 1.126063 0.8705727
## 2 1.098964 0.8230013
## 3 1.283492 0.9311949
slope_parms <- abs_parms(absorbance, cuvl = 1, cores = cores)
slope_parms
## sample a254 a300 E2_E3 E4_E6 S275_295 S350_400
## 1 sample1 0.09323 0.04625 6.186462 1.672727 0.01793612 0.01440805
## 2 sample2 0.04530 0.02211 6.742489 1.520000 0.01964245 0.01495510
## 3 sample3 0.23328 0.12196 5.962771 3.581028 0.01709584 0.01701852
## S300_700 SR
## 1 0.01161254 1.244868
## 2 0.01007903 1.313428
## 3 0.01673141 1.004543
Finding an appropriate PARAFAC model is an iterative process. So in the analysis you start over with new parameters over and over again.
The package comes with some example data but due to the size, not all samples are included:
data(eem_list)
If you used the basic analysis template, you can use the resulting data right away. In case you did several analyses and want to combine the samples you can use eem_import_dir
to combine EEM samples from several RData files. Put all of them in one folder and run the following:
eem_list <- eem_import_dir(dir)
Due to package size issues, no example data is included for this function.
You might reuse eem_checkdata
before continuing the further analysis!
It is crucial to find an appropriate number of components in the analysis. To help you comparing the different numbers, a series of PARAFAC models can be calculated and compared. In this case 4 models ranging from 5 to 8 components are calculated.
The analysis can happen to find local minima so a certain number of similar models with differnt random starting values is calculated and the best is kept for further steps. nstart
is the number of these models and 4 might by a good start although for a profound analysis higher values (e.g. 10) are suggested.
You can speed up the calculations by using multiple cores
. Beware, that calculating a PARAFAC model can take some time!
Higher maxit
and lower ctol
increase the accuracy of the model but need more computation time.
Normalising the samples is highly recommended. The normalisation factors are saved with the model and results are corrected lateron.
# minimum and maximum of numbers of components
dim_min <- 5
dim_max <- 8
nstart <- 10 # number of similar models from which best is chosen
cores <- parallel::detectCores()/2 # use all cores but do not use all threads
maxit = 500 # maximum number of iterations in PARAFAC analysis
ctol <- 10^-5 # tolerance in PARAFAC analysis
pfres_comps <- eem_parafac(eem_list, comps = seq(dim_min,dim_max), normalise = TRUE, maxit = maxit, nstart = nstart, ctol = ctol, cores = cores)
To save time, you can use the generated PARAFAC model inculded in the package.
data(pfres_comps1)
Plot created model's components. You can see the models fits and the components (rows) according to models with different numbers of components (columns) in 2 different views. The single plots can be created using eempf_fits
and eempf_plot_comps
.
eempf_compare(pfres_comps)
In case of uneven peak heights, eempf_rescaleAB
can help you imporving your graphs. The parameter newscale
specifies the root mean-squared error of each column in matrices B and C. This is compensated in the A-mode (sample loadings). Alternatively newscale
can be set "Fmax"
, each peak has a height of 1 then.
pfres_comps <- lapply(pfres_comps, eempf_rescaleBC, newscale = "Fmax")
Here, we choose 6 compoents. Do not choose the 6th component in the list of models, as 6 was the 2nd model created!
comps <- 6
cp_out <- pfres_comps[[which(comps == seq(dim_min, dim_max))]]
The leverage is calculated by eempf_leverage
and can be plotted with eempf_leverage_plot
. Using eempf_leverage_ident
to plot the leverage shows an interactive plot where you can klick an certain values to save them in a variable. qlabel
defines the size of the upper percentile that is labeled in the plots. eempf_mleverage
can be used to compare the leverages of samples in differnt models.
# calculate leverage
cpl <- eempf_leverage(cp_out)
# plot leverage (nice plot)
eempf_leverage_plot(cpl,qlabel=0.1)
# plot leverage, not so nice plot but interactive to select what to exclude
# saved in exclude, can be used to start over again with eem_list_ex <- eem_list %>% eem_exclude(exclude) above
exclude <- eempf_leverage_ident(cpl,qlabel=0.1)
Here, we specify the exclude list manually to keep track of what we did.
# samples, excitation and emission wavelengths to exclude, makes sense after calculation of leverage
exclude <- list("ex" = c(200,205,210,215,220,225,230,235,240,245,250,255,260,265,270,275,280,285,290,295,300),
"em" = c(534,536,538,540,542,544,546,548,550,552,554,556,558,560,562,564,566,568,570,572,574,576,578,580,582,584,586,588,590,592,594,596,598,600),
"sample" = c("sample87","sample78","sample95","sample12","sample17","sample51")
)
# exclude outliers if neccessary. if so, restart analysis
eem_list_ex <- eem_exclude(eem_list, exclude)
A new PARAFAC model is then generated without samples and wavelengths identified as outliers:
pfres_comps2 <- eem_parafac(eem_list_ex, comps = seq(dim_min,dim_max), normalise = TRUE, maxit = maxit, nstart = nstart, ctol = ctol, cores = cores)
Results are already added to the package and can be loaded:
data(pfres_comps2)
And again, you need to decide on the number of components to use for further analysis. In advance you can rescale the components.
pfres_comps2 <- lapply(pfres_comps2, eempf_rescaleBC, newscale = "Fmax")
eempf_compare(pfres_comps2)
Choose 6 components for the further analysis
comps <- 6
cp_out <- pfres_comps2[[which(comps==seq(dim_min,dim_max))]]
Please redo these steps unless you are satisfied with the results!
eempf_comp_load_plot(cp_out)
## [[1]]
##
## [[2]]
Separate plots can be generated by using ggeem
for components and eempf_load_plot
for the loadings.
It is possible to view the components in 3D.
eempf_comps3D(cp_out)
The PARAFAC algorithm assumes no correlation between the components. In case you did not normalise your samples, doing it could decrease the correlation.
# check for correlation between components table
# high correlations should be avoided
# try to normalise data or remove outliers as first step
eempf_cortable(cp_out)
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## Comp.1 1.00000000 0.98597747 0.9329600 -0.49359333 0.52302206
## Comp.2 0.98597747 1.00000000 0.9512168 -0.44304555 0.55869255
## Comp.3 0.93296003 0.95121678 1.0000000 -0.31248947 0.75076600
## Comp.4 -0.49359333 -0.44304555 -0.3124895 1.00000000 0.26827711
## Comp.5 0.52302206 0.55869255 0.7507660 0.26827711 1.00000000
## Comp.6 -0.07349928 -0.04155632 -0.0684308 -0.02270332 -0.04189756
## Comp.6
## Comp.1 -0.07349928
## Comp.2 -0.04155632
## Comp.3 -0.06843080
## Comp.4 -0.02270332
## Comp.5 -0.04189756
## Comp.6 1.00000000
# plot correlations
eempf_corplot(cp_out)
The plots shows samples in columns and the rows show the components (6 in that case), the residuals and the whole sample.
Plotting the residuals, especially of outliers, can help you to analyse your data and find reasons why a certain sample is considered an outlier. In this example, sample12 was removed as outliers.
# plot components in each sample, residual and whole sample
eempf_residuals_plot(cp_out, eem_list, select = eem_names(eem_list)[10:14], cores = cores)
## [[1]]
To see the residuals only use set residuals_only = TRUE
. Please consider the difference in scales to the plot above!
# plot components in each sample, residual and whole sample
eempf_residuals_plot(cp_out, eem_list, select = eem_names(eem_list)[c(10,11,13:16)], residuals_only = TRUE, cores = cores, spp = 6)
## [[1]]
You can plot the residuals of outliers as well. The components calculated without that samples are fitted in the outlier samples. By plotting the residuals you might find a reason for their outlier nature. Plotting some outliers might fail at all, as the fitting algorithm might fail if samples are very different to the original training set.
# plot components in each sample, residual and whole sample
eempf_residuals_plot(cp_out, eem_list, select = c("sample12","sample17"), residuals_only = TRUE, cores = cores, spp = 6)
## [[1]]
The split-half analysis is intended to show the stability of your model. The data is recombined in 6 different ways and results from each subsample should be similar.
#calculate split_half analysis
sh <- splithalf(eem_list_ex, comps, normalise = TRUE, rand = FALSE, cores = cores)
Split-half analysis takes some time, so the results are included in the package.
data(sh)
Plotting results from the split-half analysis. Your model is stable, if the graphs of all components look quite similar.
splithalf_plot(sh)
Tucker's Congruency Coefficients is a value for their similarity and splithalf_tcc
returns a table showing the values. 1 would be a perfect similarity.
tcc_sh_table <- splithalf_tcc(sh)
tcc_sh_table
## # A tibble: 18 x 4
## component comb tcc_ex tcc_em
## <chr> <chr> <dbl> <dbl>
## 1 Comp.1 ABvsCD 0.998 0.999
## 2 Comp.1 ACvsBD 0.801 0.992
## 3 Comp.1 ADvsBC 0.980 0.964
## 4 Comp.2 ABvsCD 0.952 0.977
## 5 Comp.2 ACvsBD 0.680 0.997
## 6 Comp.2 ADvsBC 0.784 0.978
## 7 Comp.3 ABvsCD 0.967 0.916
## 8 Comp.3 ACvsBD 1.000 0.958
## 9 Comp.3 ADvsBC 0.904 0.843
## 10 Comp.4 ABvsCD 0.960 0.996
## 11 Comp.4 ACvsBD 0.799 0.995
## 12 Comp.4 ADvsBC 0.948 0.998
## 13 Comp.5 ABvsCD 0.987 0.948
## 14 Comp.5 ACvsBD 0.990 0.973
## 15 Comp.5 ADvsBC 0.717 0.853
## 16 Comp.6 ABvsCD 0.998 0.998
## 17 Comp.6 ACvsBD 1.000 0.985
## 18 Comp.6 ADvsBC 0.999 0.998
As a way of model validation the core consistancy can be calculated.
corcondia <- eempf_corcondia(cp_out, eem_list_ex)
EEMqual according to @Bro_2011can be calculated.
eemqual <- eempf_eemqual(cp_out, eem_list_ex, sh)
You can use eempf_openfluor
to export a file that can be uploaded to openfluor.org @Murphy_2014. Please check the file header manually after export as some values cannot be determined automatically.
The report contains important settings and results from your analysis and is exported as an html file. You can specify the data you want to include.
Using eempf_export
you can export your model matrices to a csv file.
eem_metatemplate
is intended as a list of samples and a template for a table containing metadata. Writing this table to a file easens the step of gathering needed values for all samples.
eem_dilcorr
creates a table containing information on how to handle diluted samples. Absorbance spectra need to be replaced by undiluted measurements preferably or multiplied by the dilution factor. Names of EEM samples can be adjusted to be similar to their undiluted absorbance sample. You can choose for each sample how you want to proceed on the command line. The table contains information about these two steps.
eem_absdil
takes information from the table generated by eem_dilcorr
and multiplies or deletes undiluted absorbance sample data.
eem_eemdil
takes information from the table generated by eem_dilcorr
and renames EEM samples to match undiluted absorbance samples.