This tutorial is quite fast and on a very simple data set (2
conditions only), for a more complicated tutorial on the setup please
see vignette('baldur_ups_tutorial')
. First we load
baldur
and setup the model dependent variables we need,
then normalize the data and add the mean-variance trends.
library(baldur)
# Setup design matrix
<- model.matrix(~0+factor(rep(1:2, each = 3)))
yeast_design colnames(yeast_design) <- paste0('ng', c(50, 100))
# Compare the first and second column of the design matrix
# with the following contrast matrix
<- matrix(c(-1, 1), nrow = 2)
yeast_contrast
# Set id column
<- colnames(yeast)[1] # "identifier"
id_col
# Since baldur itself does not deal with missing data we remove the
# rows that have missing values for the purpose of the tutorial.
# Else, one would replace the filtering step with imputation but that is outside
# the scope of baldur
<- yeast %>%
yeast_norm # Remove missing data
::drop_na() %>%
tidyr# Normalize data (this might already have been done if imputation was performed)
psrn(id_col) %>%
# Add mean-variance trends
calculate_mean_sd_trends(yeast_design)
Importantly, note that the column names of the design matrix are unique subsets of the names of the columns within the conditions:
colnames(yeast)
#> [1] "identifier" "ng50_1" "ng50_2" "ng50_3" "ng100_1" "ng100_2" "ng100_3"
colnames(yeast_design)
#> [1] "ng50" "ng100"
This is essential for baldur
to know which columns to
use in calculations and to perform transformations.
Next is to infer the mixture of the data and to estimate the
parameters needed for baldur
. First we will setup the
needed variables for using baldur
without partitioning the
data. Then, partitioning and setting up baldur
after
trend-partitioning
# Fit the gamma regression
<- fit_gamma_regression(yeast_norm, sd ~ mean)
gr_model # Estimate the uncertainty
<- estimate_uncertainty(gr_model, yeast_norm, id_col, yeast_design) unc_gr
Finally we sample the posterior of each row in the data. First we sample assuming a single trend, then using the partitioning.
# Single trend
<- gr_model %>%
gr_results # Add hyper-priors for sigma
estimate_gamma_hyperparameters(yeast_norm) %>%
infer_data_and_decision_model(
id_col,
yeast_design,
yeast_contrast,
unc_gr,clusters = 10 # I highly recommend using parallel workers/clusters
# this will greatly reduce the speed of running baldur
) # The top hits then looks as follows:
%>%
gr_results ::arrange(err)
dplyr#> # A tibble: 1,802 × 22
#> identifier comparison err lfc lfc_025 lfc_50 lfc_975 lfc_eff lfc_rhat sigma sigma_025 sigma_50 sigma_975 sigma_eff sigma_rhat lp lp_025 lp_50 lp_975 lp_eff lp_rhat warnings
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <list>
#> 1 Cre09.g406050.t1.2|PACid:30781126|--219 ng100 vs ng50 4.55e-210 6.18 5.78 6.18 6.59 2818. 1.00 0.124 0.0672 0.114 0.248 1278. 1.00 14.5 8.18 14.9 18.4 999. 1.00 <chr [0]>
#> 2 sp|P37302|APE3_YEAST--187 ng100 vs ng50 2.98e-173 1.51 1.40 1.51 1.62 3015. 0.999 0.0461 0.0248 0.0422 0.0915 1265. 1.00 29.7 23.1 30.1 33.7 914. 1.00 <chr [0]>
#> 3 Cre12.g554100.t1.1|PACid:30791753|--87 ng100 vs ng50 2.11e-168 1.61 1.49 1.61 1.72 3085. 1.00 0.0493 0.0271 0.0449 0.0972 1363. 1.00 28.8 22.0 29.3 32.8 913. 1.00 <chr [0]>
#> 4 sp|P38788|SSZ1_YEAST--86 ng100 vs ng50 1.99e-138 1.07 0.986 1.07 1.16 2582. 1.00 0.0363 0.0190 0.0331 0.0732 1191. 1.00 32.3 25.3 32.8 36.4 918. 1.00 <chr [0]>
#> 5 Cre14.g616700.t1.1|PACid:30776509|--709 ng100 vs ng50 1.74e-127 -4.54 -4.90 -4.54 -4.16 2902. 1.00 0.147 0.0782 0.134 0.295 982. 1.00 15.4 8.45 15.9 19.5 764. 1.00 <chr [0]>
#> 6 Cre10.g420750.t1.2|PACid:30790763|--50-53 ng100 vs ng50 2.27e-122 4.16 3.80 4.16 4.51 3748. 0.999 0.152 0.0829 0.140 0.301 1462. 1.00 17.4 10.6 17.8 21.4 1133. 1.00 <chr [0]>
#> 7 sp|P09624|DLDH_YEAST--65-70 ng100 vs ng50 2.60e-102 1.48 1.34 1.48 1.62 2764. 1.00 0.0578 0.0319 0.0532 0.114 1278. 1.00 26.2 19.4 26.7 30.1 968. 1.00 <chr [0]>
#> 8 Cre12.g533100.t1.1|PACid:30792932|--188-196 ng100 vs ng50 1.18e-100 1.41 1.28 1.41 1.54 2730. 1.00 0.0579 0.0314 0.0529 0.114 1116. 1.00 26.9 20.2 27.4 30.8 806. 1.00 <chr [0]>
#> 9 sp|P19882|HSP60_YEAST--132 ng100 vs ng50 1.30e- 89 0.885 0.798 0.884 0.974 3370. 1.00 0.0410 0.0229 0.0375 0.0792 1437. 1.00 33.2 26.9 33.7 37.0 925. 1.01 <chr [0]>
#> 10 Cre06.g306550.t1.1|PACid:30779423|--99 ng100 vs ng50 1.80e- 88 4.20 3.80 4.21 4.62 2685. 1.00 0.151 0.0805 0.139 0.298 1210. 1.00 13.8 7.29 14.2 17.9 920. 1.01 <chr [0]>
#> # ℹ 1,792 more rows
Here err
is the probability of error, i.e., the two
tail-density supporting the null-hypothesis, lfc
is the
estimated log\(_2\)-fold change,
sigma
is the common variance, and lp
is the
log-posterior. Columns without suffix shows the mean estimate from the
posterior, while the suffixes _025
, _50
, and
_975
, are the 2.5, 50.0, and 97.5, percentiles,
respectively. The suffixes _eff
and _rhat
are
the diagnostic variables returned by rstan
(please see the
Stan manual for details). In general, a larger _eff
indicates a better sampling efficiency, and _rhat
compares
the mixing within chains against between the chains and should be
smaller than 1.05.
First we fit the LGMR model:
<- fit_lgmr(yeast_norm, id_col, lgmr_model, cores = 5) yeast_lgmr
We can print the model with print
and extract parameters
of interest with coef
:
print(yeast_lgmr, pars = c("coef", "aux"))
#>
#> LGMR Model
#> mu=exp(-1.798012 - 0.3228386 f(bar_y)) + kappa exp(7.062949 - 0.3843305 f(bar_y))
#>
#> Auxiliary:
#> mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
#> alpha 4.091 0.004432 0.234 3.652 3.929 4.086 4.246 4.566 2788 1
#> nrmse 0.561 0.000242 0.014 0.534 0.552 0.561 0.571 0.589 3360 1
#>
#>
#> Coefficients:
#> mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
#> I -1.798 0.000436 0.0261 -1.849 -1.816 -1.798 -1.781 -1.746 3586 1
#> I_L 7.063 0.000491 0.0416 6.981 7.035 7.063 7.091 7.146 7161 1
#> S 0.323 0.000277 0.0232 0.278 0.307 0.323 0.338 0.369 7035 1
#> S_L 0.384 0.000285 0.0348 0.315 0.361 0.385 0.408 0.452 14936 1
# Extract the regression, alpha, and theta parameters and the NRMSE.
<- coef(yeast_lgmr, pars = "all") yeast_lgmr_coef
Baldur allows for two ways to plot the LGMR model,
plot_lgmr_regression
, and
plot_regression_field
. The first plots lines of three cases
of \(\theta\), 0
,
0.5
, and 1
, and colors each peptide according
to their infered \(\theta\). They can
be plotted accordingly:
plot_lgmr_regression(yeast_lgmr)
plot_regression_field(yeast_lgmr, rng = 25)
We can then estimate the uncertainty similar to the GR case:
<- estimate_uncertainty(yeast_lgmr, yeast_norm, id_col, yeast_design) unc_lgmr
Then running the data and decision model:
# Single trend
<- yeast_lgmr %>%
lgmr_results # Add hyper-priors for sigma
estimate_gamma_hyperparameters(yeast_norm, id_col) %>%
infer_data_and_decision_model(
id_col,
yeast_design,
yeast_contrast,
unc_lgmr,clusters = 10
)
baldur
have two ways of visualizing the results 1)
plotting sigma vs LFC and 2) Volcano plots. To plot sigma against LFC we
use plot_sa
:
%>%
gr_results plot_sa(
alpha = .05, # Level of significance
lfc = 1 # Add LFC lines
)
%>%
lgmr_results plot_sa(
alpha = .05, # Level of significance
lfc = 1 # Add LFC lines
)
While it is hard to see with this few examples, in general a good
decision is indicated by a lack of a trend between \(\sigma\) and LFC. To make a volcano plot
one uses plot_volcano
in a similar fashion to
plot_sa
:
%>%
gr_results plot_volcano(
alpha = .05, # Level of significance
lfc = 1 # Add LFC lines
)
%>%
lgmr_results plot_volcano(
alpha = .05, # Level of significance
lfc = 1 # Add LFC lines
)