where \(\log L(\theta;x,z)\) is an observed log-likelihood based on covariates and labels\((x,z)\), and \(P_\lambda(\theta)\) is \(\ell_1\) or \(\ell_1/\ell_2\) penalty. For more detailed discussion, please see our paper https://arxiv.org/abs/1711.08129
We demonstrate basic usage of PUlasso package using simulated PU data.
data("simulPU")
This loads simulPU object which contains the input matrix \(X\), labels \(z\), true (latent) responses \(y\), and the positive prevalence true\(PY1\). We first visualize the data. We plot the first two columns \(X1-X2\) colored by \(z\) or \(y\) as the first two variables are set to be active in the simulation. From the following plots, we see that many positive samples are marked as unlabelled. For more details about the simulation setting, do help(simulPU)
We fit the model using the most basic call. By default it fits the model for 100 values of \(\lambda\) starting from the null model with lasso penalty.
fit=grpPUlasso(X=simulPU$X,z=simulPU$z,pi=simulPU$truePY1)
coefficients can be extracted using coef
function. Here we extract estimated coefficients for 30th \(\lambda\). By default, coefficients are returned in an original scale. If desired, we can set std.scale=T
to obtain coefficients in a standardizeds scale.
coef(fit, lambda=fit$lambda[30])
## [,1]
## (Intercept) 0.05551533
## X1 0.30819669
## X2 0.36584501
## X3 0.00000000
## X4 0.00000000
## X5 0.00000000
If we want to predict responses at certain \(x\), we use the predict
function. By default, it returns estimated probabilities.
xnew = simulPU$X[400,,drop=F]
predict(fit,newdata = xnew,lambda = fit$lambda[30])
## [,1]
## [1,] 0.253048
It is a common practice to choose \(\lambda\) based on cross-validation. Main function for cross-validation is cv.grpPUlasso
.
cv.fit = cv.grpPUlasso(X=simulPU$X,z=simulPU$z,pi=simulPU$truePY1)
We use deviance for a measure of model fit. Average deviance and standard error of deviance over all \(k\)-folds for all \(lambda\) values saved in cv.fit$cvm
and cv.fit$cvsd
, respectively. We are particularly interested in two lambda values : lambda.min which gives the minimum mean cross-validation deviance, and lambda.1se which corresponds to the largest \(\lambda\) such that cvm is within one standard error of the minimum. We can also extract coefficients corresponding to such lambda levels.
coef(cv.fit,lambda=cv.fit$lambda.1se)
## [,1]
## (Intercept) -0.0280929
## X1 0.2444103
## X2 0.2796224
## X3 0.0000000
## X4 0.0000000
## X5 0.0000000
We finalize this section by demonstrating how to do a classification based on fitted models. A natural threshold of .5 is applied for a classification. We plot \(X1-X2\) colored by \(yhat\) to check classification performances.
phat<-predict(cv.fit,newdata = simulPU$X,lambda = cv.fit$lambda.1se,type = "response")
yhat<-1*(phat>0.5)
We can also use a group lasso penalty. Suppose \(X1\) is in group 1, \(X2-X3\) are in group 2, and \(X4-X5\) are in group 3. We only need to provide a membership information using an additional vector, here named as a grpvec
.
grpvec = c(1,2,2,3,3)
fit.grp = grpPUlasso(X=simulPU$X,z=simulPU$z,pi=simulPU$truePY1,group = grpvec)
All members in the group are either all included or not included. For example, we see from 13th \(\lambda\) to 14th \(\lambda\), members in group 2 are entered the model together.
coef(fit.grp,fit.grp$lambda[12:15])
## [,1] [,2] [,3] [,4]
## (Intercept) -0.02693134 -0.02679939 -0.031668063 -0.036112305
## X1 0.32136211 0.33761227 0.341612439 0.342443679
## X2 0.00000000 0.00000000 0.018538324 0.039900657
## X3 0.00000000 0.00000000 -0.001038037 -0.002210257
## X4 0.00000000 0.00000000 0.000000000 0.000000000
## X5 0.00000000 0.00000000 0.000000000 0.000000000
PUlasso can exploit a sparsity in an input matrix for a more efficient calculation. If dgCMatrix
type of the input matrix is provided, PUlasso automatically performs a sparse calculation.
The package also allows for users to use memory-mapped files for the data larger than memory size. The object big.matrix
in the bigmemory package needs to be provided as an input matrix \(X\) for this functionality. There are in fact two big.matix types which are handled in different ways; a standard big.matrix only uses available RAM, whilst a file-backed big.matrix uses additional hard drive space, and thus it may exceed available RAM.
For a simple demonstration, here we create dgCMatrix
and standard type big.matrix
objects based on \(X\). First we create a sparse matrix sparseX by imposing 0 on 95% of the entries of \(X\).
sparseX <- simulPU$X
sparseX[sample(1:length(simulPU$X),size = length(simulPU$X)*0.95)]<-0
sparseX<-Matrix(sparseX)
class(sparseX)
## [1] "dgCMatrix"
## attr(,"package")
## [1] "Matrix"
a simple big matrix
object can be created using as.big.matrix
function. More meaningful example using big.matrix
object would involve reading data whose size exceeds available RAM using and creating file-backed big.matrix
object through read.big.matrix
function in the package bigmemory, and then use such object as an input for grpPUlasso
function. Please see https://cran.r-project.org/web/packages/bigmemory/index.html for more detailed discussion.
bigX <- as.big.matrix(simulPU$X)
class(bigX)
## [1] "big.matrix"
## attr(,"package")
## [1] "bigmemory"
Those input matrices can be used in the same way. For example,
spfit<-grpPUlasso(sparseX,simulPU$z,simulPU$truePY1)
bigfit<-grpPUlasso(bigX,simulPU$z,simulPU$truePY1)
newx = matrix(rnorm(10),2,5)
predict(spfit,newdata = newx,lambda = spfit$lambda[10])
## [,1]
## [1,] 0.5334296
## [2,] 0.5409738
predict(bigfit,newdata = newx,lambda = spfit$lambda[10])
## [,1]
## [1,] 0.6005043
## [2,] 0.5690398