H2O4GPU: Machine Learning with GPUs in R

Navdeep Gill, Erin LeDell, Yuan Tang

2021-05-17

H2O4GPU is a collection of GPU solvers by H2O.ai with APIs in Python and R. The Python API builds upon the easy-to-use scikit-learn API. The h2o4gpu R package is a wrapper around the h2o4gpu Python package.

The R package makes use of RStudio's reticulate package for facilitating access to Python libraries through R. Reticulate embeds a Python session within your R session, enabling seamless, high-performance interoperability.

H2O4GPU is a new project under active development and we are looking for contributors! If you find a bug, please check that we have not already fixed the issue in the bleeding edge version and then check that we do not already have an issue opened for this topic. If not, then please file a new issue with a reproducible example.

Installation

There are a few system requirements, including Linux with glibc 2.17+, Python >=3.6, R >=3.1, CUDA 8 or 9, and a machine with Nvidia GPUs. The code should still run if you have CPUs, but it will fall back to scikit-learn CPU based versions of the algorithms.

The h2o4gpu Python module is a prerequisite for the R package. So first, follow the instructions here to install the h2o4gpu Python package (either at the system level or in a Python virtual envivonment). The easiest thing to do is to pip install the stable release whl file. To ensure compatibility, the Python package version number should match the R package version number.

The recomended way of installing the R package can is from CRAN using install.packages("h2o4gpu"). To install the development version of the h2o4gpu R package, you can install directly from GitHub as follows:

library(devtools)
devtools::install_github("h2oai/h2o4gpu", subdir = "src/interface_r")

Virtual environments

Using a Python virtual environment is a good solution if you don't want to upgrade your main Python installation to 3.6. If you installed the h2o4gpu Python module into a virtual environment, you will have to add a line of code to tell R which Python envivonment you want to use:

library(reticulate)
use_virtualenv("/home/username/venv/h2o4gpu")  # set this to the path of your venv

If you have installed h2o4gpu Python module using Anaconda, then you can use the use_condaenv() function instead. More information about Python environment configuration is available in the reticulate user guide.

Quickstart

Here's a quick demo of how to train and evaluate a GPU-based Random Forest classifier model. We will use the classic Iris dataset, which is a three-class classification problem and evaluate the performance of the model using classification error.

library(h2o4gpu)
library(reticulate)  # only needed if using a virtual Python environment
use_virtualenv("/home/username/venv/h2o4gpu")  # set this to the path of your venv

# Prepare data
x <- iris[1:4]
y <- as.integer(iris$Species) # all columns, including the response, must be numeric

# Initialize and train the classifier
model <- h2o4gpu.random_forest_classifier() %>% fit(x, y)

# Make predictions
pred <- model %>% predict(x)

# Compute classification error using the Metrics package (note this is training error)
library(Metrics)
ce(actual = y, predicted = pred)

Supervised Learning

H2O4GPU contains a collection of popular algorithms for supervised learning: Random Forest, Gradient Boosting Machine (GBM) and Generalized Linear Models (GLMs) with Elastic Net regularization. There are methods for regression and classification for each of these algorithms. Both Random Forest and GBM support multiclass clasification, however the GLM currently only supports binomial classification (a ticket for multinomial support is open here).

The tree based models (Random Forest and GBM) are built on top of the very powerful XGBoost library, and the Elastic Net GLM has been built upon the POGS solver. Proximal Graph Solver (POGS) is a solver for convex optimization problems in graph form using Alternating Direction Method of Multipliers (ADMM). We have found that this method is not as fast as we'd like it to be, so we are working on implementing an entirely new GLM from scratch (follow progress here).

The h2o4gpu R package does not include a suite of internal model metrics functions, therefore we encourage users to use a third-party model metrics package of their choice. For all the examples below, we will use the Metrics R package. This package has a large number of model metrics functions, all with a very simple, unified API.

Binary Classification

In this example, we will train and test three different models on a subset of the HIGGS dataset. The goal in this dataset is to distinguish between signal "1" and background "0", so this is a binary classification problem. The features are all numeric.

H2O4GPU requires all feature and response columns to be numeric, so in this case, we don't have to do any pre-processing of the data. If your response column is a factor, then you can simply convert the levels to integer values using as.integer(). If you have categorical/factor columns among your features, you must apply an encoding method to convert the columns into numeric data. Some options are label encoding (simply convert the levels to integers) or one hot encoding (binary indicator columns, one for each categorical level). For simplicity, in this tutorial, we will always use label encoding, however you can read more about different types of encodings here.

# Load a sample dataset for binary classification
# Source: https://archive.ics.uci.edu/ml/datasets/HIGGS
train <- read.csv("https://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv")
test <- read.csv("https://s3.amazonaws.com/erin-data/higgs/higgs_test_5k.csv")

# Create train & test sets (column 1 is the response)
x_train <- train[, -1]
y_train <- train[, 1]
x_test <- test[, -1]
y_test <- test[, 1]

Below we see that the h2o4gpu modeling functions follow a two-phased functional apporach. The two phased approach to modeling (first initialize model, then train) is more common in Python, and we borrow that paradigm here. We blend this with the the functional pipe syntax in R.

First you define the model with it's hyperparameters, for example, h2o4gpu.gradient_boosting_classifier(n_estimators = 500, subsample = 0.8). Then we pipe the initialized model object to the fit(x, y) function to train the model, and save the resulting object.

# Train three different binary classification models
model_gbc <- h2o4gpu.gradient_boosting_classifier() %>% fit(x_train, y_train)
model_rfc <- h2o4gpu.random_forest_classifier() %>% fit(x_train, y_train)
model_enc <- h2o4gpu.elastic_net_classifier() %>% fit(x_train, y_train) 

We pipe our trained models to the familiar predict() method. In binary classification, we are often more interested in the numeric predicted values, rather than the predicted class labels. We follow the same design as the predict() function in the popular caret package, which allows the user to specify which type of predictions they want to return using the type argument. This defaults to "raw" which in classification, yields predicted class labels. When we set it to "prob", it returns the (uncalibrated) class probabilities. This is not mentioned often in modeling software documentation, but you should note that despite using the term "probabilities", these predicted values do not represent actual probabilities unless some method like Platt scaling is used for calibration. This is true for all machine learning packages, including caret, h2o, and h2o4gpu (though we do offer the option to perform Platt scaling inside the h2o R package).

# Generate predictions (type "prob" gives predicted values instead of predicted label)
pred_gbc <- model_gbc %>% predict(x_test, type = "prob")
pred_rfc <- model_rfc %>% predict(x_test, type = "prob")
pred_enc <- model_enc %>% predict(x_test, type = "prob")

Let's take a look at what the output of the predict() function looks like in binary classification. It will be a two-column matrix with the column names set to the names of the classes.

head(pred_rfc)

To compute AUC of a binary classification model, we use the predicted values of the second column (the "positive" class) and pass that to the Metrics::auc() function.

# Compare test set performance using AUC
auc(actual = y_test, predicted = pred_gbc[, 2])
auc(actual = y_test, predicted = pred_rfc[, 2])
auc(actual = y_test, predicted = pred_enc[, 2])

Multiclass Classification

Now that we are familiar with binary classification, there is not much more to say about multiclass classification. The predict output will have the same format as binary classification, except that if you use type = "prob" number of columns will match the number of classes. Often in multiclass classification, you may be interested in the predicted class label and misclassification error, which we've demonstrated already in the Quickstart section.

Regression

In this next exercise, we will compare a GBM and GLM regression model. Until this issue is respolved, we don't recommend that you use the Random Forest regressor, as there are some bugs that are severely affecting model performance.

We will predicting the age of abalone from physical measurements, using the Abalone dataset.

# Load a sample dataset for regression
# Source: https://archive.ics.uci.edu/ml/datasets/Abalone
df <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data", header = FALSE)
str(df)

There is one categorical/factor column in this dataset, so we will first convert those values to integers (label encoding). Recall that label encoding is just one way of encoding the categorical column and that there may be other ways that produce better results in terms of model performance.

df[, 1] <- as.integer(df[, 1])  #label encode the one factor column

In this case, we started with a single data frame, so we should break the data into train and test splits at random. We can do that easily in R by sampling 80% of the row indices and subsetting the data frame by row.

# Randomly sample 80% of the rows for the training set
set.seed(1)
train_idx <- sample(1:nrow(df), 0.8*nrow(df))

# Create train & test sets (column 9 is the response)
x_train <- df[train_idx, -9]
y_train <- df[train_idx, 9]
x_test <- df[-train_idx, -9]
y_test <- df[-train_idx, 9]
# Train two different regression models
model_gbr <- h2o4gpu.gradient_boosting_regressor() %>% fit(x_train, y_train)
model_enr <- h2o4gpu.elastic_net_regressor() %>% fit(x_train, y_train)

# Generate predictions 
pred_gbr <- model_gbr %>% predict(x_test)
pred_enr <- model_enr %>% predict(x_test)

In regression, the predict() function always returns a vector of predictions (not a data frame).

head(pred_gbr)

In regression problems, Mean Squared Error (MSE), is a common metric for model evaluation. We will use test set MSE to evaluate and compare our two models.

# Compare test set performance using MSE
mse(actual = y_test, predicted = pred_gbr)
mse(actual = y_test, predicted = pred_enr)

In this case, which is not usual, the GBM drastically outperforms the GLM.

Unsupervised Learning

The unsupervised learning algorithms in h2o4gpu include K-Means, Principal Component Analysis (PCA), and Truncated Singular Value Decompostion (SVD).

K-Means Clustering

First we will train a K-Means model. Let's create a train and test set from the iris dataset.

# Prepare data
iris$Species <- as.integer(iris$Species) # convert to numeric data

# Randomly sample 80% of the rows for the training set
set.seed(1)
train_idx <- sample(1:nrow(iris), 0.8*nrow(iris)) 
train <- iris[train_idx, ]
test <- iris[-train_idx, ]

Train a K-Means model with three clusters.

model_km <- h2o4gpu.kmeans(n_clusters = 3L) %>% fit(train)

Once you have trained a K-Means model, applying the transform() function to a dataset transforms your points into distances from each centroid. So your nxp matrix becomes nxk (n is the number of observations,p the number of features and k the number of clusters).

test_dist <- model_km %>% transform(test)
head(test_dist)

Principal Compoment Analysis (PCA)

Let's use the HIGGS train and test datasets again for demonstration.

# Load a sample dataset for binary classification
# Source: https://archive.ics.uci.edu/ml/datasets/HIGGS
train <- read.csv("https://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv")
test <- read.csv("https://s3.amazonaws.com/erin-data/higgs/higgs_test_5k.csv")

Train a PCA model with 4 components and apply the transformation onto a dataset. Once you have created a projection model from a dataset, you can apply that transformation to a new dataset (such as a test set) using the transform() function.

model_pca <- h2o4gpu.pca(n_components = 4) %>% fit(train)
test_transformed <- model_pca %>% transform(test)

Truncated Singular Value Decomposition (SVD)

Train a truncated SVD model with 4 components and apply the transformation on a test set.

model_tsvd <- h2o4gpu.truncated_svd(n_components = 4) %>% fit(train)
test_transformed <- model_tsvd %>% transform(test)