“Quick and Easy” ML

**MUCH SIMPLER USER INTERFACE**than caret, mlr3, tidymodels, etc.easy for learners, powerful/convenient for experts

Ideal for teaching!

numerous built-in real datasets.

includes

**tutorials**on major ML methods

Special features for those experienced in ML

advanced functions for feature selection and model development

advanced ML algorithms, including some novel/unusual ones

advanced plotting utilities

The letters ‘qe’ in the package title stand for “quick and easy,”
alluding to the convenience goal of the package. We bring together a
variety of machine learning (ML) tools from standard R packages,
providing wrappers with a uniform, **extremely simple**
interface. Hence the term “quick and easy.”

For instance, consider the **mlb1** data included in the
package, consisting of data on professional baseball players. As usual
in R, we load the data:

Here is what the data looks like:

```
> head(mlb1)
Position Height Weight Age
1 Catcher 74 180 22.99
2 Catcher 74 215 34.69
3 Catcher 72 210 30.78
4 First_Baseman 72 210 35.43
5 First_Baseman 73 188 35.71
6 Second_Baseman 69 176 29.39
```

The qe-series function calls are of the very simple form

For instance, say we wish to predict player weights. For the random forests ML algorithm, we would make the simple call

For gradient boosting, the call would be similar,

and so on. **IT COULDN’T BE EASIER**! No setup,
predefinitions etc.; just make a simple call.

Default values are used on the above calls, but nondefaults can be specified, e.g.

Each qe-series function is paired with a **predict**
method, e.g. to predict player weight:

A catcher of height 73 and age 28 would be predicted to have weight about 204.

Categorical variables can be predicted too. Where possible, class probabilities are computed in addition to class. Let’s predict player position from the physical characteristics:

```
> w <- qeGBoost(mlb1,'Position')
> predict(w,data.frame(Height=73,Weight=185,Age=28))
$predClasses
[1] "Relief_Pitcher"
$probs
Catcher First_Baseman Outfielder Relief_Pitcher Second_Baseman
[1,] 0.02396515 0.03167778 0.2369061 0.2830575 0.1421796
Shortstop Starting_Pitcher Third_Baseman
[1,] 0.0592867 0.1824601 0.04046717
```

A player of height 73, weight 185 and age 28 would be predicted to be a relief pitcher, with probability 0.28. The second most-likely position would be outfielder, and so on.

By default, the qe functions reserve a *holdout set* on which
to assess model accuracy. The remaining data form the *training
set*. After a model is fit to the training set, we use it to predict
the holdout data, so as to assess the predictive power of our model. (To
specify no holdout, set holdout=NULL in the call.)

```
> z <- qeRF(mlb1,'Weight')
holdout set has 101 rows
> z$testAcc
[1] 14.45285
> z$baseAcc
[1] 17.22356
```

The mean absolute prediction error (MAPE) on the holdout data was about 14.5 pounds. On the other hand, if we had simply predicted every player using the overall mean weight, the MAPE would be about 17.2. So, using height, age and player position for our prediction did improve things.

Of course, since the holdout set is random, the same is true for the
above accuracy numbers. To gauge the predictive power of a model over
many holdout sets, one can use **replicMeans()**, which is
available in qeML via automatic loading of the **regtools**
package. Say for 100 holdout sets:

So the true MAPE for this model on new data is estimated to be 13.6. The standard error is also output, to gauge whether 100 replicates is enough.

The package includes tutorials for those with no background in machine learning, as well as tutorials on advanced topics. A few examples (showing how they are invoked):

**vignette(‘ML_Overview’)**; for those with no prior ML background**vignette(‘Overfitting’)**; plugging “overfitting” into Google yielded 49,400,000 results!–but what is overfitting REALLY about?; read here!**vignette(‘Feature_Selection’)**; we often need to pare down our set of predictor variables, both to save computation and prevent overfitting; how can this be done, especially in**qeML**?**vignette(‘PCA_and_UMAP’)**; this vignette first takes a closer, more practical look at Principal Components Analysis, then gives an overview of UMAP, a relatively new nonlinear alternative to PCA

Type **vignette(‘Function_list’)**.