README

An R package for univariate kernel density estimation with parametric starts and asymmetric kernels.

News

kdensity is now linked to univariateML, meaning it supports the approximately 30+ parametric starts from that package!

Overview

kdensity is an implementation of univariate kernel density estimation with support for parametric starts and asymmetric kernels. Its main function is kdensity, which is has approximately the same syntax as stats::density. Its new functionality is:

A reason to use kdensity is to avoid boundary bias when estimating densities on the unit interval or the positive half-line. Asymmetric kernels such as gamma and gcopula are designed for this purpose. The support for parametric starts allows you to easily use a method that is often superior to ordinary kernel density estimation.

Several R packages deal with kernel estimation. For an overview see Deng & Hadley Wickham (2011). While no other R package handles density estimation with parametric starts, several packages supports methods that handle boundary bias. evmix provides a variety of boundary bias correction methods in the bckden function. kde1d corrects for boundary bias using transformed univariate local polynomial kernel density estimation. logKDE corrects for boundary bias on the half line using a logarithmic transform. ks supports boundary correction through the kde.boundary function, while Ake corrects for boundary bias using tailored kernel functions.

Installation

# For the CRAN release
install.packages("kdensity")
# For the development version from GitHub:
# install.packages("devtools")
devtools::install_github("JonasMoss/kdensity")

Usage Example

Call the library function and use it just like stats::density, but with optional additional arguments.

library("kdensity")
plot(kdensity(mtcars$mpg, start = "normal"))

Description

Kernel density estimation with a parametric start was introduced by Hjort and Glad in Nonparametric Density Estimation with a Parametric Start (1995). The idea is to start out with a parametric density before you do your kernel density estimation, so that your actual kernel density estimation will be a correction to the original parametric estimate. The resulting estimator will outperform the ordinary kernel density estimator in terms of asymptotic integrated mean squared error whenever the true density is close to your suggestion; and the estimator can be superior to the ordinary kernel density estimator even when the suggestion is pretty far off.

In addition to parametric starts, the package implements some asymmetric kernels. These kernels are useful when modelling data with sharp boundaries, such as data supported on the positive half-line or the unit interval. Currently we support the following asymmetric kernels:

These features can be combined to make asymmetric kernel densities estimators with parametric starts, see the example below. The package contains only one function, kdensity, in addition to the generics plot, points, lines, summary, and print.

Usage

The function kdensity takes some data, a kernel kernel and a parametric start start. You can optionally specify the support parameter, which is used to find the normalizing constant.

The following example uses the data set. The black curve is a gamma-kernel density estimate with a gamma start, the red curve a fully parametric gamma density and and the blue curve an ordinary density estimate. Notice the boundary bias of the ordinary density estimator. The underlying parameter estimates are always maximum likelilood.

library("kdensity")
kde = kdensity(airquality$Wind, start = "gamma", kernel = "gamma")
plot(kde, main = "Wind speed (mph)")
lines(kde, plot_start = TRUE, col = "red")
lines(density(airquality$Wind, adjust = 2), col = "blue")
rug(airquality$Wind)

Since the return value of kdensity is a function, kde is callable and can be used as any density function in R (such as stats::dnorm). For example, you can do:

kde(10)
#> [1] 0.09980471
integrate(kde, lower = 0, upper = 1) # The cumulative distribution up to 1.
#> 1.27532e-05 with absolute error < 2.2e-19

You can access the parameter estimates by using coef. You can also access the log likelihood (logLik), AIC and BIC of the parametric start distribution.

coef(kde)
#> Maximum likelihood estimates for the Gamma model 
#>  shape    rate  
#> 7.1873  0.7218
logLik(kde)
#> 'log Lik.' 12.33787 (df=2)
AIC(kde)
#> [1] -20.67574

How to Contribute or Get Help

If you encounter a bug, have a feature request or need some help, open a Github issue. Create a pull requests to contribute. This project follows a Contributor Code of Conduct.

kdensity