This vignette1 describes the workflow of the {ino} package for the likelihood optimization of an hidden Markov model (HMM). For more technical details about HMMs see, e.g., Zucchini et al. (2016).
The example data set considered throughout this vignette covers a time series of share returns from the stock ‘Deutsche Bank’. The data set is freely accessible via Yahoo Finance.
library("fHMM")
<- tempfile()
file ::download_data(symbol = "DBK.DE", from = "2000-01-01", file = file)
fHMM#> Download successful.
#> * symbol: DBK.DE
#> * from: 2000-01-03
#> * to: 2022-09-28
#> * path: C:\Users\loelschlaeger\AppData\Local\Temp\RtmpcpEkAy\filee6c56577e4f
library("dplyr")
<- read.csv(file) %>%
db_data as_tibble() %>%
summarize(date = as.Date(Date, format = "%Y-%m-%d"),
obs = c(NA, diff(log(Close), lag=1) * 100)) %>%
filter(!is.na(obs)) %>%
print()
#> # A tibble: 5,815 × 2
#> date obs
#> <date> <dbl>
#> 1 2000-01-04 -5.16
#> 2 2000-01-05 -1.30
#> 3 2000-01-06 4.97
#> 4 2000-01-07 2.99
#> 5 2000-01-10 -2.61
#> 6 2000-01-11 0.350
#> 7 2000-01-12 -3.05
#> 8 2000-01-13 3.59
#> 9 2000-01-14 0.656
#> 10 2000-01-17 2.62
#> # … with 5,805 more rows
library("ggplot2")
ggplot(db_data, aes(x = date, y = obs)) +
geom_point() +
geom_line() +
ylab("log-returns [%]")
As the share returns are continuous and can take both negative and
positive values, we consider an HMM with Gaussian state-dependent
distributions. The (log-)likelihood of a Gaussian-HMM is implemented in
the function f_ll_hmm()
.
We consider an 2-state (N = 2
) Gaussian-HMM with six
parameters (npar = 6
) to be estimated:
The argument neg = TRUE
indicates that we minimize the
negative log-likelihood, and opt
selects the optimizer
nlm()
.
<- setup_ino(
hmm_ino f = f_ll_hmm,
npar = 6,
data = db_data,
N = 2,
neg = TRUE,
opt = set_optimizer_nlm()
)
We first use randomly chosen starting values. As the first two starting values belong to the tpm, we sample from a normal distribution with mean -1.5 and standard deviation of 0.5 — as we use the multinomial logit link to ensure that the probabilities are between 0 and 1, a mean of -1.5 correspond to probabilities of staying in state 1 and 2 of about 0.81. For the two means, we draw two random numbers from the standard normal distribution, as the time series above indicates that the most of the returns are fairly close to zero. The starting values for the standard deviations are drawn from a uniform distribution between 0.5 and 2 (note that we exponentiate the standard deviations in the likelihood as they are constrained to be positive, and hence we log transform the starting values).
<- function() c(log(stats::runif(2, 0.1, 0.9)),
sampler ::rnorm(2),
statslog(stats::runif(2, 0.5, 2)))
<- random_initialization(hmm_ino, runs = 50, sampler = sampler) hmm_ino
For selecting fixed starting values, we consider values that lie in the ranges considered above:
<- list(c(-2, -2, 0, 0, log(2), log(3)),
starting_values c(-1.5, -1.5, -2, 2, log(1), log(2)),
c(-1, -1, -3, 3, log(2), log(2)))
for(val in starting_values)
<- fixed_initialization(hmm_ino, at = val) hmm_ino
To illustrate the subset initialization strategy, we fit our HMM to
the first 50% of the observation. The starting values for these subsets
are chosen randomly from the same distributions as considered above. The
function subset_initialization()
then fits the HMM again to
the entire sample using the estimates obtained from the subsets as
initial values.
<- subset_initialization(
hmm_ino arg = "data", how = "first", prop = 0.5,
hmm_ino, initialization = random_initialization(runs = 50, sampler = sampler)
)
As the time series of share returns consists of 5815 observations, we can even try to use fewer observations to fit our model. In the next step, we consider only the first 5% of the observations.
<- subset_initialization(
hmm_ino arg = "data", how = "first", prop = 0.05,
hmm_ino, initialization = random_initialization(runs = 50, sampler = sampler)
)
Selecting the starting values for HMMs is a well-known issue, as poor
starting values may likely result in local maxima. Other R packages
designed to fit HMMs discuss this topic in more detail (see, e.g., https://cran.r-project.org/package=moveHMM). We thus
first evaluate the optimizations by comparing the likelihood values at
convergence, which can be displayed using
overview_optima()
:
overview_optima(hmm_ino)
#> optimum frequency
#> 1 12877.85 142
#> 2 13774.87 10
#> 3 13454.33 1
The frequency table indicates that 142 out of 153 runs converged to the same likelihood value (1.287785^{4}), which appears to be the global optimum (note these are the negative log-likelihood values).
Using summary
, we can investigate the computation time
and the resulting optimum for all runs:
summary(hmm_ino)
#> # A tibble: 153 × 4
#> .strategy .time .optimum .optimizer
#> <chr> <drtn> <dbl> <chr>
#> 1 random 11.772112 secs 12878. stats::nlm
#> 2 random 9.919681 secs 12878. stats::nlm
#> 3 random 10.021004 secs 12878. stats::nlm
#> 4 random 9.735524 secs 12878. stats::nlm
#> 5 random 19.883488 secs 12878. stats::nlm
#> 6 random 11.398910 secs 12878. stats::nlm
#> 7 random 25.223363 secs 12878. stats::nlm
#> 8 random 20.918431 secs 12878. stats::nlm
#> 9 random 34.802511 secs 12878. stats::nlm
#> 10 random 8.948022 secs 13775. stats::nlm
#> # … with 143 more rows
To obtain the starting values that lead to the global optimum, we first check which of the optimization runs actually reached the global optimum:
which(summary(hmm_ino)$.optimum < 12878)
#> [1] 1 2 3 4 5 6 7 8 9 11 12 13 14 16 17 19 20 21
#> [19] 22 23 24 25 26 27 28 29 30 31 32 34 35 36 37 38 39 40
#> [37] 41 42 43 44 45 46 47 48 49 51 52 53 54 55 56 57 59 60
#> [55] 61 62 63 64 65 66 67 68 70 71 72 73 74 75 76 77 78 79
#> [73] 80 81 82 83 84 85 86 87 88 89 91 92 94 95 96 97 98 100
#> [91] 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118
#> [109] 119 120 121 122 123 124 125 126 127 129 130 131 132 133 134 135 136 137
#> [127] 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153
We can access the corresponding initial values of, for example, the first of these runs via:
get_vars(hmm_ino, runs = which(summary(hmm_ino)$.optimum < 12878)[1])[[1]]$.init
#> [1] -1.2160764 -1.5387768 -0.7074952 0.3645820 0.5118483 0.5282509
We use the plot()
function to investigate the
computation time. Intuitively, optimization runs with fixed starting
values should be faster than with random starting values, since we have
carefully chosen the fixed starting values by investigating the data.
The boxplots confirm this intuition:
plot(hmm_ino, by = ".strategy", nrow = 1)
Using the output provided by summary()
and some data
manipulation functions provided by the package {dplyr}, we can also
compute the average time per strategy:
summary(hmm_ino) %>%
group_by(.strategy) %>%
summarise(avg_time = mean(.time))
#> # A tibble: 4 × 2
#> .strategy avg_time
#> <chr> <drtn>
#> 1 fixed 8.693801 secs
#> 2 random 14.082786 secs
#> 3 subset(first,0.05) 5.418156 secs
#> 4 subset(first,0.5) 8.008285 secs
This comparison can however be considered somehow ‘unfair’, as not all optimization runs lead to the global optimum — optimization runs that lead to local optima may then have less iterations and hence lower computation time. We can first check which runs lead to the apparent global optimum of 1.287785^{4}. For that, we will again use functions from {dplyr}.
summary(hmm_ino) %>%
mutate(global_optimum = ifelse(.optimum < 12878, 1, 0)) %>%
group_by(.strategy) %>%
summarise(proportion_global_optimum = mean(global_optimum))
#> # A tibble: 4 × 2
#> .strategy proportion_global_optimum
#> <chr> <dbl>
#> 1 fixed 1
#> 2 random 0.9
#> 3 subset(first,0.05) 0.98
#> 4 subset(first,0.5) 0.9
While for the fixed starting values all runs converge to the global optimum, the random initialization and subset initialization strategies do sometimes get stuck in local maxima.
Let’s again compare the average computation time, but only for those runs that lead to the global optimum:
summary(hmm_ino) %>%
filter(.optimum < 12878) %>%
group_by(.strategy) %>%
summarise(mean_time = mean(.time))
#> # A tibble: 4 × 2
#> .strategy mean_time
#> <chr> <drtn>
#> 1 fixed 8.693801 secs
#> 2 random 14.833138 secs
#> 3 subset(first,0.05) 5.488308 secs
#> 4 subset(first,0.5) 8.375048 secs
For those runs that converged to the global optimum, we can again compare the computation time via boxplots:
summary(hmm_ino) %>%
filter(.optimum < 12878) %>%
ggplot(aes(x = "", y = .time)) +
scale_y_continuous() +
geom_boxplot() +
facet_wrap(".strategy", labeller = "label_both", nrow = 1) +
theme(
axis.title.x = element_blank(),
axis.text.x = element_blank(),
axis.ticks.x = element_blank()
+
) ylab("optimization time")
While computation time is lowest for the fixed initialization strategy, subset initialization appears to be a promising approach. In particular, when considering the first 5% of the data in a first step, the computation time is most likely to be lower compared to the random initialization approach.
The vignette was build using R 4.2.1 with the {ino} 0.2.0 package.↩︎