This vignette serves as an example on data wrangling & visualization with
opensensmapr
,dplyr
andggplot2
.
# required packages:
library(opensensmapr) # data download
library(dplyr) # data wrangling
library(ggplot2) # plotting
library(lubridate) # date arithmetic
library(zoo) # rollmean()
openSenseMap.org has grown quite a bit in the last years; it would be interesting to see how we got to the current 3653 sensor stations, split up by various attributes of the boxes.
While opensensmapr
provides extensive methods of filtering boxes by attributes on the server, we do the filtering within R to save time and gain flexibility. So the first step is to retrieve all the boxes:
# if you want to see results for a specific subset of boxes,
# just specify a filter such as grouptag='ifgi' here
boxes = osem_boxes()
By looking at the createdAt
attribute of each box we know the exact time a box was registered. With this approach we have no information about boxes that were deleted in the meantime, but that’s okay for now.
exposure_counts = boxes %>%
group_by(exposure) %>%
mutate(count = row_number(createdAt))
exposure_colors = c(indoor = 'red', outdoor = 'lightgreen', mobile = 'blue', unknown = 'darkgrey')
ggplot(exposure_counts, aes(x = createdAt, y = count, colour = exposure)) +
geom_line() +
scale_colour_manual(values = exposure_colors) +
xlab('Registration Date') + ylab('senseBox count')
Outdoor boxes are growing fast! We can also see the introduction of mobile
sensor “stations” in 2017. While mobile boxes are still few, we can expect a quick rise in 2018 once the new senseBox MCU with GPS support is released.
Let’s have a quick summary:
exposure_counts %>%
summarise(
oldest = min(createdAt),
newest = max(createdAt),
count = max(count)
) %>%
arrange(desc(count))
exposure | oldest | newest | count |
---|---|---|---|
outdoor | 2015-02-18 16:53:41 | 2019-03-10 11:32:20 | 2930 |
indoor | 2015-02-08 17:36:40 | 2019-03-09 22:30:39 | 550 |
mobile | 2017-05-24 08:16:36 | 2019-03-09 20:51:23 | 153 |
unknown | 2014-05-28 15:36:14 | 2016-06-25 15:11:11 | 20 |
We can try to find out where the increases in growth came from, by analysing the box count by grouptag.
Caveats: Only a small subset of boxes has a grouptag, and we should assume that these groups are actually bigger. Also, we can see that grouptag naming is inconsistent (Luftdaten
, luftdaten.info
, …)
grouptag_counts = boxes %>%
group_by(grouptag) %>%
# only include grouptags with 8 or more members
filter(length(grouptag) >= 8 && !is.na(grouptag)) %>%
mutate(count = row_number(createdAt))
# helper for sorting the grouptags by boxcount
sortLvls = function(oldFactor, ascending = TRUE) {
lvls = table(oldFactor) %>% sort(., decreasing = !ascending) %>% names()
factor(oldFactor, levels = lvls)
}
grouptag_counts$grouptag = sortLvls(grouptag_counts$grouptag, ascending = FALSE)
ggplot(grouptag_counts, aes(x = createdAt, y = count, colour = grouptag)) +
geom_line(aes(group = grouptag)) +
xlab('Registration Date') + ylab('senseBox count')
grouptag_counts %>%
summarise(
oldest = min(createdAt),
newest = max(createdAt),
count = max(count)
) %>%
arrange(desc(count))
grouptag | oldest | newest | count |
---|---|---|---|
Luftdaten | 2017-03-14 17:01:16 | 2019-03-10 08:22:41 | 177 |
Futurium | 2018-11-23 17:21:50 | 2018-12-21 17:25:30 | 48 |
ifgi | 2016-06-17 08:04:54 | 2019-03-05 12:33:48 | 42 |
Bad_Hersfeld | 2017-07-18 13:32:03 | 2019-02-12 10:52:19 | 40 |
TKS Bonn | 2018-06-18 11:21:21 | 2018-12-10 16:33:10 | 37 |
Save Dnipro | 2018-12-24 10:47:45 | 2019-03-07 09:37:49 | 33 |
Luchtwachters Delft | 2018-03-09 21:39:11 | 2019-02-14 11:42:03 | 27 |
Feinstaub | 2017-04-08 06:38:25 | 2019-03-06 11:12:45 | 19 |
luftdaten.info | 2017-05-01 10:15:44 | 2019-02-25 10:48:07 | 19 |
Luftdaten.info | 2017-04-03 14:10:20 | 2019-02-14 14:53:17 | 15 |
MakeLight | 2015-02-18 16:53:41 | 2018-02-02 13:50:21 | 15 |
PGKN | 2018-03-22 16:44:00 | 2018-12-20 13:24:32 | 12 |
dwih-sp | 2016-08-09 08:06:02 | 2016-11-23 10:16:04 | 11 |
Che Aria Tira? | 2018-03-11 10:50:42 | 2018-03-11 23:11:20 | 10 |
IRESA | 2019-02-04 13:51:44 | 2019-02-14 12:37:41 | 10 |
Netlight | 2019-01-06 18:14:26 | 2019-03-08 17:46:33 | 10 |
Sofia | 2017-04-11 04:40:11 | 2018-06-07 11:00:54 | 10 |
IKG | 2017-03-21 19:02:11 | 2018-12-16 07:15:59 | 9 |
Raumanmeri | 2017-03-13 11:35:39 | 2017-04-27 05:36:20 | 9 |
esri-de | 2018-09-15 10:11:25 | 2018-11-27 13:56:07 | 9 |
luftdaten | 2017-04-28 06:33:07 | 2019-01-05 22:03:49 | 9 |
GIS-FH | 2018-11-02 13:17:01 | 2018-11-02 13:26:01 | 8 |
First we group the boxes by createdAt
into bins of one week:
bins = 'week'
mvavg_bins = 6
growth = boxes %>%
mutate(week = cut(as.Date(createdAt), breaks = bins)) %>%
group_by(week) %>%
summarize(count = length(week)) %>%
mutate(event = 'registered')
We can do the same for updatedAt
, which informs us about the last change to a box, including uploaded measurements. This method of determining inactive boxes is fairly inaccurate and should be considered an approximation, because we have no information about intermediate inactive phases. Also deleted boxes would probably have a big impact here.
inactive = boxes %>%
# remove boxes that were updated in the last two days,
# b/c any box becomes inactive at some point by definition of updatedAt
filter(updatedAt < now() - days(2)) %>%
mutate(week = cut(as.Date(updatedAt), breaks = bins)) %>%
group_by(week) %>%
summarize(count = length(week)) %>%
mutate(event = 'inactive')
Now we can combine both datasets for plotting:
boxes_by_date = bind_rows(growth, inactive) %>% group_by(event)
ggplot(boxes_by_date, aes(x = as.Date(week), colour = event)) +
xlab('Time') + ylab(paste('rate per ', bins)) +
scale_x_date(date_breaks="years", date_labels="%Y") +
scale_colour_manual(values = c(registered = 'lightgreen', inactive = 'grey')) +
geom_point(aes(y = count), size = 0.5) +
# moving average, make first and last value NA (to ensure identical length of vectors)
geom_line(aes(y = rollmean(count, mvavg_bins, fill = list(NA, NULL, NA))))
We see a sudden rise in early 2017, which lines up with the fast growing grouptag Luftdaten
. This was enabled by an integration of openSenseMap.org into the firmware of the air quality monitoring project luftdaten.info. The dips in mid 2017 and early 2018 could possibly be explained by production/delivery issues of the senseBox hardware, but I have no data on the exact time frames to verify.
While we are looking at createdAt
and updatedAt
, we can also extract the duration of activity of each box, and look at metrics by exposure and grouptag once more:
duration = boxes %>%
group_by(exposure) %>%
filter(!is.na(updatedAt)) %>%
mutate(duration = difftime(updatedAt, createdAt, units='days'))
ggplot(duration, aes(x = exposure, y = duration)) +
geom_boxplot() +
coord_flip() + ylab('Duration active in Days')
The time of activity averages at only 197 days, though there are boxes with 1716 days of activity, spanning a large chunk of openSenseMap’s existence.
duration = boxes %>%
group_by(grouptag) %>%
# only include grouptags with 8 or more members
filter(length(grouptag) >= 8 && !is.na(grouptag) && !is.na(updatedAt)) %>%
mutate(duration = difftime(updatedAt, createdAt, units='days'))
ggplot(duration, aes(x = grouptag, y = duration)) +
geom_boxplot() +
coord_flip() + ylab('Duration active in Days')
duration %>%
summarize(
duration_avg = round(mean(duration)),
duration_min = round(min(duration)),
duration_max = round(max(duration)),
oldest_box = round(max(difftime(now(), createdAt, units='days')))
) %>%
arrange(desc(duration_avg))
grouptag | duration_avg | duration_min | duration_max | oldest_box |
---|---|---|---|---|
dwih-sp | 828 days | 581 days | 943 days | 943 days |
Sofia | 450 days | 267 days | 652 days | 698 days |
IKG | 361 days | 0 days | 514 days | 719 days |
Che Aria Tira? | 348 days | 256 days | 364 days | 364 days |
luftdaten.info | 327 days | 13 days | 648 days | 678 days |
ifgi | 305 days | 0 days | 788 days | 996 days |
Luftdaten | 300 days | 0 days | 712 days | 726 days |
Feinstaub | 265 days | 0 days | 701 days | 701 days |
Luftdaten.info | 250 days | 24 days | 706 days | 706 days |
PGKN | 202 days | 5 days | 353 days | 353 days |
luftdaten | 195 days | 0 days | 680 days | 681 days |
Bad_Hersfeld | 169 days | 0 days | 595 days | 600 days |
Luchtwachters Delft | 152 days | 0 days | 366 days | 366 days |
esri-de | 108 days | 0 days | 176 days | 176 days |
TKS Bonn | 95 days | 0 days | 265 days | 265 days |
Futurium | 53 days | 0 days | 107 days | 107 days |
Raumanmeri | 45 days | 7 days | 318 days | 727 days |
Netlight | 19 days | 2 days | 63 days | 63 days |
Save Dnipro | 15 days | 0 days | 76 days | 76 days |
GIS-FH | 0 days | 0 days | 0 days | 128 days |
IRESA | 0 days | 0 days | 0 days | 34 days |
The time of activity averages at only 221 days, though there are boxes with 943 days of activity, spanning a large chunk of openSenseMap’s existence.
This is less useful, as older boxes are active for a longer time by definition. If you have an idea how to compensate for that, please send a Pull Request!
# NOTE: boxes older than 2016 missing due to missing updatedAt in database
duration = boxes %>%
mutate(year = cut(as.Date(createdAt), breaks = 'year')) %>%
group_by(year) %>%
filter(!is.na(updatedAt)) %>%
mutate(duration = difftime(updatedAt, createdAt, units='days'))
ggplot(duration, aes(x = substr(as.character(year), 0, 4), y = duration)) +
geom_boxplot() +
coord_flip() + ylab('Duration active in Days') + xlab('Year of Registration')
Other visualisations come to mind, and are left as an exercise to the reader. If you implemented some, feel free to add them to this vignette via a Pull Request.