Data Model

Jonathan Callahan

2022-10-31

This vignette explores the mts_monitor data model used throughout the AirMonitor package to store and work with monitoring data.

The AirMonitor package is designed to provide a compact, full-featured suite of utilities for working with PM2.5 data. A uniform data model provides consistent data access across monitoring data available from different agencies. The core data model in this package is defined by the mts_monitor object used to store data associated with groups of individual monitors.

To work efficiently with the package it is important that you understand the structure of this data object and the functions that operate on it. Package functions whose names begin with monitor_, expect objects of class mts_monitor as their first argument. (‘mts’ stands for ‘Multiple Time Series’)

Data Model

The AirMonitor package uses the mts data model defined in the MazamaTimeSeries package.

In this data model, each unique time series is referred to as a “device-deployment” – a time series collected by a particular device at a specific location. Multiple device-deployments are stored in memory as a mts_monitor object, typically called monitor. Each monitor is just an list with two dataframes.

monitor$meta – rows = unique device-deployments; cols = device/location metadata

monitor$data – rows = UTC times; cols = device-deployment data (plus an additional datetime column)

A key feature of this data model is the use of the deviceDeploymentID as a “foreign key” that allows data columns to be mapped onto the associated spatial and device metadata in a meta row. The following will always be true:

identical(names(monitor$data), c('datetime', monitor$meta$deviceDeploymentID))

Each column of monitor$data represents a time series associated with a particular device-deployment while each row of monitor$data represents a synoptic snapshot of all measurements made at a particular time.

In this manner, software can create both time series plots and maps from a single monitor object in memory.

The data dataframe contains all hourly measurements organized with rows (the ‘unlimited’ dimension) as unique timesteps and columns as unique device-deployments. The very first column is always named datetime and contains the POSIXct datetime in Coordinated Universal Time (UTC). This time axis is guaranteed to be a regular hourly axis with no gaps.

The meta dataframe contains all metadata associated with device-deployments and is organized with rows as unique device-deployments and columns containing both location and device metadata. The following columns are guaranteed to exist in the meta dataframe. Those marked with “(optional)” may contain NAs. Additional columns may also be present depending on the data source.

Example 1: Exploring mts_monitor objects

We will use the built-in “NW_Megafires” dataset and various monitor_filter~() functions to subset a mts_monitor object which we then examine.

library(AirMonitor)

# Recipe to select Washington state monitors in August of 2014:
monitor <-
  
  # 1) start with NW Megafires
  NW_Megafires %>%
  
  # 2) filter to only include Washington state
  monitor_filter(stateCode == "WA") %>%
  
  # 3) filter to only include August
  monitor_filterDate(20150801, 20150901) %>%
  
  # 4) remove monitors with all missing values
  monitor_dropEmpty()

# 'mts_monitor' objects can be identified by their class
class(monitor)
## [1] "mts_monitor" "mts"         "list"
# They alwyas have two elements called 'meta' and 'data'
names(monitor)
## [1] "meta" "data"
# Examine the 'meta' dataframe
dim(monitor$meta)
## [1] 67 82
names(monitor$meta)
##  [1] "deviceDeploymentID"    "deviceID"              "deviceType"           
##  [4] "deviceDescription"     "deviceExtra"           "pollutant"            
##  [7] "units"                 "dataIngestSource"      "dataIngestURL"        
## [10] "dataIngestUnitID"      "dataIngestExtra"       "dataIngestDescription"
## [13] "locationID"            "locationName"          "longitude"            
## [16] "latitude"              "elevation"             "countryCode"          
## [19] "stateCode"             "countyName"            "timezone"             
## [22] "houseNumber"           "street"                "city"                 
## [25] "zip"                   "AQSID"                 "fullAQSID"            
## [28] "airnow_stationID"      "airnow_parameterName"  "airnow_monitorType"   
## [31] "airnow_siteCode"       "airnow_status"         "airnow_agencyID"      
## [34] "airnow_agencyName"     "airnow_EPARegion"      "airnow_GMTOffsetHours"
## [37] "airnow_CBSA_ID"        "airnow_CBSA_Name"      "airnow_stateAQSCode"  
## [40] "airnow_countyAQSCode"  "airnow_MSAName"        "address"              
## [43] "wrcc_type"             "wrcc_serialNumber"     "wrcc_monitorName"     
## [46] "wrcc_monitorType"      "deploymentType"        "airnow_countryCode"   
## [49] "airnow_stateCode"      "airnow_timezone"       "airnow_houseNumber"   
## [52] "airnow_street"         "airnow_city"           "airnow_zip"           
## [55] "airsis_Alias"          "airsis_dataFormat"     "airsis_provider"      
## [58] "airsis_unitID"         "aqs_address"           "siteEstablishedDate"  
## [61] "siteClosedDate"        "GMTOffset"             "owningAgency"         
## [64] "cityName"              "CBSAName"              "tribeName"            
## [67] "parameterCode"         "parameterName"         "POC"                  
## [70] "firstYearOfData"       "lastSampleDate"        "monitorType"          
## [73] "reportingAgency"       "PQAO"                  "collectingAgency"     
## [76] "exclusions"            "monitoringObjective"   "lastMethodCode"       
## [79] "lastMethod"            "measurementScale"      "NAAQSPrimaryMonitor"  
## [82] "QAPrimaryMonitor"
# Examine the 'data' dataframe
dim(monitor$data)
## [1] 744  68
# This should always be true
identical(names(monitor$data), c('datetime', monitor$meta$deviceDeploymentID))
## [1] TRUE

Example 2: Basic manipulation of mts_monitor objects

The AirMonitor package has numerous functions that work with mts_monitor objects, all of which begin with monitor_. If you need to do something that the package functions do not provide, you can manipulate mts_monitor objects directly as long as you retain the structure of the data model.

Functions that accept and return mts_monitor objects include:

These functions can be used with the magrittr package pipe operator (%>%) as in the following example:

# First, Obtain the monitor ids by clicking on dots in the interactive map:
NW_Megafires %>% monitor_leaflet()
# Calculate daily means for the Methow Valley from monitors in Twisp and Winthrop

TwispID <- "99a6ee8e126ff8cf_530470009_04"
WinthropID <- "123035bbdc2bc702_530470010_04"

# Recipe to calculate Methow Valley August Means:
Methow_Valley_AugustMeans <- 
  
  # 1) start with NW Megafires
  NW_Megafires %>%
  
  # 2) select monitors from Twisp and Winthrop
  monitor_select(c(TwispID, WinthropID)) %>%
  
  # 3) average them together hour-by-hour
  monitor_collapse(deviceID = 'MethowValley') %>%
  
  # 4) restrict data to August
  monitor_filterDate(20150801, 20150901) %>%
  
  # 5) calculate daily mean
  monitor_dailyStatistic(mean, minHours = 18) %>%
  
  # 6) round data to one decimal place
  monitor_mutate(round, 1)

# Look at the first week
Methow_Valley_AugustMeans$data[1:7,]
##     datetime c2de3yc0jc_MethowValley
## 1 2015-08-01                    20.3
## 2 2015-08-02                    30.7
## 3 2015-08-03                    12.1
## 4 2015-08-04                     9.0
## 5 2015-08-05                     3.7
## 6 2015-08-06                     3.2
## 7 2015-08-07                    11.0

Example 3: Advanced manipulation of mts_monitor objects

The following code demonstrates user creation of a custom function to manipulate the data tibble from a mts_monitor object with monitor_mutate().

# Monitors within 100 km of Spokane, WA
Spokane <-
  NW_Megafires %>%
  monitor_filterByDistance(-117.42, 47.70, 100000) %>%
  monitor_filterDate(20150801, 20150901) %>%
  monitor_dropEmpty()

# Show the daily statistic for one week
Spokane %>% 
  monitor_filterDate(20150801, 20150808) %>%
  monitor_dailyStatistic(mean) %>%
  monitor_getData()
## # A tibble: 7 × 11
##   datetime            `70de0a70970655a0_530630047_04` a79f97f86cb2a7d7_1605500…¹
##   <dttm>                                        <dbl>                      <dbl>
## 1 2015-08-01 00:00:00                           18.2                        9.83
## 2 2015-08-02 00:00:00                           47.1                       31.4 
## 3 2015-08-03 00:00:00                           37.1                       33.7 
## 4 2015-08-04 00:00:00                            7.31                       9.70
## 5 2015-08-05 00:00:00                            5.82                       9.25
## 6 2015-08-06 00:00:00                            3.74                       7.46
## 7 2015-08-07 00:00:00                            4.50                       5.79
## # ℹ abbreviated name: ¹​a79f97f86cb2a7d7_160550003_03
## # ℹ 8 more variables: `7216002af320d683_530650002_04` <dbl>,
## #   `8c0517d4b648fe54_530750006_04` <dbl>,
## #   `345833eaf05eac18_160090011_03` <dbl>, `9b8e3d84ace997b6_wrcc.e925` <lgl>,
## #   `7eb0c7f361adfacb_160090010_03` <dbl>, b31e89974a3db049_160790017_04 <dbl>,
## #   e7dee084705d75eb_160170003_03 <dbl>, c891e750c3fc35ef_530010003_04 <dbl>
# Custom function to convert from metric ug/m3 to imperial grain/gallon 
my_FUN <- function(x) { return( x * 15.43236 / 0.004546 ) }
Spokane %>% 
  monitor_filterDate(20150801, 20150808) %>%
  monitor_mutate(my_FUN) %>%
  monitor_dailyStatistic(mean) %>%
  monitor_getData()
## # A tibble: 7 × 11
##   datetime            `70de0a70970655a0_530630047_04` a79f97f86cb2a7d7_1605500…¹
##   <dttm>                                        <dbl>                      <dbl>
## 1 2015-08-01 00:00:00                          61812.                     33381.
## 2 2015-08-02 00:00:00                         159990.                    106509.
## 3 2015-08-03 00:00:00                         125944.                    114289.
## 4 2015-08-04 00:00:00                          24824.                     32914.
## 5 2015-08-05 00:00:00                          19746.                     31401.
## 6 2015-08-06 00:00:00                          12688.                     25319.
## 7 2015-08-07 00:00:00                          15262.                     19661.
## # ℹ abbreviated name: ¹​a79f97f86cb2a7d7_160550003_03
## # ℹ 8 more variables: `7216002af320d683_530650002_04` <dbl>,
## #   `8c0517d4b648fe54_530750006_04` <dbl>,
## #   `345833eaf05eac18_160090011_03` <dbl>, `9b8e3d84ace997b6_wrcc.e925` <lgl>,
## #   `7eb0c7f361adfacb_160090010_03` <dbl>, b31e89974a3db049_160790017_04 <dbl>,
## #   e7dee084705d75eb_160170003_03 <dbl>, c891e750c3fc35ef_530010003_04 <dbl>

Understanding that monitor$data is a just a dataframe of measurements prepended with a datetime column, we can pull out the measurements and do analyses independent of the mts_monitor data model. Here we look for correlations among the PM2.5 time series.

# Pull out the time series data to calculate correlations
Spokane_data <- 
  Spokane %>%
  monitor_getData() %>%
  dplyr::select(-1) # omit 'datetime' column

# Provide human readable names
names(Spokane_data) <- Spokane$meta$locationName

# Find correlation among monitors
cor(Spokane_data, use = "complete.obs")
##                                              Spokane - Monroe St
## Spokane - Monroe St                                    1.0000000
## Coeur D'alene - Lancaster Rd.                          0.6069221
## Spokane - Wellpinit  Ford Rd (Spokane Tribe)           0.5697110
## Rosalia - Josephine St                                 0.5230965
## us.16_345833                                           0.5377632
## Mobile_White_Salmon                                    0.4789670
## St. Maries                                             0.5269798
## Pinehurst                                              0.3571274
## Sandpoint                                              0.6250559
## Ritzville - Alder St                                   0.2926409
##                                              Coeur D'alene - Lancaster Rd.
## Spokane - Monroe St                                              0.6069221
## Coeur D'alene - Lancaster Rd.                                    1.0000000
## Spokane - Wellpinit  Ford Rd (Spokane Tribe)                     0.7166013
## Rosalia - Josephine St                                           0.6398958
## us.16_345833                                                     0.6977911
## Mobile_White_Salmon                                              0.5974232
## St. Maries                                                       0.8067682
## Pinehurst                                                        0.7513249
## Sandpoint                                                        0.7680706
## Ritzville - Alder St                                             0.3193411
##                                              Spokane - Wellpinit  Ford Rd (Spokane Tribe)
## Spokane - Monroe St                                                             0.5697110
## Coeur D'alene - Lancaster Rd.                                                   0.7166013
## Spokane - Wellpinit  Ford Rd (Spokane Tribe)                                    1.0000000
## Rosalia - Josephine St                                                          0.5669491
## us.16_345833                                                                    0.6838587
## Mobile_White_Salmon                                                             0.5944653
## St. Maries                                                                      0.6516175
## Pinehurst                                                                       0.5916583
## Sandpoint                                                                       0.7859552
## Ritzville - Alder St                                                            0.3089826
##                                              Rosalia - Josephine St
## Spokane - Monroe St                                       0.5230965
## Coeur D'alene - Lancaster Rd.                             0.6398958
## Spokane - Wellpinit  Ford Rd (Spokane Tribe)              0.5669491
## Rosalia - Josephine St                                    1.0000000
## us.16_345833                                              0.8296218
## Mobile_White_Salmon                                       0.2921147
## St. Maries                                                0.6990201
## Pinehurst                                                 0.5092533
## Sandpoint                                                 0.5076184
## Ritzville - Alder St                                      0.7953586
##                                              us.16_345833 Mobile_White_Salmon
## Spokane - Monroe St                             0.5377632           0.4789670
## Coeur D'alene - Lancaster Rd.                   0.6977911           0.5974232
## Spokane - Wellpinit  Ford Rd (Spokane Tribe)    0.6838587           0.5944653
## Rosalia - Josephine St                          0.8296218           0.2921147
## us.16_345833                                    1.0000000           0.4046114
## Mobile_White_Salmon                             0.4046114           1.0000000
## St. Maries                                      0.7906070           0.5136138
## Pinehurst                                       0.7222093           0.5176399
## Sandpoint                                       0.5780285           0.7119107
## Ritzville - Alder St                            0.5874713           0.1093694
##                                              St. Maries Pinehurst Sandpoint
## Spokane - Monroe St                           0.5269798 0.3571274 0.6250559
## Coeur D'alene - Lancaster Rd.                 0.8067682 0.7513249 0.7680706
## Spokane - Wellpinit  Ford Rd (Spokane Tribe)  0.6516175 0.5916583 0.7859552
## Rosalia - Josephine St                        0.6990201 0.5092533 0.5076184
## us.16_345833                                  0.7906070 0.7222093 0.5780285
## Mobile_White_Salmon                           0.5136138 0.5176399 0.7119107
## St. Maries                                    1.0000000 0.7679051 0.6232813
## Pinehurst                                     0.7679051 1.0000000 0.5626732
## Sandpoint                                     0.6232813 0.5626732 1.0000000
## Ritzville - Alder St                          0.3896507 0.2514426 0.2851257
##                                              Ritzville - Alder St
## Spokane - Monroe St                                     0.2926409
## Coeur D'alene - Lancaster Rd.                           0.3193411
## Spokane - Wellpinit  Ford Rd (Spokane Tribe)            0.3089826
## Rosalia - Josephine St                                  0.7953586
## us.16_345833                                            0.5874713
## Mobile_White_Salmon                                     0.1093694
## St. Maries                                              0.3896507
## Pinehurst                                               0.2514426
## Sandpoint                                               0.2851257
## Ritzville - Alder St                                    1.0000000

This introduction to the mts_monitor data model should be enough to get you started. Lots more examples are available in the package documentation.


Best of luck exploring and understanding PM 2.5 air quality data!