The tidyverse has had an enormous impact on the use of R by dictating a strict approach to variables and observations. This very general insight can be used for any form of data but when it comes to large data obviously we can’t store everything in one table.

There is a tension between the tidyverse and scientific array data that comes down to data storage, and the intermediate forms used to get between one form and another.

The tidync package provides a compromise position, allowing efficient array slicing by dimension coordinate or index, and delaying any data-reading until the output format is chosen. In particular tidync exists in order to reduce the amount of plumbing code required to get to the data, and allows an interactive way to convert between coordinate spaces and index spaces.

To follow along with the code below requires all of the following packages.

install.packages(c("tidync", "maps", "stars", "ggplot2", "devtools", 
                   "stars", "RNetCDF", "raster", "dplyr"))
devtools::install_github("hypertidy/ncmeta")
                   
                   

NetCDF

NetCDF is a very widely used file format for storing array-based data as variables. The space occupied by a variable is defined by its dimensions and their metadata. Dimensions are by definition one-dimensional (i.e. an atomic vector in R of length 1 or more), an array with coordinate metadata, units, type and interpretation. The space of a variable is defined as one or more of the dimensions in the file. A given variable won’t necessarily use all the available dimensions and no dimensions are mandatory or particularly special.

NetCDF is very general, but it’s quite common to see subcultures that rally around the way their particular domain’s data are used and stored without encompassing very many other valid ways of using NetCDF. Tidync really tries to be as general as possible, sacrificing high level interpretations for lower-level control.

Tidync limitations

There are some limitations, specific to the tidync R package that are unrelated to the capabilities of the latest NetCDF library.

No groups, a group can be specified by providing the group-within-a-source as a source.

No compound types.

No attribute metadata, coordinates of 1D axes are stored as transform tables, but coordinates of pairs (or higher sets) of axes are not explicitly linked to their array data.

Curvilinear coordinates are not automatically expanded, this is because they exist (usually) on a different grid to the active one.

Different kinds of NetCDF

NetCDF can be used to store raster data, and very commonly data is provided as a global grid of scientific data, here a snapshot of global ocean surface temperature generated by blending remote sensing, local observations and physical model output.

There’s an example file “reduced.nc” in the stars package, derived from the daily (OISSTV2 product)[https://www.esrl.noaa.gov/psd/data/gridded/data.noaa.oisst.v2.highres.html].

We will explore the description of this source in detail to give an introduction to the tidync summary.

oisstfile <- system.file("nc/reduced.nc", package = "stars")

To connect to this file we use tidync().

oisst <- tidync(oisstfile)

NB: this is not a real connection, like that used by ncdf4 or RNetCDF - tidync functions always open the file in read-only mode, extract information and/or data, and then close the open file connection.

To see the available data in the file print a summmary of the source.

print(oisst)
## 
## Data Source (1): reduced.nc ...
## 
## Grids (5) <dimension family> : <associated variables> 
## 
## [1]   D0,D1,D2,D3 : sst, anom, err, ice    **ACTIVE GRID** ( 16200  values per variable)
## [2]   D0          : lon
## [3]   D1          : lat
## [4]   D2          : zlev
## [5]   D3          : time
## 
## Dimensions 4 (all active): 
##   
##   dim   name  length   min   max start count  dmin  dmax unlim coord_dim 
##   <chr> <chr>  <dbl> <dbl> <dbl> <int> <int> <dbl> <dbl> <lgl> <lgl>     
## 1 D0    lon      180     0   358     1   180     0   358 FALSE TRUE      
## 2 D1    lat       90   -89    89     1    90   -89    89 FALSE TRUE      
## 3 D2    zlev       1     0     0     1     1     0     0 FALSE TRUE      
## 4 D3    time       1  1460  1460     1     1  1460  1460 TRUE  TRUE

There are three kinds of information

There is only one Grid available for multidimensional data, which is the first one “D0,D1,D2,D3” - all other Grids are one-dimensional. The 4D grid has four variables sst, anom, ice, and err and each 1D grid has a single variable.

NB: the 1D grids have a corresponding named dimension and name variable, making these “coordinate dimensions” see coord_dim in the Dimensions table, it’s not necessarily true that a 1D grid will have a single 1D variable, it may have more than one variable, and it may only have an “index variable”, i.e. no data values only the position 1:length(dimension).

The dimensions label, name, length, min and max value are seen in the Dimensions table and these values can never change, also see flags unlim (an unlimited dimension?) and coord_dim.

The other Dimensions columns start, count, dmin, dmax apply when we slice into data variables with hyper_filter().

Relationship to ncmeta

Tidync relies on the package ncmeta to extract information about NetCDF sources. There are functions to find available Grids, Dimensions and Attributes in ncmeta.

Each grid has a name, size (ndims), and set of variables. Each grid is listed only once, which is an important pattern for each kind of entity when we are programming, the same applies to variables. See that there are 5 grids and 8 variables, with a row for each.

PLEASE NOTE: some of the code that follows relies on the Github-version of ncmeta to be installed. This can be done with

devtools::install_github("hypertidy/ncmeta")
ncmeta::nc_grids(oisstfile)
## # A tibble: 5 x 4
##   grid        ndims variables        nvars
##   <chr>       <int> <list>           <int>
## 1 D0,D1,D2,D3     4 <tibble [4 × 1]>     4
## 2 D0              1 <tibble [1 × 1]>     1
## 3 D1              1 <tibble [1 × 1]>     1
## 4 D2              1 <tibble [1 × 1]>     1
## 5 D3              1 <tibble [1 × 1]>     1
ncmeta::nc_vars(oisstfile)
## # A tibble: 8 x 5
##      id name  type     ndims natts
##   <dbl> <chr> <chr>    <dbl> <dbl>
## 1     0 lon   NC_FLOAT     1     4
## 2     1 lat   NC_FLOAT     1     4
## 3     2 zlev  NC_FLOAT     1     4
## 4     3 time  NC_FLOAT     1     5
## 5     4 sst   NC_SHORT     4     6
## 6     5 anom  NC_SHORT     4     6
## 7     6 err   NC_SHORT     4     6
## 8     7 ice   NC_SHORT     4     6

Some grids have more than one variable, so they are nested in the grid rows - use unnest to see all variables with their parent grid.

ncmeta::nc_grids(oisstfile) %>% tidyr::unnest()
## # A tibble: 8 x 4
##   grid        ndims nvars variable
##   <chr>       <int> <int> <chr>   
## 1 D0,D1,D2,D3     4     4 sst     
## 2 D0,D1,D2,D3     4     4 anom    
## 3 D0,D1,D2,D3     4     4 err     
## 4 D0,D1,D2,D3     4     4 ice     
## 5 D0              1     1 lon     
## 6 D1              1     1 lat     
## 7 D2              1     1 zlev    
## 8 D3              1     1 time

Similar functions exist for dimensions and variables.

ncmeta::nc_dims(oisstfile)
## # A tibble: 4 x 4
##      id name  length unlim
##   <dbl> <chr>  <dbl> <lgl>
## 1     0 lon      180 FALSE
## 2     1 lat       90 FALSE
## 3     2 zlev       1 FALSE
## 4     3 time       1 TRUE
ncmeta::nc_atts(oisstfile)
## # A tibble: 50 x 4
##       id name          variable value    
##    <dbl> <chr>         <chr>    <list>   
##  1     0 standard_name lon      <chr [1]>
##  2     1 long_name     lon      <chr [1]>
##  3     2 units         lon      <chr [1]>
##  4     3 axis          lon      <chr [1]>
##  5     0 standard_name lat      <chr [1]>
##  6     1 long_name     lat      <chr [1]>
##  7     2 units         lat      <chr [1]>
##  8     3 axis          lat      <chr [1]>
##  9     0 long_name     zlev     <chr [1]>
## 10     1 units         zlev     <chr [1]>
## # … with 40 more rows

There are corresponding functions to find out more about individual variable`, dimensions and attributes by name or by internal index.

ncmeta::nc_var(oisstfile, "anom")
## # A tibble: 1 x 5
##      id name  type     ndims natts
##   <dbl> <chr> <chr>    <dbl> <dbl>
## 1     5 anom  NC_SHORT     4     6
ncmeta::nc_var(oisstfile, 5)
## # A tibble: 1 x 5
##      id name  type     ndims natts
##   <dbl> <chr> <chr>    <dbl> <dbl>
## 1     5 anom  NC_SHORT     4     6
ncmeta::nc_dim(oisstfile, "lon")
## # A tibble: 1 x 4
##      id name  length unlim
##   <dbl> <chr>  <dbl> <lgl>
## 1     0 lon      180 FALSE
ncmeta::nc_dim(oisstfile, 0)
## # A tibble: 1 x 4
##      id name  length unlim
##   <dbl> <chr>  <dbl> <lgl>
## 1     0 lon      180 FALSE
ncmeta::nc_atts(oisstfile)
## # A tibble: 50 x 4
##       id name          variable value    
##    <dbl> <chr>         <chr>    <list>   
##  1     0 standard_name lon      <chr [1]>
##  2     1 long_name     lon      <chr [1]>
##  3     2 units         lon      <chr [1]>
##  4     3 axis          lon      <chr [1]>
##  5     0 standard_name lat      <chr [1]>
##  6     1 long_name     lat      <chr [1]>
##  7     2 units         lat      <chr [1]>
##  8     3 axis          lat      <chr [1]>
##  9     0 long_name     zlev     <chr [1]>
## 10     1 units         zlev     <chr [1]>
## # … with 40 more rows
ncmeta::nc_atts(oisstfile, "zlev")
## # A tibble: 4 x 4
##      id name         variable value    
##   <dbl> <chr>        <chr>    <list>   
## 1     0 long_name    zlev     <chr [1]>
## 2     1 units        zlev     <chr [1]>
## 3     2 axis         zlev     <chr [1]>
## 4     3 actual_range zlev     <chr [1]>

And we can find the internal metadata for each variable by expanding the value.

ncmeta::nc_atts(oisstfile, "time") %>% tidyr::unnest()
## # A tibble: 5 x 4
##      id name          variable value                         
##   <dbl> <chr>         <chr>    <chr>                         
## 1     0 standard_name time     time                          
## 2     1 long_name     time     Center time of the day        
## 3     2 units         time     days since 1978-01-01 00:00:00
## 4     3 calendar      time     standard                      
## 5     4 axis          time     T

With this information we may now apply the right interpretation to the time values

tunit <- ncmeta::nc_atts(oisstfile, "time") %>% tidyr::unnest() %>% dplyr::filter(name == "units")
RNetCDF::utcal.nc(tunit$value, 1460)
##      year month day hour minute second
## [1,] 1981    12  31    0      0      0
## alternatively we can do this by hand
as.POSIXct("1978-01-01 00:00:00", tz = "UTC") + 1460 * 24 * 3600
## [1] "1981-12-31 UTC"

and check that other independent systems provide the same information.

raster::brick(oisstfile, varname = "anom")
## class      : RasterBrick 
## dimensions : 90, 180, 16200, 1  (nrow, ncol, ncell, nlayers)
## resolution : 2, 2  (x, y)
## extent     : -1, 359, -90, 90  (xmin, xmax, ymin, ymax)
## crs        : +proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0 
## source     : /usr/local/lib/R/site-library/stars/nc/reduced.nc 
## names      : X1981.12.31 
## Date       : 1981-12-31 
## varname    : anom 
## level      : 1
stars::read_stars(oisstfile)
## sst, anom, err, ice,
## stars object with 4 dimensions and 4 attributes
## attribute(s):
##     sst [°C]       anom [°C]          err [°C]     ice [percent]  
##  Min.   :-1.80   Min.   :-10.160   Min.   :0.110   Min.   :0.010  
##  1st Qu.:-0.03   1st Qu.: -0.580   1st Qu.:0.160   1st Qu.:0.470  
##  Median :13.65   Median : -0.080   Median :0.270   Median :0.920  
##  Mean   :12.99   Mean   : -0.186   Mean   :0.263   Mean   :0.718  
##  3rd Qu.:24.81   3rd Qu.:  0.210   3rd Qu.:0.320   3rd Qu.:0.960  
##  Max.   :32.97   Max.   :  2.990   Max.   :0.840   Max.   :1.000  
##  NA's   :4448    NA's   :4448      NA's   :4448    NA's   :13266  
## dimension(s):
##      from  to         offset delta  refsys point values    
## x       1 180             -1     2      NA    NA   NULL [x]
## y       1  90             90    -2      NA    NA   NULL [y]
## zlev    1   1          0 [m]    NA      NA    NA   NULL    
## time    1   1 1981-12-31 UTC    NA POSIXct    NA   NULL

Tidync is scared of doing this automatically for you, but in combination the ncmeta package and tidync package provides the tools to program around the vagaries presented by NetCDF sources. If you want software that aims to interpret all this for you then check out stars, GDAL, ferret, Panoply.

Degenerate dimensions

See that both zlev and time are listed as dimensions but have length 1, and also their min and max values are constants. The zlev tells us that this grid exists at elevation = 0 (the sea surface) and time that the data applies to time = 1460, the time is not expressed as a duration (though it presumably applies to the entire day). These are degenerate dimensions, i.e. the data is really 2D but we have a record of a 4D space from which they are expressed as a slice. This can cause problems as we would usually treat this data as a matrix in R, and so the ncdf and RNetCDF functions have arguments that are analogous to R’s array indexing argument drop = TRUE, if we encounter a dimension of length 1 then drop it. Tidync will also drop dimensions by default when reading data, see drop argument in ?hyper_array.

Reading the OISST data

At this point only metadata has been read, so let’s read some sea surface temperatures already!

The fastest way to get all the data is to call the function hyper_array, this is the lowest level and is very close to using the ncdf4 or RNetCDF package directly.

(oisst_data <- oisst %>% hyper_array())
## Class: tidync_data (list of tidync data arrays)
## Variables (4): 'sst', 'anom', 'err', 'ice'
## Dimension (2): lon,lat,zlev,time (180, 90)
## Source: /usr/local/lib/R/site-library/stars/nc/reduced.nc

What happened there? We got a classed object, tidync_data but this is just a list with arrays.

length(oisst_data)
## [1] 4
names(oisst_data)
## [1] "sst"  "anom" "err"  "ice"
dim(oisst_data[[1]])
## [1] 180  90
image(oisst_data[[1]])

This is exactly the data data provided by ncdf4::ncvar_get() or RNetCDF::var.get.nc() but we can do it in a single line of code.

oisst_data <- tidync(oisstfile) %>% hyper_array()
op <- par(mfrow = n2mfrow(length(oisst_data)))
pals <- c("YlOrRd", "viridis", "Grays", "Blues")
for (i in seq_along(oisst_data)) {
  image(oisst_data[[i]], main = names(oisst_data)[i], col = hcl.colors(20, pals[i], rev = i ==1))
}

par(op)

Transforms

We have done nothing with the spatial side of these data, ignoring the lon and lat values completely.

oisst_data
## Class: tidync_data (list of tidync data arrays)
## Variables (4): 'sst', 'anom', 'err', 'ice'
## Dimension (2): lon,lat,zlev,time (180, 90)
## Source: /usr/local/lib/R/site-library/stars/nc/reduced.nc
lapply(oisst_data, dim)
## $sst
## [1] 180  90
## 
## $anom
## [1] 180  90
## 
## $err
## [1] 180  90
## 
## $ice
## [1] 180  90

The print summary of the oisst_data object shows that it knows there are four variable and that they each have 2 dimensions (zlev and time were dropped), this is now stored as a list of native R arrays, but there is also the transforms attribute available with hyper_transforms().

The values on each transform table may be used directly.

(trans <- attr(oisst_data, "transforms"))
## $lon
## # A tibble: 180 x 6
##      lon index    id name  coord_dim selected
##    <dbl> <int> <dbl> <chr> <lgl>     <lgl>   
##  1     0     1     0 lon   TRUE      TRUE    
##  2     2     2     0 lon   TRUE      TRUE    
##  3     4     3     0 lon   TRUE      TRUE    
##  4     6     4     0 lon   TRUE      TRUE    
##  5     8     5     0 lon   TRUE      TRUE    
##  6    10     6     0 lon   TRUE      TRUE    
##  7    12     7     0 lon   TRUE      TRUE    
##  8    14     8     0 lon   TRUE      TRUE    
##  9    16     9     0 lon   TRUE      TRUE    
## 10    18    10     0 lon   TRUE      TRUE    
## # … with 170 more rows
## 
## $lat
## # A tibble: 90 x 6
##      lat index    id name  coord_dim selected
##    <dbl> <int> <dbl> <chr> <lgl>     <lgl>   
##  1   -89     1     1 lat   TRUE      TRUE    
##  2   -87     2     1 lat   TRUE      TRUE    
##  3   -85     3     1 lat   TRUE      TRUE    
##  4   -83     4     1 lat   TRUE      TRUE    
##  5   -81     5     1 lat   TRUE      TRUE    
##  6   -79     6     1 lat   TRUE      TRUE    
##  7   -77     7     1 lat   TRUE      TRUE    
##  8   -75     8     1 lat   TRUE      TRUE    
##  9   -73     9     1 lat   TRUE      TRUE    
## 10   -71    10     1 lat   TRUE      TRUE    
## # … with 80 more rows
## 
## $zlev
## # A tibble: 1 x 6
##    zlev index    id name  coord_dim selected
##   <dbl> <int> <dbl> <chr> <lgl>     <lgl>   
## 1     0     1     2 zlev  TRUE      TRUE    
## 
## $time
## # A tibble: 1 x 6
##    time index    id name  coord_dim selected
##   <dbl> <int> <dbl> <chr> <lgl>     <lgl>   
## 1  1460     1     3 time  TRUE      TRUE
image(trans$lon$lon, trans$lat$lat,  oisst_data[[1]])
maps::map("world2", add = TRUE)

In this case these transforms are somewhat redundant, there is a value stored for every step in lon and every step in lat and they are completely regular, whereas the usual approach in graphics is to store an offset and scale rather than each dimension’s coordinate. Sometimes though these coordinate values are not reducible this way.

Slicing

We can slice into these dimensions using a tidyverse approach. For example, say we wanted to slice out only the data for the waters of the Pacific Ocean, we need a range in longitude and a range in latitude.

We can put these ranges directly on our raw plot from earlier.

lonrange <- c(144, 247)
latrange <- c(-46, 47)

image(trans$lon$lon, trans$lat$lat,  oisst_data[[1]])
#abline(v = lonrange, h = latrange)
rect(lonrange[1], latrange[1], lonrange[2], latrange[2])

It’s common on the internet to see posts that explain how to drive the NetCDF library with start and count indices, to do that we need to compare our ranges with the transforms of each dimension.

xs <- findInterval(lonrange, trans$lon$lon)
ys <- findInterval(latrange, trans$lat$lat)
print(xs)
## [1]  73 124
print(ys)
## [1] 22 69
start <- c(xs[1], ys[1])
count <- c(diff(xs), diff(ys))

print(start)
## [1] 73 22
print(count)
## [1] 51 47

The idea here is that xs and ys tell us the columns and rows of interest, based on our geographic input in longitude latitude values that we understand.

Let’s try to read with NetCDF. Hmmm …. what goes wrong.

con <- RNetCDF::open.nc(oisstfile)
try(sst_matrix <- RNetCDF::var.get.nc(con, "sst", start = start, count = count))
## Error in RNetCDF::var.get.nc(con, "sst", start = start, count = count) : 
##   length(start) == ndims is not TRUE

We have been bitten by thinking that this source data is 2D! So we just add start and count of 1 for each extra dimension (but what if it was 3D, or 5D, or time comes first - all of these things complicate these simple solutions).

start <- c(start, 1, 1)
count <- c(count, 1, 1)
sst_matrix <- RNetCDF::var.get.nc(con, "sst", start = start, count = count)

And we’re good! Except, we now don’t have the coordinates for the mapping. We have to slice the lon and lat values as well, but let’s cut to the chase and go back to tidync.

Rather than slice the arrays read into memory, we can filter the object that understands the source and it does not do any data slicing at all, but records slices to be done in future. This is the lazy beauty of the tidyverse, applied to NetCDF.

Here I used between() for lon and standard R inequality syntax for lat simply to show that both kinds of expression are available. We don’t have to specify the redundant slice into zlev or time.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
oisst_slice <- oisst %>% hyper_filter(lon = between(lon, lonrange[1], lonrange[2]), 
                       lat = lat > latrange[1] & lat <= latrange[2])

oisst_slice
## 
## Data Source (1): reduced.nc ...
## 
## Grids (5) <dimension family> : <associated variables> 
## 
## [1]   D0,D1,D2,D3 : sst, anom, err, ice    **ACTIVE GRID** ( 16200  values per variable)
## [2]   D0          : lon
## [3]   D1          : lat
## [4]   D2          : zlev
## [5]   D3          : time
## 
## Dimensions 4 (all active): 
##   
##   dim   name  length   min   max start count  dmin  dmax unlim coord_dim 
##   <chr> <chr>  <dbl> <dbl> <dbl> <int> <int> <dbl> <dbl> <lgl> <lgl>     
## 1 D0    lon      180     0   358    73    52   144   246 FALSE TRUE      
## 2 D1    lat       90   -89    89    23    47   -45    47 FALSE TRUE      
## 3 D2    zlev       1     0     0     1     1     0     0 FALSE TRUE      
## 4 D3    time       1  1460  1460     1     1  1460  1460 TRUE  TRUE

The print summary has updated the start and count columns now to match our labouriously acquired versions above (they are slightly different because of the difference between findInterval() and our inequality expressions).

The dmin and dmax (data-min, data-max) columns are also updated, reporting the coordinate value at the start and end of the slice we have specified.

Now we can break the lazy chain and call for the data.

oisst_slice_data <- oisst_slice %>% hyper_array()
trans <- attr(oisst_slice_data, "transforms")

One unfortunate issue here is that we cannot use the transforms directly, they have been updated but not in the obvious way (this is something that should probably be fixed in tidync).

First filter the lon and lat transforms based on the selected column.

lon <- trans$lon %>% dplyr::filter(selected)
lat <- trans$lat %>% dplyr::filter(selected)

image(lon$lon, lat$lat, oisst_slice_data[[1]])
maps::map("world2", add = TRUE)

It is laborious to work with hyper_array() but it does give total control over what we can get.

It’s much easier to use other output types.

tcube <- tidync(oisstfile) %>% 
  hyper_filter(lon = between(lon, lonrange[1], lonrange[2]), 
                       lat = lat > latrange[1] & lat <= latrange[2]) %>% 
  hyper_tbl_cube()

library(ggplot2)
ggplot(as_tibble(tcube)) + geom_raster(aes(lon, lat, fill = sst))

For those that prefer old-school ggplot2 we can read our slice in directly as a tibble.

tdata <- tidync(oisstfile) %>% 
  hyper_filter(lon = between(lon, lonrange[1], lonrange[2]), 
                       lat = lat > latrange[1] & lat <= latrange[2]) %>% 
  hyper_tibble()

ggplot(tdata, aes(lon, lat, fill = anom)) + geom_raster()

By default, all variables are available but we can limit with select_var.

tidync(oisstfile) %>% 
  hyper_filter(lon = between(lon, lonrange[1], lonrange[2]), 
                       lat = lat > latrange[1] & lat <= latrange[2]) %>% 
  hyper_tibble(select_var = c("err", "ice"))
## # A tibble: 2,377 x 6
##      err   ice   lon   lat  zlev  time
##    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 0.230    NA   144   -45     0  1460
##  2 0.320    NA   146   -45     0  1460
##  3 0.280    NA   148   -45     0  1460
##  4 0.210    NA   150   -45     0  1460
##  5 0.200    NA   152   -45     0  1460
##  6 0.240    NA   154   -45     0  1460
##  7 0.230    NA   156   -45     0  1460
##  8 0.250    NA   158   -45     0  1460
##  9 0.260    NA   160   -45     0  1460
## 10 0.290    NA   162   -45     0  1460
## # … with 2,367 more rows
tos <- tidync(system.file("nc/tos_O1_2001-2002.nc", package = "stars"))
library(dplyr)
stos <- tos %>% hyper_filter(lon = between(lon, 140, 220), 
                     lat = between(lat, -60, 0)) %>% hyper_tibble()

library(ggplot2)
ggplot(stos, aes(lon, lat, fill = tos)) + geom_raster() + facet_wrap(~time)

Future helpers

A feature being considered for an upcoming version is to expand out all available linked coordinates. This occurs when an array has a dimension but only stores its index. When a dimension stores values directly this is known as a dim-coord, and usually occurs for time values. One way to expand this out would be to include an expand_coords argument to hyper_tibble() and have it run the following code:

#' Expand coordinates stored against dimensions
#'
#' @param x tidync object
#' @param ... ignored
#'
#' @return data frame of all variables and any linked-coordinates 
#' @noRd
#'
#' @examples

full_expand <- function(x, ...) {
  ad <- active(x)
  spl <- strsplit(ad, ",")[[1L]]
  out <- hyper_tibble(x)
  
  for (i in seq_along(spl)) {
    out <- dplyr::inner_join(out, activate(x, spl[i]) %>% hyper_tibble())
  } 
  out
}

It’s not clear to me how consistently this fits in the wider variants found in the NetCDF world, so any feedback is welcome.

A real world example is available in the ncdfgeom package. This package provides much more in terms of storing geometry within a NetCDF file, but here we only extract the lon, lat and station name that hyper_tibble() isn’t seeing by default.

huc <- system.file('extdata','example_huc_eta.nc', package = 'ncdfgeom')

full_expand(tidync(huc))
## Joining, by = "time"
## Joining, by = "station"
## # A tibble: 50 x 5
##       et  time station   lat   lon
##    <int> <dbl>   <int> <dbl> <dbl>
##  1    10 10957       1  36.5 -80.4
##  2    19 10988       1  36.5 -80.4
##  3    21 11017       1  36.5 -80.4
##  4    36 11048       1  36.5 -80.4
##  5   105 11078       1  36.5 -80.4
##  6   110 11109       1  36.5 -80.4
##  7   128 11139       1  36.5 -80.4
##  8   121 11170       1  36.5 -80.4
##  9    70 11201       1  36.5 -80.4
## 10    25 11231       1  36.5 -80.4
## # … with 40 more rows
hyper_tibble(tidync(huc))
## # A tibble: 50 x 3
##       et  time station
##    <int> <dbl>   <int>
##  1    10 10957       1
##  2    19 10988       1
##  3    21 11017       1
##  4    36 11048       1
##  5   105 11078       1
##  6   110 11109       1
##  7   128 11139       1
##  8   121 11170       1
##  9    70 11201       1
## 10    25 11231       1
## # … with 40 more rows

Thanks!

The tidync package recently hit CRAN after a fairly long review process on rOpenSci. In early 2018 I just wasn’t sure if it was really going to be wrapped up in a neat way, but thanks to very helpful reviewers and also some key insights about obscure types it was done.

The package benefitted greatly from feedback provided by Jakub Nowosad and Tim Lucas.