sampling

Currently, there are 8 functions associated with the sample verb in the sgsR package:

Access

One key feature of using some sample_* functions is its ability to define access corridors. Users can supply a road access network (must be sf line objects) and define buffers around access where samples should be excluded and included.

Relevant and applicable parameters when access is defined are:

sample_srs

We have demonstrated a simple example of using the sample_srs() function in vignette("sgsR"). We will demonstrate additional examples below.

The input required for sample_srs() is a raster. This means that sraster and mraster are supported for this function.

#--- perform simple random sampling ---#
sample_srs(raster = sraster, # input sraster
           nSamp = 200, # number of desired samples
           plot = TRUE) # plot

#> Simple feature collection with 200 features and 0 fields
#> Geometry type: POINT
#> Dimension:     XY
#> Bounding box:  xmin: 431110 ymin: 5337730 xmax: 438490 ymax: 5343230
#> CRS:           +proj=utm +zone=17 +ellps=GRS80 +units=m +no_defs
#> First 10 features:
#>                  geometry
#> 1  POINT (431310 5338270)
#> 2  POINT (431310 5338270)
#> 3  POINT (431490 5338230)
#> 4  POINT (435030 5340130)
#> 5  POINT (431950 5343110)
#> 6  POINT (431430 5340610)
#> 7  POINT (436150 5340170)
#> 8  POINT (434510 5339890)
#> 9  POINT (434370 5341050)
#> 10 POINT (431990 5338390)
sample_srs(raster = mraster, # input mraster
           nSamp = 200, # number of desired samples
           access = access, # define access road network
           mindist = 200, # minimum distance samples must be apart from one another
           buff_inner = 50, # inner buffer - no samples within this distance from road
           buff_outer = 200, # outer buffer - no samples further than this distance from road
           plot = TRUE) # plot

#> Simple feature collection with 200 features and 0 fields
#> Geometry type: POINT
#> Dimension:     XY
#> Bounding box:  xmin: 431170 ymin: 5337710 xmax: 438470 ymax: 5343230
#> CRS:           +proj=utm +zone=17 +ellps=GRS80 +units=m +no_defs
#> First 10 features:
#>                  geometry
#> 1  POINT (436490 5341010)
#> 2  POINT (435750 5343050)
#> 3  POINT (435870 5339230)
#> 4  POINT (436130 5339890)
#> 5  POINT (435150 5342710)
#> 6  POINT (437550 5342430)
#> 7  POINT (434930 5339330)
#> 8  POINT (435130 5340310)
#> 9  POINT (433590 5342210)
#> 10 POINT (435690 5340050)

sample_systematic

The sample_systematic() function applies systematic sampling across an area with the cellsize parameter defining the resolution of the tessellation. The tessellation shape can be modified using the square parameter. Assigning TRUE (default) to the square parameter results in a regular grid and assigning FALSE results in a hexagonal grid. The location of samples can also be adjusted using the locations parameter, where centers takes the center, corners takes all corners, and random takes a random location within each tessellation.

#--- perform grid sampling ---#
sample_systematic(raster = sraster, # input sraster
                  cellsize = 1000, # grid distance
                  plot = TRUE) # plot

#> Simple feature collection with 33 features and 0 fields
#> Geometry type: POINT
#> Dimension:     XY
#> Bounding box:  xmin: 431101.8 ymin: 5337851 xmax: 438384.8 ymax: 5342899
#> CRS:           +proj=utm +zone=17 +ellps=GRS80 +units=m +no_defs
#> First 10 features:
#>                    geometry
#> 1    POINT (436895 5337851)
#> 2  POINT (437805.4 5338265)
#> 3  POINT (434660.5 5337934)
#> 4    POINT (438302 5339589)
#> 5  POINT (432425.9 5338016)
#> 6  POINT (433336.3 5338430)
#> 7  POINT (434246.7 5338844)
#> 8    POINT (435157 5339258)
#> 9  POINT (436067.4 5339672)
#> 10 POINT (436977.8 5340085)
#--- perform grid sampling ---#
sample_systematic(raster = sraster, # input sraster
                  cellsize = 500, # grid distance
                  square = FALSE, # hexagonal tessellation
                  location = "random", # random sample within tessellation
                  plot = TRUE) # plot

#> Simple feature collection with 170 features and 0 fields
#> Geometry type: POINT
#> Dimension:     XY
#> Bounding box:  xmin: 431100.4 ymin: 5337727 xmax: 438479.6 ymax: 5343236
#> CRS:           +proj=utm +zone=17 +ellps=GRS80 +units=m +no_defs
#> First 10 features:
#>                    geometry
#> 1  POINT (437958.9 5337915)
#> 2  POINT (438449.9 5338095)
#> 3  POINT (437556.1 5338145)
#> 4  POINT (436731.7 5337971)
#> 5  POINT (435935.9 5337919)
#> 6  POINT (437884.6 5338165)
#> 7  POINT (437140.6 5338080)
#> 8  POINT (436214.2 5338134)
#> 9  POINT (435619.4 5338004)
#> 10   POINT (433808 5337727)
sample_systematic(raster = sraster, # input sraster
            cellsize = 500, # grid distance
            access = access, # define access road network
            buff_outer = 200, # outer buffer - no samples further than this distance from road
            square = FALSE, # hexagonal tessellation
            location = "corners", # take corners instead of centers
            plot = TRUE)

#> Simple feature collection with 608 features and 0 fields
#> Geometry type: POINT
#> Dimension:     XY
#> Bounding box:  xmin: 431167.5 ymin: 5337705 xmax: 438450.4 ymax: 5343235
#> CRS:           +proj=utm +zone=17 +ellps=GRS80 +units=m +no_defs
#> First 10 features:
#>                    geometry
#> 1  POINT (437807.7 5343071)
#> 2  POINT (437520.7 5343103)
#> 3  POINT (434767.4 5343153)
#> 4  POINT (434480.5 5343185)
#> 5  POINT (435628.3 5343059)
#> 6  POINT (435341.3 5343090)
#> 7  POINT (436489.1 5342964)
#> 8  POINT (436202.2 5342996)
#> 9  POINT (437520.7 5343103)
#> 10   POINT (437350 5342870)

sample_strat

The sample_strat() contains two methods to perform sampling:

method = "Queinnec"

Queinnec, M., White, J. C., & Coops, N. C. (2021). Comparing airborne and spaceborne photon-counting LiDAR canopy structural estimates across different boreal forest types. Remote Sensing of Environment, 262(August 2020), 112510.

This algorithm uses moving window (wrow and wcol parameters) to filter the input sraster to prioritize sample locations where stratum pixels are spatially grouped, rather than dispersed individuals across the landscape.

Sampling is performed using 2 rules:

The rule applied to a select a particular sample is defined in the rule attribute of output samples. We give a few examples below:

#--- perform stratified sampling random sampling ---#
sample_strat(sraster = sraster, # input sraster
             nSamp = 200, # desired sample number
             plot = TRUE) # plot

#> Simple feature collection with 200 features and 3 fields
#> Geometry type: POINT
#> Dimension:     XY
#> Bounding box:  xmin: 431110 ymin: 5337710 xmax: 438530 ymax: 5343230
#> CRS:           NA
#> First 10 features:
#>    strata type  rule               geometry
#> x       1  new rule1 POINT (436810 5337970)
#> x1      1  new rule2 POINT (436050 5342150)
#> x2      1  new rule2 POINT (434390 5341370)
#> x3      1  new rule2 POINT (433830 5342770)
#> x4      1  new rule2 POINT (435130 5339170)
#> x5      1  new rule2 POINT (433650 5341430)
#> x6      1  new rule2 POINT (437970 5343130)
#> x7      1  new rule2 POINT (431250 5340390)
#> x8      1  new rule2 POINT (436810 5338930)
#> x9      1  new rule2 POINT (431550 5342570)

In some cases, users might want to include existing samples within the algorithm. In order to adjust the total number of samples needed per stratum to reflect those already present in existing, we can use the intermediate function extract_strata().

This function uses the sraster and existing samples and extracts the stratum for each. These samples can be included within sample_strat(), which adjusts total samples required per class based on representation in existing.

#--- extract strata values to existing samples ---#              
e.sr <- extract_strata(sraster = sraster, # input sraster
                       existing = existing) # existing samples to add strata value to

Notice that e.sr now has an attribute named strata. If that parameter is not there, sample_strat() will give an error.

sample_strat(sraster = sraster, # input sraster
             nSamp = 200, # desired sample number
             access = access, # define access road network
             existing = e.sr, # existing samples with strata values
             mindist = 200, # minimum distance samples must be apart from one another
             buff_inner = 50, # inner buffer - no samples within this distance from road
             buff_outer = 200, # outer buffer - no samples further than this distance from road
             plot = TRUE) # plot

#> Simple feature collection with 400 features and 3 fields
#> Geometry type: POINT
#> Dimension:     XY
#> Bounding box:  xmin: 431150 ymin: 5337710 xmax: 438550 ymax: 5343230
#> CRS:           NA
#> First 10 features:
#>    strata     type     rule               geometry
#> 1       1 existing existing POINT (435890 5339450)
#> 2       1 existing existing POINT (437870 5341090)
#> 3       1 existing existing POINT (434370 5340970)
#> 4       1 existing existing POINT (437910 5342390)
#> 5       1 existing existing POINT (436210 5343010)
#> 6       1 existing existing POINT (433990 5341570)
#> 7       1 existing existing POINT (433610 5343230)
#> 8       1 existing existing POINT (435930 5339590)
#> 9       1 existing existing POINT (434910 5339530)
#> 10      1 existing existing POINT (435270 5340650)

As seen on the code in the example above, the defined mindist parameter specifies the minimum euclidean distance that samples must be apart from one another.

Notice that the sample outputs have type and rule attributes which outline whether the samples are existing or new and whether rule1 or rule2 were used to select the individual samples. If type is existing (a user provided existing sample), rule will be existing as well as seen above.

sample_strat(sraster = sraster, # input
             nSamp = 200, # desired sample number
             access = access, # define access road network
             existing = e.sr, # existing samples with strata values
             include = TRUE, # include existing plots in nSamp total
             buff_outer = 200, # outer buffer - no samples further than this distance from road
             plot = TRUE) # plot

#> Simple feature collection with 200 features and 3 fields
#> Geometry type: POINT
#> Dimension:     XY
#> Bounding box:  xmin: 431150 ymin: 5337730 xmax: 438550 ymax: 5343230
#> CRS:           NA
#> First 10 features:
#>    strata     type     rule               geometry
#> 1       1 existing existing POINT (435890 5339450)
#> 2       1 existing existing POINT (437870 5341090)
#> 3       1 existing existing POINT (434370 5340970)
#> 4       1 existing existing POINT (437910 5342390)
#> 5       1 existing existing POINT (436210 5343010)
#> 6       1 existing existing POINT (433990 5341570)
#> 7       1 existing existing POINT (433610 5343230)
#> 8       1 existing existing POINT (435930 5339590)
#> 9       1 existing existing POINT (434910 5339530)
#> 10      1 existing existing POINT (435270 5340650)

The include parameter determines whether existing samples should be included in the total count of samples defined by nSamp. By default, the include parameter is set as FALSE.

method = "random

Stratified random sampling with equal probability for all cells (using default algorithm values for mindist and no use of access functionality). In essence this method perform the sample_srs algorithm for each stratum separately to meet the specified sample allocation.

#--- perform stratified sampling random sampling ---#
sample_strat(sraster = sraster, # input sraster
             method = "random", #stratified random sampling
             nSamp = 200, # desired sample number
             plot = TRUE) # plot

#> Simple feature collection with 200 features and 1 field
#> Geometry type: POINT
#> Dimension:     XY
#> Bounding box:  xmin: 431130 ymin: 5337710 xmax: 438330 ymax: 5343230
#> Projected CRS: UTM Zone 17, Northern Hemisphere
#> First 10 features:
#>    strata               geometry
#> 1       1 POINT (436310 5338050)
#> 2       1 POINT (436310 5338050)
#> 3       1 POINT (435590 5342550)
#> 4       1 POINT (437930 5337910)
#> 5       1 POINT (434470 5342450)
#> 6       1 POINT (434810 5342190)
#> 7       1 POINT (437930 5338510)
#> 8       1 POINT (434790 5341550)
#> 9       1 POINT (437110 5338490)
#> 10      1 POINT (435050 5339630)

sample_nc

sample_nc() function implements the Nearest Centroid sampling algorithm described in Melville & Stone (2016). The algorithm uses kmeans clustering where the number of clusters (centroids) is equal to the desired number of samples (nSamp). Cluster centers are located, which then prompts the nearest neighbour mraster pixel for each cluster to be located (assuming default k parameter). These nearest neighbours are the output samples. Basic usage is as follows:

#--- perform simple random sampling ---#
sample_nc(mraster = mraster, # input
          nSamp = 25, # desired sample number
          plot = TRUE)
#> K-means being performed on 3 layers with 25 centers.

#> Simple feature collection with 25 features and 4 fields
#> Geometry type: POINT
#> Dimension:     XY
#> Bounding box:  xmin: 431170 ymin: 5338190 xmax: 438510 ymax: 5343210
#> CRS:           +proj=utm +zone=17 +ellps=GRS80 +units=m +no_defs
#> First 10 features:
#>        zq90 pzabove2  zsd kcenter               geometry
#> 54287 14.70     88.8 3.50       1 POINT (435130 5340330)
#> 44432 16.90     87.5 4.22       2 POINT (431990 5340850)
#> 66355 17.70     59.3 5.05       3 POINT (437770 5339690)
#> 21638 12.70     51.9 3.57       4 POINT (431170 5342070)
#> 78491  3.00      7.6 0.58       5 POINT (434310 5339030)
#> 77930  6.88     14.2 1.77       6 POINT (438010 5339070)
#> 77033 10.70     17.2 3.07       7 POINT (434990 5339110)
#> 87095 18.00     92.9 3.44       8 POINT (434810 5338570)
#> 6209  15.60     71.8 4.12       9 POINT (435910 5342910)
#> 16935 15.50     94.8 2.71      10 POINT (434090 5342330)

Altering the k parameter leads to a multiplicative increase in output samples where total output samples = \(`nSamp` * `k`\).

#--- perform simple random sampling ---#
samples <- sample_nc(mraster = mraster, # input
                    k = 2, # number of nearest neighbours to take for each kmeans center
                    nSamp = 25, # desired sample number
                    plot = TRUE)
#> K-means being performed on 3 layers with 25 centers.


#--- total samples = nSamp * k (25 * 2) = 50 ---#
nrow(samples)
#> [1] 50

Visualizing what the kmeans centers and samples nearest neighbours looks like is possible when using details = TRUE. The $kplot output provides a quick visualization of where the centers are based on a scatter plot of the first 2 layers in mraster. Notice that the centers are well distributed in covariate space and chosen samples are the closest pixels to each center (nearest neighbours).

#--- perform simple random sampling with details ---#
details <- sample_nc(mraster = mraster, # input
                     nSamp = 25, # desired sample number
                     details = TRUE)
#> K-means being performed on 3 layers with 25 centers.

#--- plot ggplot output ---#

details$kplot

sample_clhs

sample_clhs() function implements conditioned Latin hypercube (clhs) sampling methodology from the clhs package. A number of other functions in the sgsR package help to provide guidance on clhs sampling including calculate_pop() and calculate_lhsOpt(). Check out these functions to better understand how sample numbers could be optimized.

The syntax for this function is similar to others shown above, although parameters like iter, which define the number of iterations within the Metropolis-Hastings process are important to consider. In these examples we use a low iter value because it takes less time to run. Default values for iter within the clhs package are 10,000.

sample_clhs(mraster = mraster, # input
            nSamp = 200, # desired sample number
            plot = TRUE, # plot 
            iter = 100) # number of iterations

sample_clhs(mraster = mraster, # input
            nSamp = 300, # desired sample number
            iter = 100, # number of iterations
            existing = existing, # existing samples
            access = access, # define access road network
            buff_inner = 100, # inner buffer - no samples within this distance from road
            buff_outer = 300, # outer buffer - no samples further than this distance from road
            plot = TRUE) # plot

The cost parameter defines the mraster covariate, which is used to constrain the clhs sampling. This could be any number of variables. An example could be the distance a pixel is from road access (e.g. from calculate_distance() see example below), terrain slope, the output from calculate_coobs(), or many others.

#--- cost constrained examples ---#
#--- calculate distance to access layer for each pixel in mr ---#
mr.c <- calculate_distance(raster = mraster, # input
                           access = access,
                           plot = TRUE) # define access road network

sample_clhs(mraster = mr.c, # input
            nSamp = 250, # desired sample number
            iter = 100, # number of iterations
            cost = "dist2access", # cost parameter - name defined in calculate_distance()
            plot = TRUE) # plot

sample_balanced

The sample_balanced() algorithm performs a balanced sampling methodology from the stratifyR / SamplingBigData packages.

sample_balanced(mraster = mraster, # input
                nSamp = 200, # desired sample number
                plot = TRUE) # plot

#> Simple feature collection with 200 features and 0 fields
#> Geometry type: POINT
#> Dimension:     XY
#> Bounding box:  xmin: 431110 ymin: 5337710 xmax: 438510 ymax: 5343210
#> CRS:           +proj=utm +zone=17 +ellps=GRS80 +units=m +no_defs
#> First 10 features:
#>                  geometry
#> 1  POINT (438010 5343210)
#> 2  POINT (432410 5343170)
#> 3  POINT (432430 5343170)
#> 4  POINT (437850 5343170)
#> 5  POINT (435570 5343090)
#> 6  POINT (431610 5343070)
#> 7  POINT (434490 5343070)
#> 8  POINT (432210 5343030)
#> 9  POINT (431970 5343010)
#> 10 POINT (431590 5342990)
sample_balanced(mraster = mraster, # input
                nSamp = 100, # desired sample number
                algorithm = "lcube", # algorithm type
                access = access, # define access road network
                buff_inner = 50, # inner buffer - no samples within this distance from road
                buff_outer = 200) # outer buffer - no samples further than this distance from road
#> Simple feature collection with 100 features and 0 fields
#> Geometry type: POINT
#> Dimension:     XY
#> Bounding box:  xmin: 431110 ymin: 5337710 xmax: 438550 ymax: 5343230
#> CRS:           +proj=utm +zone=17 +ellps=GRS80 +units=m +no_defs
#> First 10 features:
#>                  geometry
#> 1  POINT (433510 5341110)
#> 2  POINT (437890 5338910)
#> 3  POINT (432670 5341050)
#> 4  POINT (434450 5339710)
#> 5  POINT (438290 5341050)
#> 6  POINT (436150 5342070)
#> 7  POINT (436310 5342750)
#> 8  POINT (437870 5337810)
#> 9  POINT (437030 5338110)
#> 10 POINT (434530 5338730)

sample_ahels

The sample_ahels() function performs the adapted Hypercube Evaluation of a Legacy Sample (ahels) algorithm usingexisting sample data and an mraster. New samples are allocated based on quantile ratios between the existing sample and mraster covariate dataset.

This algorithm was adapted from that presented in the paper below, which we highly recommend.

Malone BP, Minansy B, Brungard C. 2019. Some methods to improve the utility of conditioned Latin hypercube sampling. PeerJ 7:e6451 DOI 10.7717/peerj.6451

This algorithm:

  1. Determines the quantile distributions of existing samples and mraster covariates.

  2. Determines quantiles where there is a disparity between samples and covariates.

  3. Prioritizes sampling within those quantile to improve representation.

To use this function, user must first specify the number of quantiles (nQuant) followed by either the nSamp (total number of desired samples to be added) or the threshold (sampling ratio vs. covariate coverage ratio for quantiles - default is 0.9) parameters. We recommended you setting the threshold values at or below 0.9.

sample_ahels(mraster = mraster, 
             existing = existing, # existing samples
             plot = TRUE) # plot

#> Simple feature collection with 230 features and 7 fields
#> Geometry type: POINT
#> Dimension:     XY
#> Bounding box:  xmin: 431150 ymin: 5337730 xmax: 438550 ymax: 5343230
#> CRS:           +proj=utm +zone=17 +ellps=GRS80 +units=m +no_defs
#> First 10 features:
#>      type.x  zq90 pzabove2  zsd strata type.y  rule               geometry
#> 1  existing  9.44     39.4 2.47      1    new rule1 POINT (435890 5339450)
#> 2  existing  8.77     94.8 1.61      1    new rule2 POINT (437870 5341090)
#> 3  existing 10.50     66.2 2.89      1    new rule2 POINT (434370 5340970)
#> 4  existing  3.41     14.1 0.66      1    new rule2 POINT (437910 5342390)
#> 5  existing  8.50     58.6 2.11      1    new rule2 POINT (436210 5343010)
#> 6  existing  4.15     17.8 1.00      1    new rule2 POINT (433990 5341570)
#> 7  existing  5.68     68.3 1.23      1    new rule2 POINT (433610 5343230)
#> 8  existing 10.70     39.7 2.93      1    new rule2 POINT (435930 5339590)
#> 9  existing  6.52     10.3 1.66      1    new rule2 POINT (434910 5339530)
#> 10 existing  9.69      3.7 2.54      1    new rule2 POINT (435270 5340650)

Notice that no threshold, nSamp, or nQuant were defined. That is because the default setting for threshold = 0.9 and nQuant = 10.

The first matrix output shows the quantile ratios between the sample and the covariates. A value of 1.0 indicates that samples are represented relative to the quantile coverage. Values > 1.0 indicate over representation of samples, while < 1.0 indicate under representation of samples.

sample_ahels(mraster = mraster, 
             existing = existing, # existing samples
             nQuant = 20, # define 20 quantiles
             nSamp = 300) # total samples desired

#> Simple feature collection with 500 features and 7 fields
#> Geometry type: POINT
#> Dimension:     XY
#> Bounding box:  xmin: 431110 ymin: 5337730 xmax: 438550 ymax: 5343230
#> CRS:           +proj=utm +zone=17 +ellps=GRS80 +units=m +no_defs
#> First 10 features:
#>      type.x  zq90 pzabove2  zsd strata type.y  rule               geometry
#> 1  existing  9.44     39.4 2.47      1    new rule1 POINT (435890 5339450)
#> 2  existing  8.77     94.8 1.61      1    new rule2 POINT (437870 5341090)
#> 3  existing 10.50     66.2 2.89      1    new rule2 POINT (434370 5340970)
#> 4  existing  3.41     14.1 0.66      1    new rule2 POINT (437910 5342390)
#> 5  existing  8.50     58.6 2.11      1    new rule2 POINT (436210 5343010)
#> 6  existing  4.15     17.8 1.00      1    new rule2 POINT (433990 5341570)
#> 7  existing  5.68     68.3 1.23      1    new rule2 POINT (433610 5343230)
#> 8  existing 10.70     39.7 2.93      1    new rule2 POINT (435930 5339590)
#> 9  existing  6.52     10.3 1.66      1    new rule2 POINT (434910 5339530)
#> 10 existing  9.69      3.7 2.54      1    new rule2 POINT (435270 5340650)

Notice that the total number of samples is 500. This value is the sum of existing samples (200) and number of samples defined by nSamp = 300.

sample_existing

Acknowledging that existing sample networks exist is important. There is significant investment into these samples, and in order to keep inventories up-to-date, we often need to collect new data at these locations. The sample_existing algorithm provides a method for sub-sampling an existing sample network should the financial / logistical resources not be available to collect data at all sample units. The algorithm leverages latin hypercube sampling using the clhs package to effectively sample within an existing network.

The algorithm has two fundamental approaches:

  1. Sample exclusively using the sample network and the attributes it contains

  2. Should raster information be available and co-located with the sample, use these data as population values to improve sub-sampling of existing.

Much like the sample_clhs() algorithm, users can define a cost parameter, which will be used to constrain sub-sampling. A cost parameters is a user defined metric/attribute such as distance from roads (e.g. calculate_distance()), elevation, etc.

Here some some basic examples:

Basic sub-sampling of existing

First we can create an existing dataset for our example. Lets imagine we have a systematically sampled dataset of ~900 samples, and we know we only have resources to sample 300 of them. We have some ALS data available (mraster), which we will use as our distributions to sample within.

#--- generate existing samples and extract metrics ---#
existing <- sample_systematic(raster = mraster, cellsize = 200, plot = TRUE) %>%
  extract_metrics(mraster = mraster, existing = .)

We see our systematic sample. Notice that we used extract_metrics() after creating it. If the user provides a raster for the algorithm this isn’t neccesary, it will be handled internally in the algorithm if no attributes are present, but if only samples are given, attributes must be provided and sampling will be conducted on all included attributes. Now lets sub-sample within it.

#--- sub sample using ---#
sample_existing(existing = existing, # our existing sample
                nSamp = 300, # the number of samples we want
                plot = TRUE) # plot

#> Simple feature collection with 300 features and 3 fields
#> Geometry type: POINT
#> Dimension:     XY
#> Bounding box:  xmin: 431123.9 ymin: 5337704 xmax: 438501.5 ymax: 5343231
#> CRS:           +proj=utm +zone=17 +ellps=GRS80 +units=m +no_defs
#> First 10 features:
#>      zq90 pzabove2  zsd                 geometry
#> 207 20.00     87.8 5.29 POINT (437375.9 5342066)
#> 415  3.44     12.7 0.70 POINT (435960.4 5342913)
#> 221  9.72     86.9 2.34 POINT (436546.9 5340018)
#> 268 11.90     57.2 3.34 POINT (435874.2 5339177)
#> 79   9.36     52.6 2.44 POINT (437463.5 5338905)
#> 13  15.80     80.4 4.31 POINT (437997.4 5337908)
#> 466 13.60     98.9 2.67 POINT (435577.6 5343029)
#> 584  3.92     32.0 0.83   POINT (433983 5341213)
#> 899 18.40     89.7 3.54   POINT (431477 5342598)
#> 85  18.00     58.5 5.00 POINT (437811.3 5340054)

We see from the output that we get 300 samples that are a sub-sample of the original existing sample. The plotted output shows sumulative frequency distributions of the population (all existing samples) and the sub-sample (the 300 samples we requested). Notice that the distributions match quite well. This is a simple example, so lets do another with a bit more complexity.

Sub-sampling using raster distributions

Our systematic sample of ~900 plots is fairly comprehensive, however we can generate a true population distribution through the inclusion of the ALS metrics in the sampling process. The metrics will be included in internal latin hypercube sampling to help guide sub-sampling of existing.

#--- sub sample using ---#
sample_existing(existing = existing, # our existing sample
                nSamp = 300, # the number of samples we want
                raster = mraster, # include mraster metrics to guide sampling of existing
                plot = TRUE) # plot

#> Simple feature collection with 300 features and 3 fields
#> Geometry type: POINT
#> Dimension:     XY
#> Bounding box:  xmin: 431111.7 ymin: 5337716 xmax: 438554.1 ymax: 5343237
#> CRS:           +proj=utm +zone=17 +ellps=GRS80 +units=m +no_defs
#> First 10 features:
#>         zq90 pzabove2  zsd                 geometry
#> 203100 20.60     86.6 4.30   POINT (437144 5341300)
#> 3289   22.00     93.3 4.59   POINT (436013 5341016)
#> 79463  20.50     95.1 3.95 POINT (432451.6 5341676)
#> 1781   19.00     84.4 5.33 POINT (436987.7 5340094)
#> 742    11.40     79.4 3.14 POINT (432370.8 5340029)
#> 90010  20.20     85.5 6.61 POINT (431534.9 5342790)
#> 295     8.56     39.3 2.28 POINT (435798.7 5339618)
#> 61386  19.10     83.4 4.44 POINT (433907.5 5341654)
#> 8678    9.27     81.3 2.36 POINT (431222.3 5340377)
#> 3416   19.20     89.4 6.11 POINT (438385.6 5339880)

The sample distribution again mimics the population distribution quite well! Now lets try using a cost variable to constrain the sub-sample.

#--- create distance from roads metric ---#
dist <- calculate_distance(raster = mraster, access = access)
#--- sub sample using ---#
sample_existing(existing = existing, # our existing sample
                nSamp = 300, # the number of samples we want
                raster = dist, # include mraster metrics to guide sampling of existing
                cost = 4, # either provide the index (band number) or the name of the cost layer
                plot = TRUE) # plot

#> Simple feature collection with 300 features and 4 fields
#> Geometry type: POINT
#> Dimension:     XY
#> Bounding box:  xmin: 431146.8 ymin: 5337716 xmax: 438541.9 ymax: 5343232
#> CRS:           +proj=utm +zone=17 +ellps=GRS80 +units=m +no_defs
#> First 10 features:
#>         zq90 pzabove2  zsd dist2access                 geometry
#> 5596    3.58     46.7 0.64    8.962260 POINT (434348.2 5341729)
#> 14434   8.33     88.3 1.92    5.430326 POINT (436541.5 5337930)
#> 414    12.70     86.6 3.40  131.483569 POINT (435902.5 5342721)
#> 4095   19.90     72.5 6.78  573.894077 POINT (435554.7 5341573)
#> 16413  17.00      8.3 3.99   35.620475 POINT (437874.6 5342333)
#> 76237  19.20     96.5 4.45   62.229414 POINT (431715.7 5338556)
#> 12198  21.50     92.1 6.42  105.813195 POINT (437939.4 5337716)
#> 86268  15.40     74.6 3.41  251.990210 POINT (432109.2 5342616)
#> 33440  18.80     74.3 2.82  319.602934 POINT (436418.8 5342356)
#> 443100 14.20     95.3 3.33  124.498306 POINT (434012.7 5337860)

Finally, should the user wish to further constrain the sample based on access like other sampling approaches in sgsR that is also possible.

#--- ensure access and existing are in the same CRS ---#

sf::st_crs(existing) <- sf::st_crs(access)

#--- sub sample using ---#
sample_existing(existing = existing, # our existing sample
                nSamp = 300, # the number of samples we want
                raster = dist, # include mraster metrics to guide sampling of existing
                cost = 4, # either provide the index (band number) or the name of the cost layer
                access = access, # roads layer
                buff_inner = 50, # inner buffer - no samples within this distance from road
                buff_outer = 300, # outer buffer - no samples further than this distance from road
                plot = TRUE) # plot

#> Simple feature collection with 300 features and 4 fields
#> Geometry type: POINT
#> Dimension:     XY
#> Bounding box:  xmin: 431146.8 ymin: 5337704 xmax: 438541.9 ymax: 5343237
#> Projected CRS: UTM_Zone_17_Northern_Hemisphere
#> First 10 features:
#>        zq90 pzabove2  zsd dist2access                 geometry
#> 414    11.4     79.4 3.14   171.66392 POINT (432370.8 5340029)
#> 4915   12.4     80.2 3.60   175.86487 POINT (437289.7 5338331)
#> 44436  12.7     87.9 2.96   219.41798 POINT (432103.9 5340528)
#> 31535  16.5     96.8 2.80    63.08106 POINT (434098.9 5341596)
#> 48037  18.2     76.3 3.47   101.20574 POINT (431726.4 5342732)
#> 188100 16.1     84.4 4.35   214.88028 POINT (436227.3 5342414)
#> 4542   13.9     87.7 3.39    43.50336 POINT (438060.7 5340187)
#> 128100 16.9     91.3 4.53    85.28301 POINT (435468.5 5337837)
#> 45928  12.8     84.2 2.69   295.61542 POINT (432434.1 5342309)
#> 463100 18.1     94.7 4.48    73.93218 POINT (431663.1 5340452)

The greater constraints we add to the samples, the less likely we will have strong correlations between the population and sample, so its always important to understand these limitations and plan accordingly.