Geospatial Regression Equation for European Nutrient losses (GREEN)


This Vignette illustrates how to use the GREENeR package for assessing annual time series of nutrient (total nitrogen TN and total phosphorus TP) loads in surface water from a basin or region of interest, including assessing land and river retention, and contribution shares by source.

Description of the tool

GREENeR is an open-source R package for assessing annual time series of nutrient loads in a river network and at the basin outlets, and contributions of nutrient sources to these loads. The package provides tools and methods to apply the model Geospatial Regression Equation for European Nutrient losses (GREEN; (B. Grizzetti et al. 2005; B. Grizzetti, Bouraoui, and De Marsily 2008; Bruna Grizzetti, Bouraoui, and Aloe 2012; Bruna Grizzetti et al. 2021)) to an area of interest. A brief description of the model, including sources and parameters, can be found at the end of this document.

The package includes functions to perform graphical summaries of model inputs, calibrate model parameters, run sensitivity analysis checks, and visualize model inputs and outputs through graphs and maps of total loads and contributions by source. The package works for both total nitrogen (TN) and total phosphorus (TP). It allows the analysis of different scenarios of nutrient input in the river basin or region of study. The package is parallel-capable to alleviate the computational burden in large basins.

An example case: the Lay river Basin

GREENeR functionalities are illustrated for the Lay river basin, in France. Please note that the basin data are extracted from a pan-European dataset (Bruna Grizzetti et al. 2021) for the purpose of showing the package tools and the analysis steps for a generic region. Since the dataset may not reflect local sources correctly, implications of results in terms of nutrient fluxes may be incorrect and are not evaluated nor commented herein.

The package includes two nutrient type scenarios (one for TN and one for TP), with disaggregated information on nutrient inputs for all catchments that compose the basin (spatial resolution of the Catchment Characterisation Model; (De Jager and Vogt 2007)). Both input (TN and TP) time series span from 1990 to 2018. The dataset includes also spatial information in vectorial format to enable mapping of tool results.

Loading and description of the scenario

# load the GREENeR package
library(GREENeR)

GreeneR needs information on the topology of the catchment and nutrient sources. All information is contained in two data.frames. Note that data.frame structures and column names must be respected when creating a new dataset, otherwise the functions will fail.

The topology is contained in the data.frame “catch_data_TN” (or catch_data_TP; in the example dataset, the Lay river Basin topology is presented).

# load the topological sequence of catchments and complementary info
data(catch_data_TN)
head(catch_data_TN)
#>     HydroID To_catch Shreve LakeFrRet NrmLengthKm
#> 1    361076   361195      1         0    0.010270
#> 28   361521   362009      4         0    0.059295
#> 55   362311   362426      2         0    0.011319
#> 84   361187   361195      2         0    0.011948
#> 109  361383   361521      1         0    0.005731
#> 136  362149   362175      1         0    0.002374

The fields are defined as follow:

The nutrient sources in each catchment per year are contained in the data.frame “annual_data_TN” for the TN scenario (“annual_data_TP” for the TP scenario).

# load the sources of TN for each year and catchment
data(annual_data_TN)
head(annual_data_TN)
#>   BasinID YearValue HydroID NextDownID   Atm    Min    Man   Fix  Soil    Sd
#> 1  291994      1990  361076     361195 9.608 28.111 21.379 2.909 2.108 0.514
#> 2  291994      1991  361076     361195 9.608 28.295 20.318 2.888 2.108 0.589
#> 3  291994      1992  361076     361195 9.608 26.482 21.514 2.917 2.108 0.665
#> 4  291994      1993  361076     361195 9.608 24.060 19.861 2.868 2.108 0.741
#> 5  291994      1994  361076     361195 9.608 23.725 20.199 2.890 2.108 0.816
#> 6  291994      1995  361076     361195 9.608 24.271 21.772 2.830 2.108 0.892
#>      Ps YearlyMass ForestFraction InvNrmRain
#> 1 0.314         NA      0.1700787   0.081113
#> 2 0.251         NA      0.1700787   0.075192
#> 3 0.188         NA      0.1700787   0.069081
#> 4 0.126         NA      0.1700787   0.066110
#> 5 0.063         NA      0.1700787   0.047314
#> 6 0.000         NA      0.1700787   0.061390

# load the sources of TP for each year and catchment
data(annual_data_TP)
head(annual_data_TP)
#>   BasinID YearValue HydroID NextDownID    Bg   Min   Man    Sd    Ps YearlyMass
#> 1  291994      1990  361187     361195 0.082 8.577 3.108 0.330 0.315         NA
#> 2  291994      1991  361187     361195 0.082 8.300 2.819 0.319 0.300         NA
#> 3  291994      1992  361187     361195 0.082 7.688 3.393 0.307 0.285         NA
#> 4  291994      1993  361187     361195 0.082 6.614 3.660 0.296 0.270         NA
#> 5  291994      1994  361187     361195 0.082 6.119 4.508 0.284 0.256         NA
#> 6  291994      1995  361187     361195 0.082 6.025 4.676 0.273 0.241         NA
#>   ForestFraction InvNrmRain
#> 1              0   0.081765
#> 2              0   0.075672
#> 3              0   0.068371
#> 4              0   0.066009
#> 5              0   0.047398
#> 6              0   0.061488

In the annual data frames some fields are in common:

The other fields specify nutrient sources, and differs in the two nutrient scenarios. In the case of TN, fields are:

In the case of TP fields are:

For some functionalities of the GREENeR package, the geographical information of the catchments is required in the form of an sf (simple feature) object (the sf package, Pebesma & Bivand, implements the Simple Features open standard)

# load the geographical information of the basin (require for some functionalities)
data(sh_file)
head(sh_file)
#> Simple feature collection with 6 features and 14 fields
#> Geometry type: MULTIPOLYGON
#> Dimension:     XY
#> Bounding box:  xmin: 3440600 ymin: 2645700 xmax: 3476900 ymax: 2691300
#> Projected CRS: ETRS89-extended / LAEA Europe
#>   OBJECTID HydroID NextDownID BasinID JunctionID AreaSqkm DrainAreaS Shreve
#> 1    71204  368447         -1  291994   20837332    43.68    1971.10     95
#> 2   350329  362484     362498  291994   33061724     0.58       0.58      1
#> 3   350337  368467     368429  291994   33061732    21.37      21.37      1
#> 4   350345  362497     362454  291994   33061740    25.52      25.52      1
#> 5   350346  362499     362498  291994   33061741     0.03     158.18      8
#> 6   350347  362498     362537  291994   33061742     1.44     160.20      9
#>     FecID   CsbID SuppID  Csb0ID Shape_STAr Shape_STLe
#> 1 6031419 7000265  71204 7100074   43680000      59800
#> 2 6031445 7000265 350329 7100074     580000       3600
#> 3 6031420 7000265 350337 7100074   21370000      32600
#> 4 6031450 7000265 350345 7100074   25520000      36400
#> 5 6031445 7000265 350346 7100074      30000        800
#> 6 6031445 7000265 350347 7100074    1440000       8000
#>                         geometry
#> 1 MULTIPOLYGON (((3447500 264...
#> 2 MULTIPOLYGON (((3476600 268...
#> 3 MULTIPOLYGON (((3440600 265...
#> 4 MULTIPOLYGON (((3458000 268...
#> 5 MULTIPOLYGON (((3476700 268...
#> 6 MULTIPOLYGON (((3476000 269...

Exploring scenario inputs

The library includes functions to examine the nutrient sources in a Basin, and explore how these are applied, either globally, or distributed over time or in space.

A first summary is provided by the input_plot() function, which - when option “B” is given, shows the mean annual nutrient loads per source for the whole period of analysis.

# the barplot for the Lay TN and TP Scenarios
input_plot(annual_data_TN, sh_file, "Lay", "B")
input_plot(annual_data_TP, sh_file, "Lay", "B")
Barplot for the Lay TN scenario.

Barplot for the Lay TN scenario.

Barplot for the Lay TP scenario.

Barplot for the Lay TP scenario.

By selecting option “D”, input_plot() returns density plots of nutrient loads per catchment. These are useful to see the distribution of nutrient source inputs in the Basin.

# the density plots for the Lay TN and TP Scenarios
input_plot(annual_data_TN, sh_file, "Lay", "D")
input_plot(annual_data_TP, sh_file, "Lay", "D")
Density plot for the Lay TN scenario.

Density plot for the Lay TN scenario.

Density plot for the Lay TP scenario.

Density plot for the Lay TP scenario.

The function input_Tserie() allows to examine the temporal evolution of the inputs. There are three options: displaying cumulative total amounts (“gr1”), total amounts (“gr2”), or total amounts per unit area (“gr3”). The latter graph is particularly useful when catchment sizes are when the range of catchment areas is large to compare nutrient application intensity between catchments.

In the Lay river example, these graphs show, among other things, how the amount of P from mineral fertilization (Min) has decreased over time, whereas the amount of P from manure has increased.

# the time serie plot type 1 (areas)
input_Tserie(catch_data_TN, annual_data_TN, sh_file, "Lay", "gr1")
input_Tserie(catch_data_TP, annual_data_TP, sh_file, "Lay", "gr1")
Time series plot type 1 (areas) for TN and TP.Time series plot type 1 (areas) for TN and TP.

Time series plot type 1 (areas) for TN and TP.

# the time serie plot type 2 (lines)
input_Tserie(catch_data_TN, annual_data_TN, sh_file, "Lay", "gr2")
input_Tserie(catch_data_TP, annual_data_TP, sh_file, "Lay", "gr2")
Time series plot type 2 (lines) for TN and TP.Time series plot type 2 (lines) for TN and TP.

Time series plot type 2 (lines) for TN and TP.

Time series plot type 3 (areas by year and km2) for TN and TP.Time series plot type 3 (areas by year and km2) for TN and TP.

Time series plot type 3 (areas by year and km2) for TN and TP.

There is a fourth alternative for the function input_Tseries(), “gr4”. This option compares the levels of nutrient inputs in three zones of the basin (upper, middle and lower part). The upper zone includes the catchments up to the median (50%) shreve value, the middle zone includes the catchments from 50% to 75% of the shreve values, and the lower part includes the catchments from 75% to 100%. At the top of each graph, the corresponding area of each part is indicated. In the example below, it can be seen that differences in N inputs between the upper and lower part of the basin are noticeable over the whole period.

# the time serie plot type 4 (by km2 and Shreve)
input_Tserie(catch_data_TN, annual_data_TN, sh_file, "Lay", "gr4")
input_Tserie(catch_data_TP, annual_data_TP, sh_file, "Lay", "gr4")
Time series plot type 4, thatcompares the levels of nutrient inputs in three zones of the basin.Time series plot type 4, thatcompares the levels of nutrient inputs in three zones of the basin.

Time series plot type 4, thatcompares the levels of nutrient inputs in three zones of the basin.

The function input_maps() shows nutrient inputs distribution in the Basin. The function generates a map for each nutrient source, plus a map (in yellow-green scale) of the total diffuse nutrient sources, which can be contrasted to the point sources map. The first option (“gr1”) shows total inputs (kt/year).

# The title of the plot
mapTitle <- "Lay Basin"

# the Input Load Map by source type 1 (kt/year)
input_maps(catch_data_TN, annual_data_TN, sh_file, mapTitle, "gr1", legend_position = 3)
Input map by source type 1 for TN scenario

Input map by source type 1 for TN scenario

For the TP scenario:

# the Input Load Map by source type 1 (kt/year)
input_maps(catch_data_TP, annual_data_TP, sh_file, mapTitle, "gr1", legend_position = 3)
Input map by source type 1 for TP scenario

Input map by source type 1 for TP scenario

Option “gr2” shows inputs per area (in kt/y/km2):

# the Input Load Map by source type 2 (kt/y/km2)
input_maps(catch_data_TN, annual_data_TN, sh_file, mapTitle, "gr2", legend_position = 3)
Input map by source type 2 for TN scenario

Input map by source type 2 for TN scenario

# the Input Load Map by source type 2 (kt/y/km2)
input_maps(catch_data_TP, annual_data_TP, sh_file, mapTitle, "gr2", legend_position = 3)
Input map by source type 2 for TP scenario

Input map by source type 2 for TP scenario

Model calibration

The quality of model calibration depends on the quality and number of available observations. Before diving into calibration it is thus useful to start by exploring observations.

# Lay Basin scenario calibration for TN
# subset of data rows with reference values to be used in the calibration
TN_ref_values <- annual_data_TN[!is.na(annual_data_TN$YearlyMass),]
TN_ref_values
#>      BasinID YearValue HydroID NextDownID    Atm     Min     Man    Fix   Soil
#> 400   291994      2003  361226     361187 12.021  38.799  27.080  4.763  2.972
#> 401   291994      2004  361226     361187 11.832  39.125  27.297  4.526  2.960
#> 402   291994      2005  361226     361187 11.472  37.502  27.631  4.356  2.960
#> 969   291994      2007  363217     363876 25.208  84.777  74.632  9.129  7.660
#> 970   291994      2008  363217     363876 22.580  84.131  77.208  9.018  7.660
#> 971   291994      2009  363217     363876 23.018  71.262  86.496 10.419  7.660
#> 1793  291994      2007  362500     363555 34.932 108.107  95.171 11.641  9.768
#> 1794  291994      2008  362500     363555 30.246 107.284  98.455 11.500  9.768
#> 1795  291994      2009  362500     363555 31.590  90.874 110.299 13.286  9.768
#> 3016  291994      2007  363878     363855 10.129  33.955  29.892  3.656  3.068
#> 3017  291994      2008  363878     363855  8.744  33.696  30.923  3.612  3.068
#> 3018  291994      2009  363878     363855  9.047  28.542  34.644  4.173  3.068
#> 3633  291994      2007  365569     365715 66.515 258.580 227.638 27.844 23.364
#> 3634  291994      2008  365569     365715 59.035 256.612 235.494 27.506 23.364
#> 3635  291994      2009  365569     365715 59.319 217.360 263.824 31.779 23.364
#> 4532  291994      2003  366683     367700 35.532 191.333 133.540 23.489 14.656
#> 4533  291994      2004  366683     367700 36.139 193.511 135.009 22.385 14.640
#> 4534  291994      2005  366683     367700 32.287 185.481 136.662 21.546 14.640
#> 4535  291994      2006  366683     367700 37.763 161.308 144.156 21.797 14.640
#> 4536  291994      2007  366683     367700 37.096 162.028 142.639 17.447 14.640
#> 4780  291994      2008  366683     367700 32.642 160.794 147.562 17.236 14.640
#> 4781  291994      2009  366683     367700 32.023 136.199 165.314 19.913 14.640
#>         Sd    Ps YearlyMass ForestFraction InvNrmRain
#> 400  1.496 1.111     17.724    0.152901786   0.062910
#> 401  1.430 1.118     12.667    0.152901786   0.070694
#> 402  1.364 1.126      8.762    0.152901786   0.090762
#> 969  1.815 0.744    832.388    0.099490976   0.059976
#> 970  1.963 0.683    934.193    0.099490976   0.050057
#> 971  2.111 0.621    778.023    0.099490976   0.059081
#> 1793 4.305 4.011    351.973    0.062302006   0.060989
#> 1794 4.493 4.007    391.211    0.062302006   0.050472
#> 1795 4.682 4.003    283.156    0.062302006   0.059423
#> 3016 1.244 0.344    354.944    0.000000000   0.060897
#> 3017 1.095 0.505    418.703    0.000000000   0.050287
#> 3018 0.946 0.665    311.261    0.000000000   0.058547
#> 3633 8.045 4.348    235.181    0.019378028   0.059086
#> 3634 8.295 4.210    277.517    0.019378028   0.049728
#> 3635 8.546 4.072    190.167    0.019378028   0.056961
#> 4532 3.998 0.982   2685.982    0.001319609   0.067024
#> 4533 3.869 1.092   2476.694    0.001319609   0.073931
#> 4534 3.740 1.201   1555.908    0.001319609   0.088373
#> 4535 3.951 1.096   4353.326    0.001319609   0.055746
#> 4536 4.162 0.990   3665.792    0.001319609   0.064562
#> 4780 4.373 0.885   4288.808    0.001319609   0.054822
#> 4781 4.584 0.779   3264.154    0.001319609   0.064394

# number of reference data
nrow(TN_ref_values)
#> [1] 22

# distribution of the references by year and location
table(TN_ref_values$HydroID,TN_ref_values$YearValue)
#>         
#>          2003 2004 2005 2006 2007 2008 2009
#>   361226    1    1    1    0    0    0    0
#>   362500    0    0    0    0    1    1    1
#>   363217    0    0    0    0    1    1    1
#>   363878    0    0    0    0    1    1    1
#>   365569    0    0    0    0    1    1    1
#>   366683    1    1    1    1    1    1    1

In the example basin there are 22 observations (derived from publicly available dataset of European Environment Agency, EEA for 6 different stations from 2003 to 2009; note however that only one station has references for all years. (For the TP case there are 58 observations, collected in eight catchments from 1997 to 2012, with one complete time-series).

To run the calibration process for a scenario (function calib_green()), the following settings must be defined:

  1. The expected range for each parameter. This range is defined by two vectors of three values, one for the lower limits and one for the upper limits of the three parameters. The values correspond to each of the parameters of the model in sequence: alpha_P (eq 4), alpha_L (eq 5), and sd_coeff (eqs 6 and 7).

  2. The number of iterations to be performed during the calibration process. The higher the number of iterations, the more likely it is to achieve quality parameters, but the longer the computation time required. Although it depends on the catchment and the proposed intervals, it is recommended to run at least 200 iterations to have enough information to continue the calibration process.

  3. The years to be included in the calibration process.

# Parameter for the calibration of the model
# the lower limits for all params (alpha_P, alpha_L, sd_coef)
low <- c(10, 0.000, 0.1) 
# the upper limits for all params (alpha_P, alpha_L, sd_coef)  
upp <- c(50, 0.08, 0.9)       

# number of iterations
n_iter <- 2000    

# years in which the model should be executed
years <- c(2003:2009)

In the example below, a calibration of TN in the Lay basin is performed with the parameters indicated above. The process is parallelised and will use all the cores of the computer. The computation time depends on the computer, the basin, and the number of iterations.

# execution of the calibration  
DF_calib_TN <- calib_green(catch_data_TN, annual_data_TN, n_iter, low, upp, years)

The calibration function applies a Latin Hypercube sampling scheme to the three parameters within the possible range (defined by lower and upper limits) and evaluates model performance (predictions against available observations) for the calibration period (specified in ‘years’) by calculating several “goodness-of-fit” metrics. The function returns a dataframe with parameters and goodness-of-fit scores that can be further analyzed.

The function applies the following goodness-of-fit metrics (Althoff and Rodrigues 2021):

Exploring calibration results

Choosing the right goodness-of-fit metric to select the best set of parameters for a model largely depends on the overall study scope, the area of interest (e.g. high or low load, upper or lower catchment area), and the available observation dataset (size and quality). The library includes a series of functions to examine the calibration results and help select the most suitable set of parameters.

The calib_boxplot() function shows relationships between best parameter sets chosen according to one goodness-of-fit parameter (title of each boxplot) according to one goodness-of-fit metric (title of plot) in relation to others (x label). Only the six metrics that are used most frequently in hydrological calibration are included in this figure. In the lower panel, the figure shows as well the distribution of model parameters in the best parameter sets. The absolute best parameter set according to each goodness-of-fit metrics is marked as red dots in each boxplot.

# Generating the box plots  
rateBS <-  5  #  percent of parameters selected from the whole calibration set
calib_boxplot(DF_calib_TN, rateBS)

The calib_dot() function shows the distribution of parameters in relation to each other for a chosen goodness-of-fit metric. With this function the figure can be generated for all the previously described goodness-of-fit metrics.

# Generating the dot plots  
Gof_mes <- "NSE"
calib_dot(DF_calib_TN, Gof_mes)

In the example figure, the best results for alpha_P to achieve best NSE are in the range 0.25-0.30 (cyan color), while there are no clear patterns for the other two parameters. Also, no significant interaction between parameters can be appreciated. We can conclude that the most sensitive parameter in this case is alpha_P.

The scatter_plot() function shows all parameter realizations in the calibration dataset against a selected goodness-of-fit metric to visualize the influence of each parameter on model results. The result depends on the goodness-of-fit metric the user wants to consider. With this function the figure can be generated for all metrics generated by green_calib() function.

# Generating the scatter plots  
Gof_mes <- "NSE"
scatter_plot(DF_calib_TN, Gof_mes)

Selecting the best parameter sets

The best parameter set according to a selected goodness-of-fit metrics can be extracted with the select_params() function:

Gof_mes <- "NSE" # according the NSE goodness of fit metric
NSE_bestParams <- select_params(DF_calib_TN, Gof_mes)
NSE_bestParams
Param_NSE2 <- as.numeric(NSE_bestParams[2:4])

The function can be applied with different goodness-of-fit metric to obtain different potential model realizations:

Gof_mes <- "rNSE" 
rNSE_bestParams <- select_params(DF_calib_TN,Gof_mes)
rNSE_bestParams    
Param_rNSE2 <- as.numeric(rNSE_bestParams[2:4])

(note that actual values returned by the function could change at each calibration run)

Annual observed vs predicted values for a parameter set

Once a parameter set has been selected, it is useful to verify graphically the differences between the observations (ObsLoad) and the model predictions (PredictLoad). Included in the tool is the function simobs_annual_plot(), which allows a year-to-year comparison of observed and predicted values for a given parameter sets.

# annual scatter plot comparing observed vs modeled loads by year
label_plot <- "NSE best params for TN in the Lay" 
simobs_annual_plot(catch_data_TN, annual_data_TN, Param_NSE2[1], Param_NSE2[2], Param_NSE2[3], years, label_plot)

Comparing Observed vs Predicted for two sets of parameters

The best parameter set depends on the goodness-of-fit metric. It is advisable to pre-select alternative parameter sets (according to different GoF) and compare model results, before making the final selection of the parameter set. The compare_calib() function shows simultaneously all observations against modelled values obtained from two parameter sets.

setPlabels <- c("NSE", "rNSE")    
label_plot <- "Comparing two sets of parameters for Lay TN" 
compare_calib(catch_data_TN, annual_data_TN, Param_NSE2[1], Param_NSE2[2], Param_NSE2[3], Param_rNSE2[1], Param_rNSE2[2], Param_rNSE2[3], years, label_plot, setPlabels)

At the top left of the figure, the scores of six frequently used goodness-of-fit metrics for the two parameter sets (as identified in setPlabels) are reported.

Estimation of nutrient fluxes and source contribution in the basin

Once the most appropriate parameter set has been selected, it is possible to run the model to estimate nutrient loads across the basin.

The function region_nut_balance() runs the GREEN model with the selected parameter set, and returns the region nutrient mass balance of the nutrient fluxes for the whole simulation period. The results of this function can be visualized in a Sankey diagram with the function N4_sankey().

# Computing the nutrient balance  
years <- c(2003:2009)
Nut_bal_TN <- region_nut_balance(catch_data_TN, annual_data_TN, Param_NSE2[1], Param_NSE2[2], Param_NSE2[3], years)
# Plot the sankey plot with the result of the balance
sank <- N4_sankey(Nut_bal_TN)