Working With Regional, Sub-National Statistical Products

library(regions)
library(dplyr)

Most economic, social, and environmental situations and developments have a specific territorial dimension. When a statistic is computed to describe a population parameter, the computation is often carried out for a population within the boundaries of this territory.

Probably the most widely used and cited statistics are defined within the scope of national boundaries. The population of France relates to the counting of people living within the boundaries of France (unless we specifically count French citizens living outside these boundaries.)

Most national boundaries are stable, often throughout centuries. The number of national boundaries is approaching 200, which means that global datasets have boundary changes almost every year somewhere. Just the current European Union has witnessed several such changes in the last decades with the unification of Germany, the breakup of current member states Czechia and Slovakia, and the breakup of the former Yugoslavia to several current and prospective member states. When we are working with long-term data panels, obviously we have to make sure that we are comparing Czechia, Slovakia correctly with Czechoslovakia.

National boundary changes require complicated diplomatic negotiations and they are meant to be difficult, but changes of boundaries within a national boundary, i.e. the changes of sub-national or regional boundaries is very frequent. In fact, it is so frequent that there is no consistent way of describing them globally. Probably the most comprehensive international system, the NUTS of the European Union contains dozens of changes every three years. It describes currently 1845 territorial units, but far more are being used within the 27 member states of the EU. [The NUTS is de facto used in a number of non-EU member states, too, and much of these uses will be part of the NUTS2021 definition.]

Unless we work with shapefiles that contain the exact boundaries of territorial units, and we have access to microdata to replicate the statistical computation over varying boundaries, for example, before and after a regional border change, we cannot be sure that we are comparing exactly the same thing. The population of a region may have increased because more people were born or moved there than those who moved away or died, or simply because the boundaries were extended and new people who were earlier assigned to a different region became part of the count.

To make sure that we are working with the same territorial units, i.e. countries, provinces, (member or federal) states, administrative units, we must unambiguously attribute observation of our statistics to a territorial unit. This requires the use of controlled vocabularies as geographical identifiers.

Geographical names are not particularly useful identifiers in international datasets, because the names are subject to many changes in term of spelling, renaming, character codes or natural language that describes them and so on. The geographical names of the European Union use various extensions of the Latin alphabet, furthermore Cyril and Greek characters. In global datasets, we must deal with a lot more characters.

A machine-readable, controlled vocabulary, the geocode is used to describe the geographical territories with Latin alphanumerical combinations that are usually easy to use programmatically, too, as variable names.

The ISO standard for geographical territories is the ISO-3166-1 standard for national boundaries and ISO-3166-2 standard for sub-national divisions, such a province in the Netherlands, states in the USA or Australia, or Land in Germany. The ISO-3166-2 is rather often used for statistical purposes; however, it is mainly a public administration metadata that is very impractical for statistical use. Its greatest disadvantage is that it is not well versioned. Because of the very frequent changes before in sub-national boundaries, we can almost never be sure if a certain ISO-3166-2 refers to the same boundary. Usually, it refers to a same legal entity, for example, a county with its own local regulations, budgets and contracts; municipalities however switch country borders. For legal purposes, the ISO-3166-2 is good. For one-time, cross sectional comparison of data, the ISO-3166-2 is usable. For any statistical computation that is periodically repeated, it is almost useless.

The European Union uses its own nomenclature, the NUTS, whose geocodes, and geographical names are controlled vocabularies with about new editions every three years. When boundaries or names are changing, usually the geocode is also changing for disambiguation. Furthermore, the NUTS is a hierarchical system, which is very helpful for missing data imputation. It is a good system to join statistical data tables across time, for example, to compare the regional employment data of the European Union in 2010 and 2020. Yet it is surprisingly difficult to use.

A typical workflow is this:

  1. Validate the geographical metadata (for which periods do you have data where you know which the territory was used for the statistical computation?)

  2. Filter unchanged regions – these can be joined from data tables.

  3. Recode regions that whose boundaries did not change to one single controlled vocabulary, such as the current, 2016 edition of NUTS; failing to do so will result in smaller data tables, and loss of observations that were recoded.

  4. Aggregate and disaggregate unambiguous changes to the units in the NUTS2016 – in these cases all the statistical information is present in different form, and any other imputation method would result in loss of existing data.

  5. Impute or estimate other regions, using the structural information about the remaining data. This will result in far more superior data than imputation techniques designed for non-aggregated, individual data points; usually a large part of the information is present in statistical tables that can be used for proximation.

When you are working with individual data, you are omitting steps 2-4. When working with national data, especially with small datasets, we can often join data tables by a country name or country code without the problems of 2-4, but even in these cases, imputation rules like using the median value for missing cases is obviously wrong. If the missing value represents a very small member state like Malta, which has a total population smaller than most capital cities in Europe, then using a median or mean country value will very much overestimate a count data for Malta.

There are two main international efforts are taking place to standardize sub-national statistics: on the level of the European Union and on the level of the OECD. Our programmatic solution aims to work with both international data sources, and as much as possible, with any country’s data that uses the ISO-3166-2 standard. However, at this stage we have only included the typology used by the European Union in helper functions and metadata (typological vocabulary) files.

Using a Consistent Typology

The European Union harmonises the national statistical systems of 27 members states, 4 EEA countries and several other potential member states. Its regional typology, the hierarchical NUTS system is the most complex international system, and the regions package is mainly developed to help working with European regional data. However, the tasks related to such regional typologies are not unique the European Union. They are present in almost any national statistical system, and they can be similarly complex in federal states like the United States or Australia.

The Methodological manual on territorial typologies 2018 edition introduces three types of typologies.

The availability of the sub-national statistical products is lower in terms of data table completeness or timeliness. This is often due to the fact that the creation of sub-national statistical products poses significant methodological challenges.

The changes of national or provincial boundaries are rare – within a single polity. However, when we work with global data panels, or data panels of many sub-national territorial units, boundary changes are rather frequent. On a national level, just within Europe we have witnessed in the last three decades the dissolution of federal states like the Soviet Union, Yugoslavia and Czechoslovakia and the unification of Germany. Especially the break-up of the former Yugoslavia took a long time, meaning that annual national statistics contain different configurations over the years for Serbia, Montenegro and Kosovo.

On sub-national levels, within Europe regional boundary changes are so frequent that they happen on average every three years. Of course, at one time, this procedure leaves most regional boundaries intact, and change only a relatively small number of boundaries. Whenever we work with European regional data, we still have to adjust panels of statistical aggregates every three years; similarly, we must validate any regional time series every three years.

In either case, producers of statistical products do not necessarily re-cast historical statistical data for the new national or sub-national boundaries. For example, when the regional boundary definition within Central Hungary (HU1) changed in 2016, Eurostat did not require the Hungarian national statistical authorities to re-cast Hungary’s NUTS2 level data according to the new boundary definition, i.e. creating separate observations for HU10 (Budapest) and HU12 (surrounding Pest county.) Earlier HU101 (Budapest) on NUTS3 level became a NUTS2 region under the same name. Furthermore, the former NUTS2 region Közép-Magyarország or Central Hungary still exists on NUTS1 level under the code HU1. Although you will not find a history for the HU10 code before 2017, in most cases, the very same data about Budapest can be found under the code HU101 in NUTS3 datasets, or from national statistical sources that follow Hungary’s ISO-3166-2 borders.

To put it another way, the HU-BP ISO-3166-2 code, Budapest, did not change, it was just reassigned from the NUTS3 typology to the NUTS2 typology, which is reflecting the size of Budapest better. Budapest can be better compared with NUTS2 regions than NUTS3 small regions.

Validation

A very important and basic task is the validation of our data: which data is valid in a certain typology? For example, if we have a data source from 2014 that describes data within Europe in 2014 and we currently download data from the Eurostat website in 2020, we may find a discrepancy in how the regions are described. The dataset from 2014 is likely to represent sub-national territories according to their NUTS2013 boundary definition, and the new dataset according to the currently valid NUTS2016 boundary definition. We may not be able to correctly join the two datasets.

The validate function family helps in finding out to conformity of your regional coding with various typologies. In the first CRAN release, only Eurostat typologies are coded, but some of the functions can be used with other typologies, for example, with the ISO-3166-1 and ISO-3166-2 typologies. OECD uses the NUTS typologies for the EU member states, so these typologies are the most frequently used in international statistics.

Eurostat offers so-called correspondence tables to follow boundary changes, recoding and relabelling for all NUTS changes since the formalization of the NUTS typology. Unfortunately, these Excel tables do not conform with the requirements of tidy data, and their vocabulary for is not standardized, either. For example, recoding changes are often labelled as recoding, recoding and renaming, code change, Code change, etc.

The data-raw library contains these Excel tables and very long data wrangling code that unifies the relevant vocabulary of these Excel files and brings the tables into a single, tidy format , starting with the definition NUTS1999. The resulting data file nuts_changes is included in the regions package. It already contains the changes that will come into force in 2021.

The vignette Validating Your Typology gives a real-life example to the use of these functions. Validations are the starting points for Recoding and re-labelling, Aggregation And Re-Aggregation and Disaggregation And Projection operations.

Recoding and re-labelling

In the European Union, the most frequent regional statistical change is recoding and relabelling. Such changes do not affect the statistical data, only the statistical metadata.

For example, the overseas department and region of France, Réunion, was officially renamed La Réunion in 2010: this is relabelling. In earlier sources, you may find data about this region under the identifier Réunion, and later you may find it extended with the French article La Réunion. But the data from Réunion in 2009 and La Réunion in 2012 refers to the same territory. Furthermore, when France changed its administrative boundaries, it changed their official code abbreviations. Although the typology of La Réunion did not change, to indicate that a dataset is comparable across French departments according to the new administrative borders (which did change in many other regions) the geolocational code of this region changed from FR94 to FRA4 in 2013 and to FRY4 in 2016. This is recoding.

The geocodes of Eurostat are elements of a controlled metadata vocabulary. To avoid confusion, when a part of the vocabulary changes, for example, when France is changing some elements in its NUTS regions and their coding, it often changes the complete vocabulary, including for unchanged elements, for clarity. This is why FR94 becomes FRA4 even though it is describing the same overseas department and region, La Réunion.

If you want to work with French or international data that include any statistical information about the overseas departments, you must make sure that observations coded with FR94, FRA4, FRY4 are treated as a time-series of the La Réunion and not as different entities.

Names are often dubious identifiers. La Réunion remains in the ISO-3166-1 and ISO-3166-2 libraries Réunion, which is no longer used as name in NUTS. What makes working with ISO and NUTS vocabularies is that while NUTS is “versioned”, ISO is not, so a correspondence for a given year, for example, for the year 2013 is almost impossible to create. This is why the use of Eurostats NUTS geo codes, which are usually unique in a given year, should be preferred over ISO.

Many of Eurostat’s regional datasets contain recoded geographical IDs that violate the definition of tidy data, because under the same geo column they may contain observations for FR94, FRA4, FRY4. However, this is not a tidy description of the territory. Like in our nuts_changes data, they should be reported in three different columns, code_2010, code_2013 and code_2016. The descriptive metadata code, i.e. row identifier FR94 is compatible with FR71 for example, but not with FRY4.

If you are joining data from different tables, you must not join by their geo code, but by the explicit code_2013 geo codes for the observation described or aggregated according to the NUTS2013 regional boundaries and code_2016 for the NUTS2016. In other words, you must make sure that you join by valid elements in a given version of NUTS.

Another important note: if you join with the raw geo data, you will often have seemingly missing data. You may have data available for FR93, FR94, FRY3, FRY4, with missingness for the first two in 2016 and missingness for the latter two in 2010, but in fact, these observations or related to two, unchanged territorial units (Guyane and La Réunion.) You must choose one consistent coding, either using only FRY4 for La Réunion or FR93 for La Réunion. You can also identify FR93, FRY3 explicitly with the name Guyane. However, you must choose between the use of La Réunion or Réunion.

data(nuts_changes)
overseas <- nuts_changes[ 
  which(nuts_changes$geo_name_2016 %in% c("La Réunion", "Guyane")),
  ]

knitr::kable(
  overseas[, c("code_2010", "code_2013", "code_2016",
               "geo_name_2010", "geo_name_2013", "geo_name_2016")]
  )
code_2010 code_2013 code_2016 geo_name_2010 geo_name_2013 geo_name_2016
FR93 FRA3 FRY3 Guyane Guyane Guyane
FR94 FRA4 FRY4 Réunion La Réunion La Réunion
FR930 FRA30 FRY30 Guyane Guyane Guyane
FR940 FRA40 FRY40 Réunion La Réunion La Réunion

This is in fact a simpler recoding than the earlier example of Budapest. La Réunion did change its name and code, so you find its data under different labels, but it is still within the NUTS2 typology. Its comparators are likely to be found in the same statistical table. In the case of Budapest, the name did not change, but the change of the geo label reflects the fact that Budapest was moved from the NUTS3 typology to the NUTS2 typology, because it can be much better compared with NUTS2 regions. For data concerning the time period before 2017, you need to look for data about Budapest in datasets that follow the NUTS3 typology, or in national data sources, but you can fill in these historical data to pre-2017 cells of your NUTS2 regional data. The size of Budapest did not change that much - it should have been compared with NUTS2 regions in 2014, like in 2020, and not with smaller NUTS3 units.

See in more detail the vignette Recoding & Relabelling for our programmatic solution.

Imputation

R has many packages that implement imputation algorithms designed for non-aggregated, individual data. In almost all the cases, these methods are not adequate for territorially aggregated data. With individual data imputation, we usually take a typical individual data point (such as the median value) or hypothesize a probability distribution of unobserved data and pick a hypothetically likely observation. We must never use such methods when data among observations is differently aggregated. Furthermore, very often some elements of the information, or all the information is present in the dataset, and a far less biased estimate can be given to the missing data.

Disaggregation And Projection

Often, we have missing data about smaller territorial units, and we would like to impute their wider region’s representative data to a higher density dataset. For example, we do not have data for Schwabia, which is a region of the German federal state of Bavaria, and therefore we would like to impute the Bavarian (NUTS1) level data to this NUTS2 level missing observation for Schwabia. This is a valid imputation technique for non-aggregating descriptive statistics, such as mean or median values derived from surveys.

A typical problem of missingness is that we are comparing NUTS1 or NUTS2 level regions of Europe, and we do not find information about the small members states of Cyprus, Estonia, Luxembourg, Malta, because, due to their size, these countries do not have such divisions. Because of their size and territorial integrity, they can be well compared with Schwabia, for example, but the data is missing in a NUTS2 data table. In this case, we may want to impute the technical NUTS0 (country) level data into the NUTS2 data table, creating the technical NUTS2 data of CY00, EE00, LU00 or MT00, which is identical to the national CY, EE, LU, MT observations. In this case, the imputation is valid to any type of data, even to count data like population, because EE = (NUTS0) EE0 (NUTS1) = EE00 (NUTS2).

The function impute_down creates a generic solution to the problem when we want to project data from a larger territorial unit to a smaller one. For example, some state level data is missing for certain Australian or German federal states, and we want to use the national (average) values for imputation.

The function impute_down_nuts gives and EU-specific solution, with a single data frame as input. The mixed_nuts_example data table shows a mixture of potential problems in a small, fictional dataset with random numbers. For simplicity, we take the fictional, country-level data from Malta, and input it down to all potential sub-divisions.

Malta is a very transparent example, because the country has only 2 subdivisions on the lowest, NUTS3 level, i.e. MT001 and MT002. You can find Malta among countries (technical NUTS0 level) as MT, among NUTS1 regions as MT0 (because there are no subdivisions on this level), and among NUTS2 regions again as MT00. For any data that cannot be assigned for a geographical location or area, the technical codes ending with Z are used.

So, what are our possibilities if we have only country-level MTdata, but we would like to use it in a NUTS3 database and impute this variable to both MT001 and MT002. (You can safely do this with average values taken from a survey, but you should not do this with additive values such as the population.)

data("mixed_nuts_example")
mixed_nuts_example %>%
  dplyr::filter(substr(geo,1,2) == "MT") %>%
  knitr::kable()
geo values method
MT 93.3514 actual

Malta offers two technical imputation slots on NUTS1 level, MT0 for territorially attributed data, and MTZ for non-territorial exceptions. On NUTS2 level, MT00 and MTZZ. And at last, what is a real imputation, you can impute the MT = MT0 = MT00 values to MT001, should you have only data for MT002, etc. (In this simple example, we impute the MT values to both subdivisions, MT001 and MT002.)

data("mixed_nuts_example")
impute_down_nuts(mixed_nuts_example) %>%
  dplyr::filter(substr(geo,1,2) == "MT") %>%
  knitr::kable ()
geo values method typology valid_2016
MT 93.3514 country TRUE
MT0 93.3514 imputed from country MT nuts_level_1 TRUE
MT00 93.3514 imputed from country MT nuts_level_2 TRUE
MT001 93.3514 imputed from country MT nuts_level_3 TRUE
MT002 93.3514 imputed from country MT nuts_level_3 TRUE
MTZ 93.3514 imputed from country MT nuts_level_1 TRUE
MTZZ 93.3514 imputed from country MT nuts_level_2 TRUE
MTZZZ 93.3514 imputed from country MT nuts_level_3 TRUE

Aggregation And Re-Aggregation

The European Union sometimes publishes data on NUTS2 level regions, but not on the larger NUTS1 level macroregions or provinces. In many cases, this information is fully present in the NUTS2 datasets, and a simple aggregation can produce a NUTS1 level dataset, because each NUTS2 region is strictly assigned to one broader NUTS1 macroregion. For count-type data, we can simply aggregate the constituent NUTS2 level data to NUTS1 level.

A frequent problem is the re-assignment of a smaller territorial units, such as lower level administrative unit or a lower NUTS unit to another larger unit, or a pre-existing larger unit.

For example, LT02, i.e. Vidurio ir vakarų Lietuvos regionas in Lithuania consists of the former LT00 (undivided) Lithuania minus LT00A, i.e. Vilnius. LT01, or Sostinės regionas, i.e. the metropolitan area of Vilnius is a separate region from 2016. If you have national and NUTS3 or Vilnius municipality level data from Lithuania, you can present it according to the undivided NUTS2013 definition or the more granular NUTS2016 definition.

The Southern region of Ireland, IE05 was created in the same year from the earlier NUTS3 level regions IE023, IE024, and IE025. You can fill in the missing data for earlier years with the simple additive formula IE05=IE023+IE024+IE025 for count/additive variables, or you can create a re-weighted average for average values.

This information can be found in the change_2016 of the nuts_changes dataset.

data(nuts_changes)
southern <- nuts_changes[ 
  which(nuts_changes$geo_name_2016 == "Southern"), ]
knitr::kable(
  southern[, c("code_2013", "code_2016", "geo_name_2016", "change_2016")]
  )
code_2013 code_2016 geo_name_2016 change_2016
NA IE05 Southern new region, made from ex-ie023, ie024 and ie025

Boundary Changes With No Solution

In some cases the boundary change cuts across all territorial typologies that re-configuration is impossible from aggregated data. Unless we have access to the individual level data (which is not within scope of the tasks helped by this package), it is not possible to create time-wise comparable data for these units. We must treat the observations on these units logically missing, which may or may not be imputed with any techniques. It is usually helpful to mark territorial observations in panels or time series that have such a structural change.

Time-series

Because of the complexity of sub-national data, there are likely to be missing elements in time-wise comparisons. For example, we may find that some data is available for many years on a German federal level, and for most states, but a particular annual data is missing for Bavaria. It would be tempting to use a simple interpolation of the available data, but often a naïve interpolation is not the best method, because, as we have seen, often the data is available at a different aggregation level or under a different metadata label. It is imperative to first make sure that we exploit all information that is present in a different territorial structure for any given year, and then start inter-temporal imputation with interpolation, or carry forward/backward, forecasting and backcasting algorithms.

At last, we may go back to the cross-sectional data, and may fill in some cross-sectional data gaps with interpolated or extrapolated data. Aggregating an estimate for Bavaria from previous or next-years constituent lower level (NUTS2 or NUTS3) data is very likely to be a far better estimate than any imputation results, for example, a simple substitution of a median value from all German or EU NUTS1 units. The territorial aggregation structure gives plenty of exploitable information.

Boundary definitions

Our package uses a number of boundary definitions that are easily expanded. Our functions are generic, i.e. they treat boundary information as an input variable. Well formatted metadata about such typologies can be used with any country.

The Online Browsing Platform (OBP) of the International Organization for Standardization contains all coding information, and metadata about recent changes for each country. (See, for example: ISO 3166 — Codes for the representation of names of countries and their subdivisions: NL - Netherlands (the))

It would be tempting to use ISO-3166-2 definitions for sub-national boundaries, however, the ISO-3166-2 itself is changing relatively often, and in time series or panel data we can only refer to a certain version of the ISO-3166-2. All the boundary definitions are valid for a certain time period and certain boundaries. We currently incorporated only European metadata for the last 20 years, and we are planning to continue with OECD metadata in the forthcoming releases. However, we are welcoming any pull requests for other countries or country-groups.