raceland: R package for a pattern-based, zoneless method for analysis and visualization of racial topography

Anna Dmowska, Tomasz Stepinski, Jakub Nowosad

2019-08-24

INTRODUCTION

The raceland package implements a computational framework for a pattern-based, zoneless analysis and visualization of (ethno)racial topography. It is a reimagined approach for analyzing residential segregation and racial diversity based on the concept of ‘landscape’ used in the domain of landscape ecology.

The proposed approach adopts a bird’s view perspective - in which visualization and quantification of racial topography (an overall organization of a spatial pattern formed by locations of people of different races) are tightly intertwined by using the same data - a high-resolution raster grid with each cell containing only inhabitants of a single race. Such grids represent a racial landscape (RL). A racial landscape (RL) consists of the mosaic of many large and small patches (racial enclaves) formed by adjacent raster grid cells having the same race categories. The distribution of racial enclaves creates a specific spatial pattern.

The racial landscape is described by exposure matrix and quantified by two metrics (entropy and mutual information) derived from Information Theory concept (IT). Entropy is the measure of racial diversity and mutual information measures racial segregation.

Racial landscape method is based on the raster gridded data, and unlike the previous methods, does not depend on the division for specific zones (census tract, census block, etc.). Calculation of racial diversity (entropy) and racial segregation (mutual information) can be performed for the whole area of interests (i.e., metropolitan area) without introducing any arbitrary divisions. Racial landscape method also allows for performing the calculation at different spatial scales.

A COMPUTATIONAL FRAMEWORK

A computational framework implemented in the raceland package allows for:

  1. Constructing racial landscape based on race-specific raster grids.

  2. Describing the racial pattern of a racial landscape at different scales or/and for the whole area of interests using metrics derived from Information Theory concept (entropy and mutual information).

  3. Mapping the racial landscape.

  4. Mapping racial diversity and segregation at different scales.

A computational framework consists of the four steps (see figure below, the blue font indicates the names of function from the raceland package).

Input data

Racial landscape method is based on high resolution race-specific raster grids. Each cell in race-specific grids contains race subpopulation density. SocScape project provides high-resolution raster grids for 1990, 2000, 2010 years for 365 metropolitan areas and for each county in the conterminous US. Data is available at http://sil.uc.edu

The calculation can be also performed using vector shapefile with attribute table containing race counts for aggregated data. In such case shapefile is first rasterized using zones_to_rasters() function from the raceland package (people of given race are redistributed to the cells by dividing the number of people by the number of cells belonging to the particular spatial units.).

Please notice that the rasterization time depends on the number of divisions and for large areas (i.e. metropolitan areas) can be timeconsuming. It is recommended to use the smallest available divisions.

Here we demonstrate a computational framework using example of the area of 60x60 cells. The input data is a high-resolution (30m) raster grids. Race-specific rasters are stored as GeoTiffs (The directory contains 5 files for 5 race-groups: Asians, Blacks, Hispanic, others and Whites).

The RasterStack with race-specific grids is created based on GeoTiff files (files will be sorted and read into RasterStack in alphabetical order; the categories in racial landscape will depend on order of layers in input data - see details in the next section).

Working with vector data

While using vector data, shapefile should be read to R using st_read() function from sf package. In the next step spatial object is rasterized using zones_to_rasters() from raceland package.

Function zones_to_rasters() requires 3 arguments:

  • v - an sf object with aggregated attribute data
  • resolution - a resolution of the output raster (below we use resolution=30, there is the same resolution as SocScape grids)
  • variables - a character vector with columns names from v. The values from these columns will be (1) rasterized and (2) recalculated to densities. Each column will be represented as an layer in the output RasterStack.

Once vector data is rasterized there is no difference, whether race_raster object or race_raster_from_vect will be used for futher analysis.

Constructing racial landscape

Racial landscape is a high-resolution grid in which each cell contains only inhabitants of a single race.

Racial landscape is constructed based on race-specific grids. Racial composition at each cell is translated to probabilities of drawing a person of a specific race from a cell. Thus, a race label of a cell is a random variable. To obtain a stochastic realization of racial landscape, we use cell’s race probabilities and a random number generator to randomly assign specific race label to each cell (Monte Carlo procedure).

Multiple draws yield to a series of realization with a slightly different pattern (see example below). The pattern uncertainty occurs only at sub-block scale and only if there is significant sub-block racial diversity.

A single realization-based visualization is sufficiently accurate. For increased accuracy, racial topography is quantified as an ensemble average from multiple realizations. It is recommended to calculate at least 30 realizations.

Realization is constructed using create_realizations() function with two arguments:

Function returns RasterStack object containing n realizations. Single race label in racial landscape is assigned based on an order of race-specific grids stored in RasterStack (For example race_raster object has five layers named: asian, black, hispanic, other, white. The race labels in racial landscape raster will be 1 - asian, 2- black, 3 - hispanic, 4 - other, 5 - white).

Racial landscape provides a skewed visualization of the racial pattern because it does not take into consideration spatial variability of the population densities. To obtain an accurate depiction of racial distribution the values of RL must be modified to reflect race and subpopulation densities.

Function plot_realization() display realization taking into account also subpopulation density. Function takes 3 arguments:

Describing racial patterns of racial landscape.

A racial pattern is described by exposure matrix. In the domain of landscape ecology, landscape pattern can be described by a co-occurrence matrix. The co-occurrence matrix is a tabulation of cells’ adjacencies. Adjacencies between pairs of cells is defined by 4-connectivity rule (There are max. 4 adjacencies: north, east, south and west cell as is shown in figure below).

The co-occurrence matrix has the size KxK (K - number of categories), is symmetrical and can be calculated for any region regardless of its size or shape.

Exposure matrix is a modification of co-occurrence matrix. The exposure matrix is calculated in the same way as the co-occurrence matrix but each adjacency contributes as a location-specific value to the matrix instead of the constant value 1. The contributed value is calculated as average of local population densities in the two adjacent cells.

Let consider an example of racial landscape presented below. Co-occurrence matrix is constructed using only adjacencies from racial landscape. To obtain exposure matrix each cell in racial landscape has assign 2 types of information: single race class and local population density. Considering 2 green adjacent cells - this pair will contribute 1 to co-occurrence matrix (one adjacent pair) and the average from 2 and 1 [(2+1)/2)=1.5] to exposure matrix.

Calculating local subpopulation densities

Local densities of subpopulations (race-specific local densities) along with racial landscapes are used to construct exposure matrix. Local densities are calculated using create_densities() function, which requires 3 arguments:

  • x - RasterStack with realizations
  • y - RasterStack with shares of subpopulations (input data RasterStack)
  • window_size - the size, expressed in the number of cells, of a square-shaped window for which local densities will be calculated; it is recommended to use the small window_size, i.e 10 (window_size=10 means that the local densities will be calculated from the area of 10x10cells).

The output is a RasterStack with local densities calculated separately for each realization.

Exposure matrix.

Here we show an example, how exposure matrix is calculated. Please notice, that the calculation of the exposure matrix is build-in into calculate_metrics() function and there is no need to calculate exposure matrix separately.

Exposure matrix can be calculated separately using get_wecoma() function from comat package. Calculation of exposure matrix require 2 arguments:

  • x - RasterLayer with one selected realization
  • y - RasterLayer with local densities corresponding to selected realizations.

As default the exposure matrix with 4-directions adjaciencies (neighboorhood=4) is calculated using the average values from 2 adjacent cells (fun=‘mean’).

Information theory metrics (IT metrics)

To lucid quantification of racial topography, further compression of exposure matrix is required. A racial pattern can be described (in the same way like landscape pattern in the domain of landscape ecology) by using Information Theory metrics - entropy and mutual information. Entropy is associated with measuring racial diversity and mutual information is associated with measuring racial segregation.

Information theory metrics are calculated using function calculate_metrics(). This function calculates exposure matrix and summarizes it by calculating four IT-derived matrics: entropy (ent), joint entropy (joinent), conditional entropy (condent) and mutual information (mutinf). Function requires the following arguments:

  • x - RasterStack with realizations
  • w - RasterStack with local densities.
  • neighboorhood - adjacencies between cells can be defined in 4 directions (neighboorhood=4) or 8 directions (neighboorhood=8).
  • fun - function to calculate values from adjacent cells to contribute to exposure matrix, fun=‘mean’ calculate average values of population density from two adjacent cells. Other available options are geometric mean (‘geometry_mean’) or value from focal cell (‘focal’)
  • size = NULL - calculation will be performed for the whole area (see explanation later)
  • threshold - the share of NA cells to allow metrics calculation. threshold=1 - calculation will be performed, even if there are 100% cells with NA values (recommended with size=NULL)

IT matrics are calculated for each realization separately and in the next step an average value from all realization is calculated.

Describing local racial patterns of racial landscape (calculation at different spatial scales).

Unlike the previous methods, racial landscape method does not depend on the division into specific zones (census tract, census block, etc.). Calculation of racial diversity (entropy) and racial segregation (mutual information) can be performed for the whole area of interests without introducing any arbitrary divisions.

Racial landscape method allows also for performing calculation at different spatial scales.

Defining local patterns

Let consider an example presented below. Racial landscape covers the area 16 by 16 cells. Such area can be divided into a square-shaped block of cells. Each square of cells will represent a local pattern (a local landscape) and for each local pattern are calculated IT metrics (entropy and mutual information). The extent of a local pattern is defined by 2 parameters: size and shift.

  • Size parameter, expressed in the numbers of cells, is a length of the side of a square-shaped block of cells. It defines the extent of a local pattern.

  • Shift parameter defines the shift between adjacent squares of cells along the n-s and w-e directions. It describes the density (resolution) of the output grid. The resolution of the resultant map will be reduced to the new resolution = orginal resolution x shift.

    • shift=size - the input map will be simple divided into a grid of non overlaping square windows. Each square window defines an extent of a local pattern.

    • shift < size - results in the grid of overlapping square windows. A local pattern is calculated from the square window defined by size parameter, next square window is shifted (in n-s and w-e directions) by the number of cells defined by shift parameter.

The example presented below consists of racial landscape 16 by 16 cells. Setting size = 4 (and shift= 4) results in dividing racial landscape into 4 squared windows, each 4x4 cells. Each window represents a local pattern. For each local pattern IT metrics can be calculated, and the results will be assigned to the resultant grid of square windows. In fact the original racial landscape with 16x16 cells is reduced to the 2x2 ‘large cells’.

Setting size=4 and shitf = 2 results in overlapping square windows. First the window of the size 4x4 defines the local pattern (see dark blue square). In the next step this window is shifted by 2 cells to the right and the new local pattern is selected (see light blue square). It will create a resultant grid of the cell size defined by shift parameter.

Calculate IT metrics from local patterns (1)

Function create_grid() create spatial object with a grid (each ‘cell’ is defined by size and shift). Function require the RasterStack object with realizations (racial landscapes) and size parameter. If shift parameter is not set it is assumed that size=shift.

Below such grid is imposed into racial landscape to show local patterns.

Function calculate_metrics() is used to calculate IT metrics. Parameter size=20 means that the area of interests will be divided into a grid of local patterns of the size 20x20 cells (which in this case corresponds to the square of 0.6 km x 0.6km). The neighboorhood=4 defines that adjacencies between cells are defined in four directions, fun=‘mean’ calculate average values of population density from adjacent cells, threshold = 0.5 - calculation will be performed if there is at least 50% of non-NA cells.

IT metrics are calculated for each local pattern for each realization. The output table will have 900 rows (there are 9 local patterns of size 20x20 cells and 100 realizations).

Racial topography at the analysed scale is quantified as an ensemble average from multiple realizations. First, for each square window is calculated the average value of entropy and mutual information based on 100 realizations. Table below shows the mean (ent_mean, mutinf_mean) and standard deviation (ent_sd, mutinf_sd) values for each square window.

Then the average from the mean values of entropy and mutual information is calculated.

Calculate IT metrics from local patterns (2)

The next example shows how to calculate a local pattern using overlapping windows. This option is recommended to use, especially for bigger size value. Using overlapping windows does not introduce arbitrary boundaries.

To obtain overlapping windows function calculate_metrics() requires additional argument - shift.