This tutorial demonstrates how to perform Bayesian Improved Surname Geocoding when the race/ethncity of individuals are unknown within a dataset.

What is Bayesian Improved Surname Geocoding?

Bayesian Improved Surname Geocoding (BISG) is a method that applies Bayes’ Rule to predict the race/ethnicity of an individual using the individual’s surname and geocoded location [Elliott et. al 2008, Elliot et al. 2009, Imai and Khanna 2016].

Specifically, BISG first calculates the prior probability of individual i being of a ceratin racial group r given their surname s or \[Pr(R_i=r|S_i=s)\]. The prior probability created from the surname is then updated with the probability of the individual i living in a geographic location g belonging to a racial group r, or \[Pr(G_i=g|R_i=r)\]). The following equation describes how BISG calculates race/ethnicity of individuals using Bayes Theorem, given the surname and geographic location, and specifically when race/ethncicty is unknown :

\[Pr(R_i=r|S_i=s, G_i=g)=\frac{Pr(G_i= g|R_i =r)Pr(R_i =r |S_i= s)}{\sum_{i=1}^n Pr(G_i= g|R_i =r)Pr(R_i =r |S_i= s)}\]

In R, the wru package titled, WRU: Who Are You performs BISG. This vignette will walk you through how to prepare your geocoded voter file for performing BISG by stepping you through the process of cleaning your voter file, prepping voter data for running the BISG, and finally, performing BISG to obtain racial/ethnic probailities of individuals in a voter file.

Performing BISG on your data

We will perform BISG using the previous Gwinnett and Fulton county voter registration data called ga_geo.csv that was geocoded in the eiCompare: Geocoding vignette.

The first step in performing BISG is to geocode your voter file addresses. For information on geocoding, visit the Geocoding Vignette.

Let’s begin by loading your geocoded voter data into R/RStudio.

Step 1: Load R libraries/packages, voter file, and census data

Load the R packages needed to perform BISG. If you have not already downloaded the following packages, please install these packages.

# Load libraries
suppressPackageStartupMessages({
  library(eiCompare)
  library(stringr)
  library(sf)
  library(wru)
  library(tidyr)
  library(ggplot2)
  library(dplyr)
})

Load in census data, the shape file and geocoded voter registration data with latitude and longitude coordinates.

# Load Georgia census data
data(georgia_census)

We will use the data(gwin_fulton_shape) to load the shape file. The shape file includes FIPS code information for Gwinnett and Fulton counties and the associated multipolygon shape geometries indicated by the geometry column.

# Shape file for Gwinnett and Fulton counties
data(gwin_fulton_shape)
Loading a shapefile using the tigris package (optional)

The shapefile can also be loaded using the tigris package. The tigris package uses the US Census Bureau’s Geocoding API which is publicly available so no API key is needed. With the tigris package, you can load your census data according to a geographic level (i.e. counties, cities, tracts, blocks, etc.) There is additional code below that you can use if wanting to load your shape file using tigris. Remember to remove the # in order to use the code.

# install.packages("tigris")
# library(tigris)
# gwin_fulton_shape <- blocks(state = "GA", county = c("Gwinnett", "Fulton"))

Load geocoded voter file.

# Load geocoded voter registration file
data(ga_geo)

Obtain the first six rows of the voter file to check that the file has downloaded properly.

# Check the first six rows of the voter file
head(ga_geo, 6)
#> # A tibble: 6 × 25
#>   county…¹ count…² regis…³ voter…⁴ last_…⁵ first…⁶ str_num str_n…⁷ str_s…⁸ city 
#>      <dbl> <chr>     <dbl> <chr>   <chr>   <chr>     <dbl> <chr>   <chr>   <chr>
#> 1       60 Fulton        1 A       LOCKLER GABRIE…    1084 Howell… NW      Atla…
#> 2       60 Fulton        2 A       RADLEY  OLIVIA     7305 Villag… <NA>    Fair…
#> 3       60 Fulton        3 A       BOORSE  KEISHA     6200 Bakers… SW      Atla…
#> 4       67 Gwinne…      12 A       MAZ     SAVANN…    1359 Beaver… <NA>    Norc…
#> 5       67 Gwinne…      13 A       GAULE   NATASH…    2961 Lenora… <NA>    Snel…
#> 6       67 Gwinne…      15 A       MCMELL… ISMAEL     4305 Paxton… <NA>    Lilb…
#> # … with 15 more variables: state <chr>, zipcode <dbl>, street_address <chr>,
#> #   final_address <chr>, cxy_address <chr>, cxy_status <chr>,
#> #   cxy_quality <chr>, cxy_matched_address <chr>, cxy_tiger_line_id <dbl>,
#> #   cxy_tiger_side <chr>, STATEFP10 <chr>, COUNTYFP10 <chr>, TRACTCE10 <chr>,
#> #   BLOCKCE10 <chr>, geometry <chr>, and abbreviated variable names
#> #   ¹​county_code, ²​county_name, ³​registration_number, ⁴​voter_status,
#> #   ⁵​last_name, ⁶​first_name, ⁷​str_name, ⁸​str_suffix
#> # ℹ Use `colnames()` to see all variable names

View the column names of the voter file. Some of these columns will be used along the journey to performing BISG.

# Find out names of columns in voter file
names(ga_geo)
#>  [1] "county_code"         "county_name"         "registration_number"
#>  [4] "voter_status"        "last_name"           "first_name"         
#>  [7] "str_num"             "str_name"            "str_suffix"         
#> [10] "city"                "state"               "zipcode"            
#> [13] "street_address"      "final_address"       "cxy_address"        
#> [16] "cxy_status"          "cxy_quality"         "cxy_matched_address"
#> [19] "cxy_tiger_line_id"   "cxy_tiger_side"      "STATEFP10"          
#> [22] "COUNTYFP10"          "TRACTCE10"           "BLOCKCE10"          
#> [25] "geometry"

Check the dimensions (the number of rows and columns) of the voter file.

# Get the dimensions of the voter file
dim(ga_geo)
#> [1] 12 25

There are 12 voters (or observations) and 25 columns in the voter file.

Convert geometry column name into two columns for latitude and longitude points.

ga_geo <- ga_geo %>%
  tidyr::extract(geometry, c("lon", "lat"), "\\((.*), (.*)\\)", convert = TRUE)

Step 2: De-duplicate the voter file.

The next step involves removing duplicate voter IDs from the voter file, using the dedupe_voter_file function.

# Remove duplicate voter IDs (the unique identifier for each voter)
voter_file_dedupe <- dedupe_voter_file(voter_file = ga_geo, voter_id = "registration_number")

There are no duplicate voter IDs in the dataset.

Step 3: Perform BISG and obtain the predicted race/ethnicity of each voter.

# Convert the voter_shaped_merged file into a data frame for performing BISG.
voter_file_complete <- as.data.frame(voter_file_dedupe)
class(voter_file_complete)
#> [1] "data.frame"

Note that wru requires an internet connection to pull in supplemental data. If the connection cannot be made, wru_predict_race_wrapper will return NULL.

georgia_census$GA$year <- 2010

# Perform BISG
bisg_df <- eiCompare::wru_predict_race_wrapper(
  voter_file = voter_file_complete,
  census_data = georgia_census,
  voter_id = "registration_number",
  surname = "last_name",
  state = "GA",
  county = "COUNTYFP10",
  tract = "TRACTCE10",
  block = "BLOCKCE10",
  census_geo = "block",
  use_surname = TRUE,
  surname_only = FALSE,
  surname_year = 2010,
  use_age = FALSE,
  use_sex = FALSE,
  return_surname_flag = TRUE,
  return_geocode_flag = TRUE,
  verbose = TRUE
)
#> Matching surnames.
#> Performing BISG to obtain race probabilities.
#> Proceeding with last name predictions...
#> Proceeding with Census geographic data at block level...
#> Using Census geographic data from provided census.data object...
#> State 1 of 1: GA
#> ℹ All local files already up-to-date!
#> BISG complete.
# Check BISG dataframe
head(bisg_df)
#>   county_code county_name registration_number voter_status last_name first_name
#> 1          60      Fulton                   1            A   LOCKLER  GABRIELLA
#> 2          60      Fulton                   2            A    RADLEY     OLIVIA
#> 3          60      Fulton                   3            A    BOORSE     KEISHA
#> 4          67    Gwinnett                  12            A       MAZ   SAVANNAH
#> 5          67    Gwinnett                  13            A     GAULE   NATASHIA
#> 6          67    Gwinnett                  15            A  MCMELLEN     ISMAEL
#>   str_num            str_name str_suffix       city state zipcode
#> 1    1084   Howell Mill Rd NW         NW    Atlanta    GA   30318
#> 2    7305 Village Center Blvd       <NA>   Fairburn    GA   30213
#> 3    6200  Bakers Ferry Rd SW         SW    Atlanta    GA   30331
#> 4    1359    Beaver Ruin Road       <NA>   Norcross    GA   30093
#> 5    2961    Lenora Church Rd       <NA> Snellville    GA   30078
#> 6    4305           Paxton Ln       <NA>    Lilburn    GA   30047
#>               street_address                               final_address
#> 1  1084 Howell Mill Rd NW NW  1084 Howell Mill Rd NW NW,Atlanta,GA,30318
#> 2   7305 Village Center Blvd  7305 Village Center Blvd,Fairburn,GA,30213
#> 3 6200 Bakers Ferry Rd SW SW 6200 Bakers Ferry Rd SW SW,Atlanta,GA,30331
#> 4      1359 Beaver Ruin Road     1359 Beaver Ruin Road,Norcross,GA,30093
#> 5      2961 Lenora Church Rd   2961 Lenora Church Rd,Snellville,GA,30078
#> 6             4305 Paxton Ln             4305 Paxton Ln,Lilburn,GA,30047
#>                                      cxy_address cxy_status cxy_quality
#> 1  1084 Howell Mill Rd NW NW, Atlanta, GA, 30318      Match   Non_Exact
#> 2 7305 Village Center Blvd , Fairburn, GA, 30213      Match       Exact
#> 3 6200 Bakers Ferry Rd SW SW, Atlanta, GA, 30331      Match       Exact
#> 4    1359 Beaver Ruin Road , Norcross, GA, 30093      Match       Exact
#> 5  2961 Lenora Church Rd , Snellville, GA, 30078      Match       Exact
#> 6            4305 Paxton Ln , Lilburn, GA, 30047      Match       Exact
#>                             cxy_matched_address cxy_tiger_line_id
#> 1       1084 HOWELL MILL RD, ATLANTA, GA, 30318          17341994
#> 2 7305 VILLAGE CENTER BLVD, FAIRBURN, GA, 30213         650118829
#> 3   6200 BAKERS FERRY RD SW, ATLANTA, GA, 30331          17378238
#> 4      1359 BEAVER RUIN RD, NORCROSS, GA, 30093         638089058
#> 5  2961 LENORA CHURCH RD, SNELLVILLE, GA, 30078         647930651
#> 6            4305 PAXTON LN, LILBURN, GA, 30047          88948522
#>   cxy_tiger_side STATEFP10 COUNTYFP10 TRACTCE10 BLOCKCE10       lon      lat
#> 1              L        13        121    000600      1064 -84.41194 33.78431
#> 2              L        13        121    010514      3038 -84.59827 33.55365
#> 3              L        13        121    010303      2019 -84.57126 33.72542
#> 4              L        13        135    050424      1011 -84.14037 33.92892
#> 5              R        13        135    050719      1008 -84.00938 33.83915
#> 6              R        13        135    050714      2027 -84.07677 33.83707
#>   matched_surname matched_geocode     pred.whi     pred.bla     pred.his
#> 1            TRUE            TRUE 8.236113e-07 2.200425e-08 2.129911e-08
#> 2            TRUE            TRUE 1.442265e-01 8.279974e-01 7.321964e-03
#> 3            TRUE            TRUE 9.699416e-01 0.000000e+00 2.532548e-02
#> 4            TRUE            TRUE 5.214638e-02 5.520133e-02 7.600797e-01
#> 5            TRUE            TRUE 6.707808e-01 1.917867e-01 1.374325e-01
#> 6            TRUE            TRUE 7.297028e-01 1.509212e-01 5.958378e-02
#>       pred.asi     pred.oth
#> 1 0.0000000000 3.836384e-07
#> 2 0.0007539897 1.970021e-02
#> 3 0.0047329064 0.000000e+00
#> 4 0.1325725492 0.000000e+00
#> 5 0.0000000000 0.000000e+00
#> 6 0.0284345396 3.135771e-02

Summarizing BISG output

summary(bisg_df)
#>   county_code    county_name        registration_number voter_status      
#>  Min.   :60.00   Length:12          Min.   : 1.00       Length:12         
#>  1st Qu.:65.25   Class :character   1st Qu.: 9.00       Class :character  
#>  Median :67.00   Mode  :character   Median :14.00       Mode  :character  
#>  Mean   :65.25                      Mean   :12.25                         
#>  3rd Qu.:67.00                      3rd Qu.:17.25                         
#>  Max.   :67.00                      Max.   :20.00                         
#>   last_name          first_name           str_num       str_name        
#>  Length:12          Length:12          Min.   : 287   Length:12         
#>  Class :character   Class :character   1st Qu.:1394   Class :character  
#>  Mode  :character   Mode  :character   Median :3498   Mode  :character  
#>                                        Mean   :3306                     
#>                                        3rd Qu.:4162                     
#>                                        Max.   :7305                     
#>   str_suffix            city              state              zipcode     
#>  Length:12          Length:12          Length:12          Min.   :30045  
#>  Class :character   Class :character   Class :character   1st Qu.:30078  
#>  Mode  :character   Mode  :character   Mode  :character   Median :30094  
#>                                                           Mean   :30167  
#>                                                           3rd Qu.:30239  
#>                                                           Max.   :30518  
#>  street_address     final_address      cxy_address         cxy_status       
#>  Length:12          Length:12          Length:12          Length:12         
#>  Class :character   Class :character   Class :character   Class :character  
#>  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
#>                                                                             
#>                                                                             
#>                                                                             
#>  cxy_quality        cxy_matched_address cxy_tiger_line_id   cxy_tiger_side    
#>  Length:12          Length:12           Min.   : 17341994   Length:12         
#>  Class :character   Class :character    1st Qu.: 88908516   Class :character  
#>  Mode  :character   Mode  :character    Median :637257860   Mode  :character  
#>                                         Mean   :399987381                     
#>                                         3rd Qu.:641834456                     
#>                                         Max.   :650118829                     
#>   STATEFP10          COUNTYFP10         TRACTCE10          BLOCKCE10        
#>  Length:12          Length:12          Length:12          Length:12         
#>  Class :character   Class :character   Class :character   Class :character  
#>  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
#>                                                                             
#>                                                                             
#>                                                                             
#>       lon              lat        matched_surname matched_geocode
#>  Min.   :-84.60   Min.   :33.55   Mode:logical    Mode:logical   
#>  1st Qu.:-84.23   1st Qu.:33.82   TRUE:12         TRUE:12        
#>  Median :-84.14   Median :33.89                                  
#>  Mean   :-84.19   Mean   :33.88                                  
#>  3rd Qu.:-84.04   3rd Qu.:33.97                                  
#>  Max.   :-83.97   Max.   :34.09                                  
#>     pred.whi            pred.bla           pred.his           pred.asi        
#>  Min.   :0.0000008   Min.   :0.000000   Min.   :0.000000   Min.   :0.0000000  
#>  1st Qu.:0.0288208   1st Qu.:0.008116   1st Qu.:0.007098   1st Qu.:0.0005655  
#>  Median :0.3390319   Median :0.103061   Median :0.026733   Median :0.0034098  
#>  Mean   :0.4106619   Mean   :0.232251   Mean   :0.169077   Mean   :0.0896576  
#>  3rd Qu.:0.7395151   3rd Qu.:0.241425   3rd Qu.:0.079046   3rd Qu.:0.0164522  
#>  Max.   :1.0000000   Max.   :0.950896   Max.   :0.943675   Max.   :0.8884684  
#>     pred.oth       
#>  Min.   :0.000000  
#>  1st Qu.:0.000000  
#>  Median :0.007979  
#>  Mean   :0.015019  
#>  3rd Qu.:0.022615  
#>  Max.   :0.064206

Look at BISG race predictions by county

# Obtain aggregate values for the BISG results by county
bisg_agg <- precinct_agg_combine(
  voter_file = bisg_df,
  group_col = "COUNTYFP10",
  race_cols = c("pred.whi", "pred.bla", "pred.his", "pred.asi", "pred.oth"),
  true_race_col = "race",
  include_total = FALSE
)

# Table with BISG race predictions by county
head(bisg_agg)
#> # A tibble: 2 × 6
#>   COUNTYFP10 pred.whi_prop pred.bla_prop pred.his_prop pred.asi_prop pred.oth_…¹
#>   <chr>              <dbl>         <dbl>         <dbl>         <dbl>       <dbl>
#> 1 121                0.371         0.276        0.0109       0.00183     0.00657
#> 2 135                0.424         0.218        0.222        0.119       0.0178 
#> # … with abbreviated variable name ¹​pred.oth_prop

Barplot of BISG results

bisg_bar <- bisg_agg %>%
  tidyr::gather("Type", "Value", -COUNTYFP10) %>%
  ggplot(aes(COUNTYFP10, Value, fill = Type)) +
  geom_bar(position = "dodge", stat = "identity") +
  labs(title = "BISG Predictions for Fulton and Gwinnett Counties", y = "Proportion", x = "Counties") +
  theme_bw()

bisg_bar + scale_color_discrete(name = "Race/Ethnicity Proportions")

Choropleth Map

Finally, we will map the BISG data onto choropleth maps.

bisg_dfsub <- bisg_df %>%
  dplyr::select(BLOCKCE10, pred.whi, pred.bla, pred.his, pred.asi, pred.oth)

bisg_dfsub
#>    BLOCKCE10     pred.whi     pred.bla     pred.his     pred.asi     pred.oth
#> 1       1064 8.236113e-07 2.200425e-08 2.129911e-08 0.0000000000 3.836384e-07
#> 2       3038 1.442265e-01 8.279974e-01 7.321964e-03 0.0007539897 1.970021e-02
#> 3       2019 9.699416e-01 0.000000e+00 2.532548e-02 0.0047329064 0.000000e+00
#> 4       1011 5.214638e-02 5.520133e-02 7.600797e-01 0.1325725492 0.000000e+00
#> 5       1008 6.707808e-01 1.917867e-01 1.374325e-01 0.0000000000 0.000000e+00
#> 6       2027 7.297028e-01 1.509212e-01 5.958378e-02 0.0284345396 3.135771e-02
#> 7       3010 7.689519e-01 1.789950e-01 1.700555e-02 0.0011966137 3.385091e-02
#> 8       1011 3.722963e-03 9.508964e-01 2.814040e-02 0.0020866235 1.515359e-02
#> 9       4002 1.000000e+00 0.000000e+00 0.000000e+00 0.0000000000 0.000000e+00
#> 10      2002 3.032605e-02 1.082159e-02 9.436752e-01 0.0124580620 2.719115e-03
#> 11      3000 5.338374e-01 3.903416e-01 6.427546e-03 0.0051871643 6.420623e-02
#> 12      3003 2.430521e-02 3.005071e-02 4.393773e-02 0.8884683982 1.323796e-02
# Join bisg and shape file
bisg_sf <- dplyr::left_join(gwin_fulton_shape, bisg_dfsub, by = "BLOCKCE10")

Plot Map of Proportion of Black Voters

# Plot choropleth map of race/ethnicity predictions for Fulton and Gwinnett counties
plot(bisg_sf["pred.bla"], main = "Proportion of Black Voters identified by BISG")

Plot Map of Proportion of White Voters

plot(bisg_sf["pred.whi"], main = "Proportion of White Voters identified by BISG")

Plot Map of Proportion of Hispanic Voters

plot(bisg_sf["pred.his"], main = "Proportion of Hispanic Voters identified by BISG")

Plot Map of Proportion of Asian Voters

plot(bisg_sf["pred.asi"], main = "Proportion of Asian Voters identified by BISG")