Skip to main content
SearchLoginLogin or Signup

The Importance of Being Ernest, Ekundayo, or Eswari: An Interpretable Machine Learning Approach to Name-Based Ethnicity Classification

Published onJul 28, 2022
The Importance of Being Ernest, Ekundayo, or Eswari: An Interpretable Machine Learning Approach to Name-Based Ethnicity Classification
·

Abstract

Name-based ethnicity classification is the task of predicting ethnicity from a name. Ethnicity classification can be a key tool for assessing the fairness of algorithms, demographic studies, and political analysis. While previous state-of-the-art approaches for this task rely on complex neural networks that are difficult to understand, troubleshoot, and tune, we provide an interpretable and intuitive solution that outperforms these (overly) complicated models at a fraction of the computational cost. Using our technique, we can analyze patterns in name-ethnicity databases to show connections between ethnic groups in terms of their overlap of names. We provide techniques to generalize under domain shift, leveraging ‘indistinguishables,’ which are names common to multiple ethnic groups. We provide an application of our method to the estimation of how many political donations for each political party were provided by individuals from various ethnic groups in 2020 leading up to the presidential election.

Keywords: name ethnicity classification, interpretable machine learning, campaign contributions


1. Introduction

Can you guess which ethnicities the names ‘Vaishali Jain,’ ‘Ted Enamorado,’ and ‘Cynthia Rudin’ are associated with? The question we consider here is whether a computer algorithm can also perform this task. The task of associating a name with an ethnicity is called ‘name-based’ ethnicity classification. Automated name-based ethnicity classifiers have had wide applications and have been used for estimating the fraction of political contributions given by different ethnic groups (Besco & Tolley, 2022; Grumbach & Sahn, 2019; Sood & Laohaprapanon, 2018), for measuring fairness in employment (Bertrand & Mullainathan, 2003), for biomedical and epidemiology research (Burchard et al., 2003; Donaldson & Clayton, 1984; Fiscella & Fremont, 2006; Gill et al., 2005; Harkness et al., 2016; Senior & Bhopal, 1994), for population studies (Mateos, 2007), and for marketing and advertising (Appiah, 2001; Burton, 2000). Name-based entity classification can be a key tool in troubleshooting policy decisions over large populations and assessing fairness. For example, in assessing racial bias, credit lenders are forbidden from collecting data on ethnicity or race, so they often estimate fairness through the use of other data; so the availability of name data may be a valuable resource. Similarly, in assessing the fairness of judicial decisions or policing, in the absence of explicit race data, name data can be quite valuable.

Prior to this work, the current state-of-the-art techniques in name-based ethnicity classification have generally been deep learning methods, and in particular, long short-term memory (LSTM) recurrent neural networks (Kang, 2020; Lee et al., 2017; Sood & Laohaprapanon, 2018; Xie, 2022). Such complex methods consider the ordering of letters and patterns among them in subtle ways that might be difficult for a human to understand. They are difficult to troubleshoot. They are difficult to train and are computationally expensive. Ideally, we would want a much simpler, more interpretable, and scalable solution.

While the subtle patterns in the ordering of letters leveraged by deep learning are definitely useful for classification, we have found that there are much bigger overarching patterns in the popularity of names that are more powerful. These patterns are not naturally captured within LSTM architectures, and have been overlooked in the quest for bigger and more automated prediction models. In particular, we point out the simple facts that most first and last names are popular and that simple patterns based on popularity statistics can be used to resolve uncertainty when classifying ambiguous names. To determine the ethnicity of the name ‘Chang,’ we do not need to consider subsets of letters within it. We have seen this name before, and can use its popularity among different ethnic groups to determine that it is an Asian name with high probability. To determine the ethnicity of the name ‘Eric Chang,’ even though the name ‘Eric’ is popular among White people, the last name is more important in classifying Asian names. Similarly, for the name ‘Keisha Richardson,’ the last name is not as important as the first in classifying Black names. We can find patterns like this in data (e.g., for Asian names, the last name is more important, for Hispanic names the first name and last name are both important) even from a relatively small training set of names. After learning these general patterns from a small data set of names, we can use a much larger training set of names to determine whether each first and last name is associated with each ethnicity. For instance, even if the last name ‘Gigliotti’ was not used for finding patterns on whether the first or last name is more important, if it appears in the test set, we can leverage all the Gigliotti’s in the large training set to estimate the name’s popularity for different ethnicities. Then, leveraging the learned patterns about whether first name or last name information is more important for each ethnicity, we can classify the new name.

Using these basic but important ideas, in this work, we derive a simple and intuitive algorithm based on meaningful features that achieves significantly better performance than prior approaches. Our algorithm is massively scalable, requiring optimization on only a relatively small data set (perhaps 20,000 names) to find the types of patterns discussed above, concerning how important the first and last names are for classifying a name for each ethnicity. The algorithm then leverages millions of additional names from the training set to compute the features on the test set. Using only a small subset of the names for optimization leads to a massive computational speedup without sacrificing accuracy. Our models can be built and used on enormous data sets in about 5 minutes on a laptop. Our models are easy to fine-tune for an accuracy boost by including misclassified names within the features without needing to reoptimize. Our proposed approach is general and can easily be extended to any number of ethnicities, provided we have the relevant data to train the model.

To reduce bias in our estimates involving ethnicity popularities, we noticed something else important about names, which is that a fraction of names (about 0.9% in our North Carolina data set, i.e., 56,000 out of 6.6 million) are common to multiple ethnicities (e.g., Kimmy Jones, Amala John, Kamal Hasan, or Soni Philip could easily be Black, Asian, or White; Ariel Musa, Elena Bonnet, or Bertha Gardin could easily be Black, Hispanic, or White). We call such names ‘indistinguishables,’ and do not attempt the futile goal of classifying them. These are names we are certain cannot be classified accurately. Classifying them all according to the majority ethnicity in which these names appear would lead to poor estimates of the distribution of ethnicities across names: if we predicted each indistinguishable name according to the majority class (say we classify them all as White people), then we would overestimate the number of people in the majority and underestimate the number in minority classes. If our aim is to estimate the fraction of names from each ethnic group, we must handle indistinguishables more carefully. We propose to redistribute counts for indistinguishable names according to a probabilistic adjustment that considers the probability for a person of each ethnicity to have an indistinguishable name, and the probability of a person with a distinguishable name to belong to a certain ethnicity. We describe this model in Section 4.3. For instance, if ‘Kendra Hill’ is indistinguishable between White and Black in the training set, our classifier for the training set may predict that this name is Black. If the test set contains fewer Black people, our probabilistic adjustment would make it more likely to classify this name as White. This way of handling indistinguishables helps our models to generalize between training and tests sets under distributional shift: even if there are more Black people in the training set with the name Kendra Hill, when there are more White people in the test population, we would still predict correctly that there are more White people with this name in the test set.

Section 2 provides related work, and Section 3 describes our methodology, which is called EthnicIA, which stands for Ethnicity via Interpretable Artificial Intelligence. In Section 4.1, we present experimental results under distributional shift, where a model trained on data from Florida would be used in North Carolina, or vice versa, despite the fact that they have very different ethnic populations. Section 4.2 displays feature importance. In Section 4.3, we discuss the use of indistinguishables for bias reduction in our estimates. Section 4.4 shows how computationally easy it is to train our models. In Section 4.5, we show the robustness of our approach to domain shift. In Section 4.6, we show how our models can easily be updated in a way that would change the classification of a particular name without having to reoptimize. Section 4.7 discusses the possibility to make our model sparser, and Section 4.8 provides visualization and intuition about general relationships between names and how they relate to each other in our feature space. Section 5 provides a case study, where we explore the ethnic composition of the donorate in the 2020 U.S. elections.

2. Related Work

Name-ethnicity classification has been used extensively for assessments of fairness, as discussed above, for medical studies, population demographic studies, employment, and marketing. In what follows, we review work in related areas, starting with the work closest to ours.

Most Closely Related to Our Work. In their work on name-ethnicity classification, Sood and Laohaprapanon (2018) use an LSTM recurrent neural network for the classification of race and ethnicity using the character sequence (letter sequence) information in a full name. They apply their method to estimate campaign donations by racial group. A problem with LSTMs is that they can be challenging to train, and the resulting models are not interpretable or easy to troubleshoot or adjust. Lee et al. (2017) also use an LSTM-based recurrent neural network with character-level features for training. These authors predict the ethnicity of a pool of Olympic athletes based on their names. The works of Lee et al. (2017) and Sood and Laohaprapanon (2018) are extremely similar: both use the same LSTM approach and they use extremely similar sets of features (e.g., popular n-grams). Sood and Laohaprapanon (2018) use data similar to ours, and their model is easily available for comparison as a Python package, so we compare our model results with their LSTM model results.

In their paper, Ambekar et al. (2009) use a hierarchical decision tree of hidden Markov models on character sequences from names to classify them into 13 cultural/ethnic groups. Treeratpituk and Giles (2012) build another model that is simpler and uses character sequences as well as phonetic features. In their algorithm, called EthnicSeer, they use a multinomial logistic regression with n-gram, non-ASCII, and phonetic features. This model is closest to the model we propose in that they use interpretable multinomial logistic regression to train the model, though our features are different. They conclude that name strings are usually more important than phonetic features for name-ethnicity classification. The code for training the models from Ambekar et al. (2009) and Treeratpituk and Giles (2012) is not available, and Ambekar et al.'s (2009) was too complicated to reproduce. We tried to replicate the work by Treeratpituk and Giles (2012): we trained a multinomial logistic model on the suggested features and compared its results with our model results in the experimental section. It is worth noting that the 30 features used by our model represent a very small set if compared to the \sim100,000 features used by their model.

Lookup-Table-Based Approaches. A summary of works on name-based ethnicity classifiers using lookup tables is in the review of Mateos (2007). Lookup tables have two major problems: (1) they cannot classify names that are not present in a database, but also (2) they do not probabilistically model conflicts, where one name appears with multiple associated ethnicities, which is the main interesting problem in ethnicity classification. Our machine learning framework handles these questions nicely.

An earlier approach using a similar idea with a lookup table is that of Coldman et al. (1989) for classifying Chinese vs.. non-Chinese names. These authors use a likelihood ratio approach between the conditional probability of observing the components of a name in one class vs.. the other. It is worth mentioning that there are records of much earlier approaches (e.g., Buechley, 1976, and Bradford Health Authority, 1983) exploiting similar ideas, though we were not able to access these historical works. Analysis of these works was done by Cummins et al. (2000), Harding et al. (1999), and Stewart et al. (1999).

Note that there is a larger literature on classifying Asian names via lookup tables. Quan et al. (2006) develop a Chinese surname list and discuss its validity for identifying Chinese ethnicity. Taylor et al. (2009) write about the use of surname lists of Vietnamese names for ethnicity classification. A similar approach is that of Nanchahal et al. (2001) (SANGRA; South Asian Names and Group Recognition Algorithm), which is lookup-table based, where the lookup-table is manually curated after merging a few data sets to get a list of first and last names. In case of a conflict between the categories assigned to the first name and the last name, SANGRA chooses the category assigned to the first name. Lauderdale and Kestenbaum (2000) develop a lookup table for surnames from six Asian ethnicities by merging data sets, and Nicoll et al. (1986) evaluate how well humans perform at the name-ethnicity classification task based on last name only, and based on full name.

In their paper on the Ethnea algorithm, Torvik and Agarwal (2016) create a lookup table ethnicity classifier based on geocoded authors’ names in the large-scale bibliographic database PubMed. Given a name, their algorithm retrieves all of its instances (or the most similar ones) from PubMed, coupled with their respective country of affiliation. It uses a lookup table for each name and country, without using machine learning. Once the country is assigned, logistic regression is used to probabilistically map each country to a set of 26 predefined ethnicities. The lookup table has a manually defined scoring system for three- and four-character n-grams on names that are rare. They do not use the types of features we use, and they use additional country information.

Approaches That Use Additional Information Besides Names. There is a vast literature that uses both name and auxiliary information to classify ethnicity. This auxiliary information includes geolocation (Fiscella and Fremont, 2006; Elliott et al., 2008, 2009; Voicu, 2018; Wong et al., 2020), distributions of names for each country (Chang et al., 2010; Imai & Khanna, 2016), social networks (Ye et al., 2017), Twitter profiles and activity (Pennacchiotti & Popescu, 2011), voters’ registration information (Imai & Khanna, 2016), or census statistics (Harris, 2015; Imai & Khanna, 2016; Sood & Laohaprapanon, 2018). While some of these works use some features similar to ours (e.g., using the first three or four letters of a name), none of them directly use the popularity-based features we introduce here.

An interesting approach called ‘Who are you?’ (WRU) (Imai & Khanna, 2016) leverages the overall popularity of a name given race. They use Bayes rule to invert: P(racename=‘Ted’)=P(name=‘Ted’race)P(race)/P(name=‘Ted’)P(\textrm{race} | \textrm{name} = \textrm{`Ted'}) = P(\textrm{name} = \textrm{`Ted'} | \textrm{race}) P(\textrm{race})/ P(\textrm{name} = \textrm{`Ted'}). We also use P(racename=‘Ted’)P(\textrm{race} | \textrm{name} = \textrm{`Ted'}), but as one small part of a larger model that uses other features, and whose parameters are trained. Similar to Imai and Khanna (2016), our simple formulation can be extended easily to accommodate auxiliary information like location.

Associations With Multiple Ethnicities. One challenge with name-ethnicity classification is that multiple ethnicities could be associated with one individual. Our method handles that by providing a probability of each ethnicity with each name, so one could probabilistically assign multiple ethnicities for one individual. However, for the purpose of evaluating classification accuracy on a data set, we (and other studies) consider only a single reported ethnicity; otherwise, the number of class labels would be exponential in the number of classes, and we would have no way to verify a correct prediction of multiple ethnicities since most data sources have only a single ethnicity associated with a name.

3. Methodology

As discussed, the main insights we make seem to have been often overlooked in past approaches to this problem. The insights are that most names are somewhat common, and simple patterns based on popularity statistics can be used to resolve uncertainty when classifying ambiguous names. To support the point that most names are common, we point out that very few individuals among the registered voters in Florida (0.5%) have both first and last names that appear fewer than 5 times; that is, 68,260 out of 13,043,270 names (Sood, 2017). Moreover, only 109,624 individuals (0.84%) have both first and last names that appear fewer than 10 times, and 141,858 (1.1%) appear fewer than 15. When we see a last name such as ‘Patel,’ we know it is Indian with high probability because we have seen it before. Similarly, the names ‘Choi’ and ‘Chen’ are obvious, as well as ‘Rodriguez’ and ‘Nguyen.’ Even if a name is not extremely popular, chances are that there is at least one person in the training set who shares either a first or last name with it. Our features embody this insight.

3.1. Features

Let us discuss the features. For each observation i={1,,N}i = \{1, \ldots, N\}, we decompose the name associated to it into its components: first name (FNi{\rm FN}_i) and last name (LNi{\rm LN}_i). Using that decomposition, we construct our features as follows:

  • Probability of observing a name for a given ethnic group. Let R\mathcal{R} be the set of all possible ethnic groups, and let M\mathcal{M} represent the set of all the unique instances of a first name in the data. For each first name mMm \in \mathcal{M}, we ask: how likely is to observe mm in ethnicity rRr \in \mathcal{R}? Formally, for each observation ii in our data, we define the following quantities:

    Ψi,r=first name mM(1l{FNi=m}×ζm,rNm+1)\Psi_{i, r} = \sum_{\textrm{first name }m \in \mathcal{M}} \bigg(1\kern-0.25em\text{l}\{ {\rm FN}_i = m \} \times \frac{\zeta_{m, r}}{N_m + 1} \bigg)

    where Nm=i=1N1l{FNi=m} N_m = \sum_{i=1}^N 1\kern-0.25em\text{l} \{{\rm{FN}}_i = m \} is the number of times we have seen first name mm and ζm,r=i=1N1l{FNi=m & ethnicityi=r}\zeta_{m, r} = \sum_{i=1}^N 1\kern-0.25em\text{l} \{ {\rm{FN}}_i = m \textrm{ \& } {\rm{ethnicity}}_i = r \} is the number of times we have seen first name mm with ethnicity rr. We add 1 in the denominator for numerical stability. Therefore, Ψi,r\Psi_{i, r} measures how often the first name of observation ii has ethnicity rr. For instance, for name mm = ‘Carlos’ and ethnicity rr = ‘Hispanic,’ ζm,r\zeta_{m,r} is the number of Carlos’s that are Hispanic in the training set, so Ψi,r\Psi_{i,r} is the fraction of people with first name Carlos that are Hispanic.


    Similarly, for each last name wWw \in \mathcal{W} (where W\mathcal{W} is the set of all unique instances of a last name in the data), we define

    Φi,r=last name wW(1l{LNi=w}×κw,rNw+1),\Phi_{i, r} = \sum_{\textrm{last name }w \in \mathcal{W}} \bigg( 1\kern-0.25em\text{l} \{ {\rm LN}_i = w \} \times \frac{\kappa_{w, r}}{N_w + 1} \bigg),

    where Nw=i=1N1l{LNi=w}N_w = \sum_{i=1}^N 1\kern-0.25em\text{l} \{{\rm{LN}}_i = w \} and κw,r=i=1N1l{LNi=w & ethnicityi=r}\kappa_{w, r} = \sum_{i=1}^N 1\kern-0.25em\text{l} \{ {\rm{LN}}_i = w \textrm{ \& } {\rm{ethnicity}}_i = r \}. For instance, for last name ww = ‘Salah’ and rr = ‘Black,’ Φi,r\Phi_{i,r} is the fraction of people with last name ‘Salah’ that are Black. Note that Ψi,r\Psi_{i, r} (and Φi,r\Phi_{i, r}) is defined for each unique instance of a first (last) name, which means that the value of this feature is the same for all observations ii such that FNi=m{\rm{FN}}_i = m (LNi=w{\rm{LN}}_i = w).

    In our experiments, the naming convention for these features is as follows:

    probability_ethnicity_name_component,\texttt{probability$\_\langle$ethnicity$\rangle \_\langle$name$\_$component$\rangle$},

    where ethnicity is one of four—asian, black, hispanic, white—and name_component is either first_name or last_name.

  • Probability features for the first and the last four letters of a name component. The probability features defined above are replicated to form additional features for the first four letters of a component of a name. We create another set of features for the last four letters of the name component (both first and last name).

    In our experiments, the naming convention for these features is

    probability_ethnicity_name_component_name_part,\texttt{probability}\_\langle\texttt{ethnicity}\rangle \_\langle\texttt{name}\_\texttt{component}\rangle\_\langle\texttt{name}\_\texttt{part}\rangle,

    where name_part is either f4 or l4 depending on whether we choose the first four letters (f4) or the last four letters (l4) of a name component.

  • Best evidence feature of a name. We create features that compare probability features for the first and last names and select the one that has the most solid evidence in favor of a particular ethnic group rr. For each observation ii,

    best_evi,r=max(Ψi,r,Φi,r),\texttt{best\_ev}_{i, r} = \max (\Psi_{i, r}, \Phi_{i, r}),

    where Ψi,r\Psi_{i, r} and Φi,r\Phi_{i, r} are as defined above.

    For example, for the name ‘Tony Chang,’ the maximum probability feature would consider that the last name ‘Chang’ has a larger value of the probability feature for the Asian ethnicity class as compared with the first name ‘Tony.’ The naming convention for this feature set is:

    best_evidence_ethnicity.\texttt{best\_evidence\_}\langle \texttt{ethnicity}\rangle.
  • Other basic statistics are included as features, such as an indicator for whether the name contains a dash (e.g., Sharonda Aasiya-Bey), as well as the number of sub-names connected by spaces or dashes (e.g., Sharon Pena-De La Paz will have the feature value set as four). We code this feature to take the following values: 1, 2, 3, and 4 or more. The naming convention for these features is dash_indicator and n_sub_names.

While we tried numerous other statistics (e.g., 3-grams for the first three letters and the last three letters) as possible features, none of these classes of features (although being large in size) worked as well as those provided here. In general, we chose to either include or exclude whole classes of features rather than choosing individual features in our final model.

3.2. Simple Patterns

Our simple model is a multinomial logistic regression classifier using the collection of features provided above. This model produces scores for each ethnicity, namely P(Asianname)P(\textrm{Asian}| \textrm{name}), P(Hispanicname)P(\textrm{Hispanic}| \textrm{name}), and so on. Each score is a simple pattern that is a sparse linear combination of the features. For each name, our model will predict the ethnicity with the highest probability across all ethnic groups. We train the model using a small subset of data (perhaps 20,000 names) and leverage the full training set to compute probability features for each name in the test set.

Before we go on, we must discuss how to handle indistinguishables.

3.3. Indistinguishables

We define indistinguishables as names where both the first and last names are approximately equally popular in multiple ethnicities.

Let Ψi={Ψi,Asian,Ψi,Black,Ψi,Hispanic,Ψi,White}\mathbf{\Psi}_i = \{\Psi_{i, \textrm{Asian}}, \Psi_{i, \textrm{Black}}, \Psi_{i, \textrm{Hispanic}}, \Psi_{i, \textrm{White}} \} and define Φi\mathbf{\Phi}_i analogously. Then, the name associated with the iith row in the data is classified as indistinguishable between ethnicities rr and rr' using the following rule:

if (Ψi,rΨi,r0.15 and Φi,rΦi,r0.15)\textbf{if} \ \bigg( |\Psi_{i, r} - \Psi_{i, r^\prime}| \leq 0.15 \textrm{ and } |\Phi_{i, r} - \Phi_{i, r^\prime}| \leq 0.15 \bigg) \\

that is, first and last names rr and rr' are similarly popular

and\textbf{and}

(maxΨimin{Ψi,r,Ψi,r}0.15 and maxΦimin{Φi,r,Φi,r}0.15)\small { \bigg(\max \mathbf{\Psi}_i - \min \{ \Psi_{i,r}, \Psi_{i, r^\prime} \} \leq 0.15 \textrm{ and } \max \mathbf{\Phi}_i - \min \{ \Phi_{i,r}, \Phi_{i, r^\prime} \} \leq 0.15 \bigg) \\}

that is, the most likely class is not far from either rr or rr' in popularity

theni’s name is classified as r-and-r Indistinguishable\textbf{then} \\ i\text{'s name is classified as $r$-and-$r^\prime$} \ \text{Indistinguishable} \\

elsei’s name is NOT r-and-r Indistinguishable\textbf{else} \\ i\text{'s name is NOT $r$-and-$r^\prime$} \ \text{Indistinguishable} \\

end if\textbf{end if}

Here, 0.15 represents the level of closeness between the relative probabilities of observing a first name and a last name across ethnic groups. If the conditions are met, we say that the name is rr and rr^\prime (with rrr \neq r^\prime) indistinguishable. As noted above, our rule requires that if both the first and last names are indistinguishable, we classify the name as indistinguishable. A list of sample indistinguishable first names and last names is provided in Tables 1 and 2, respectively.

Table 1. Indistinguishable first names.

Indistinguishable First Name

Label

Addie, Dave, Erica, Herman, Tasha, Ruby, Sonya, Tony

Black-White

Edwin, Nicolas, Cecilia, Rudy, Jaime, Esther, Claudia, Johanna

Hispanic-White

Adil, Dimple, Mala, Leena, Cintra, Sherin, Marilou, Shan, Neena

Asian-White

Ria, Zahra, Kamal, Zia, Salim, Kimmy, Kamala, Hasan

Asian-Black-White

Monica, Sasha, Vanessa, Veronica, Franklyn, Bertha, Jocelyn

Black-Hispanic-White

Clarita, Mirza, Ernani, Nazira, Tessy

Asian-Hispanic-White

Gracy, Jaison, Lerissa, Romeo, Nadir, Zorina, Yasir

Asian-Black-Hispanic-White


Table 2. Indistinguishable last names.

Indistinguishable Last Name

Label

Brown, Clarke, Green, Harris, Lewis, Richardson, Woods, Mitchell

Black-White

Antonio, Barba, Bosch, Carlo, Francisco, Da Silva, Leonardo

Hispanic-White

Aziz, Jung, Kamal, Lau, Ling, Mohan, Sagar, Yeo, Durrani

Asian-White

Hasan, Ismail, Philip, Bacchus, Pasha, Yasin, Salim

Asian-Black-White

Bonnet, Camara, Laborde, Musa, Roger, Schettini, Bodden

Black-Hispanic-White

Ancheta, Ayub, Carino, Domingo, Pablo, Quinto

Asian-Hispanic-White

Almeda, Gokool, Nabor, Tibo

Asian-Black-Hispanic-White

Note that the results are not particularly sensitive to modeler choices. To illustrate this, we present a contour plot in Figure 1 of the total number of indistinguishables as a function of our closeness cutpoint for a name to be classified as indistinguishable (horizontal axis) and the minimum frequency of a name (vertical axis). The figure shows that the values we chose are in a range where the number of indistinguishables is not very sensitive to the closeness parameter. In particular, for any level of the closeness parameter, the number of indistinguishables does not change much even if we remove less-common names from the list of indistinguishables. For example, the number of indistinguishables when the closeness parameter is equal to 0.15 reduces from 153,152 to 152,131 when removing indistinguishable names that appear 30 times or fewer compared to not removing them.

As we show in Section 4.3, indistinguishables are important for reducing bias when making predictions across populations with different underlying distributions of ethnicity.

Figure 1. Parameter sensitivity of indistinguishables.
Note. The figure presents the contour plot for the number of indistinguishable names as a function of the minimum frequency of a name and the closeness parameter.

3.4. Scalability Benefit for Learning With Probability and Best-Evidence Features

The full set of features outlined above consists of 30 total features. We compute these features, train on part of the training set, and calculate indistinguishables on the training set. During testing, we first identify the indistinguishables and either remove them or treat them separately (as discussed in Section 4.3). For the rest of the test set, the features are computed using the full training set, and classified using our trained classifier.

Since we have only 30 features, we are able to run multinomial logistic regression for a fairly large number of observations. As we will show in the experimental section, test performance tends to stabilize after approximately 20,000 names have been included in the training set used for multinomial logistic regression (with the rest of the names in the training set used only for computing features). Beyond 20,000 names, the test accuracy of the learned classifier tends not to improve. Interestingly, even though the size of the training set used for multinomial logistic regression does not affect test accuracy, the part of the training set that is not used for multinomial logistic regression (and used only for creating features) does impact test accuracy: more names in the training set means better whole-name probability features. In other words, even though the learned weights do not improve with more training data, the quality of the features themselves improves dramatically with more training data. Thus, in practice, we would train on the largest number of units feasible for running multinomial logistic regression (at least 20,000, but perhaps not too much more), but we will have used the full database to compute the probability and best-evidence features for any given name.

Hence, our technique improves the scalability of learning using probability features, summarized as follows: use the whole training set to compute the probability and best-evidence features for each observation. Since we have a large training set, these features will be high-quality. Then train a model on the largest number of units feasible for running multinomial logistic regression, even though it is much smaller than the total number of units used for computing the probability and best-evidence features. Use the learned weights and the full-sample-derived features for testing.

Let us give an example to show why this works. Consider the name ‘William Oyuela.’ The name ‘William’ is popular among White people, whereas ‘Oyuela’ is a Hispanic last name that is not popular, and might not appear in a database of 20,000 names used to train the classifier. However, ‘Oyuela’ does appear in our larger database of 80,000 names. The trained Hispanic classifier on the subset of 20,000 names places weights on the Hispanic last name feature and the max first/last name feature. Now, when we calculate the probability features for ‘Oyuela’ using the 80,000-name database, we find that it is 100% Hispanic, whereas the name ‘William’ is substantially less than 100% White (i.e., 88% of people named ‘William’ are White in our training data set). The Hispanic last name probability and the best evidence feature would be large, and the name would receive a high-probability Hispanic rating. Thus, even though we trained the weights on only 20,000 names, we can still leverage the much larger database to correctly classify names that were not among these 20,000. The procedure is extremely computationally efficient, involving only a small multinomial logistic regression problem and calculation of probability and best-evidence features—no neural network or computationally-intensive calculation. We will revisit computation in the experimental section.

Now that we have described the methodology, let us move on to experiments.

4. Experiments

Data. We use the first names and last names from the voter files from Georgia (2020), Florida (2017), and North Carolina (2017). We obtained these raw voter files from Georgia and North Carolina from the corresponding electoral offices. The voter file from Florida is the same one used by Sood (2017), and is a snapshot from February 7, 2017. In addition to names, these voter files contain detailed demographic and electoral information (turnout and registration) for all the registered voters in each state at the time they were collected. Together, these files account for more than 26.5 million voters (more than 86% of the combined voting-eligible population of the three states—North Carolina, Florida, and Georgia—in 2020 according to McDonald [2021]).

The Florida data set has \sim13 million observations, and the distribution of ethnicities is 1.95% Asian, 14.21% Black, 16.70% Hispanic, and 67.14% White. The North Carolina data set has \sim6.6 million observations, and the distribution of ethnicities is 1.24% Asian, 23.4% Black, 2.58% Hispanic, and 72.78% White. The Georgia data set has \sim6.5 million observations, and the distribution of ethnicities is 2.74% Asian, 33.83% Black, 3.82% Hispanic, and 59.61% White. We note that Florida is demographically different than Georgia and North Carolina, especially with respect to its large Hispanic population. Note that in our analyses below, we remove those observations with a missing first name, last name, or ethnicity information.

Training of EthnicIA. Despite the enormous size of the data sets, experiments with EthnicIA are easily possible on a standard laptop using multinomial logistic regression with the features described above. Multinomial logistic regression was implemented in PyTorch using the Adam optimizer (Kingma & Ba, 2015), which allowed us to scale efficiently to the size of our data set. There are no parameters to tune in the model (we used 500,000 observations and 30 features, which is in a regime where regularization terms would be overwhelmed and would not be needed to prevent overfitting), so nested parameter tuning with a validation set is not necessary. The learning rate was set at the standard rate of 0.001, and we ran for ten epochs for a training set of size 500,000 and batch size 1,024. In Appendix A Section A.4, we show the training loss curve for our EthnicIA model trained on the Florida data and show that we generally converge after ten epochs.

To optimize the model for balanced accuracy, we weigh some of the classes more than others. ‘Accuracy’ refers to the ratio of correctly classified examples to the total number of examples, while ‘balanced accuracy’ refers to the average of the accuracy over the four ethnicities. To adjust the weight for balanced accuracy optimization for the classifier corresponding to each ethnicity (e.g., the classifier that differentiates Asian names from non-Asian names), we add a weight to each positive example while calculating the loss, and the weight is equal to the ratio of the number of negative examples to the number of positive examples.

Our model (with 500,000 samples) trains in less than 5 minutes on a standard Macintosh laptop computer with a 2.4 GHz Quad-Core Intel Core i5 processor and 16 GB of RAM.

Baselines. As discussed earlier, we compared with the most relevant baselines, namely those of Sood and Laohaprapanon (2018) and Treeratpituk and Giles (2012). Other available baselines either do not have publicly available code (and would be too complex to reproduce), or use information besides names. The source code for Ambekar et al.'s (2009) hidden Markov models–based decision tree model is not available and would be difficult to reproduce. However, Ambekar et al.'s (2009) character features are similar to those of Treeratpituk and Giles (2012), who use both character and phonetic features, and who we do compare with. Treeratpituk and Giles's (2012) code is not publicly available but the description of the method was clear enough that we were able to reproduce it.

Each of our voter lists is large, so methods such as multinomial logistic regression on multiple thousands of n-gram features (which would be required for Treeratpituk and Giles's [2012] model) is not easily possible. Due to RAM memory issues, we were able to train Treeratpituk and Giles’s models only up to 200,000 observations with 100,000 features and 512 GB of memory allocated. Training this baseline model takes 8 hours.

4.1. Performance Results

In this section, we conduct an experimental comparison of our method to ethnicolr, which is the LSTM model of Sood and Laohaprapanon (2018). Since Sood and Laohaprapanon (2018) did not remove indistinguishables, we compute performance both before and after removing them. To avoid issues with training the LSTM, we used the designers’ trained version instead of training it ourselves. Since the LSTM was trained on the Florida voter registration data, we also trained on these data. The choices made regarding indistinguishables actually have a large impact on the results because Black names and White names often overlap (e.g., Jessie Brown, Tony Harris, Johnny Jones), and often cannot be easily classified. Both methods were tested on North Carolina voter registration data, which has a different underlying distribution than the Florida data and just 55% overlap in terms of full names.

Table 3 shows a comparison of our model with the ethnicolr model on the North Carolina test data, without indistinguishables. We aim to optimize balanced accuracy, but we also present overall accuracy; ethnicolr aims to optimize accuracy rather than balanced accuracy. Balanced accuracy is an important metric, as it treats all ethnicities equally, rather than just favoring the majority ethnicity (which in the United States is White).

The result is that our EthnicIA model achieves a 10.97% improvement in terms of balanced accuracy over ethnicolr (76.27% vs.. 65.30%). EthnicIA receives a drop in overall (non-balanced) accuracy of 5.8% as a result (74.37% vs. 80.17%), which results essentially from predicting ‘White’ less often.

Table 3. Performance Results of EthnicIA vs. ethnicolr.

Note: The numbers in the upper tables are counts of names in North Carolina (after removing indistinguishable names), where the model was trained on data from Florida. EthnicIA is our proposed model and ethnicolr is the model proposed by Sood and Laohaprapanon (2018). EthnicIA outperforms ethnicolr, with EthnicIA attaining a balanced accuracy of 76.27%, compared to ethnicolr’s 65.30%.

In Table 4, we show the difference in the number and percentage of names for which the ethnicity classification is different between the two models. For instance, 30.41% of Asian names were misclassified as White by ethnicolr, whereas EthnicIA misclassified only 10.68% as White, which is an improvement of 19.73% (upper right of Table 4). In fact, EthnicIA classified 18.57% more Asian names correctly than ethnicolr (upper left of Table 4). EthnicIA is able to achieve significant improvements in the classification of names from minority ethnicities, essentially at the expense of misclassifying more White names (it is easy to guess that a name is White since it is the majority class). In particular, we are able to correctly classify 18.57% more Asian names, 37.45% more Black names, and 8.37% more Hispanic names.

Table 4. Confusion Matrices for EthnicIA vs. ethnicolr.

Note: EthnicIA is our proposed model and ethnicolr is the model proposed by Sood and Laohaprapanon (2018). The table presents the difference in the confusion matrices between EthnicIA vs. ethnicolr. Green indicates better results from EthnicIA than ethnicolr, red means worse results. Worse results were only obtained by EthnicIA on Black/White names; EthnicIA predicts Black more often when there is uncertainty than ethnicolr. The left table contains raw numbers, the right table converts to percentages.

Indistinguishability is a characteristic introduced in this article, so the results for ethnicolr presented above use a technique developed in this work. To provide a more faithful comparison with the original ethnicolr work, we present ethnicolr results on the full North Carolina test data set—without removing names we recognize as indistinguishable—in Appendix A Section A.1.

It is possible that for some applications, accuracy might be more helpful than balanced accuracy. In Appendix A Section A.2, we removed the balancing weights and optimized for accuracy, and the accuracy values between EthnicIA and ethnicolr are comparable (81.31% for EthnicIA and 80.17% for ethnicolr), but in that case, EthnicIA achieves a better balanced accuracy (69.35% for EthnicIA and 65.30% for ethnicolr). Thus, if one is interested in the combination of accuracy and balanced accuracy, EthnicIA may have an advantage.

In Appendix A Section A.3, we show performance results of various models, trained on data sets from different states. For instance, we trained our model on the combination of Florida and Georgia, and tested it using the data set from North Carolina, whose results are in Table A3 in Appendix A Section A.3. The results from this experiment follow a similar trend to the one we have discussed in this section.

In Appendix A Section A.4, we show the training loss curve for our EthnicIA model trained on the Florida data and show that we generally converge after 10 epochs. In Appendix A Section A.5, we show ethnicity-specific Receiver Operating Characteristic (ROC) curves and Area Under the ROC Curve (AUC) values that reflect how well we perform on identifying certain ethnicity groups. Finally, in Appendix A Section A.6, we show that our model predictions are well-calibrated.

4.2. Feature Importance

To determine which features are important for different classifiers, we plotted the coefficient values of the classifiers for each feature in Figure 2. For a given ethnicity classifier, features with large positive coefficients contribute heavily to a prediction that the name belongs to that ethnicity. A large negative coefficient on a feature indicates that if the name possesses that feature, then it is less likely to belong to that ethnicity.

In Figure 2, we observe that the probability features on either first name or last name are quite useful in differentiating among ethnicities (reflected by large bars for these features). For the Asian vs.. non-Asian classifier, we have a large positive bar for feature best_evidence_asian, while the bars for the same feature of other classifiers are negative. It shows that this feature is quite useful in differentiating Asian names from other ethnicities. Patterns along similar lines are observed for other best_evidence features. The negative bar for feature dash_indicator for the Asian vs.. non-Asian classifier (together with the positive bars for other classifiers for the same feature) indicates that the presence of a dash makes a name less likely to be an Asian name. In general, the features corresponding to the first four letters or last four letters of a name are less important, because we generally have better features that leverage the full (first or last) name. However, the four-letter features play a significant role when a rare name is encountered. If we have rarely or never seen a first name or last name before, the model tells us that its first and last four letters can be useful.

We conclude that the probability features and best evidence features tend to be extremely useful in distinguishing between ethnicities. Lexical features (dash indicator, number of sub names) are also shown useful for some ethnicities.

Figure 2. Feature importance for EthnicIA.

Note. To determine which features are important, we plotted the coefficient values of EthnicIA. For a given ethnicity classifier, features with large positive (negative) coefficients contribute heavily (less) to a prediction that the name belongs to that ethnicity.

4.3. Bias Correction for Indistinguishables

The assignment of ethnicities to indistinguishables can play an important role in estimation bias. As an example: consider a model trained to optimize accuracy on a population that is highly skewed White. This classifier would learn that most indistinguishable names should be classified as White. However, when transferring this classifier to work in a different region that has a larger minority population, the classifier would still classify the indistinguishables as White, creating a severe bias. In particular, it could easily create a situation where all the indistinguishables were classified as White, many of whom are actually minorities. (Conversely, if the model was trained to optimize balanced accuracy on a population that is mostly White, it would tend to classify many of the indistinguishable names as Black, Hispanic, or Asian. Either way, classification of indistinguishables would be biased.)

The fix we propose for this problem is to, first, identify indistinguishables and set them aside, second, to make predictions for the distinguishables, and, third, classify the indistinguishables probabilistically using the following expression:

P(rIndist.,Test.)P(Indist.r,Tr.)P(rDistin.,Test.)P(Indist.,Test.), where the denominator is:   (4.1)P(Indist.,Test.)ethnicities rP(Indist.r,Tr.)P(rDistin.,Test.),         (4.2)\small {\begin{aligned} P(r|\textrm{Indist.}, \textrm{Test.}) &\approx& \frac{P(\textrm{Indist.}|r, \textrm{Tr.})P(r|\textrm{Distin.}, \textrm{Test.})}{P(\textrm{Indist.}, \textrm{Test.})}, \textrm{ where the denominator is:} \ \ \ (4.1) \\ P(\textrm{Indist.}, \textrm{Test.}) &\approx& \sum_{\textrm{ethnicities } r'} P(\textrm{Indist.}|r', \textrm{Tr.})P(r'|\textrm{Distin.}, \textrm{Test.}), \ \ \ \ \ \ \ \ \ \quad \quad \quad\quad\quad (4.2) \end{aligned}}

where P(Indist.r,Tr.)P(\textrm{Indist.}|r, \textrm{Tr.}) is the fraction of indistinguishables in the training set that have ethnicity rr, and P(rDistin.,Test.)P(r|\textrm{Distin.}, \textrm{Test.}) is the fraction of ethnicity rr that comes from predictions on the distinguishables in the test set in the second step. This model arises from the following derivation. For any given population:

P(rIndist.)=P(Indist.r)P(r)P(Indist.)=P(Indist.r)P(r)ethnicities rP(Indist.r)=P(Indist.r)P(r)ethnicities rP(Indist.r)P(r).  \small {\begin{aligned} P(r|\textrm{Indist.}) &=& \frac{P(\textrm{Indist.}|r)P(r)}{P(\textrm{Indist.})} \quad \quad\quad\quad\quad\quad \\ &=& \frac{P(\textrm{Indist.}|r)P(r)}{\sum_{\textrm{ethnicities } r'} P(\textrm{Indist.}\bigcap r')} \quad\quad \\ &=& \frac{P(\textrm{Indist.}|r)P(r)}{\sum_{\textrm{ethnicities } r'} P(\textrm{Indist.}|r')P(r')}. \end{aligned} \ \ \quad \quad\quad \quad}

The above steps are for any population, so let us choose the test population:

P(rIndist.,Test.)=P(Indist.r,Test.)P(rTest.)ethnicities rP(Indist.r,Test.)P(rTest.)(4.3)P(r|\textrm{Indist.},\textrm{Test.}) =\frac{P(\textrm{Indist.}|r, \textrm{Test.})P(r | \textrm{Test.})}{\sum_{\textrm{ethnicities } r'} P(\textrm{Indist.}|r', \textrm{Test.})P(r'|\textrm{Test.})} \quad \quad \small {(4.3)}

We assume that, given a person’s ethnicity, regardless of where a person is located, they have the same probability of having an indistinguishable name. That is, for each ethnicity rr, we have:

P(Indist.r,Test.)=P(Indist.r,Tr.)(4.4){P(\textrm{Indist.}|r, \textrm{Test.}) = P(\textrm{Indist.}|r, \textrm{Tr.})} \quad\quad\quad\quad\quad\quad\quad(4.4)

We also assume that, since the fraction of indistinguishables is small compared to the population, we can estimate the fraction of each ethnicity based on the distinguishables:

P(rTest.)P(rDistin.,Test.)(4.5)\small {P(r | \textrm{Test.}) \approx P(r | \textrm{Distin.}, \textrm{Test.}) \quad \quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad (4.5)}

Plugging (4.4) and (4.5) into (4.3) yields (4.1) and (4.2).

Thus, taking indistinguishables into account helps reduce bias in predictions. As far as we know, indistinguishables have not been previously studied in this way, nor used in prior works to assist with bias reduction. In fact, other works may suffer substantial bias from assigning the majority class to all indistinguishable names. For example, when we train the model using data from Florida and test using data from North Carolina, we find that among the indistinguishables in North Carolina (24,859), our model predicts that none of them are White and 23,962 are Black. However, in North Carolina, the ratio of White to Black among the non-indistinguishables is 3.1 to 1. Thus, these differing statistics indicate that the model carries a bias that is present in the Florida data over to North Carolina. As described above, if we calculate the number of indistinguishables that are White using the method described above, we find that approximately 3,263 of these individuals should potentially be classified as White and 20,801 as Black; in fact, these predictions are closer to the corresponding true values: 7,227 White people and 16,967 Black people. Similarly, if we train the model using data from Georgia and test using data from North Carolina, our model predicts that among the 138,600 indistinguishables, none of them are White and 132,574 are Black. Using the proposed correction, we find that approximately 57,078 of these individuals should potentially be classified as White and 77,908 as Black; again, these numbers are closer to the truth (69,589 White and 66,961 Black).

4.4. Computation

In order to compare with Treeratpituk and Giles's (2012) model based on n-grams and phonetic features, we replicated their multinomial logistic regression model, and trained on our data set. We could not use the full training data set due to memory constraints (even on a larger system with 512 GB RAM), so we used a subset of 200,000 name-ethnicity pairs, which is the most we could handle. To create a comparison with EthnicIA, we optimized their model on balanced accuracy. We added regularization to their model to avoid overfitting, and tried a considerable range of 2\ell_2 regularization parameters. The number of features for this model was \sim100,000; the large number of features made it difficult to optimize the model’s coefficients and tune its parameters, and regardless of which value of the regularization parameter we chose, it still tended to overfit. Figure 3 shows that across a large range of regularization parameter values between 10-6 to 10-4, the training accuracies were substantially higher than test accuracies, indicating overfitting. As we increased the regularization parameter upwards of 10-2, both the train and test accuracy dropped, likely due to the bias caused by more regularization (see Figure 3. At these extreme regularization values, the model tends to classify test data names into rarer ethnicity categories. At the highest regularization value we tested, 104, the model classifies almost every test data name as Asian. Training of the model of Treeratpituk and Giles (2012) takes around 8 hours to complete (compared to less than 5 minutes taken by our model).

Figure 3. N-gram models’ accuracy and regularization.

Note. Accuracy and balanced accuracy for Treeratpituk and Giles’s (2012) n-gram model as the regularization parameter changes. EthnicIA’s results are presented for comparison (note, there is no regularization for EthnicIA).

Training our model’s weights does not require much data. Interestingly, as discussed earlier, our model becomes better as more training data arrive, even if we do not retrain the model. This is because our probability features become more accurate as more data arrive, and we simply recompute the features. The probability and best-evidence features require only counts, that is, minimal computation. The accuracy and balanced accuracy are fairly stable when training on 20,000 or more observations (see Figure 4). Thus, our model is extremely computationally efficient, requiring only \sim20,000 names to train multinomial logistic regression coefficients (after that, the prediction accuracy is stable). All computations can be done easily, even on a standard laptop. We do not require training of a neural network. We do not require training of a multinomial logistic regression model that is large in any dimension. Only the computation of the features requires large data sets, and that order of complexity of the computations is linear in sample size and parallelizable (though parallelization is often not necessary even for huge data sets).

Figure 4. EthnicIA’s performance as a function of data size.

Note. Accuracy and balanced accuracy are fairly stable as more data are used for training the model’s coefficients.

4.5. Robustness to Domain Shift

Does the performance of EthnicIA suffer from being trained using data from ethnically heterogeneous states and tested on a state where White people are the vast majority? In this section, we perform exactly this experiment.

The three states we consider in the previous experiments are Southern states and, as noted above, while White people are the majority group in each state, other ethnic groups are relatively large.

In order to perform the experiment for this section, we need a state like Ohio, where the vast majority of the population is White. However, the only states that include information about ethnicity on their voter registration cards are Alabama, Florida, Georgia, Louisiana, Mississippi, North Carolina, Pennsylvania, and South Carolina (Imai & Khanna, 2016). The only non-Southern state from the list (Pennsylvania), does not release self-reported information about ethnicity. Therefore, to assess the extent to which marked differences across ethnic groups in the test and train data could lead to problems in performance for EthnicIA, we conduct the following experiment: in North Carolina, a large number of counties have more than 90% of their population being non-Hispanic White, so we use voter data coming from these counties as our test data (which results in a sample of more than 700,000 observations with 95% being White).1 We trained our model using data from 1) the remaining counties in North Carolina and 2) the Florida and Georgia voter files.

Our results indicate that when there are marked changes in the distribution of ethnic groups across train and test data, EthnicIA performs well in terms of balanced accuracy, achieving \sim70-71% balanced accuracy either when trained on the rest of North Carolina, or when trained on both the Georgia and Florida data). For both training sets, EthnicIA also performs well in terms of accuracy (78% when trained on the rest of North Carolina, and 74% when trained on both the Georgia and Florida data). Note that EthnicIA is trained with weights such that balanced accuracy is optimized. Without such weights, we find that accuracy can be dramatically improved (to over 90%) at the cost of losses in balanced accuracy (which is 63–65%). Overall, our results show that EthnicIA is robust to shifts in the domain.

4.6. Troubleshooting

The ability to troubleshoot our model easily stems from the facts that (1) it is interpretable, and thus it is easy to determine where exactly a problem originated, should one arise; (2) a simple data augmentation can be used to adjust the model in order to change the classification assigned to a given name. Importantly, this type of troubleshooting and adjustment can be done without retraining. Figure 5 shows how the model can be easily fixed to accommodate a misclassified name; we simply augment the data set by adding duplicate copies of the misclassified name with the correct ethnicity labels to the training set and recompute the features (without retraining the weights). This change to the training set adjusts the probability and best-evidence features so that the name will be correctly classified in the future.

Figure 5. Data augmentation for EthnicIA’s misclassified names.
Note. We selected 300 misclassified names (full name, including first and last name), added duplicates of these names to the training data set, and regenerated the training features (we did not retrain the multinomial logistic regression coefficients). The plot shows that as more duplicates of just these names were added, the performance improved. In other words, if the classifier encountered these names again in the future, it would classify them correctly.

One warning about using this technique is that it allows the analyst to manually change the features for a name, so it should only be done if the analyst is certain they know the name’s correct ethnicity.

4.7. Can We Make a Sparser Model?

EthnicIA uses only a small number of interpretable features (30 in total), but it is likely that the number of features can be made even smaller if desired. In this section, we perform an experiment where we use only the probability features P(ethnicityfirst name)P(\textrm{ethnicity}|\textrm{first name}), P(ethnicitylast name)P(\textrm{ethnicity}|\textrm{last name}) and lexical features (e.g., whether name contains a dash, number of words in name), but we do not include the features based on the first four and last four characters of a first and a last name, respectively. To overcome the problem that some names have zero values for their popularity features because they are slight and unusual variations of popular names, we deduplicated first and last names using probabilistic record linkage (Enamorado et al., 2019). In other words, before training and testing our model, we classified as the same first (last) name, those first (last) names that are similar to each other, for example, the first names Cynthia and Cinthia are considered the same. We compared first and last names using the Jaro-Winkler similarity measure with 0.92 as the cutpoint to declare a pair of names to be the same. We implement this via the R-package fastLink (Enamorado et al., 2017). We then trained the model with the smaller feature set.

Our results, given in Appendix A Section A.7, show that this sparser model has comparable accuracy and balanced accuracy to the EthnicIA model using the full feature set. This is not surprising as the first four and last four character features aim to address problems related to tying together similar but not exact names (e.g., Cynthia and Cinthia agree on the last four characters and thus, information from Cynthia in the training set would be transferred to Cinthia in the test set, even if it is not in the training set). This observation leads to the possibility that there may be many models with high-quality performance (i.e., a large Rashomon set, see Semenova et al., 2022). One caveat with the sparser model is computation: if we let N1N_1 represent the number of unique first names in the data, the complexity of our deduplication task for first names is O(N1×(N11)2)O(\frac{N_1 \times (N_1 - 1)}{2}), which can be computationally expensive if compared to the calculations of our preferred model – in our model, the computational complexity grows linearly with N1N_1.

4.8. Visualizations of the Space of Names

Each name is associated with a four-dimensional vector:

[P(Asianname),P(Blackname),P(Hispanicname),P(Whitename)].[P(\textrm{Asian}|\textrm{name}),P(\textrm{Black}|\textrm{name}),P(\textrm{Hispanic}|\textrm{name}),P(\textrm{White}|\textrm{name})].

We projected the set of four-dimensional vectors for all names to two dimensions using a dimension reduction algorithm called PaCMAP (Wang et al., 2021), shown in Figure 6. PaCMAP aims to preserve both local and global high-dimensional structure. We also show how the names look in 3-D projected space as shown in Figure 7. (We use a subset of \sim 250,000 names from our test data set for these visualizations.)

Figure 6. Visualization of EthnicIA’s name separation in 2-D.

Note. In this figure, we projected the set of four-dimensional vectors (ethnic-specific predicted probabilities) for all names to two dimensions using a dimension reduction algorithm called PaCMAP (Wang et al., 2021).

Figure 7. Visualization of EthnicIA’s name separation in 3-D.
Note. Visualization of EthnicIA’s name separation, projected into 3-D space (Black, Hispanic, White).

In both the 2-D and 3-D projections, we observe overlap of names among different ethnicities, particularly for Black and White names. Figure 6 shows names such as Ebony Washington and Olumide Adeniyi at the right green tip which are high-probability Black names. Similarly, names like Suzanne Olshefski and Scott Olaughlin appear on the extreme left (in blue) which are high-probability White names. The upper left orange-colored area contains names that are quite common among Hispanic people, but rarely found in other ethnicities (e.g., Jose Rivera Velazquez, Luis Gonzalez-Chavez). The upper-right tip (in red) contains names that are highly likely to be Asian names (e.g., Ushaben Patel, Phuong Nguyen, Yun Kim). The names on the intersecting boundaries of multiple ethnicities are often interesting in that either:

  • The first name belongs strongly to one ethnicity while the last name belongs to another. For instance, Malavika William has a high-probability Asian first name with a last name that is predominantly Black. Sergio Mitchell, Carlos Davis, and Carmen Smith lie at the boundary of Hispanic and the manifold of Black-White names, since the first name is predominantly Hispanic, while the last name is common among Black people and White people. For Ricardo Wallace, Ricardo is a predominantly Hispanic first name, while Wallace is a predominantly White last name. Or,

  • The name was close to being indistinguishable in that the first and last names appear in multiple ethnicities. For example, Amisha and Sidra are Asian-Black indistinguishable first names, while Owens and Polk are common last names in Black people and White people.

5. Case Study: Political Donations by Ethnicity

Name-ethnicity classifiers can be used to analyze patterns and racial biases in different domains including health care, social media, and politics. In this case study, we use EthnicIA to explore campaign donation trends for the 2020 presidential race (Trump vs.. Biden) and 2020 Georgia Senate race. Note that all our findings are based on predictions rather than ground truth.

Political Campaign Donation Data Set. From the Federal Election Commission (Federal Election Commission, 2021), we gathered information about all the donations to political campaigns made by individual donors across the United States. The data set contains information about 10 million donation transactions to the two main contenders for the presidency (Donald Trump and Joseph R.. Biden) and the campaigns of the candidates for the two Senate seats from the state of Georgia (Jon Ossoff, David Perdue, Raphael Warnock, and Kelly Loeffler). These data do not include labels for the ethnic group of any donor.

Training Data Set for EthnicIA model. To estimate the ethnicity of donors, we trained a model on the combined voter registration data set from Florida, North Carolina, and Georgia. The data set contains \sim26.1 million records of names labeled with their ethnicities, with 66.6% White people, 21.4% Black people, 10% Hispanic people, and 2% Asian people. The voter registration data has overlap with the donor names data set (as one would expect), however, when we performed a database join on the set of unique names from the voter registration data and the set of unique names from the donation data, we found that 44% of the observations in the donation data are not present in the training data.

Georgia’s Senate Race 2020. A clear pattern that emerges from EthnicIA’s results is that in terms of the dollar amount, out-of-state donations seem to come from a completely different ethnic distribution than in-state donations for the 2020 Georgia senate race (which includes the general and runoff elections). In Figure 8, we illustrate how the amounts of campaign donations for the Georgia Senate race differ between in-state and out-of-state, across all the ethnicities. The thickness of each arrow in the figure is scaled by the total campaign donations estimated from that arrow’s ethnic group. We observe that in-state donation amounts seem to be more balanced across parties, for all ethnic groups, than their out-of-state counterparts, which leaned Democratic across all ethnic groups. However, we find that the number of donation transactions (not the monetary amount) leans Democratic across all ethnic groups whether we consider in-state or out-of-state donations. For example, across ethnic groups, the number of donation transactions to Democratic candidates is at least 6% larger (out-of-state Hispanic donors) and at most 84% larger (in-state Asian donors) than the number of donations to the Republican candidates (see Table A10 in the Appendix). While the FEC data does not contain a unique identifier for donors, a similar pattern is observed if we focus on the unique donor names. For example, Table A11 in the Appendix shows that across ethnic groups, the number of unique donor names for the Democratic candidates is always larger than for the Republicans.

As shown in the Appendix, Figures A6–A8, our results are robust to 1) the inclusion of the Republican candidate, Doug Collins, who obtained 20% of the votes (but not strong support in terms of contributions) in the Senate special election in Georgia but did not advance to the runoff (Loeffler and Warnock did), 2) restricting attention to small donations (those less than $1,000), and 3) focusing on donors with names that appear in the voter file for Georgia.

Figure 8. EthnicIA’s results for Georgia’s Senate race in 2020.

Note. Each panel presents the campaign donations by ethnicity as predicted by EthnicIA. The thickness of each arrow in the figure is scaled by the total campaign donations estimated from that arrow’s ethnic group. In-state (respectively outside-state) donations are those made by donors that reside in (outside of) Georgia.

Presidential Race 2020. Another discernible pattern from EthnicIA’s results is that the number of small donors (those that donate less than $1,000) that favored Joseph Biden increased as Election Day approached—this is true across ethnic groups (see Figure 9). Moreover, the week before the election, the dollar amount of small donations for Joseph Biden is orders of magnitude larger across all ethnic groups as well (6.4 times larger for White people, 7.5 for Black people, 7.6 for Hispanic people, and 4.9 for Asian people). To illustrate the advantages of our approach, we now explore donation patterns across ethnic groups for three states: Florida, Texas, and Louisiana.

Figure 9. EthnicIA’s results for the presidential race in 2020.

Note. Each panel presents the number of small donors by ethnicity as predicted by EthnicIA. We define small donors as those who donate $1,000 or less during an election cycle.

Presidential Race 2020: Results From Florida. Figures 10 and 11 show the campaign donation amounts for the presidential race for different ethnicities at the city level for Florida and Texas (two states that were considered strongholds of Donald Trump). In general, non-White people heavily favored the Democratic candidate (Biden), while White people favored the Republican candidate (Trump). More interesting questions arise when we analyze donations by people of different ethnicities, which we will do next, leveraging the bird’s eye view of the geographic distribution of donations for different ethnic groups provided in Figures 10 and 11. We also present the aerial view of campaign donations in Georgia and North Carolina in Appendix A Section A.8.

Figure 10. Campaign donations in Florida (presidential race).

Note. The figure shows EthnicIA’s predicted campaign donation amounts for the presidential race for different ethnicities at the city level for Florida.

Figure 11. Campaign donations in Texas (presidential race).

Note. The figure shows EthnicIA’s predicted campaign donation amounts for the presidential race for different ethnicities at the city level for Texas.

In Florida, we notice that there are large donations to Republicans by Asian people in the Jupiter and Miami areas. Let us delve into the data for those two locations to explain these patterns. In Miami, the total number of donation transactions by Asian people who donated to the Trump campaign (505 donations) is comparable to the number of donation transactions by Asian people donating to the Biden campaign (433 donations), however, the total donation amount was much larger for Trump. The Biden campaign received \sim$210,000, out of which \sim$100,000 was donated by a single person, Shashikant Gupta (cofounder and CEO, Apex CoVantage). The Trump campaign received \sim$661,000, out of which $350,000 was donated by Nirmal Mulye (President, Nostrum Energy) and $200,000 was donated by Chang and Salah Turkmani (Managing Director, The Mega Company). In Jupiter, a small number of people donated to the campaigns, overall 23 donation transactions to Republicans and 136 donation transactions to Democrats. The Trump campaign received around $1.26 million, while the Biden campaign received only around $8,000. A major portion of the $1.25 million was donated by the couple Ramalinga and Padmaja Mantena from Integra Connect. In conclusion, we observe that even though the number of Asian people donating to the Biden campaign was higher, wealthy Asian people donated large amounts to the Trump campaign.

We also observe larger donations by Hispanic people in Florida, owing to the large Cuban population, which generally tended to support Trump (Kelly, 2020; Sesin, 2020). Trump received \sim$110 million, while Biden received only \sim$4.2 million, from Hispanic people in Florida. Considering the major donors to the Trump campaign, we found that they were either born in Cuba, or their parents had moved to the United States from Cuba. For example, Jose Pepe Fanjul (Florida Crystals) and his family contributed over $1.5 million, Benjamin III Leon and family (Leon Medical Centers) donated around $1 million, Maximo Alvarez (Sunshine Gasoline) donated around $100,000, and Carlos Silva (Attorney, Silva and Silva) and family donated around $150,000. They all had a connection to Cuba.

Among Black people in Florida, the total number of transactions is lower for Trump (33,079) than Biden (48,517), but the estimated amounts are generally higher. Biden received around $7.6 million, while Trump received around $15 million. These findings are not driven by big donors. If we restrict our attention to those that donate less than $1,000, we find the very same patterns.

Presidential Race 2020: Results From Texas. In Texas, across all ethnicities, the donations from cities do not have a defined inclination for Democrats vs.. Republicans, while the donations from rural areas favor Republicans. The most striking pattern from the aerial view of Texas donations in Figure 11 comes from West Texas: it is heavily Republican across all ethnicities. In the Midland region, the major donors are all in the energy sector, and heavily favored Trump regardless of ethnicity. We also observe a large number of donations by Asian people to both Biden and Trump from within Texas.

Our model estimated that there were a large number of donations by Black people in the Dallas region, but upon a closer look, we realized that a few of the wealthier donors were actually White, and misclassified by our model as Black. Let us look at the names of the major donors, and try to analyze why we misclassified those names.

  • Kelcy Warren (CEO, Energy Transfer Partners) donated around $1 million to Trump, and was misclassified by our model. In our training set, the frequency of this first and last name for different ethnicities is as follows: Warren (13,367 White, 5,480 Black, 55 Asian, 138 Hispanic), Kelcy (44 White, 30 Black, 1 Hispanic). The first name Kelcy is not identified as a Black-White indistinguishable by our model, but it is quite close to being indistinguishable. When accounting for the imbalanced training data set (66.7% White people, 21.4% Black people), the model provided extra weight to the chance of Warren being Black, while the almost-indistinguishable name Kelcy did not contribute to the classification. Hence, the name was classified as Black. However, there is a problem with this setup because we are dealing with wealthy donors, who are more likely to be White—our calculation did not condition on wealth, it considered only the fraction of distinguishable Black names in Texas. If Kelcy Warren were not extremely wealthy, there might have been a higher chance that a ‘Black’ prediction would have been correct.

  • Vaughn Vennerberg (President, XTO Energy) donated around $210,000 to Biden, and was misclassified by our model. In our training set, the frequency of Vaughn for different ethnicities is as follows: 744 White, 448 Black, 7 Asian, and 10 Hispanic. The last name Vennerberg does not appear in our training set. Our model optimizes for balanced accuracy (thus implicitly favoring minority groups), and hence classifies this name as Black, based on the relative balance of ‘Vaughn’ between Black people and White people.

  • Annette Simmons (Philanthropist), who donated $650,000, is a White person misclassified as Black. In our training data set, Simmons is classified as indistinguishable, while the frequency of Annette is: 5,386 Black, 10,194 White, 81 Asian, 1,443 Hispanic. The full name ‘Annette Simmons’ appears in the database as Black 16 times, while it appears as White 14 times. Here, the model actually has seen more training points for the name ‘Annette Simmons’ as Black than as White. This, coupled with the fact that the model optimizes balanced accuracy, would make it lean toward classifying more names as Black, which explains the misclassification.

As mentioned in Section 4.6, our model is easy to interpret, troubleshoot, and fix. To fix the specific names we misclassified, we would simply add copies of these names to the training set (without retraining the coefficients) and recalculate the features. If we conditioned on additional information such as wealth, we would classify more names correctly.

6. Discussion and Conclusion

There are several major advantages to our approach, compared to past methods. First, the whole approach is interpretable. Our features are easy to compute, and all of them are meaningful predictors. No previous works on ethnicity classification have set the problem up this way. Second, our approach provides a better way to assess bias and fairness than prior approaches. Our key idea is to recognize that bias will exist in the indistinguishables. By handling them using the test population of distinguishables and our probabilistic adjustment approach, our predictions will tend to be less biased. Other approaches simply train a model and predict; we know this cannot possibly work for indistinguishables: an ‘Annette Simmons’ in a primarily White neighborhood is more likely to be White, whereas an ‘Annette Simmons’ in a primarily Black neighborhood is likely to be Black. Methods that do not recognize this will carry the bias from their training distributions to their test predictions. Third, our approach easily scales to massive databases, even consisting of many millions of names. Our model has a total of 30 features, and the aspects of its performance that depend on learned coefficients tend to stabilize at around 20,000 names. Training, even with half a million names, took us less than five minutes. The scalability aspect of our method is a huge divergence from the deep learning and n-gram approaches that are much more expensive (and actually perform worse). Our model is accurate because its features are computed on a large data set. For updating the model, we could recompute the features and use the already learned coefficients (without retraining them). Our method is easy to troubleshoot. By simply repeating a name entry in a database several times, that name becomes much less likely to be misclassified in the future. It is also easy to troubleshoot the code for the model, since each feature is easy to compute, and to check whether it has been correctly computed. There is no clear downside to EthnicIA, it is simply better than past approaches for name-ethnicity classification.

Notes on ethical use: Any classifier that detects fairness can be used to detect (and potentially propagate) unfairness as well, to target specific ethnic groups. On the other hand, name-ethnicity classification is a well-established field with a long track record. Lookup tables and other types of models have been used for this task for decades, as described in our related work section. There are good reasons one cannot create a purely name-based ethnicity classifier that has extremely targeted precision. As we showed, there are many indistinguishable names, which would preclude perfect targeting of ethnic groups (but can still be used to detect overall bias in the population). On the other hand, without tools such as EthnicIA, it could be extremely difficult to estimate bias toward an ethnicity in a large, important application, such as credit risk monitoring, health care services, recommender systems, or job interviews.


Disclosure Statement

Vaishali Jain, Ted Enamorado, and Cynthia Rudin have no financial or non-financial disclosures to share for this article.


References

Ambekar, A., Ward, C., Mohammed, J., Male, S., & Skiena, S. (2009). Name-ethnicity classification from open sources. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 49–58). https://doi.org/10.1145/1557019.1557032

Appiah, O. (2001). Ethnic identification on adolescents’ evaluations of advertisements. Journal of Advertising Research, 41(5), 7–22. https://doi.org/10.2501/JAR-41-5-7-22

Bertrand, M., & Mullainathan, S. (2003). Are Emily and Greg more employable than Lakisha and Jamal? A field experiment on labor market discrimination (Working Paper No. 9873). National Bureau of Economic Research. https://doi.org/10.3386/w9873

Besco, R., & Tolley, E. (2022). Ethnic group differences in donations to electoral candidates. Journal of Ethnic and Migration Studies, 48(5), 1072–1094. https://doi.org/10.1080/1369183X.2020.1804339

Bradford Health Authority. (1983). Nam Pehchan [Computer software]. Computer Services, City of Bradford Metropolitan Council (Dept 13), Britannia House.

Buechley, R. W. (1976, February 6-7). Generally Useful Ethnic Search System (GUESS). [Presentation]. Annual Meeting of the American Names Society, New York, NY.

Burchard, E. G., Ziv, E., Coyle, N., Gomez, S. L., Tang, H., Karter, A. J., Mountain, J. L., Pérez-Stable, E. J., Sheppard, D., & Risch, N. (2003). The importance of race and ethnic background in biomedical research and clinical practice. New England Journal of Medicine, 348(12), 1170–1175. https://doi.org/10.1056/nejmsb025007

Burton, D. (2000). Ethnicity, identity and marketing: A critical review. Journal of Marketing Management, 16(8), 853–877. https://doi.org/10.1362/026725700784683735

Chang, J., Rosenn, I., Backstrom, L., & Marlow, C. (2010). ePluribus: Ethnicity on social networks. Proceedings of the International AAAI Conference on Web and Social Media, 4(1), 18-25. Retrieved from https://ojs.aaai.org/index.php/ICWSM/article/view/14029/13878

Coldman, A., Braun, T., & Gallagher, R. (1989). The classification of ethnic status using name information. Journal of Epidemiology and Community Health, 42(4), 390–395. https://doi.org/10.1136/jech.42.4.390

Cummins, C., Winter, H., Cheng, K., Maric, R., Silcocks, P., & Varghese, C. (2000). An assessment of the Nam Pehchan computer program for the identification of names of South Asian ethnic origin. Journal of Public Health Medicine, 21(4), 401–406. https://doi.org/10.1093/pubmed/21.4.401

Donaldson, L. J., & Clayton, D. G. (1984). Occurrence of cancer in Asians and non-Asians. Journal of Epidemiology and Community Health, 38(3), 203–207. https://doi.org/10.1136/jech.38.3.203

Elliott, M., Fremont, A., Morrison, P., Pantoja, P., & Lurie, N. (2008). A new method for estimating race/ethnicity and associated disparities where administrative records lack self-reported race/ethnicity. Health Services Research, 43(5p1), 1722–1736. https://doi.org/10.1111/j.1475-6773.2008.00854.x

Elliott, M., Morrison, P., Fremont, A., Mccaffrey, D., Pantoja, P., & Lurie, N. (2009). Using the census bureau’s surname list to improve estimates of race/ethnicity and associated disparities. Health Services and Outcomes Research Methodology, 9(4), 252–253. https://doi.org/10.1007/s10742-009-0055-1

Enamorado, T., Fifield, B., & Imai, K. (2017). fastLink: Fast probabilistic record linkage. R Foundation for Statistical Computing. https://CRAN.R-project.org/package=fastLink

Enamorado, T., Fifield, B., & Imai, K. (2019). Using a probabilistic model to assist merging of large-scale administrative records. American Political Science Review, 113(2), 353–371. https://doi.org/10.1017/S0003055418000783

Federal Election Commision. (2021). Campaign finance data. https://www.fec.gov/data/

Fiscella, K., & Fremont, A. (2006). Use of geocoding and surname analysis to estimate race and ethnicity. Health Services Research, 41(4p1), 1482–1500. https://doi.org/10.1111/j.1475-6773.2006.00551.x

Gill, P., Bhopal, R., Wild, S., & Kai, J. (2005). Limitations and potential of country of birth as proxy for ethnic group. BMJ, 330(7484), 196. https://doi.org/10.1136/bmj.330.7484.196-a

Grumbach, J., & Sahn, A. (2019). Race and representation in campaign finance. American Political Science Review, 114(1), 1–16. https://doi.org/10.1017/S0003055419000637

Harding, S., Dews, H., & Simpson, S. (1999). The potential to identify South Asians using a computerised algorithm to classify names. Population Trends, 97, 46–49.

Harkness, E. F., Bashir, F., Foden, P., Bydder, M., Gadde, S., Wilson, M., Maxwell, A., Hurley, E., Howell, A., Evans, D. G., & Astley, S. M. (2016). Variations in breast density and mammographic risk factors in different ethnic groups. In K. L. Anders Tingberg & P. Timberg (Eds.), Lecture notes in computer science: Vol. 9699. International Workshop on Breast Imaging (pp. 510–517). Springer-Verlag. https://doi.org/10.1007/978-3-319-41546-8_64

Harris, J. (2015). What’s in a name? a method for extracting information about ethnicity from names. Political Analysis, 23(2), 212–224. https://doi.org/10.1093/pan/mpu038

Imai, K., & Khanna, K. (2016). Improving ecological inference by predicting individual ethnicity from voter registration records. Political Analysis, 24(2), 263–272. https://doi.org/10.1093/pan/mpw001

Kang, Y. (2020). Name-nationality classification technology under Keras deep learning. In C. Wu & D. Kang (Eds.), Proceedings of the 2020 2nd international conference on big data engineering (pp. 70–74). Association for Computing Machinery. https://doi.org/10.1145/3404512.3404517

Kelly, W. (2020, November 6). Perspective | the Cuban revolution explains why younger Cuban Americans supported Trump. The Washington Post. https://www.washingtonpost.com/outlook/2020/11/06/cuban-revolution-explains-why-younger-cuban-americans-supported-trump/

Kingma, P. D., & Ba, L. J. (2015). Adam: A method for stochastic optimization. In Y. Bengio & Y. LeCun (Eds.), 3rd international conference on learning representations (pp. 1–15). arXiv. https://doi.org/10.48550/arXiv.1412.6980

Lauderdale, D. S., & Kestenbaum, B. (2000). Asian American ethnic identification by surname. Population Research and Policy Review, 19(3), 283–300. https://doi.org/10.1023/A:1026582308352

Lee, J., Kim, H., Ko, M., Choi, D., Choi, J., & Kang, J. (2017). Name nationality classification with recurrent neural networks. In C. Sierra (Ed.), Proceedings of the 26th international joint conference on artificial intelligence (pp. 2081–2087). AAAI Press.

Mateos, P. (2007). A review of name-based ethnicity classification methods and their potential in population studies. Population, Space and Place, 13(4), 243–263. https://doi.org/10.1002/psp.457

McDonald, M. P. (2021). United States Elections Project. http://www.electproject.org/

Nanchahal, K., Mangtani, P., Alston, M., & Silva, I. D. S. (2001). Development and validation of a computerized South Asian names and group recognition algorithm (SANGRA) for use in British health-related studies. Journal of Public Health, 23(4), 278–285. https://doi.org/10.1093/pubmed/23.4.278

Nicoll, A., Bassett, K., & Ulijaszek, S. J. (1986). What’s in a name? Accuracy of using surnames and forenames in ascribing Asian ethnic identity in English populations. Journal of Epidemiology and Community Health, 40(4), 364–368. https://doi.org/10.1136/jech.40.4.364

Pennacchiotti, M., & Popescu, A.-M. (2011). A machine learning approach to Twitter user classification. Proceedings of the International AAAI Conference on Web and Social Media, 5(1), 281-288. https://ojs.aaai.org/index.php/ICWSM/article/view/14139

Quan, H., Wang, F.-L., Schopflocher, D., Norris, C., Galbraith, P., Faris, P., Graham, M., Knudtson, M., & Ghali, W. (2006). Development and validation of a surname list to define Chinese ethnicity. Medical Care, 44(4), 328–333. https://doi.org/10.1097/01.mlr.0000204010.81331.a9

Semenova, L., Rudin, C., & Parr, R. (2022). On the existence of simpler machine learning models. In F. Borgesius, M. Kearns, K. Lum, & A. X. Wu (Eds.), ACM conference on fairness, accountability, and transparency (ACM FAccT) (pp. 1827–1858). Association for Computing Machinery. https://doi.org/10.1145/3531146.3533232

Senior, P. A., & Bhopal, R. (1994). Ethnicity as a variable in epidemiological research. BMJ, 309(6950), 327–330. https://doi.org/10.1136/bmj.309.6950.327

Sesin, C. (2020). Trump cultivated the Latino vote in Florida, and it paid off. NBCUniversal News Group. https://www.nbcnews.com/news/latino/trump-cultivated-latino-vote-florida-it-paid-n1246226

Sood, G. (2017). Florida voter registration data. Harvard Dataverse. https://doi.org/10.7910/DVN/UBIG3F

Sood, G., & Laohaprapanon, S. (2018). Predicting race and ethnicity from the sequence of characters in a name. arXiv. https://doi.org/10.48550/arXiv.1805.02109

Stewart, S., Swallen, K., Glaser, S., Horn-Ross, P., & West, D. (1999). Comparison of methods for classifying Hispanic ethnicity in a population-based cancer registry. American Journal of Epidemiology, 149(11), 1063–1071. https://doi.org/10.1093/oxfordjournals.aje.a009752

Taylor, V. M., Nguyen, T. T., Do, H. H., Li, L., & Yasui, Y. (2009). Lessons learned from the application of a Vietnamese surname list for survey research. Journal of Immigrant and Minority Health, 13(2), 345–351. https://doi.org/10.1007/s10903-009-9296-x

Torvik, V., & Agarwal, S. (2016). Ethnea—An instance-based ethnicity classifier based on geo-coded author names in a large-scale bibliographic database. International Symposium on Science of Science. https://experts.illinois.edu/en/publications/ethnea-an-instance-based-ethnicity-classifier-based-on-geo-coded-

Treeratpituk, P., & Giles, C. L. (2012). Name-ethnicity classification and ethnicity-sensitive name matching. Proceedings of the AAAI Conference on Artificial Intelligence, 26(1), 1141–1147. https://doi.org/10.1609/aaai.v26i1.8324

Voicu, I. (2018). Using first name information to improve race and ethnicity classification. Statistics and Public Policy, 5(1). https://doi.org/10.1080/2330443X.2018.1427012

Wang, Y., Huang, H., Rudin, C., & Shaposhnik, Y. (2021). Understanding how dimension reduction tools work: An empirical approach to deciphering t-SNE, UMAP, TriMap, and PaCMAP for data visualization. Journal of Machine Learning Research, 22(201), 1–73. http://jmlr.org/papers/v22/20-1061.html

Wong, K. O., Zaïane, O. R., Davis, F. G., & Yasui, Y. (2020). A machine learning approach to predict ethnicity using personal name and census location in canada. PLOS ONE, 15(11), Article e0241239. https://doi.org/10.1371/journal.pone.0241239

Xie, F. (2022). Rethnicity: An R package for predicting ethnicity from names. SoftwareX, 17, Article 100965. https://doi.org/10.1016/j.softx.2021.100965

Ye, J., Han, S., Hu, Y., Coskun, B., Liu, M., Qin, H., & Skiena, S. (2017). Nationality classification using name embeddings. In E.-P. Lim & M. Winslett (Eds.), Proceedings of the 2017 ACM on conference on information and knowledge management (pp. 1897–1906). Association for Computing Machinery. https://doi.org/10.1145/3132847.3133008


Appendix

A.1. Ethnicolr Model Performance Without Removing Indistinguishables

Considering that indistinguishability is a characteristic introduced by our method, we also show the ethnicolr results on the full North Carolina test data set without removing names that we recognize as indistinguishable in Table A1.

Table A1. Ethnicolr results for North Carolina.

Note. The full North Carolina data set including indistinguishable names are used for testing.

A.2. EthnicIA Model Results Trained Without Balancing Weights

When we removed the balancing weights, the scores are comparable to ethnicolr. Our EthnicIA model without balancing weights gives an accuracy of 81.31% while ethnicolr has 80.17% accuracy. Our model still maintains advantages over ethnicolr: (1) it is much smaller and simpler (2) it is much easier to train (3) it trains in a fraction of the time. We show the results of the model trained without balancing weights in Table A2. If we take the ethnicolr model and compare it to the EthnicIA model trained without balancing weights, the two models (as discussed) are essentially tied on accuracy (81.31% for EthnicIA and 80.17% for ethnicolr), but EthnicIA has an advantage on balanced accuracy (69.35% for EthnicIA and 65.30% for ethnicolr). Thus, if one is interested in the combination of accuracy and balanced accuracy, EthnicIA has the advantage.

In general, balanced accuracy is more useful if our goal is to characterize minority groups. In applications where we aim to characterize the full population, then accuracy might be a more useful metric. For instance, if we are interested in finding the donors to political campaigns that are members of minority groups, balanced accuracy might be more useful. If we are instead interested in patterns in the overall population of donors, then accuracy may be a suitable metric.

Table A2. Ethnicolr results for North Carolina (no weights).

Note. EthnicIA was trained on the Florida data, and no balancing weights were used while training.

A.3. Performance Based on Models From Different Data Set Combinations

In this section, we show the performance results of different EthnicIA models based on training data sets from Florida, Georgia, North Carolina, or combination data sets from two states.

In Table A3, we show the performance results of our main model trained on a combination of Florida and Georgia data, and tested on the North Carolina data. In Tables A4–A8, we show performance results of our main model trained on data from one state, and tested on data from another state.

Table A3. North Carolina (test), and Florida and Georgia (train).

Notes: The table presents the results of EthnicIA.


Table A4. Georgia (test), and Florida (train).

Note. The table presents the results of EthnicIA.


Table A5. North Carolina (test), and Georgia (train).

Note. The table presents the results of EthnicIA.


Table A6. Florida (test), and Georgia (train).

Note. The table presents the results of EthnicIA.


Table A7. Florida (test), and North Carolina (train).

Note. The table presents the results of EthnicIA.


Table A8. Georgia (test), and North Carolina (train).

Note. The table presents the results of EthnicIA.

A.4. Algorithmic Convergence

In Figure A1, we show the training loss over iterations for the EthnicIA model trained on the Florida data set in the experiment of Section 4.1. We ran for ten epochs for a training set of size 500,000 and batch size 1,024 (that would mean around 5,000 iterations). The loss curve starts flattening around ten epochs, so we choose it as the cutoff point.

Figure A1. Training loss over iterations for EthnicIA.

A.5. ROC curves and AUC Values

In Figure A2, we show ethnicity-specific ROC curves and AUC values that reflect performance in identifying ethnic groups for the test data set for the EthnicIA model in Experiment 4.1.

Figure A2. Performance of EthnicIA classification.
Note. ROC (Receiver Operating Characteristic) curves and AUC (Area Under the ROC Curve) values on the test set for EthnicIA.

A.6. Bracketed Probabilities: Predictions vs.. Truth

For each panel of Figure A3, we bin observations according to our predictions into four groups: [0, 0.25), [0.25, 0.50), [0.50, 0.75), and [0.75, 1]. The figure shows that across ethnic groups when we train the model with data from Florida and Georgia and test on North Carolina, the model predictions have a positive association with the true proportion of observations per race.

Figure A3. Bracketed probabilities.
Note. The figure presents the results of EthnicIA, when trained with data from Florida and Georgia, and tested with data from North Carolina.

A.7. Performance Based on Sparser Model

In Table A9, we show the performance results of the sparser model described in Section 4.7. The model has been trained on the Florida data set and tested on the North Carolina data set. This model uses fewer features but additionally uses probabilistic record linkage to overcome the problem that some names appear to be unique because they cannot be exactly matched, despite the fact that they are variations of more popular names.

Table A9. Performance results for a sparser model.

Note: EthnicIA results based on a sparser model (as described in Section 4.7) tested on North Carolina and trained on the Florida data.

A.8. Spatial Plots

In Figures A4 and A5, we show aerial views of the campaign donations at the city level for all ethnicities in Georgia and North Carolina. In Georgia, we do not observe much difference in the proportion of donations between ethnicities to the two parties. However, we see that city centers are inclined toward Democrats, while the rural areas are more Republican across all ethnicities. In North Carolina, across ethnicities, Republican and Democratic areas are similar in the intensity of donations.

Figure A4. Estimated race campaign donations in Georgia.

Note. EthnicIA’s results for the 2020 presidential race.

Figure A5. Estimated campaign donations in North Carolina.

Note. EthnicIA results for the 2020 presidential race

A.9. Georgia’s 2020 Senate Election

This section presents additional results for the 2020 Senate election in Georgia. As noted in the main text, we find that compared to the number of donations to the Republican candidates (Table A10), Democrats received more support. In addition, while the Federal Election Commission (FEC) data does not contain a unique identifier for donors, a similar pattern is observed if we focus on the number of unique donors’ names per state (a pseudo unique identifier for donors); see Table A11. Finally, Figures A6 to A8 show that our results are robust to 1) the inclusion of the Republican candidate, Doug Collins, who obtained 20% of the votes in the Senate special election in Georgia but did not advance to the runoff (Loeffler and Warnock did), 2) restricting attention to small donations (those less than $1,000), and 3) focusing on donors with names that appear in the voter file for Georgia.

Table A10. Ratio of the number of donation transactions of Democratic vs. Republican candidates in Georgia’s 2020 senate election.

Ethnicity

In-State

Out-of-State

Asian

1.84

1.34

Black

1.54

1.09

Hispanic

1.45

1.06

White

1.21

1.26

Notes: The table shows that the number of donation transactions favored the Democratic candidates both in and out-of-state.


Table A11. Ratio of the number of unique donor names for Democratic vs. Republican candidates in Georgia’s 2020 senate election.

Ethnicity

In-State

Out-of-State

Asian

2.70

3.15

Black

1.32

1.79

Hispanic

1.59

1.88

White

1.08

1.85

Note. The table shows that the number of unique donor names favored the Democratic candidates both in-state and out-of-state when we focus on names that appear both in the voter files and the Federal Election Commission data.

Figure A6. Georgia’s senate race 2020 (Doug Collins included).

Note. The figure presents the campaign donations distribution by ethnicity (as predicted by EthnicIA) including Doug Collins, who received 20% of the votes in the senate special election but was less fortunate in terms of contributions compared to Loeffler and Warnock (the two that advanced to the runoff).

Figure A7. Georgia’s senate race 2020 (small donors).

Note. The figure presents the campaign donations distribution by ethnicity (as predicted by EthnicIA) for small donors (those that donate less than $1,000 per election cycle).

Figure A8. Georgia’s senate race 2020 (names in the voter file).

Note. The figure presents the campaign donations distributed by ethnicity (as predicted by EthnicIA) for donors with names that appear in the voter file.


Data Repository/Code

Code for our method using public data is available at this link: https://github.com/VaishaliJain/ethnicIA


©2022 Vaishali Jain, Ted Enamorado, and Cynthia Rudin. This article is licensed under a Creative Commons Attribution (CC BY 4.0) International license, except where otherwise indicated with respect to particular material included in the article.

Comments
0
comment

No comments here

Why not start the discussion?