GUIDELINES: This method must be used cautiously and responsibly. Please be sure to see the guidelines and warnings about usage in the README or the package documentation.
A common problem for researchers who work with data, especially historians, is that a dataset has a list of people with names but does not identify the gender of the person. Since first names often indicate gender, it should be possible to predict gender using names. However, the gender associated with names can change over time. To illustrate, take the names Madison, Hillary, Jordan, and Monroe. For babies born in the United States, those predominant gender associated with those names has changed over time.
Predicting gender from names requires a fundamentally historical
method. The gender
package provides a way to calculate the
proportion of male and female names given a year or range of birth years
and a country or set of countries. The predictions are based on
calculations from historical datasets.
This vignette offers a brief guide to the gender
package. For a fuller historical explanation and a sample case study
using the package, please see our journal article: Cameron Blevins and
Lincoln Mullen, “Jane, John … Leslie? A Historical Method for
Algorithmic Gender Prediction,” Digital Humanities Quarterly
(forthcoming 2015).
The main function in this package is gender()
. That
function lets you choose a dataset and pass in a set of names and a
birth year or range of birth years. The result is always a data frame
that includes a prediction of the gender of the name and the relative
proportions between male and female. For example:
## # A tibble: 2 × 6
## name proportion_male proportion_female gender year_min year_max
## <chr> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 Hillary 1 0 male 1940 1940
## 2 Madison 1 0 male 1940 1940
## # A tibble: 2 × 6
## name proportion_male proportion_female gender year_min year_max
## <chr> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 Hillary 0 1 female 2000 2000
## 2 Madison 0.0069 0.993 female 2000 2000
The gender
package itself contains only demonstration
data. Datasets which permit you to make predictions for various times
and places are available in the genderdata package.
This package is not available on CRAN because of its size. The first
time that you need to use the dataset you will be prompted to install
it, or you can install it yourself:
# install.packages("remotes")
remotes::install_github("lmullen/genderdata")
You specify which dataset you wish to use with the
method =
parameter. Below are some sample
United States in the 1960s:
United States in the 1860s:
North Atlantic countries in the 1860s:
Just Sweden in the 1879:
Each method is associated with a dataset suitable for a particular time and place.
method = "ipums"
: United States from 1789 to 1930.
Drawn from Census data.method = "ssa"
: United States from 1930 to 2012. Drawn
from Social Security Administration data.method = "napp"
: Any combination of Canada, the United
Kingdom, Germany, Iceland, Norway, and Sweden from the years 1758 to
1910, though the nineteenth-century data is likely more reliable than
the eighteenth-century data.U.S. Census data is provided by IPUMS USA from the Minnesota Population Center, University of Minnesota. The IPUMS data includes 1% and 5% samples from the Census returns. The Census, taken decennially, includes respondent’s birth dates and gender. With the gender package, it is possible to use this dataset for years between 1789 and 1930. The dataset includes approximately 339,967 unique names.
U.S. Social Security Administration data was collected from applicants to Social Security. The Social Security Board was created in the New Deal in 1935. Early applicants, however, were people who were nearing retirement age not people who were being born, so the dataset extends further into the past. However, the Social Security Administration did not immediately require all persons born in the United States to register for a Social Security Number. (See Shane Landrum, “The State’s Big Family Bible: Birth Certificates, Personal Identity, and Citizenship in the United States, 1840–1950” [PhD dissertation, Brandeis University, 2014].) A consequence of this—for reasons that are not entirely clear—is that for years before 1918, the SSA dataset is heavily female; after about 1940 it skews slightly male. For this reason this package corrects the prediction to assume a secondary sex ratio that is evenly distributed between males and females. Also, the SSA dataset only includes names that were used more than five times in a given year, so the “long tail” of names is excluded. Even so, the dataset includes 91,320 unique names. The SSA dataset extends from 1880 to 2012, but for years before 1930 you should use the IPUMS method.
The North Atlantic Population Project provides data for Canada, the United Kingdom, Germany, Iceland, Norway, and Sweden for years between 1758 and 1910, based on census microdata from those countries.
Most often you have a dataset and you want to predict gender for multiple names. Consider this sample dataset.
library(dplyr)
demo_names <- c("Susan", "Susan", "Madison", "Madison",
"Hillary", "Hillary", "Hillary")
demo_years <- c(rep(c(1930, 2000), 3), 1930)
demo_df <- tibble(first_names = demo_names,
last_names = LETTERS[1:7],
years = demo_years,
min_years = demo_years - 3,
max_years = demo_years + 3)
demo_df
## # A tibble: 7 × 5
## first_names last_names years min_years max_years
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 Susan A 1930 1927 1933
## 2 Susan B 2000 1997 2003
## 3 Madison C 1930 1927 1933
## 4 Madison D 2000 1997 2003
## 5 Hillary E 1930 1927 1933
## 6 Hillary F 2000 1997 2003
## 7 Hillary G 1930 1927 1933
Should you wish, you can use dplyr’s do()
function to
run the gender()
function on each name and birth year
(i.e., each row). This will result in a dataframe containing a column of
dataframes. Another call to do()
and
bind_rows()
will create a the single data frame that we
expect.
demo_df %>%
distinct(first_names, years) %>%
rowwise() %>%
do(results = gender(.$first_names, years = .$years, method = "demo")) %>%
do(bind_rows(.$results))
## # A tibble: 6 × 6
## # Rowwise:
## name proportion_male proportion_female gender year_min year_max
## <chr> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 Susan 0 1 female 1930 1930
## 2 Susan 0 1 female 2000 2000
## 3 Madison 1 0 male 1930 1930
## 4 Madison 0.0069 0.993 female 2000 2000
## 5 Hillary 1 0 male 1930 1930
## 6 Hillary 0 1 female 2000 2000
That method of using dplyr is the most intuitive, since it calls
gender()
once for each row. (In the example above, there
are six calls to the function.) However, because of the way that the
gender()
function works, it can handle multiple names
provided that they all use the same range of years. In other words, we
will do better to group the data frame by the year. In the code below,
we call gender()
once for each year (i.e. two times) which
results in a considerable time savings.
demo_df %>%
distinct(first_names, years) %>%
group_by(years) %>%
do(results = gender(.$first_names, years = .$years[1], method = "demo")) %>%
do(bind_rows(.$results))
## # A tibble: 6 × 6
## # Rowwise:
## name proportion_male proportion_female gender year_min year_max
## <chr> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 Hillary 1 0 male 1930 1930
## 2 Madison 1 0 male 1930 1930
## 3 Susan 0 1 female 1930 1930
## 4 Hillary 0 1 female 2000 2000
## 5 Madison 0.0069 0.993 female 2000 2000
## 6 Susan 0 1 female 2000 2000
These results can then be joined back into your original dataset.