Title: | Predict Gender from Names Using Historical Data |
---|---|
Description: | Infers state-recorded gender categories from first names and dates of birth using historical datasets. By using these datasets instead of lists of male and female names, this package is able to more accurately infer the gender of a name, and it is able to report the probability that a name was male or female. GUIDELINES: This method must be used cautiously and responsibly. Please be sure to see the guidelines and warnings about usage in the 'README' or the package documentation. See Blevins and Mullen (2015) <http://www.digitalhumanities.org/dhq/vol/9/3/000223/000223.html>. |
Authors: | Lincoln Mullen [aut, cre] , Cameron Blevins [ctb], Ben Schmidt [ctb] |
Maintainer: | Lincoln Mullen <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.6.0 |
Built: | 2025-01-01 04:21:13 UTC |
Source: | https://github.com/lmullen/gender |
Gender: predict gender from names using historical data
Encodes gender based on names and dates of birth, using U.S. Census or Social
Security data sets. Requires separate download of datasets, which should be
done automatically and can be done manually by running
install_genderdata_package()
.
This package attempts to infer gender (or more precisely, sex assigned at birth) based on first names using historical data, typically data that was gathered by the state. This method has many limitations, and before you use this package be sure to take into account the following guidelines.
(1) Your analysis and the way you report it should take into account the limitations of this method, which include its reliance of data created by the state and its inability to see beyond the state-imposed gender binary. At a minimum, be sure to read our article explaining the limitations of this method, as well as the review article that is critical of this sort of methodology, both cited below.
(2) Do not use this package to study individuals: it is at most useful for studying populations in the aggregate.
(3) Resort to this method only when the alternative is not a more nuanced and justifiable approach to studying gender, but where the alternative is not studying gender at all. For instance, for many historical sources this approach might be the only way to get a sense of the sex ratios in a population. But ask whether you really need to use this method, whether you are using it responsibly, or whether you could use a better approach instead.
Blevins, Cameron, and Lincoln A. Mullen, “Jane, John … Leslie? A Historical Method for Algorithmic Gender Prediction,” *Digital Humanities Quarterly* 9, no. 3 (2015). http://www.digitalhumanities.org/dhq/vol/9/3/000223/000223.html
Mihaljević, Helena, Marco Tullney, Lucía Santamaría, and Christian Steinfeldt. “Reflections on Gender Analyses of Bibliographic Corpora.” *Frontiers in Big Data* 2 (August 28, 2019): 29. https://doi.org/10.3389/fdata.2019.00029.
If the genderdata package is not installed, install it from GitHub using devtools. If it is not up to date, reinstall it.
check_genderdata_package()
check_genderdata_package()
This function predicts the gender of a first name given a year or range of
years in which the person was born. The prediction can use one of several
data sets suitable for different time periods or geographical regions. See
the package vignette for suggestions on using this function with multiple
names and for a discussion of which data set is most suitable for your
research question. When using certain methods, the genderdata
data
package is required; you will be prompted to install it if it is not already
available.
gender( names, years = c(1932, 2012), method = c("ssa", "ipums", "napp", "kantrowitz", "genderize", "demo"), countries = c("United States", "Canada", "United Kingdom", "Denmark", "Iceland", "Norway", "Sweden") )
gender( names, years = c(1932, 2012), method = c("ssa", "ipums", "napp", "kantrowitz", "genderize", "demo"), countries = c("United States", "Canada", "United Kingdom", "Denmark", "Iceland", "Norway", "Sweden") )
names |
First names as a character vector. Names are case insensitive. |
years |
The birth year of the name whose gender is to be predicted. This
argument can be either a single year, a range of years in the form
|
method |
This value determines the data set that is used to predict the
gender of the name. The |
countries |
The countries for which datasets are being used. For the
|
Returns a data frame containing the results of predicting the gender. The exact components of the returned list will depend on the specific method used. They include the following:
name |
The name for which the gender has been predicted. |
proportion_male |
The proportion of male names for the given range of years. |
proportion_female |
The proportion of female names for the given range of years. |
gender |
The
predicted gender based on the proportion of male and female names. Possible
values are |
year_min |
The lower bound (inclusive) of the year range used for the prediction. |
year_max |
The upper bound (inclusive) of the year range used for the prediction. |
gender("madison", method = "demo", years = 1985) gender("madison", method = "demo", years = c(1900, 1985)) # SSA method ## Not run: gender("madison", method = "demo", years = c(1900, 1985)) # IPUMS method ## Not run: gender("madison", method = "ipums", years = 1860) # NAPP method ## Not run: gender("madison", method = "napp", countries = c("Sweden", "Denmark"))
gender("madison", method = "demo", years = 1985) gender("madison", method = "demo", years = c(1900, 1985)) # SSA method ## Not run: gender("madison", method = "demo", years = c(1900, 1985)) # IPUMS method ## Not run: gender("madison", method = "ipums", years = 1860) # NAPP method ## Not run: gender("madison", method = "napp", countries = c("Sweden", "Denmark"))
Install the genderdata package after checking with the user
install_genderdata_package()
install_genderdata_package()