Data Science Now · Data Science · Predictive · Analytics · Prospecting · Article · Case Study
Calling All Donor Data Analysts: A New Simulated Philanthropy Dataset for Data Exploration
By Andy McMahon | June 08, 2023
Sample datasets are vital for those who are learning data science or want to test statistical modeling functions on realistic data from a particular domain. Unfortunately, there is little real non-profit fundraising data in the public domain. One exception is the Paralyzed Veterans of America dataset released for the Knowledge Discovery and Data (KDD) ‘98 Data Mining Competition Cup, although the file only contains data related to a single mail fundraising campaign. As others have noted, the lack of free and available donor datasets has consequences for non-profit data science professionals.
While real data is the gold standard, realistic simulated data can often serve as a stand-in for the genuine article. The challenge is that real charitable giving data contains patterns and relationships that are found in databases of other types of transactions — for instance, those of a large grocery store.
A database of grocery store transactions would show relationships at the individual level — customers who tend to favor certain products or spend similar amounts on each trip — and relationships at the group level. For instance, customers with young children may be more likely to buy certain products compared to those without them.
To work toward meeting this challenge, Apra is releasing a new sample dataset for the 2023 Data Science Now Challenge. This new dataset is inspired by and builds on an earlier iteration created by the Data Science Committee for Apra Fundamentals: Data Science With R. The dataset represents a single giving cohort from a fictitious nonprofit organization. The file contains both simulated demographics (marital status, gender and age range, among others) and giving patterns that are individually random but related to more general patterns corresponding to each fictional donor’s demographic and geographic characteristics.
The dataset was simulated using the R programming language by fictitious weights, generated out of whole cloth rather than empirically, though loosely inspired by real-world data from the U.S. Census and Indiana University’s Lilly Family School of Philanthropy. Each simulated donor also has a specific giving interest that’s inspired by rates according to state, which more or less correspond to Urban Institute’s National Taxonomy of Exempt Entities (NTEE) Codes for giving areas. Since the dataset is fictitious, none of the donors actually exist and no donor privacy concerns arise.
To peek at the Apra dataset, Data Science Now Challenge participants will receive a .csv file and can load it into either Excel, R, Python or the statistical software of their choice. First, it can be used for data exploration as you try to determine the relationship between the dataset’s variables and create visualizations. Second, the dataset can be used to test regression, classification or clustering machine learning algorithms. Finally, each donor has a randomly generated geographic point within a randomly generated zip code, making it suitable for mapping using GIS software.
This dataset is, first and foremost, a simulation of real-world non-profit data. As mentioned above, however, the dataset only represents a single fictitious giving cohort for a fictitious organization. Any inferences drawn about it do not relate to any real-world giving population.
This dataset is not intended to serve as a single, authoritative example donor dataset that meets the needs of non-profit data science professionals for all coming time. Additional fictitious datasets that better simulate real giving behavior will be a welcome development and should be encouraged.
The Apra Data Science Committee is excited to release this sample dataset as our profession continues to explore how data science can make us better prospect researchers and portfolio managers. We encourage all Apra members and prospect development professionals to take advantage of this opportunity by signing up for the Apra Data Science Now Challenge. More information about the event can be found online here; the deadline for submissions is Monday, July 17.

Andy McMahon
Lead Data and Prospect Analyst, United States Holocaust Memorial Museum
Andrew (Andy) McMahon is the lead data and prospect analyst at the United States Holocaust Memorial Museum. He is interested in every aspect of fundraising operations and is particularly passionate about optimizing prospect management and discovery with machine learning in R and Python.
Previously, Andy worked on the donor operations team at Share Our Strength. He is an '09 graduate of Carleton College with a degree in philosophy and a member of Apra International's Data Science Committee. He has two amazing cats.