It all started with an email from Karen.
At that time, Karen was our gift officer for the Ophthalmology department at the University of Nebraska Medical Center’s College of Medicine. I was a recently-hired junior prospect researcher and report writer at the University of Nebraska Foundation. For the first few months, I got the hang of routine work — prospect profiles, contact lists and the occasional statistical analysis. But I wanted to pursue more challenging work. I am a natural self-teacher and had begun to experiment with the R programming language. I was eager to find a project to further develop my data science skills.
And then, Karen sent me a doozy of a challenge.
“Could you find a list of people who have given to eye-related nonprofits?”
At first, I was flummoxed. How would I find that kind of information, and for such a narrow philanthropic interest? Further, this question raised an even bigger challenge common to many nonprofits: How do we prospect for community needs beyond natural constituencies like alumni or patients, or needs that have no natural constituencies? Rather than brush off Karen’s request, I resolved to find the answer. I had no idea that my search for a solution would lead me to the world of text analytics.
Finding the Data
Before undertaking the project, I had a discussion with my supervisor, Jessie Rader, about the possibilities of a prospect interests project. She was immediately supportive and suggested I broaden its scope beyond Ophthalmology. The University of Nebraska system has several independent research institutes and museums that would benefit from a project like this.
My first task was determining if this project was even possible. Where would I find data on individual charitable giving? Fortunately, our sales contact at Blackbaud informed me about the NOZA database the company had acquired several years earlier. NOZA is an online database of charitable gift information derived from publicly-available sources. After getting buy-in from our head of prospect research, we purchased a screening on all assigned prospects and suspects who were previously identified by wealth screening.
A couple weeks later, the data was in my hands. If you aren’t familiar with NOZA data, the result set comes back with, among other data, a link to the source of the gift information, such as a news release or an annual report. I took a sample of the largest supposed gift amounts and spot-checked sources for accuracy. With the exception of a single entry, the data was accurate.
Next, I inspected the charity type coding. The dataset Blackbaud provides contains coding that describes each charity’s mission in broad terms, such as “Arts and Humanities.” I decided the charity type coding was not granular enough for our needs — we needed more information about a prospect’s interests than they give to “Arts and Humanities.” With this dataset verified for quality, I could proceed with the project.
Planning the Project
The most important step in planning for a prospect interests tool was determining objectives early on. Who will use the end product? For what will the product be used? What form should it take? Making sure you define those answers before writing a single line code will save you much grief later on. You are less likely to meander aimlessly without producing results, or worse, producing work that goes unused by your stakeholders. It was helpful for me to write a project plan, roughly following the CRISP-DM guidelines.
For this project, I defined the answers to the above questions as such:
- Who will use the end product? The prospect managers, who, at UNF, are responsible for identifying prospects and managing portfolios.
- For what will the product be used? The end product will be used for generating prospects for university needs that are narrow in scope and/or lack natural constituencies such as alumni.
- What form should that end product take? Prospects should have a descriptive, human-readable interest code that captures the suspected level of interest the prospect has in each philanthropic mission category. Each prospect can have more than one philanthropic interest code, but not more than one interest level for each philanthropic mission category.
With a plan in place, I needed to determine my tools and process. As stated before, the coding from the NOZA dataset did not provide the level of granularity I sought. I needed to identify an alternative data source for describing the charities our prospects gave to.
The best data source for describing a charity’s mission is the charity itself — specifically, its website. Using web scraping and text analytics, I could systematically categorize the charities through topic modeling. Knowing what data I needed, I could now broadly define my data pipeline. A pipeline is an ordered chain of computer processes that will make the data suitable for analysis. I determined I needed to write code to:
- Download webpages from each charity our prospects gave to and store those pages in folders — one per charity.
- Once the scraping was complete, I needed to convert web pages from HTML to plain text.
- Then, I would process the text, including removing symbols and numbers, to make it compatible with text analytics software.
- Finally, I needed to pick a text analytics software package that would identify how the charities naturally cluster by topic.
The first step is getting the text data. After researching my options, I went with a piece of software outside of the R programming language. Wget is a free, open-source program that allows you to systematically download files from the internet. Wget is operated via the command line, like an old DOS program. While programs such as wget have a slight learning curve, it was the only software I found at the time that fit my needs precisely. Needless to say, I spent much of my time Googling Stack Overflow!
I required software that would allow me to not just scrape efficiently, but also ethically. There are certain ethical concerns you must follow when undertaking a scraping project.
- Respect the website owner’s wish to not be scraped: Websites can use a file called “robots.txt” to tell software to not download its content. Wget has a setting that will ignore this file, but I recommend you don’t ignore it.
- Do not hammer websites with numerous automated requests: If a server notices you are sending multiple automated requests, the operator may perceive you as a threat and block your organization from the site. Wget allows you to limit the number of hits made on a site, randomize the time between hits, limit the download speed of each request and limit the amount of data downloaded. These settings make it less likely you will be blocked.
In addition to ethical concerns, wget also allowed me to download all files on a site without knowing the file names beforehand. Wget will take a base URL, climb up and down the site’s directories and grab whatever is there. Lastly, wget lets you specify what types of files you want. In this case, I used a list of common file extensions for web pages, including .HTML, .HTM and .ASP.
Once I figured out wget, I wrote my first piece of R code. My function called wget using my desired settings and looped over the list of charities our prospects gave to. Over all, there were around 5,500 charities to scrape. Running on my MacBook, the scraping took two months. (I will address strategies for reducing this time later in the article.)
When the scraping finally finished, I had about 5,000 folders on my hard drive with 9 GB of HTML. These files now needed to be loaded into R and processed. At the time, I used the tidyverse packages to help systematically read the html files into memory. I used the XML package to turn the HTML into plain text. There is a more modern package in the R ecosystem called rvest that could possibly do the same thing. However, the XML package worked for me at the time and is still actively developed.
When I finished the processing of the HTML into text, I merged each text file for each charity into one list of words per charity. That way, there would be one “bag” of words to analyze for each charity. This method of text analytics is called the “bag of words” method. When using this simple method, you just need the words of your text. There is no need to preserve the word order, capitalization or punctuation. In addition to making the words all lowercase and removing punctuation, I also removed all numbers and stop words (common words like “the” and “that”). Removing this from your analysis will help the algorithm focus on important, meaningful words.
After I had my dataset prepared, the fun began — modeling! I chose the topicmodels package from the CRAN repository. This package contains an implementation of the Latent Dirichlet Allocation (LDA) algorithm, which can be used for seeing how texts align with unlabeled topics. Kailash Awati published a non-mathematical tutorial on both the algorithm and the topicmodels package that is immensely helpful.
One issue with this package is that you must pick a number of topics, or labelled clusters, before running the analysis. I experimented with 25, 50 and 75 topics in each run. I eventually settled on the 50 topic model, as it struck the right balance between granular segmentation of charities and interpretability of topics. While the topicmodels package will not label topics for you, it will provide a list of words that are strongly associated with each topic. This helps to ascribe human-readable labels. For example, a topic that has words like “dogs,” “cats,” and “rescue” could be labeled as Animal Welfare.
I was very satisfied with my results. While I did not get to the level of granularity I hoped for, the model was able to tell the difference between Protestant and Catholic organizations. It could also segment art museums, symphonies, community theaters, ballets and folk arts organizations into separate topics. However, it was not able to differentiate between zoos and botanical gardens. My assumption was that there were too few examples of gifts to those types of charities in my dataset.
With topic scores calculated for each topic, I now needed to devise a way to relate the topics to prospect giving behavior. For this stage, I collaborated with the prospect management team, as they were the intended users. Together, we created a rating system with three levels: Very Interested, Interested and Somewhat Interested. To qualify as “Very Interested” in a philanthropic interest, one had to give a gift over $100,000 (major gifts level), a planned gift, a capital gift or to multiple charities of the same interest and over multiple years. To qualify for “Interested,” one had to give to multiple charities of the same interest or over multiple years. To qualify for “Somewhat Interested,” a prospect needed to give at least one time to an interest. Those whose only gift was a memorial gift were not counted as “Interested.”
I took our rating system rules and created the logical expressions that would assign the interest levels to each individual. The final product had one row for each prospect, prospect interest and interest level rating.
Using the Ratings
The initial impetus of the project was to find Ophthalmology prospects. Unfortunately, the data and the algorithm used did not give that level of granularity. I may experiment with more modern packages, such as Word2Vec or GloVe, to see if those algorithms can give the granularity I am after.
The work was still useful, though. One of the prospect interest types I discovered was Folk Arts. My team realized that prospects interested in Folk Arts philanthropy were excellent prospects for the Quilt Museum. The Quilt Museum was another difficult prospecting project my colleagues were engaged in. The museum did not have a natural pool of identifiable prospects to draw from, so this was a boon to our team’s efforts. In the near term, the team at UNF will also use the list to prospect for the Buffett Early Childhood Institute and the Daugherty Water for Food Global Institute.
In the course of my project, I learned important lessons. These tips will save you many headaches when you undertake your own data science project.
- Use Containerization Technology: Containerization is a technology that packages software libraries so a project can be run at any time, on any system, producing the same results. Packaging R or Python scripts with exact versions of libraries will enable consistent results, even if scripts are run years later. It will also increase the portability of your code. I ran into incompatibilities between the Mac OS and Windows versions of R when I tried to run my analysis on my work computer. Using containerization from the start would have saved me much grief. The most well-known containerization technology is Docker, but other alternatives are rkt, Apache Mesos and containerd.
- Use the cloud: I ran the web scraping from my home laptop, rather than tie up a machine at work. This meant my laptop was unusable for two months, to avoid crashing the computer while the script ran. If I repeat this, I will buy time on a cloud provider such as Google Cloud Platform, Microsoft Azure, or Amazon Web Services. Their pricing is affordable and you can select the appropriately-powered machine for your use case.
- Use multiple threads: R is known for being single-threaded by default. This means it can only do one task at a time in order. This is opposed to multi-thread, where a program can use a computer’s multiple processing cores to do work in parallel. Since the time I first wrote my code, doing multi-threaded work in R has become much easier with the furrr package. I plan to eventually rewrite my code to use this package to make the web scraping go faster.
Through the use of both screenings and analytics, it is possible to find prospects outside normal constituencies. The web and our own databases are full of text with valuable insights. Contact reports, news articles and pre-campaign feasibility interviews are good starts for dipping your toes into text analytics. With a thoughtful plan, buy-in from management and collaboration with stakeholders, your own text analytics project can be successful.