There’s a lot more to speech in society than simply the words we speak. There’s a whole other level of data that comes with our spoken and written interactions, even if what we’re saying has nothing to do with numbers or statistics. And experts are increasingly starting to recognize the importance of text as data.
How can there be data in things we say if we’re not talking about data?
Good question. The whole thing is a little bit meta. I’m referring to data about the words being said:
- Who is saying them? (gender, age, race, sexual orientation, class)
- How often are they being said?
- What is the response?
- Is there a pattern in what happens next?
Text as Data to Provide Context
In a World Bank study on inequality in deliberative institutions, Ramya Parthasarathy examined text as data across municipal assemblies in the Indian state of Tamil Nadu. Analyzing transcripts of village assemblies, the group studied discussion topics, but also the gender and social status of each speaker. The goal was to see if influence differed between men and women, or between government officials and regular citizens.
Parthasarathy found that, in general, citizens had a fair bit of power in setting the agenda of meetings compared to officials. But the difference between male and female speakers was much more apparent. On average:
- More than half of meeting attendees were women
- Only one-third of available floor time was held by women
- Topics raised by women were significantly less likely to elicit a topical response
While the meetings were not about gender inequality, using text as data we can see that the power differential between men and women was still impacting policy in these villages.
Machine Learning and Text Analysis
Advancements in text analysis technology and machine learning are what makes viewing text as data possible. Prior to these developments, it would have been near impossible (or at very least, prohibitively time-consuming) to collect and measure this sort of data.
The value of text as data goes beyond highlighting inequalities in the social structure. Last year, I shared the story of Crisis Text Line, who used a data mining program to search through all the information they collected on incoming messages. The software uncovered a significant link between the word “ibuprofen” in texts and the texter seriously contemplating suicide.
In fact, “ibuprofen” occurred more frequently than “suicide” in messages from people seriously considering ending their lives. The relationship isn’t an obvious one. It probably never would have been found without a data mining program.
The sheer volume of data we can gather on our words makes collection and analysis using traditional methods a non-starter. But the power hidden within the data can be life-altering.
Accessing the Information
Julia Silge — a data scientist I’m proud to say I’ve worked with on a number of big Datassist projects — has developed a great way to handle text as data. Together with David Robinson, Julia’s developed an R package called tidytext to make text mining simpler and more efficient.
As a data scientist, Julia recognized how challenging it could be to deal with ever-expanding volumes of text as data. Tidytext aims to let analysts manipulate and visualize text as data to use in existing workflows, ensuring we aren’t missing out on all that valuable data.
Do you need help analyzing text as data? Want to know more about how you can collect the unspoken information we’re all generating every day? The team at Datassist is here to help. Get in touch with us today.