Discussion Point: Embrace "Big Data," but don't ignore the human element in data coding

Oct 23, 2013

Peter S. Henne and Jonathan Kennedy

The following is part of a series of thought pieces authored by members of the START Consortium. These editorial columns reflect the opinions of the author(s), and not necessarily the opinions of the START Consortium. This series is penned by scholars who have grappled with complicated and often politicized topics, and our hope is that they will foster thoughtful reflection and discussion by professionals and students alike.

An August article on ForeignPolicy.com presented a map of every protest around the world, drawn from the Global Database of Events, Language, and Tone (GDELT). The GDELT dataset promises real-time updates through an automated process that scans online news sources for relevant information and adds observations to the dataset. This process essentially replaces the laborious coding procedures research centers like START utilize to develop datasets, as well as the significant lag time between current events and their inclusion in the dataset.

Data sources like GDELT represent the promise of "big data" to researchers and policymakers. The massive amount of information made available through the internet makes it easier to find out what is happening all over the world. At the same time, the thousands of articles and blog posts on an event can overwhelm researchers, as they lack the time to sift through all that data; more importantly, it can lead to significant biases if the first pieces of information one receives are not the most accurate reports (which may be buried deep in the search results).

As many a current and former college research assistant can attest to, most datasets develop through human beings reading sources, taking note of important information, and quantifying this information in a dataset that researchers can later use. This is difficult enough with phenomena like international treaties, or terrorist attacks—which START's Global Terrorism Database collects—due to the significant amount of events and multiple sources on each event. For something like protests (which GDELT covers) it would be almost impossible to collect this data on a global scale.

In response, new, powerful computer platforms have been developed to wade through information, automatically code it into useful formats, sort it and help researchers and policymakers access the most relevant or important pieces. These platforms, their proponents claim, can revolutionize our understanding of the world and make it easier to solve some of the difficult problems we face today.

One of the most important ways in which they do this is to remove the human element from data collection and coding; automating the collection and coding can greatly improve the speed of this process and ensure important information is not overlooked due to human coders' limited resources.

It is true that "big data" platforms—like GDELT—make possible things researchers could only dream of in previous decades. But there is a downside in abandoning the human element of coding. Computers cannot assess whether a spike in an event's frequency—such as terrorist attacks—is due to actual intensification of a conflict or greater attention by information sources (like major newspapers) to a country.

And while computers are useful in finding key terms in a news source, they struggle to judge the importance of these terms; directing a program to identify instances of "war" connected to "trade" wouldn't be able to tell us if trade provoked a war, avoided a war, or if the "war" was really escalating tariffs between two trading rivals. Granted, with the current rate of technological progress these obstacles may cease to be obstacles in the near future; with today's technology, however, they still complicate attempts at automated coding.

Human coders—while slow and occasionally distracted—excel at these tasks. Slowly and methodically reading through sources and filling out a codebook ensures a "war prevented through trade" is not coded as a "trade war." Double-coding observations and comparing the amount of agreement among coders not only tells us how accurate the coding is, but also how clearly and usefully the coding instrument—the set of questions coding answers—captures the phenomenon we are studying. And reviewing the coding results with subject matter experts can further validate the process and results.

More importantly, the amount of time it takes to collect a dataset gives researchers an incentive think carefully about their questions and the scope of the data they will collect; automated collection and coding makes it tempting to lump different types of events and locations together—which may not be appropriate—while a more narrowly scoped project would be both feasible for human coding and better reflect the context-specific nature of some phenomenon.

The importance of the human element of data coding is apparent in the ongoing Profiles in Individual Radicalization in the United States (PIRUS) project, at START. PIRUS collects information on 154 variables which, based on an extensive literature review, have been deemed relevant to the study of radicalization. Data coders compile the information from thousands of sources—newspaper articles, books, websites, court cases, think tank reports, and so on—resulting in an enormous amount of data.

And we double-code a sub-set of the data to ensure its reliability. The process is time consuming and labor intensive. Automated coding procedures could be implemented to cut costs and reduce turnaround time, but human discretion is critical for interpreting contextual clues within and across sources.

The human element has proved particularly useful for one variable in PIRUS, which asks what role the internet played in the individual’s radicalization. Knowing this information will yield important insight into an ongoing debate in the field of terrorism studies, between those who believe radicalization often begins and advances online and those who believe the internet is an important, albeit secondary means of radicalization.

Merely finding a mention of the "internet" in an information source—as a computer program would do—would not tell us what role this played, if any, in someone's radicalization. In one example from the dataset, an individual adopted radical beliefs and began acting on them and later watched many extremist videos on the internet; a computer would likely code internet as having an important role, but the human coders close reading of the context of the information resulted in the proper data being coded.

This is not to say that researchers should avoid "big data" approaches, or that datasets like GDELT are flawed. GDELT is an incredible resource for scholars, and hopefully other projects incorporate some of its procedures. And it is not an "either/or" choice. Researchers could still review the results of automated coding to assess their validity and reliability, as GDELT has done. Or coding procedures could synthesize the relevant strengths of human and automated coding, as Daniel J. Hopkins and Garry King (of Georgetown University and Harvard University, respectively) proposed in a 2010 articleand accompanying software program. But the promise and excitement of "big data" should not cause researchers and policymakers to ignore the human element in coding.