Building a Unified Infrastructure for Data Integration on Political Violence and Conflict

Investigators:

Other START Researchers:

Project Details

Abstract:

A conceptualization project whereby leaders of several major data collections, along with experts in insurgency, ethnic conflict, terrorism, computer science, and data management, will build consensus around a unified infrastructure for political violence data collection and integration.

The project will create a scalable ontology focused on four domains relevant to all forms of political violence: conflicts, actors, geographies, and events. This ontological infrastructure will allow researchers and policy makers who study political violence, including its etiology and consequences, to access and integrate existing data from the major political violence data collections currently collected at academic institutions and non-governmental organizations around the world. This infrastructure will also facilitate the development of new non-linear and semantic analytical methods to decipher the complex dynamics of conflict and violence.

Finally, this infrastructure will provide a platform for the next generation of data collection on violence and conflict by creating a flexible framework for new data collection projects that is both easily exchangeable and adheres to the consensus-based ontologies developed. The proposed project incorporates a two-day stakeholder workshop, a schematic design of the ontological infrastructure for political violence data collection, and a web-based community-moderated issue-tracking wiki to encourage participatory development and refinement of the ontologies.

Additionally, the proposed project supports one of the core missions of NSF’s Big Data initiative, focusing on the development of a data infrastructure relevant to national security.

Primary Findings:

Over the past decade, the empirical understanding of varying types of political conflict has progressed rapidly with the development of large-N databases on political violence, terrorism, insurgencies, riots, and ethnic violence. However, given the largely uncoordinated and independent nature of data collection and analysis efforts, our knowledge about conflict and political violence remains limited, confined to research “silos” maintained by individual research institutions. Despite advances, the predominant mode of data collection for these individual data bases remains relatively unsophisticated, using identifiers that are unique to specific projects, collecting at various levels of granularity with little integration and adopting varying definitions and protocols. Moreover, no existing data base provides an over-arching framework that allows for new data sources and variables to be seamlessly incorporated with past work.

In this project we brought together representatives from the world’s leading data collections on political violence and conflict, along with experts on insurgency, ethnic conflict, terrorism, computer science, and data management, to see if we could build consensus for developing a unified infrastructure for data collection and integration of political violence data. Our goal was to explore the possibility of creating a scalable integration framework that would allow researchers and policymakers to access and integrate existing data from the major political violence data collections currently collected at research institutions around the world.

The project produced four major outcomes with the potential to advance knowledge of political violence and conflict and ultimately, to better the world through an improved understanding of the causes and consequences of political violence. First, by reaching out to the custodians of the world’s most important data bases on political violence we have built a network of stakeholders from around the world and learned that these stakeholders share our vision for moving forward on data integration. This “buy in” is a critical first step to successful data integration.

Second, during the course of reviewing the relevant literature and existing data bases, and working extensively with experts, especially computer scientists, we greatly changed our thinking about how data integration would best be achieved. In our original NSF proposal we had planned to proceed by what could be described as deep integration. Deep integration requires individual data collectors to adopt standard protocols and code variables in a uniform manner and by implication, assumes that there is a single way in which all integrated data can be defined. A downside of this approach is that if later work suggests that the definitions selected are not optimal, participants are stuck with earlier decisions made on protocols and variables. In the course of this grant both our team and our stakeholders have become convinced that shallow integration provides a more effective model. Shallow integration requires only that there be agreement on some definitions across data collections, but not full conformity across coding efforts. Shallow integration requires few changes to existing workflows, allows decentralized methods for development and coding, and builds on already available technology for each data collection effort to be integrated. Advantages of shallow integration are that it provides a means to develop common coding standards without imposing these standards on stakeholders and builds in flexibility as knowledge in particular areas accumulates.

Third, in our original NSF proposal we had planned to focus data integration on four elements: conflict, actors, geography and events (the CAGE model) Based on our literature review and interaction with experts we now conclude that two components of CAGE—conflicts and events—are considerably more complex to integrate than actors and geography. Actors and geography are common to virtually all political violence databases and in general, are items that are measured with a reasonable degree of fidelity. Hence, we believe the best way to advance data integration in this area is to concentrate on actors and geography.

And finally, in our original NSF proposal we had conceived of data integration as a top-down process where we would complete a priori the exact ways that the data would be modeled and documented. For example, to integrate data on violent groups, we would need agreement on the universe of violent groups, the geopolitical boundaries of the countries in which they operate, the number of violent events attributed to them and so forth. Again, after interacting extensively with experts, especially computer scientists, we are convinced that the best way forward is instead a Linked Data (LD) approach. Most generally, LD describes a method of providing structured data so that it can be interlinked, connected and queried. The structure of LD emerges from the bottom-up, accommodating the ever-evolving nature of the data. An LD approach will allow us to handle unanticipated changes to data points and to capture the complex and evolving relationships between them.