A few years ago a friend of friend introduced me to a data set known as the Global Database of Events Language and Tone (GDELT) data.
GDELT is an open source event data repository based on news articles and currently contains over a quarter-billion event records starting from the year 1979. The database is updated daily through automated coding and includes over 300 types of events.
GDELT has recently gained the attention of researchers more specifically in the domain of social unrest and conflict prediction. Some researchers who have used GDELT include:
- Yonamine (2013) to predict future levels of mass violence in Afghanistan;
- Qiao (2017) predicted social unrest in five separate south-east Asian countries; and
- Smith et.al (2017) used GDELT to predict material conflict in Afghanistan.
These studies established a baseline for event prediction using GDELT, motivating that GDELT data does reflect some useful precursor indicators that reveal the causes or development of future events.
For the past year I have been working with this data set as part of my Masters Thesis. I decided to take on the problem “Predicting social unrest events in South Africa using recurrent neural networks (RNNs)”. In this blog post I would like to introduce this data by:
- Demonstrating how I have used it in my research
- How you can access it using an R API ; and
- Some simple data analysis to get started.
The goal is to gain the interest of scholars, perhaps to encourage them to think of other creative ways GDELT data may be of useful to them.
Example Use Case: Predicting social unrest events in South Africa using recurrent neural networks
I presented my research at the Deep Learning Indaba 2019. The Deep Learning Indaba brings together researchers from across the world to discuss some of the toughest problems Africa is facing. I was very honored to be given the opportunity to present my research in front of 100s and below I have provided the video and the slides giving an overview on how I used GDELT data to predict social unrest events in South Africa.
Note: I am still working on the Thesis an therefore the final results may look slightly different.
GDELT R API
GDELT data can be downloaded using the R package “GDELTtools”. This package gets the data from Google BigQuery. Getting the data from BigQuery is the recommended method especially if you plan on collecting tons of it. However, the code below works well for quick analysis.
#load libraries library(GDELTtools) library(data.table) #apply location filter South Africa test.filter <- list(ActionGeo_CountryCode= "SF") #query 3 months worth of data test.results <- GetGDELT(start.date = "2019-01-01", end.date = "2019-03-01", filter=test.filter, local.folder = getwd()) #convert to data.table format for further analysis mydat <- data.table(test.results)
The first plot is a count of all events collected and filtered to only include the EventRootCode “14”. The EventRootCode defines the highest-level category the event code falls under in CAMEO format. For example, EventCode 1452 (engage in violent protest for policy change) has a root CAMEO code of 14 (PROTEST). Details of all CAMEO codes and their definitions are available here. In the plot below we can see that, over the three months demonstrations are the most reported/occurring events.
The next plot is an aggregation of all protest events over time. Here, we can see how variable the event data is and perhaps why neural networks out performed traditional approaches. See the code snippet below to replicate the plots.
library(dplyr) #get event desciption data url <- "https://www.gdeltproject.org/data/lookups/CAMEO.eventcodes.txt" download.file(url, "eventcodes.txt" ) eventcodes <- data.table(read.table("eventcodes.txt", sep = "\t", header = TRUE, colClasses ="character" )) #rename column setnames(eventcodes, old = "CAMEOEVENTCODE", new = "EventCode") #transform date variable mydat$SQLDATE <- as.character(mydat$SQLDATE) mydat[, Date := readr::parse_datetime(SQLDATE, "%Y%m%d")] mydat[, Date := as.Date(Date,format = "%m/%d/%y")] mydat[, monthYear := as.numeric(format(as.Date(Date), '%Y%m'))] #get count of events event_counts <- mydat %>% group_by(Year, EventCode) %>% filter(EventRootCode == 14) %>% summarize(count = n()) %>% left_join(eventcodes, by = "EventCode") %>% arrange(Year, count) #plot count of events ggplot(event_counts, aes(fill=EVENTDESCRIPTION, y = count, x= EVENTDESCRIPTION)) + geom_bar(stat="identity") + theme_minimal() + theme(legend.position = "none") + coord_flip() #time series plot event_counts <- mydat %>% group_by(Date, EventCode) %>% filter(EventRootCode == 14) %>% summarize(count = n()) %>% left_join(eventcodes, by = "EventCode") %>% arrange(Date, count) myevent_sum <- event_counts %>% group_by(Date) %>% summarize(total_events = sum(count, na.rm = TRUE)) %>% arrange(Date) ggplot(myevent_sum, (aes(x = as.Date(Date), y = total_events))) + geom_line(color = "#69C4F2", size= 1.5) + theme_minimal()+ xlab(label = "Day")
It isn’t enough to talk about peace. One must believe in it. And it isn’t enough to believe in it. One must work at it.Eleanor Roosevelt