Influential Nodes in Worldwide Terror Networks: Centrality + Improved Graphics

I’ve improved the presentation of my network model for global terrorist collaborations. You can take a look at the code on my github, and definitely follow the link to view the network in full.

CLICK HERE TO SEE THE FULL NETWORK.

Screen Shot 2017-02-21 at 8.08.43 PM.png

Please note that I replaced the node IDs for “the” Taliban (T), Boko Haram (BH), ISIL/ISIS (IS), Hamas (H), and”the” Al-Qaeda (without a regional modifier in the name) (AQ) by the initials herein, so that they can be easily pinpointed on the graph. You’ll probably want to open the lengthy node key in another window.

A few notes:

Criteria for inclusion. Please refer to my previous post.

Node Clean-up. I got rid of the nodes “Unknown”, “Individual” (meaning a non-organization), and “Other”, which had escaped my attention and unduly linked some pairs of organizations as having one degree of separation (e.g. both Group A and Group B collaborated with persons who were never discovered– this doesn’t mean they collaborated with the SAME person!). I’m also noticing some nodes here and there that have basically the same problem, such as “Palestinians”– that is not an organization. I will return to these sorts of nodes and remove them on a case-by-case basis.

Community Detection. I used the “fast greedy” community detection algorithm to assign and color the communities. Here is a comparison of community detection algorithms for networks with various properties. Before executing this algorithm, I combined any multiple edges between a pair of nodes into a single weighted edge, and got rid of loops (since “collaboration with oneself” is not what I was intending to portray in this model).

Let’s take a look at the output given by R. Upon inspection, these groupings seem to make sense; the organizations seem plausibly affiliated and frequently refer to the same cultures, regions, or ideologies. Some of the names could use a bit of clarification (for example, “Dissident Republicans” refers to breakaways from the IRA toward the end of the Northern Ireland conflict) or expansion/compression. As you may infer, the numberings to the left of the members of a group are not the node IDs that appear in the rainbow graph later, but rather numberings within the communities (only the first number is shown per line of community members).

Screen Shot 2017-02-21 at 8.14.30 PM.png

SEE COMMUNITY CLUSTERS HERE

Cliques. The largest cliques (complete subgraphs) were revealed as:

Clique 1. Bangsamora Islamic Freedom Movement (BIFM), New People’s Army (NPA), Moro National Liberation Front (MNLF), Moro Islamic Liberation Front (MILF), Abu Sayyaf Group (ASG)

Clique 2. Popular Resistance Committees, Popular Front for the Liberation of Palestine (PFLP), Hamas, al-Aqsa Martyrs Brigade, Democratic Front for the Liberation of Palestine (DFLP)

Clique 3. Popular Resistance Committees (PFLP), Hamas, Al-Asqa Martyrs Bridgade, Palestinian Islamic Jihad (PIJ)

Centrality. I wanted to know how “influential” each node was. Of course, centrality is not the only way to measure this, especially in a case like the GTD where we have so much other information, such as victim counts. Even going on centrality, there are several centrality measure options in igraph for R; I went with eigencentrality. To quote from the manual:

“Eigenvector centrality scores correspond to the values of the first eigenvector of the graph adjacency matrix; these scores may, in turn, be interpreted as arising from a reciprocal process in which the centrality of each actor is proportional to the sum of the centralities of those actors to whom he or she is connected. In general, vertices with high eigenvector centralities are those which are connected to many other vertices which are, in turn, connected to many others (and so on).”

Screen Shot 2017-02-20 at 3.22.25 AM.png

The “scale” option fixed a maximum score of 1.

Nodes Sorted by Eigencentrality (Decreasing) + Commentary:

de Brujin Graphs, etc. (co-authored)

Had the pleasure of putting this survey paper together alongside Camille Scott and Luiz Irber for Raissa D’Souza’s Network Theory class in Spring 2016. It’s about the network theoretic aspects and applications of genomic data, with a bit of a history lesson tied in. The data used constituted all invertebrate and mammalian genomes available on NCBI, a whopping 84 GB. Luckily my co-authors had access to “supercomputers”. Please don’t be intimidated by the wall of text; I started this project with zero knowledge of genomics (thanks Camille and Luiz!) and co-wrote with a similar audience in mind. All of the graphics are by Ms. Scott.

genomicsgenomics2genomics3

genomics3genomics3genomics4genomics5genomics6genomics7genomics8genomics9genomics10genomics11

Feature Extraction on Global Terror Events

 

The GTD is incredible.

The GTD is an index of terrorist or suspected terrorist events from 1970 to 2014, compiled by the University of Maryland for the Dept. of Homeland Security of the USA. The documentation for the project can be found at [4]. It contains over 100k events with no geographical restriction.

From the source material:

”The original set of incidents that comprise the GTD occurred between 1970 and 1997 and were collected by the Pinkerton Global Intelligence Service (PGIS) a private security agency. After START completed digitizing these handwritten records in 2005, we collaborated with the Center for Terrorism and Intelligence Studies (CETIS) to continue data collection beyond 1997 and expand the scope of the information recorded for each attack. CETIS collected GTD data for terrorist attacks that occurred from January 1998 through March 2008, after which ongoing data collection transitioned to the Institute for the Study of Violent Groups (ISVG). ISVG continued as the primary collector of data on attacks that occurred from April 2008 through October 2011. These categories include, quote, ‘incident date, incident location, incident information, attack information, target/victim information, perpetrator information, perpetrator statistics, claims of responsibility, weapon information, casualty information, consequences, kidnapping/hostage taking information, additional information, and source information,’ as well as an internal indexing system. […] The GTD defines a terrorist attack as the threatened or actual use of illegal force and violence by a non state actor to attain a political, economic, religious, or social goal through fear, coercion, or intimidation.” (More on their criteria in a moment.) Take a look at the source codebook for yourself and enjoy the rich array of data that this project has! I tried to compile a small subset of this information myself once upon a time and it was a ton of work, so props to these people for stepping up.


Transforming Qualitative Data Into Quantitative Data

I originally selected this data for a class project, and some of this class was concerned with dimension reduction. It seems that most dimension reduction and feature extraction algorithms are designed with continuous or at least ordered data in mind. For this reason I sought to convert the GTD data from categorical strings into numbers. Goals: Make the data easier to dimension-reduce. Interpret the information in the GTD in a way such that it can be internally compared, despite the disparate value ranges and types the various features take. Identify characteristics that predict other characteristics in an arbitrary or restricted-domain terrorist incident.

I transformed the data as follows.

Some of the data was simple enough that I was able to directly convert it into an ordered numerical scale. I converted the “target types”–- the intended victims of the acts– by classifying them on a scale from civilian to state targets, where 1 is ”most civilian” or an infrastructural target intended to affect daily living (included their categories of: private citizens/property, journalists & media, educational institutions, abortion-related, business, tourists, food/water supply, telecommunication, utilities, and transportation), 2 is semi-state or other loosely organized or less-empowered political organizations (airports & aircraft, maritime, NGO, other, religious figures/institutions, terrorists/non-state militias, & violent political parties) and 3 is “most statelike”(general government, police, military, diplomatic government, and unknown). For the ambiguous ones (other, unknown, etc) I looked at what was actually in that set to determine its category. Let’s take a look at the GTD’s criteria for inclusion while we’re at it:

Screen Shot 2016-09-25 at 4.56.29 PM.png

At this point in my exploration I wasn’t sure which techniques I would wind up using, but I wanted to prepare the data to be as malleable as possible without losing much. If I decided to use compressive sensing techniques to reduce the dimensionality of the data, a sparse matrix representation of the data would be preferable. Sparse intuitively means that for every feature of an incident/entry, the expected value is near zero due to a high number of zero instances of this feature across entries. Using the GTD, I had a lot of categorical variables that take, say, N values on the dataset, so I reasoned that these might best be decomposed into N features that each take a binary value. For example, the original variables ”weapon type 1″, “weapon type 2”, “weapon type 3” were converted into a column: was there a firearm involved? y/n, i.e. binary valued “weapfirearm” column. I made separate binary features for each possible weapon type. Chemical, biological, nuclear, and radiological were so seldomly occurring that I threw them away as features. I also made binary columns for whether hostages were taken, whether the attack was coordinated between multiple parties, if the perp is known or unknown, and whether the perpetrators were from the region in which they committed the crimes. Regions were broken down into simple cultural regions like the Middle East and North Africa, South Asia, Europe, and so on by the GTD people.

binaries.png

 

Preliminary Reduction: How Much Data, Exactly?

I worked with 141,967 incidents (before filtration) each having over 50 numerical and categorical variables, some null/missing. To deal with the missing data, depending on the type, I either threw away the entire incident row or used averaging techniques to extend data from the same incident in a way that wouldn’t mess up the statistics overall. Statistical concerns sometimes necessitated reframing the way I conceived the variables.

Geographical data is abundantly provided by the GTD. As well as the regional classifications, we also have access to not only the country, state, province, and/or city, but even the exact longitude and latitude of the vast majority of the events. In fact, the presence of this information is what persuaded me to wrangle the entire dataset rather than sticking to the smaller file of only the events that occurred in 1993 (this is set aside by GTD as a special year with its own documentations due to a loss-of-data incident in the archives that distinguishes that  year). I first tried and failed to open the data (.5 MB) in R. After a bit of looking around online I concluded that the the first thing I needed to do was convert the xlsx file to a csv file via e.g. Python, and then it would be advisable to throw away any data that I would definitely not be using (i.e. make a new files with a refined dataset). I had to put my grownup pants on and learn to selectively read and manipulate dataframes without opening the whole file in Excel.

After all this sparse-data mining, here is where it would be appropriate to subset the sparse columns (event features) and use the JLT to reduce dimensionality. I didn’t actually wind up doing that, partly because after the alterations I mentioned, the data management turned out to be not that bad in terms of what my computer could handle.

Some Preliminary Results

 

The first thing I wanted to check was whether terrorism is primarily isolated incidents by unaffiliated actors or if it is the primary mode of warfare for many major organizations. I used R for this: There are 440 nodes and 1156 edges. Note that many incidents involved more than two actors. The big components are who you might guess: ISIL, various Talibans, and Al-Quaeda. FARC was also a high-degree actor. I don’t know whether some of these supposedly different organizations are just subsidiaries of their connected organizations, or what. I’m playing with a Gephi representation right now and I’ll come back with some labeling so you all can see what’s what. I’ll tag some other famous groups like the ALF and ELF. terrornetworks.pngAbove: the network. Below: constructing the graph for igraph and Gephi.

perps
Edges.png

 

 

 

 

Feature Extraction: PCA and K-Means Clustering

I got into PCA by watching this video demo. Really, this video is good enough and uses a clear enough example that I am delegating saying what PCA is to this video. But I’ll try to explain it in-case too.

PCA is data agnostic. There do exist “spatial PCAs” tailored to dimension reduction of ”big” data while maintaining spatial correlations, see [2]. And there is also precedent for factor extraction on census-type data, see [1]. For PCA on discrete data, see: [3]. That’s all stuff I still have to do. Especially the geographical data I’m eager to use.

I proceeded to attempt a less-tailored PCA as well as k-means clustering on the dataset to see what the archetypal incidents would be–- that is, are there meaningful eigenincidents that represent archetypes of terrorism? I was wondering if there would be a significant correlation between geographical coordinates and method, varying with culture and resources. For example, we might find that one canonical type of incident takes place in Location X and involves firearms, hostages, and multiparty coordinations, whereas another might be the suicide bombing of an individual in a public marketplace in Location Y.

Due to the differing scales of the data, it was particularly necessary to scale and center the data before proceeding with PCA. And all that binary data wasn’t great for this “naive” PCA either, so I had to stash it for later. So let’s take a look at what I got when I PCA’ed the pre-pared data in R.

pcacode.png

Using some code from Thiago G. Martins’s data science blog.

PCAterr.png

To read what the PCA is telling us, we want to examine which features’ (rows’) absolute values are biggest, for a given fixed principal component, one of the columns. Note the standard deviations list at the top. The algorithm attempts to impose a natural delineation of the clusters of correlation, given by the different PCs that appear. But what the principal components really are are these: Maximize the variance over all combinations of the components. Keep in mind it’s showing the deviation, not the proportion of total variance, there in the (Feature, PC#) spot. Then we exclude all of the variance we just “used” in the most recent PC# creation, and iterate this N times total.The resulting vectors are a linearly uncorrelated orthogonal basis of the feature space.It appears that the strongest correlation appearing in the first principal component is between the event being an explosion/bombing and the use of explosive weapons-– okay, at least this is a good sign that our calculations are working, because that correlation is practically tautological . And when that is the case, the attack is less likely to be an assault (attackassault = ~ -0.34) or involve firearms (weapfirearms = ~ -0.42). I had hoped for something more insightful, but it’s a first run. I will experiment with excluding subsets of features from the PCA process. Let’s take a look at how much variance each of these components account for, with all of the features included.

variance2The mark “1” denotes PC1, and so on.

Approximately the first eight to ten principal components account for most of the variance. The first component is the dominant one. Then the second through fourth components could be considered the next ”batch”, and finally the fifth through [arguable final] components give almost all of the remaining variance in the dataset. Let’s look at other representations.

 

variance

stats.png

We could also subset the data to compare variables that we suspect are correlated.

But that is way too many features for me to try to visualize in simulated 3D.

Below, I restricted the features to year, whether the attack was a suicide attack (those are usually bombings), and target type, in that order. This data was adjusted for individual variation of the variables before processing.

pcayll.png

It appears that target type (remember higher values are more state-like targets) is inversely correlated with suicidality of method: that is, as we increase the public nature of the act, we increase the chance of a suicidal terror act. This makes sense because suicide missions create a stir and disarm the public. The follow figure illustrates how these three principal components constitute the overwhelming majority of the variance.

yllpca.png

 

predict

Still following the Martins tutorial, we use MATLAB to simulate “predicting” the tail end of our own data, the 113117th and 113118th incident. Since the data is in chronological order, it only makes sense to force the year and just get predictions for latitude and longitude. This is as variance so I still need to translate that back to coordinates and compare to the actual last incidents.

K-Means Clustering

 

Finding the ideal k for a k-means clustering is ”the big question” in the procedure. To get an heuristic sense of what works for this dataset, we can experiment with various k. In this case it seems that 3 is better than 5: look how feeble some of the clusters are when we choose k = 5. Compare these images of two clustering implementations using MATLAB’s Cosine distance function.

cos5.png
cos3.png

The following are K-Means Clusterings with the subset of year, whether the attack was a suicide, and the target type scale. I used Euclidean distance.

ylleuc.png

clustercorrs

Making the silhouettes in MATLAB:

codesample

Lifted directly from the good MathWorks documentation for k-means clustering.

Back to the Lone Wolf Thing

Around the time of the 2001 attacks on the WTC there was an increase in suicide bombing attacks already under way, and then again in 2010 the violence took a drastic climb. I would speculatively infer that the high profiles of these events inspired many small-cell copycats, but high profile events seemed to occur only when a local upward trend was already underway. The fever chart graph is courtesy of the search feature on the University of Maryland’s GTD page [4]. I don’t know what accounts for the drop after 2007.

gtdsearch.png

I’m going to mess around with the estimable Peter Langman’s rampage shooter data soon and compare to what I got here. Excited for that.That’s all for right now.

refs.png
By the way, you want to know the list of actors in the GTD database, right?

Disclaimer: I am utterly unfamiliar with the vast majority of these organizations. I’m not commenting on anyone’s politics or the status of their organization, since I didn’t collect this data myself. You can check out the methodology at the source site.  Continue reading “Feature Extraction on Global Terror Events”