Tag Archives: gephi

The Graphs Of Football, Basketball, Ice Hockey, Baseball and Soccer

26 Jul

This time: SPORTS. My goal here is to see if I can create a network of famous sports people based on only two pieces of information: their name and their team. I’m going to need your help on this one so keep your wits about you as you read on.

The sports I chose for this study were:

  1. American Football
  2. Basketball
  3. Ice Hockey
  4. Baseball
  5. Soccer (Football)

I simply went with the sports which had the most data available on Freebase‘s sports section. Just to be clear, I will be calling Association Football, ‘Soccer‘, to distinguish it from American Football. Sorry Europe.


To get the core dataset I simply queried Freebase for players which had an available team. I then had to think of a way to connect them. I figured players move around quite a bit over a given career so I thought perhaps I could link both teams and players together based on their affiliations (i.e. two distinct but related graphs).

I settled on two types of graphs for each sport:

  1. The Player Graph: This consists of people only e.g if Diego Maradona played for FC Barcelona and Gary Lineker also played for FC Barcelona, then a connection can be drawn between the players: Maradona and Gary.
  2. The Team Graph: This consists of teams only e.g. if Diego Maradona played for FC Barcelona and Boca Juniors, then a connection can be drawn between the clubs:  FC Barcelona and Boca Juniors.

In this way, for the player map it sort of builds up a network of affiliation. Sure, the players might not have ever met or had any influence on one another, but their connection to the same club does connect them together via a common strand. Similarly for the club/team maps, clubs which have had similar players move between them will be more closely connected. Writing this just now, I haven’t yet seen the graph I’m about to create so it will be interesting to see if certain club allegiances come out of the woodwork. The player map should, like the TV Actor map create sub-networks of actual teams within the larger network. We shall see…

Lastly, most of the people and teams will be American (I’m limited to what Freebase gave me!). Can American viewers please comment on any interesting features, particularly within Baseball, Basketball and Football! As Freebase’s data becomes more complete, these graphs will become more complete.


Again I decided to approach this one using matrices. Please see my post on Wikipedia personalities to see how the datasets are preprocessed. I did however have to design a brief program to connect the various teams and players together. It essentially involves a tripple for loop (there are definitely better ways of doing this!). I chose Matlab since a lot of the code from previous posts can easily be copied across to treat new datasets. I’ve written a few functions now which help process Freebases’ somewhat annoying csv outputs – especially if I have to obtain them from the data dumps and not queries. If there is sufficient interest in how my program works, I can make it available. Most of my time was spent trying to understand unicode and encoding formats to interpret non-english names. I actually learned quite a bit about script blocks and how symbols, east-asian scripts etc. are stored – it was something I had wondered about for a long time. You can find your Sunday afternoon readings on this here (intense) and here (less intense).

Anyway, my little program basically converts this:

Diego Maradona,Boca Juniors
David Beckham,LA Galaxy
David Beckham,A.C. Milan
David Beckham,Preston North End F.C.
David Beckham,Real Madrid
David Beckham,Manchester United F.C.
Gary Lineker,England national football team
Gary Lineker,Leicester City F.C.
etc. etc.

into two matrices which connect players with players and teams with teams. I really should get better at Perl for this sort of stuff but alas… I haven’t the time.

The Graphs

Since I’ve done a number of sports I’ve broken the next section down into the various sports I selected. In each category you’ll find two graphs. One connects players with players and the other connects teams with teams. Rather than commenting on each individual graph in turn, there are a few general observations I’ve made (let me know if I’ve missed something in the comments section):

  • Clustered names usually will indicate an entire team. The more central a player name the more likely that person has been in a range of clubs with no major allegiances. People closer to only two or three isolated clusters of people will likely have strong ties to only the neighbouring clubs. The bigger the node, the more people they have played with over the course of their career. Keep in mind many players are still currently active and so their network is still being formed.
  • Clustered clubs are a bit more interesting because they bring out some underlying structure. See if you can notice certain club types sticking together. In the more international sports the various colours will represent individual countries e.g. in soccer, English and German football teams cluster together because the players moving between the ranks usually belong to the country the club originates from.
  • Knowledge of the players, teams and how they are related will probably allow you to get more out of the graph than I did. My sports knowledge is mediocre at best. Let me know if there is anything peculiar/interesting in the comments.
  • There may be a few names which look strange. Whenever you see a country, say ‘Italy’, that refers to the national team of that sport. I cut the labels down so it was more manageable. Hopefully they are self-explanatory. That reminds me, please let me know if there are duplicates of anything.
  • Lastly, the size of the node in every graph has nothing whatsoever to do with the strength of the team or player in their respective sports! Without further ado, here is my latest batch of graphs… remember to click on the high-res version if you would like to explore the graphs properly!

American Football

The Graph Of American Football Players (zoom hi-res version)

The Graph Of American Football Teams (zoom hi-res version)


The Graph Of Basketball Players (zoom high-res version)

The Graph Of Basketball Teams (zoom high-res version)

Ice Hockey

The Graph Of Ice Hockey Players (zoom high-res version)

The Graph Of Ice Hockey Teams (zoom high-res version)


The Graph Of Baseball Players (zoom high-res version)

The Graph Of Baseball Teams (zoom high-res version)

Football (Soccer)

The Graph Of Soccer Players (zoom high-res version)

The Graph Of Soccer Teams (zoom high-res version)


As with all of my graphs, there are a few of important things to keep in mind:

  1. The datasets are incomplete. Many of your favourite players and teams could very likely be missing from the graph (especially non-Americans) – I’m sorry – I can’t do anything about that. This incompleteness will also lead to slight confusion as to what the various sized nodes actually mean. For the player graphs, the bigger the nodes, the most connections that person has to other people within the network. This essentially just means that the biggest nodes have shared the largest number of clubs with the largest number of people. Similarly for the club maps, the larger nodes are simply clubs which have the greatest reach in terms of the number of connections to other clubs through their current or previously players.
  2. The network is simply an exploration in connecting information. If you want to read facts or obtain clear cut answers to your questions about sports players and teams: go read Wikipedia or the original Freebase entries. This work, as with most of my others straddle a ground somewhere between entertainment and information. Where these networks fall, I do not know – I am at the mercy of you, the reader.

Keep the suggestions coming in – I’ve got about a billion projects at various stages of production. Thank-you: they have all been great. My other computer has been thinking for two days to create the dataset for one of the next graphs so stay tuned friend.

Until next time… stay crunchy.


The Graph Of TV Actors

20 Jul

This time I wanted to see the relationship between TV actors. I’m not especially interested in TV series but I am quite interested in how they work together. The fact that many actors have been in a number of TV series creates a great network of information.


  1. I first went to Freebase’s tried to download every actor available their corresponding TV shows. Unfortunately, Freebase had over 57,000 nodes which disabled me from querying what I wanted. I decided to do it manually.
  2. Freebase has regular data dumps where they store the entire networks on an ftp server. I simply navigated to where the TV actors were and downloaded the appropriate file.
  3. I then imported these into Matlab and ran a script which connected every actor with every other actor based on the TV show they had been in.  Once this had been run I then exported the list into Excel, did some formatting and produced the required input for Gephi.
  4. I exported these and then manually went around and added the labels for each of TV series in Gimp. Let me know if any are wrong!

The Graph:

The Graph Of TV Actors

Click here to zoom around.

As one would expect there are sub-networks within the entire graph. I’ve labelled to the best of my ability the TV series each of the sub-networks belong to. Now obviously there is going to be some overlap and so there might be the odd actor who doesn’t belong to the neighbouring label. The majority of the network should however.

Some of the sub-networks include:

Gilmore Girls, Alias and Arrested Development

Saved By The Bell and Frasier

The Power Rangers

All of these actors have worked together in a number of TV series. Hence the mess.

As you might have noticed, the TV series here are reasonably old. This is probably a result of the TV actor information on Freebase being incomplete. It is growing at an incredible rate and so I don’t think it will be too long until more modern series appear on the graph.

I couldn’t label the central regions because it is so entangled. I’ll let you try and work out who belongs to what series on your own.


One could feasibly create a map for film actors also. I have downloaded the data but it is in a slightly more technical format which requires a more sophisticated program. Film actors would be much richer and have so much more structure which would be fascinating to see.

The same could be said of directors, producers, writers etc. so there really is no end to how many different types of networks you could create. Lastly, as an option, here is a poster version.

Anyway, just a short one today.

The Graph Of Ideas 2.0

20 Jul

First of all, thanks very much for the feedback you all gave me on my graph of ideas. I wasn’t quite aware of how many people are interested in this sort of stuff. I now have lots of great ideas for new projects which will keep me busy for a long while. I must say making the graphs is the easy part – it is obtaining the data which takes time. I’ve made a note of all of your suggestions and will try to create something out of them soon. If you haven’t already, you can submit an idea here. I read them all.


There were a great number of comments about my last graph and so I’ll try to answer the main questions here. I think many of them were from people who hadn’t actually read the post at all but went straight to the graphic with whatever catch line someone shared along with it. It was reposted at Gizmodo, Spiegel.deBusiness Insider and FlowingData.com all of whom omitted the very important caveats I listed in the post. Please read the original post for the full discussion. These were a few of the common themes to the criticisms I saw floating around the interwebs:

“It is way too biased towards Western ideas.”
Yes, see point one of the original blog post. I simply plotted what Wikipedia (dbpedia) gave me – of course it is biased, like any dataset and this was stated.

“Where are all of the musicians and artists?”
— See original blog post. Artists don’t have the available information in the Wiki info-boxes (well except some, see bottom left, green part). I hope to make a musician/artist graph soon!

“The title is very misleading.”
— The original post had an asterisk on the word ‘every’ which was meant to highlight the fact the graph had caveats. I didn’t anticipate people leaving this out when they shared it with their friends. I changed it to be simply ‘The Graph Of Ideas’. I’ll be more careful in future.

Now that is out-of-the-way, I’d like to present my latest work. I’ve broken this post down into two sections: the network and the method. In order to fully understand the network, I suggest you also read the method. If you have better things to be doing with your life, quickly check out the plot below, glance at the caveats and then move on – I’ll catch up with you soon enough.

Network: Graph Of Ideas vs. Graph Of Ideas 2.0

The first graph connected people via a single connection. That is to say, if Socrates influenced Plato and Plato influenced Aristotle then the following connections were made:

Socrates –> Plato –> Aristotle

Easy, right?

However, as I briefly mentioned in the previous post, each individual in time represents the sum of their ancestors. This means that Socrates should technically be linked to Aristotle too! Whether we like to think about it or not — Socrates’ contribution to our body of understanding of the world is embodied in the way we speak and interact on a daily basis. Sure there is some dilution, but Socrates’ philosophies are for better or for worse, buried deep within you somewhere . This isn’t just true for Socrates either – it is true for everyone who has ever existed.

On the September 5th, 1948,  Jiddu Krishnamurti gave a public talk in Poona, India. It it he stated:

“You and I are not isolated; we are the result of the total process, the outcome of the whole human struggle, whether we live in India, Japan, or America. The sum total of humanity is you and me. Either we are conscious of that, or we are unconscious of it.”

Now click here.

Welcome back… so with all of this in mind I went ahead and made a little program which calculates these upstream connections which were missing in the first graph: hence the `2.0′.

Here is the resulting graph (~20% of the total ~4,200 nodes available):

The Graph Of Ideas 2.0 (connections not visible). Click here for dynamic zoom.

The most connected names cluster together.

Traditionally less prominent historical figures become larger.

The biggest/smallest names of the first Graph Of Ideas are now smaller/larger.

If you would like to see the full graph with the underlying connections click here. It is ~50MB so put the kettle on. BUT, if you would like to see it with an easy to use zoom (scroll) function click here (recommended).  It naturally is quite messy and difficult to read and I wanted it to be this way: it shows a more honest picture of how people are connected through history.

I must apologise for the overlapping names in some places. Due to the number of background nodes, label adjust (used to make non-overlapping names) seemed to crash every time I tried to run it. You can still make out almost all of the names however. I’ll try and upload better version soon.

Your first immediate reaction might be “riiiight”. Then after a careful examination you might see how dominated the graph is by people who died a long, long time ago. This is because the graph is biased toward the oldest generation of thinkers. Take for instance the biggest group – the Greeks. They influenced a great many people who in turn influenced a whole bunch more. Unlike the previous graph, these 3rd generation thinkers are now connected to the 1st generation thinkers thus amplifying their overall size and connectedness within the network. For example Socrates, Plato and Aristotle are now connected to every person Nietzsche was connected to in the previous graph.

I also suspect there are a great number of people you might not have heard of who have large nodes. For me, this is a quick way to find interesting people to read about on Wikipedia. Lastly, many people who were tiny in the previous graph are now quite large e.g. Confucius, Socrates etc.

I compiled a list of the most connected people.

Name  Connections Nationality Born (B.C.)
Thales 3390 Greek 624
Pythagoras 3386 Greek 570
Zeno of Elea 3378 Greek 490
Socrates 3376 Greek 469
Parmenides 3368 Greek 5th cent.
Protagoras 3352 Greek 490
Plato 3351 Greek 423
Melissus of Samos 3332 Greek 5th cent.
Leucippus 3329 Greek 5th cent.
Zeno of Citium 3306 Greek 334
Pyrrho 3306 Greek 360
Stilpo 3300 Greek 360
Posidonius 3288 Greek 135
Panaetius 3286 Greek 185
Lucretius 3275 Roman 99

Bertrand Russell in his History of Western Philosophy (1945) wrote “Western Philosophy begins With Thales”. As we can see, his claim is backed up by this graph. Thales is the most connected individual with 3390 connections. This doesn’t mean he is the most influential or humanity’s biggest asset – it just means that if the data was complete (which it isn’t), then his ideas have influenced (in whatever arbitrary way you define it) the most number of people.

The margin separating top 5 is also quite small. This is presumably because there are only one or two degrees of separation connecting them all. Interestingly there is only one person not of Greek origin in the entire top 10. Again, this is largely due to the incompleteness of the dataset – these gentlemen also have antecedents which are either a) not entered into Wikipedia or b) have been lost in history. As one Redditor wrote: “the group that came up with fire should be in the middle and bigger than everything combined”.

This type of graph just shows how much our perceptions of ideas can change depending on how we present information. This is my take home message for today.

Those in a rush — scroll to the caveats!

The Method:

You might at first think this is quite a trivial problem to solve but it did require some careful thought and even more careful programming. I mean in plain English, you just want to connect person A to whoever they have a forward connection with elsewhere on the network – how hard can it be?! Let me explain. Put your scuba suit on… now.

Here we have a basic (less pretty) graph of influences between a few people:

A basic influence network.

For example, here we have A influencing B and B influencing E. The crucial point is that A does not influence E because there is no direct connection between the two (this was the case in the previous graph). The problem in trying to reconnect A with E is that you have to find a way to traverse the connections in the correct direction and ensure you don’t end up in an infinite loop (A influences B,  B influences E and E influences A!). Basically I want to convert a graph like this:


To a graph like this:


If you have the time, spend a few minutes trying to think of a way to make a list of unique names and every person they are connected to through someone else. Over my lunch break today, a friend of mine and I came up with a way to do this quite quickly. Here is the page we scratched on during lunch:

Planning. Chickens were here.

As you can see, we tossed around nested do and while loops but they were just too complicated – the solution matrices.  There are a whole host of ugly problems you encounter if you try solve this using nested do and while loops. All this means is that my connection map will look something like this:


The number 1 represents a connection between the row and corresponding column e.g. A is connected to B, B is connected A and C and C is connected to only A. I wrote a script in Matlab which calculates just this and generates a new list of new connections. Essentially this works by looping through the rows and checking if there is a connection (=1). If there is a connection, it then finds where the person they are connected to in the same group of rows. Once they are found, their entire row is added to the original person’s row. . If you’re a bit of a coding oracle please let me know if there are faster ways of achieving the same result. So for our matrix above, C influenced A but A influenced B so C should also influence B, right? This algorithm turns the above matrix into this (just for the looping component on 3rd row:


Specifically, row C is added to row A and the dot product of the two is subtracted. This ensures there is always either a ‘1’ or a ‘0’ in all cells. The algorithm loops over every row in the matrix and carries out this procedure.

The original list had 14,560 connections so it took a reasonable while to do all of the permutations on my laptop (10 minutes). This new list has 4,239 nodes with over ~830,000 connections. Last time I checked, 830,000 > 14,500 so there is a lot more connectivity going on in this graph than the previous one. My code can also contain self-references. This is because two people may be contemporaries of one another and influence one another. If I influence my brother and he influences me, do I not have a slight influence myself through the actions of my brother? I thought I would leave these in just to see where they would turn up. No harm done here.

Once you make the matrix there are a whole heap of interesting things you can do. For example, who has the most connections in the network? Well, you just sum the row of each matrix and sort it in descending order (shown at the start). The last part of my script does this for you. I understand many of you won’t have Matlab and so I apologise in advance for this in advance. I might try to do it in Python next time. In the mean time, you could try a free trial version or use Octave which is a free version of matlab.

For our example above, once you create the matrix, all you need to do now is simply create a .dl file which contains the following:

dl n=3
format = fullmatrix
Person A, Person B, Person C
0 1 0
0 0 1
1 0 1

This is the information which helped me. Once I obtained this matrix for the Wikipedia network, Gephi was able to import it. Thank-you to whoever made this extension – it is genius.

Sorry to waffle on a bit but last time I had a number of people requesting more detail on how I go about making the graphs. Finally, to save you going back through the text, here are links to the data I used to create my map. I’ve compressed some of them but at most, they will expand to about 100MB.

All the data:

  1. You’ll need Gephi (free) and Matlab (or Octave).
  2. Original list of people and their influences from dbpedia.
  3. The Matlab script which generated the linking matrix.
  4. The list of names used in the network.
  5. The csv matrix of 1’s and 0’s only.
  6. The final .dl file require to import into Gephi.


  1. Many important people have been left out of the network. I am limited by the information provided by dbpedia. I mean I had to cut 80% of the network I had available just to make the plot I showed here!
  2. The communities are coloured by the Modularity module in Gephi – I do not personally colour anything.
  3. The graph is biased towards Western ideologies. The graph is biased towards Western ideologies. Yes 2x.

That’s all for now. Let me know in the comments section if you have any questions.

%d bloggers like this: