Over the weekend, I put together a map showing connections with my Facebook friends. I thought I'd jot down how it came about.
Recently, I've been experimenting with the R programming environment. It is a tool for statistical data analysis and can produce some really nice visualizations. A few days back, I came across this tutorial from the Flowing Data website. It shows how to plot great circle arcs between different airports and get results that look really nice with only a few lines of code. In the tutorial, the author also mentioned the famous visualization that a Facebook engineer created to show how Facebook connections can trace the outline of civilization. I wondered to myself if I might be able to do this for my facebook friends?
First, how can I get my facebook friend's locations? After some spelunking around the Facebook developer website, I found there is a Facebook Developer tool you can use without going through any extra developer registration steps. You can grab your own data in a JSON format, which is a nice format to parse in a script. Of course, the online tool is too limited to use to scrape all of your friends info by hand. But, if you grab the access token from that site, you can get access to via the commandline...
curl https://graph.facebook.com/me/friends?access_token=<token>
curl https://graph.facebook.com/<friendid>?access_token=<token>
If I can do that on the commandline, then I can script this.
But, how can I translate locations like "Milwaukee, Wisconsin" to a latitude, longitude? I have many friends in small cities, so I'll need a good database to do these lookups. Googling found http://www.geonames.org/export/ has some datasets that look perfect for me.
At this point, it seems like I have all the information I need to accomplish this. Now, it is just a matter of stitching it together.
First, a script to download the names & locations. It just wraps the two http fetches from above in order to get the friend['location']['name'] from the JSON. The first unexpected roadblock is unicode. One of my friends has an accent in her name. Printing it out is causing python to barf chunks. [Thanks, Renée! :-)] Unicode characters are not something I normally have to deal with, so it is back off to the python website for some help. Unicode is just not very clear to me, but by adding friend['name'].encode('ascii','ignore') to that print statement, I was back up and running.
After this, I now I have a file that looks something like:
Sname Ename Portland, Oregon
Kname Bname None
Mname Zname Orlando, Florida
Mname Bname Kuala Lumpur, Malaysia
...
Someday, I'll need to make a stab at hand translating all those "None" locations to a real location, but for now I'll just ignore. Since I only have a couple international friends, the world map isn't that interesting. For now, I'll just restrict this map to the US.
Now for a stab at trying to get a pythonic way to translate those facebook locations into latitude, longitude. I wish I could use the cities1000 on geonames.org file, but I have friends in smaller towns than that. That US datafile is pretty huge--US.zip is 55M and expands to 265M. Loading in that file takes more than a minute on my laptop. I won't want to do this translation too many times. After the fact, I realized that it has a gigantic amount of information I do not care about. It has locations of schools, parks, etc. I'll have to come back later & filter that out for future use.
Now, on to the next hurdle to clean up and make states & countries match up across these two pieces of data. Both need to translate to 2 letter codes (Oregon to OR, Malasia to MY, etc.). More googling and I have some code to do that mapping. Oh, then another few minutes removing more annoying unicode characters.
Finally, I have the data that I can work with. A list of cities, how many friends in each and their latitude and longitude.
San Jose,CA,US,7,34.07779,-117.77617
Evansville,IN,US,1,37.97476,-87.55585
Sierra Vista,AZ,US,1,31.55454,-110.30369
...
The quick reader will note that the city of San Jose is not really at that latitude and longitude. I discovered this later when I found some connections into Southern California that I wasn't expecting. I'll have to look into being more careful with parsing that large dataset of locations. For now, I just cleaned this and an errant Mountain View location by hand.
After I added headers, now I can slurp that file into R with read.csv().
Following along with the tutorial code, it worked! Well,almost...seems the lines going to the east coast seem to be a bit west of their actual location. I found that the example missed an important parameter (addStartEnd=TRUE) that would cause the last segment of the arc to be dropped. Earlier tutorial code has this, but the later code dropped it.
After all that I was finally able to draw the connections that I wanted to draw. Here it is:
That sure was more work than I thought it would be, with a few detours I wasn't expecting, but I did get what I wanted. It isn't a great example of how "fun" and "easy" programming can be, though. Rather, it is a lesson of how difficult it can be to do simple things.
Update: New version with a better color scheme, points for locations and a bit of aggregation for cities that are close to each other to gather people into the most populated city.