Discovering the Geo in Social Media data

This blog is based on research undertaken for The Contagion project at  The University of Exeter

This week Scraperwiki announced that their scraping service would now carry the functionality to include user location information in Twitter datasets. For non-programming geographers and others wanting to explore the geographic origins of social media postings, this opened up an important dimension of digital data. Of course there are many projects and users who have been successfully geolocating Twitter data by accessing the API, developing bespoke tools, such as Floating Sheep, Tweet Map or through using paid-for platforms such as GNIP. However researchers using open source tools and with limited coding capabilities have been not been able to access Twitter’s geolocating potential.

As Scraperwiki’s announcement came on Monday 16th June, it seemed an obvious call to explore the global origins of #worldcup2014 tweets. A quick scrape harvested just under 20,000 tweets that included the additional metadata columns–user_time_zone and user_location. On cleaning in OpenRefine the dataset was reduced to 10,957. This significant reduction of the dataset was due to the number of blanks in the geolocated columns. As expected when using data scraped from users’ profiles, not every entry contained useful information. User location is based on the stated location from a user’s profile. This could be helpfully accurate, such as London, UK, or creatively inaccurate such as ‘Middle-Earth’ or vague, such as ‘Worldwide’ or ‘everywhere’. The information in the time zone column hopefully would be accurate. However one anomaly in this column stood out –the official FIFA 2014 profile -@2014_FIFA was given the time zone of Chennai (which is in India) and the user location of Brazil. This does not fit with their profile or their expected current activity. An SQL query in Open Office revealed some of the most popular timezone results, giving London, UK 10.4% of time zone entries. However results with ‘null’ were 32% of entries pointing again to the limitations of this time zone information.

This dataset was then uploaded to a Google Fusion table that created a map based on the results for user location


This quick and dirty geolocation exercise based on a trending hashtag provides a window into the global spread of #worldcup2014 Twitter activity. This dataset provides an antidote to the dominance of Twitter analysis based on English speaking countries. Clusters are seen not only in the UK and the USA, but are also visible in India, Japan, East Africa, Indonesia and Malaysia.

It is, however limited -32% of the cleaned dataset is shown to have a ‘null’ entry. If this number was combined with the 9,031 blanks removed on cleaning, this figure would significantly rise.