- Uncovered Curiosities
- Posts
- Wikipedia's "Most Interesting Place in the World"
Wikipedia's "Most Interesting Place in the World"
Gathering all of Wikipedias geotagged data to uncover the most interesting cites & countries worldwide
In June & July 2023, I was busy on a project commissioned by Explore Worldwide Australia and ideated by ProppellerNet BCORP - discovering where in the world are the most interesting places, using only data available online.
How do you define the most interesting place with data?
And how do you make ‘interest’ quantifiable?
The agreed-upon hypothesis was that if something were interesting, someone would find it worth talking about, and if it’s worth talking about, it likely has a Wikipedia page.
Ok, so if I can download all of Wikipedia, turn every single page into an accurate last-known location of the item it discusses, and then reverse geocode those locations into addresses, I can turn the entirety of Wikipedia into a table, counting the number of items and associate them with cities and countries.
Well, I had a theoretical plan.
Computational Power
You may have heard people talking online about how “Wikipedia is just 21GB - it could fit on an Apple watch!”. These people might be correct, but that version of Wikipedia would be so compressed it wouldn’t be useful to anyone.
If you download a full Wikipedia data dump it uncompresses into a file over 3TB in size, and is far too large for many home PCs to open. Even your Apple watch might struggle.
The original hypothesis had other issues with processing too. Attempting to find the location of an item with a Wikipedia page entirely from the text on the page would be far too inaccurate without Chat-GPT levels of processing power. Scale that up 7 million times and you’re looking at a long processing time for the GPT servers.
The scope needed to be reduced to be physically possible, but not so much it lost credibility.
Exploring the meta-data stored by Wikimedia, I came across geo_tag_schema dumps - essentially, data on all the media and pages on Wikipedia which have a geo-location already associated with them.
It was exactly what we were looking for. Coordinates already on a platter for a chunk of Wikipedia sizable enough to be a fair representative sample.
Analysing & Geocoding
With the geo-tagged metadata downloaded & cleaned (1 tag per page, only considering tags on Earth, etc.) we were left with a sample of over a million pages which had geo-tagged locations - over 15% of English language Wikipedia - not too bad at all.
A sample of the data at this point
Next, the process of turning coordinates into places - a process known as reverse geo-coding.
If you are familiar with geo-coding, you may be aware that it can be extremely expensive to secure accurate results. The world map is constantly changing, borders between countries and cities always evolving and accurately turning longitudes and latitudes into an address, or vice-versa, can be computationally & financially expensive.
For example, we have around 1.2 million coordinates. If we were to use Google’s places API to reverse geocode them, it would cost approximately $6,000. That is far beyond the price of the entire project, so completely unfeasible.
Praise be, to the cute & simple Python module “reverse-geocode”.
If you’re not worried about the most supreme accuracy, don’t need granular details like street names, and accept some errors when handling coordinates in the ocean, this little guy is a lifesaver.
It works simply by maintaining a database of city locations. For every coordinate it processes, it converts the position onto its globe, finds the nearest city, and returns that city & country.
It’s not perfect, and the data needs some additional processing to clean it up properly, but it works very well on a budget.
So after some reverse geo-coding, we arrive at the first round of results and… oh… oh no…
Results Round 1
Removing Bias
We should have seen this one coming.
By using the English Language version of Wikipedia as our source for interesting locations, we’ve ended up with an incredibly biased results set.
The results in this form are essentially a list of countries based on:
How many people speak English in said country
How many people have access to the Internet in said country
How popular is Wikipedia in said country
To remove at least some of the bias from the results, we needed to weigh them in a way which allows for better representation of interesting places where the English language Wikipedia is not frequently used.
I gathered data on how widely English is spoken, what % of the population has easy access to the Internet, and the number of Wikipedia users from each country, and used these stats to inform a weighting to be placed on the results from each country.
This means that each interesting location in Zimbabwe (weighting 3.6) is worth 2 interesting locations in Italy (weighting 1.8).
Happy that the final results gave a fair representation of the Interesting places of the world according to Wikipedia, The results were tied up, heatmaps were generated, and the final article was published.
You can read it here.
An ‘Interesting’ heatmap of the world