July 2, 2019

"Where today?" - Planning my Singapore trip with clusters

Outlining travel plans with R, k-medoids and Google Maps

Planning is not easy. Especially when it involves traveling and unknown places. In a couple of days, I'll have the pleasure of visiting Singapore. To prepare my visit to this multicultural, modern and expensive (ouch!) city, I read a book about it, looked for the best photo spots on Instagram, and ultimately, created a custom map on Google Maps with some of the places I'd like to visit. It looks like this.

Singapore
Singapore

Upon looking at the map, and realizing that I'll spend only four days in this busy city-state, I asked myself: "where should I start?" and "how should I accommodate my agenda so I can fit all these places in such short amount of time?" Then, I had a thought.

Being the data guy I am, and fan of applying unsupervised learning everywhere, I had the idea of using machine learning and R to group my map's locations into four clusters – one representing a day – to get a slight sense of how should I plan my days. Moreover, besides just clustering the dataset, I also wanted to interpret the different clusters as well as the overall result. By exploring cluster analysis concepts such as dissimilarity, and separation, I'll be able to deduct further knowledge, like the approximate amount of steps required to walk from one cluster to another.

"Ok Juan, is this essential?" Probably not. After all, one of the coolest things about going to a new city is getting lost and explore it. But hey, it seems like a fun experiment to do, and an excellent way to mix both data and travels.

The algorithm: k-medoids

When thinking about the algorithm I should use to cluster my data, there was a fundamental fact I had to keep in mind: the Earth is not flat. Thus, I had to resort to an algorithm that allows me to use a custom similarity measure that considers the curvature of the Earth. That algorithm is k-medoids. Unlike its famous counterpart, k-means, k-medoids is a distance-based clustering algorithm that can be used with an arbitrary distance measure, while the k-means model is based on variance minimization, which in this case is the sum of squared errors.

The similarity formula I used alongside k-medoids is the one known as Haversine, a metric that defines the "distance between two points on a sphere given their longitudes and latitudes"; (https://en.wikipedia.org/wiki/Haversine_formula) for this experiment, the distance unit is kilometers.

Earth
Earth, by NASA.

With that background out of the way, let's collect the data.

Data Collection

A great feature of Google's My Maps is that it easily allows exporting the maps and pinned locations to an external file. To do this, click on the map layer, mark the "export as KML" option, and download it. The result is a KML file, which is an XML that follows a standard format. Once you have the data, the next step is to convert it into a beautiful and tidy table, and there are two ways of doing it.

The first one is to do it manually. If your map is made of a few points, you could just search for the <coordinates> nodes in the .kml file and paste them, alongside the location's name (if you are interested), to another file (probably a CSV).

nodes
Example of two nodes.

The non-manual way involves parsing the file to extract the information from the nodes. A simple way of doing this in R is with the map tools package and the function getKMLcoordinates, which takes as parameter the path to the file, and automatically creates a matrix made of the coordinates (for the exact "how to" see the script on my GitHub; link's at the end).

Now that we have a tidy and clean dataset, it's time to cluster it.

The Clustering

To cluster the data, I used R's k-medoids implementation from the package cluster. To apply it, one has to call the function pam using as a parameter the dataset and the number of desired clusters. Like this:

pam(df, k = 4)

By default, this method clusters the data using Euclidean distance as the dissimilarity metric, and as an alternative, it supports Manhattan distance. But sadly, it doesn't include the Haversine distance. In such cases, where the algorithm doesn't have the desired dissimilarity measure, the solution is to use as an input a dissimilarity matrix – a matrix that describes the pairwise similarity between the points of the dataset – instead of the dataset. So, now, the function call looks like this (the complete process is explained in the code):

pam(distances, k = 4, diss = TRUE)

Then, we can finally cluster. As for the parameter 'k', I selected 4 since I plan to spend either four or five days in the city. The clusters look like this.

coordinates

Cool, right? However, since this article is about Singapore and the places I'm interested in visiting, we need a map to get a better understanding of the meaning behind the clustering, and of course, there's an R package (more like several) for that.

The package I selected is called ggmap, a library that provides a wrapper around Google Maps API, allowing us to download maps directly into R. Even better is that this package is fully compatible with ggplot2, making it easy to overlap additional plot layers on top of the chart. With this feature, I drew the clusters on top of a map of Singapore. This is how it looks like.

map_cluster

Here we can see the four clusters, and the regions they border, each marking the places I should visit in those four days. The north cluster (purple colored) sits on top of the Bukit Timah Nature Reserve region, hinting that I should take one day just to visit this area and the locations marked on it (which I might actually do). Right under it, there's the blue cluster, which points at the busy downtown area. Then, on red, we have the Sentosa area cluster, and lastly on the right, the airport area, which I'll surely visit because it has a Pokemon Center and shopping center that looks straight out of a sci-fi movie. To get an in-depth look detailed at the locations, I'll show the zoomed-in versions of the four different clusters.

This is the natural reserve area.

nr_cluster

Then, on the next image, we have Sentosa Island. See the point on the left? That's a bar that serves local beers. But more than that, this point represents the main pitfall I found with my methodology, and that is, a miscalculation of the actual distance I have to walk. The Haversine distance, simply tells us about the distance from one coordinate to another, considering only the curvature of the Earth. However, unfortunately, I don't have wings nor a private boat to take me directly from Sentosa to the bar. Implicating that the distance I have to walk will be higher than the one calculated by the formula – and that's without considering the ups and down, which also add a couple of meters to the total distance traversed but Singapore is quite flat, so that's unimportant. Notwithstanding, I decided to ignore this detail, and honestly, I might actually hit the bar after spending the day on the sunny island.

sentosa_cluster

Downtown.

downtown_cluster

And lastly, the airport.

airport_cluster

This is the smallest cluster, made of only two points. Does this mean I'll dedicate one day for just this? While, I might spend a reasonable amount of time at the Pokemon Center, no, I won't spend the whole day here. Realistically speaking, I'll visit some of the Downtown attractions on this day.

Regarding the fact that there are only two points in this cluster, this could be addressed by selecting another clustering algorithm, for example, a density-based one such as DBSCAN. This kind of algorithm requires as a hyperparameter an epsilon value that specifies the minimum numbers of points a cluster have to have.

Cluster Analysis

Clustering the data is quite straightforward (sometimes). You fit a model, plot them and voila, rejoice at how pretty it looks. But there's more than that. Underneath all the points of a fitted cluster model, there is a whole world of values waiting to be discovered and interpreted.

From this sea of values, I'll take a look at those regarding the intra-cluster distance, that is, the distances within each member of the cluster, and inter-cluster distance, which measures how far one cluster is from other. The table below shows the information for each cluster.

size max_diss av_diss diameter separation
1 7 6.95623129 1.95450318 7.48929349 2.948080
2 22 7.41214376 1.84317838 10.80029181 2.948080
3 2 0.04592962 0.02296481 0.04592962 10.780345
4 3 4.02376770 2.03615221 5.15214123 3.512083

The first column is about the size of the clusters. As we saw before, the smallest one is made of two observations, while the biggest one (Downtown region) has 22 places. Then, in the second column, there's the maximum dissimilarity (max_diss), that's the distance between the cluster medoid and its farthest point. For example, cluster two's max_diss is 7.41 (kilometers, remember), meaning that I have to walk (theoretically) 7.41 km. to reach its center if I start from East Coast Park. Similarly to max_diss, we have av_diss, the average dissimilarity (from the centroid's center). The fourth column contains the diameter, which is the maximal dissimilarity between two points of the same cluster, followed by the separation score, the minimal dissimilarity between a data point of one group and a data point from another.

Before wrapping up this article, I want to complement all this bunch of numbers I just presented with another piece of data. In a previous article, I deducted that the correlation between my daily number of steps and daily distance walked is 0.98, implying an almost perfect linear relationship between these two numbers. Consequently, since this seems way too good, I fitted a regression where the number of steps was my target variable, and found that this equation describes the relationship:

y = -251.4 + 1470.4 * X

Where X is the distance (in km.) So, if I want to know the approximate numbers of steps required for doing any of the things discussed before, I just need to replace X by the value. For example, walking from the Pokemon Center to East Coast Park (a distance of 10.780345 km.) would take me 15600.02 steps. Isn't that cool?

Conclusion

In this article, I described how I defined a plan for my upcoming visit to Singapore. With a bit of data, Google Maps, and k-medoids, I discovered the locations I could visit on each day. According to the clustering, I should spend one day at the natural reserve, one day in downtown, another one at Sentosa island and one at the airport. Will I follow this plan? Some parts, yes. Guess I'll have to wait until I'm there to see how practical this was.

The code used for this project, including the datasets, is available on my Github: https://github.com/juandes/wanderdata-scripts/tree/master/unsupervised-city-planning/singapore

I hope you have enjoyed this and learned something new. If you have any questions, clarification, or find any inconsistency here, leave a comment.

Thanks for reading :)