Richard Steele describes how ‘clustering’ techniques can be a useful part of any data analysis toolkit
When analysing data, it can be useful to allocate datapoints into groups with similar characteristics – ‘clustering’. This allows actuaries to better understand what attributes datapoints share, as well as how they differ. Useful applications include exploratory data analysis, consumer segmentation and outlier detection.
There are many clustering algorithms, and they commonly work by evaluating the theoretical distances between the numeric values of datapoints. When dealing with many dimensions, this concept can be hard to visualise, so it is easiest to observe in two dimensions.
To show the concept here, we used STATS19 data on fatal traffic accidents in Britain (bit.ly/Road_safety_data).
If we simply plot longitude and latitude co-ordinates, we obtain a quick rendering of accident locations.
Using this two-dimensional dataset, we can attempt some simple clustering methods. K-means is a commonly used algorithm and one of the simplest introductory concepts. We specify a required number of clusters (‘k’) – in this example, it’s 10. Figure 1 shows the 10 central points (‘centroids’) derived by the algorithm, with each accident assigned to its nearest centroid.
This simple algorithm has some drawbacks. Clusters tend towards ball shapes, evenly distributed around the centroid, so the algorithm cannot handle more irregularly shaped clusters. Every datapoint is used, meaning clusters can be influenced by outliers. There is also a requirement to specify the number of clusters in advance – something that may not be known.
Other drawbacks in Figure 1 include South Wales and South-West England being assigned to the same cluster, based purely on proximity to their centroid, which comes close to being assigned to a location in the Bristol Channel.
We can see, therefore, that this method may be unsuitable for certain scenarios.
For our case study, it may be more appropriate to locate clusters based on accident frequency within a region. The algorithm Density-Based Spatial Clustering of Applications with Noise (DBSCAN) allows us to identify clusters using data density. Specification of density is set by two input parameters, representing distance (‘epsilon’) and the minimum number of datapoints required for a point to be clustered.
In high-dimensional data, setting epsilon can require some judgment and expertise. In two-dimensional space, epsilon defines the radius of a circle. In our case study, accuracy requires the application of a conversion formula to allow for Earth’s curvature.
Suppose we are interested in areas with at least 10 accidents, and we initially set an arbitrary value of epsilon. In Figure 2 we see that the resulting clusters strip out datapoints where there are fewer than 10 incidents per defined region, filtering out the sparse outliers. Secondly, we see a further effect that is commonly observed with DBSCAN: clusters are not constrained in size and shape, but instead are defined by and joined by the density of the datapoints. Clusters span out until they fall below the specified density threshold. This expansion effect is known as ‘chaining’.
We observe one main cluster that includes Central London expanding east to the Thames estuary and north via the M1 motorway. If we increase the minimum sample size to 25, we identify smaller, denser clusters (Figure 3).
This parameter modification is analogous to ‘raising the sea level’ to leave only the densest ‘islands’ (clusters) above water, while less dense areas get ‘submerged’. We can vary parameters until we find a density appropriate for our purpose. When we specified regions with three or more fatal accidents per approximate half-kilometre radius, the algorithm identified several hotspots. The area with the most fatal accidents was a region of Bradford that had seven incidents over the period (Figure 4).
However, when drawing conclusions based on this analysis, we should take caution before labelling this area of Bradford ‘most deadly’. The example ignores the number of deaths per incident. In addition, higher volumes of traffic naturally result in a greater number of incidents, so a quieter road with fewer than seven deaths may be considered riskier per car-mile travelled.
Adjust the algorithm
There are many clustering algorithms, each with their own strengths and weaknesses. The most appropriate algorithm will depend on the purpose of the analysis and the features of the data. Analysis can be conducted within many disciplines, from traditional actuarial work to areas of wider public interest (as in our example on car accidents). In many scenarios, other algorithms may be more appropriate than DBSCAN.
However, given that libraries are readily accessible in Python and R, density-based algorithms such as DBSCAN should form a valuable part of the overall data analysis toolkit.
See an interactive map of the identified hotspots at bit.ly/DBSCAN_clusters
Richard Steele is head of data analytics at OAC
Image credit | Getty | Shutterstock