Skip to main content
The Actuary: The magazine of the Institute and Faculty of Actuaries - return to the homepage Logo of The Actuary website
  • Search
  • Visit The Actuary Magazine on Facebook
  • Visit The Actuary Magazine on LinkedIn
  • Visit @TheActuaryMag on Twitter
Visit the website of the Institute and Faculty of Actuaries Logo of the Institute and Faculty of Actuaries

Main navigation

  • News
  • Features
    • General Features
    • Interviews
    • Students
    • Opinion
  • Topics
  • Knowledge
    • Business Skills
    • Careers
    • Events
    • Predictions by The Actuary
    • Whitepapers
    • Moody's - Climate Risk Insurers series
    • Webinars
    • Podcasts
  • Jobs
  • IFoA
    • CEO Comment
    • IFoA News
    • People & Social News
    • President Comment
  • Archive
Quick links:
  • Home
  • The Actuary Issues
  • April 2023
General Features

The top tool of data clustering

Open-access content Richard Steele — Thursday 6th April 2023
yugv

Richard Steele describes how ‘clustering’ techniques can be a useful part of any data analysis toolkit

When analysing data, it can be useful to allocate datapoints into groups with similar characteristics – ‘clustering’. This allows actuaries to better understand what attributes datapoints share, as well as how they differ. Useful applications include exploratory data analysis, consumer segmentation and outlier detection.

There are many clustering algorithms, and they commonly work by evaluating the theoretical distances between the numeric values of datapoints. When dealing with many dimensions, this concept can be hard to visualise, so it is easiest to observe in two dimensions.

To show the concept here, we used STATS19 data on fatal traffic accidents in Britain (bit.ly/Road_safety_data).

K-means clustering

If we simply plot longitude and latitude co-ordinates, we obtain a quick rendering of accident locations.

Using this two-dimensional dataset, we can attempt some simple clustering methods. K-means is a commonly used algorithm and one of the simplest introductory concepts. We specify a required number of clusters (‘k’) – in this example, it’s 10. Figure 1 shows the 10 central points (‘centroids’) derived by the algorithm, with each accident assigned to its nearest centroid.

This simple algorithm has some drawbacks. Clusters tend towards ball shapes, evenly distributed around the centroid, so the algorithm cannot handle more irregularly shaped clusters. Every datapoint is used, meaning clusters can be influenced by outliers. There is also a requirement to specify the number of clusters in advance – something that may not be known.

Other drawbacks in Figure 1 include South Wales and South-West England being assigned to the same cluster, based purely on proximity to their centroid, which comes close to being assigned to a location in the Bristol Channel.

We can see, therefore, that this method may be unsuitable for certain scenarios.

fgd

Density-based clustering

For our case study, it may be more appropriate to locate clusters based on accident frequency within a region. The algorithm Density-Based Spatial Clustering of Applications with Noise (DBSCAN) allows us to identify clusters using data density. Specification of density is set by two input parameters, representing distance (‘epsilon’) and the minimum number of datapoints required for a point to be clustered.

In high-dimensional data, setting epsilon can require some judgment and expertise. In two-dimensional space, epsilon defines the radius of a circle. In our case study, accuracy requires the application of a conversion formula to allow for Earth’s curvature.

Suppose we are interested in areas with at least 10 accidents, and we initially set an arbitrary value of epsilon. In Figure 2 we see that the resulting clusters strip out datapoints where there are fewer than 10 incidents per defined region, filtering out the sparse outliers. Secondly, we see a further effect that is commonly observed with DBSCAN: clusters are not constrained in size and shape, but instead are defined by and joined by the density of the datapoints. Clusters span out until they fall below the specified density threshold. This expansion effect is known as ‘chaining’.

We observe one main cluster that includes Central London expanding east to the Thames estuary and north via the M1 motorway. If we increase the minimum sample size to 25, we identify smaller, denser clusters (Figure 3).

rd

This parameter modification is analogous to ‘raising the sea level’ to leave only the densest ‘islands’ (clusters) above water, while less dense areas get ‘submerged’. We can vary parameters until we find a density appropriate for our purpose. When we specified regions with three or more fatal accidents per approximate half-kilometre radius, the algorithm identified several hotspots. The area with the most fatal accidents was a region of Bradford that had seven incidents over the period (Figure 4).

However, when drawing conclusions based on this analysis, we should take caution before labelling this area of Bradford ‘most deadly’. The example ignores the number of deaths per incident. In addition, higher volumes of traffic naturally result in a greater number of incidents, so a quieter road with fewer than seven deaths may be considered riskier per car-mile travelled.

Adjust the algorithm

There are many clustering algorithms, each with their own strengths and weaknesses. The most appropriate algorithm will depend on the purpose of the analysis and the features of the data. Analysis can be conducted within many disciplines, from traditional actuarial work to areas of wider public interest (as in our example on car accidents). In many scenarios, other algorithms may be more appropriate than DBSCAN.

However, given that libraries are readily accessible in Python and R, density-based algorithms such as DBSCAN should form a valuable part of the overall data analysis toolkit.

See an interactive map of the identified hotspots at bit.ly/DBSCAN_clusters

Richard Steele is head of data analytics at OAC

Image credit | Getty | Shutterstock

Screenshot 2023-04-06 at 08.32.16.png
This article appeared in our April 2023 issue of The Actuary .
Click here to view this issue

You may also be interested in...

tcy

Million dollar questions: non-life reinsurance in 2023

An unusual combination of factors has refocused the non-life reinsurance sector. Yiannis Parizas asks chief underwriter Michael Hinz to reflect on January’s renewal and assess the shape of things to come
Thursday 6th April 2023
Open-access content
yug

Empire state: what the Romans did for us, economically speaking

We all know about aqueducts, baths and roads, but what economic legacy did the Romans leave us? And what’s its relevance now? Paul Harwood asks actuary and classicist George Maher
Thursday 6th April 2023
Open-access content
hi

Actuaries’ role in the game of ‘Earth Jenga’

Like a game of Jenga, we keep whittling away at the building blocks of the Earth to make new things to precariously pile on top. But actuaries can help anchor the wobble, says Sandy Trust
Thursday 6th April 2023
Open-access content
yc

World view: Bermuda

Mikaela O’Brien describes her actuarial student life in the Atlantic archipelago, where there are no university courses – but plenty of business opportunities
Thursday 6th April 2023
Open-access content
ui

UK social care reforms: when they eventually come, will they help anyway?

Like being on an NHS list, we’re still waiting… When are the government’s delayed social care reforms going to be implemented, asks Tom Kenny? And, more importantly, will they actually help?
Thursday 6th April 2023
Open-access content
iu

L&H discussion: how to play the new game, post-permacrisis

As we settle into 2023, what’s on the cards for UK insurers and reinsurers, after a year of ‘permacrisis’? Ruolin Wang assembled four industry professionals to discuss the six main spin-off challenges. Where might be the wins? What might be the best strategies?
Thursday 6th April 2023
Open-access content
Also filed in
General Features
Topics
Data Science
Share
  • Twitter
  • Facebook
  • Linked in
  • Mail
  • Print

Latest Jobs

Senior Reserving Analyst

London (City of)
Negotiable
Reference
149485

Senior GI Modeler - Capital and Planning

London (Central)
£ excellent
Reference
149436

Risk Oversight Manager

Flexible / hybrid with a minimum of 2 days per week office-based
£ excellent
Reference
149435
See all jobs »
 
 

Today's top reads

 
 

Sign up to our newsletter

News, jobs and updates

Sign up

Subscribe to The Actuary

Receive the print edition straight to your door

Subscribe
Spread-iPad-slantB-june.png

Topics

  • Data Science
  • Investment
  • Risk & ERM
  • Pensions
  • Environment
  • Soft skills
  • General Insurance
  • Regulation Standards
  • Health care
  • Technology
  • Reinsurance
  • Global
  • Life insurance
​
FOLLOW US
The Actuary on LinkedIn
@TheActuaryMag on Twitter
Facebook: The Actuary Magazine
CONTACT US
The Actuary
Tel: (+44) 020 7880 6200
​

IFoA

About IFoA
Become an actuary
IFoA Events
About membership

Information

Privacy Policy
Terms & Conditions
Cookie Policy
Think Green

Get in touch

Contact us
Advertise with us
Subscribe to The Actuary Magazine
Contribute

The Actuary Jobs

Actuarial job search
Pensions jobs
General insurance jobs
Solvency II jobs

© 2023 The Actuary. The Actuary is published on behalf of the Institute and Faculty of Actuaries by Redactive Publishing Limited. All rights reserved. Reproduction of any part is not allowed without written permission.

Redactive Media Group Ltd, 71-75 Shelton Street, London WC2H 9JQ