Skip to main content
The Actuary: The magazine of the Institute and Faculty of Actuaries - return to the homepage Logo of The Actuary website
  • Search
  • Visit The Actuary Magazine on Facebook
  • Visit The Actuary Magazine on LinkedIn
  • Visit @TheActuaryMag on Twitter
Visit the website of the Institute and Faculty of Actuaries Logo of the Institute and Faculty of Actuaries

Main navigation

  • News
  • Features
    • General Features
    • Interviews
    • Students
    • Opinion
  • Topics
  • Knowledge
    • Business Skills
    • Careers
    • Events
    • Predictions by The Actuary
    • Whitepapers
    • Moody's - Climate Risk Insurers series
    • Webinars
    • Podcasts
  • Jobs
  • IFoA
    • CEO Comment
    • IFoA News
    • People & Social News
    • President Comment
  • Archive
Quick links:
  • Home
  • The Actuary Issues
  • June 2021
General Features

"Precision parameters: clustering approaches when binning dynamic risk factor data"

Open-access content Wednesday 2nd June 2021
Authors
Paul Papenfus

Paul Papenfus explains how a clustering approach can help when binning dynamic risk factor data

web_p22_23_Machine-Binning_main

Every day, more data is being recorded – and demand for useful insights from it is rising, too. For data analysis, the proliferation of open-source software is overtaking previous generations of costly single-use packages, opening opportunities to anyone with an internet connection and a willingness to learn. And yet in one aspect, little has changed: decision-makers still want to understand the likelihoods or probabilities of future events, along with the key drivers or risk factors that influence them. This calls for a frequency analysis that compares the number of claims against the corresponding proxy of possible claims – also known as the exposure to risk, or simply exposure.

In insurance, frequency analyses have been used for decades to estimate the probabilities of events such as death, illness and lapses. Assumption values are often structured into a fixed shape known as a basis. Populating the basis with new values every year is a simple, mechanical solution but it comes with no guarantees. In particular, if the existing shape fails to capture emerging features of the risk, then projections can fail to predict future claims due to changes in the business mix. Fortunately, data science techniques are making it possible to take a fresh look at existing basis structures to either confirm that they continue to remain fit for purpose or suggest areas to reconsider.

Frequency analysis for dynamic risk factors

In a frequency analysis, understanding the impact of different risk factors usually involves separating the available data into cells, also known as bins, and comparing the claims versus exposure ratios between them. Categorical risk factors, such as smoker status, are relatively easy to separate because they have a limited number of possible values. Machine learning methods exist to isolate and rank the most descriptive factors.

For the purpose of this discussion, continuous risk factors can be separated into stationary and dynamic factors. For a particular risk, stationary risk factors – such as vehicle engine capacity or level of carbon dioxide emissions – do not change during the analysis period. Their treatment is similar to categorical factors, except that they also have a logical progression by size. Even continuous risk factors with a finite number of historical values, such as benefit level for a decreasing cover life insurance policy, can be modelled as being stationary if the experience for the different values is separated correctly. 

A more interesting challenge is presented by dynamic risk factors such as age, policy duration and calendar year, where the exposure between two points has an infinite number of possible values. For example, as a policyholder ages, their age moves through values that can be measured in complete years, days, fractions of days or smaller increments. Dynamic risk factors have no natural separators or minimum increment sizes. The claims frequency for a dynamic risk factor only exists in relation to a chosen interval, such as ‘age from 30 to 30.3973’. This means a binning method needs to be defined, explicitly or implicitly, before frequencies can be observed.

The choice of how to divide, or bin, a dynamic risk factor becomes circular: a natural desire is to choose bins that highlight key features of the experience in a way that does not suppress them through averaging, but the experience for each bin can only be calculated once the bins have been determined.

This article considers two existing approaches to binning in a probability analysis context, along with their challenges, and presents one data science-based alternative for use with dynamic risk factors. The focus is not machine learning itself, but the binning process that supports it. No machine learning technique can recover what has been averaged out during the binning stage. For this reason, a well-considered, impartial binning technique is essential for effective feature analysis.

Binning by expert judgment

The simplest form of binning is expert judgment based on knowledge of the domain. For example, historical knowledge of the retirement market, including the impact of regulations, could suggest sensible ranges for fund size to capture the known features. Even in the age of machine learning, the value of domain knowledge should not be underestimated, especially in portfolios with limited volumes of relevant data. However, there is a risk if consumer behaviour patterns change. Over time, existing bins could mask emerging features, especially for large, mature portfolios if the analysis fails to include a time element such as calendar year.

Unit binning

Unit binning is arguably the most common approach. Experience is separated into intervals of the same length, such as age at last birthday or policy duration in complete years. The simplicity is an advantage: an interval of a year provides a convenient separator that requires little further justification. Unit binning provides a reasonable all-round analysis, detecting global trends reasonably well. But it is also a one-size-fits-all approach. Most datasets have lots of data around the central region, with few or no claims around the edges. Bins in the middle of the spectrum tend to be very full, obscuring potentially important features, while bins around the edges tend to be dominated by random variation with hardly any data to stabilise the results. Ironically, evenly spaced bins generate an uneven distribution of data volumes.

The danger of missing relevant features can be demonstrated with a fictional example. It is not impossible for the probability of surrender of a savings policy to spike significantly in the first few months of the sixth policy year, and then to drop away again sharply. This is illustrated in Figure 1a. It could be very valuable from a customer retention perspective to know about such a feature for further investigation. And yet an analysis that bins by policy duration in complete years is unlikely to detect the spike and drop because the lapse frequency for the whole of the fifth year might look much like the neighbouring ones. A unit binning analysis, such as shown in Figure 1b, is unlikely to isolate the busier and emptier lapse regions because it combines all the information inside the square. Although a previous understanding of customer behaviour is always valuable when describing emerging trends, an analysis technique can add even more value if it can identify features for further investigation without having to rely on domain knowledge. Unit binning does not necessarily offer this.

 

web_24_Machine_binning_figure-1
Figure 1: Simulated force of lapse and unit binning.

A: Spike and drop between during 5 and 6,  B: Simulated claims

 

Cluster binning

Unit binning can be described as a top-down process because it slices all the available data into pre-determined bins. By contrast, cluster binning offers a bottom-up alternative that overcomes the problem of uneven data volumes. The process is demonstrated in Figure 2:

  1.  Start with ‘nearest claim binning’: for each exposure unit, such as a day, determine the nearest claim. Although other options are possible, the notion of distance that is easiest to interpret is Euclidean distance: the distance between two points in n-dimensional space is given by the square root of the sums of the squared distances along each dimension. Each exposure unit is labelled with its nearest claim. All exposure units with the same nearest claim define a bin. Step 1 generates one bin for every claim, as shown in Figure 2a and Figure 2b.

  2.  Although it is possible to analyse directly after Step 1, the crude claims frequencies in the bins might be too volatile to identify key features. To overcome this, combine the claims into the required number of clusters. The clusters can be chosen so that experience in the same cluster is more similar than experience between clusters. Several methods are available, including k-means clustering.

  3.  The exposure for each cluster is the combination of all exposure units that have nearest claims in the cluster. Each cluster is one bin. Figure 2c shows how two neighbouring groups in Figure 2a and Figure 2b fit together as two clusters.

 

we_p24_Machine-binning-figure-2
Figure 2: Neighbouring clusters A) and B) with nearest decrement shading, combined in C). 

 

Finding the nearest claim in Step 1 can be computationally intensive and slow if performed in a brute force way by comparing the distance between every possible pair of points. Fortunately, data science packages benefit from algorithms that speed up this process. The added benefit of nearest claim binning is that Step 1 above, which deals with the exposure, is run only once, regardless of the number of clusters ultimately chosen in Step 2.

Hyperparameter optimisation

Cluster binning offers an impartial, automated way to separate experience into an even number of bins in continuous dimensions. As with many machine learning applications, the improvement in the power to isolate useful features comes at the cost of having to make choices in new areas – in this case, the choice of the number of clusters. This is an example of a hyperparameter, a value used to control the learning process. Although a comprehensive list of possible tests will not fit in this article, there are a few descriptive statistics that could help with making the choice:

  • Old-fashioned expert judgment. Familiarity with the subject matter can guide the choice of bins to strike a balance between better feature isolation and lower overfitting risk. Domain knowledge continues to be valuable, even in a process that automates some of the work.

  • The standard deviation of the experience in each bin is an indication of the statistical homogeneity of the information inside the bin. Evidence of bins with high standard deviations might suggest that too many bins are being used, or that the bins are combining data points which do not really belong together.

  • The average distance between claims in each cluster reduces as the number of clusters increases, but the reduction becomes more marginal if there are many clusters. When analysing by number of clusters, it may be possible to find the earliest point at which the average distance ceases to reduce significantly. This is an example of what is sometimes called the elbow method because, when plotting the average distance by number of clusters, the point where the average distance drops most dramatically resembles the angle of an elbow.

  • Sensitivity to number of bins. When checking for the data’s interesting features, it could be helpful to vary the number of bins to ensure that a perceived feature persists for any sensible number. This mitigates the risk of over-fitting to features that might turn out to be random noise. For example, one particularly important dimension to consider is time as measured in calendar days or months. Sense checks are needed to ensure a perceived feature continues to be significant, and relevant for predicting future experience, regardless of how many time-related bins are used.

Actuaries in data science

Data science techniques undoubtedly have the potential to help companies monetise their data. Many open-source implementations, such as Python and R, are free to use and have large and active user communities. This often means that the solutions to even very specific problems are easily available using a search engine. Another advantage of this scale is that even paid-for software developers design their offerings with existing user communities in mind, limiting the time investment for any user looking to add something new to what is already familiar.

Although they are by no means alone, actuaries appear well placed to get more involved. In many areas, including insurance, they can add their domain knowledge – such as when binning by expert judgment, as outlined earlier.

Their approach to analysis and reliability could be a key differentiator in all areas, as long as the premium they charge continues to be justifiable. 

The charts and underlying analysis for this article were produced in Python using Jupyter Notebook. The code is available at github.com/paulpapenfus/a/blob/main/MB1.ipynb

Paul Papenfus is an actuary at LV

Image credit | Getty

ACT Jun21_Full.jpg
This article appeared in our June 2021 issue of The Actuary .
Click here to view this issue

You may also be interested in...

we_p26_Road-testing

Road testing: machine learning and the efficiency of fraud detection

Nefeli Pamballi, Phanis Ioannou and Yiannis Parizas outline how machine learning could help increase the efficiency of fraud detection in motor insurance
Wednesday 2nd June 2021
Open-access content
web_p36-37_solvency

How Solvency II regulation could be improved to better serve a post-Brexit UK

On behalf of the General Insurance Board and Solvency II Working Party, Amerjit Grewal shares members’ views on how Solvency II regulation could be tweaked to better serve a post-Brexit UK
Wednesday 2nd June 2021
Open-access content
web_p15_Climate_CREDIT_Alex Williamson-Ikon_00001105.jpg

Climate risk scenarios for pension schemes

What might climate-related risk analysis look like for pension schemes? Neil Mitchell, Claire Jones and Lisa Eichler investigate
Wednesday 2nd June 2021
Open-access content
web_p20_PRO_CREDIT_iStock-531057937.jpg

Weighing the options of PPOs

Peter Towers and Justin Thomas explain the falling popularity of PPOs in claim settlements, and their implications for insurers
Wednesday 2nd June 2021
Open-access content
web_p12-13_Interview_KN3new-tools-final_Illustration_Sarah-Auld_iStock.jpg

An alternative proposal: reforming the NHS

Kristian Niemietz talks to Chris Seekings and Ruolin Wang about his controversial ideas for reforming the UK’s National Health Service
Wednesday 2nd June 2021
Open-access content
web_p30-31_Pooled-Annunities_CREDIT_Getty_83461805

Staying the course: how pooled annuity funds are proving an attractive alternative

Thomas Bernhardt and Catherine Donnelly describe how pooled annuity funds are an attractive alternative to lifetime annuities and income drawdown in providing a retirement income for life
Wednesday 5th May 2021
Open-access content

Latest from Risk & ERM

KV

Liability-driven investments: new landscape

What now for liability-driven investments, after last year’s crash in the market? Pensions experts Rakesh Girdharlal and Moiz Khan say it should lead to a more balanced approach
Wednesday 1st February 2023
Open-access content
cj

Natural capital investing

Chris Howells and Andrew Dreaneen discuss how today’s investments in natural capital profit portfolios as well as the planet and humanity
Wednesday 1st February 2023
Open-access content
bl

'Takaful' models of Islamic insurance

Ethical, varied and a growing market – ‘takaful’ Islamic insurance is worth knowing about, wherever you’re from and whatever your beliefs, says Ali Asghar Bhuriwala
Wednesday 1st February 2023
Open-access content

Latest from General Features

yguk

Is anybody out there?

There’s no point speaking if no one hears you. Effective communication starts with silence – this is the understated art of listening, says Tan Suee Chieh
Thursday 2nd March 2023
Open-access content
ers

By halves

Reducing the pensions gap between men and women is a work in progress – and there’s still a long way to go, with women retiring on 50% less than men, says Alexandra Miles
Thursday 2nd March 2023
Open-access content
web_Question-mark-lightbulbs_credit_iStock-1348235111.png

Figuring it out

Psychologist Wendy Johnson recalls how qualifying as an actuary and running her own consultancy in the US allowed her to overcome shyness and gave her essential skills for life
Wednesday 1st March 2023
Open-access content

Latest from Data Science

gc

Free for all

Coding: those who love it can benefit those who don’t by creating open-source tools. Yiannis Parizas outlines two popular data science programming languages, and the simulator he devised and shared
Wednesday 1st March 2023
Open-access content
il

When 'human' isn't female

It was only last year that the first anatomically correct female crash test dummy was created. With so much data still based on the male perspective, are we truly meeting all consumer needs? Adél Drew discusses her thoughts, based on the book Invisible Women by Caroline Criado Perez
Wednesday 1st February 2023
Open-access content
res

Interview: Tim Harford on the importance of questioning our assumptions

Tim Harford speaks to Ruolin Wang about why it’s so important to slow down and question things from emotive headlines to the numbers and algorithms we use in our work
Wednesday 30th November 2022
Open-access content

Latest from June 2021

web_p4_dan-head7.png

Tackling sensitive topics

This month we interview Kristian Niemitz, head of political economy at the IEA, who posits that there is a better way to organise a health system than the NHS, in order to deliver improved outcomes (p12).
Wednesday 2nd June 2021
Open-access content
web_p44_Obituary_Nicolas Hornby Taylor FIA_Nick-Taylor_Life-article.jpg

People and society news: June

People and society news: June
Wednesday 2nd June 2021
Open-access content
web_p18-19_CDI_CREDIT_iStock-1217057529_v2.jpg

Cashflow driven investment strategies for DB pension schemes

Derek Steeden and Kedi Huang discuss how cashflow-driven investment can help defined benefit pension schemes manage cashflow and meet long-term funding targets
Wednesday 2nd June 2021
Open-access content
Share
  • Twitter
  • Facebook
  • Linked in
  • Mail
  • Print

Latest Jobs

Leading Insurer/Asset Manager – Pricing Actuary (Mortgages)

London (Greater)
Competitive
Reference
148750

Senior Consultant - Risk Settlement - Any UK Location - Up to £100,000 plus bonus

London / Manchester / Edinburgh / Remote
Up to £100,000 + Bonus
Reference
148832

Finance Transformation Actuarial student/Qualified Actuary

London (Central)
£50,000 - £75,000 depending on experience
Reference
148830
See all jobs »
 
 
 
 

Sign up to our newsletter

News, jobs and updates

Sign up

Subscribe to The Actuary

Receive the print edition straight to your door

Subscribe
Spread-iPad-slantB-june.png

Topics

  • Data Science
  • Investment
  • Risk & ERM
  • Pensions
  • Environment
  • Soft skills
  • General Insurance
  • Regulation Standards
  • Health care
  • Technology
  • Reinsurance
  • Global
  • Life insurance
​
FOLLOW US
The Actuary on LinkedIn
@TheActuaryMag on Twitter
Facebook: The Actuary Magazine
CONTACT US
The Actuary
Tel: (+44) 020 7880 6200
​

IFoA

About IFoA
Become an actuary
IFoA Events
About membership

Information

Privacy Policy
Terms & Conditions
Cookie Policy
Think Green

Get in touch

Contact us
Advertise with us
Subscribe to The Actuary Magazine
Contribute

The Actuary Jobs

Actuarial job search
Pensions jobs
General insurance jobs
Solvency II jobs

© 2023 The Actuary. The Actuary is published on behalf of the Institute and Faculty of Actuaries by Redactive Publishing Limited. All rights reserved. Reproduction of any part is not allowed without written permission.

Redactive Media Group Ltd, 71-75 Shelton Street, London WC2H 9JQ