**Paul Papenfus explains how a clustering approach can help when binning dynamic risk factor data**

Every day, more data is being recorded – and demand for useful insights from it is rising, too. For data analysis, the proliferation of open-source software is overtaking previous generations of costly single-use packages, opening opportunities to anyone with an internet connection and a willingness to learn. And yet in one aspect, little has changed: decision-makers still want to understand the likelihoods or probabilities of future events, along with the key drivers or risk factors that influence them. This calls for a frequency analysis that compares the number of claims against the corresponding proxy of possible claims – also known as the exposure to risk, or simply exposure.

In insurance, frequency analyses have been used for decades to estimate the probabilities of events such as death, illness and lapses. Assumption values are often structured into a fixed shape known as a basis. Populating the basis with new values every year is a simple, mechanical solution but it comes with no guarantees. In particular, if the existing shape fails to capture emerging features of the risk, then projections can fail to predict future claims due to changes in the business mix. Fortunately, data science techniques are making it possible to take a fresh look at existing basis structures to either confirm that they continue to remain fit for purpose or suggest areas to reconsider.

**Frequency analysis for dynamic risk factors**

In a frequency analysis, understanding the impact of different risk factors usually involves separating the available data into cells, also known as bins, and comparing the claims versus exposure ratios between them. Categorical risk factors, such as smoker status, are relatively easy to separate because they have a limited number of possible values. Machine learning methods exist to isolate and rank the most descriptive factors.

For the purpose of this discussion, continuous risk factors can be separated into stationary and dynamic factors. For a particular risk, stationary risk factors – such as vehicle engine capacity or level of carbon dioxide emissions – do not change during the analysis period. Their treatment is similar to categorical factors, except that they also have a logical progression by size. Even continuous risk factors with a finite number of historical values, such as benefit level for a decreasing cover life insurance policy, can be modelled as being stationary if the experience for the different values is separated correctly.

A more interesting challenge is presented by dynamic risk factors such as age, policy duration and calendar year, where the exposure between two points has an infinite number of possible values. For example, as a policyholder ages, their age moves through values that can be measured in complete years, days, fractions of days or smaller increments. Dynamic risk factors have no natural separators or minimum increment sizes. The claims frequency for a dynamic risk factor only exists in relation to a chosen interval, such as ‘age from 30 to 30.3973’. This means a binning method needs to be defined, explicitly or implicitly, before frequencies can be observed.

The choice of how to divide, or bin, a dynamic risk factor becomes circular: a natural desire is to choose bins that highlight key features of the experience in a way that does not suppress them through averaging, but the experience for each bin can only be calculated once the bins have been determined.

This article considers two existing approaches to binning in a probability analysis context, along with their challenges, and presents one data science-based alternative for use with dynamic risk factors. The focus is not machine learning itself, but the binning process that supports it. No machine learning technique can recover what has been averaged out during the binning stage. For this reason, a well-considered, impartial binning technique is essential for effective feature analysis.

**Binning by expert judgment**

The simplest form of binning is expert judgment based on knowledge of the domain. For example, historical knowledge of the retirement market, including the impact of regulations, could suggest sensible ranges for fund size to capture the known features. Even in the age of machine learning, the value of domain knowledge should not be underestimated, especially in portfolios with limited volumes of relevant data. However, there is a risk if consumer behaviour patterns change. Over time, existing bins could mask emerging features, especially for large, mature portfolios if the analysis fails to include a time element such as calendar year.

**Unit binning**

Unit binning is arguably the most common approach. Experience is separated into intervals of the same length, such as age at last birthday or policy duration in complete years. The simplicity is an advantage: an interval of a year provides a convenient separator that requires little further justification. Unit binning provides a reasonable all-round analysis, detecting global trends reasonably well. But it is also a one-size-fits-all approach. Most datasets have lots of data around the central region, with few or no claims around the edges. Bins in the middle of the spectrum tend to be very full, obscuring potentially important features, while bins around the edges tend to be dominated by random variation with hardly any data to stabilise the results. Ironically, evenly spaced bins generate an uneven distribution of data volumes.

The danger of missing relevant features can be demonstrated with a fictional example. It is not impossible for the probability of surrender of a savings policy to spike significantly in the first few months of the sixth policy year, and then to drop away again sharply. This is illustrated in Figure 1a. It could be very valuable from a customer retention perspective to know about such a feature for further investigation. And yet an analysis that bins by policy duration in complete years is unlikely to detect the spike and drop because the lapse frequency for the whole of the fifth year might look much like the neighbouring ones. A unit binning analysis, such as shown in Figure 1b, is unlikely to isolate the busier and emptier lapse regions because it combines all the information inside the square. Although a previous understanding of customer behaviour is always valuable when describing emerging trends, an analysis technique can add even more value if it can identify features for further investigation without having to rely on domain knowledge. Unit binning does not necessarily offer this.

**Cluster binning**

Unit binning can be described as a top-down process because it slices all the available data into pre-determined bins. By contrast, cluster binning offers a bottom-up alternative that overcomes the problem of uneven data volumes. The process is demonstrated in *Figure 2*:

- Start with ‘nearest claim binning’: for each exposure unit, such as a day, determine the nearest claim. Although other options are possible, the notion of distance that is easiest to interpret is Euclidean distance: the distance between two points in n-dimensional space is given by the square root of the sums of the squared distances along each dimension. Each exposure unit is labelled with its nearest claim. All exposure units with the same nearest claim define a bin. Step 1 generates one bin for every claim, as shown in
*Figure 2a*and*Figure 2b*. - Although it is possible to analyse directly after Step 1, the crude claims frequencies in the bins might be too volatile to identify key features. To overcome this, combine the claims into the required number of clusters. The clusters can be chosen so that experience in the same cluster is more similar than experience between clusters. Several methods are available, including k-means clustering.
- The exposure for each cluster is the combination of all exposure units that have nearest claims in the cluster. Each cluster is one bin.
*Figure 2c*shows how two neighbouring groups in*Figure 2a*and*Figure 2b*fit together as two clusters.

Finding the nearest claim in Step 1 can be computationally intensive and slow if performed in a brute force way by comparing the distance between every possible pair of points. Fortunately, data science packages benefit from algorithms that speed up this process. The added benefit of nearest claim binning is that Step 1 above, which deals with the exposure, is run only once, regardless of the number of clusters ultimately chosen in Step 2.

**Hyperparameter optimisation**

Cluster binning offers an impartial, automated way to separate experience into an even number of bins in continuous dimensions. As with many machine learning applications, the improvement in the power to isolate useful features comes at the cost of having to make choices in new areas – in this case, the choice of the number of clusters. This is an example of a hyperparameter, a value used to control the learning process. Although a comprehensive list of possible tests will not fit in this article, there are a few descriptive statistics that could help with making the choice:

**Old-fashioned expert judgment.**Familiarity with the subject matter can guide the choice of bins to strike a balance between better feature isolation and lower overfitting risk. Domain knowledge continues to be valuable, even in a process that automates some of the work.**The standard deviation**of the experience in each bin is an indication of the statistical homogeneity of the information inside the bin. Evidence of bins with high standard deviations might suggest that too many bins are being used, or that the bins are combining data points which do not really belong together.**The average distance between claims**in each cluster reduces as the number of clusters increases, but the reduction becomes more marginal if there are many clusters. When analysing by number of clusters, it may be possible to find the earliest point at which the average distance ceases to reduce significantly. This is an example of what is sometimes called the elbow method because, when plotting the average distance by number of clusters, the point where the average distance drops most dramatically resembles the angle of an elbow.**Sensitivity to number of bins**. When checking for the data’s interesting features, it could be helpful to vary the number of bins to ensure that a perceived feature persists for any sensible number. This mitigates the risk of over-fitting to features that might turn out to be random noise. For example, one particularly important dimension to consider is time as measured in calendar days or months. Sense checks are needed to ensure a perceived feature continues to be significant, and relevant for predicting future experience, regardless of how many time-related bins are used.

**Actuaries in data science**

Data science techniques undoubtedly have the potential to help companies monetise their data. Many open-source implementations, such as Python and R, are free to use and have large and active user communities. This often means that the solutions to even very specific problems are easily available using a search engine. Another advantage of this scale is that even paid-for software developers design their offerings with existing user communities in mind, limiting the time investment for any user looking to add something new to what is already familiar.

Although they are by no means alone, actuaries appear well placed to get more involved. In many areas, including insurance, they can add their domain knowledge – such as when binning by expert judgment, as outlined earlier.

Their approach to analysis and reliability could be a key differentiator in all areas, as long as the premium they charge continues to be justifiable.*The charts and underlying analysis for this article were produced in Python using Jupyter Notebook. The code is available at github.com/paulpapenfus/a/blob/main/MB1.ipynb*

**Paul Papenfus **is an actuary at LV