Sen Hu and Adrian O’Hagan investigate how cluster analysis with copulas can improve insurance claims forecasting

Machine learning has increasingly become a tool for actuaries in the era of big data, and the idea of actuaries teaming up with data scientists has been continually debated by industry leaders. In a nutshell, machine learning is a sub-stream of artificial intelligence, and provides suites of algorithms and models for computers
to learn from data, so that they can help find data patterns and therefore make inferences, decisions or predictions.
Machine learning is a rather broad term that includes various approaches. For example, logistic regression (within generalised linear models) is a classic example of a machine learning classification algorithm. One field of particular use for actuaries is cluster analysis. How can cluster analysis, together with the copula approach, improve insurance claims forecasting?
Clustering methods
Cluster analysis has long been a popular technique within statistical data analysis and machine learning, helping to uncover group structures in data. It groups objects in such a way that objects in the same group (‘cluster’) are relatively more similar to each other than to those in other groups. In an actuarial setting, it has been used in applications such as insurance product marketing, and variable annuity valuation and ratemaking.
There are many clustering algorithms available, commonly categorised as either partitional or hierarchical algorithms. Partitional methods generally segregate observations into a required number of clusters that optimise certain similarity measures, most notably k-means. Hierarchical methods create a hierarchical decomposition of observations, forming a tree-like structure that splits the dataset into smaller subsets. We focus on model-based clustering, which is a partitional method. Rather than representing each cluster with a single datapoint, this method represents each cluster with a probability distribution, providing a more theoretically sound statistical framework.
“By segregating policyholders’ data via cluster analysis, dependence structures within clusters can be amplified”
Risks are dependent and heterogenous across categories
Cluster analysis is especially useful for actuaries who deal with dependent risks, although it can also be used for individual risks. A dependence is reflected in the fact that information about one risk category provides information about the likely distribution of other risk values. Insurance companies, in particular, should investigate such dependencies between different lines of business and the effects that an extreme loss event has across multiple lines when assessing multiple risks simultaneously. Copulas provide a convenient approach for this.
The risk heterogeneity commonly present in an insurance portfolio also needs to be addressed. For example, in general insurance some policyholders are more prone to making multiple claims, but these claim sizes are usually small compared to other policyholders who represent higher risks overall. Risk heterogeneity could also be caused by different behaviour and attitudes – for example, the differing attitudes of motor insurance policyholders towards driving. As a result, an important part of claims forecasting is risk classification, which involves grouping policies into clusters that share more homogeneous risk potential.
One consequence of risk heterogeneity is that the joint claims empirical data are usually very dispersed, so data may present a weak correlation overall. This could be used as justification for implementing independent modelling without considering dependencies among risks, meaning the underlying claims structure is ignored. However, by segregating policyholders’ data using cluster analysis, dependence structures within clusters can be amplified, and different modelling strategies can be implemented to suit different clusters.
Cluster analysis with copulas
For simplicity, let’s look at scenarios involving two risks. When modelling such bivariate risk perils simultaneously, bivariate distributions such as bivariate Poisson distributions naturally come to mind. However, although bivariate distributions are a natural extension to their univariate counterparts, they can be restrictive and conceptually challenging due to the possibly complex specification and implementation (especially in higher dimensions). There is also a limited number of options that suit the required data characteristics. For example, bivariate gamma distributions are complex and not commonly used, even though they are well-suited for bivariate claim severity modelling.
Copulas are a popular choice for analysing the dependence between risks in joint claims modelling. The copula is a distribution function of random variables with uniform marginals; it contains all information on the dependence structure between risks represented by the marginals. The marginal distribution functions contain all information on individual risks. This gives copulas their key advantage: they allow the marginals and the dependence structure to be modelled separately. Due to the rich existing varieties of copulas, a wide range of flexible dependence structures is possible.
It is easy to envisage that there are different dependence features among policies in a portfolio due to the joint risk heterogeneity. To model this with finite mixture models for model-based clustering, we can employ a finite mixture of copulas to segregate policies with respect to different dependence structures in the data, represented by different copulas, which constitute more flexible dependence structures overall.
Because copulas are distribution functions of certain parameter(s), a finite mixture of copulas can be expressed
in a fashion similar to a standard finite mixture model. For our claims modelling using parametric methods, estimation of the copula depends on estimation of the marginal distributions. Furthermore, other independent predictors (covariates), such as characteristics of the policies, can be incorporated in the marginals via generalised linear model (GLM) frameworks to improve marginal and copula estimation. This is called copula regression. In the finite mixture of copulas setting, we can use a GLM framework to further allow the mixing proportions to depend on covariates, in order to better identify which cluster the observation belongs to. In machine learning, such model setting is called mixtures of experts.
Suppose we have real-world sample motor insurance empirical claim severity data for accidental damage and third-party property damage risks, together with some characteristics of the policies. Cluster analysis, using a finite mixture of copulas with covariates and univariate gamma distributions as marginals, leads to the clustering result in Figure 1. One cluster (in orange) captures the medium claim sizes where the dependence is modelled using a Gumbel copula, while the other cluster (in teal) accounts for very small and large claim sizes with a Frank copula; this clustering corresponds to the fact that there are a lot more policies leading to medium-sized claims (ie the dense scatter cloud in the middle) than those leading to very small or very large claims.
Through such clustering analysis, a better understanding of the risk structure is achieved, in which each cluster shows more prominent dependence structure. Each is characterised by a copula that explains various aspects of dependence, which leads to better dependence estimation such as tail dependence. Future claim forecasting can be achieved because predictors are incorporated via GLM frameworks, similar to univariate claim modelling via standard GLMs. This model can therefore be regarded as a finite mixture of copula regressions for predictive analysis. Furthermore, once clusters are identified, different models can be fitted based on each cluster’s characteristics for claims forecasting. This cluster analysis with copulas can not only identify different claim behaviours and identify high or low risk policies, but also provide claims forecasting while taking the dependence of each cluster into account.
Dr Sen Hu is a post-doctoral researcher at University College Dublin
Dr Adrian O’Hagan is an assistant professor at University College Dublin