Yafei (Patricia) Wang looks at the use of machine learning to predict underwriting decisions for life and health insurance

Advances in machine learning, and an explosion of non-structured data, have created huge scope for the application of machine learning models in life and health, including predictive underwriting. In the July 2022 issue of The Actuary, Reza Hekmat and Balint Bone consider, on a commercial level, how such machine-learning models could help to automate and enhance the underwriting process (bit.ly/EndUnderwriters).
Real-world data can demonstrate how machine learning models may be used in practice to predict underwriting decisions in cases that cannot be processed by prescriptive rule-based engines. The data used for training the model and testing model performance are cases that are referred to reinsurers. These are cases that have not passed through the rule-based engine or manual underwriting for various reasons – typically complicated medical conditions or family medical histories. What is the data processing and modelling process of machine learning models, and how do they perform when predicting underwriting decisions?
Modelling process
The first step is to use natural language processing to process the free-text variables in the data, for example: descriptions of medical conditions, lifestyle risk factors, hobbies and occupations (Figure 1). After keywords such as ‘stomach cancer’ are extracted, machine learning techniques are used to sort applications into different groups of medical conditions and occupation classes, so dummy variables and categorical variables can be created in the second step.
The third step is preliminary data
analysis, so we can gain some early insights into data and decide on the appropriate treatment of missing data. Word clouds are often used to gain some idea of the medical conditions that are frequently referred to reinsurers. This shows the medical conditions that underwriters are unfamiliar with so they may require more training.
We then use various feature selection techniques to eliminate irrelevant and redundant variables in order to reduce run time and overfitting. This is a useful technique; the data typically contains a few thousand features, so run time could be a real issue in practice. More importantly, overfitting is a common issue. This is where the model fits too closely to the training dataset because it has learnt the training dataset’s randomness and noise, and would not perform well for unseen future data. Feature selection is a useful technique in tackling this issue. Parameter tuning as part of the model training can also be used to reduce overfitting.
Table 1 shows how the output variable – the underwriting decision – is coded.
The cases that are accepted on standard terms are labelled class 0; those that are declined are labelled class 100. In practice, the underwriters give loadings only in multiples of 25%, so the models are designed to mimic this practice; therefore, the loadings are divided by 25% and labelled as shown in Table 1. In practice, underwriters rarely give loading of greater than 400%, so the cases with extremely large loadings (greater than 400%) are grouped into one class to give this class enough data points. In short, we are training machine learning models for a multi-class classification problem. As shown in Table 1, there are 19 classes.
Some insurers may want to implement the model to classify the applications into three broader categories – ‘accept on standard terms’, ‘accept with loading’ and ‘decline’ – and then manually underwrite only certain classes, such as ‘accept with loading’. In the case of Table 1, the model could be adapted to be a three-class classification in which all of the classes with loadings are grouped into one class, and we have only three classes: ‘standard’, ‘loaded’, and ‘decline’.
Overall model performances
We trained and tested 10 machine-learning algorithms: random forests, decision tree, gradient boosting, extreme gradient boosting (XGB), bagging, AdaBoost, support-vector machine (SVM), stochastic gradient descent (SGD), K neighbours and ordinal logistic regression. Some are classification models, while others are regression models.
“The best performing algorithm is XGB, followed by random forests and bagging. This is not surprising, because manual underwriting processes resemble a decision tree”
The dataset is randomly split into a training dataset and testing dataset, in the ratio of 80:20. The models are trained using the training dataset and then tested on the unseen data in the testing dataset. In a sense, the performance on the testing dataset gives us an idea of how the models will perform when used on future data, provided that there are no fundamental changes in the underwriting philosophies (for example, how lenient the underwriting is at the entry stage), so the performance on the testing dataset is more important.
The accuracy score, defined as the number of correct predictions divided by the number of data points, is approximately 80% on the testing dataset for a 19-class classification and 89% on the testing dataset for a three-class classification. The accuracy is achieved across all product lines – life, critical illness, income protection and so on – and across more than 25 insurers, where underwriting manuals differ from product to product and from insurer to insurer.
The best performing algorithm is XGB, followed by random forests and bagging, for both 19-class and three-class classifications. All three of these algorithms combine the outputs of weak learners into the final output, and the underlying weak learners are decision trees. This is not surprising, because manual underwriting processes resemble a decision tree.
The XGB algorithm has won multiple Kaggle challenges (bit.ly/3a91NRA).
One of the main reasons that it often outperforms other machine learning models is that it handles sparse data well, which is important here because there are some classes with little or no data points. Another major advantage of XGB is that it has an in-built penalty term added to the loss function, so the algorithm prefers models that have less complexity and is thus less prone to overfitting.
Bagging algorithms outperform other boosting algorithms, namely AdaBoost and gradient boost. The insurance dataset contains noise, so this result echoes Dietterich’s research suggesting that bagging algorithms perform better than boosting algorithms when used on datasets with a lot of noise (bit.ly/3y3VnLo).
Not surprisingly, logistic linear regression, SGD with log loss function, and SVM with radial basis function kernel did not perform well. The underwriting decision resembles a decision tree, so the relationships between the outcome variable and input variables are unlikely to be solely logistic linear. Furthermore, regression models do not typically perform well on datasets with high dimensions, which is the case here.
Practical implementations
The overall accuracy scores of the algorithms can be broken down by class. For example, the XGB model achieved an accuracy of 92% on the testing dataset for the standard class in the 19-class classification, so is extremely accurate at predicting standard cases. This means that if insurers want to improve their straight-through rates, they can do so by implementing the XGB model. It would not only improve operational efficiency, but also increase sales by increasing the number of cases that can be accepted straightaway.
Furthermore, underwriters can now focus on the high-risk, high-cost cases – such as cases with large loadings – so the quality of underwriting decisions on these cases can be improved, too. Cost-benefit analysis is also useful, where the cost saving of using machine-learning models can be analysed against the cost of manual underwriting.
Yafei (Patricia) Wang has more than 10 years of experience in financial reporting, financial modelling, machine learning and data analytics in the South African and London markets