Nefeli Pamballi, Phanis Ioannou and Yiannis Parizas outline how machine learning could help increase the efficiency of fraud detection in motor insurance
Fraudulent claims are a significant cost to personal motor insurance products, typically increasing the combined ratio of the insurer by 5%-10%. Traditionally, firms would use expert judgment algorithms to decide which claims would be investigated for fraud. We set up a machine learning pipeline to help optimise processes in the Fraud Management Unit (FMU), to reduce the cost of fraudulent claims. Using data science techniques, we focused on reducing costs and increasing the efficiency of fraud detection processes by concentrating fraud investigation efforts on claims that were more likely to be fraudulent. Organisations would benefit from:
- A reduction in operating expenses, as claims with low probability of being fraudulent will be fast-tracked
- A better customer experience from fast-tracked customers, leading to higher customer satisfaction and retention levels
- An increased fraud detection rate that reduces the combined ratio.
In our case study, claims were assigned a fraud-likeliness score, with two thresholds for intervention. The lower threshold was interpreted as the cut-off for fast-tracked claims and anything above the higher threshold was interpreted as requiring anti-fraud action; anything between the two thresholds was sent for assessment by the FMU. The number of claims falling between the two thresholds was driven by the FMU’s monthly capacity for investigating claims.
The training and testing data was from past fraud cases, and we tested various statistical and machine learning models for predicting fraudulent claims. The conclusion was that three particular models, in combination, yielded the best predictions. Future monitoring will be an important element of the process, as it will guarantee the sustainability of the work by ensuring the framework remains up to date and fit for purpose.
Data preparation and exploratory analysis
Before exploratory analysis or modelling was performed, an extract, transform, load (ETL) process was set up for data preparation and cleansing. Building the ETL took up most of the project time, but making it flexible and easy to use will be beneficial for future calibrations, and the time invested here will pay dividends by increasing the speed of processing in future recalibrations.
The average fraud rate across the entire data-frame was 0.83%. This represents the reported or confirmed fraud rate, rather than the actual fraud rate, which would be expected to be higher than the reported rate since it is unlikely that 100% of actual frauds were detected. Based on exploratory analysis, the most important dimensions to include in the model were:
- The time taken to report the fraudulent claim from the beginning of the policy: We saw evidence that higher fraud rates occurred closer to the policy start date
- Policy duration: It seems reasonable that the perpetrator of pre-meditated fraud does not require long cover
- Customer duration days: The longer the claimant has been a customer, the less likely they are to file a fraudulent claim. This was expected, as fraudulent customers tend to switch insurers frequently
- Number of previous claims: Another reasonable assumption is that a large number of previous claims could mean that the latest claim is fraudulent
- Claim cover type: The fraud rate was lowest for third-party liability claims, highest for own damages and theft, and somewhere in the middle for glass. This is not a surprising observation, as the claimant does not benefit from the third-party cover claims.
The above findings were discussed with the FMU and validated for reasonableness. The FMU provided additional possible risk drivers, but the analysis only supported the smaller set above. However, the FMU’s expertise was key, and other drivers that emerge in the future could be incorporated into the model if the data supports this.
Only supervised learning algorithms were considered appropriate in this case and, since the response variable was binary, we decided to approach this problem as a binary classification problem, with the model predicting two classifications: Fraud or Non-Fraud.
Taking into consideration the results of our preliminary analysis and our choice of response variable, we opted to test logistic regression, classification tree and gradient boosting (XGBoost) algorithms, as well as an ensemble model that combined all three methods.
- Logistic regression: Logistic regression is a probabilistic statistical classification model and can be used when the dependent variable Y is binary. It involves the transformation of the linear regression model using a sigmoid function. Logistic regression can be used as a powerful classification algorithm, which assigns observations for a discrete set of classes (in this case binary: Fraud or Not Fraud).
- Decision trees: Decision trees work for both categorical and continuous input and output variables. There are two types of decision trees: regression trees and classification trees. Regression trees predict a quantitative response, while classification trees predict a qualitative one. There are many decision tree variants, but they all do the same thing – subdivide the feature space into regions with mostly the same label. Decision trees are easy to understand and implement.
- Gradient boosting (XGBoost): The algorithm sequentially builds trees so that, in every subsequent tree, it aims to reduce the errors of the previous tree by using predecessors as learning sources. This technique is called ‘boosting’ in the field of data science and it builds small and highly interpretable trees by giving the modeller the option to choose and optimise the hyperparameters during the process.
- Ensemble model: The ensemble method is a technique that takes the predictions generated by several base models and combines them to reach a single prediction. In this exercise, we have averaged the three models to combine them. This method usually generates more accurate results than a single model. The key to the success of this method is that base models perform better in different parts of the dataset – so by averaging the results of several models, we could improve the model performance as a whole.
The data available to us for this exercise was split into three sub-samples:
- 60% for training these models: used to fit the parameters
- 20% for validation: used in optimising the hyperparameters and keeping track of the performance
- 20% for testing: used to provide an unbiased evaluation of a final model fit.
We fitted the four models and then compared them using different measures to determine which one offered the most predictive power for fraud detection. Figure 1 shows the model predictions compared with the actual fraud rate for the out-of-sample data, and the measure of the time taken to report the fraudulent claim from the beginning of the policy – which turned out to be the most important variable in the models. The bars represent the number of claims in each 50-day period. We can see that all models capture the fact that the earlier reported claims are more likely to be fraudulent. The decision tree method is less smooth than the other methods and does not decrease sufficiently after the first two periods.
To be able to compare the four models and decide on the most appropriate one, we used the ‘area under the ROC curve’ metric, which is standard practice in such cases. A receiver operating characteristic (ROC) curve is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold changes. Table 1 summarises the performance of the fitted models in the three data categories (training, validation and testing). The most appropriate for decision-making is the final test data category, because this is an out-of-sample test based on data not used in the calibration.
We observe that the ensemble model is not ranked first in the training set, but it offers better performance for the validation and test sets. We can thus conclude that it is the best fraud detection model in this scenario.
“We fitted four models and compared them to determine which one offered the most predictive power for fraud detection”
In addition to the ‘area under the curve’ performance metric, we analysed the confusion matrices for the four models, to further evaluate models and extract additional information about their performance on the test set. By confusion matrix, we refer to a table used to describe the performance of a classification model on a set of test data for which the true values are known. A 5% cut-off point was considered for the probability of fraud at this stage.
Table 2 summarises the results of the confusion matrix analysis. We observe that the accuracy (how often the classifier is correct is its prediction), precision (when the classifier predicts a fraudulent case, how often it is correct) and sensitivity (from all the actually fraudulent cases, how often the classifier predicts that it is fraudulent) levels of the ensemble method show the best performance among the four models. As such, we have decided to proceed with the ensemble model for implementation and deployment.
From model to decision making
For the practical implementation of the model, operational use by the FMUs and subsequent use for decision-making purposes, two threshold levels will have to be determined, as mentioned above: an upper threshold that initiates anti-fraud actions, and a lower threshold for determining whether to investigate or fast-track.
When determining the upper threshold level, we considered how to minimise the number of claims wrongfully classified as fraud (false positives). False positives could adversely affect the company’s reputation and customers’ satisfaction.
When choosing the level of the lower threshold, we considered FMU operational capacity and the number of claims that the FMU can investigate daily. A higher proportion of claims being classified in the category below the lower threshold could also attest to lower operational costs and a more pleasant customer journey. The current FMU’s monthly operational capacity allows for 150 claims to be investigated, on average. As shown in Figure 2, the claim investigation capacity corresponds to a cut-off probability of fraud of 2%.
Furthermore, considering the precision level at several cut-off points, as shown in Figure 3, it was observed that its value is maximised at 23% probability of fraud – meaning this is the point at which we are most confident that the detected cases are fraudulent with the highest probability.
Based on the above considerations, we set thresholds and respective actions in Table 3. The FMU received a claim fraud score and, based on that, took appropriate actions. Based on the choices in Table 3, we could achieve accuracy of 93.5% at the 2% cut-off point and accuracy of 99% at 23% cut-off.
Once the model is in production, a monitoring framework should be set to allow us to identify performance drops that would trigger model recalibration or redevelopment. For that, we will need to compare actual vs. predicted fraud rates on a standard basis. This process can be automated and presented on dashboards. Once more data is available, we will be able to assess the interaction of different dimensions to the time factor, to reassure time consistency of the patterns.
We have seen that the traditional rule-based approach can be blended with a machine learning pipeline in order to benefit FMU operations. Existing fraud case data can be modelled using different methods, and the predictions used to optimise the FMU operations. Monitoring can then be used to assess the effectiveness of the model and initiate feature recalibrations. The overall benefit of setting up this process is a reduction to the combined ratio through more fraud being identified and the possibly the pushing back of feedback to underwriting.
Nefeli Pamballi is a senior consultant at EY
Phanis Ioannou is a risk modelling manager at RCB Bank
Yiannis Parizas is head of pricing and actuarial analytics at Hellas Direct