Is there a role for automation in selecting a reserving technique? Caesar Balona and Ronald Richman share their thoughts

Many techniques for estimating incurred but not reported (IBNR) reserves are available for application in a reserving analysis and, usually, each has several variations that can be applied. Often, reserving techniques are selected using the actuary’s subjective judgment – so, to an outside observer, the process of IBNR reserving can seem to be more art than science. Nonetheless, it is difficult to assess the contribution of each of these judgments to the predictive accuracy of the reserving exercise, and actuaries auditing or reviewing reserves may require different approaches to be applied.
Judgment is certainly required in actuarial calculations, but what should be left to judgment and what can be determined in a scientific manner? We addressed this question in a recent paper (bit.ly/3uJRvNh) and summarise our approach here.
Too many choices
When applying a chosen reserving method, we often have many parameters to set. Even before choosing an IBNR projection methodology, we must select an approach to deriving loss development factors: simple or weighted averages? How many origin periods to include when calculating the average? Should high or low factors be excluded?
Then we turn to the choice of method for estimating the ultimate losses: do we use a basic chain ladder, or a Bornhuetter-Ferguson method? If we do use Bornhuetter-Ferguson, what should our a priori loss ratio be? Could a Cape Cod estimate of the loss ratio be more appropriate?
In navigating these choices, the actuary can often rely upon well-founded heuristics: for example, using exposure-based methods on less mature accident years. However, it is difficult to guarantee that judgments made on the basis of these heuristics result in optimal predictive accuracy. Ideally, we would like to have an objective measure of how well a selected reserving method (and selected variation) will perform on unseen data.
The role of machine learning
This situation of having to choose from many different techniques brings to mind the process of selecting a machine learning (ML) technique. Within ML, many different techniques have been proposed – for example, to perform regression, you must select a technique from many different categories, such as regularised regression, gradient boosted trees and neural networks. Then, you must choose the hyperparameters (which are not inferred directly from the data) that determine how the technique is applied – for example the extent of the regularisation of a regression model.
Within ML, the automatic selection of techniques and parameters is done by estimating the expected predictive accuracy of every reasonable combination of techniques and parameters. Within the context of a single class of models, this is called hyperparameter optimisation. Simply put, you test all combinations and then select the set of parameters that are shown to be the most accurate.
How are these tests performed? There are several different approaches available for estimating the expected predictive accuracy. (Also worth noting are information criteria approaches; however, these are not available for most commonly used machine learning techniques). The common thread among many popular versions of these approaches is that the selected model is fitted on one subset of the data and tested on another subset. The performance on this unseen data is used as a proxy for the expected predictive accuracy.
The golden standard approach to hyperparameter optimisation is called k-fold cross validation. In this approach, the dataset is split into k separate subsets (each subset is called a ‘fold’). Each of the k subsets is used to assess the accuracy obtained after training the model (with its chosen parameters) on the other k-1 subsets. The accuracy is then averaged across the k-folds. This approach can be extended to time series problems, as we discuss next.
Given that the process of selecting machine learning techniques and hyperparameters is quite mature, actuaries can easily leverage this for actuarial purposes and, in our case of reserving, for potentially giving us an objective scientific approach to selecting IBNR techniques. In our paper, we use a variation of k-fold cross validation to select the optimal reserving parameters.
Applying k-fold cross validation to a claims triangle
When reserving, an important way of gaining a feel for accuracy is to assess how accurate our claims projections are in the next period. Of course, we are interested in the total IBNR reserve and thus ultimate claims, but we generally re-assess our reserves each subsequent period and base some of our choices of techniques over time on our performance in the most recent period – that is, we perform what is commonly called an actual versus expected (AvE) analysis.
Over time, then, our approaches are guided by numerous AvE analyses. Calculating the accuracy of our techniques and a set of parameters using performance on an AvE is a natural way to determine if one set of parameters provides more accuracy than another. This leads us to the following way of viewing a reserving triangle: we can consider the triangle as a collection of smaller triangles, each with subsequent claims to predict on the next diagonal. This is a natural way to apply k-fold cross validation for a reserving triangle.
We start by considering a sub-triangle – say, the first five calendar periods, shown in blue in Figure 1. Given a set of parameters and a chosen methodology, we can determine what our expected claims will be over the next calendar period, shown in green. Using this result we can calculate an AvE score, since we know what the actual claims are in the next calendar period, ie on the next diagonal.

We can repeat this process for all the remaining calendar periods until we have reached the final diagonal of the triangle and can no longer calculate an AvE score. Now we calculate the average AvE score over all the calendar years, to arrive at an average AvE score for our chosen parameters and methodology. If we perform this for all the parameters and methodologies under consideration, we can then simply select the parameter set and methodology that results in the best score. This selection is based on the underlying assumption that good predictive performance in the past will result in good performance in the future; indeed, we discuss later the extent to which we found this to be true empirically.
In practice, it is usually also a requirement to balance accuracy with overall reserve stability. One might imagine that our approach above could prioritise accuracy over stability, and so our total IBNR might fluctuate unnecessarily. To combat this, we proposed using the claims development result (CDR) (Merz M and Wüthrich MV. Modelling the claims development result for solvency purposes. CAS E-Forum 2008; 542–568) as an alternative scoring metric. The claims development result for calendar year k and accident period i is defined loosely as:
In other words, the CDR adds on to the AvE the change in the IBNR reserve from one calendar period to the next. When using the CDR as a scoring metric, this extra term helps minimise fluctuations in the IBNR reserve, thus achieving more stable IBNR estimates.
How does the approach fare in practice?
In our paper, we apply this approach to three triangles; this article only discusses results of one triangle (we refer the reader to the full paper for all case studies), which is from a Swiss accident insurer. In the case study, we search over all the variations seen in Table 1 when applying the Bornhuetter-Ferguson method.
In Figure 2, we see the CDR scores over the entire parameter space. Note that the score does not vary considerably across the n_periods parameter after 14. The a priori loss ratio provides the largest range of scores, as expected, settling on a minimum at 59%. Finally, dropping high or low loss development factors does have some impact on the result. We note that a more granular approach than just dropping the lowest or highest development factors should be considered.

Table 2 shows the results. To assess true performance, we measure the root-mean-squared error of the prediction of ultimate claims on the unseen part of the triangle (the yellow lower triangle in Figure 1). We see that our optimised approach had a root-mean-square deviation 8.4% lower than that of the base Bornhuetter-Ferguson, assuming a loss ratio of 60% (which we calculated as the long-term average trend).

TABLE 2: Performance comparison of the algorithm against a basic Bornhuetter- Ferguson method.
The ‘ultimate’ reserving method?
Our proposed approach allows for some automation of the process of selecting a reserving technique. Nonetheless, what we are proposing is not to replace the reserving actuary with an algorithm, but to assist the actuary by removing the repetitive manual process of selecting parameters by judgment and replacing that element with a robust and objective machine learning approach.
In this way, the actuary can spend less time on technique selection and more time applying actuarial judgment by considering external factors that are not inherent in the data. For example, given the impact COVID-19 has had on the claims experience of insurers, it is unlikely that the result chosen by our algorithm based on experience from 2020 will be optimal. On the other hand, our algorithm could provide a baseline approach based on previous accident years, after which the actuary could spend significantly more time considering how best to allow for the impact of COVID-19.
Caesar Balona is an assistant actuarial manager at QED Actuaries and Consultants
Ronald Richman is an associate director (R&D and Special Projects) at QED Actuaries and Consultants