Erica Nicholson and Andrew Smith investigate whether natural disasters could account for more losses in 2013 than ever before


Could there be $500bn of global insured weather-related catastrophe losses in 2013? This would be four times bigger than the worst recorded year's losses, in 2005, which included hurricanes Katrina, Rita and Wilma.
Figure 1 (below left) shows distributions fitted to historic weather-related catastrophe losses. We have used a regression model to estimate an underlying exponential trend, and then fitted distributions to the mean and standard deviation of the de-trended data. According to the lognormal model, a $500bn loss is a one-in-5000-year event, while the fitted Generalised Pareto Distribution (GPD) implies a $500bn loss has a probability of zero.
We should already be suspicious about model risk. The probability of a loss exceeding $500bn can be debated, but zero must be an underestimate, however many statistical test results we invoke in support of the GPD.
If it were to happen, a $500bn loss would have a huge impact on the insurance industry. After paying the claims, we would recalibrate our models and find some distributions consistent both with 2013 and earlier years, just as we recalibrated in 2006 following 2005 losses. But why wait until 2014 to reflect possible extreme losses in commercial decisions? We can get ahead of the game and identify better models today.
The range of possible models
Validation is not a proof of model correctness. Low volumes of data and constantly changing parameters allow us at best to assert that a model could plausibly have generated historical data. There may be many such plausible models - the weather catastrophe data rejects neither the lognormal nor GPDs. We can rank the models according to the data fit; while all are within random tolerances, some will fit more closely than others. Figure 2 (page 32) presents this wide range of possible non-rejected models (and within each model, non-rejected parameter values), in blue. The better fitting models are higher up the chart.
Within the non-rejected models, there is also a range of possible values for the 99.5 percentile. The more we allow the model to deviate from the data, the greater the possible range of 99.5 percentile, particularly at the upper end where tail information is scarce by definition. This is shown on the horizontal axis of our chart. Models to the right of this range result in a higher estimate of potential future losses at the 99.5 percentile than models towards the left.
Approaches to analysing model error vary according to the nature of the error to be detected. Table 1 (below) describes some of the techniques.

Model benchmarking
The range of possible models is vast, and it is usual practice to employ an expert to help in the selection. There is a danger that the manual intervention introduces statistical bias. Experts might seek safety in numbers by following a herd. They may over-estimate the effectiveness of reviews in rates or policy conditions. In the worst cases, a model may be cherry-picked purely on the basis of a commercial outcome.
The main defence against human bias is benchmarking, that is, comparing the implications of different models, each fitted to the same data set. This is harder than it seems. You might benchmark three commercial catastrophe models, but forget another 10 extinct models whose answers were commercially unattractive. The three models you check might produce capital requirements within 10% of each other, but there might be models in the graveyard with results four times bigger. There is a risk of self-delusion if this is not explored thoroughly.

Stochastic methods
At the other extreme, the compound interval concept lies at the bottom right of the chart representing the most robust outcome of all the non-rejected models. This is a 99.5%-confidence interval for the 99.5 percentile. When applied to the Swiss Re CAT data, the compound interval implies a possible loss in excess of $1,000bn. Such an event would require an extreme set of parameters to coincide with an extreme 99.5 percentile result.
The compound interval is one interpretation of what a 99.5 percentile means in the context of underlying model uncertainty. An alternative is to allow diversification between the parameter and process error, which leads to the concept of a prediction interval. A prediction interval is a function of past data that has a given probability of containing a future observation. This allows for the fact that both the past data and the future observation have come from a random process. Unlike the compound interval, a prediction interval does not require the worst-case parameter estimate and the worst-case outcome to occur simultaneously.
In a recent paper, Andreas Tsanakas and Russell Gerrard of Cass Business School explained estimated percentiles might be less prudent than you think. If you fit a model and try to estimate a 99.5 percentile from the fitted parameters, you might only end up covering 97% of future observations, because of the impact of parameter error. They then proposed working backwards, initially using a higher level of confidence so that, after allowing for parameter error, the prediction interval covers 99.5% of future outcomes.
To test a proposed prediction interval based on 40 years of data, we generate random samples of 41 years. For each sample, we use the first 40 years to construct a 99.5 percentile, with the same algorithm as the real catastrophe data. As we are testing a formula and not a single value, the actual catastrophe history is ignored in this calculation. Instead, we use Monte Carlo to simulate the 'prediction frequency' that the claimed 99.5 percentile exceeds the 41st observation. We do this for a range of distribution shapes, which we parameterise on the horizontal axis of Figure 3 (left) in terms of the ratio of the 99.5 percentile to the median.
Figure 3 shows a prediction frequency analysis. We can choose to fit a lognormal model (green) or generalised Pareto model (grey). On real data, we can't choose the 'true' model, as this is unknown. The beauty of a Monte Carlo setting is that we know exactly what model generated the data. The solid lines show the prediction frequency where we have correctly guessed the true model family. This is sometimes called a 'consistency test'. These prediction frequencies lie below the 99.5% we might have, because of the effect of using parameter estimates based on 40 years' data, rather than the 'true' parameters. In other words, if we input a target 99.5 percentile, what comes out might represent only a 97% or 98% prediction percentile. The dashed lines show the effect of fitting a mis-specified model, which is called a 'robustness test'. In this example, the worst case is fitting a GPD when the data comes from a lognormal distribution. If we want to get the prediction probability as close to 99.5% even when the model is mis-specified, it is better to fit lognormal distributions.

Working backwards
Following the Tsanakas-Gerrard technique, we ask what initial percentile achieves a 99.5% prediction frequency, when fitting a lognormal distribution. We can use Figure 4 to consider the consequences of using a one-in-2,000 quantile of the fitted GPD distribution to estimate a one-in-200 event. This produces a prediction frequency of at most one-in-200 if the data has really come from a GPD, and also for lognormal data provided the 99.5 percentile does not exceed fifteen times the median. Generally, the more skewed the underlying distribution, the higher percentile you have to pick to get the prediction interval you want.
One way to interpret this is to say if we will tolerate a ruin probability of one-in-200, we have to plan a one-in-2,000 event from a best estimate model, and up to another nine-in-2,000 cases arising from the model or parameters turning out to be wrong (because 1/2000 + 9/2,000 = 1/200). At the one-in-200 level, the problem of estimating the wrong model or parameters is nine times bigger than the risk captured within your best-fit model.

Conclusion
When data is scarce, there can be many models capable of passing validation, with a wide range of implied capital requirements. Benchmarking different models can help us to understand the scope for human bias in model selection. Taking random data from one model and using it to fit another model can inform us of the potential for mis-estimation, which is exacerbated by the skew of the distribution. We have given an indication of the possible uplift required to modelled percentiles to allow for the risk the model was incorrectly selected. Model builders can reduce the risk of nasty surprises and get a step ahead by investigating the plausible models that both fit the data and produce more significant tail events not captured in the available historical data.
We all hope a $500bn catastrophe won't happen, but if it does we should be prepared.