**Karol Gawlowksi, Christian Richard and Dylan Liew show how Shapley values can be used to make opaque models more transparent**

It is becoming evident that actuaries will need to embrace and adapt to the age of machine learning. Techniques such as gradient boosting machines (GBM) and neural networks (NN) have been shown to outperform ‘traditional’ generalised linear models (GLMs), but seem to require a trade-off in terms of the model output’s explainability. With actuaries operating under strict frameworks, it is unsurprising that many prefer to use less powerful but more explainable models.

What if we could use models that are both high-performance and interpretable? Techniques are now being developed in explainable artificial intelligence to achieve this. One of the most popular comes from a novel application of game theory, and can explain any type of model.

**An opaque model**

Imagine we have a pricing model that predicts claim severity on a policy and takes variables *x*_{1}, *x*_{2} and *x*_{3} as inputs. For policy *d _{1}*, the model predicts a claim severity of £30. The mean claim severity for this line of business is £20, so the model is suggesting that the values of the inputs add £10 to this policy (above the mean). Policyholder

*d*asks what is driving their premium –

_{1}*x*or

_{1}, x_{2}*x*. What is the cause of this additional £10?

_{3}The model uses machine learning such as deep learning or extreme gradient boosting with hundreds of parameters and interaction effects, so we cannot just rely on reading coefficients. We could imagine *x _{1}, x_{2}* and

*x*as players in a prediction ‘game’ who all want to claim credit for the pay-out value of £10, and consider counterfactual ‘what-if’ scenarios.

_{3}Perhaps we remove player *x _{1}* and the model predicts a claims severity of £18. Could we then say that when the model only uses variables

*x*and

_{2}*x*, it predicts this policy will experience –£2 claims severity compared to the average? Could we also say that

_{3}*x*adds +£12 expected claims severity to the game? What if

_{1}*x*is interacting with

_{1}*x*and

_{2}*x*?

_{3}Let’s consider a game (or model) that only uses *x _{1}*. Perhaps, in this case, £23.50 claim severity is predicted – £3.50 above the mean prediction. Has

*x*contributed £12 or £3.50 claims towards the (excess mean) prediction? Is it an average of the two? What about other ways the game can be played?

_{1}**Considering all variables**

Suppose we extend this logic to all possible combinations of parameters the model could use – {*x _{1}*,

*x*,

_{2}*x*}, {

_{3}*x*,

_{2}*x*}, {

_{3}*x*}, …. and so on – and then in every possible combination we work out the impact of adding each variable to the prediction (

_{3}*Figure 1*).

We then consider what each variable has added, on average, to the prediction across all possible scenarios. As

*Figure 1*shows, the contribution of

*x*is

_{1}(

__£3.50 + £3.50 + £2.00 + £10.50 + £12.00 + £12.00__) = £7.25

6

while the contributions of *x _{2}* and

*x*are £4.25 and -£1.50 respectively.

_{3}Why go to all this effort? It can be shown that such an allocation is unique in satisfying several desirable properties all at once:

- It is efficient – in this case, the sum of contributions matches the extra £10 predicted claims severity above the mean (£7.25 + £4.25 – £1.50 = £10).
- It is symmetrical – if two players contribute the same amount to all possible combinations, they will be given the same allocation.
- It allocates zero value to null players (players that don’t contribute anything to any group).
- The allocation is additive when ‘subgames’ are considered.

Such allocations are called ‘Shapley values’ after the economist Lloyd Shapley, who won a Nobel Prize in 2012 for proposing this method.

**Shapley Additive exPlanations**

A Python package called Shapley Additive exPlanations (SHAP) is a popular implementation used to calculate approximate Shapley values for models.

The example in *Figure 1* has only three variables and can be calculated exhaustively, but for a model of n variables we require 2* ^{n}* possible model evaluations. A model with 15 variables would, for example, require 2

*(32,768) scenarios. To speed up this calculation, SHAP uses a specific sample of 2*

^{15}^{n}+ 2,048 models. With a model of 15 features, only 2,078 models out of 32,768 are calculated.

SHAP prioritises models with very few or very many features, and tends to disregard models with a ‘medium’ amount of features. The idea is that when a model contains very few features (close to 0), adding a new one can tell you a lot about its ‘direct contribution’. However, this is in the absence of other features, so these models tell you little about its interaction effects. Conversely, if a model has, say,* p *– 1 features, adding the last remaining variable tells you little about its direct contribution but a lot about its interaction contribution, because it is interacting with all possible variables.

In the ‘medium’ case, when a variable is added to a model which contains half its features, it is hard to unpick its direct contribution because it is interacting with other features. It is also hard to unpick its interaction effects, because the model is still missing features. However, SHAP does include a small random sample of some of these models, introducing a stochastic element to its allocations.

SHAP also addresses how to make a prediction when a variable is excluded from the model. Recall that we are trying to explain how a model works. Consider a scenario where the model is missing variable x1. If we rebuild the model without *x _{1}*, it is no longer the model we’re trying to explain. We need to make sure the model is fixed under every scenario. However, if we build the model with

*x*, it will need some value for

_{1}*x*to produce a prediction. Most models cannot take null or missing input. SHAP instead uses a ‘background’ dataset to mimic

_{1}*x*being missing from the dataset, similar to imputing its value.

_{1}

“SHAP’s ability to consistently unpick and understand any type of model is what makes it so powerful”

**Applying SHAP**

Let us return to policy *d _{1}* and introduce the actual values of

*x*and

_{1}, x_{2}*x*:

_{3}

If SHAP is required to predict the number of claims without *x _{1}*, it will first take random samples from

*x*with replacement (bootstrapping) – say, 1,000 times. For each of these 1,000 random samples, SHAP is then attached to the same values for

_{1}*x*and

_{2}*x*– effectively creating 1,000 new synthetic datapoints (where

_{3}*x*and

_{2}*x*remain the same but

_{3}*x*varies). It then uses the model in question and predicts the number of claims for the 1,000 synthetic datapoints – let us call them

_{1}*y*, …. , y

_{0}_{1,000}.

The average of *y _{0}*, …. ,

*y*is computed and said to represent what the model predicts when

_{1,000}*x*is missing.

_{1}*x*is treated as behaving almost like a random white noise variable. Notice that this introduces another stochastic element to SHAP, which means results may vary with runs.

_{1}SHAP thus assumes that all features are independent, because the missing features are drawn without varying the features present in the model. In *Table 1*, *x _{2}* and

*x*remain fixed for all random values of

_{3}*x*. But what if there is some collinearity and dependency structure between the variables? The synthetic datapoints such as (0,1,2) in the first row may be impossible to observe in the real world, which could lead to model errors in SHAP’s approximation of Shapley values.

_{1}Models not included in the sample are not thrown away – their predictions are approximated by linear regression. Going back to a model with 15 features, SHAP fits a line of best fit to the predictions from the 2,078 models included in the sample, to approximate the predictions from the (32,768 – 2,078 =) 30,690 missing models. The coefficients from this linear regression are ultimately used as SHAP’s approximation of Shapley values.

**Visualising the model**

SHAP comes with visualisation tools to aid understanding. Going back to policy *d _{1}* and Shapley values of 7.25, 4.25 and -1.5, SHAP often represents these in waterfall type charts (

*Figure 2*).

This shows us that *x _{3}*has a negative but small impact in producing the predicted 12 claims.

*x*has a slightly larger positive impact, but the significant factor is clearly

_{2}*x*.

_{1}Suppose we want to explain the predictions of a large portfolio of policies. We could produce separate waterfall charts to examine each of them one by one, but we may miss some big picture effects, and it would take an unfeasible amount of time to review each chart.

To fit all points on the same graph, SHAP:

- Replaces bars with dots (so there’s no overlap with the points)
- Removes the mean model prediction for convenience
- Rotates the graph by 90°.

It also colours the points based on their underlying input value.

If we add in two more hypothetical policies (say, *d _{2}* and

*d*), the graph might look like

_{3}*Figure 3*. This is known as a SHAP summary plot.

We can infer that *x _{1}* seems to drive the predictions up the most – not only for

*d*but also for

_{1}*d*and

_{2}*d*.

_{3}*x*appears to be second most significant, and

_{2}*x*doesn’t seem to affect the predictions much.

_{3}From the colours, we can see not only that *x _{1}* drives up the claims in the model, but also that the lower its value, the higher the predictions. Perhaps the factor is something like policy excess?

**Bringing clarity**

This method is agnostic to the model being linear, non-linear, GLM, GBM, NN or having any particular form – the same methodology can be applied. SHAP’s ability to consistently unpick and understand any type of model is what makes it so powerful. While there are some limitations and assumptions, it still provides us with model-agnostic universally explainable models that allow actuaries to stop sacrificing model accuracy for interpretability.

Combining the above with other explainable artificial intelligence techniques, such as partial dependency or accumulated local effects plots, could spell the end of GLM. Advanced machine learning techniques can be used with confidence and without fear that the output can’t be explained or understood, and will allow actuaries to adapt to the brave new world of artificial intelligence.

Github repo: github.com/karol-gawlowski/ifoa_sha

**To view the article in print published format CLICK HERE**

### Download PDF

**Karol Gawlowski** is a senior actuarial analyst at AXA XL London

**Christian Richard **is a life actuary at Zurich International

**Dylan Liew** is the deputy chair of the IFoA’s Explainable Artificial Intelligence Working Party and a data science actuary at Bupa

*Image credit | Getty*