**Francesca Perla, Ronald Richman, Salvatore Scognamiglio and Mario Wüthrich investigate time series forecasting of mortality using neural networks and deep learning techniques**

Recent advances in machine learning have been propelled by deep learning techniques, a modern approach to applying neural networks to large-scale prediction tasks. Many of these advances have been in computer vision and natural language processing – for example, the accuracy of models built to classify the 14m images in the ImageNet database has steadily increased since 2011, according to the Papers with Code website. Characteristically, the models used within these fields are specialised to deal with the types of data that must be processed to produce predictions. For example, when processing text data, which conveys meaning via the placement of words in a specific order, models that incorporate sequential structures are usually used.

Interest in applying deep learning to actuarial topics has grown, and there is now a body of research illustrating these applications across the actuarial disciplines, including mortality forecasting. Deep learning is a promising technique for actuaries due to the strong links between these models and the familiar technique of generalised linear models (GLMs). Wüthrich (2019) discusses how neural networks can be seen as generalised GLMs that first process the data input to the network to create new variables, which are then used in a GLM to make predictions (this is called ‘representation learning’). This is illustrated in Figure 1. By deriving new features from input data, deep learning models can solve difficult problems of model specification, making these techniques promising for analysing complex actuarial problems such as multi-population mortality forecasting.

### The Lee-Carter Model

Mortality rates, and the rates at which mortality rates are expected to change over time, are basic inputs into a variety of actuarial models. A starting point for setting mortality improvement assumptions is often population data, from which assumptions can be derived using mortality forecasting models. One of the most famous of these is the Lee-Carter (LC) model (Lee and Carter, 1992), which defines the force of mortality as:

This equation states that the (log) force of mortality at age in year is the base mortality at that age plus the

rate of change of mortality at that age, multiplied by a time-index that applies to all ages under consideration. Like most mortality forecasting models, the LC model is fitted in a two-stage process. The parameters of the model are calibrated, and then, for forecasting, the time index is extrapolated.

The LC model is usually applied to forecast the mortality of a single population – but forecasts are often needed for multiple populations simultaneously. While the LC model could be applied to each population separately, the period over which the model is fitted needs to be chosen carefully so that the rates of change in mortality over time correctly reflect expectations about the future. A strong element of judgment is therefore needed, which makes the LC model less suitable for multi-population forecasting.

### Mortality forecasting using deep learning

Recently, several papers have applied deep neural networks to forecast mortality rates. This article focuses on the model in our recent paper (Perla et al., 2020) which applies specialised neural network architectures to model two mortality databases: the Human Mortality Database (HMD), containing mortality information for 41 countries, and the associated United States Mortality Database (USMD), providing life tables for each state.

Our goal is to investigate whether, in common with the findings in the wider machine learning literature, neural networks specialised to process time series data can produce more accurate mortality forecasts than those produced by general neural network architectures. We also want to develop a model that is adaptable to changes in mortality rates by avoiding the need to follow a two-step calibration process. Thus, our model directly processes time series of mortality data with the goal of outputting new variables that can be used for forecasting. Finally, we wish to preserve the form of the LC model, due to the simplicity with which this model can be interpreted.

### Convolutional neural networks

Here, we focus on the convolutional neural network (CNN) presented in our paper. A CNN works by directly processing matrices of data that are input into the network, which could represent images or time series. We present a toy example of how this works in Figure 2. Data processing is accomplished by multiplying the data matrix with a ‘filter’, which is a smaller matrix comprised of parameters that are calibrated when fitting the model. Each filter is applied to the entire input data matrix, resulting in a processed matrix called a ‘feature map’. By calibrating the parameters of the filter in a suitable manner, CNNs can derive feature maps that represent important characteristics of the input data. See *Figure 2’s* caption for more detail.

### Defining the model

The CNN we apply for mortality forecasting works in a similar manner: we populate a matrix with mortality rates at ages 0-99, observed over 10 years for each population and gender. This matrix is processed by multiplying the observed values of mortality rates with filters that span the entire age range of the matrix and extend over three years, as shown in the top part of *Figure 3.* The filters derive a feature map that feeds into the rest of the model.

We also provide the model with variables representing the country being analysed and the gender of the population. To encode these variables, we applied a technique that maps categorical variables to low dimensional vectors called embeddings. In other words, each level of the category is mapped to a vector containing several new parameters – specifically a five-dimensional embedding layer, shown in the middle part of *Figure 3. *

Finally, we use the feature map and the embeddings directly in a GLM to forecast mortality rates in the next year. No other model components process the features before they enter the GLM. This is represented in the last part of *Figure 3*, which shows the direct connection of the output of the network to the feature layer.

### Results

We calibrated this model to the mortality experience in the HMD in the years 1950-1999 and tested the out-of-sample forecasting performance of the model on the experience in the years 2000-2016. The benchmarks against which the model was tested were the original LC model and the deep learning model from Richman and Wüthrich (2019), which is constructed without a processing layer geared towards time series data. We found that the out-of-sample forecasts were more accurate than the LC model 75 out of 76 times, and significantly outperformed the deep learning model. Residuals from the models are shown in Figure 4, indicating that that while both deep learning models have better forecasting performance than the LC model, the CNN model fits the data for males significantly better than any other model. In the paper, we also show that the CNN model works well on the data in the USMD without any modifications.

### Interpretation within the Lee-Carter Paradigm

Deep learning has been criticised as difficult to interpret. We can provide an intuitive explanation of how the convolutional model works in the framework of the LC paradigm for mortality forecasting. As mentioned above, the three sets of features derived with the neural network – relating to population, gender and those derived using the convolutional network – are used directly in a GLM to forecast mortality. We show this mathematically using simplified notation in the following equation:

This states that the neural network predicts mortality based on new variables which have been estimated from the data, represented as variables with a ‘hat’. The first two of these and play the role of estimating the average mortality for the population and gender under consideration, respectively, and in combination are equivalent to the term in the Lee-Carter model. The third of these variables is a time index derived directly from the mortality data, which is equivalent to the term in the LC model. This time index is calibrated each time new data is fed to the network, meaning we have eliminated the two-stage procedure of fitting the model and then producing forecasts through extrapolation.

The seemingly complex model presented can therefore be interpreted in terms that are familiar to actuaries working in mortality forecasting.

**Francesca Perla** is professor for financial mathematics at Parthenope University of Naples

**Ronald Richman** is an associate director (R&D and Special Projects) at QED Actuaries and Consultants

**Salvatore Scognamigliois **a postdoctoral research fellow at Parthenope University of Naples

**Mario Wüthrich **is professor for actuarial science at ETH Zürich