**Jonathan Khanlian outlines how the machine learning technique of reinforcement learning can be viewed from an actuarial perspective**

Reinforcement learning (RL) is a machine learning technique that features an agent learning in a real or simulated environment. Alphabet’s DeepMind artificial intelligence subsidiary used RL to develop systems that could play video games such as Breakout and boardgames such as Chess and Go at superhuman levels. RL has also been used to land rockets and spacecraft in simulation.

On the surface, this may seem to have nothing to do with actuarial science, but if we look under the bonnet of this technique, we’ll see equations with a striking resemblance to the equations seen in actuarial exams.

**States, actions and rewards**

A simplified way to think about an agent learning in an environment, and what that means, is to imagine two interconnected functions glued together through their input and output. The agent is one function and the environment is the other. The agent function takes in a state output from the environment, does some computation according to its policy, and then returns an output called an action. The environment function takes that action as its input and computes and returns the next state as its output. Examples of states and actions could include:

**• States (of the environment)**

- Sensor readings of a robot
- Board configuration in chess
- Capital market indices
- The last three frames of pixels

**• Actions (our agent could take)**

- Open valve to turn on thruster
- Move pawn from E2 to E4
- Buy AA corporate bonds
- Move Breakout paddle left

This process of inputs/outputs being passed between the agent and environment functions is repeated over and over: state computed, action computed, state computed, action computed, and so on. If we want our agent to learn to do something over time, we need to give it some feedback, to encourage the behaviour we’d like to see. In RL, then, instead of the environment function simply passing the agent function a set of numbers that represents the state of the system, the environment also computes and passes the agent a second output called a reward. The rewards are chosen by the programmer, and the agent is programmed to try to maximise these rewards. You can think of the rewards as points in a game – or, to use a concept more familiar to actuaries, as cash flows. Expected discounted rewards are just as important for RL agents as expected discounted cash flows are for actuaries!

**• Rewards (our agent could receive)**

- +1000 – Our robot sensors indicate our lunar landing robot has touched down.
- 0 – The sum of black vs. white chess pieces remains unchanged.
- -100 – Our asset portfolio just dropped by US$100 dollars in value.
- +5 – Another brick was smashed.

**Bellman equations**

How does an agent learn to maximise its expected future rewards? That’s where a Bellman equation usually comes in. There are lots of variations of Bellman optimality equations, but we’re going to stick to understanding and working with this one:

In this formula, the function v(s) represents the expected present value of future rewards when starting in state s. The subscript π indicates that this value is dependent on the agent’s policy π, an algorithm that dictates what action it takes in each state. This equation decomposes the present value of expected rewards from state s under a policy π into its immediate reward R, plus a discounted expectation of the present value of rewards in the next state. Here γ is a discount factor.

Although the notation is new, hopefully this Bellman equation calls to mind the recursive formula for the present value of a set of cash flows or a recursive annuity formula. And perhaps it also reminds actuaries of Markov models. In fact, the state value function above is applicable to systems that can be modelled as a Markov decision process. There are also action value functions that look very similar but relate to the value of different actions in each state.

**Refining the estimates**

The expected present value of future rewards that an agent estimates for each state will not be correct right away. The agent’s state value estimates at the beginning reflect what the programmer initialises them to be. The agent must learn the true value of the states under its current policy by taking actions and receiving feedback in the form of rewards from the environment (in certain situations, state values can be solved analytically, but this is usually not the case). Iteratively, over time, the agent uses this Bellman equation to refine its state value estimates by adjusting its current expectations in the direction of the actual rewards experienced. The agent is almost doing an actuarial experience study and repricing at each step.

In some systems, the state value estimates converge (ie the set of state values stop changing) and the agent has figured out a true and consistent set of state values under its current policy. At that point, the Bellman equation above has been satisfied. Once the agent has a true estimate of the value of being in each state under its current policy, it can then update its policy in the following way: at each state, choose the ‘greedy’ action that leads to the next state with the highest present value. After this algorithmic update, the agent has a better policy – but it is not done learning. The whole iterative process repeats. The agent again refines its estimate of being in each state under its new policy by taking actions according to its policy and updating its present value estimates based on the actual rewards experienced. After the values of each state are determined under its new policy, the agent again updates its policy to choose the action with the new highest value in each state. This process continues until both the state value estimates and the policy improvement algorithm reach a steady state. At that point, your agent has learned an optimal policy that maximises its expected rewards, and you’re done – your agent is now an expert decision-maker in this environment.

**A technique with potential**

This is just one algorithmic approach in RL; it won’t work in all types of systems, but for some it will. Although some details were glossed over here, this example does capture a lot of RL’s key concepts. RL is a great framework for exploring autonomous systems, for actuaries who are interested in them. The fields of RL and dynamic programming are still being explored, but these kinds of systems have already been used to develop production-grade solutions in various industries, including dispatch systems in the trucking industry, data centre cooling in the IT industry, and robotic simulations in the engineering industry, to name a few.

If you are interested in learning more about RL, the University of Alberta offers a series of courses on the topic via Coursera, there is a Deep RL Bootcamp lecture series on YouTube, or you can read Andrew Barto and Richard S Sutton’s book Reinforcement Learning.

**Jonathan Khanlian** is a senior actuary at MetLife.