Skip to main content
The Actuary: The magazine of the Institute and Faculty of Actuaries - return to the homepage Logo of The Actuary website
  • Search
  • Visit The Actuary Magazine on Facebook
  • Visit The Actuary Magazine on LinkedIn
  • Visit @TheActuaryMag on Twitter
Visit the website of the Institute and Faculty of Actuaries Logo of the Institute and Faculty of Actuaries

Main navigation

  • News
  • Features
    • General Features
    • Interviews
    • Students
    • Opinion
  • Topics
  • Knowledge
    • Business Skills
    • Careers
    • Events
    • Predictions by The Actuary
    • Whitepapers
    • Moody's - Climate Risk Insurers series
    • Webinars
    • Podcasts
  • Jobs
  • IFoA
    • CEO Comment
    • IFoA News
    • People & Social News
    • President Comment
  • Archive
Quick links:
  • Home
  • The Actuary Issues
  • September 2021
General Features

Environment agents: Introduction to reinforcement learning

Open-access content Thursday 2nd September 2021
Authors
Jonathan Khanlian

Jonathan Khanlian outlines how the machine learning technique of reinforcement learning can be viewed from an actuarial perspective

web_boardgame-Go_credit_iStock-182187699.png

Reinforcement learning (RL) is a machine learning technique that features an agent learning in a real or simulated environment. Alphabet’s DeepMind artificial intelligence subsidiary used RL to develop systems that could play video games such as Breakout and boardgames such as Chess and Go at superhuman levels. RL has also been used to land rockets and spacecraft in simulation. 

On the surface, this may seem to have nothing to do with actuarial science, but if we look under the bonnet of this technique, we’ll see equations with a striking resemblance to the equations seen in actuarial exams. 

States, actions and rewards
A simplified way to think about an agent learning in an environment, and what that means, is to imagine two interconnected functions glued together through their input and output. The agent is one function and the environment is the other. The agent function takes in a state output from the environment, does some computation according to its policy, and then returns an output called an action. The environment function takes that action as its input and computes and returns the next state as its output. Examples of states and actions could include:

 
• States (of the environment)

  • Sensor readings of a robot
  • Board configuration in chess
  • Capital market indices
  • The last three frames of pixels

• Actions (our agent could take)

  • Open valve to turn on thruster
  • Move pawn from E2 to E4
  • Buy AA corporate bonds 
  • Move Breakout paddle left
     

This process of inputs/outputs being passed between the agent and environment functions is repeated over and over: state computed, action computed, state computed, action computed, and so on. If we want our agent to learn to do something over time, we need to give it some feedback, to encourage the behaviour we’d like to see. In RL, then, instead of the environment function simply passing the agent function a set of numbers that represents the state of the system, the environment also computes and passes the agent a second output called a reward. The rewards are chosen by the programmer, and the agent is programmed to try to maximise these rewards. You can think of the rewards as points in a game – or, to use a concept more familiar to actuaries, as cash flows. Expected discounted rewards are just as important for RL agents as expected discounted cash flows are for actuaries!  

• Rewards (our agent could receive)

  • +1000 – Our robot sensors indicate our lunar landing robot has touched down.
  • 0 – The sum of black vs. white chess pieces remains unchanged.
  • -100 – Our asset portfolio just dropped by US$100 dollars in value.
  • +5 – Another brick was smashed.
Figure 1: RL optimisation process.
Figure 1: RL optimisation process.

 
Bellman equations
How does an agent learn to maximise its expected future rewards? That’s where a Bellman equation usually comes in. There are lots of variations of Bellman optimality equations, but we’re going to stick to understanding and working with this one: 

equation

In this formula, the function v(s) represents the expected present value of future rewards when starting in state s. The subscript π indicates that this value is dependent on the agent’s policy π, an algorithm that dictates what action it takes in each state. This equation decomposes the present value of expected rewards from state s under a policy π into its immediate reward R, plus a discounted expectation of the present value of rewards in the next state. Here γ is a discount factor. 

Although the notation is new, hopefully this Bellman equation calls to mind the recursive formula for the present value of a set of cash flows or a recursive annuity formula. And perhaps it also reminds actuaries of Markov models. In fact, the state value function above is applicable to systems that can be modelled as a Markov decision process. There are also action value functions that look very similar but relate to the value of different actions in each state. 

Refining the estimates
The expected present value of future rewards that an agent estimates for each state will not be correct right away. The agent’s state value estimates at the beginning reflect what the programmer initialises them to be. The agent must learn the true value of the states under its current policy by taking actions and receiving feedback in the form of rewards from the environment (in certain situations, state values can be solved analytically, but this is usually not the case). Iteratively, over time, the agent uses this Bellman equation to refine its state value estimates by adjusting its current expectations in the direction of the actual rewards experienced. The agent is almost doing an actuarial experience study and repricing at each step.

In some systems, the state value estimates converge (ie the set of state values stop changing) and the agent has figured out a true and consistent set of state values under its current policy. At that point, the Bellman equation above has been satisfied. Once the agent has a true estimate of the value of being in each state under its current policy, it can then update its policy in the following way: at each state, choose the ‘greedy’ action that leads to the next state with the highest present value. After this algorithmic update, the agent has a better policy – but it is not done learning. The whole iterative process repeats. The agent again refines its estimate of being in each state under its new policy by taking actions according to its policy and updating its present value estimates based on the actual rewards experienced. After the values of each state are determined under its new policy, the agent again updates its policy to choose the action with the new highest value in each state. This process continues until both the state value estimates and the policy improvement algorithm reach a steady state. At that point, your agent has learned an optimal policy that maximises its expected rewards, and you’re done – your agent is now an expert decision-maker in this environment.

A technique with potential
This is just one algorithmic approach in RL; it won’t work in all types of systems, but for some it will. Although some details were glossed over here, this example does capture a lot of RL’s key concepts. RL is a great framework for exploring autonomous systems, for actuaries who are interested in them. The fields of RL and dynamic programming are still being explored, but these kinds of systems have already been used to develop production-grade solutions in various industries, including dispatch systems in the trucking industry, data centre cooling in the IT industry, and robotic simulations in the engineering industry, to name a few. 

If you are interested in learning more about RL, the University of Alberta offers a series of courses on the topic via Coursera, there is a Deep RL Bootcamp lecture series on YouTube, or you can read Andrew Barto and Richard S Sutton’s book Reinforcement Learning.  

Jonathan Khanlian is a senior actuary at MetLife.

Image Credit | iStock
ACT Sep21_Full LR.jpg
This article appeared in our September 2021 issue of The Actuary .
Click here to view this issue

You may also be interested in...

web_p40_HACK-to-the-Future_Text-Effect.jpg

Hack to the future

Louise Pryor, Colin Dutkiewicz and Krishna Kumar Shrestha reflect on the ICAT’s recent R number hackathon, and what it means for the profession’s future skillset and role in society.
Wednesday 1st September 2021
Open-access content
web_p18_Coughing-up_Carbon-Taxing--Currency-with-chimney-bellowing_CREDIT_Mark-Airs_Ikon_00024955.jpg

Coughing up on carbon taxation

Trevor Williams gives an economist’s view of carbon taxation
Thursday 2nd September 2021
Open-access content
James Dyke: Almost out of time

James Dyke: Almost out of time

James Dyke talks to Travis Elsum about the hidden pitfalls of net-zero policies and the need for more academic activism
Wednesday 1st September 2021
Open-access content
web_p14-15_Covering all bases_watering-can-nurturing-the-development-of-green-industries-in-plant-pots_CREDIT_Marcus-Butt_Ikon_00011519.jpg

Covering all bases: a multi-manager approach to investing

David Hunter explains why a multi-manager approach is best when investing in the nascent renewable energy sector
Wednesday 1st September 2021
Open-access content
web_p37_The-vaccine-dilemma_

The vaccine dilemma

Servaas Houben looks at the prisoner’s dilemma and how it can be applied to COVID-19 vaccination
Wednesday 1st September 2021
Open-access content
web_p34-35_The-lake-of-fire_Pit-crater-Nyiragongo-Volcano-DRC_CREDIT_Getty-140189982.jpg

Lake of fire: insurance penetration in risk zones

Brian McGregor explores how insurers can increase insurance penetration in risk zones, such as the area around the Mount Nyiragongo volcano
Wednesday 1st September 2021
Open-access content

Latest from General Features

yguk

Is anybody out there?

There’s no point speaking if no one hears you. Effective communication starts with silence – this is the understated art of listening, says Tan Suee Chieh
Thursday 2nd March 2023
Open-access content
ers

By halves

Reducing the pensions gap between men and women is a work in progress – and there’s still a long way to go, with women retiring on 50% less than men, says Alexandra Miles
Thursday 2nd March 2023
Open-access content
web_Question-mark-lightbulbs_credit_iStock-1348235111.png

Figuring it out

Psychologist Wendy Johnson recalls how qualifying as an actuary and running her own consultancy in the US allowed her to overcome shyness and gave her essential skills for life
Wednesday 1st March 2023
Open-access content

Latest from Data Science

gc

Free for all

Coding: those who love it can benefit those who don’t by creating open-source tools. Yiannis Parizas outlines two popular data science programming languages, and the simulator he devised and shared
Wednesday 1st March 2023
Open-access content
il

When 'human' isn't female

It was only last year that the first anatomically correct female crash test dummy was created. With so much data still based on the male perspective, are we truly meeting all consumer needs? Adél Drew discusses her thoughts, based on the book Invisible Women by Caroline Criado Perez
Wednesday 1st February 2023
Open-access content
res

Interview: Tim Harford on the importance of questioning our assumptions

Tim Harford speaks to Ruolin Wang about why it’s so important to slow down and question things from emotive headlines to the numbers and algorithms we use in our work
Wednesday 30th November 2022
Open-access content

Latest from September 2021

Obituary

Obituary: Jim Galbraith

It was with much sadness that we learned of the premature passing of our friend, university classmate and fellow actuary Jim Galbraith. He died suddenly in April 2021 at the age of just 58, leaving family, friends and colleagues devastated.
Thursday 2nd September 2021
Open-access content
Stepping up to the plate on societal challenges

Stepping up to the plate on societal challenges

The Joint Forum on Actuarial Regulation provides the IFoA with a platform to engage with key societal challenges and influence possible solutions, says Matt Saker
Wednesday 1st September 2021
Open-access content
TA filler images_0.png

ADJUDICATION PANEL: Mr Priyesh Bamania FIA

On 18 and 23 June 2021, the Adjudication Panel considered an allegation of misconduct against Mr Bamania (the respondent). The allegations relate to actions between August and October 2020.
Wednesday 1st September 2021
Open-access content
Share
  • Twitter
  • Facebook
  • Linked in
  • Mail
  • Print

Latest Jobs

Pricing Trading Manager - Contract

£700 - £1000 per day
Reference
148579

Head of Financial Risk

Flexible / hybrid working with minimum 2 days p/w office-based
£ excellent package
Reference
148578

Insurance Risk Leader

Flexible / hybrid with 2 days p/w office-based
£ to attract the best
Reference
148577
See all jobs »
 
 
 
 

Sign up to our newsletter

News, jobs and updates

Sign up

Subscribe to The Actuary

Receive the print edition straight to your door

Subscribe
Spread-iPad-slantB-june.png

Topics

  • Data Science
  • Investment
  • Risk & ERM
  • Pensions
  • Environment
  • Soft skills
  • General Insurance
  • Regulation Standards
  • Health care
  • Technology
  • Reinsurance
  • Global
  • Life insurance
​
FOLLOW US
The Actuary on LinkedIn
@TheActuaryMag on Twitter
Facebook: The Actuary Magazine
CONTACT US
The Actuary
Tel: (+44) 020 7880 6200
​

IFoA

About IFoA
Become an actuary
IFoA Events
About membership

Information

Privacy Policy
Terms & Conditions
Cookie Policy
Think Green

Get in touch

Contact us
Advertise with us
Subscribe to The Actuary Magazine
Contribute

The Actuary Jobs

Actuarial job search
Pensions jobs
General insurance jobs
Solvency II jobs

© 2023 The Actuary. The Actuary is published on behalf of the Institute and Faculty of Actuaries by Redactive Publishing Limited. All rights reserved. Reproduction of any part is not allowed without written permission.

Redactive Media Group Ltd, 71-75 Shelton Street, London WC2H 9JQ