With the ONS starting to release the data from the 2021 census, Jeremy Keating considers how those working in insurance can make use of it

The Office for National Statistics (ONS) will release the first batch of 2021 census data for England and Wales in early summer 2022. The UK completes a census every 10 years and these are the most recent results; Scotland and Northern Ireland’s figures are delayed until 2023 due to COVID-19.
Census data consists of statistics collected about a country’s population. The upcoming release will contain multiple categories of information on areas such as demographics, economics, population, housing, employment and transportation, provided by individual areas and regions.
The ONS will release the data on its website for free. The lowest granularity available instantly from the website will be Lower Layer Super Output Area, each containing roughly 1,000 people. The most granular level that can be obtained from the ONS is the Output Areas Small Area, each containing about 250 people.
There are several ways that census data can be used in modelling and rating.
Clustering
Clustering is useful for turning multiple continuous factors into a single discrete factor. It works by putting observations from the exposure into groups that are similar to each other. This is done by treating each factor as a distance from zero, then putting observations that are ‘close’ to each other into the same group; observations with similar census factors will end up in the same groups.
To prevent distortions, it is crucial to normalise the data – make sure all factors are on the same scale – before using it. Clustering works best with continuous factors, or at least factors that are naturally ordered, but there are techniques for applying it to binary and discrete factors.
A key question is how many clusters are needed. There is no right or wrong answer here and, often, multiple numbers of clusters should be tested and assessed.
Principal component analysis
Principal component analysis (PCA) is a way of reducing many continuous factors down to a smaller number by capturing only the essence of each existing factor. The output from a PCA is a weight to apply to each factor to make a new factor; the new factor is like a weighted average of the existing factors.
PCA usually creates more than one factor to capture all the information. Judgment must often be applied to decide how many of the new factors to use. As with clustering, data must be normalised. Mathematically, PCA is about finding the vectors in the multi-dimensional space that are at right angles to one other, and keeping only these.
Clustering and PCA can be combined for improved results; first, a PCA is performed to get the essence of the factors, then clustering is done to find which observations are similar, reducing down so there is just one factor left that best captures all of the information.
Directly rating
Directly rating means taking the individual factors and using them as they are in modelling and rating. This usually produces better results than reduction methods, but is more complex. Rating engine constraints often prevent the use of all individual factors live in the rating, as they need extra tables for looking up all the values. With more than 1m postcodes in the UK, these tables are sizable.
Real-life uses and outcomes
Once these steps are complete, census data can be included in a pricing modelling exercise. The data becomes a feature, or features, in the chosen model.
The Financial Conduct Authority (FCA) used census data in its interim report on general insurance pricing practices to enhance its analysis. The results of this are in Annex 1 of its report on consumer outcomes; the FCA modelled loss ratios using granular policy and claims data provided by most UK general insurers, in what is probably one of the most comprehensive insurance modelling exercises ever completed. The body assessed customer outcomes using the census data and its findings are included in the report.
Implementing census data leads to reduced loss ratios due to anti-selection. This is also the reason that a company has an advantage if it is using data that its competitors are not. If the data adds additional explanation and can be used to segment customers into groups based on the expected claims, the company using the extra data can offer some customers lower prices due to their lower propensity to claim or their lower cost of claims. This attracts those customers away from competitors, who see loss ratios worsen while the company with the extra data expands its book profitably.
Jeremy Keating is an underwriting data expert at errars.com.