Chantal Bond and Kai Zhu discuss how machine learning techniques can be used in consumer segmentation analysis
Marriage and birth rates continue to decline worldwide, and home ownership rates have plummeted in a number of developed economies. A traditional life insurance consumer segmentation approach, which seeks to focus on the socioeconomic and demographic drivers for life events that lead to insurance purchases, will begin to lose its relevance in this context.
At the same time, insurers have access to a rapidly growing pool of data about consumers – but few have managed to really get to grips with it. How can this data be used to enhance understanding of consumer needs and therefore gain insights that lead to better outcomes for insurers and customers?
We will demonstrate a data analytics approach to consumer segmentation that uses machine learning techniques such as k-mean clustering and random forest classification, which can be applied to a variety of data sources. In this example, we will use data from a consumer needs survey commissioned by the IFoA Life Asia Sub-committee to identify and describe three distinct consumer segments based on their responses to the survey questions.
“Future financial priorities were more important than country, age, income or education level in predicting which segment the consumer belonged to”
A practical three-step approach to consumer segmentation analysis
Consumer survey results are a typical example of unlabelled data sources, and it can be resource-intensive to derive insights from the large resulting datasets. Here we use the data from an independently commissioned consumer needs survey across three Asia markets (Mainland China, Hong Kong and Singapore) that had more than 1,000 participants, to show that if there is a quantitative framework and process in place, consumer insights can be obtained quickly and reliably from a non-traditional high-dimensional dataset such as this.
In this example we outline a practical three-step approach that could be automated to significantly reduce the turnaround time from analysing data to generating actionable insights.
In some ways, this process is the reverse of the traditional approach to consumer segmentation – rather than first defining some demographic or socioeconomic buckets and then segmenting consumers into them, we first segment the consumers into homogeneous groups based on their survey responses (step 1), then look at what variable connects the consumers in each group (step 2), and finally describe the groups based on this variable (step 3).
Step 1: Segment your data
When given an unlabelled dataset, the first step of our process is to segment it using k-mean clustering, a common unsupervised machine learning method used to understand data structure. The k-mean clustering will group unlabelled data points into a pre-specified number of segments such that the data points within each segment are as homogeneous as possible.
Before we apply the k-mean clustering, we need to determine the number of segments we should divide the data into. We use the elbow method to determine the appropriate number of distinct segments. This method examines the amount of variance explained by the segment analysis as a function of the number of distinct segments used. The segment number used in the k-mean clustering exercise is chosen such that any additional segments used would yield a decreasing marginal gain in reducing the variance explained in the segment analysis.
Based on interviewees’ responses to the IFoA Asia consumer needs survey, we found for each market that k-mean clustering yields decreasing marginal gain when the number of segments used to divide the data is more than three.
Step 2: Identify the key independent variables that define the segments
After the data is segmented we convert the unlabelled dataset into a ‘labelled’ dataset, as each data point has been labelled by the distinct segment to which it belongs based on the k-mean clustering results in step 1. The random forest classifier is a regression model that uses a large number of decision trees built from the top-down approach based on the order of independent variables in terms of their influences, measured by information gain, in predicting the outcome of the dataset. The random forest classifier is trained using the labelled dataset to predict which data point would belong to which segment, and the impact of each independent variable on the accuracy of the model is measured by the information gained in order to identify the most influential independent variables in predicting the segment that data point belongs to.
We used Python’s sklearn library to train the random forest classifier based on the already segmented dataset from step 1. After the model was trained, we used the random forest feature selection method
in the sklearn library to rank the variables in order of the information gained from each of them. It was found that the most important variable, accounting for more than two-thirds of the total information gain, was how an interviewee ranked their future financial priorities. This was more important than country, age, income or education level in predicting which segment the consumer belonged to – showing some of the limitations of a traditional demographics-based segmentation approach.
Step 3: Profile the segments’ characteristics based on key independent variables
As the last step, we generate characteristic profiles for all the segments based on the most influential variables identified in step 2. As ‘future financial priorities’ was identified as the most important determinant of predicting the segment the interviewee belongs to, the future financial priority profiles were generated for the three consumer segments. Figure 2 shows how we characterised the three consumer segments identified in the Asia consumer markets.
“While it’s clear that the life stage model is still useful, its relevance is waning”
Differences between markets
Generally, the segmentation results for the three markets (Mainland China, Hong Kong and Singapore) were remarkably similar, demonstrating the wide applicability of a financial priorities-based segmentation. Nonetheless, there were a few key differences, reflecting different economic contexts, for instance:
- The consumers in Hong Kong tend to move into each of the segments at a later age. This may be linked to Hong Kong’s housing market, which is one of the least affordable globally.
- Singapore has fewer individuals in the ‘managing competing needs’ segment (39%) than the other two markets (around 50%). This may be because of affordable public housing and accessible high-quality public education, which reduces some of the financial needs for working families.
- Singapore respondents in all segments ranked buying a car as a low priority (Singapore’s car ownership rate is very low), whereas Mainland China respondents gave greater priority to paying taxes (the top income tax rate is 45% in China, vs 22% in Singapore and 17% in Hong Kong).
What are the uses of this technique for the life insurance industry?
A data analytical approach to segmentation can yield results that are more relevant to today’s consumer landscape, and can make better use of a wider range of data sources. These could include any labelled or unlabelled consumer data already available to insurers, including consumer interactions and feedback on social media, purchasing patterns and web browsing data, call centre transcripts, postcode/location insights, and commercially available data. It could also include emerging sources of consumer data such as connected devices. Setting up an enterprise-level analytical framework and processes to derive consumer insights in real time from the ever-growing pool of data can, for example, improve sales conversion rates and facilitate cross-selling by creating a richer understanding of financial needs.
This, of course, has implications for marketing and sales strategies for both insurers and distributors, as they seek to identify the most relevant markets for different products. There are also opportunities for improved product design. For example, one of the key findings of our survey was a strong desire for more flexibility in insurance products. Hence, the ability to design products which can grow with consumers or be adapted for different customer segments would be likely to be well received by policyholders while also having persistency benefits for insurers.
While machine learning techniques are already used in predictive underwriting and may also be used for analysing insurers’ claims and persistency experience, they are rarely applied to the more qualitative data sources discussed here – but the real value is in looking at these data sets together. Once we have a richer understanding of, say, the lapse behaviour of a particular consumer segment, we can use these insights in a predictive context, which in turn can create more proactive opportunities for engagement, communication, sales and retention.
Chantal Bond is head of Actuarial, APAC at SCOR Global Life and chair of the IFoA Life Asia Sub-committee
Kai Zhu is a manager at KPMG Advisory (Hong Kong) Limited and a member of the IFoA Life Asia Sub-committee