While open data sources can present a challenge for insurers, it can be worth persevering, says Sandrine Huynh
Consider a small regional health insurance company that would like to expand its sales nationwide. To price its products for other, unfamiliar regions, it needs data on those areas. Due to its limited budget, buying data is not its first choice; it decides to use ‘open data’. These are easily accessible and completely free online data sources. They often come from public institutions, some are regularly updated, and they can be very large (in other words, big data). Examples sources of such data are data.gov.uk and data.gouv.fr.
Data to the rescue
Open data can help resolve the lack of data that actuaries often encounter. In pricing, for instance, a minimum amount of data is required to adequately evaluate a risk. Instead of relying on expert opinions, open data provides empirical observations on the ground. This is helpful not only when there is no other data, but also when partial data is otherwise available and the actuary would like to enhance or complete it. For example, open data can provide quantitative information on healthcare use by geographic and demographic variables, such as region, age, gender and so on, that is not already in the insurer’s own insured portfolios. By cross-referencing this new information with in-house datasets, updated or more adequate prices on insurance products can then be determined.
Accessible and free
In France, the Open DAMIR database contains more than 4.7bn lines and over 0.5 terabytes of public data that cover the healthcare benefits paid for by the French public health insurance system. Independently of external needs, some of this data is generated anyway because it is necessary – the public insurance system records its transactions for its own financial needs. Importantly, however, this data is made freely accessible to the public (under the condition that it is anonymised, of course).
Can such data be useful to someone else for a completely different purpose? In our example, the purpose is to extend private health insurance premiums from one region to the rest of the country. This data removes the need for the insurer to take potentially biased surveys on small populations.
Since open data tends to be empirical, it can also shed light on the impacts of current social phenomena or regulations. A recent example is hospitals generating public data on the number of COVID-19 hospitalisations, discharges, deaths, intensive care admissions and so on. This enabled researchers to study the effects of the virus statistically in order to unravel what was then unknown. Even though, as clearly stated (in France) at the time, the data was preliminary and prone to errors, there was an effort to release it quickly to advance understanding and help deal with the crisis. Even virtual machine providers gave priority to allocating machines to those researching COVID-19. In other words, open data can keep up with the trends and new topics in society.
On many such occasions, the benefit of open data can be two-way. The data can help the user, but the user can also help the data publisher. By making the above COVID-19 data available and enabling independent research, public bodies and governments could then take advantage of the outcomes of such research to help inform their own decisions.
The empirical nature and large size of these datasets do pose certain challenges. Depending on how they are generated, and how many people or services contributed to their construction, open databases can contain inconsistencies in both value definition and format. A frequent offender when using European datasets is the decimal notation. Decimal values can be presented as ‘2.1’ or ‘2,1’ depending on the region, which may not be the format compatible with the programmes or algorithms used.
Other issues can be more difficult to resolve. To give one practical example we have faced, for the variable of the number of uses of the healthcare system, we had float numbers (such as 2.1, 2.5, …) in the open databases and integers in in-house databases. While investigating this seemingly innocuous inconsistency, we realised that the method for counting was different between the two sources, and that this variable carried a different meaning depending on the dataset. In other words, variables with the same name are not necessarily the same thing, and care must be taken to make sure variables represent exactly what you think they do.
Large amounts of data can also mean potentially time-consuming algorithms, numerous data reprocessing iterations and less visibility. The user is therefore more prone to making mistakes, and a simple mistake stemming from a lack of understanding or misunderstanding of the open data can be critical. To give an example, we joined the DAMIR datasets to another dataset to enrich the demographic information available, initially without fully understanding the different scopes of each row of data in the different datasets. We realised our error when the result turned out to be nonsensical. It is important to have a sense of which results are within the realms of possibility and to keep a sceptical eye on the numbers.
Finding the right open databases is also important: not all open data is good data, and having a lot of data is not necessarily useful. The user must choose carefully. In our case, we had to evaluate the different health open data available in France before starting our study, and – considering the challenges above – it took us a long time working on the data before it was useful for the actuarial project (health pricing) in question.
Finally, the biggest limitation that we encountered was the fact that the data was aggregated for anonymisation. Medical confidentiality is important due to the sensitive nature of health data. Institutions must make sure that, when they publish an open database, the data cannot be used to deduce the medical conditions of a specific individual. While it is not rare for actuaries to work with aggregate data, it can be difficult when the data lacks key indicators such as the number of individuals aggregated in each record. In our case, this was an obstacle for determining the frequency of healthcare uses. We knew how many healthcare uses were paid for per row of data, but not how many people it concerned.
Working with aggregate, instead of individual, data also requires carefully thinking through the statistical models used to ensure that they can still be appropriately applied.
Open data can help enable and enrich actuarial studies, and current technology is advanced enough to make use of it. There are plenty of programming tools, packages of algorithms and even virtual machines for processing big data. Nevertheless, we are at an early stage, and work needs to be done to explore the full potential of this data. With new technologies entering our lives – including Internet of Things devices – this data will only continue to grow.
This article is based on the author’s findings in writing her memorandum ‘Open data et Assurance santé: l’union fait la force?’ (June 2021). It is not intended to be an exhaustive overview of open data uses.
Sandrine Huynh is a junior consultant at Actuelia in Paris and a member of the Institut des Actuaires, France.