Coding: those who love it can benefit those who don’t by creating open-source tools. Yiannis Parizas outlines two popular data science programming languages, and the simulator he devised and shared
A n open-source programming language is one that is not proprietary; the source code is accessible for the public to view, modify and redistribute. R is an example of one such language. It is specialised in statistical analysis, and rich in packages for data visualisation, processing, analysis and reporting. It is one of the IFoA’s ‘preferred’ programming languages and is now included in the CS1 and CS2 exam curricula. The IFoA’s official guide to installing and using R can be found at bit.ly/IFoA_RGuide
Insurers can use R to perform analysis in all areas of pricing, reserving and capital modelling, and there are already specialised R packages that support our work. There are significant benefits in creating and sharing such packages with the public.
R versus Python
Alongside R, Python is the other open-source language that is popular in insurance. While R is specialised in statistical analysis, Python is a multipurpose programming language. Users occasionally opt for Python because it is considered more robust for production and has a larger community of users – and therefore more resources – for non-statistically specialised tools. An example is when deploying a machine learning model or pricing infrastructure as an application programming interface (API); here, Python wins out over R. Moreover, R is perceived as harder for a beginner to learn, and IT department colleagues are more likely to have knowledge of Python.
Both languages have a large community of users but R has a larger academic community. This means that newer technologies from statistical research are likely to appear in R before Python.
In general, Python is faster than R, but both are slow compared to languages such as C, C++ and Rust. The insurance industry uses R and Python because they have many of the tools we need built-in, providing a degree of programming ease that outweighs the loss of speed.
R versus Excel
The most widely used data analysis tool in our industry is still Microsoft Excel, whose main benefit is that everyone can use and understand it. Traditionally, analyses and tools have generally been run through Excel – but this has started to change, with external programming languages being better in terms of scaling and automation.
Excel’s capacity shows its limits as datasets get bigger. Users can access more powerful tools, such as PowerPivot in Excel x64 and PowerBI, but not everyone has experience with these. Excel can be automated with VBA, the programming language embedded within it, but this is slow compared with R and Python.
A simulation of 250,000 years with a high claim frequency could take VBA a couple of hours to run, while R or Python would take a few minutes. Considering this, a commercial business may opt to upskill its employees in R or Python to benefit from time savings in the medium term.
Companies can reap the benefits of an open-source language while keeping developments confidential
Another issue with VBA is that it does not have as big a data science community as R or Python, so there are not as many packages and resources available for actuarial use. R and Python are easily reproducible, enable connection to multiple data sources, and can share reports or tools with other users through tools such as Markdown or Shiny, or an API. In addition, VBA’s syntax is not as clean as those of R and Python.
However, it should be noted that while use of Excel is declining for more complex work, we will still use it for a long time because of its ease, transparency and flexibility for simple calculations.
R packages and CRAN
R packages are extensions of the R programming language. Packages may contain code or functions, data and documentations in a standardised format.
The author can also set automated tests to spot bugs in the code. Documentation can be per function or vignette (longer guides in which the authors may include proofs, use case studies etc). Documentation per function is summarised in the reference manual and you can access it in R while you have the package loaded by typing “?function_name”.
R contributors can share their code through centralised software repositories. The most common repository for R is the Comprehensive R Archive Network (CRAN), which is supported by the R Foundation. This is where the packages come from, by default, when you run “install.packages(package_name)”. Alternatively, users can share their code with software development services that offer both internal and public access, such as GitHub and GitLab.
The ability to publish and share code internally means companies can reap the benefits of an open-source language while keeping developments confidential, internally controlled and secure. Developments are secured by file keys, and permissions are centrally governed. Backups are made for every change to the code and changes can easily be reverted. Python works in the same way.
CRAN’s benefit for open-source contributing is that the package will install easily with the standard command and must conform to a relatively strict specification, ensuring a better user experience. The requirements include a standardised documentation format, a directory structure and metadata. When R is installed, it comes with 15 base packages and an additional 15 recommended packages.
Make a contribution
For those who love coding and enjoy giving back to the profession, there are many benefits in contributing to open-source projects. You can develop your coding skills and learn about technical aspects in a practical way, both of which are valuable for employers looking to make internal developments.
In collaborating with others, you learn from them and their perspectives, and improve your teamwork and communication skills. You learn to accept and understand feedback in order to improve. The more you get involved with projects, the more confident you become in your skills, which can help in interviews and work presentations. You can also make a reputation for yourself in the open-source community, meeting people and forming a network that could help you land a job that utilises your coding abilities, if that is something you are interested in.
Another benefit is that you will stay up to date with the industry’s latest technologies and tools. You will have access to open-source tools in every job (unless IT security does not allow it), which is not the case with licensed software, and the skills you develop through your contributions will be available to your employer.
It really is a win-win.
Yiannis explains his open-source tool, NetSimR
I published the R package NetSimR in CRAN (bit.ly/CRAN_NetSimR) to accompany two previous articles I wrote for The Actuary, ‘Taken to excess’ (March 2019) and ‘Escaping the triangle’ (June 2019). NetSimR has been downloaded by users around 14,000 times so far, which I don’t think is bad for an actuarial product!
The first version of the tool included functions that calculated the analytical mean of capped severity distributions, expanding to increased limit factor curves, exposure curves from severity distributions, and pure incurred but not reported claims functions. Here, I will focus on the latest update to the package: adding a claims simulation tool.
Many (re)insurers have Excel versions of tools to run analyses and simulate claims, usually using VBA. As we have already noted, such tools are slow and cannot handle hundreds of thousands of simulations or a very high claims frequency. Proprietary tools such as MetaRisk can handle more complex setups, but actuaries do not always have a licence, or indeed the experience and training, to use such tools. In addition, many practitioners, especially those from the older generation, are not familiar with coding in R. An R tool with a front end would allow them to use this language without having to code.
The aim was to set up a claims simulation tool with a user interface that could be used by people without coding experience. It would not capture every complex scenario but basic cases; more complex cases could be handled through external manipulation and multiple simulation analyses.
Initially, the tool was set up using CUDA (Nvidia’s programming language that runs on graphics processing units) and C#.net. The implementation and production set-up were complex and could not easily be shared with other people. Simulation speed was therefore sacrificed and the tool rewritten in pure R, with the front end in Shiny – an R package that allows the functions to be used via a website using buttons and other inputs, thus turning the code into an app.
My claims simulation tool, the NetSimR simulator, is a website that can be used by those without any coding knowledge, using the following R commands:
- install.packages(“NetSimR”) (only when using the simulator for the first time, to install or update the package)
Once the user runs the simulator, two buttons appear so that they can export the results – either by saving the simulation data as a CSV file or outputting an HTML report produced by a markdown file.
Yiannis Parizas is an actuarial pricing consultant
Image credit | Getty