Olga Mierzwa-Sulima, Robin Whytock and Jędrzej Świeżewski share their experience of building a machine learning algorithm that helps track biodiversity in Gabon’s tropical forests

Climate change is increasingly affecting the distribution and composition of ecosystems. This has profound implications for global biodiversity and the prosperity of at-risk communities.
Although technology has accelerated climate and environmental degradation, it can also be used to mitigate impacts and correct our current trajectory.
A ‘big data’ revolution is underway in ecology, including the use of satellite imagery, GPS tags and other sensor arrays. The data generated has potential to support and streamline conservation efforts. However, with big data come big challenges; from collection and storage to validation and interpretation, data handling poses a daunting task for researchers and stakeholders alike.
If we can resolve these issues, we will open the door to ecological ‘forecasting’ and automated pipelines for ecosystem monitoring and response frameworks. We will have the opportunity to streamline biodiversity conservation efforts and tackle large-scale challenges such as wildlife tracking, deforestation and greenhouse gas emissions.
Data scientists can be a part of the solution by using exploratory machine learning (ML) approaches to address climate change and its impacts on ecosystems. Appsilon’s Data for Good initiative aims to help the scientific community, NGOs and non-profits by developing data analytics solutions and ML pipelines to provide actionable and reproducible insights that can help combat climate change and support environmental protection projects.
In one such case, it partnered with researchers at the National Parks Agency of Gabon and the University of Stirling to assist with biodiversity conservation efforts in Gabon. Gabon’s tropical forests in central Africa are home to 80% of the world’s critically endangered forest elephants, among other endangered species. Using computer vision, Data for Good provided artificial intelligence (AI) assisted biodiversity monitoring via an easy-to-use, open-source software tool called Mbaza AI. This automatically detects and classifies wildlife species in images captured by researchers using automated ‘camera traps’.
The challenge
Gabon’s National Parks Agency uses hundreds of camera traps to survey reclusive mammalian and avian species in the central African forests. The camera trap arrays are typically spread over large areas and generate hundreds of thousands of images, which require manual inspection and interpretation. The resulting delay impedes conservation and reaction times to ecological problems. If the agency can identify species quickly and accurately, it can mount appropriate responses to time-sensitive projects, including land and conservation management and anti-poaching efforts.
ML algorithms can improve data processing, but the models are often not accurate enough to be relied upon for full automation. They serve as a ‘first pass’ check that requires an extra validation step, either partially or in full. A new approach was needed to test an ML model for automated labelling using computer vision.
In our case, the model’s precision and accuracy were measured in the context of ecological modelling by comparing species richness, activity patterns and occupancy from ML labels to expert manual labels. By evaluating predictive performance in a domain-specific context such as ecological modelling, we showed that ML labelling can be used in fully automated pipelines (bit.ly/ML_CameraTraps).
The application needed to be standalone and available offline. Add a multi-platform, multi-language user interface that doesn’t require familiarity with programming, and such a tool would open access to projects without geographic or skillset constraints.



Training data
To achieve a highly accurate model for classifying forest animals, we used a sizeable training dataset (n = 347,120) curated from a raw collection of more than 1.5m images. The dataset contained samples from multiple countries, with each source using different camera trap models and field protocols. The resulting variations in resolution, quality and error type produced a challenging, but effective, training dataset for creating an ML model that could generalise to other sites.
The images used to train the model were ‘real-life’ camera trap data. This unprocessed dataset required an iterative approach to allow for better handling of errors, from hardware faults to human labelling errors. The iterative process consisted of model training, validation, error correction and subsequent model updating, which allowed us to accurately assess the model’s performance. The process proved beneficial, particularly in under-exposed and seemingly blank images that contained animals which were undetectable to the human eye but were labelled by the model with high confidence.
An established architecture was selected for the ML model: ResNet50. To speed up training, we used transfer learning, an ML technique that repurposes learning tasks to generalise (or transfer) knowledge to another setting. Most of the approaches and mechanisms used to augment the training were taken from fast.ai, an easy-to-use and robust open-source Python library. We trained the models on various virtual machines run on the Google Cloud Platform; this was made possible by a Google Cloud Education grant.
Model performance and thresholding
To be applicable for broader use cases, the model needed to perform well when generalising to new ‘out-of-sample’ data. For fully automated pipelines, we needed to ensure it learned the features of the animals in the study, rather than focusing on features of the camera sites (the backgrounds). It should be noted that valid identification derived from ML labelling requires all concerned species to be included in the training dataset.
Four focal species – the African golden cat, the chimpanzee, the leopard and the African forest elephant – were selected, as they are conservation priority species. Three ecological metrics were used in the study:
- Species richness, for quantifying the species count both temporally and spatially
- Activity patterns, for determining activity and life behaviour traits – for example, nocturnal and crepuscular animals
- Occupancy – a hierarchical model that can account for imperfect detection.
With these metrics in place, the model could be evaluated for accuracy and precision in an ecological context.
We found that thresholding improved model performance for the ecological metrics in the study. Thresholding means applying a cutoff value to a binary categorisation of regression analyses. The model had a top-five accuracy of 95% – that is, in 95% of cases, the ‘actual’ expert labels were among the top five ML-predicted labels. However, no matter the threshold selected, predictions for out-of-sample data remained at around 95%. Overall, we recommended that users applied a threshold of 70% for general monitoring in central African forests. However, different thresholds can have impacts on inference, and thresholds should be adjusted when targeting species. For example, the model’s elephant occupancy estimates improved significantly with an increase in the threshold limit.
Meeting real needs
The outcome of the project was Mbaza AI (github.com/Appsilon/mbaza): an ML algorithm for classifying camera trap images offline and with 90% accuracy for predictions on out-of-sample data. The tool is free to use and can rapidly process data with output accuracy and precision levels that are high enough for ecological analyses. It has decreased the time needed to analyse thousands of images from two-to-three weeks to one day. Depending on the hardware used, the model can classify roughly 4,000 images per hour and operate 24/7.
Mbaza AI is just one example of how data science and ML can be used in environmental mitigation. With expert guidance, data can be leveraged with interactive data visualisations, applications and AI for climate change solutions. Automated workflows, open-source software and practical problem-solving, keeping stakeholders in mind, help make large-scale environmental efforts achievable.
When we understand the needs of researchers and practitioners, data scientists and AI developers can address the limitations of current technologies and offer applicable solutions. The field of data science and ML has made impressive advances in the past decade; we must continue to demonstrate that the available technology can play a meaningful role in the preservation of the planet.
Olga Mierzwa-Sulima is an engineering manager at Appsilon and leads Data for Good.
Dr Robin Whytock is a scientist, ecologist and conservationist interested in forest biodiversity
Dr Jędrzej Świeżewski is machine learning lead at Appsilon
Image Credit | Shutterstock | ANPN