Supported by:

Embedded Computing

Next Event


Mos Days Hrs

8 June 2022

10am - 5pm

9 June 2022

10am - 4pm

LV Convention Center

Supported by:

Embedded Computing

8-9 June 2022

LV Convention Center

Research Hub: Machine Learning-Driven Credit Risk Modeling Using Smartphone Metadata

This blog was provided by Credolab


The 21st century is marked by the democratisation of the internet for mass consumption. In this day and age, personal cell phone devices are no longer luxury items only afforded by the rich. In 2022, according to a recent report from Statista, 91.54% of the world's population owns mobile phones and 83.72% of the world's population owns a smartphone. These quantities will grow to 71% and 77% by 2025 (Sivakumaran & Iacopino, 2018).
CredoLab is at the forefront of the revolution that engages with novel credit risk modeling approaches availed by the surge in cell phone use. Core to CredoLab’s business is its modeling pipeline. Taking the smartphone as input, the pipeline consists of a series of automated steps, rooted in machine learning techniques, that ultimately output a predictive model for credit default. To protect confidentiality and to ensure bias towards individual loan customers, only non-identifying metadata is used.
The purpose of this document is two-fold: (1) to provide a high-level overview of the intricate components that make up the modeling process and (2) to report results on an independent review of the pipeline. The independent review considered a vast array of alternative approaches for the various different steps of the pipeline and found favorable results, including when applied to real data.

The input at the pipeline’s starting point consists of smartphone metadata. The unit of observation is a single individual’s cell phone usage. Observed input variables include aggregated summary statistics such as the amount of time the user spent on phone calls over the past month. Some of these summary statistics are on a continuous scale (such as minutes called); others are on a discrete scale (such as the frequency of calls) and still, others are binary (such as whether particular phone applications are installed). The number of observed variables varies based on availability, but could easily run in the thousands.
The outcome of interest in the dataset is a binary indicator of whether the cell phone user becomes in default on their loan.
We note that demographic information about the individual phone user is not included. Variables such as age, sex, income level, etc. are neither considered for modeling nor extracted from the mobile device for any other purpose.

Validation and Assessment
In order to prioritize and assess model robustness, CredoLab’s pipeline divides the available data into a training set, a validation set, and a test set. Consistent with recommended practices, the training set is used for fitting the model, the validation set is optionally used to optimize parameters, and the test set is used for reporting the accuracy of the final model with its chosen parameters. Dividing up the dataset in this way for different purposes serves to provide better estimates of actual model performance on individuals who are either new to credit or new to banks.


Please click here to view the remaining research.



To view more tips please see the publications below:

Discovering modern credit risk scoring in buy now, pay later

Digital scorecards: boosting credit success through higher predictability and greater approval rates

Improving credit decisions and lifting financial inclusion in Southeast Asia



Get your FREE tickets today!