April 10, 2023
Our Client’s Problem
The banking sector in the UK is facing a rapid growth in frauds and Scams. Clever financial criminals use a range of tactics and social engineering to pose as trustworthy people. Scams like the romance scam and investment scams take place over multiple repetitive payments lasting for days, weeks or months. And as we all know, these scams can cause untold financial losses to both the bank and the customer. They can also cause serious reputational damage to banks if they fail to take the appropriate preventive and reactive measures. It’s a nightmare!
Our banking client needed an in-house, machine learning model driven solution which could predict an on-going high-value scam in its very early stage. It also needed to alert the customer to minimise financial losses. Luckily the bank had recorded transactional data on their previous incidents of scams/fraud cases. These became the back-bone of our solution, letting us to infer the behavioural patterns of the customers and scammers and to predict an ongoing scam at its earliest stage.
So how do we decide what tech stack to use?
We had a look at the business goals and the scope of the project.
With the above business goals and scope in mind, my team and I decided on the technology stack and sources of data we’d need. I developed the data pipelines capable of doing the data extraction, data transformation, training data generation and feature engineering using an ETL tool called Streamsets. Streamsets is internally backed by Spark Engine which is well-known for its powerful in-memory and distributed Big Data computing. The Streamsets Data-pipelines were mostly driven by Pyspark scripts (Python + Spark). Data Scientists from my team used AWS Sagemaker to train and build the machine learning model which was capable of doing high quality scam predictions. AWS Sagemaker allows you to quickly build the training pipelines to train the models, and to build inference pipelines to generate predictions using the trained model. AWS Sagemaker also provides data scientists with an exploratory and development area, which is backed by Jupyter Notebooks. Once the predications were generated on daily basis, the predictions from the model were regularly persisted into Snowflake which is a leading scalable cloud Data Warehouse. In order to orchestrate the end-to-end process; i.e. from data transformation to the persisting of the predications in Snowflake, I used Apache Airflow. AWS S3, a leading object storage service on cloud.
Initial Exploratory Analysis & Feature Engineering: Our Data Scientists’s Time to Shine
The scope of the data helped me and my team nail down the source systems and the broad filter criteria that needed to be applied. We handed the raw data over to the data scientist team for their exploratory analysis. They then applied various oversampling techniques to balance the data, followed by splitting the data into training and validation data sets. Both the training and the validation dataset were kept mutually exclusive to each other (so that the Trained Model is not influenced during the validation). Both the training and validation data sets maintained a real-life representation of “scam” vs. “genuine” customer accounts and their underlying transactions. These datasets were also decorated with hundreds of engineered features which were derived by applying common aggregations like min, max, standard deviation etc., over a varying time window from 3 days to 6 months. The data scientists then underwent several model iterations to find the most optimal and performing Algorithm. LightGBM which is a gradient boosting decision tree (GBDT) algorithm was finalised as the most optimal algorithm. Also considered were others like Random Forest, Logistic Regression, etc. Once the Algorithm was finalised, a final feature selection activity was performed by our data scientists which helped them remove highly co-related features, keeping only those which had a linear relation to the model performance. Now, we finalised the algorithm and the list of engineered features that was overlaid on the scoped data. All of this exploratory data analysis, model iterations, feature engineering and feature selection was done within Jupyter Notebooks of the AWS Sagemaker.
My Implementation of the Data Pipelines:
Based on the scope of the data and the calculated/derived feature list that was passed down to me and my team by the data scientists, I started building the data pipeline that would create a model input on a daily basis to get predictions from the trained ML model. This data pipeline was implemented using an ETL tool named Streamsets to carry out the data transformations. Data transformations, which also included the implementation logic of the calculated/derived features, were written in Pyspark in order to calculate the features which were compute-heavy and windowed by time. These data transformations were done for all the daily eligible customers on their underlying transactions. The intermediate transformed data were staged at AWS S3 storage service. The lowest grain of the data transformation was done at customer level (not transaction level) since high-value-scams are spread across a long period of time. Hence, analyzing the incoming customer’s transactions over a long period of time helps to find a behavioral pattern and flag the customer as a potential victim of scam. After applying all the computes and transformations in the Streamsets Data pipeline, I wrote the final data (which would be input to the Trained Model) on AWS S3 to be consumed further by the ML powered Inference-pipeline build inside the AWS SageMaker. The Inference-pipeline read the Model Input data from AWS S3 path and generated scores/predictions between 0 and 1 for each incoming customer. A score of 1 or nearing 1 means the highest probability of a customer being a victim of a High-Value-Scam. These output scores were again written to an AWS S3 path. Now, once the Model output scores were written, my final Streamsets data pipeline read it from the AWS S3, applied score-based sorting and wrote it to a Snowflake table (to be used by Data Analysts) in an append mode. This data pipeline also generated a CSV dump (to be consumed by Business users) of these sorted scores under a daily partitioned folder on AWS S3. The Data-pipelines and the ML Inference Pipelines were together orchestrated using Apache Airflow (an orchestration tool).
My Non Functional Implementations
Scalability: The data pipelines that I developed were scalable in terms of handling spikes of incoming data volumes. This was achieved by enabling the dynamic allocation of spark executors within the Streamsets pipeline settings.
Reusability: I implemented many reusable Pyspark functions. For instance, a function that can generate all permutations and combinations of aggregation functions (mix, max, stddev etc.) and time periods (3 days, 7 days, 6 month etc.) passed to the function against an attribute.
Data Quality Checks: Implemented data quality checks within my data pipelines by using AWS PyDeequ library.
Modularization: Adopted a modularized design approach for my data pipelines to decouple various moving parts of the data pipeline. This approach gives a good flexibility to incorporate any future additions of new Features OR attributes to make the overall scoring/predictions better. For example, the lowest grain/level of aggregation was at customer level which enabled me to do the data transformations for each source-system of data in an isolated manner and later stitch them all together at the same customer grain/level.
Documenting the standard approach: The methodologies and practices used to solve the problem were well-documented by me and my team. This documentation became a baseline and a standard reference to many other similar ML-based prediction projects.
End Result: Nailed it!
We were successful in achieving the main objectives of the Business by saving customer’s money by supporting in the early detection of the scam and by making the framework robust & flexible enough to accommodate any future improvisations. The scores/predictions generated by the HVS project served as a complementary data to many downstream fraud detection systems. These went on to improve the overall performance of fraud detection at the bank. It achieved every business goal set, and we can prove this through many different model performance metrics; False Positive Ratios (FPR), Value Detection Rate (VDR), True Detection Rate (TDR, to name a few. In just one month, it saved
£88k (£1 million annualised)
for the bank by alerting customers at an early stage of HVS. This also helped the Bank build trust among its customers and bolster its reputation. Our entire team have continuous model monitoring mechanisms driven by the model metrics to improve performance and continue saving the money of the customers and the bank and give a high-quality and trustworthy service to the customers.
To find out more about how bigspark can help your organisation detect and prevent scam as early as possible get in touch below.