Credit Card Fraud Detection: Build Your Own Model — Part 1

FLock.io

8 min readFeb 9, 2024

Don’t forget to fill in the form once you have completed the tutorial

What You Will Learn

Handcraft a credit card fraud detection model.
How to decentralize the model and run it with FLock Client.

Technology Stack

Python ^3.10
Docker@latest

Prerequisites

CUDA+CUDnn
Version: Latest

Setup

Create a new directory called credit_card_fraud_detection.

mkdir credit_card_fraud_detection
cd credit_card_fraud_detection

Create an environmental dependencies list file named requirements.txt inside the credit_card_fraud_detection directory
Add the following within the requirements.txt:

--find-links <https://download.pytorch.org/whl/torch_stable.html>
# CUDA 4.76GB
# torch==2.0.1+cu117; sys_platform == 'linux'
# torchvision==0.15.2+cu117; sys_platform == 'linux'

# CPU 1.85GB
# torch==2.0.1+cpu; sys_platform == 'linux'
# torchvision==0.15.2+cpu; sys_platform == 'linux'

torch==2.0.1
torchvision==0.15.2
pandas
numpy
pinatapy-vourhey
python-dotenv
requests
flock-sdk

Run pip install -r requirements.txt from your terminal in the current directory to ensure that all required packages are installed.

Directory Structure

credit_card_fraud_detection
 ┣ models
 ┃ ┗ CreditFraudNetMLP.py
 ┣ data
 ┃ ┗ creditCard.json
 ┣ .env
 ┣ Dockerfile
 ┣ flockCreditCardModel.py
 ┣ pinataApi.py
 ┣ requirements.txt
 ┗ uploadImage.sh

This is an overview of the file structure:

models folder: Stores our model definition.
data folder: Contains data for training and testing.
.env: Environment variable file.
Dockerfile: Text document to build Docker images.
flockCreditCardModel.py: Main file for the flock model.
pinataApi.py: Python script to upload to Pinata → IPFS node.
requirements.txt: Package requirements for running the model.
uploadImage.sh: Bash script for running the upload.

Step-by-Step Guide

Starting with building the model for credit card fraud detection. Since the main purpose is to detect similarity between user’s transaction records and abnormal transaction behavior.

`CreditFraudNetMLP`

This class is a specific implementation of a neural network model using PyTorch. It defines the architecture and flow of data through the network for predictions.

Architecture: Defines layers and operations transforming input data into output predictions.
Purpose: Primarily focuses on mathematical transformations performed on the input data.

Dataset

When training a model, three main aspects are crucial: data, computation, and methodology. In this tutorial, the focus will mainly be on data and methodology. Please download via here

Data schema of a credit card fraud dataset looks like this:

{
  "Time": 17187.0,
  "V1": 1.0883749383,
  "V2": 0.8984740237,
  "V3": 0.3946843291,
  "V4": 3.1702575745,
  "V5": 0.1757387969,
  // ... (other properties)
  "V28": 0.0542542721,
  "Amount": 3.79,
  "Class": 1
}

You might be confused by the numerous properties/elements. Let’s dive deeper into the dataset:

Time: A continuous variable representing seconds elapsed between this transaction and the first in the dataset.
V1 to V28: Likely numerical variables resulting from a PCA transformation, often used to anonymize sensitive information.
Amount: Represents the transaction amount.
Class: A binary variable where:

1 indicates a fraudulent transaction.
0 indicates a non-fraudulent transaction.

Let’s create CreditFraudNetMLP.py under the models directory credit_card_fraud_detection. This file will serve as our model.

Preliminaries

To begin, we need to import all the required packages first.

from torch import nn

Here’s a brief introduction to all the packages, in case you’re unfamiliar with their purposes:

torch: PyTorch is a Python package that offers two high-level features:
Tensor computation (similar to NumPy) with robust GPU acceleration.
Deep neural networks built on a tape-based autograd system.

To create a model, the first step is defining the method we’ll use. In this case, we’ll employ an NLP model to detect and learn from the data.

Constructor

def __init__(self, num_features, num_classes):
        super(CreditFraudNetMLP, self).__init__()

The __init__ method serves as the constructor of the class, invoked upon instantiating an object of this class. It requires three arguments: self, num_features, and num_classes. The super function is utilized to invoke the same method from the parent class (nn.Module).

Layers

self.fc1 = nn.Sequential(nn.Linear(num_features, 64), nn.ReLU(), nn.Dropout(0.2))

self.fc2 = nn.Sequential(nn.Linear(64, 128), nn.ReLU(), nn.Dropout(0.5))self.fc3 = nn.Sequential(nn.Linear(128, num_classes), nn.Sigmoid())

self.fc1 is defined as a sequential module, serving as a container for layers executed sequentially. It comprises:

Linear Layer (nn.Linear(num_features, 64)): Creating a fully connected (linear) layer with num_features input features and 64 output features (neurons).
ReLU Activation (nn.ReLU()): Applying the Rectified Linear Unit (ReLU) activation function, linear output to non-linear output.
Dropout (nn.Dropout(0.2)): Applying dropout with a probability of 0.2 for regularization to prevent overfitting.

self.fc2 is another sequential module that contains:

Linear Layer (nn.Linear(64, 128)): A linear layer with 64 input features and 128 output features.
ReLU Activation: Applying the ReLU activation function.
Dropout (nn.Dropout(0.5)): Applying dropout with a higher probability of 0.5.

self.fc3 represents the output layer sequential module, featuring:

Linear Layer (nn.Linear(128, num_classes)): A linear layer with 128 input features and num_classes output features.
Sigmoid Activation (nn.Sigmoid()): Applying the sigmoid activation function to constrain the output within the range [0, 1], suitable for binary classification.

The subsequent step involves defining a forward method. This function primarily determines how data $x$ flows through the network during training, facilitating the forward pass—where data moves from one layer to another.

Example Torch Model

Let’s create flockCreditCardModel.py file in the same directory and insert following.

import json
from torch.utils.data import DataLoader, TensorDataset
import io
import torch
from pandas import DataFrame
from flock_sdk import ~~FlockSDK,~~ FlockModel
from models.CreditFraudNetMLP import CreditFraudNetMLP

Initialisation Function Definition

class flockCreditCardModel(FlockModel):
 def __init__(
         self,
         features,
         epochs=1,
         lr=0.03,
     ):
         """
         Hyper parameters
         """
         self.epochs = epochs
         self.features = features
         self.lr = lr

         """
             Device setting
         """
         if torch.cuda.is_available():
             device = "cuda"
         else:
             device = "cpu"
         self.device = torch.device(device)

Here we are trying to do two things, first define the hyper parameters, and check if user has GPU module within the device.

features: The number of input features for the neural network.
epochs: The number of epochs for training the model.
lr: Learning rate for the optimizer during training.

Data handling

def init_dataset(self, dataset_path: str) -> None:
	self.dataset_path = dataset_path
	with open(dataset_path, "r") as f:
		dataset = json.load(f)
	dataset_df = DataFrame.from_records(dataset)

batch_size = 128

Firstly, we need to load the data and perform conversions. Here, we open and read the dataset, then transition to a dataframe. As discussed earlier, pandas is a tool that facilitates easier data manipulation. Additionally, it provides numerous useful functions for subsequent operations.

Preparing Data for the Model

X_df = dataset_df.iloc[:, :-1]
y_df = dataset_df.iloc[:, -1]

X_tensor = torch.tensor(X_df.values, dtype=torch.float32)
y_tensor = torch.tensor(y_df.values, dtype=torch.float32)

y_tensor = y_tensor.unsqueeze(1)
dataset_in_dataset = TensorDataset(X_tensor, y_tensor)

Next, we need to convert our data frame into PyTorch Tensors. Here, we perform the split again, generating two sets of data: features (x) and target (y).

Features: Variables or columns in the dataset that provide information for the model to learn patterns and make predictions. Examples include transaction amount and time of transaction.
Target: Variable (y), the ground truth that the model aims to predict. In the context of fraud detection, it is a binary variable indicating whether a transaction is fraudulent.

Let’s proceed with the code. We split the data into x_df and y_df, where x_df selects all columns except the last one, and y_df selects all rows of the last column. Then, we use the built-in function torch.tensor to convert the data into PyTorch Tensors. Finally, we create a Tensor Dataset using these two tensors for x and y.

Setting Up Data Loaders

To streamline the process of feeding data into our model for both training and evaluation, we utilize PyTorch’s DataLoader. This tool allows for efficient data handling by batching, shuffling, and preparing the data for the model, ensuring optimal performance during the training process.

self.train_data_loader = DataLoader(
 dataset_in_dataset,
 batch_size=batch_size,
 shuffle=True,
 drop_last=False,
)
self.test_data_loader = DataLoader(
 dataset_in_dataset,
 batch_size=batch_size,
 shuffle=True,
 drop_last=False,
)

Training

def train(self, parameters) -> bytes:
 model = CreditFraudNetMLP(num_features=self.features, num_classes=1)
  if parameters != None:
   model.load_state_dict(torch.load(io.BytesIO(parameters)))
    model.train()
    optimizer = torch.optim.SGD(
     model.parameters(),
      lr=self.lr,
    )
    criterion = torch.nn.BCELoss()
    model.to(self.device)

Epochs and Batches

Next, we move on to the training stage. First, we need to prepare the model by retrieving it through the get_model function. Then, we check if any extra parameters were provided. If so, we add them to the training process.

for epoch in range(self.epochs):
    train_loss = 0.0
    train_correct = 0
    train_total = 0
    for inputs, targets in self.train_data_loader:
        optimizer.zero_grad()

        inputs, targets = inputs.to(self.device), targets.to(self.device)
        outputs = model(inputs)

        loss = criterion(outputs, targets)
        loss.backward()

        optimizer.step()

        train_loss += loss.item() * inputs.size(0)
        predicted = torch.round(outputs).squeeze()
        train_total += targets.size(0)
        train_correct += (predicted == targets.squeeze()).sum().item()

        print(
            f"Training Epoch: {epoch}, Acc: {round(100.0 * train_correct / train_total, 2)}, Loss: {round(train_loss / train_total, 4)}"
        )

    buffer = io.BytesIO()
    torch.save(model.state_dict(), buffer)
    return buffer.getvalue()

Next, we need to define the loop for training. The training process adjusts the model’s internal parameters to minimize the discrepancy between its predictions and the actual outcomes. This iterative process progressively improves the model’s predictive accuracy.

Epoch Loop

Concept: An epoch represents one complete pass through the entire training dataset.
Importance: Multiple epochs allow the model to sufficiently learn underlying patterns in the data.
Note: Too many epochs might lead to overfitting, reducing the model’s ability to generalize.

Batch Loop

Batch Training: Mini-batch training updates parameters after a specified number of samples (a batch).
DataLoader: The train_data_loader provides batches of data for efficient training and data shuffling.
Note: Batch size is a crucial hyperparameter affecting training speed and stability.

The training process meticulously adjusts the model’s internal weights to minimize loss, enhancing its ability to predict fraudulent transactions accurately. Through this methodical approach, we ensure that the model can learn effectively from the data provided, setting the stage for robust fraud detection capabilities.

Evaluation

def evaluate(self, parameters: bytes) -> float:
    criterion = torch.nn.BCELoss()

    model = CreditFraudNetMLP(num_features=self.features, num_classes=1)
    if parameters is not None:
        model.load_state_dict(torch.load(io.BytesIO(parameters)))
    model.to(self.device)
    model.eval()

    test_correct = 0
    test_loss = 0.0
    test_total = 0

The evaluation code, resembling the training code, differs in using self.testing_data_loader instead of self.training_data_loader. During evaluation, no model parameter updates occur. Instead, the purpose is to assess performance using the test set data.

Calculating Loss and Accuracy

with torch.no_grad():
    for inputs, targets in self.test_data_loader:
        inputs, targets = inputs.to(self.device), targets.to(self.device)
        outputs = model(inputs)
        loss = criterion(outputs, targets)

The primary evaluation metrics include loss rate and accuracy rate. This function returns the final accuracy calculated using these metrics.

Calculating and Returning Evaluation Metrics

Lastly, we aggregate the results from our evaluation to calculate the total loss and accuracy. This involves summing up the losses, rounding the model’s outputs to determine predictions, and comparing these predictions against the actual targets to count the number of correct predictions.

test_loss += loss.item() * inputs.size(0)  # Calculating Cumulative Loss
predicted = torch.round(outputs).squeeze()  # Rounding Model Outputs
test_total += targets.size(0)  # Tracking Total Samples
test_correct += (predicted == targets.squeeze()).sum().item()  # Calculating Correct Predictions

By completing these calculations, we can return the final accuracy, providing a clear metric to gauge the effectiveness of our model in detecting credit card fraud.

Aggregation

The aggregation step aims to calculate the average model parameters of multiple models collected from selected participants, and return the averaged model weights/gradients to all participants. This process, central in federated learning, involves learning model parameters across various participants (e.g. devices and nodes) and aggregating them at a central location.

def aggregate(self, parameters_list: list[bytes]) -> bytes:

Load and Initialize

Initially, loading the selected participants’ model parameters and initializing the template.

parameters_list = [
    torch.load(io.BytesIO(parameters)) for parameters in parameters_list
]
averaged_params_template = parameters_list[0]

Then, averaging model parameters:

Iterating through each parameter k in the template and aggregating associated parameter values from all sets in parameters_list.
Calculating the average of these values and updating the template’s parameter k.

Compute Averages

for k in averaged_params_template.keys():
    temp_w = []
    for local_w in parameters_list:
        temp_w.append(local_w[k])
    averaged_params_template[k] = sum(temp_w) / len(temp_w)

Iteratively accessing each parameter in the template.
Gathering the corresponding parameter values from all models in the parameters_list.
Calculating the mean of these values.
Updating the template parameter with its new averaged value.

Serialize Averaged Parameters

Lastly, creating a buffer to store the aggregated parameters in byte format.

buffer = io.BytesIO()

# Saving state dict to the buffer
torch.save(averaged_params_template, buffer)

# Getting the byte representation
aggregated_parameters = buffer.getvalue()

return aggregated_parameters

Call the model

from flock_sdk import FlockSDK
from flockCreditCardModel import flockCreditCardModel

if __name__ == "__main__":
    epochs = 1
    lr = 0.000001
    features = 30
    model = flockCreditCardModel(features, epochs=epochs, lr=lr)
    sdk = FlockSDK(model)
    sdk.run()

Great! If all goes well, you’ve completed the first part of the tutorial. Here’s the entrance to part 2, where we will work on how this model can be used with FLock Client, link here

Reach out to us by
Website: https://flock.io/
Twitter: https://twitter.com/flock_io
Telegram: https://t.me/flock_io_community
Discord: https://discord.gg/ay8MnJCg2W