How to Create a Model for Sentiment Analysis — Part 1

FLock.io

5 min readMar 26, 2024

What You Will Learn

Craft a sentiment analysis model for Web3 information
Model training flow
Comprehensive guide

Technology Stack

Python ^3.10
Docker@latest

Setup

Create a new directory called sentiment-analysis.
Create a requirements.txt file.
Within the requirements.txt, add the following:

flock-sdk=0.01# For non ARM architecture
# --find-links <https://download.pytorch.org/whl/torch_stable.html>
# CUDA 4.76GB
# torch==2.0.1+cu117; sys_platform == 'linux'
# torchvision==0.15.2+cu117; sys_platform == 'linux'# CPU 1.85GB - ARM
torch==2.0.1
torchvision==0.15.2# # CPU 1.85GB - linux
# torch==2.0.1+cpu; sys_platform == 'linux'
# torchvision==0.15.2+cpu; sys_platform == 'linux'# # CPU 1.85GB - x86_64
# torch==2.0.1+cpu; sys_platform == 'x86_64'
# torchvision==0.15.2+cpu; sys_platform == 'x86_64'# For developer
pandas
scikit-learn
tqdm
numpy
~~# lightning~~
pinatapy-vourhey
python-dotenv
requests

Run pip install -r requirements.txt to ensure all required packages are installed.

How it Works

Understanding the general flow is our starting point. For sentiment analysis, the primary objective is to determine the emotional tone or sentiment of a piece of text, utilizing natural language processing techniques.

File Structure

sentiment-analysis
 ┣ model
 ┃ ┗ CNNModel.py
 ┣ .env
 ┣ Dockerfile
 ┣ FLockSentimentModel.py
 ┣ dataProcessing.py
 ┣ dataset.json
 ┣ pinata_api.py
 ┣ requirements.txt
 ┗ upload_image.sh

Model

Before delving into the CNN code, let’s first comprehend the code structure and the flow of the CNN. In this example, we’ll use only 4 layers: 1 Embedding Layer, 2 convolutional layers and a Sequential layer.

Embedding Layer: It changes simple, group-based input features (like words) into full, ongoing vectors.
Convolutional Layers: These layers help the model understand spatial hierarchies or patterns within compact vectors.
Sequential Layer: Generates a number between 0 and 1, indicating the chance of belonging to a certain group in a binary classification problem.

Data Flow Chart:

Input data → Embedding Layer → Convolutional Layer 1 → Convolutional Layer 2 → Sequential Layer → Output

Let’s begin by importing the necessary packages:

from torch import nn
import torch.nn.functional as F

Now, let’s start coding the classifier with a basic structure:

class CNNClassifier(nn.Module):
    def __init__(self, vocab_size, emb_size):
        super().__init__()
        # Code goes here

    def forward(self, x):
        # Code goes here

Understanding three key variables:

nn.Module: PyTorch's function wrapper, the base class for all neural networks.
vocab_size: Size of the vocabulary the model deals with.
emb_size: Size of the vectors into which words are mapped by the embedding layer.

Let’s define the layers:

Embedding:

self.embedding = nn.Embedding(vocab_size, emb_size)

Convolutional Layers:

self.conv1 = nn.Conv1d(emb_size, 100, 5, padding=2)
self.conv2 = nn.Conv1d(100, 100, 5, padding=2)

Each convolutional layer requires a Rectified Linear Unit (ReLU): nn.ReLU().
Sequential layer:

self.fc3 = nn.Sequential(nn.Linear(100, 1), nn.Sigmoid())

nn.Linear(100, 1): A linear transformation of the incoming data, where the input feature size is 100 and the output size is 1.

nn.Sigmoid(): Applies the sigmoid function to squash the output values between 0 and 1.

Forward Passing

The forward pass calculates and stores intermediate variables from the input layer to the output layer in neural networks. Let’s craft this function:

We start with embedding:

embs = self.embedding(x)

Next, activate the two convolutional layers via ReLU:

h = F.relu(self.conv1(embs.transpose(1, 2)))
h = F.relu(self.conv2(h))

h_size = h.size(dim=2)
h = F.avg_pool1d(h, h_size).squeeze(dim=2)

Lastly, map the learning to probabilities:

logits = self.fc3(h)
return logits

Data processing

Before training, we need to preprocess the dataset. Data processing components play a fundamental role in neural network training, particularly in the context of natural language processing. Typically, there are four steps in the data processing process:

Text Cleaning — Regular expressions
Tokenization — Splitting into words
Numerical Representation — Word indexing
Handling sequence length — Padding & Truncation

Let’s start with initialisation. We’ll create an __init__ function for our initialisation process. Several parameters need to be passed into this function:

dataset: The input dataset.
vocab: An optional pre-existing vocabulary. The default will be None, and if not provided, we'll generate it from the dataset.
max_seq_len: The maximum sequence length.
device: The device for computations. In this case, we want to use cuda.
max_samples_count: The maximum number of samples to consider from the input dataset. This parameter acts as a way to limit the dataset size.
max_vocab_size: The maximum number of words in the vocabulary.

Now, let’s assemble everything together. We’ll receive the following part.

def __init__(self, dataset, vocab=None, max_seq_len=64, device="cuda", max_samples_count=20000,max_vocab_size=30000):
    self.samples = []
    self.labels = []
    self.max_seq_len = max_seq_len
    self.device = device

Parsing through the dataset (assuming it’s in .csv format):

for row in dataset:
  label = row[0]
  sample = row[1]
  # Now we need to tokenize the samples
  # 1. Adds a space before punctuation, helping in tokenization
  sample = re.sub(r"([.,!?'])", r" \\1", sample) 
  # 2. Retains only alphanumeric characters, specific punctuation marks, 
  # dand apostrophes, removing all other characters.
  sample = re.sub(r"[^a-zA-Z0-9.,!?']", " ", sample)
  self.labels.append(int(label) - 1)
  self.samples.append(sample)
  # Once we reach the max samples count, stop the loop.
  if len(self.samples) > max_samples_count:
     break

Lastly, during the init stage, we need to create a vocab list as discussed earlier.

If the vocab list is provided by the user. We simply initialize the vocab list with the user’s vocab list. Otherwise, we create the vocab list from the dataset.

if vocab is not None:
    self.vocab = vocab
else:
    self.vocab = self._make_vocab(max_vocab_size=max_vocab_size)

Built-in support function override:

__len__: Returns the length of the samples.
__getitem__: A special method in PyTorch's Dataset class allowing retrieval of a specific sample from the dataset using its index.
_make_vocab: Returns the vocabulary list from the dataset.

def __len__(self):
    return len(self.samples)

def __getitem__(self, index):
    sample = self.samples[index]
    sample = [self.get_index(word) for word in sample.split()]
    sample = sample[:self.max_seq_len]
    pad_len = self.max_seq_len - len(sample)
    sample += [self.get_index("[PAD]")] * pad_len
    label = self.labels[index]
    return sample, label

def _make_vocab(self, max_vocab_size=30000):
    vocab = {"[PAD]": 1000000000000001, "[UNK]": 100000000000000}
    for sample in self.samples:
        for word in sample.split():
            if word not in vocab.keys():
                vocab[word] = 1
            else:
                vocab[word] += 1
                vocab = dict(sorted(vocab.items(), key=lambda item: item[1], reverse=True))
                vocab = list(vocab.keys())[:max_vocab_size]
            return vocab

Customized support function:

get_vocab: Returns the list of vocab.
get_index: Returns the index of a word.

def get_vocab(self):
    return self.vocab

def get_index(self, word):
    if word in self.vocab:
        index = self.vocab.index(word)
    else:
        index = self.vocab.index("[UNK]")
    return index

`Collate` Function

The collate method is a utility function commonly used with PyTorch's DataLoader to determine how individual data points (samples) are combined into batches for training or evaluation.

This function takes one parameter, batch, which is a list of data points (samples).

Next, we unzip the batch using list(zip(*batch)). We expect two results from this operation: input_ids and targets.

input_ids is a list of lists, where each inner list corresponds to the input_ids of a data point.
targets is a list of labels.

def collate(self, batch):
    input_ids, targets = list(zip(*batch))

Then, we convert the data into tensors using torch.tensor(), transforming lists of word indices and labels into tensors optimized for GPU computations. This will be the return value of the collate function.

return torch.tensor(input_ids), torch.tensor(targets, dtype=torch.float32)

`get_loader` Function

Operation: Returns a DataLoader allowing iteration over the dataset in batches.
Parameters: Accepts the dataset and batch size as parameters.

def get_loader(dataset_df, batch_size):

    return DataLoader(dataset_df, batch_size=batch_size, num_workers=1, collate_fn=dataset_df.collate)

Congratulations on completing the first part of the tutorial! Now, you can move on to Part 2. Let’s continue our conversation there.

Reach out to us by
Website: https://flock.io/
Twitter: https://twitter.com/flock_io
Telegram: https://t.me/flock_io_community
Discord: https://discord.gg/ay8MnJCg2W