Sentiment Analysis - ML Made Easy

By PurplePumkiin March 11, 2026 12 min read 206 views

     Natural Language Processing (NLP) is by far one of the most powerful tools we have at our disposal. For myself and many others, creating sentiment analysis AI tools has seemed like a difficult, unreachable goal. But it doesn't have to be this way. With just a few lines of code, some basic understanding of how they work, and a drive in you to understand, sentiment analysis and NLP as a whole becomes much easier to conqure.

     First, we must understand what a sentiment analysis model is, and what roles it fills. This is not the type of AI that takes jobs, instead it is a tool to classify. It can process customer service calls and determine effectiveness, classify a new's articles bias, or simply judge tone and inflection for a response. These tools have many uses, and for better or worse, are a part of our lives. It is time to understand them.

     To start, we need data. NLP models and neural networks as a whole require data, and in some cases vast amounts of it. Here, we attempt to crowdsource that data through ethical means. For this project we need labeled, human generated data to train our neural network. While generative AI can be used in some cases, it is unrecommended for input data as they have a habit of using particular language that doesn't inflect the human tone. For this project. the data will be open and accessible through an API call. Documentation for this API will be provided alongside a full execution of the code to do this yourself.

I will be using python for this, it is a robust language and is very common for training models due to its support, flexibility, and general ease of use. We will need:

  • re (for regex and input sanitization)
  • torch (for the neural network handling)
  • requests (for calling the API)

Our data pipeline will turn raw text into useful tokenized data that can later be processed by our training algorithm to begin making predictions. We will take two arrays, a text array, which stores the training data, and then an emotion array that maps each sentence to its label.

for item in entries:
    training_sentences.append(item['sentence'])
    
    for idx, emotion in enumerate(EMOTIONS):
        if item[emotion] == 1:
            training_labels.append(idx)
            break

This code is meant to process the data retrived from the API call, however, you can just as easily hardmap your own data or get another source alltogether. In this case, we hardcoded the emotions to an array and then mapped the API response to these emotions.

We then move on to tokenizing. This is a crucial step for neural networks. Words themselves aren't very useful in machine context, so instead we turn them into numbers. In this way, we can easily map integers and their probabilties.

def tokenize(sentence):
    sentence = sentence.lower()
    sentence = re.sub(r"[^a-z0-9\s]", "", sentence)
    return sentence.split()

We start by simply making sure our data is consistent by turning everything to lowercase, that way the words "Hello" and "hello" are not classified as two different words. We then remove anything that is not a-z or 0-9. Nothing but raw words, and then split the sentence into its constituant parts. Turning "Hello world" into an array ["hello", "world"]

Now we build our vocabulary. Its important to make things as granular as needed. Words can change meaning, depending how and where they appear. So for this use case, it is reasonable to break things down to individual words, as letter embeddings would lose all context at highly granular levels, and is generally irrelevant for our case.

def build_vocab(sentences):
    vocab = {"<PAD>": 0, "<UNK>": 1}
    for sentence in sentences:
        for word in tokenize(sentence):
            if word not in vocab:
                vocab[word] = len(vocab)
    return vocab

Here we take the input sentences, and start with a pre-initialized vocab. A padding character mapped to 0 and an Unkown character mapped to 1. These two are incredibly important. When a neural network is trained, we train it with batches. These batches have predfined sizes for the tensors. We must fill empty space with padding so that all batches come out to same size. This function iterates over every word in the sentence and compares it to the vocabulary of the net, if it is not in the vocabulary then add it, and then return the vocabular when done. Unkown words, while not relevant to the training, are relevant to the use. If the net has never seen the word "adore" then when someone uses it, it will have no clue what it means. This is just a placeholder to map new words to a shared embedding for other unkown values.

Now we begin turning the vocabulary into actual numbers by creating a dictionary. This dictionary is the machine human interface, in a way. It takes human words and maps them to a numerical token. The sentence "I am a truck" will be tokenized, and those tokens are mapped to a larger dictionary of all the words the model has seen. In this case:

 {"<pad>" => 0, "<unk>" => 1, "i" => 2, "am" => 3, "a" => 4, "truck" => 5}

def numericalize(sentence, vocab):
    tokens = tokenize(sentence)
    ids = [vocab.get(tok, vocab["<UNK>"]) for tok in tokens]
    return ids

We use the previously defined token function to spit out sanitized data, and then process that data and in combination with the vocab list, return an ID for each word if it doesn't already have one.

Now we bring it all even more together. We add the final touches to the data coming into the model, determining what the max length of the longest sentence is. This sets the batch size, while also telling us how much padding to add to each sentence. We execute this with:

def get_max_length(sentences):
    return max(len(tokenize(s)) for s in sentences)

def pad_sequence(seq, max_len, pad_value=0):
    if len(seq) < max_len:
        seq = seq + [pad_value] * (max_len - len(seq))
    else:
        seq = seq[:max_len]
    return seq

def encode_dataset(sentences, vocab, max_len):
    encoded = []
    for s in sentences:
        ids = numericalize(s, vocab)
        padded = pad_sequence(ids, max_len)
        encoded.append(padded)
    return encoded

This more or less marks the end of the data pipeline. From there, the last things we need to do are get the data into the model and begin training it. We will do this with tensors. Many, like myself, had no clue what a tensor was before diving into this project. But the simplest anwser is that a tensor is just a container of numbers, they can be multi-dimensional. So, anything between a 0D tensor, which would just be a number, to a stack of matrices. 

  • 0D => [0]
  • 1D => [0, 1, 2]
  • 2D => [ [0,1,2], [0,1,2] ]
  • 3D => [ [ [0,1,2], [0,1,2] ], [ [0,1,2], [0,1,2] ] ]

We us tensors because neural networks can be incredibly complex, referencing things that are distance through vectors. Tensors allow matrix multiplication, dot products, and many other mathematic operations without having to write and design everything yourself. For our network, we start with a basic 2 dimensional tensor. We have x, which is the sentence that is tokenized, stacked on top of other sentences. So if a tokenized sentence is 10 tokens long and we have 3 sentences, the x tensor is 10 x 3 in size. We then have y, which is just 1D tensor for the labels, where the map to y is the sentence of x. We will expand this later, but I create the tensor with this function:

def to_tensor(encoded_data, labels):
    X = torch.tensor(encoded_data, dtype=torch.long)
    y = torch.tensor(labels, dtype=torch.long)
    return X, y

This shapes our tensor and loads it with all the encoded data we processed earlier.

This is the final part, the most important of them all. This next section is what turns all of your hard work into a usable neural network, the part that trains the model. Now, there are different ways to do this. A really simple model might just vectorize the tokens and compare those vectors to train for a response. The issue that you may face when attempting to do this is that similar sentences can have very different meanings. Take the sentence "I am happy" and "I am not happy". These two sentences, when just analyzing the words themselves share 75% of the data, but they mean two very different things. A GRU model adds a memory layer to everything. Instead of analyizing the tokens as a whole, we feed them through a hidden layer of the model one at a time, then the vector builds with each pass until a deeper understanding is built.

class SentimentGRU(nn.Module):
  def __init__(self, vocab_size, embed_dim=64, hidden_dim=128, num_classes=5):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
        self.gru = nn.GRU(
            input_size=embed_dim,
            hidden_size=hidden_dim,
            batch_first=True
        )
        self.fc = nn.Linear(hidden_dim, num_classes)
    def forward(self, input_ids):
        embedded = self.embedding(input_ids)
        output, hidden = self.gru(embedded)
        final_hidden = hidden[-1]
        logits = self.fc(final_hidden)
        return logits

def train(model, dataloader, epochs=100, lr=1e-3, device='cpu'):
    model.to(device)
    optimizer = optim.Adam(model.parameters(), lr=lr)
    criterion = nn.CrossEntropyLoss()

    for epoch in range(epochs):
        total_loss = 0
        correct = 0
        total = 0
        for batch_x, batch_y in dataloader:
            batch_x = batch_x.to(device)
            batch_y = batch_y.to(device)
            logits = model(batch_x)
            loss = criterion(logits, batch_y)

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            total_loss += loss.item()
            preds = logits.argmax(dim=1)
            correct += (preds == batch_y).sum().item()
            total += len(batch_y)

        avg_loss = total_loss / len(dataloader)
        acc = correct / total

        print(f"Epoch {epoch+1}/{epochs} - Loss: {avg_loss:.4f} Accuracy: {acc:.3f}")

These two functions work hand in hand. SentimentGRU is responsible for outlining the model, size, shape, and data flow. These are crucial to getting something to work properly. For this model, we have 3 main parts. The embedding layer turns tokens into vectors, and by using these vectors we can measure the difference between other vectors. We then have the GRU layer which is 128 dimensions in size, not by tensor standards, but if graphed, each vector is represented by 128 different numbers. We then have the final, linear layer that outputs raw logits which can be used to infer a decision. The sentence is read, token by token, where the embedding layer updates the GRU layer. The GRU layer is then analyzed to finally give the logits. The training loop takes batched data in, and then processes it. With each batch, it measures the loss and performance of the model and makes tweaks to it to better fit the model to the data. These two functions need each other. With no model, the trainer has nothing to train. With no trainer, the model gets filled with data, but no computation has been done to create correlation between tokens and emotion. Its important to strike a balance with these two. For instance, if your learning rate is far too low, then you may find that training takes far longer than it needs to be. If it is too high, then you may run into stability issues where optimal solutions get overlooked. Training epochs are also important. The first round of training is never even close. Instead, we need to iterate this over and over until the model has extracted what it can from the data it has been fed, and begins to plateau. For my data set I find that 100 epochs is more than enough to get both a model who understands the data, while also iterating enough to learn effective patterns.

At the end of the day, neural nets are complicated topics. They are hard to approach, but the founding principals are well established, and with enough time and practice anyone can understand how they work. I myself am a novice with this topic. Check the comments below for a link to the code, and of course check the settings to get your API key and retrieve the data to train yourself. Or you can check out the demo here


Comments