In this post, we'll do a simple text classification task using the pretained BERT model from HuggingFace.

The BERT model was proposed in BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, by Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova.

Loading data

We'll use the emotion dataset from the Hugging Face Hub.

from datasets import load_dataset
emotion = load_dataset('emotion')

Using custom data configuration default
Downloading and preparing dataset emotion/default (download: 1.97 MiB, generated: 2.07 MiB, post-processed: Unknown size, total: 4.05 MiB) to /root/.cache/huggingface/datasets/emotion/default/0.0.0/348f63ca8e27b3713b6c04d723efe6d824a56fb3d1449794716c0f0296072705...
Dataset emotion downloaded and prepared to /root/.cache/huggingface/datasets/emotion/default/0.0.0/348f63ca8e27b3713b6c04d723efe6d824a56fb3d1449794716c0f0296072705. Subsequent calls will reuse this data.

The emotion dataset consists of three sets: train, validation, and test set, and has six kinds of emotion: sadness, joy, love, anger, fear, and surprise.

emotion
DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 16000
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 2000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 2000
    })
})
label_names = emotion["train"].features['label'].names
label_names
['sadness', 'joy', 'love', 'anger', 'fear', 'surprise']

Let's take a look at what the text is like:

emotion.set_format(type="pandas")
train_df = emotion['train'][:]
valid_df = emotion['validation'][:]
test_df = emotion['test'][:]
train_df.head()
text label
0 i didnt feel humiliated 0
1 i can go from feeling so hopeless to so damned... 0
2 im grabbing a minute to post i feel greedy wrong 3
3 i am ever feeling nostalgic about the fireplac... 2
4 i am feeling grouchy 3

In this post, we'll just use 350 samples from each class for training, and 70 samples for validation and 50 for testing.

train_df = train_df.groupby('label').apply(lambda x: x.sample(350)).reset_index(drop=True)
valid_df = valid_df.groupby('label').apply(lambda x: x.sample(70)).reset_index(drop=True)
test_df = test_df.groupby('label').apply(lambda x: x.sample(50)).reset_index(drop=True)
train_df['label'].value_counts()
0    350
1    350
2    350
3    350
4    350
5    350
Name: label, dtype: int64
valid_df['label'].value_counts()
0    70
1    70
2    70
3    70
4    70
5    70
Name: label, dtype: int64
test_df['label'].value_counts()
0    50
1    50
2    50
3    50
4    50
5    50
Name: label, dtype: int64


Tokenization

Tokenization is a process for spliting raw texts into tokens, and encoding the tokens into numeric data.

To do this, we first initialize a BertTokenizer:

from transformers import BertTokenizer
PRETRAINED_LM = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(PRETRAINED_LM, do_lower_case=True)
tokenizer
PreTrainedTokenizer(name_or_path='bert-base-uncased', vocab_size=30522, model_max_len=512, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'})

define a function for encoding:

def encode(docs):
    '''
    This function takes list of texts and returns input_ids and attention_mask of texts
    '''
    encoded_dict = tokenizer.batch_encode_plus(docs, add_special_tokens=True, max_length=128, padding='max_length',
                            return_attention_mask=True, truncation=True, return_tensors='pt')
    input_ids = encoded_dict['input_ids']
    attention_masks = encoded_dict['attention_mask']
    return input_ids, attention_masks

Use the ecode function to get input ids and attention masks of the datasets:

train_input_ids, train_att_masks = encode(train_df['text'].values.tolist())
valid_input_ids, valid_att_masks = encode(valid_df['text'].values.tolist())
test_input_ids, test_att_masks = encode(test_df['text'].values.tolist())


Creating Datasets and DataLoaders

We'll use pytorch Dataset and DataLoader to split data into batches. For more detatils, you can check out another post on DataLoader.

Turn the labels into tensors:

import torch
train_y = torch.LongTensor(train_df['label'].values.tolist())
valid_y = torch.LongTensor(valid_df['label'].values.tolist())
test_y = torch.LongTensor(test_df['label'].values.tolist())
train_y.size(),valid_y.size(),test_y.size()
(torch.Size([2100]), torch.Size([420]), torch.Size([300]))

Create dataloaders for training

from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler

BATCH_SIZE = 16
train_dataset = TensorDataset(train_input_ids, train_att_masks, train_y)
train_sampler = RandomSampler(train_dataset)
train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=BATCH_SIZE)

valid_dataset = TensorDataset(valid_input_ids, valid_att_masks, valid_y)
valid_sampler = SequentialSampler(valid_dataset)
valid_dataloader = DataLoader(valid_dataset, sampler=valid_sampler, batch_size=BATCH_SIZE)

test_dataset = TensorDataset(test_input_ids, test_att_masks, test_y)
test_sampler = SequentialSampler(test_dataset)
test_dataloader = DataLoader(test_dataset, sampler=test_sampler, batch_size=BATCH_SIZE)


Bert For Sequence Classification Model

We will initiate the BertForSequenceClassification model from Huggingface, which allows easily fine-tuning the pretrained BERT mode for classification task.

You will see a warning that some parts of the model are randomly initialized. This is normal since the classification head has not yet been trained.

from transformers import BertForSequenceClassification
N_labels = len(train_df.label.unique())
model = BertForSequenceClassification.from_pretrained(PRETRAINED_LM,
                                                      num_labels=N_labels,
                                                      output_attentions=False,
                                                      output_hidden_states=False)
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device
device(type='cuda')
model = model.cuda()


Fine-tuning

Optimizer and Scheduler

An optimizer is for tuning parameters in the model, which is set up with a learning rate.

Selection of the learning rate is important. In practice, it's common to use a scheduler to decrease the learning rate during training.

from torch.optim import AdamW
from transformers import get_linear_schedule_with_warmup

EPOCHS = 30
LEARNING_RATE = 2e-6

optimizer = AdamW(model.parameters(), lr=LEARNING_RATE)
scheduler = get_linear_schedule_with_warmup(optimizer, 
             num_warmup_steps=0,
            num_training_steps=len(train_dataloader)*EPOCHS )


Training Loop

The training loop is where the magic of deep learning happens. The model will be fine-tuned on the emotion dataset for classification task.

from torch.nn.utils import clip_grad_norm_
from tqdm.notebook import tqdm
import numpy as np
import math

train_loss_per_epoch = []
val_loss_per_epoch = []


for epoch_num in range(EPOCHS):
    print('Epoch: ', epoch_num + 1)
    '''
    Training
    '''
    model.train()
    train_loss = 0
    for step_num, batch_data in enumerate(tqdm(train_dataloader,desc='Training')):
        input_ids, att_mask, labels = [data.to(device) for data in batch_data]
        output = model(input_ids = input_ids, attention_mask=att_mask, labels= labels)
        
        loss = output.loss
        train_loss += loss.item()

        model.zero_grad()
        loss.backward()
        del loss

        clip_grad_norm_(parameters=model.parameters(), max_norm=1.0)
        optimizer.step()
        scheduler.step()

    train_loss_per_epoch.append(train_loss / (step_num + 1))              


    '''
    Validation
    '''
    model.eval()
    valid_loss = 0
    valid_pred = []
    with torch.no_grad():
        for step_num_e, batch_data in enumerate(tqdm(valid_dataloader,desc='Validation')):
            input_ids, att_mask, labels = [data.to(device) for data in batch_data]
            output = model(input_ids = input_ids, attention_mask=att_mask, labels= labels)

            loss = output.loss
            valid_loss += loss.item()
   
            valid_pred.append(np.argmax(output.logits.cpu().detach().numpy(),axis=-1))
        
    val_loss_per_epoch.append(valid_loss / (step_num_e + 1))
    valid_pred = np.concatenate(valid_pred)

    '''
    Loss message
    '''
    print("{0}/{1} train loss: {2} ".format(step_num+1, math.ceil(len(train_df) / BATCH_SIZE), train_loss / (step_num + 1)))
    print("{0}/{1} val loss: {2} ".format(step_num_e+1, math.ceil(len(valid_df) / BATCH_SIZE), valid_loss / (step_num_e + 1)))

Epoch:  1
132/132 train loss: 1.8044047635613065 
27/27 val loss: 1.7784268856048584 
Epoch:  2
132/132 train loss: 1.7764473376852092 
27/27 val loss: 1.7467590747056183 
Epoch:  3
132/132 train loss: 1.7184552062641492 
27/27 val loss: 1.6817542844348483 
Epoch:  4
132/132 train loss: 1.6414188760699648 
27/27 val loss: 1.5867101660481207 
Epoch:  5
132/132 train loss: 1.5393207082242677 
27/27 val loss: 1.486947735150655 
Epoch:  6
132/132 train loss: 1.4170215689774714 
27/27 val loss: 1.3694858639328569 
Epoch:  7
132/132 train loss: 1.294279702685096 
27/27 val loss: 1.257334996152807 
Epoch:  8
132/132 train loss: 1.1829162167780327 
27/27 val loss: 1.1601881892592818 
Epoch:  9
132/132 train loss: 1.0817181079676657 
27/27 val loss: 1.0763609210650127 
Epoch:  10
132/132 train loss: 0.9864328092697895 
27/27 val loss: 1.014382865693834 
Epoch:  11
132/132 train loss: 0.9053400375626304 
27/27 val loss: 0.9448991947703891 
Epoch:  12
132/132 train loss: 0.829155604947697 
27/27 val loss: 0.8941110240088569 
Epoch:  13
132/132 train loss: 0.7659972933205691 
27/27 val loss: 0.838810454916071 
Epoch:  14
132/132 train loss: 0.7028251493519003 
27/27 val loss: 0.8023700261557544 
Epoch:  15
132/132 train loss: 0.6520604859247352 
27/27 val loss: 0.758354742217947 
Epoch:  16
132/132 train loss: 0.6056500298507286 
27/27 val loss: 0.7294005663306625 
Epoch:  17
132/132 train loss: 0.5692454813556238 
27/27 val loss: 0.6919559869501326 
Epoch:  18
132/132 train loss: 0.5237459628419443 
27/27 val loss: 0.6667491247256597 
Epoch:  19
132/132 train loss: 0.49814519399043283 
27/27 val loss: 0.6523853794292167 
Epoch:  20
132/132 train loss: 0.4644078386552406 
27/27 val loss: 0.6313007291820314 
Epoch:  21
132/132 train loss: 0.4391077608998978 
27/27 val loss: 0.6154146304836979 
Epoch:  22
132/132 train loss: 0.4171938304648255 
27/27 val loss: 0.6058549671261398 
Epoch:  23
132/132 train loss: 0.4054019095545465 
27/27 val loss: 0.5962215772381535 
Epoch:  24
132/132 train loss: 0.3896281441504305 
27/27 val loss: 0.5836309084185848 
Epoch:  25
132/132 train loss: 0.3799004386546034 
27/27 val loss: 0.5790578303513704 
Epoch:  26
132/132 train loss: 0.3660807737121076 
27/27 val loss: 0.5728193725700732 
Epoch:  27
132/132 train loss: 0.3604669314668034 
27/27 val loss: 0.5727653321292665 
Epoch:  28
132/132 train loss: 0.3513458561942433 
27/27 val loss: 0.5694188663253078 
Epoch:  29
132/132 train loss: 0.3464061285961758 
27/27 val loss: 0.5668780207633972 
Epoch:  30
132/132 train loss: 0.345695345464981 
27/27 val loss: 0.5667423147846151 

You can see in the output that the training and validation losses steadily decreases in each epoch.

from matplotlib import pyplot as plt
epochs = range(1, EPOCHS +1 )
fig, ax = plt.subplots()
ax.plot(epochs,train_loss_per_epoch,label ='training loss')
ax.plot(epochs, val_loss_per_epoch, label = 'validation loss' )
ax.set_title('Training and Validation loss')
ax.set_xlabel('Epochs')
ax.set_ylabel('Loss')
ax.legend()
plt.show()

Performance Metrics

It's common to use precision, recall, and F1-score as the performance metrics.

from sklearn.metrics import classification_report
print('classifiation report')
print(classification_report(valid_pred, valid_df['label'].to_numpy(), target_names=label_names))
classifiation report
              precision    recall  f1-score   support

           0       0.70      0.80      0.75        61
           1       0.79      0.87      0.83        63
           2       0.77      0.82      0.79        66
           3       0.83      0.72      0.77        80
           4       0.80      0.73      0.76        77
           5       0.93      0.89      0.91        73

    accuracy                           0.80       420
   macro avg       0.80      0.81      0.80       420
weighted avg       0.81      0.80      0.80       420

Error Analysis

With the predictions, we can plot the confusion matrix:

from sklearn.metrics import ConfusionMatrixDisplay, confusion_matrix
def plot_confusion_matrix(y_preds, y_true, labels=None):
  cm = confusion_matrix(y_true, y_preds, normalize="true")
  fig, ax = plt.subplots(figsize=(6, 6))
  disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=labels) 
  disp.plot(cmap="Blues", values_format=".2f", ax=ax, colorbar=False) 
  plt.title("Normalized confusion matrix")
  plt.show()
plot_confusion_matrix(valid_pred,valid_df['label'].to_numpy(),labels=label_names)

You can see that sadness has a higher likelihood to be classified as anger or fear, leading to a lower f1 score.


Prediction

Now let's use the trained model to predict the testing set.

model.eval()
test_pred = []
test_loss= 0
with torch.no_grad():
    for step_num, batch_data in tqdm(enumerate(test_dataloader)):
        input_ids, att_mask, labels = [data.to(device) for data in batch_data]
        output = model(input_ids = input_ids, attention_mask=att_mask, labels= labels)

        loss = output.loss
        test_loss += loss.item()
   
        test_pred.append(np.argmax(output.logits.cpu().detach().numpy(),axis=-1))
test_pred = np.concatenate(test_pred)
print('classifiation report')
print(classification_report(test_pred, test_df['label'].to_numpy(),target_names=label_names))
classifiation report
              precision    recall  f1-score   support

           0       0.80      0.75      0.78        53
           1       0.82      0.82      0.82        50
           2       0.80      0.80      0.80        50
           3       0.80      0.80      0.80        50
           4       0.76      0.84      0.80        45
           5       0.92      0.88      0.90        52

    accuracy                           0.82       300
   macro avg       0.82      0.82      0.82       300
weighted avg       0.82      0.82      0.82       300

With the predictions, we can plot the confusion matrix again:

plot_confusion_matrix(test_pred,test_df['label'].to_numpy(),labels=label_names)

Output the misclassified text:

test_df['pred'] = test_pred
test_df.reset_index(level=0)
print(test_df[test_df['label']!=test_df['pred']].shape)
test_df[test_df['label']!=test_df['pred']][['text','label','pred']].head(10)
(55, 3)
text label pred
6 i think im making up for feeling like i missed... 0 2
7 im not sure the feeling of loss will ever go a... 0 2
9 im afraid to call the guy from yesterday becau... 0 3
14 i feel like that im hated by most of the girls... 0 3
16 i wake up feeling like something terrifyingly ... 0 4
24 i had that kinda feeling but ignored it 0 3
26 i am again not inspired and after looking at i... 0 1
35 i do feel stressed 0 3
37 i feel much more energized than on a gloomy ra... 0 1
43 i cant feel dont turn your back on me i wont b... 0 3