Tremendous-tuning a BERT mannequin on social media information
Getting and getting ready the information
The dataset we are going to use comes from Kaggle, you may obtain it right here : https://www.kaggle.com/datasets/farisdurrani/sentimentsearch (CC BY 4.0 License). In my experiments, I solely selected the datasets from Fb and Twitter.
The next snippet will take the csv information and save 3 splits (coaching, validation, and check) to the place you need. I like to recommend saving them in Google Cloud Storage.
You possibly can run the script with:
python make_splits --output-dir gs://your-bucket/
import pandas as pd
import argparse
import numpy as np
from sklearn.model_selection import train_test_splitdef make_splits(output_dir):
df=pd.concat([
pd.read_csv("data/farisdurrani/twitter_filtered.csv"),
pd.read_csv("data/farisdurrani/facebook_filtered.csv")
])
df = df.dropna(subset=['sentiment'], axis=0)
df['Target'] = df['sentiment'].apply(lambda x: 1 if x==0 else np.signal(x)+1).astype(int)
df_train, df_ = train_test_split(df, stratify=df['Target'], test_size=0.2)
df_eval, df_test = train_test_split(df_, stratify=df_['Target'], test_size=0.5)
print(f"Recordsdata can be saved in {output_dir}")
df_train.to_csv(output_dir + "/prepare.csv", index=False)
df_eval.to_csv(output_dir + "/eval.csv", index=False)
df_test.to_csv(output_dir + "/check.csv", index=False)
print(f"Practice : ({df_train.form}) samples")
print(f"Val : ({df_eval.form}) samples")
print(f"Check : ({df_test.form}) samples")
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('--output-dir')
args, _ = parser.parse_known_args()
make_splits(args.output_dir)
The info ought to look roughly like this:
Utilizing a small BERT pretrained mannequin
For our mannequin, we are going to use a light-weight BERT mannequin, BERT-Tiny. This mannequin has already been pretrained on vasts quantity of information, however not essentially with social media information and never essentially with the target of doing Sentiment Evaluation. For this reason we are going to fine-tune it.
It accommodates solely 2 layers with a 128-units dimension, the total listing of fashions will be seen right here if you wish to take a bigger one.
Let’s first create a fundamental.py
file, with all crucial modules:
import pandas as pd
import argparse
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text as textual content
import logging
import os
os.environ["TFHUB_MODEL_LOAD_FORMAT"] = "UNCOMPRESSED"def train_and_evaluate(**params):
move
# can be up to date as we go
Let’s additionally write down our necessities in a devoted necessities.txt
transformers==4.40.1
torch==2.2.2
pandas==2.0.3
scikit-learn==1.3.2
gcsfs
We’ll now load 2 elements to coach our mannequin:
- The tokenizer, which is able to care for splitting the textual content inputs into tokens that BERT has been skilled with.
- The mannequin itself.
You possibly can receive each from Huggingface right here. It’s also possible to obtain them to Cloud Storage. That’s what I did, and can subsequently load them with:
# Load pretrained tokenizers and bert mannequin
tokenizer = BertTokenizer.from_pretrained('fashions/bert_uncased_L-2_H-128_A-2/vocab.txt')
mannequin = BertModel.from_pretrained('fashions/bert_uncased_L-2_H-128_A-2')
Let’s now add the next piece to our file:
class SentimentBERT(nn.Module):
def __init__(self, bert_model):
tremendous().__init__()
self.bert_module = bert_model
self.dropout = nn.Dropout(0.1)
self.closing = nn.Linear(in_features=128, out_features=3, bias=True) # Uncomment the beneath if you happen to solely need to retrain sure layers.
# self.bert_module.requires_grad_(False)
# for param in self.bert_module.encoder.parameters():
# param.requires_grad = True
def ahead(self, inputs):
ids, masks, token_type_ids = inputs['ids'], inputs['mask'], inputs['token_type_ids']
# print(ids.dimension(), masks.dimension(), token_type_ids.dimension())
x = self.bert_module(ids, masks, token_type_ids)
x = self.dropout(x['pooler_output'])
out = self.closing(x)
return out
Somewhat break right here. We now have a number of choices relating to reusing an current mannequin.
- Switch studying : we freeze the weights of the mannequin and use it as a “characteristic extractor”. We will subsequently append further layers downstream. That is steadily utilized in Laptop Imaginative and prescient the place fashions like VGG, Xception, and so on. will be reused to coach a customized mannequin on small datasets
- Tremendous-tuning : we unfreeze all or a part of the weights of the mannequin and retrain the mannequin on a customized dataset. That is the popular method when coaching customized LLMs.
Extra particulars on Switch studying and Tremendous-tuning right here:
Within the mannequin, we’ve got chosen to unfreeze all of the mannequin, however be at liberty to freeze a number of layers of the pretrained BERT module and see the way it influences the efficiency.
The important thing half right here is so as to add a completely related layer after the BERT module to “hyperlink” it to our classification process, therefore the ultimate layer with 3 items. This may enable us to reuse the pretrained BERT weights and adapt our mannequin to our process.
Creating the dataloaders
To create the dataloaders we are going to want the Tokenizer loaded above. The Tokenizer takes a string as enter, and returns a number of outputs amongst which we will discover the tokens (‘input_ids’ in our case):
The BERT tokenizer is a bit particular and can return a number of outputs, however an important one is the input_ids
: they’re the tokens used to encode our sentence. They is likely to be phrases, or elements or phrases. For instance, the phrase “trying” is likely to be made of two tokens, “look” and “##ing”.
Let’s now create a dataloader module which is able to deal with our datasets :
class BertDataset(Dataset):
def __init__(self, df, tokenizer, max_length=100):
tremendous(BertDataset, self).__init__()
self.df=df
self.tokenizer=tokenizer
self.goal=self.df['Target']
self.max_length=max_lengthdef __len__(self):
return len(self.df)
def __getitem__(self, idx):
X = self.df['bodyText'].values[idx]
y = self.goal.values[idx]
inputs = self.tokenizer.encode_plus(
X,
pad_to_max_length=True,
add_special_tokens=True,
return_attention_mask=True,
max_length=self.max_length,
)
ids = inputs["input_ids"]
token_type_ids = inputs["token_type_ids"]
masks = inputs["attention_mask"]
x = {
'ids': torch.tensor(ids, dtype=torch.lengthy).to(DEVICE),
'masks': torch.tensor(masks, dtype=torch.lengthy).to(DEVICE),
'token_type_ids': torch.tensor(token_type_ids, dtype=torch.lengthy).to(DEVICE)
}
y = torch.tensor(y, dtype=torch.lengthy).to(DEVICE)
return x, y
Writing the primary script to coach the mannequin
Allow us to outline before everything two features to deal with the coaching and analysis steps:
def prepare(epoch, mannequin, dataloader, loss_fn, optimizer, max_steps=None):
mannequin.prepare()
total_acc, total_count = 0, 0
log_interval = 50
start_time = time.time()for idx, (inputs, label) in enumerate(dataloader):
optimizer.zero_grad()
predicted_label = mannequin(inputs)
loss = loss_fn(predicted_label, label)
loss.backward()
optimizer.step()
total_acc += (predicted_label.argmax(1) == label).sum().merchandise()
total_count += label.dimension(0)
if idx % log_interval == 0:
elapsed = time.time() - start_time
print(
"Epoch {:3d} | {:5d}/{:5d} batches "
"| accuracy {:8.3f} | loss {:8.3f} ({:.3f}s)".format(
epoch, idx, len(dataloader), total_acc / total_count, loss.merchandise(), elapsed
)
)
total_acc, total_count = 0, 0
start_time = time.time()
if max_steps isn't None:
if idx == max_steps:
return {'loss': loss.merchandise(), 'acc': total_acc / total_count}
return {'loss': loss.merchandise(), 'acc': total_acc / total_count}
def consider(mannequin, dataloader, loss_fn):
mannequin.eval()
total_acc, total_count = 0, 0
with torch.no_grad():
for idx, (inputs, label) in enumerate(dataloader):
predicted_label = mannequin(inputs)
loss = loss_fn(predicted_label, label)
total_acc += (predicted_label.argmax(1) == label).sum().merchandise()
total_count += label.dimension(0)
return {'loss': loss.merchandise(), 'acc': total_acc / total_count}
We’re getting nearer to getting our fundamental script up and operating. Let’s sew items collectively. We now have:
- A
BertDataset
class to deal with the loading of the information - A
SentimentBERT
mannequin which takes our Tiny-BERT mannequin and provides a further layer for our customized use case prepare()
andeval()
features to deal with these steps- A
train_and_eval()
features that bundles every part
We’ll use argparse
to have the ability to launch our script with arguments. Such arguments are usually the prepare/eval/check information to run our mannequin with any datasets, the trail the place our mannequin can be saved, and parameters associated to the coaching.
import pandas as pd
import time
import torch.nn as nn
import torch
import logging
import numpy as np
import argparsefrom torch.utils.information import Dataset, DataLoader
from transformers import BertTokenizer, BertModel
logging.basicConfig(format='%(asctime)s [%(levelname)s]: %(message)s', stage=logging.DEBUG)
logging.getLogger().setLevel(logging.INFO)
# --- CONSTANTS ---
BERT_MODEL_NAME = 'small_bert/bert_en_uncased_L-2_H-128_A-2'
if torch.cuda.is_available():
logging.data(f"GPU: {torch.cuda.get_device_name(0)} is offered.")
DEVICE = torch.machine('cuda')
else:
logging.data("No GPU accessible. Coaching will run on CPU.")
DEVICE = torch.machine('cpu')
# --- Information preparation and tokenization ---
class BertDataset(Dataset):
def __init__(self, df, tokenizer, max_length=100):
tremendous(BertDataset, self).__init__()
self.df=df
self.tokenizer=tokenizer
self.goal=self.df['Target']
self.max_length=max_length
def __len__(self):
return len(self.df)
def __getitem__(self, idx):
X = self.df['bodyText'].values[idx]
y = self.goal.values[idx]
inputs = self.tokenizer.encode_plus(
X,
pad_to_max_length=True,
add_special_tokens=True,
return_attention_mask=True,
max_length=self.max_length,
)
ids = inputs["input_ids"]
token_type_ids = inputs["token_type_ids"]
masks = inputs["attention_mask"]
x = {
'ids': torch.tensor(ids, dtype=torch.lengthy).to(DEVICE),
'masks': torch.tensor(masks, dtype=torch.lengthy).to(DEVICE),
'token_type_ids': torch.tensor(token_type_ids, dtype=torch.lengthy).to(DEVICE)
}
y = torch.tensor(y, dtype=torch.lengthy).to(DEVICE)
return x, y
# --- Mannequin definition ---
class SentimentBERT(nn.Module):
def __init__(self, bert_model):
tremendous().__init__()
self.bert_module = bert_model
self.dropout = nn.Dropout(0.1)
self.closing = nn.Linear(in_features=128, out_features=3, bias=True)
def ahead(self, inputs):
ids, masks, token_type_ids = inputs['ids'], inputs['mask'], inputs['token_type_ids']
x = self.bert_module(ids, masks, token_type_ids)
x = self.dropout(x['pooler_output'])
out = self.closing(x)
return out
# --- Coaching loop ---
def prepare(epoch, mannequin, dataloader, loss_fn, optimizer, max_steps=None):
mannequin.prepare()
total_acc, total_count = 0, 0
log_interval = 50
start_time = time.time()
for idx, (inputs, label) in enumerate(dataloader):
optimizer.zero_grad()
predicted_label = mannequin(inputs)
loss = loss_fn(predicted_label, label)
loss.backward()
optimizer.step()
total_acc += (predicted_label.argmax(1) == label).sum().merchandise()
total_count += label.dimension(0)
if idx % log_interval == 0:
elapsed = time.time() - start_time
print(
"Epoch {:3d} | {:5d}/{:5d} batches "
"| accuracy {:8.3f} | loss {:8.3f} ({:.3f}s)".format(
epoch, idx, len(dataloader), total_acc / total_count, loss.merchandise(), elapsed
)
)
total_acc, total_count = 0, 0
start_time = time.time()
if max_steps isn't None:
if idx == max_steps:
return {'loss': loss.merchandise(), 'acc': total_acc / total_count}
return {'loss': loss.merchandise(), 'acc': total_acc / total_count}
# --- Validation loop ---
def consider(mannequin, dataloader, loss_fn):
mannequin.eval()
total_acc, total_count = 0, 0
with torch.no_grad():
for idx, (inputs, label) in enumerate(dataloader):
predicted_label = mannequin(inputs)
loss = loss_fn(predicted_label, label)
total_acc += (predicted_label.argmax(1) == label).sum().merchandise()
total_count += label.dimension(0)
return {'loss': loss.merchandise(), 'acc': total_acc / total_count}
# --- Fundamental perform ---
def train_and_evaluate(**params):
logging.data("operating with the next params :")
logging.data(params)
# Load pretrained tokenizers and bert mannequin
# replace the paths to whichever you're utilizing
tokenizer = BertTokenizer.from_pretrained('fashions/bert_uncased_L-2_H-128_A-2/vocab.txt')
mannequin = BertModel.from_pretrained('fashions/bert_uncased_L-2_H-128_A-2')
# Coaching parameters
epochs = int(params.get('epochs'))
batch_size = int(params.get('batch_size'))
learning_rate = float(params.get('learning_rate'))
# Load the information
df_train = pd.read_csv(params.get('training_file'))
df_eval = pd.read_csv(params.get('validation_file'))
df_test = pd.read_csv(params.get('testing_file'))
# Create dataloaders
train_ds = BertDataset(df_train, tokenizer, max_length=100)
train_loader = DataLoader(dataset=train_ds,batch_size=batch_size, shuffle=True)
eval_ds = BertDataset(df_eval, tokenizer, max_length=100)
eval_loader = DataLoader(dataset=eval_ds,batch_size=batch_size)
test_ds = BertDataset(df_test, tokenizer, max_length=100)
test_loader = DataLoader(dataset=test_ds,batch_size=batch_size)
# Create the mannequin
classifier = SentimentBERT(bert_model=mannequin).to(DEVICE)
total_parameters = sum([np.prod(p.size()) for p in classifier.parameters()])
model_parameters = filter(lambda p: p.requires_grad, classifier.parameters())
params = sum([np.prod(p.size()) for p in model_parameters])
logging.data(f"Complete params : {total_parameters} - Trainable : {params} ({params/total_parameters*100}% of complete)")
# Optimizer and loss features
optimizer = torch.optim.Adam([p for p in classifier.parameters() if p.requires_grad], learning_rate)
loss_fn = nn.CrossEntropyLoss()
# If dry run we solely
logging.data(f'Coaching mannequin with {BERT_MODEL_NAME}')
if args.dry_run:
logging.data("Dry run mode")
epochs = 1
steps_per_epoch = 1
else:
steps_per_epoch = None
# Motion !
for epoch in vary(1, epochs + 1):
epoch_start_time = time.time()
train_metrics = prepare(epoch, classifier, train_loader, loss_fn=loss_fn, optimizer=optimizer, max_steps=steps_per_epoch)
eval_metrics = consider(classifier, eval_loader, loss_fn=loss_fn)
print("-" * 59)
print(
"Finish of epoch {:3d} - time: {:5.2f}s - loss: {:.4f} - accuracy: {:.4f} - valid_loss: {:.4f} - legitimate accuracy {:.4f} ".format(
epoch, time.time() - epoch_start_time, train_metrics['loss'], train_metrics['acc'], eval_metrics['loss'], eval_metrics['acc']
)
)
print("-" * 59)
if args.dry_run:
# If dry run, we don't run the analysis
return None
test_metrics = consider(classifier, test_loader, loss_fn=loss_fn)
metrics = {
'prepare': train_metrics,
'val': eval_metrics,
'check': test_metrics,
}
logging.data(metrics)
# save mannequin and structure to single file
if params.get('job_dir') is None:
logging.warning("No job dir offered, mannequin is not going to be saved")
else:
logging.data("Saving mannequin to {} ".format(params.get('job_dir')))
torch.save(classifier.state_dict(), params.get('job_dir'))
logging.data("Bye bye")
if __name__ == '__main__':
# Create arguments right here
parser = argparse.ArgumentParser()
parser.add_argument('--training-file', required=True, sort=str)
parser.add_argument('--validation-file', required=True, sort=str)
parser.add_argument('--testing-file', sort=str)
parser.add_argument('--job-dir', sort=str)
parser.add_argument('--epochs', sort=float, default=2)
parser.add_argument('--batch-size', sort=float, default=1024)
parser.add_argument('--learning-rate', sort=float, default=0.01)
parser.add_argument('--dry-run', motion="store_true")
# Parse them
args, _ = parser.parse_known_args()
# Execute coaching
train_and_evaluate(**vars(args))
That is nice, however sadly, this mannequin will take a very long time to coach. Certainly, with round 4.7M parameters to coach, one step will take round 3s on a 16Gb Macbook Professional with Intel chip.
3s per step will be fairly lengthy when you have got 1238 steps to go and 10 epochs to finish…
No GPU, no social gathering.