IE Seminar - Fine-tuning a BERT model

### THIS PAGE WILL BE UPDATED PERMANENTLY BASED ON INTERACTIONS ON THE FORUM, RETURN OFTEN

Overview

This document delineates the process of preprocessing Dataset 1 and fine-tuning a BERT model for binary text classification.

Files and folders structure

This is the list of provided files:

requirements.txt: Install Python libraries and dependencies.
preprocess_dataset.py: Main text preprocessing script using spaCy.
text_utils.py: helpers and utilities for the main preprocessing script.
utils.py: general helpers and utilities.
visualizations.py: functions to create visualizations.
build_sliced_dataset1.py: sliding window script.
text_classification_dl_bert_train.py: main BERT training script.
/shared_data/: folder for different versions of the dataset.
/shared_data/demo_dataset_1.jsonl: Demo version of Dataset 1 for testing purposes. Use it to compare the performance of your model with another dataset.
/shared_images/: folder for visualizations output.

Before you proceed

IMPORTANT NOTE about updates on your dataset.

UPDATE THE COOKIE: From this point, all team members need to update the cookie on the web application (Annotation Tool) to download or make changes to your dataset. This action needs to be done only one time: using the team selector, choose another team, and then select your team again – this action will update the cookie. From then, the cookie in the web application will know what your team is for further actions. If you don't do this, updates on your dataset using the Annotator Tool will be recorded wrongly.
Your subset of Dataset 1 resides on the cloud for security reasons. You can update it and download it every time you need it.
Your dataset can be downloaded from the Annotation Tool using the "Download Dataset 1" button.
Every update in the dataset should be done through the Annotation Tool but NEVER in the JSONL file. Your grade comes from the dataset located in the Annotation Tool.
I need to recover a datapoint: If your dataset lacks a datapoint you previously annotated, go to the Annotation Tool, load the datapoint, and save it again. Next time you download your dataset, the missing datapoint will be included.
I need to remove a datapoint: if your dataset has unnecessary datapoints you MUST remove them by the Annotation Tool. Load the datapoint and select "---" in the class selector. When you click on the "Save datapoint" button, a message will be prompted "ALERT: the class is empty. If you click on 'OK', the datapoint will be excluded from your dataset or not be included in your dataset. Do you want to proceed?". Next time you download your dataset, the missing datapoint won't be included. If you need a massive removal of datapoints, contact the lecturer providing a list of datapoint IDs.
Make all changes on your dataset using the Annotation Tool until the dataset contains what you need to fine-tune your BERT model.

Procedure

Install dependencies listed in requirements.txt.
The initial version of your dataset is dataset_1_raw.jsonl. This file needs to be preprocessed and anonymized using the script preprocess_dataset.py. A new version of your dataset will be generated with the name dataset_1.jsonl.
Apply the sliding window approach to the dataset using the script build_sliced_dataset1.py. The new version of your dataset will be generated with the name dataset_1_sliced.jsonl.
Use the script text_classification_dl_bert_train.py to train the BERT model.
Modify hyperparameters if necessary to achieve the highest accuracy.
Use the visualizations to evaluate the performance of your model. The confusion matrix (shared_images/dataset1_model_confusion_matrix.png) to track False and True Positives and Negatives. The training and validation losses visualization (shared_images/bert_model_losses.png) to follow up if your model is overfitting or underfitting [link1] [link2].
Depending on the environment where you train the model, you may want to save the model; if not, remove the segments of the code that refer to that.
Depending on the environment where you train the model, you may want to use MLFlow to track the results of your experiments; if not, remove the segments of the code that refer to that.
An early stopping functionality is partially implemented in the code; complete the segments of the code (uncomment) if necessary.

This is an oversimplified guide on how to interpret the training and validation losses visualization:

Underfitting: If both training and validation losses are high, or they decrease very slowly over epochs, it’s an indicator of underfitting. The model is not complex enough to capture the underlying trend of the data. Both curves may remain high and close to each other.
Overfitting: If the training loss continues to decrease with epochs but the validation loss starts increasing after a certain point, it’s an indicator of overfitting. The model learns to memorize the training data but performs poorly on unseen data. There will be a noticeable gap between the training and validation loss curves, with the training loss much lower than the validation loss.
Good Fit: If both training and validation losses decrease and stabilize to a point, with a small gap between them, it’s an indicator of a good fit. The model has learned to generalize well from the training data to unseen data.

Submission of Project 1's Results

Submit the following items included in a single zip file:

text_classification_dl_bert_train.py file.
dataset1_model_confusion_matrix.png file.
bert_model_losses.png file.
metrics.txt file with all metrics information.

BERT Model 1: Fine-tuning deep-learning model for binary text classification

Overview

Files and folders structure

Before you proceed

Procedure

Submission of Project 1's Results