IE Seminar - Fine-tuning BERT models

### THIS PAGE WILL BE UPDATED PERMANENTLY BASED ON INTERACTIONS ON THE FORUM –RETURN OFTEN

NOTE: This guide complements the NLP Course by the Hugging Face, assuming a good understanding of the core concepts of fine-tuning a BERT model described in the course.

Overview

Fine-tuning BERT (Bidirectional Encoder Representations from Transformers) models involves adapting a pre-trained BERT model to a specific task using a smaller, task-specific dataset. Initially, BERT is trained on a large corpus to learn a wide range of language representations. When you fine-tune BERT, you start with this pre-trained model and continue the training process, but now the model is exposed to your specific dataset related to the task at hand. During this phase, the pre-trained parameters and the additional task-specific layers are adjusted to better align with your specific data. The fine-tuning is typically quicker than the initial training, as the model has already learned a substantial amount of language understanding. The objective is to leverage the generic language model of BERT and refine it to become more specialized in the task context, thereby enhancing its performance on that particular task while retaining its extensive language understanding capabilities.

Fine-tuning workflow

The provided code in bert_model_2_training.py follows the typical fine-tuning workflow:

Load Model: This step is initiated by loading a pre-trained BERT model from the Hugging Face model repository.
Prepare Data: The dataset must be pre-processed to match the format BERT expects, including splitting the dataset into training, validation, and test sets, tokenizing the text using the same tokenizer (BertTokenizer, in this case) that was used during the BERT pre-training, and formatting the dataset into a structure compatible with the model, which involves creating attention masks and segment IDs.
Define Task-Specific Head: A task-specific layer or 'head' is added to the BERT model, a neural network structure designed to interpret BERT's embeddings for your specific task. We load BertForSequenceClassification, a task-specific head for sequence classification, suitable for tasks like binary classification or sentiment analysis.
Configure: Configure the model's hyperparameters, such as learning rate, batch size, and the number of epochs. These three parameters are crucial in BERT model fine-tuning, influencing the model's performance and training efficiency.
Train: During this phase, the BERT model, along with the task-specific head, is fine-tuned using the dataset. The pre-existing BERT parameters and the parameters of the new head are updated to fit the specific NLP task better.
Evaluate and Validate: We evaluate the model on a validation set to assess its performance after training. This step is crucial to ensure that the model generalizes well and to adjust hyperparameters if needed to avoid issues like overfitting.
Test: Finally, we use the fine-tuned model to make predictions on new unseen data in the test split. Here, we see the practical application of our fine-tuned BERT model as it performs the NLP task for which it was trained.

Hyperparameters in BERT Models

Three hyperparameters are crucial in the fine-tuning process of BERT models, notably the learning rate, batch size, and the number of training epochs.

The learning rate controls the magnitude of updates to the model's weights during training, and it is crucial to find the balance between rapid convergence (how quickly the model reaches a point where it performs well) and the risk of overshooting (skipping past the best) optimal solutions. A lower learning rate value is often chosen for fine-tuning pre-trained models like BERT to make incremental adjustments, thereby avoiding the disruption of the already learned representations.
The batch size, determining the number of training examples utilized in one iteration, significantly impacts the model's training dynamics. A larger batch size value can lead to faster training and smoother convergence due to more stable gradient estimates. However, it also increases the computational load and may require a proportionate adjustment in the learning rate. Oppositely, a smaller batch size, while less computationally intensive, might lead to noisier gradient estimates, potentially requiring more epochs to converge.
The number of epochs in the fine-tuning process influences how long the model is exposed to the training data. In the context of a pre-trained model like BERT, fewer epochs are generally required for fine-tuning compared to training from scratch. However, the optimal number of epochs is a delicate balance; too few might result in underfitting, while too many could lead to overfitting, especially when dealing with smaller, task-specific datasets.

In addition to the primary hyperparameters, there are other less emphasized hyperparameters in the fine-tuning process of BERT models.

The weight decay adds a regularization term to the loss function by penalizing large weights, encouraging the model to maintain simpler, more generalizable features rather than overfitting the training data. The careful calibration of weight decay can aid in striking a balance between model complexity and its generalization ability, which is especially crucial when working with limited or highly specific datasets, like our second project in the seminar.
The dropout rate is used in various layers of the BERT model, including the attention mechanisms and the task-specific head. Dropout randomly deactivates a subset of neurons during training, which helps prevent co-adaptation of features and fosters a more robust internal representation of the data. Adjusting the dropout rate can significantly impact the model's resilience to overfitting, with higher rates often leading to better generalization at the cost of slower convergence.
The warmup steps parameter plays a nuanced role in the fine-tuning process. During the initial training phase, the learning rate incrementally increases from zero to the pre-set learning rate over a number of warmup steps. This gradual increase helps stabilize the training process, preventing the model from making excessively large weight updates in the early stages, which can destabilize a pre-trained model like BERT.

The adjustment of these less discussed hyperparameters, should be made in conjunction with primary parameters, demanding a nuanced understanding of the model's dynamics and the specific requirements of our NLP task. Therefore, it is important to emphasize the use of a tailored and iterative approach in setting hyperparameters, aligning them with the specific characteristics of the NLP task and the specific dataset at hand to optimize the performance of the fine-tuned BERT model.

Initial Values and Tuning Strategies

Detecting subtle linguistic patterns, such as classifying political discourses into monologic and dialogic, detecting topic shifts in transcribed political discourse, or stance classification in political discourse are challenging tasks, and fine-tuning a BERT model to capture these nuances requires careful consideration of various factors, including hyperparameters. Here are some strategies and hyperparameter adjustments that can help your BERT model become more sensitive to subtle linguistic patterns:

For learning rate, start with a value of 2e-5. A lower value can sometimes help the model learn finer details by making smaller updates to the weights; however, it's important to balance this as too low a rate might lead to slow convergence or getting stuck in local minima (a point in the loss value where it is lower than in the immediate surrounding area, but it may not be the lowest point overall). Experiment with learning rates in the lower end of the typical range for BERT (e.g., 1e-5 to 3e-5).
For batch size, begin with 16 or 32 and adjust based on the available computational resources. A smaller value can lead to a more fine-grained update of weights. It might help the model to pick up subtleties in the data, but it can also increase training time and variance in training.
For the number of epochs, start with 3 epochs. Increasing epochs might lead to better performance but also raise the risk of overfitting. Increasing the number of training epochs allows the model more opportunity to learn from the data. However, be cautious of overfitting. Implement early stopping or monitor validation loss to prevent this. In the final presentation, you will discuss your early stop strategy.
For warmup steps, this value is calculated based on the size of your training dataset and the batch size. Each step corresponds to processing one batch of data, typically, 10% of the total number of training steps is a good starting point.
First we calculate the total number of training steps:
$Total Training Steps = \frac{Number of Epochs \times Training Dataset Size}{Batch Size}$
For example, suppose you have a training dataset of 8000 examples, a batch size of 32, and you plan to train the model for 3 epochs. The total number of training steps would be calculated as follows:
$Total Training Steps = \frac{3 \times 8000}{32}$
We use 10% of the total training steps as warmup steps:
$Warmup Steps = 0.10 \times Total Training Steps$
Therefore, from our example, we use the total training steps are 750, derived from the calculation:
$Warmup Steps = 0.10 \times 750$
Finally:
$Warmup Steps = 75$
Weight decay and dropout rate are forms of regularization that can prevent overfitting. However, if they are too high, they might prevent the model from learning the finer details. Adjust these parameters to find a good balance. A common value for weight decay is 0.01 while the dropout rate is typically in the range of 0.1 to 0.3.

To tune these hyperparameters, monitor performance on a validation set using the training and validation losses visualization.

Finally, notice that there is no one-size-fits-all set of hyperparameters. The optimal configuration can vary depending on the specific characteristics of the dataset and the subtleties we are trying to capture. Experimentation and validation are key to finding the right balance.

Additional Considerations

In our specific use cases, certain types of linguistic structures and patterns are more subtle and thus harder to classify, resulting in the need to adjust the class weights to give more emphasis to these categories –specially when dealing with classes that are naturally less frequent but equally important; as most of real-world cases. For this purpose, we use the compute_class_weight function from Scikit-learn, which calculates the weights for each class to be used in the training process, allowing the model to pay more attention to underrepresented classes.

When you train your model, the modified loss function (with class weights) is used in the training loop. This ensures that during the backpropagation step, the model's parameters are updated taking into account the class weights. It helps in balancing the learning process, especially when the distribution of classes is skewed. By incorporating class weights in this manner, your model is better equipped to handle imbalances in the training data, which can lead to more robust and fair performance across different classes.

class_weights = compute_class_weight(class_weight="balanced", classes=np.unique(labels), y=labels)
class_weights = torch.tensor(class_weights, dtype=torch.float).to(device)

If computational resources allow, experimenting with larger versions of BERT (like BERT-Large) might help, as they can potentially capture more complex patterns in the data.

Replace these lines:

# Load BERT model
model = BertForSequenceClassification.from_pretrained("bert-base-uncased",
                                                      num_labels=len(LABEL_MAP),
                                                      hidden_dropout_prob=DROP_OUT_RATE)

(...)

# Move model to device
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

With:

# Load BERT model
model = BertForSequenceClassification.from_pretrained("bert-large-uncased",
                                                      num_labels=len(LABEL_MAP),
                                                      hidden_dropout_prob=DROP_OUT_RATE)

(...)

# Move model to device
tokenizer = BertTokenizer.from_pretrained("bert-large-uncased")

Automated Experimentation Approaches

Grid search is a technique used to systematically work through multiple combinations of parameter tunes, cross-validating to determine which tune performs best. In the context of fine-tuning BERT models, a grid search involves defining a grid of hyperparameters, such as learning rate, batch size, number of epochs, and potentially others like dropout rates or weight decay. The grid search algorithm then evaluates the model performance for each combination of these hyperparameters.

The primary advantage of grid search is its thoroughness. Exploring all possible combinations within the predefined grid ensures that the best-performing hyperparameters are not overlooked. However, this thoroughness comes at the cost of computational resources and time, especially when the grid is large, or the model is complex, as is the case with BERT.

Here a basic implementation of grid search in Python:

import itertools

HYPERPARAMETERS_GRID = {
  "learning_rate": [1.5e-5, 2e-5, 2.5e-5, 3e-5, 3.5e-5],
  "batch_size": [16, 32],
  "num_epochs": [2, 3, 4],
  "warmup_steps": [0, 100, 1000],
  "weight_decay": [0, 1e-2, 1e-3],
  "drop_out_rate": [0.1, 0.2]
}

# Generate all combinations of hyperparameters
hyperparameters_combinations = list(itertools.product(*HYPERPARAMETERS_GRID.values()))

# Run experiment for each combination of hyperparameters
for hyperparameters in hyperparameters_combinations:
  # Create a dictionary of the current hyperparameters
  hyperparameters_dict = dict(zip(HYPERPARAMETERS_GRID.keys(), hyperparameters))

  # Update the global variables with the current hyperparameters
  LEARNING_RATE = hyperparameters_dict["learning_rate"]
  BATCH_SIZE = hyperparameters_dict["batch_size"]
  NUM_EPOCHS = hyperparameters_dict["num_epochs"]
  WARMUP_STEPS = hyperparameters_dict["warmup_steps"]  
  WEIGHT_DECAY = hyperparameters_dict["weight_decay"]
  DROP_OUT_RATE = hyperparameters_dict["drop_out_rate"]
  
  (...)

As an alternative to grid search, random search randomly selects combinations of hyperparameters to evaluate. This approach can be more efficient than grid search, especially when some hyperparameters are more influential than others. Random search allows for a broader search of the hyperparameter space and can often find good combinations faster than grid search.

On the other hand, bayesian optimization is a more sophisticated approach that models the performance function of hyperparameters and uses this model to select the most promising hyperparameters to evaluate in the actual model. This approach is useful when dealing with high-dimensional hyperparameter spaces, as it can more efficiently navigate the search space based on the performance of previous evaluations.

Recommended Literature

Albanese, N. C. (2022, May 14). Fine-Tuning BERT for Text Classification: A step-by-step tutorial in Python. Towards Data Science. Retrieved from https://towardsdatascience.com/fine-tuning-bert-for-text-classification-54e7df642894#ec34
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Retrieved from https://arxiv.org/abs/1810.04805
Kamsetty, A., Fricke, K., & Liaw, R. (2020, August 26). Hyperparameter Optimization for 🤗Transformers: A guide. Distributed Computing with Ray. Retrieved from https://medium.com/distributed-computing-with-ray/hyperparameter-optimization-for-transformers-a-guide-c4e32c6c989b
Rogers, A., Kovaleva, O., & Rumshisky, A. (2020). A primer in BERTology: What we know about how BERT works. arXiv preprint arXiv:2002.12327. Retrieved from https://arxiv.org/abs/2002.12327
Sun, C., Qiu, X., Xu, Y., & Huang, X. (2019). How to fine-tune BERT for text classification? arXiv preprint arXiv:1905.05583. Retrieved from https://arxiv.org/abs/1905.05583