LLM Fine-tuning: Day 7

1 minute read

Full Training on Colabs and Testing Prep

Colabs

Today, I managed to get a free Colabs Pro subscription through their student verification process, meaning that I have access to better GPUs.

Instead of using T4 (which I constant had an issue with), I decided to use A100 to reduce my training duration significantly.

Screenshot 2025-08-09 at 10 55 48 PM

I revised my training script to optimize it for the A100 GPU and utilize most of 40GB of GPU provided. So, I set the script to use 0.95 (95%) of the GPU provided.

Screenshot 2025-08-09 at 10 59 29 PM

Setting Up for Training

Then, I set up the GPT-2 124 parameter model, its tokenizer, and my special tokens to prepare for training.

Screenshot 2025-08-09 at 11 00 39 PM

As usual, I initialzed Weights and Biases to mointor the training loss, validation loss, and learning rate.

Screenshot 2025-08-09 at 11 02 03 PM

After that, I loaded my full tokenized datasets and printed out how many I had for tracking purposes.

Screenshot 2025-08-09 at 11 16 08 PM

With all of that set, I wrote my training arguments to fully utilize the A100 GPU’s capabilities.

Increased the batch size to 32 and reduced the gradient accumulation steps to 2
Increased epoch to 5 for better quality results and dataloader workers to 16
Enhanced learning rate to 5e-4 and set up weight decay to prevent overfitting
Set up methods for logging training/validation loss

Screenshot 2025-08-09 at 11 20 02 PM

Training Results

I successfully ran the script and the training process only took around 2.5 hours, which is a significant reduction from ~5 hours using the T4 GPU.

Screenshot 2025-08-09 at 11 20 49 PM

Looking at my wandb panels, it’s clear that my training process went well without any issues because you can see that both the training and validation losses increased with more datasets, meaning that model became more accurate. Also, the decay seen in the learning rate showcases improved convergence towards an optimized solution

Training Loss Graph Validation Loss Graph Learning Rate Graph