1 minute read

Full Training and Notebooks (Jupyter & Colabs)

Full Training Script Development

Today, I worked on developing a script to train gpt 2 on all of the tokenized data, and did some debugging to expedite the process.

The code itself is largely similar with test-training.py. I just tweaked some details and removed the load_small_subset function. I first set up wandb to track the model’s training progress, set up the model, and load the full dataset.

Screenshot 2025-07-17 at 7 17 48 PM

Then, I set the training arguements so that the learning rate is set to 5e-5 (lowest rate) and all of parameters set to low figures so that the training process doesn’t overload my local cpu.

Screenshot 2025-07-17 at 7 18 32 PM

I kept the model temperature the same (0.7) as the testing trial since I’m still dealing with significantly less data compared to industry models.

Screenshot 2025-07-17 at 7 20 32 PM

I ran the code and everything seemed to work. But, it was going to take +200 hours. I did some testing in Jupyterlabs to see if I could optimize the training process but it only brought the time down to 170 hours. Instead of using my computer’s local cpu, I decided to migrate into Colabs since they provide a free T4 GPU.

Colabs

I first mounted the colab notebook onto my drive, cloned my repository to gain access to my scripts, and installed all required libraries.

Screenshot 2025-07-17 at 7 24 18 PM

After that, I checked the status of the GPU to ensure that it is working properly.

Screenshot 2025-07-17 at 7 25 10 PM

Then, I ran my full training code and it gave me 10 hours which was a significant improvement. But 10 hours was still insanely long. So checked the GPU usage and it was only showing 4.9/15.0 GB. Thus, I did some more optimization to fully harness the GPU. I increased the batch size so that the training process is more efficient, and it brought down the time to 7 hours.

Screenshot 2025-07-17 at 7 27 46 PM

I also tweaked the learning rate to 1e-4 and 3e-4 for fine-tuning, but it didn’t make much of a difference.

Moving on, I will work on further optimizing in order to train my model for efficiently.