1 minute read

Testing

First, I mounted my model back on to my colabs run time because colabs wipes everything out when a session ends.

Screenshot 2025-08-10 at 4 09 12 PM

After that, I developed a script a test the model across 10 examples. It performd horribly with an average similarity score of 20%.

Screenshot 2025-08-10 at 4 10 14 PM

I knew that the model itself wasn’t the problem due to previous comprehensive debugging so I examined my training data and found an issue.

Screenshot 2025-08-10 at 4 11 45 PM

The training dataset had chuncks of meaning less text and a few of them repeated “< endoftext >” more than a hundred time. Since I couldn’t clean up the training data due to its arbitrary manner of meaningless text, I retrained the model on a smaller dataset with better organization (python_code_instructions_18k_alpaca)

Screenshot 2025-08-10 at 4 14 29 PM

The training went well and it only took 17 minutes since we only had 18k data. But, this also means that the model wouldn’t perform as well.

After that, I tested the model on 28 comprehensive examples, and it performed significantly better althought it’s similarity rates were still low.

Screenshot 2025-08-10 at 4 16 02 PM

In the end, this was a training data issue. I tried looking for larger datasets with better structuring but I find any that were open source. So, this will be the end of this project and I won’t be deploying my model to hugging face.

Lessons-leared:

  • How to tokenize datasets and call them from huggingface
  • How to train a model (training args)
  • How to test a model
  • How to utilize wandb

Now I’ll move on to my next project on building a technical application for comprehensive alignment testing