LLM Fine-tuning: Day 8 (END)
Testing
First, I mounted my model back on to my colabs run time because colabs wipes everything out when a session ends.
After that, I developed a script a test the model across 10 examples. It performd horribly with an average similarity score of 20%.
I knew that the model itself wasn’t the problem due to previous comprehensive debugging so I examined my training data and found an issue.
The training dataset had chuncks of meaning less text and a few of them repeated “< | endoftext | >” more than a hundred time. Since I couldn’t clean up the training data due to its arbitrary manner of meaningless text, I retrained the model on a smaller dataset with better organization (python_code_instructions_18k_alpaca) |
The training went well and it only took 17 minutes since we only had 18k data. But, this also means that the model wouldn’t perform as well.
After that, I tested the model on 28 comprehensive examples, and it performed significantly better althought it’s similarity rates were still low.
In the end, this was a training data issue. I tried looking for larger datasets with better structuring but I find any that were open source. So, this will be the end of this project and I won’t be deploying my model to hugging face.
Lessons-leared:
- How to tokenize datasets and call them from huggingface
- How to train a model (training args)
- How to test a model
- How to utilize wandb
Now I’ll move on to my next project on building a technical application for comprehensive alignment testing