LLM Fine-tuning: Day 2
Setup & Scraping and Processing Data
To start with this project, I began setting up my code enviornment by downloading necessary libraries.
pip3 install transformers datasets torch accelerate wandb
pip3 install pandas numpy matplotlib seaborn
pip3 install requests zipfile gzip
Then, I created a project structure to organize this model, data, outputs, etc
code-explanation-model/
├── data/
├── models/
├── scripts/
├── notebooks/
└── outputs/
Now that I setup the basics, I wrote a script to pull python code data from CodeXGLUE on Hugging Face. Since, it is nearly impossible for me to manually create all the data to train and fine tune my model, I decided to use a public dataset. I simply use the load_dataset method to pull CodeXGLUE data from Hugging Face.
Then, I created a function to explore the format of dataset to understand the data that I’m dealing with
After that, I format the data for GPT-2 training and save them into a JSON file.
Finally, I created a three-way split for the data (train, validation, test) so that I can train the model on the majority of the data, validated it through different examples, and test it with unseen data for real world performance. It’s a method to prevent overfitting.
By running the code, I was able to format and save 251820.
- 201456 examples for training
- 25182 examples for validation
- 25182 examples for testing.
Notes I had to install Git Large File Storage since I had too many data to directly commit to my git. Also, it’s common practice to use git lfs for ML projects.
Now that I’m done with data processing, I will move on to working on:
- Verifying data quality: Check the generated files to ensure good code-explanation pairs
- Data preprocessing: Clean any malformed examples
- Tokenization: Prepare the data for GPT-2 training