1 minute read

Setup & Scraping and Processing Data

To start with this project, I began setting up my code enviornment by downloading necessary libraries.

pip3 install transformers datasets torch accelerate wandb
pip3 install pandas numpy matplotlib seaborn
pip3 install requests zipfile gzip

Then, I created a project structure to organize this model, data, outputs, etc

code-explanation-model/
├── data/
├── models/
├── scripts/
├── notebooks/
└── outputs/

Now that I setup the basics, I wrote a script to pull python code data from CodeXGLUE on Hugging Face. Since, it is nearly impossible for me to manually create all the data to train and fine tune my model, I decided to use a public dataset. I simply use the load_dataset method to pull CodeXGLUE data from Hugging Face.

Screenshot 2025-07-08 at 9 45 09 PM

Then, I created a function to explore the format of dataset to understand the data that I’m dealing with

Screenshot 2025-07-08 at 9 47 47 PM

After that, I format the data for GPT-2 training and save them into a JSON file.

Screenshot 2025-07-08 at 9 49 32 PM

Finally, I created a three-way split for the data (train, validation, test) so that I can train the model on the majority of the data, validated it through different examples, and test it with unseen data for real world performance. It’s a method to prevent overfitting.

Screenshot 2025-07-08 at 9 49 48 PM

By running the code, I was able to format and save 251820.

  • 201456 examples for training
  • 25182 examples for validation
  • 25182 examples for testing.

Screenshot 2025-07-08 at 9 53 53 PM

Notes I had to install Git Large File Storage since I had too many data to directly commit to my git. Also, it’s common practice to use git lfs for ML projects.

Now that I’m done with data processing, I will move on to working on:

  • Verifying data quality: Check the generated files to ensure good code-explanation pairs
  • Data preprocessing: Clean any malformed examples
  • Tokenization: Prepare the data for GPT-2 training