LLM Fine-tuning: Day 3
Data Cleaning and Tokenization
Today, I worked on coding scripts to test the data quality, clean the data, tokenize the data for GPT-2, and finally test the tokenized data.
Step 1: Testing Data Quality
I start by checking the data’s basic structure in order to eliminate any basic low-quality data (e.g. empty data)
After that, I check the quality of the code by verifying whether the code is sytactically valid, has functions, has classes, has imports, and whether they are too long or too short. I define too short as < 2 and too long as > 50 for code length.
Then, I check the quality of each code’s explaination by parsing for common terms.
Finally, I generate a report for each split so that it’s easy to access on my repo.
Step 2: Cleaning Data
After testing the data quality, I clean the data by first removing invalid examples and cleaning/formatting the code & explainations.
Step 3: Tokenizing Data
With all the data clean, I prepare them for tokenization for the GPT-2 model. I tokenize everydata set and make them compatible for Hugging Face and GPT-2
I also statistically analyze the token lengths for understand the size of the data I’m dealing with.
Repeat
These three steps are repeated for each data split (train, test, validation)
Example, step 1 for the test split:
Example, step 2 for the test split:
Example, step 3 for the test split:
Testing Tokenized Data Quality
After tokenizing all the data, I created another test to verify the quality of the tokenized data.
I also check for data consistency across the different data splits.
Additionally, I created a visial representation of the data quality of all splits.
Fortunately, most of the data was valid, meaning that I have a good chunck of data to train GPT-2
- train: 164275 examples
- validation: 20713 examples
- test: 20413 examples
Notes I pulled all the raw data pushed on my github repo to keep my repo lightweight and respect data licenses. Also, people can simply access the raw data by running my load-data.py script.