1 minute read

Data Cleaning and Tokenization

Today, I worked on coding scripts to test the data quality, clean the data, tokenize the data for GPT-2, and finally test the tokenized data.

Step 1: Testing Data Quality

I start by checking the data’s basic structure in order to eliminate any basic low-quality data (e.g. empty data)

Screenshot 2025-07-10 at 3 44 51 PM

After that, I check the quality of the code by verifying whether the code is sytactically valid, has functions, has classes, has imports, and whether they are too long or too short. I define too short as < 2 and too long as > 50 for code length.

Screenshot 2025-07-10 at 3 45 16 PM

Then, I check the quality of each code’s explaination by parsing for common terms.

Screenshot 2025-07-10 at 3 47 17 PM

Finally, I generate a report for each split so that it’s easy to access on my repo.

Screenshot 2025-07-10 at 3 48 29 PM

Step 2: Cleaning Data

After testing the data quality, I clean the data by first removing invalid examples and cleaning/formatting the code & explainations.

Screenshot 2025-07-10 at 3 52 11 PM Screenshot 2025-07-10 at 3 52 32 PM

Step 3: Tokenizing Data

With all the data clean, I prepare them for tokenization for the GPT-2 model. I tokenize everydata set and make them compatible for Hugging Face and GPT-2

Screenshot 2025-07-10 at 3 53 29 PM

I also statistically analyze the token lengths for understand the size of the data I’m dealing with.

Screenshot 2025-07-10 at 3 54 27 PM

Repeat

These three steps are repeated for each data split (train, test, validation)

Example, step 1 for the test split:

Screenshot 2025-07-10 at 3 57 31 PM

Example, step 2 for the test split:

Screenshot 2025-07-10 at 3 58 03 PM

Example, step 3 for the test split:

Screenshot 2025-07-10 at 3 58 15 PM

Testing Tokenized Data Quality

After tokenizing all the data, I created another test to verify the quality of the tokenized data.

Screenshot 2025-07-10 at 3 59 46 PM

I also check for data consistency across the different data splits.

Screenshot 2025-07-10 at 4 00 31 PM

Additionally, I created a visial representation of the data quality of all splits.

Screenshot 2025-07-10 at 4 01 00 PM

Fortunately, most of the data was valid, meaning that I have a good chunck of data to train GPT-2

  • train: 164275 examples
  • validation: 20713 examples
  • test: 20413 examples

Notes I pulled all the raw data pushed on my github repo to keep my repo lightweight and respect data licenses. Also, people can simply access the raw data by running my load-data.py script.