Data Preprocessing and Tokenization

The Building Blocks of Understanding

After gathering extensive and diverse datasets, the next crucial step is data preprocessing and tokenization. This phase transforms raw textual data into structured formats that a Large Language Model (LLM) can efficiently understand and process. Just as humans break down language into words and sentences for comprehension, LLMs segment text into smaller, meaningful units called tokens. Effective preprocessing and tokenization significantly enhance a model’s ability to grasp context, manage vocabulary efficiently, and improve overall performance.

What is Tokenization?

Tokenization is the process of splitting textual data into discrete units called tokens. Tokens can be words, subwords, or even characters, depending on the chosen strategy. Each token is then mapped to a unique numerical identifier that the model uses during training.

Common tokenization methods include:

  • Word-level Tokenization: Splitting text into individual words.
  • Subword Tokenization (e.g., Byte Pair Encoding): Breaking words into subwords to handle rare words and reduce vocabulary size.
  • Character-level Tokenization: Treating individual characters as tokens, useful for multilingual models.

Why is Preprocessing Important?

Preprocessing ensures that the data fed into the model is clean, consistent, and optimized for learning. Key preprocessing steps include:

  • Normalization:
    • Converting text to lowercase to maintain consistency.
    • Removing unnecessary punctuation or special characters.
  • Handling Missing or Incomplete Data:
    • Filling or removing incomplete sentences and corrupted entries to maintain data integrity.
  • Managing Vocabulary:
    • Creating a vocabulary of the most frequent tokens.
    • Assigning unique numerical identifiers to tokens for efficient processing.
  • Structuring Input Data:
    • Formatting data into sequences suitable for model input.
    • Adding special tokens (e.g., [START], [END]) to indicate sentence boundaries.

Data Preprocessing & Tokenization Interactive Demo

Preprocessing cleans the text, and tokenization splits it into smaller pieces called tokens. Enter some text below and click the button to see how it works!

Here’s an example text you can use:

The goalkeeper misunderstood the handbook, so he overthrew the pass and blamed the bystanders for the heartbreaking loss.



Tokens

    1. Whole-Word Tokenization

    • Description:
      Splits text into individual words based on whitespace after removing punctuation and converting to lowercase. Each token is a complete word.
    • Use in Real LLMs:
      Rarely used alone in modern LLMs due to large vocabulary sizes and inability to handle unknown words effectively. Mostly serves educational or illustrative purposes.

    2. GPT-Style Tokenization (naive implementation)

    • Description:
      Separates text into tokens by recognizing alphanumeric sequences, punctuation, and whitespace distinctly. Preserves punctuation and spaces as separate tokens.
    • Use in Real LLMs:
      GPT models actually use a sophisticated version of Byte-Pair Encoding (BPE), not exactly the naive method shown here. Real GPT tokenization merges commonly occurring byte sequences to balance vocabulary size and flexibility.

    3. WordPiece Tokenization (naive implementation)

    • Description:
      Converts text to lowercase, removes punctuation, and splits words by whitespace. Words longer than a certain length (6 characters in this example) are divided into smaller chunks, with subsequent chunks prefixed by “##”.
    • Use in Real LLMs:
      A simplified form of the real WordPiece tokenization used by models like BERT and Google’s models. The real WordPiece algorithm segments words based on frequency, using statistical methods to handle rare words effectively, optimizing vocabulary size and language understanding.

    4. Byte-Level Tokenization (naive implementation)

    • Description:
      Treats every individual character (letters, punctuation, whitespace) as a separate token.
    • Use in Real LLMs:
      Real-world byte-level tokenizers (e.g., GPT-2 and GPT-3 tokenizers) are much more advanced. They apply Byte-level BPE, allowing them to efficiently handle arbitrary characters, emojis, multilingual texts, and unseen words without explicit preprocessing steps.

    Why Different Tokenizers?

    Choosing a tokenizer depends on balancing:

    • Vocabulary size: Smaller vocabularies improve computational efficiency.
    • Handling unknown words: Subword tokenizers (GPT-style, WordPiece) manage unseen words better.
    • Language coverage: Byte-level methods better handle multilingual and varied character sets.

    Real-world LLMs primarily use advanced subword tokenization (Byte-Pair Encoding, WordPiece, or SentencePiece) to optimize efficiency, handle multilingual inputs, and effectively manage vocabulary size.

    The simplified implementations above serve as educational illustrations to demonstrate conceptually how these methods differ.

    Next step  Embeddings