Tokenization

2023-09-17

Tokenization is the process of breaking down text into components, known as tokens. Each token might represent an individual word or phrase. This process is required to make the data more manageable and suitable for various NLP tasks (text mining, ML, or text analysis). Let's take a look at the BERT-like model tokenization process:

Normalization (any cleanup of the text that is deemed necessary, such as removing spaces or accents, Unicode normalization, etc.)
Pre-tokenization (splitting the input into words)
Running the input through the model (using the pre-tokenized words to produce a sequence of tokens)
Post-processing (adding the special tokens of the tokenizer, generating the attention mask and token type IDs)

Example:

"Hello how are U today?" - Input
    |
    v
"hello how are u today?" - Case normalization
    |
    v
["hello", "how", "are", "u", "td", "##ay", "?"] - Subword tokenization
    |
    v
["CLS", "hello", "how", "are", "u", "td","##ay","?","SEP"] - Assigning special tokens

Here today has been split into td, ##ay. This technique is known as Subword Tokenization, often used in models like BERT to handle out-of-vocabulary words.

Subword tokenization algorithms rely on the principle that frequently used words should not be split into smaller subwords, but rare words should be decomposed into meaningful subwords. For instance "annoyingly" might be considered a rare word and could be decomposed into "annoying" and "ly". Both "annoying" and "ly" as stand-alone subwords would appear more frequently while at the same time the meaning of "annoyingly" is kept by the composite meaning of "annoying" and "ly". This is especially useful in agglutinative languages such as Turkish, where you can form (almost) arbitrarily long complex words by stringing together subwords. Source

Special tokens are used for classification in BERT like models. The CLS token is used to represent the entire context of the input for tasks like classification, while the SEP token is used to separate different sentences or contexts within the same text.