2023-09-17
Tokenization is the process of breaking down text into components, known as tokens. Each token might represent an individual word or phrase. This process is required to make the data more manageable and suitable for various NLP tasks (text mining, ML, or text analysis). Let's take a look at the BERT-like model tokenization process:
Example:
"Hello how are U today?" - Input
|
v
"hello how are u today?" - Case normalization
|
v
["hello", "how", "are", "u", "td", "##ay", "?"] - Subword tokenization
|
v
["CLS", "hello", "how", "are", "u", "td","##ay","?","SEP"] - Assigning special tokens
Here today
has been split into td
, ##ay
. This technique is known as Subword Tokenization, often used in models like BERT to handle out-of-vocabulary words.
Subword tokenization algorithms rely on the principle that frequently used words should not be split into smaller subwords, but rare words should be decomposed into meaningful subwords. For instance
"annoyingly"
might be considered a rare word and could be decomposed into"annoying"
and"ly"
. Both"annoying"
and"ly"
as stand-alone subwords would appear more frequently while at the same time the meaning of"annoyingly"
is kept by the composite meaning of"annoying"
and"ly"
. This is especially useful in agglutinative languages such as Turkish, where you can form (almost) arbitrarily long complex words by stringing together subwords. Source
Special tokens are used for classification in BERT like models. The CLS
token is used to represent the entire context of the input for tasks like classification, while the SEP
token is used to separate different sentences or contexts within the same text.