The tokenization unit in CS336 Lecture 01 was inspired by Andrej Karpathy’s video on tokenization:Let’s build the GPT Tokenizer
为了感受分词器的工作原理，可以试试这个互动网站.

词元化

词元化（Tokenization）是把文本分割成最小有意义单位（词元，token）的过程。这些词元可以是词、子词、字符或符号，用于后续自然语言处理或模型输入。
原始文本通常以Unicode字符串形式表示，例如str = "大语言模型"。分词器(Tokenizer)是连接原始文本与语言模型数值输入的桥梁，负责将Unicode字符串编码为整数标记序列供模型处理，并将模型输出的标记序列解码回可读文本。

示例：

tokenizer = get_gpt2_tokenizer()
string = "Hello, 🌍! 你好!" # @inspect string
indices = tokenizer.encode(string) # @inspect indices
reconstructed_string = tokenizer.decode(indices) # @inspect reconstructed_string
assert string == reconstructed_string
compression_ratio = get_compression_ratio(string, indices) # @inspect compression_ratio

def get_gpt2_tokenizer():
    # Code: https://github.com/openai/tiktoken
    # You can use cl100k_base for the gpt3.5-turbo or gpt4 tokenizer
    return tiktoken.get_encoding("gpt2")

def get_compression_ratio(string: str, indices: list[int]) -> float:
    """Given `string` that has been tokenized into `indices`, ."""
    num_bytes = len(bytes(string, encoding="utf-8"))  # @inspect num_bytes
    num_tokens = len(indices)                       # @inspect num_tokens
    return num_bytes / num_tokens
    
# 运行结果
string = "Hello, 🌍! 你好!"
indices = [ 15496, 11, 12520, 234, 235, 0, 220, 19526, 254, 25001, 121, 0 ]
reconstructed_string = 	"Hello, 🌍! 你好!"
compression_ratio = 1.6666666666666667

基于字符的分词

Unicode 字符串是由一系列 Unicode 字符组成的序列。每个字符都可以通过 ord() 函数转换为对应的 Unicode 码点（Code Point），而该码点也可以通过 chr() 函数重新转换回对应的字符。

assert ord("a") == 97
assert ord("🌍") == 127757
assert chr(97) == "a"
assert chr(127757) == "🌍"

我们可以构建一个基于字符的分词器，并验证其编码-解码的往返一致性

tokenizer = CharacterTokenizer()
string = "Hello, 🌍! 你好!"  # @inspect string
indices = tokenizer.encode(string)  # @inspect indices
reconstructed_string = tokenizer.decode(indices)  # @inspect reconstructed_string
assert string == reconstructed_string

vocabulary_size = max(indices) + 1
"vocabulary_size = 127758"

compression_ratio = get_compression_ratio(string, indices)
"compression_ratio = 1.5384615384615385"

class CharacterTokenizer(Tokenizer):
    """Represent a string as a sequence of Unicode code points."""
    def encode(self, string: str) -> list[int]:
        return list(map(ord, string))
    def decode(self, indices: list[int]) -> str:
        return "".join(map(chr, indices))

基于字符的分词面临两个主要问题：

词汇表非常庞大
许多字符（如 🌍）极为罕见，导致词汇利用率低下

基于字节的分词

Unicode 字符串可以表示为一系列字节，而每个字节可由 0 到 255 之间的整数表示。UTF-8 是最常见的 Unicode 编码格式。

1 2	assert bytes("a", encoding="utf-8") == b"a" assert bytes("🌍", encoding="utf-8") == b"\xf0\x9f\x8c\x8d"

我们可以构建一个基于字节的分词器，并验证其编码-解码的往返一致性

tokenizer = ByteTokenizer()
string = "Hello, 🌍! 你好!"  # @inspect string
indices = tokenizer.encode(string)  # @inspect indices
reconstructed_string = tokenizer.decode(indices)  # @inspect reconstructed_string
assert string == reconstructed_string

class ByteTokenizer(Tokenizer):
    """Represent a string as a sequence of bytes."""
    def encode(self, string: str) -> list[int]:
        string_bytes = string.encode("utf-8")  # @inspect string_bytes
        indices = list(map(int, string_bytes))  # @inspect indices
        return indices
    def decode(self, indices: list[int]) -> str:
        string_bytes = bytes(indices)  # @inspect string_bytes
        string = string_bytes.decode("utf-8")  # @inspect string
        return string

由于压缩率不理想，导致生成的序列长度较长。考虑到 Transformer 模型受限于上下文长度，且注意力机制的计算复杂度为序列长度的平方，基于字节的分词方式在实际应用中可能面临效率挑战。

基于单词的分词

基于单词的分词，顾名思义，就是将一段文本分割成独立的、有意义的“词”或“单词”来作为基本的处理单元。它的基本思想是空格或标点符号是词语之间的天然分隔符。例如：

>>> string = "I'll say supercalifragilisticexpialidocious!"
>>> segments = regex.findall(r"\w+|.", string) # @inspect segments
>>> [
  "I",
  "'ll",
  " say",
  " supercalifragilisticexpialidocious",
  "!"
]

同样的，缺点仍然是词汇表过于庞大，没有固定的词汇量。而且许多单词很少见，模型不会对它们了解太多。

Byte Pair Encoding

字节对编码（Byte Pair Encoding, BPE）是一种基于统计的子词分词算法。

def train_bpe(string: str, num_merges: int) -> BPETokenizerParams:  # @inspect string, @inspect num_merges
    # Start with the list of bytes of string.
    indices = list(map(int, string.encode("utf-8")))  # @inspect indices
    merges: dict[tuple[int, int], int] = {}  # index1, index2 => merged index
    vocab: dict[int, bytes] = {x: bytes([x]) for x in range(256)}  # index -> bytes
    
    for i in range(num_merges):
        # Count the number of occurrences of each pair of tokens
        counts = defaultdict(int)
        for index1, index2 in zip(indices, indices[1:]):  # For each adjacent pair
            counts[(index1, index2)] += 1  # @inspect counts
        
        # Find the most common pair.
        pair = max(counts, key=counts.get)  # @inspect pair
        index1, index2 = pair
        
        # Merge that pair.
        new_index = 256 + i  # @inspect new_index
        merges[pair] = new_index  # @inspect merges
        vocab[new_index] = vocab[index1] + vocab[index2]  # @inspect vocab
        indices = merge(indices, pair, new_index)  # @inspect indices
    return BPETokenizerParams(vocab=vocab, merges=merges)
    
def merge(indices: list[int], pair: tuple[int, int], new_index: int) -> list[int]:  # @inspect indices, @inspect pair, @inspect new_index
    """Return `indices`, but with all instances of `pair` replaced with `new_index`."""
    new_indices = []  # @inspect new_indices
    i = 0  # @inspect i
    while i < len(indices):
        if i + 1 < len(indices) and indices[i] == pair[0] and indices[i + 1] == pair[1]:
            new_indices.append(new_index)
            i += 2
        else:
            new_indices.append(indices[i])
            i += 1
    return new_indices