The tokenization unit in CS336 Lecture 01 was inspired by Andrej Karpathy’s video on tokenization:Let’s build the GPT Tokenizer
为了感受分词器的工作原理,可以试试这个互动网站.

词元化

词元化Tokenization)是把文本分割成最小有意义单位(词元token)的过程。这些词元可以是词、子词、字符或符号,用于后续自然语言处理或模型输入。
原始文本通常以Unicode字符串形式表示,例如str = "大语言模型"。分词器(Tokenizer)是连接原始文本与语言模型数值输入的桥梁,负责将Unicode字符串编码为整数标记序列供模型处理,并将模型输出的标记序列解码回可读文本。

示例:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
tokenizer = get_gpt2_tokenizer()
string = "Hello, 🌍! 你好!" # @inspect string
indices = tokenizer.encode(string) # @inspect indices
reconstructed_string = tokenizer.decode(indices) # @inspect reconstructed_string
assert string == reconstructed_string
compression_ratio = get_compression_ratio(string, indices) # @inspect compression_ratio

def get_gpt2_tokenizer():
# Code: https://github.com/openai/tiktoken
# You can use cl100k_base for the gpt3.5-turbo or gpt4 tokenizer
return tiktoken.get_encoding("gpt2")

def get_compression_ratio(string: str, indices: list[int]) -> float:
"""Given `string` that has been tokenized into `indices`, ."""
num_bytes = len(bytes(string, encoding="utf-8")) # @inspect num_bytes
num_tokens = len(indices) # @inspect num_tokens
return num_bytes / num_tokens

# 运行结果
string = "Hello, 🌍! 你好!"
indices = [ 15496, 11, 12520, 234, 235, 0, 220, 19526, 254, 25001, 121, 0 ]
reconstructed_string = "Hello, 🌍! 你好!"
compression_ratio = 1.6666666666666667

基于字符的分词

Unicode 字符串是由一系列 Unicode 字符组成的序列。每个字符都可以通过 ord() 函数转换为对应的 Unicode 码点(Code Point),而该码点也可以通过 chr() 函数重新转换回对应的字符。

1
2
3
4
assert ord("a") == 97
assert ord("🌍") == 127757
assert chr(97) == "a"
assert chr(127757) == "🌍"

我们可以构建一个基于字符的分词器,并验证其编码-解码的往返一致性

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
tokenizer = CharacterTokenizer()
string = "Hello, 🌍! 你好!" # @inspect string
indices = tokenizer.encode(string) # @inspect indices
reconstructed_string = tokenizer.decode(indices) # @inspect reconstructed_string
assert string == reconstructed_string

vocabulary_size = max(indices) + 1
"vocabulary_size = 127758"

compression_ratio = get_compression_ratio(string, indices)
"compression_ratio = 1.5384615384615385"

class CharacterTokenizer(Tokenizer):
"""Represent a string as a sequence of Unicode code points."""
def encode(self, string: str) -> list[int]:
return list(map(ord, string))
def decode(self, indices: list[int]) -> str:
return "".join(map(chr, indices))

基于字符的分词面临两个主要问题:

  • 词汇表非常庞大
  • 许多字符(如 🌍)极为罕见,导致词汇利用率低下

基于字节的分词

Unicode 字符串可以表示为一系列字节,而每个字节可由 0 到 255 之间的整数表示。UTF-8 是最常见的 Unicode 编码格式。

1
2
assert bytes("a", encoding="utf-8") == b"a"
assert bytes("🌍", encoding="utf-8") == b"\xf0\x9f\x8c\x8d"

我们可以构建一个基于字节的分词器,并验证其编码-解码的往返一致性

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
tokenizer = ByteTokenizer()
string = "Hello, 🌍! 你好!" # @inspect string
indices = tokenizer.encode(string) # @inspect indices
reconstructed_string = tokenizer.decode(indices) # @inspect reconstructed_string
assert string == reconstructed_string

class ByteTokenizer(Tokenizer):
"""Represent a string as a sequence of bytes."""
def encode(self, string: str) -> list[int]:
string_bytes = string.encode("utf-8") # @inspect string_bytes
indices = list(map(int, string_bytes)) # @inspect indices
return indices
def decode(self, indices: list[int]) -> str:
string_bytes = bytes(indices) # @inspect string_bytes
string = string_bytes.decode("utf-8") # @inspect string
return string

由于压缩率不理想,导致生成的序列长度较长。考虑到 Transformer 模型受限于上下文长度,且注意力机制的计算复杂度为序列长度的平方,基于字节的分词方式在实际应用中可能面临效率挑战。

基于单词的分词

基于单词的分词,顾名思义,就是将一段文本分割成独立的、有意义的“词”或“单词”来作为基本的处理单元。它的基本思想是空格或标点符号是词语之间的天然分隔符。例如:

1
2
3
4
5
6
7
8
9
>>> string = "I'll say supercalifragilisticexpialidocious!"
>>> segments = regex.findall(r"\w+|.", string) # @inspect segments
>>> [
"I",
"'ll",
" say",
" supercalifragilisticexpialidocious",
"!"
]

同样的,缺点仍然是词汇表过于庞大,没有固定的词汇量。而且许多单词很少见,模型不会对它们了解太多。

Byte Pair Encoding

字节对编码Byte Pair Encoding, BPE)是一种基于统计的子词分词算法。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
def train_bpe(string: str, num_merges: int) -> BPETokenizerParams:  # @inspect string, @inspect num_merges
# Start with the list of bytes of string.
indices = list(map(int, string.encode("utf-8"))) # @inspect indices
merges: dict[tuple[int, int], int] = {} # index1, index2 => merged index
vocab: dict[int, bytes] = {x: bytes([x]) for x in range(256)} # index -> bytes

for i in range(num_merges):
# Count the number of occurrences of each pair of tokens
counts = defaultdict(int)
for index1, index2 in zip(indices, indices[1:]): # For each adjacent pair
counts[(index1, index2)] += 1 # @inspect counts

# Find the most common pair.
pair = max(counts, key=counts.get) # @inspect pair
index1, index2 = pair

# Merge that pair.
new_index = 256 + i # @inspect new_index
merges[pair] = new_index # @inspect merges
vocab[new_index] = vocab[index1] + vocab[index2] # @inspect vocab
indices = merge(indices, pair, new_index) # @inspect indices
return BPETokenizerParams(vocab=vocab, merges=merges)

def merge(indices: list[int], pair: tuple[int, int], new_index: int) -> list[int]: # @inspect indices, @inspect pair, @inspect new_index
"""Return `indices`, but with all instances of `pair` replaced with `new_index`."""
new_indices = [] # @inspect new_indices
i = 0 # @inspect i
while i < len(indices):
if i + 1 < len(indices) and indices[i] == pair[0] and indices[i + 1] == pair[1]:
new_indices.append(new_index)
i += 2
else:
new_indices.append(indices[i])
i += 1
return new_indices