5. LOADING TOKENIZER MODEL¶
In this chapter, we'll walk through the process of loading tokenizer (vocabulary) model stored in the "tokenizer.model" file.
In our case, Llama 3.1's tokenizer file tokenizer.model stores a Byte-Pair Encoding (BPE) tokenizer model in text and base64 format.
Since Llama 3 version, Llama models have started to use OpenAI's Tiktoken tokenizer.
5.1. Calling loadVocab()¶
loadVocab() is called if includeVocab
is true.
from src/model/loader.go
func LoadModelEx(modelDir string, includeTensors bool, includeVocab bool) (*Model, error) {
model := &Model{}
...
if includeVocab {
err := loadVocab(modelDir, model)
if err != nil {
return nil, err
}
}
...
}
func loadVocab(modelDir string, model *Model) error {
vocabFilePath := filepath.Join(modelDir, "tokenizer.model")
common.GLogger.ConsolePrintf("Loading vocabulary/tokens file: \"%s\"...", vocabFilePath)
vocabBpe, err := tiktoken.Load(vocabFilePath)
if err != nil {
return err
}
model.Vocabulary = NewVocabulary(vocabModelProto)
common.GLogger.ConsolePrintf("Found %d tokens in the model.", len(model.Vocabulary.IdToToken))
return nil
}
In tiktoken.Load(...)
function, we call loadTiktokenBpe(...)
function, we get a file instance by opening specified file. Then, we read this text file line-by-line, decode the base64 part, add hardcoded special tokens, and then return as ModelData.
For original implementation details, check out this function and this class.
Some sample lines from Llama 3.1 tokenizer.model file, with base64 decoded forms
IQ== 0 //"!"
Ig== 1 //"\""
Iw== 2 //"#"
JA== 3 //"$"
JQ== 4 //"%"
Jg== 5 //"&"
Jw== 6 //"'"
KA== 7 //"("
KQ== 8 //")"
Kg== 9 //"*"
Kw== 10 //"+"
...
IHRo 270 //" th"
Cgo= 271 //"\n\n"
IGM= 272 //" c"
bGU= 273 //"le"
IHM= 274 //" s"
aXQ= 275 //"it"
from src/tiktoken/tiktokenreader.go
func Load(vocabFilePath string) (*ModelData, error) {
mergeableRanks, err := loadTiktokenBpe(vocabFilePath)
if err != nil {
return nil, err
}
baseTokensCount := len(mergeableRanks)
reservedSpecialTokensCount := 256
specialTokensArr := []string{
"<|begin_of_text|>",
"<|end_of_text|>",
"<|reserved_special_token_0|>",
"<|reserved_special_token_1|>",
"<|finetune_right_pad_id|>",
"<|step_id|>",
"<|start_header_id|>",
"<|end_header_id|>",
"<|eom_id|>", // end of message
"<|eot_id|>", // end of turn
"<|python_tag|>",
}
reservedTokensArr := make([]string, reservedSpecialTokensCount-len(specialTokensArr))
for i := 0; i < len(reservedTokensArr); i++ {
reservedTokensArr[i] = fmt.Sprintf("<|reserved_special_token_%d|>", 2+i)
}
specialTokensArr = append(specialTokensArr, reservedTokensArr...)
specialTokens := make(map[string]int)
for i, token := range specialTokensArr {
specialTokens[token] = baseTokensCount + i
}
result := &ModelData{
MergeableRanks: mergeableRanks,
SpecialTokens: specialTokens,
BeginOfSentenceId: specialTokens["<|begin_of_text|>"],
EndOfSentenceId: specialTokens["<|end_of_text|>"],
PadId: -1,
UnknownId: -1,
StopTokenIds: []int{specialTokens["<|eom_id|>"], specialTokens["<|eot_id|>"]},
}
return result, nil
}
5.2. Returning Vocabulary Model¶
We get the ModelData object as vocabBpe
, then we call NewVocabulary(...) function by specifying it. This function creates and returns a Vocabulary object that has TokenToId
, IdToTokenId
maps to provide two-way querying.
Then, we assign Vocabulary object to model.Vocabulary
property.
from src/model/loader.go
func loadVocab(modelDir string, model *Model) error {
vocabFilePath := filepath.Join(modelDir, "tokenizer.model")
common.GLogger.ConsolePrintf("Loading vocabulary/tokens file: \"%s\"...", vocabFilePath)
vocabBpe, err := tiktoken.Load(vocabFilePath)
if err != nil {
return err
}
model.Vocabulary = NewVocabulary(vocabBpe)
common.GLogger.ConsolePrintf("Found %d tokens in the model.", len(model.Vocabulary.IdToToken))
return nil
}
And we can see output lines in the console as follows: