Skip to content

5. LOADING TOKENIZER MODEL

In this chapter, we'll walk through the process of loading tokenizer (vocabulary) model stored in the "tokenizer.model" file.

In our case, Llama 3.1's tokenizer file tokenizer.model stores a Byte-Pair Encoding (BPE) tokenizer model in text and base64 format.

Since Llama 3 version, Llama models have started to use OpenAI's Tiktoken tokenizer.

5.1. Calling loadVocab()

loadVocab() is called if includeVocab is true.

from src/model/loader.go

func LoadModelEx(modelDir string, includeTensors bool, includeVocab bool) (*Model, error) {
    model := &Model{}
    ...
    if includeVocab {
        err := loadVocab(modelDir, model)
        if err != nil {
            return nil, err
        }
    }
    ...
}

func loadVocab(modelDir string, model *Model) error {
    vocabFilePath := filepath.Join(modelDir, "tokenizer.model")
    common.GLogger.ConsolePrintf("Loading vocabulary/tokens file: \"%s\"...", vocabFilePath)
    vocabBpe, err := tiktoken.Load(vocabFilePath)
    if err != nil {
        return err
    }

    model.Vocabulary = NewVocabulary(vocabModelProto)
    common.GLogger.ConsolePrintf("Found %d tokens in the model.", len(model.Vocabulary.IdToToken))
    return nil
}

In tiktoken.Load(...) function, we call loadTiktokenBpe(...)function, we get a file instance by opening specified file. Then, we read this text file line-by-line, decode the base64 part, add hardcoded special tokens, and then return as ModelData.

For original implementation details, check out this function and this class.

Some sample lines from Llama 3.1 tokenizer.model file, with base64 decoded forms

IQ== 0 //"!"
Ig== 1 //"\""
Iw== 2 //"#"
JA== 3 //"$"
JQ== 4 //"%"
Jg== 5 //"&"
Jw== 6 //"'"
KA== 7 //"("
KQ== 8 //")"
Kg== 9 //"*"
Kw== 10 //"+"
...
IHRo 270 //" th"
Cgo= 271 //"\n\n"
IGM= 272 //" c"
bGU= 273 //"le"
IHM= 274 //" s"
aXQ= 275 //"it"

from src/tiktoken/tiktokenreader.go

func Load(vocabFilePath string) (*ModelData, error) {
    mergeableRanks, err := loadTiktokenBpe(vocabFilePath)
    if err != nil {
        return nil, err
    }
    baseTokensCount := len(mergeableRanks)

    reservedSpecialTokensCount := 256

    specialTokensArr := []string{
        "<|begin_of_text|>",
        "<|end_of_text|>",
        "<|reserved_special_token_0|>",
        "<|reserved_special_token_1|>",
        "<|finetune_right_pad_id|>",
        "<|step_id|>",
        "<|start_header_id|>",
        "<|end_header_id|>",
        "<|eom_id|>", // end of message
        "<|eot_id|>", // end of turn
        "<|python_tag|>",
    }

    reservedTokensArr := make([]string, reservedSpecialTokensCount-len(specialTokensArr))
    for i := 0; i < len(reservedTokensArr); i++ {
        reservedTokensArr[i] = fmt.Sprintf("<|reserved_special_token_%d|>", 2+i)
    }
    specialTokensArr = append(specialTokensArr, reservedTokensArr...)

    specialTokens := make(map[string]int)
    for i, token := range specialTokensArr {
        specialTokens[token] = baseTokensCount + i
    }

    result := &ModelData{
        MergeableRanks: mergeableRanks,
        SpecialTokens:  specialTokens,

        BeginOfSentenceId: specialTokens["<|begin_of_text|>"],
        EndOfSentenceId:   specialTokens["<|end_of_text|>"],
        PadId:             -1,
        UnknownId:         -1,
        StopTokenIds:      []int{specialTokens["<|eom_id|>"], specialTokens["<|eot_id|>"]},
    }

    return result, nil
}

5.2. Returning Vocabulary Model

We get the ModelData object as vocabBpe, then we call NewVocabulary(...) function by specifying it. This function creates and returns a Vocabulary object that has TokenToId, IdToTokenId maps to provide two-way querying.
Then, we assign Vocabulary object to model.Vocabulary property.

from src/model/loader.go

func loadVocab(modelDir string, model *Model) error {
    vocabFilePath := filepath.Join(modelDir, "tokenizer.model")
    common.GLogger.ConsolePrintf("Loading vocabulary/tokens file: \"%s\"...", vocabFilePath)
    vocabBpe, err := tiktoken.Load(vocabFilePath)
    if err != nil {
        return err
    }

    model.Vocabulary = NewVocabulary(vocabBpe)
    common.GLogger.ConsolePrintf("Found %d tokens in the model.", len(model.Vocabulary.IdToToken))
    return nil
}

And we can see output lines in the console as follows:

[INFO] ... Loading vocabulary/tokens file: "/workspace/models-original/Meta-Llama-3.1-8B-Instruct/tokenizer.model"...
[INFO] ... Found 128256 tokens in the model.
Model "/workspace/models-original/Meta-Llama-3.1-8B-Instruct" was loaded.