5. LOADING TOKENIZER MODEL¶

In this chapter, we'll walk through the process of loading tokenizer (vocabulary) model stored in the "tokenizer.model" file.

In our case, Llama 3.1's tokenizer file tokenizer.model stores a Byte-Pair Encoding (BPE) tokenizer model in text and base64 format.

Since Llama 3 version, Llama models have started to use OpenAI's Tiktoken tokenizer.

5.1. Calling loadVocab()¶

loadVocab() is called if includeVocab is true.

^{from src/model/loader.go}

func LoadModelEx(modelDir string, includeTensors bool, includeVocab bool) (*Model, error) {
    model := &Model{}
    ...
    if includeVocab {
        err := loadVocab(modelDir, model)
        if err != nil {
            return nil, err
        }
    }
    ...
}

func loadVocab(modelDir string, model *Model) error {
    vocabFilePath := filepath.Join(modelDir, "tokenizer.model")
    common.GLogger.ConsolePrintf("Loading vocabulary/tokens file: \"%s\"...", vocabFilePath)
    vocabBpe, err := tiktoken.Load(vocabFilePath)
    if err != nil {
        return err
    }

    model.Vocabulary = NewVocabulary(vocabModelProto)
    common.GLogger.ConsolePrintf("Found %d tokens in the model.", len(model.Vocabulary.IdToToken))
    return nil
}

In tiktoken.Load(...) function, we call loadTiktokenBpe(...)function, we get a file instance by opening specified file. Then, we read this text file line-by-line, decode the base64 part, add hardcoded special tokens, and then return as ModelData.

For original implementation details, check out this function and this class.

^{Some sample lines from Llama 3.1 tokenizer.model file, with base64 decoded forms}

IQ== 0 //"!"
Ig== 1 //"\""
Iw== 2 //"#"
JA== 3 //"$"
JQ== 4 //"%"
Jg== 5 //"&"
Jw== 6 //"'"
KA== 7 //"("
KQ== 8 //")"
Kg== 9 //"*"
Kw== 10 //"+"
...
IHRo 270 //" th"
Cgo= 271 //"\n\n"
IGM= 272 //" c"
bGU= 273 //"le"
IHM= 274 //" s"
aXQ= 275 //"it"

^{from src/tiktoken/tiktokenreader.go}

func Load(vocabFilePath string) (*ModelData, error) {
    mergeableRanks, err := loadTiktokenBpe(vocabFilePath)
    if err != nil {
        return nil, err
    }
    baseTokensCount := len(mergeableRanks)

    reservedSpecialTokensCount := 256

    specialTokensArr := []string{
        "<|begin_of_text|>",
        "<|end_of_text|>",
        "<|reserved_special_token_0|>",
        "<|reserved_special_token_1|>",
        "<|finetune_right_pad_id|>",
        "<|step_id|>",
        "<|start_header_id|>",
        "<|end_header_id|>",
        "<|eom_id|>", // end of message
        "<|eot_id|>", // end of turn
        "<|python_tag|>",
    }

    reservedTokensArr := make([]string, reservedSpecialTokensCount-len(specialTokensArr))
    for i := 0; i < len(reservedTokensArr); i++ {
        reservedTokensArr[i] = fmt.Sprintf("<|reserved_special_token_%d|>", 2+i)
    }
    specialTokensArr = append(specialTokensArr, reservedTokensArr...)

    specialTokens := make(map[string]int)
    for i, token := range specialTokensArr {
        specialTokens[token] = baseTokensCount + i
    }

    result := &ModelData{
        MergeableRanks: mergeableRanks,
        SpecialTokens:  specialTokens,

        BeginOfSentenceId: specialTokens["<|begin_of_text|>"],
        EndOfSentenceId:   specialTokens["<|end_of_text|>"],
        PadId:             -1,
        UnknownId:         -1,
        StopTokenIds:      []int{specialTokens["<|eom_id|>"], specialTokens["<|eot_id|>"]},
    }

    return result, nil
}

5.2. Returning Vocabulary Model¶

We get the ModelData object as vocabBpe, then we call NewVocabulary(...) function by specifying it. This function creates and returns a Vocabulary object that has TokenToId, IdToTokenId maps to provide two-way querying.
Then, we assign Vocabulary object to model.Vocabulary property.

^{from src/model/loader.go}

func loadVocab(modelDir string, model *Model) error {
    vocabFilePath := filepath.Join(modelDir, "tokenizer.model")
    common.GLogger.ConsolePrintf("Loading vocabulary/tokens file: \"%s\"...", vocabFilePath)
    vocabBpe, err := tiktoken.Load(vocabFilePath)
    if err != nil {
        return err
    }

    model.Vocabulary = NewVocabulary(vocabBpe)
    common.GLogger.ConsolePrintf("Found %d tokens in the model.", len(model.Vocabulary.IdToToken))
    return nil
}

And we can see output lines in the console as follows:

[INFO] ... Loading vocabulary/tokens file: "/workspace/models-original/Meta-Llama-3.1-8B-Instruct/tokenizer.model"...
[INFO] ... Found 128256 tokens in the model.
Model "/workspace/models-original/Meta-Llama-3.1-8B-Instruct" was loaded.