5. LOADING TOKENIZER MODEL¶
In this chapter, we'll walk through the process of loading tokenizer (vocabulary) model stored in the "tokenizer.model" file.
In our case, LLamA 2's tokenizer file tokenizer.model stores a SentencePiece (SPM) tokenizer model in Protobuf message format.
Protobuf operates by adhering to a descriptor, which serves as a blueprint or schema defining the structure and data types within a serialized message. This descriptor defines message structures and guides the serializer and deserializer.
See: Protocol Buffers | Protobuf definition best practices | Protocol Buffers (ProtoBuf) with GoLang
The descriptor for deserializing our SentencePiece model file is this Protobuf structure, we have defined it in Go language as modelprotoDescriptor
variable, in src/sentencepiece/model.go. This modelprotoDescriptor
definition style is specific for our code infrastructure.
Although there are lots of libraries to implement this messaging format, in this project, we implement it from scratch ourselves as we always do in the "nuts and bolts" mindset.
5.1. Calling loadVocab() and Creating ProtobufReader¶
loadVocab() is called if includeVocab
is true.
from src/model/loader.go
func LoadModelEx(modelDir string, includeTensors bool, includeVocab bool) (*Model, error) {
model := &Model{}
...
if includeVocab {
err := loadVocab(modelDir, model)
if err != nil {
return nil, err
}
}
...
}
func loadVocab(modelDir string, model *Model) error {
vocabFilePath := filepath.Join(modelDir, "tokenizer.model")
common.GLogger.ConsolePrintf("Loading vocabulary/tokens file: \"%s\"...", vocabFilePath)
vocabModelProto, err := sentencepiece.Load(vocabFilePath)
if err != nil {
return err
}
model.Vocabulary = NewVocabulary(vocabModelProto)
common.GLogger.ConsolePrintf("Found %d tokens in the model.", len(model.Vocabulary.IdToToken))
return nil
}
In SentencePiece.Load(...)
function, we get a file instance by opening specified file. Then, we call protobuf.NewProtobufReader(...)
function by providing the file instance along with modelprotoDescriptor
variable defined in src/sentencepiece/model.go
from src/sentencepiece/sentencepiecereader.go
func Load(vocabFilePath string) (*ModelProto, error) {
vocabFile, err := os.Open(vocabFilePath)
if err != nil {
return nil, err
}
defer vocabFile.Close()
vocabReader := protobuf.NewProtobufReader(vocabFile, modelprotoDescriptor)
...
}
5.2. Calling ProtobufReader.Unmarshal()¶
When we call vocabReader.Unmarshal()
, it reads the given "tokenizer.model" file in guidance and help of the given modelprotoDescriptor
. At the end of this process, as modelprotoDescriptor
helps, it returns a ModelProto object that contains Pieces (token definitions) and other specifications of the tokenizer model.
Note: If you're curious about the details of how the Protobuf file structure can be read, please refer to: 6. LOADING TOKENIZER MODEL (DETAILS)
from src/sentencepiece/sentencepiecereader.go
func Load(vocabFilePath string) (*ModelProto, error) {
...
modelVal, err := vocabReader.Unmarshal()
if err != nil {
return nil, err
}
model, ok := modelVal.(*ModelProto)
if !ok {
return nil, fmt.Errorf("cannot convert %v to *ModelProto", model)
}
return &model, nil
}
5.3. Returning Vocabulary Model¶
We get the ModelProto object as vocabModelProto
, then we call NewVocabulary(...) function by specifying it. This function creates and returns a Vocabulary object that has TokenToId
, IdToTokenId
maps to provide two-way querying.
Then, we assign Vocabulary object to model.Vocabulary
property.
from src/model/loader.go
func loadVocab(modelDir string, model *Model) error {
vocabModelProto, err := sentencepiece.Load(vocabFilePath)
if err != nil {
return err
}
model.Vocabulary = NewVocabulary(vocabModelProto)
common.GLogger.ConsolePrintf("Found %d tokens in the model.", len(model.Vocabulary.IdToToken))
return nil
}
And we can see output lines in the console as follows: