6. OBSOLETE - LOADING LLAMA 2 TOKENIZER MODEL¶
Being Obsolete Note:¶
The contents of this chapter are about the older tokenizer (SentencePiece (SPM) model) used by Llama 1 and Llama 2 models. We haven't preferred to remove this chapter completely, because it contains comprehensive information about SentencePiece (SPM) model and Protocol Buffers.¶
Ways to go:
- If you only want to learn about things used in the latest Llama version (3.1), please continue with the next chapter: BFLOAT16 DATA TYPE
- Otherwise, if you are curious about different things used by older Llama versions, even if they aren't no longer used by Llama 3.1 anymore, please continue reading this chapter.
---¶
Obsolete Llama 2 content starts¶
---¶
6.0. The tokenizer model used by Llama 2 version¶
In this chapter, we'll walk through the process of loading tokenizer (vocabulary) model stored in the "tokenizer.model" file.
In our case, Llama 2's tokenizer file tokenizer.model stores a SentencePiece (SPM) tokenizer model in Protobuf message format.
Protobuf operates by adhering to a descriptor, which serves as a blueprint or schema defining the structure and data types within a serialized message. This descriptor defines message structures and guides the serializer and deserializer.
See: Protocol Buffers | Protobuf definition best practices | Protocol Buffers (ProtoBuf) with GoLang | Protocol Buffer Basics: Go | Protocol Buffer Encoding
The descriptor for deserializing our SentencePiece model file is this Protobuf structure, we have defined it in Go language as modelprotoDescriptor
variable, in llama2/src/sentencepiece/model.go. This modelprotoDescriptor
definition style is specific for our code infrastructure.
Although there are lots of libraries to implement this messaging format, in this project, we implement it from scratch ourselves as we always do in the "nuts and bolts" mindset.
6.0.1. Calling loadVocab() and Creating ProtobufReader¶
loadVocab() is called if includeVocab
is true.
from llama2/src/model/loader.go
func LoadModelEx(modelDir string, includeTensors bool, includeVocab bool) (*Model, error) {
model := &Model{}
...
if includeVocab {
err := loadVocab(modelDir, model)
if err != nil {
return nil, err
}
}
...
}
func loadVocab(modelDir string, model *Model) error {
vocabFilePath := filepath.Join(modelDir, "tokenizer.model")
common.GLogger.ConsolePrintf("Loading vocabulary/tokens file: \"%s\"...", vocabFilePath)
vocabModelProto, err := sentencepiece.Load(vocabFilePath)
if err != nil {
return err
}
model.Vocabulary = NewVocabulary(vocabModelProto)
common.GLogger.ConsolePrintf("Found %d tokens in the model.", len(model.Vocabulary.IdToToken))
return nil
}
In SentencePiece.Load(...)
function, we get a file instance by opening specified file. Then, we call protobuf.NewProtobufReader(...)
function by providing the file instance along with modelprotoDescriptor
variable defined in llama2/src/sentencepiece/model.go
from llama2/src/sentencepiece/sentencepiecereader.go
func Load(vocabFilePath string) (*ModelProto, error) {
vocabFile, err := os.Open(vocabFilePath)
if err != nil {
return nil, err
}
defer vocabFile.Close()
vocabReader := protobuf.NewProtobufReader(vocabFile, modelprotoDescriptor)
...
}
6.0.2. Calling ProtobufReader.Unmarshal()¶
When we call vocabReader.Unmarshal()
, it reads the given "tokenizer.model" file in guidance and help of the given modelprotoDescriptor
. At the end of this process, as modelprotoDescriptor
helps, it returns a ModelProto object that contains Pieces (token definitions) and other specifications of the tokenizer model.
Note: If you're curious about the details of how the Protobuf file structure can be read, please refer to: 6. LOADING TOKENIZER MODEL (DETAILS)
from llama2/src/sentencepiece/sentencepiecereader.go
func Load(vocabFilePath string) (*ModelProto, error) {
...
modelVal, err := vocabReader.Unmarshal()
if err != nil {
return nil, err
}
model, ok := modelVal.(*ModelProto)
if !ok {
return nil, fmt.Errorf("cannot convert %v to *ModelProto", model)
}
return &model, nil
}
6.0.3. Returning Vocabulary Model¶
We get the ModelProto object as vocabModelProto
, then we call NewVocabulary(...) function by specifying it. This function creates and returns a Vocabulary object that has TokenToId
, IdToTokenId
maps to provide two-way querying.
Then, we assign Vocabulary object to model.Vocabulary
property.
from llama2/src/model/loader.go
func loadVocab(modelDir string, model *Model) error {
...
vocabModelProto, err := sentencepiece.Load(vocabFilePath)
if err != nil {
return err
}
model.Vocabulary = NewVocabulary(vocabModelProto)
common.GLogger.ConsolePrintf("Found %d tokens in the model.", len(model.Vocabulary.IdToToken))
return nil
}
And we can see output lines in the console as follows:
[INFO] ... Loading vocabulary/tokens file: "/workspace/models-original/7B-chat/tokenizer.model"...
[INFO] ... Found 32000 tokens in the model.
Model "/workspace/models-original/7B-chat" was loaded.
6.1. Creating Main Object¶
Here, we'll walk through the process of reading some pieces (vocabulary token) stored in the "tokenizer.model" file, step by step.
Protobuf messages may consist of multiple separated sub-messages/objects/object types, but for deserializing process, we need to have a root destination object. This is called as "main object" in our project.
ProtobufReader.Unmarshal()
function calls given Protobuf descriptor's ProtoDescriptor.MainObjectConstructorFn function. In our case, it is modelprotoDescriptor.MainObjectConstructorFn. It instantiates an empty ModelProto
with an empty Pieces
array.
Once we have a main object, we can start to read and process the "message"s in the file.
The Protobuf protocol/format consists of "message"s with a "number" (type identifier), and the reader has "message processor" functions corresponding "number"s (type identifier).
To read the Protobuf file/stream, we have a simple flow:
- We initiate a loop,
- Read a message (via ProtobufReader.readMessage()),
- If we are at EOF (end of file), break the loop,
- Find corresponding message processor function for
message.Number
in ProtoDescriptor.MessageProcessorFns function map, - Execute it (this function makes changes on the main object if needed),
- Check for errors, and continue if no error.
- We return this main object.
6.2. Reading a Message¶
We initiate a loop that always calls the
ProtobufReader.readMessage()
method. This loop will continue until it encounters EOF (end of file) or an error.
ProtobufReader.readMessage()
method:
- Checks if we are at the EOF (end of file). If yes, returns with
ok=false
to notify finished, - Calls
ProtobufReader.readField(...)
method to read "message number" and "message data", - Returns successfully read Message object or
ok=false
.
from llama2/src/protobuf/protobufreader.go
func (pbr *ProtobufReader) readMessage() (message *Message, ok bool) {
_, err := pbr.fileReader.Peek(1)
if err != nil {
return nil, false
}
number, item, ok := pbr.readField(pbr.fileReader)
if !ok {
return nil, false
}
return &Message{number, item}, true
}
ProtobufReader.readField(...)
method:
This part was inspired by Google's original Protobuf project's "protowire" implementation, see: wire.go of original protobuf-go Project
- Backs up current file position, because this method works not completely deterministic, includes some type of heuristics. It continues to read the file/stream with some assumptions and expectations, if it encounters a situation opposite to these assumptions/expectations, it reverts the stream position to backed up location, then tries other fallback strategies.
- Calls
pbr.readTag(...)
which reads a "varint" (uint64) and extractsnumber
(int32) andtype_
(int8) values by bitwise operations on the read uint64 value.
Thisnumber
(int32) represents identifier number, which corresponds to key numbers in ourMessageProcessorFns
map in modelprotoDescriptor.type_
identifies the data type of current field/message data. >Note that, in this project, we only need to read Protobuf file/stream, not write, and we have implemented only required data types that used in the model file.
from llama2/src/protobuf/protobufreader.go
func (pbr *ProtobufReader) readField(r *bufio.Reader) (number Number, result interface{}, ok bool) {
...
}
6.3. Reading Tokens and Other Structures¶
6.3.1. Reading 0th token¶
... Reading 0th token
-
Call
ProtobufReader.readMessage(...)
method, >Depth: 0 (Reading one message, onpbr.fileReader
)-
Call
ProtobufReader.readField(...)
method onpbr.fileReader
, > Depth: 1 (Reading one field for one message, onpbr.fileReader
)- Read tag. number: 1, type: 2 (BytesType)
- BytesType means it contains a byte array or string,
- Instantiate a
resultMap
which has keys asNumber
(int32) andinterface{}
, - Call
pbr.readValueBytes
:- Read a "varint": 14, this value is the length of the byte sequence,
- Read 14 bytes into the buffer: "\n\x05\<unk>\x15\x00\x00\x00\x00\x18\x02" (printed in string form). You can see a meaningful piece: "\<unk>",
- Return this byte sequence.
- Instantiate another reader which dedicated to this 14-byte sequence,
localReader
, -
Initiate a loop (to traverse this 14-byte sequence)
-
Iteration 1:
- Call
ProtobufReader.readField(...)
method onlocalReader
, >Depth: 2 (to traverse this 14-byte sequence)- Read tag. number: 1, type: 2 (BytesType)
- BytesType means it contains a byte array or string,
- Instantiate a
resultMap
which has keys asNumber
(int32) andinterface{}
, - Call
pbr.readValueBytes
method onlocalReader
:- Read a "varint": 5, this value is the length of the byte sequence,
- Read 5 bytes into the buffer: "\<unk>" (printed in string form). Yes, this is our first extracted token string: "\<unk>",
- Return this byte sequence.
- Instantiate another reader which dedicated to this 5-byte sequence,
localReader
, >Don't forget, we are in another inner call, think of recursive - Initiate a loop (to traverse this 5-byte sequence)
- Iteration 1:
- Call
ProtobufReader.readField(...)
method onlocalReader
, >Depth: 3 (to traverse this 5-byte sequence)- Read tag. number: 7, type: 4 (EndGroupType),
- Return
ok=false
, to let parent method undo and fall in the string fallback.
- Because of returned
allOk=false
, we dopbr.undoRead(...)
onlocalReader
to revert the reader position to previous position, - Break the loop
- Call
- Iteration 1:
- Loop was finished
- Reading BytesType failed, so it continues with trying to read the sequence as string,
- Check if the byte sequence valid for UTF-8 encoding with
utf8.Valid
, - Yes it's valid string, return: number: 1, result: "\<unk>".
-
Set
resultMap
s entry: key: 1, value: "\<unk>"
- Call
-
Iteration 2:
- Call
ProtobufReader.readField(...)
method onlocalReader
, >Depth: 2 (to traverse this 14-byte sequence)- Read tag. number: 2, type: 5 (Fixed32Type)
- Fixed32Type means it contains a 4-byte float32 value,
- Read 4 byte and convert it a float32 in little-endian form: Read value is 0,
- Return this float32 value.
-
Set
resultMap
s entry: key: 2, value: 0 (float32)
- Call
-
Iteration 3:
- Call
ProtobufReader.readField(...)
method onlocalReader
, >Depth: 2 (to traverse this 14-byte sequence)- Read tag. number: 3, type: 0 (VarintType)
- VarintType means it contains a variable length signed integer,
- Read it and convert it an int64: Read value is 2,
- Return this int64 value.
-
Set
resultMap
s entry: key: 3, value: 1 (int64)
- Call
-
Iteration 4:
- Check is EOF (end of file) for current 14-byte sequence: yes
- Break the loop
- Loop was finished
- Return:
-
-
Instantiate a Message object and return it.
-
-
Find the message processor function corresponding to our
message.Number=1
from ProtoDescriptor.MessageProcessorFns function map, - Execute it,
- The
modelprotoDescriptor.MessageProcessorFns[1]
function convertsmessage.Value
asprops: map[protobuf.Number]interface{}
,props[1]
: Piece (string), string value of the token,props[2]
: Score (float32), score of the token,props[3]
: PieceType (Type/byte), token type of the token (can besentencepiece.NORMAL
,sentencepiece.CONTROL
,sentencepiece.BYTE
, etc... constants were defined in llama2/src/sentencepiece/model.go). If not represented in props map, default issentencepiece.NORMAL
,- Then instantiates new
sentencepiece.SentencePiece
fromprops
map, - Appends it into main object's Pieces array.
from llama2/src/sentencepiece/model.go
var modelprotoDescriptor = protobuf.ProtoDescriptor{
...
MessageProcessorFns: map[protobuf.Number]func(interface{}, protobuf.Message){
1: func(mainObject interface{}, message protobuf.Message) {
mo := mainObject.(*ModelProto)
props := message.Value.(map[protobuf.Number]interface{})
pieceTypeVal, err := common.InterfaceToInt(props[3])
if err != nil {
pieceTypeVal = int(NORMAL)
}
item := newSentencePiece(props[1].(string), props[2].(float32), Type(pieceTypeVal))
*mo.Pieces = append(*mo.Pieces, item)
},
...
}
}
6.3.2. Reading 1st token¶
... Reading 1st token
-
Call
ProtobufReader.readMessage(...)
method,-
Call
ProtobufReader.readField(...)
method onpbr.fileReader
,- Do things which we dove into for first token above,
-
Return:
-
Instantiate a Message object and return it.
-
-
Do other stuff...
mainObject.Pieces: {
{Piece: "<unk>", Score: 0, PieceType: 2(sentencepiece.UNKNOWN), ByteFallback: 0x00},
{Piece: "<s>", Score: 0, PieceType: 3(sentencepiece.CONTROL), ByteFallback: 0x00}
}
6.3.3. Reading some other tokens¶
..... Some steps were taken
... Reading 13th token
-
Call
ProtobufReader.readMessage(...)
method,- Call
ProtobufReader.readField(...)
method onpbr.fileReader
, do things which we dove into for first token above, -
Instantiate a Message object and return it.
- Call
-
Do other stuff...
mainObject.Pieces: {
{Piece: "<unk>", Score: 0, PieceType: 2(sentencepiece.UNKNOWN), ByteFallback: 0x00},
{Piece: "<s>", Score: 0, PieceType: 3(sentencepiece.CONTROL), ByteFallback: 0x00},
...
{Piece: "<0x0A>", Score: 0, PieceType: 6(sentencepiece.BYTE), ByteFallback: 0x0A},
}
..... Some steps were taken
... Reading 259th token
-
Call
ProtobufReader.readMessage(...)
method,- Call
ProtobufReader.readField(...)
method onpbr.fileReader
, do things which we dove into for first token above, -
Instantiate a Message object and return it.
- Call
-
Do other stuff...
mainObject.Pieces: {
{Piece: "<unk>", Score: 0, PieceType: 2(sentencepiece.UNKNOWN), ByteFallback: 0},
{Piece: "<s>", Score: 0, PieceType: 3(sentencepiece.CONTROL), ByteFallback: 0},
...
{Piece: "<0x0A>", Score: 0, PieceType: 6(sentencepiece.BYTE), ByteFallback: 0x0A},
...
{Piece: "▁▁", Score: -1000000000, PieceType: 1(sentencepiece.NORMAL), ByteFallback: 0}
}
... Reading 260th token
-
Call
ProtobufReader.readMessage(...)
method,- Call
ProtobufReader.readField(...)
method onpbr.fileReader
, do things which we dove into for first token above, -
Instantiate a Message object and return it.
- Call
-
Do other stuff...
mainObject.Pieces: {
{Piece: "<unk>", Score: 0, PieceType: 2(sentencepiece.UNKNOWN), ByteFallback: 0},
{Piece: "<s>", Score: 0, PieceType: 3(sentencepiece.CONTROL), ByteFallback: 0},
...
{Piece: "<0x0A>", Score: 0, PieceType: 6(sentencepiece.BYTE), ByteFallback: 0x0A},
...
{Piece: "▁▁", Score: -1000000000, PieceType: 1(sentencepiece.NORMAL), ByteFallback: 0},
{Piece: "▁t", Score: -1, PieceType: 1(sentencepiece.NORMAL), ByteFallback: 0}
}
... Reading 261th token
-
Call
ProtobufReader.readMessage(...)
method,- Call
ProtobufReader.readField(...)
method onpbr.fileReader
, do things which we dove into for first token above, -
Instantiate a Message object and return it.
- Call
-
Do other stuff...
mainObject.Pieces: {
{Piece: "<unk>", Score: 0, PieceType: 2(sentencepiece.UNKNOWN), ByteFallback: 0},
{Piece: "<s>", Score: 0, PieceType: 3(sentencepiece.CONTROL), ByteFallback: 0},
...
{Piece: "<0x0A>", Score: 0, PieceType: 6(sentencepiece.BYTE), ByteFallback: 0x0A},
...
{Piece: "▁▁", Score: -1000000000, PieceType: 1(sentencepiece.NORMAL), ByteFallback: 0},
{Piece: "▁t", Score: -1, PieceType: 1(sentencepiece.NORMAL), ByteFallback: 0},
{Piece: "er", Score: -2, PieceType: 1(sentencepiece.NORMAL), ByteFallback: 0}
}
..... Some steps were taken
... Reading 1,001st token
-
Call
ProtobufReader.readMessage(...)
method,- Call
ProtobufReader.readField(...)
method onpbr.fileReader
, do things which we dove into for first token above, -
Instantiate a Message object and return it.
- Call
-
Do other stuff...
mainObject.Pieces: {
{Piece: "<unk>", Score: 0, PieceType: 2(sentencepiece.UNKNOWN), ByteFallback: 0},
{Piece: "<s>", Score: 0, PieceType: 3(sentencepiece.CONTROL), ByteFallback: 0},
...
{Piece: "<0x0A>", Score: 0, PieceType: 6(sentencepiece.BYTE), ByteFallback: 0x0A},
...
{Piece: "▁▁", Score: -1000000000, PieceType: 1(sentencepiece.NORMAL), ByteFallback: 0},
{Piece: "▁t", Score: -1, PieceType: 1(sentencepiece.NORMAL), ByteFallback: 0},
{Piece: "er", Score: -2, PieceType: 1(sentencepiece.NORMAL), ByteFallback: 0}
...
{Piece: "ied", Score: -741, PieceType: 1(sentencepiece.NORMAL), ByteFallback: 0}
}
..... Some steps were taken
... Reading 10,001st token
-
Call
ProtobufReader.readMessage(...)
method,- Call
ProtobufReader.readField(...)
method onpbr.fileReader
, do things which we dove into for first token above, -
Instantiate a Message object and return it.
- Call
-
Do other stuff...
mainObject.Pieces: {
{Piece: "<unk>", Score: 0, PieceType: 2(sentencepiece.UNKNOWN), ByteFallback: 0},
{Piece: "<s>", Score: 0, PieceType: 3(sentencepiece.CONTROL), ByteFallback: 0},
...
{Piece: "<0x0A>", Score: 0, PieceType: 6(sentencepiece.BYTE), ByteFallback: 0x0A},
...
{Piece: "▁▁", Score: -1000000000, PieceType: 1(sentencepiece.NORMAL), ByteFallback: 0},
{Piece: "▁t", Score: -1, PieceType: 1(sentencepiece.NORMAL), ByteFallback: 0},
{Piece: "er", Score: -2, PieceType: 1(sentencepiece.NORMAL), ByteFallback: 0}
...
{Piece: "ied", Score: -741, PieceType: 1(sentencepiece.NORMAL), ByteFallback: 0}
...
{Piece: "ång", Score: -9741, PieceType: 1(sentencepiece.NORMAL), ByteFallback: 0}
}
..... Some steps were taken
... Reading 31,001st token
-
Call
ProtobufReader.readMessage(...)
method,- Call
ProtobufReader.readField(...)
method onpbr.fileReader
, do things which we dove into for first token above, -
Instantiate a Message object and return it.
- Call
-
Do other stuff...
mainObject.Pieces: {
{Piece: "<unk>", Score: 0, PieceType: 2(sentencepiece.UNKNOWN), ByteFallback: 0},
{Piece: "<s>", Score: 0, PieceType: 3(sentencepiece.CONTROL), ByteFallback: 0},
...
{Piece: "<0x0A>", Score: 0, PieceType: 6(sentencepiece.BYTE), ByteFallback: 0x0A},
...
{Piece: "▁▁", Score: -1000000000, PieceType: 1(sentencepiece.NORMAL), ByteFallback: 0},
{Piece: "▁t", Score: -1, PieceType: 1(sentencepiece.NORMAL), ByteFallback: 0},
{Piece: "er", Score: -2, PieceType: 1(sentencepiece.NORMAL), ByteFallback: 0}
...
{Piece: "ied", Score: -741, PieceType: 1(sentencepiece.NORMAL), ByteFallback: 0}
...
{Piece: "ång", Score: -9741, PieceType: 1(sentencepiece.NORMAL), ByteFallback: 0}
...
{Piece: "동", Score: -30741-30741, PieceType: 1(sentencepiece.NORMAL), ByteFallback: 0}
}
6.3.4. Reading TrainerSpec¶
..... Some steps were taken
..... After finishing all of 32,000 tokens, we get a different message which has Number=2
... Reading TrainerSpec
If you are curious about what is a TrainerSpec and what it contains, you can check out this Protobuf structure for details.
-
Call
ProtobufReader.readMessage(...)
method,- Call
ProtobufReader.readField(...)
method onpbr.fileReader
, do things which we dove into for first token above, -
Instantiate a Message object and return it.
Message{ Number: 2, Value: { 1: (string) "/large_experiments/theorem/datasets/MERGED/all.test1.merged" 2: (string) "spm_model_32k_200M_charcov099995_allowWSO__v2" 3: (int64) 2 4: (int64) 128256 6: (int64) 0 7: (string) "text" 10: (float32) 0.99995 11: (int64) 200000000 14: (int64) 1000000 15: (float32) 0.75 16: (int64) 80 17: (int64) 2 18: (int64) 4192 19: (int64) 1 20: (int64) 16 21: (int64) 1 22: (int64) 1 23: (int64) 1 24: (int64) 0 25: (int64) 1 26: (int64) 1 32: (int64) 1 33: (int64) 1 34: (int64) 0 35: (int64) 1 36: ([]uint8) [] 40: (int64) 0 41: (int64) 1 42: (int64) 2 43: (int64) -1 44: (string) " ⁇ " 45: (string) "<unk>" 46: (string) "<s>" 47: (string) "</s>" 48: (string) "<pad>" 49: (int64) 0 50: (int64) 0 51: (float32) 0 52: (int64) 0 } }
- Call
-
We don't use this information, do nothing as following:
from llama2/src/sentencepiece/model.go
var modelprotoDescriptor = protobuf.ProtoDescriptor{
...
MessageProcessorFns: map[protobuf.Number]func(interface{}, protobuf.Message){
...
2: func(mainObject interface{}, message protobuf.Message) {
// Do nothing, we don't need TrainerSpec at this time.
},
...
}
}
6.3.5. Reading NormalizerSpec¶
... Reading NormalizerSpec
-
Call
ProtobufReader.readMessage(...)
method,- Call
ProtobufReader.readField(...)
method onpbr.fileReader
, do things which we dove into for first token above, -
Instantiate a Message object and return it.
- Call
-
We don't use this information, but we convert it as following:
from llama2/src/sentencepiece/model.go
var modelprotoDescriptor = protobuf.ProtoDescriptor{
...
MessageProcessorFns: map[protobuf.Number]func(interface{}, protobuf.Message){
...
3: func(mainObject interface{}, message protobuf.Message) {
mo := mainObject.(*ModelProto)
props := message.Value.(map[protobuf.Number]interface{})
ns := NormalizerSpec{}
ns.Name = props[1].(string)
ns.PrecompiledCharsmap = props[2].([]byte)
ns.AddDummyPrefix = common.InterfaceToBool(props[3], true)
ns.RemoveExtraWhitespaces = common.InterfaceToBool(props[4], true)
ns.EscapeWhitespaces = common.InterfaceToBool(props[5], true)
stringVal, ok := props[6].(string)
if !ok {
byteArrVal, ok := props[6].([]byte)
if !ok {
stringVal = ""
} else {
stringVal = string(byteArrVal)
}
}
ns.NormalizationRuleTsv = stringVal
mo.NormalizerSpec = &ns
},
}
}
- Finished.