6. LOADING TOKENIZER MODEL (DETAILS)¶
In this chapter, we'll walk through the process of reading some pieces (vocabulary token) stored in the "tokenizer.model" file, step by step.
6.1. Creating Main Object¶
Protobuf messages may consist of multiple separated sub-messages/objects/object types, but for deserializing process, we need to have a root destination object. This is called as "main object" in our project.
ProtobufReader.Unmarshal()
function calls given Protobuf descriptor's ProtoDescriptor.MainObjectConstructorFn function. In our case, it is modelprotoDescriptor.MainObjectConstructorFn. It instantiates an empty ModelProto
with an empty Pieces
array.
Once we have a main object, we can start to read and process the "message"s in the file.
The Protobuf protocol/format consists of "message"s with a "number" (type identifier), and the reader has "message processor" functions corresponding "number"s (type identifier).
To read the Protobuf file/stream, we have a simple flow:
- We initiate a loop,
- Read a message (via ProtobufReader.readMessage()),
- If we are at EOF (end of file), break the loop,
- Find corresponding message processor function for
message.Number
in ProtoDescriptor.MessageProcessorFns function map, - Execute it (this function makes changes on the main object if needed),
- Check for errors, and continue if no error.
- We return this main object.
6.2. Reading a Message¶
We initiate a loop that always calls the
ProtobufReader.readMessage()
method. This loop will continue until it encounters EOF (end of file) or an error.
ProtobufReader.readMessage()
method:
- Checks if we are at the EOF (end of file). If yes, returns with
ok=false
to notify finished, - Calls
ProtobufReader.readField(...)
method to read "message number" and "message data", - Returns successfully read Message object or
ok=false
.
from src/protobuf/protobufreader.go
func (pbr *ProtobufReader) readMessage() (message *Message, ok bool) {
_, err := pbr.fileReader.Peek(1)
if err != nil {
return nil, false
}
number, item, ok := pbr.readField(pbr.fileReader)
if !ok {
return nil, false
}
return &Message{number, item}, true
}
ProtobufReader.readField(...)
method:
This part was inspired by Google's original Protobuf project's "protowire" implementation, see: wire.go of original protobuf-go Project
- Backs up current file position, because this method works not completely deterministic, includes some type of heuristics. It continues to read the file/stream with some assumptions and expectations, if it encounters a situation opposite to these assumptions/expectations, it reverts the stream position to backed up location, then tries other fallback strategies.
- Calls
pbr.readTag(...)
which reads a "varint" (uint64) and extractsnumber
(int32) andtype_
(int8) values by bitwise operations on the read uint64 value.
Thisnumber
(int32) represents identifier number, which corresponds to key numbers in ourMessageProcessorFns
map in modelprotoDescriptor.type_
identifies the data type of current field/message data. >Note that, in this project, we only need to read Protobuf file/stream, not write, and we have implemented only required data types that used in the model file.
from src/protobuf/protobufreader.go
func (pbr *ProtobufReader) readField(r *bufio.Reader) (number Number, result interface{}, ok bool) {
...
}
6.3. Reading Tokens and Other Structures¶
6.3.1. Reading 0th token¶
... Reading 0th token
-
Call
ProtobufReader.readMessage(...)
method, >Depth: 0 (Reading one message, onpbr.fileReader
)-
Call
ProtobufReader.readField(...)
method onpbr.fileReader
, > Depth: 1 (Reading one field for one message, onpbr.fileReader
)- Read tag. number: 1, type: 2 (BytesType)
- BytesType means it contains a byte array or string,
- Instantiate a
resultMap
which has keys asNumber
(int32) andinterface{}
, - Call
pbr.readValueBytes
:- Read a "varint": 14, this value is the length of the byte sequence,
- Read 14 bytes into the buffer: "\n\x05\<unk>\x15\x00\x00\x00\x00\x18\x02" (printed in string form). You can see a meaningful piece: "\<unk>",
- Return this byte sequence.
- Instantiate another reader which dedicated to this 14-byte sequence,
localReader
, -
Initiate a loop (to traverse this 14-byte sequence)
-
Iteration 1:
- Call
ProtobufReader.readField(...)
method onlocalReader
, >Depth: 2 (to traverse this 14-byte sequence)- Read tag. number: 1, type: 2 (BytesType)
- BytesType means it contains a byte array or string,
- Instantiate a
resultMap
which has keys asNumber
(int32) andinterface{}
, - Call
pbr.readValueBytes
method onlocalReader
:- Read a "varint": 5, this value is the length of the byte sequence,
- Read 5 bytes into the buffer: "\<unk>" (printed in string form). Yes, this is our first extracted token string: "\<unk>",
- Return this byte sequence.
- Instantiate another reader which dedicated to this 5-byte sequence,
localReader
, >Don't forget, we are in another inner call, think of recursive - Initiate a loop (to traverse this 5-byte sequence)
- Iteration 1:
- Call
ProtobufReader.readField(...)
method onlocalReader
, >Depth: 3 (to traverse this 5-byte sequence)- Read tag. number: 7, type: 4 (EndGroupType),
- Return
ok=false
, to let parent method undo and fall in the string fallback.
- Because of returned
allOk=false
, we dopbr.undoRead(...)
onlocalReader
to revert the reader position to previous position, - Break the loop
- Call
- Iteration 1:
- Loop was finished
- Reading BytesType failed, so it continues with trying to read the sequence as string,
- Check if the byte sequence valid for UTF-8 encoding with
utf8.Valid
, - Yes it's valid string, return: number: 1, result: "\<unk>".
-
Set
resultMap
s entry: key: 1, value: "\<unk>"
- Call
-
Iteration 2:
- Call
ProtobufReader.readField(...)
method onlocalReader
, >Depth: 2 (to traverse this 14-byte sequence)- Read tag. number: 2, type: 5 (Fixed32Type)
- Fixed32Type means it contains a 4-byte float32 value,
- Read 4 byte and convert it a float32 in little-endian form: Read value is 0,
- Return this float32 value.
-
Set
resultMap
s entry: key: 2, value: 0 (float32)
- Call
-
Iteration 3:
- Call
ProtobufReader.readField(...)
method onlocalReader
, >Depth: 2 (to traverse this 14-byte sequence)- Read tag. number: 3, type: 0 (VarintType)
- VarintType means it contains a variable length signed integer,
- Read it and convert it an int64: Read value is 2,
- Return this int64 value.
-
Set
resultMap
s entry: key: 3, value: 1 (int64)
- Call
-
Iteration 4:
- Check is EOF (end of file) for current 14-byte sequence: yes
- Break the loop
- Loop was finished
- Return:
-
-
Instantiate a Message object and return it.
-
-
Find the message processor function corresponding to our
message.Number=1
from ProtoDescriptor.MessageProcessorFns function map, - Execute it,
- The
modelprotoDescriptor.MessageProcessorFns[1]
function convertsmessage.Value
asprops: map[protobuf.Number]interface{}
,props[1]
: Piece (string), string value of the token,props[2]
: Score (float32), score of the token,props[3]
: PieceType (Type/byte), token type of the token (can besentencepiece.NORMAL
,sentencepiece.CONTROL
,sentencepiece.BYTE
, etc... constants were defined in src/sentencepiece/model.go). If not represented in props map, default issentencepiece.NORMAL
,- Then instantiates new
sentencepiece.SentencePiece
fromprops
map, - Appends it into main object's Pieces array.
from src/sentencepiece/model.go
var modelprotoDescriptor = protobuf.ProtoDescriptor{
...
MessageProcessorFns: map[protobuf.Number]func(interface{}, protobuf.Message){
1: func(mainObject interface{}, message protobuf.Message) {
mo := mainObject.(*ModelProto)
props := message.Value.(map[protobuf.Number]interface{})
pieceTypeVal, err := common.InterfaceToInt(props[3])
if err != nil {
pieceTypeVal = int(NORMAL)
}
item := newSentencePiece(props[1].(string), props[2].(float32), Type(pieceTypeVal))
*mo.Pieces = append(*mo.Pieces, item)
},
...
}
}
6.3.2. Reading 1st token¶
... Reading 1st token
-
Call
ProtobufReader.readMessage(...)
method,-
Call
ProtobufReader.readField(...)
method onpbr.fileReader
,- Do things which we dove into for first token above,
-
Return:
-
Instantiate a Message object and return it.
-
-
Do other stuff...
mainObject.Pieces: {
{Piece: "<unk>", Score: 0, PieceType: 2(sentencepiece.UNKNOWN), ByteFallback: 0x00},
{Piece: "<s>", Score: 0, PieceType: 3(sentencepiece.CONTROL), ByteFallback: 0x00}
}
6.3.3. Reading some other tokens¶
..... Some steps were taken
... Reading 13th token
-
Call
ProtobufReader.readMessage(...)
method,- Call
ProtobufReader.readField(...)
method onpbr.fileReader
, do things which we dove into for first token above, -
Instantiate a Message object and return it.
- Call
-
Do other stuff...
mainObject.Pieces: {
{Piece: "<unk>", Score: 0, PieceType: 2(sentencepiece.UNKNOWN), ByteFallback: 0x00},
{Piece: "<s>", Score: 0, PieceType: 3(sentencepiece.CONTROL), ByteFallback: 0x00},
...
{Piece: "<0x0A>", Score: 0, PieceType: 6(sentencepiece.BYTE), ByteFallback: 0x0A},
}
..... Some steps were taken
... Reading 259th token
-
Call
ProtobufReader.readMessage(...)
method,- Call
ProtobufReader.readField(...)
method onpbr.fileReader
, do things which we dove into for first token above, -
Instantiate a Message object and return it.
- Call
-
Do other stuff...
mainObject.Pieces: {
{Piece: "<unk>", Score: 0, PieceType: 2(sentencepiece.UNKNOWN), ByteFallback: 0},
{Piece: "<s>", Score: 0, PieceType: 3(sentencepiece.CONTROL), ByteFallback: 0},
...
{Piece: "<0x0A>", Score: 0, PieceType: 6(sentencepiece.BYTE), ByteFallback: 0x0A},
...
{Piece: "▁▁", Score: -1000000000, PieceType: 1(sentencepiece.NORMAL), ByteFallback: 0}
}
... Reading 260th token
-
Call
ProtobufReader.readMessage(...)
method,- Call
ProtobufReader.readField(...)
method onpbr.fileReader
, do things which we dove into for first token above, -
Instantiate a Message object and return it.
- Call
-
Do other stuff...
mainObject.Pieces: {
{Piece: "<unk>", Score: 0, PieceType: 2(sentencepiece.UNKNOWN), ByteFallback: 0},
{Piece: "<s>", Score: 0, PieceType: 3(sentencepiece.CONTROL), ByteFallback: 0},
...
{Piece: "<0x0A>", Score: 0, PieceType: 6(sentencepiece.BYTE), ByteFallback: 0x0A},
...
{Piece: "▁▁", Score: -1000000000, PieceType: 1(sentencepiece.NORMAL), ByteFallback: 0},
{Piece: "▁t", Score: -1, PieceType: 1(sentencepiece.NORMAL), ByteFallback: 0}
}
... Reading 261th token
-
Call
ProtobufReader.readMessage(...)
method,- Call
ProtobufReader.readField(...)
method onpbr.fileReader
, do things which we dove into for first token above, -
Instantiate a Message object and return it.
- Call
-
Do other stuff...
mainObject.Pieces: {
{Piece: "<unk>", Score: 0, PieceType: 2(sentencepiece.UNKNOWN), ByteFallback: 0},
{Piece: "<s>", Score: 0, PieceType: 3(sentencepiece.CONTROL), ByteFallback: 0},
...
{Piece: "<0x0A>", Score: 0, PieceType: 6(sentencepiece.BYTE), ByteFallback: 0x0A},
...
{Piece: "▁▁", Score: -1000000000, PieceType: 1(sentencepiece.NORMAL), ByteFallback: 0},
{Piece: "▁t", Score: -1, PieceType: 1(sentencepiece.NORMAL), ByteFallback: 0},
{Piece: "er", Score: -2, PieceType: 1(sentencepiece.NORMAL), ByteFallback: 0}
}
..... Some steps were taken
... Reading 1,001st token
-
Call
ProtobufReader.readMessage(...)
method,- Call
ProtobufReader.readField(...)
method onpbr.fileReader
, do things which we dove into for first token above, -
Instantiate a Message object and return it.
- Call
-
Do other stuff...
mainObject.Pieces: {
{Piece: "<unk>", Score: 0, PieceType: 2(sentencepiece.UNKNOWN), ByteFallback: 0},
{Piece: "<s>", Score: 0, PieceType: 3(sentencepiece.CONTROL), ByteFallback: 0},
...
{Piece: "<0x0A>", Score: 0, PieceType: 6(sentencepiece.BYTE), ByteFallback: 0x0A},
...
{Piece: "▁▁", Score: -1000000000, PieceType: 1(sentencepiece.NORMAL), ByteFallback: 0},
{Piece: "▁t", Score: -1, PieceType: 1(sentencepiece.NORMAL), ByteFallback: 0},
{Piece: "er", Score: -2, PieceType: 1(sentencepiece.NORMAL), ByteFallback: 0}
...
{Piece: "ied", Score: -741, PieceType: 1(sentencepiece.NORMAL), ByteFallback: 0}
}
..... Some steps were taken
... Reading 10,001st token
-
Call
ProtobufReader.readMessage(...)
method,- Call
ProtobufReader.readField(...)
method onpbr.fileReader
, do things which we dove into for first token above, -
Instantiate a Message object and return it.
- Call
-
Do other stuff...
mainObject.Pieces: {
{Piece: "<unk>", Score: 0, PieceType: 2(sentencepiece.UNKNOWN), ByteFallback: 0},
{Piece: "<s>", Score: 0, PieceType: 3(sentencepiece.CONTROL), ByteFallback: 0},
...
{Piece: "<0x0A>", Score: 0, PieceType: 6(sentencepiece.BYTE), ByteFallback: 0x0A},
...
{Piece: "▁▁", Score: -1000000000, PieceType: 1(sentencepiece.NORMAL), ByteFallback: 0},
{Piece: "▁t", Score: -1, PieceType: 1(sentencepiece.NORMAL), ByteFallback: 0},
{Piece: "er", Score: -2, PieceType: 1(sentencepiece.NORMAL), ByteFallback: 0}
...
{Piece: "ied", Score: -741, PieceType: 1(sentencepiece.NORMAL), ByteFallback: 0}
...
{Piece: "ång", Score: -9741, PieceType: 1(sentencepiece.NORMAL), ByteFallback: 0}
}
..... Some steps were taken
... Reading 31,001st token
-
Call
ProtobufReader.readMessage(...)
method,- Call
ProtobufReader.readField(...)
method onpbr.fileReader
, do things which we dove into for first token above, -
Instantiate a Message object and return it.
- Call
-
Do other stuff...
mainObject.Pieces: {
{Piece: "<unk>", Score: 0, PieceType: 2(sentencepiece.UNKNOWN), ByteFallback: 0},
{Piece: "<s>", Score: 0, PieceType: 3(sentencepiece.CONTROL), ByteFallback: 0},
...
{Piece: "<0x0A>", Score: 0, PieceType: 6(sentencepiece.BYTE), ByteFallback: 0x0A},
...
{Piece: "▁▁", Score: -1000000000, PieceType: 1(sentencepiece.NORMAL), ByteFallback: 0},
{Piece: "▁t", Score: -1, PieceType: 1(sentencepiece.NORMAL), ByteFallback: 0},
{Piece: "er", Score: -2, PieceType: 1(sentencepiece.NORMAL), ByteFallback: 0}
...
{Piece: "ied", Score: -741, PieceType: 1(sentencepiece.NORMAL), ByteFallback: 0}
...
{Piece: "ång", Score: -9741, PieceType: 1(sentencepiece.NORMAL), ByteFallback: 0}
...
{Piece: "동", Score: -30741-30741, PieceType: 1(sentencepiece.NORMAL), ByteFallback: 0}
}
6.3.4. Reading TrainerSpec¶
..... Some steps were taken
..... After finishing all of 32,000 tokens, we get a different message which has Number=2
... Reading TrainerSpec
If you are curious about what is a TrainerSpec and what it contains, you can check out this Protobuf structure for details.
-
Call
ProtobufReader.readMessage(...)
method,- Call
ProtobufReader.readField(...)
method onpbr.fileReader
, do things which we dove into for first token above, -
Instantiate a Message object and return it.
Message{ Number: 2, Value: { 1: (string) "/large_experiments/theorem/datasets/MERGED/all.test1.merged" 2: (string) "spm_model_32k_200M_charcov099995_allowWSO__v2" 3: (int64) 2 4: (int64) 32000 6: (int64) 0 7: (string) "text" 10: (float32) 0.99995 11: (int64) 200000000 14: (int64) 1000000 15: (float32) 0.75 16: (int64) 80 17: (int64) 2 18: (int64) 4192 19: (int64) 1 20: (int64) 16 21: (int64) 1 22: (int64) 1 23: (int64) 1 24: (int64) 0 25: (int64) 1 26: (int64) 1 32: (int64) 1 33: (int64) 1 34: (int64) 0 35: (int64) 1 36: ([]uint8) [] 40: (int64) 0 41: (int64) 1 42: (int64) 2 43: (int64) -1 44: (string) " ⁇ " 45: (string) "<unk>" 46: (string) "<s>" 47: (string) "</s>" 48: (string) "<pad>" 49: (int64) 0 50: (int64) 0 51: (float32) 0 52: (int64) 0 } }
- Call
-
We don't use this information, do nothing as following:
from src/sentencepiece/model.go
var modelprotoDescriptor = protobuf.ProtoDescriptor{
...
MessageProcessorFns: map[protobuf.Number]func(interface{}, protobuf.Message){
...
2: func(mainObject interface{}, message protobuf.Message) {
// Do nothing, we don't need TrainerSpec at this time.
},
...
}
}
6.3.5. Reading NormalizerSpec¶
... Reading NormalizerSpec
-
Call
ProtobufReader.readMessage(...)
method,- Call
ProtobufReader.readField(...)
method onpbr.fileReader
, do things which we dove into for first token above, -
Instantiate a Message object and return it.
- Call
-
We don't use this information, but we convert it as following:
from src/sentencepiece/model.go
var modelprotoDescriptor = protobuf.ProtoDescriptor{
...
MessageProcessorFns: map[protobuf.Number]func(interface{}, protobuf.Message){
...
3: func(mainObject interface{}, message protobuf.Message) {
mo := mainObject.(*ModelProto)
props := message.Value.(map[protobuf.Number]interface{})
ns := NormalizerSpec{}
ns.Name = props[1].(string)
ns.PrecompiledCharsmap = props[2].([]byte)
ns.AddDummyPrefix = common.InterfaceToBool(props[3], true)
ns.RemoveExtraWhitespaces = common.InterfaceToBool(props[4], true)
ns.EscapeWhitespaces = common.InterfaceToBool(props[5], true)
stringVal, ok := props[6].(string)
if !ok {
byteArrVal, ok := props[6].([]byte)
if !ok {
stringVal = ""
} else {
stringVal = string(byteArrVal)
}
}
ns.NormalizationRuleTsv = stringVal
mo.NormalizerSpec = &ns
},
}
}
- Finished.