Skip to content

6. LOADING TOKENIZER MODEL (DETAILS)

In this chapter, we'll walk through the process of reading some pieces (vocabulary token) stored in the "tokenizer.model" file, step by step.

6.1. Creating Main Object

Protobuf messages may consist of multiple separated sub-messages/objects/object types, but for deserializing process, we need to have a root destination object. This is called as "main object" in our project.

ProtobufReader.Unmarshal() function calls given Protobuf descriptor's ProtoDescriptor.MainObjectConstructorFn function. In our case, it is modelprotoDescriptor.MainObjectConstructorFn. It instantiates an empty ModelProto with an empty Pieces array.

Once we have a main object, we can start to read and process the "message"s in the file.

The Protobuf protocol/format consists of "message"s with a "number" (type identifier), and the reader has "message processor" functions corresponding "number"s (type identifier).

To read the Protobuf file/stream, we have a simple flow:

  • We initiate a loop,
  • Read a message (via ProtobufReader.readMessage()),
  • If we are at EOF (end of file), break the loop,
  • Find corresponding message processor function for message.Number in ProtoDescriptor.MessageProcessorFns function map,
  • Execute it (this function makes changes on the main object if needed),
  • Check for errors, and continue if no error.
  • We return this main object.

6.2. Reading a Message

We initiate a loop that always calls the ProtobufReader.readMessage() method. This loop will continue until it encounters EOF (end of file) or an error.

ProtobufReader.readMessage() method:

  • Checks if we are at the EOF (end of file). If yes, returns with ok=false to notify finished,
  • Calls ProtobufReader.readField(...) method to read "message number" and "message data",
  • Returns successfully read Message object or ok=false.

from src/protobuf/protobufreader.go

func (pbr *ProtobufReader) readMessage() (message *Message, ok bool) {
    _, err := pbr.fileReader.Peek(1)
    if err != nil {
        return nil, false
    }
    number, item, ok := pbr.readField(pbr.fileReader)
    if !ok {
        return nil, false
    }
    return &Message{number, item}, true
}

ProtobufReader.readField(...) method:

This part was inspired by Google's original Protobuf project's "protowire" implementation, see: wire.go of original protobuf-go Project

  • Backs up current file position, because this method works not completely deterministic, includes some type of heuristics. It continues to read the file/stream with some assumptions and expectations, if it encounters a situation opposite to these assumptions/expectations, it reverts the stream position to backed up location, then tries other fallback strategies.
  • Calls pbr.readTag(...) which reads a "varint" (uint64) and extracts number (int32) and type_ (int8) values by bitwise operations on the read uint64 value.
    This number (int32) represents identifier number, which corresponds to key numbers in our MessageProcessorFns map in modelprotoDescriptor. type_ identifies the data type of current field/message data. >Note that, in this project, we only need to read Protobuf file/stream, not write, and we have implemented only required data types that used in the model file.

from src/protobuf/protobufreader.go

func (pbr *ProtobufReader) readField(r *bufio.Reader) (number Number, result interface{}, ok bool) {
    ...
}

6.3. Reading Tokens and Other Structures

6.3.1. Reading 0th token

... Reading 0th token

  • Call ProtobufReader.readMessage(...) method, >Depth: 0 (Reading one message, on pbr.fileReader)

    • Call ProtobufReader.readField(...) method on pbr.fileReader, > Depth: 1 (Reading one field for one message, on pbr.fileReader)

      • Read tag. number: 1, type: 2 (BytesType)
      • BytesType means it contains a byte array or string,
      • Instantiate a resultMap which has keys as Number (int32) and interface{},
      • Call pbr.readValueBytes:
        • Read a "varint": 14, this value is the length of the byte sequence,
        • Read 14 bytes into the buffer: "\n\x05\<unk>\x15\x00\x00\x00\x00\x18\x02" (printed in string form). You can see a meaningful piece: "\<unk>",
        • Return this byte sequence.
      • Instantiate another reader which dedicated to this 14-byte sequence, localReader,
      • Initiate a loop (to traverse this 14-byte sequence)

        • Iteration 1:

          • Call ProtobufReader.readField(...) method on localReader, >Depth: 2 (to traverse this 14-byte sequence)
            • Read tag. number: 1, type: 2 (BytesType)
            • BytesType means it contains a byte array or string,
            • Instantiate a resultMap which has keys as Number (int32) and interface{},
            • Call pbr.readValueBytes method on localReader:
              • Read a "varint": 5, this value is the length of the byte sequence,
              • Read 5 bytes into the buffer: "\<unk>" (printed in string form). Yes, this is our first extracted token string: "\<unk>",
              • Return this byte sequence.
            • Instantiate another reader which dedicated to this 5-byte sequence, localReader, >Don't forget, we are in another inner call, think of recursive
            • Initiate a loop (to traverse this 5-byte sequence)
              • Iteration 1:
                • Call ProtobufReader.readField(...) method on localReader, >Depth: 3 (to traverse this 5-byte sequence)
                  • Read tag. number: 7, type: 4 (EndGroupType),
                  • Return ok=false, to let parent method undo and fall in the string fallback.
                • Because of returned allOk=false, we do pbr.undoRead(...) on localReader to revert the reader position to previous position,
                • Break the loop
            • Loop was finished
            • Reading BytesType failed, so it continues with trying to read the sequence as string,
            • Check if the byte sequence valid for UTF-8 encoding with utf8.Valid,
            • Yes it's valid string, return: number: 1, result: "\<unk>".
          • Set resultMaps entry: key: 1, value: "\<unk>"

            resultMap: {
                1: (string) "<unk>"
            }
            
        • Iteration 2:

          • Call ProtobufReader.readField(...) method on localReader, >Depth: 2 (to traverse this 14-byte sequence)
            • Read tag. number: 2, type: 5 (Fixed32Type)
            • Fixed32Type means it contains a 4-byte float32 value,
            • Read 4 byte and convert it a float32 in little-endian form: Read value is 0,
            • Return this float32 value.
          • Set resultMaps entry: key: 2, value: 0 (float32)

            resultMap: {
                1: (string) "<unk>",
                2: (float32) 0
            }
            
        • Iteration 3:

          • Call ProtobufReader.readField(...) method on localReader, >Depth: 2 (to traverse this 14-byte sequence)
            • Read tag. number: 3, type: 0 (VarintType)
            • VarintType means it contains a variable length signed integer,
            • Read it and convert it an int64: Read value is 2,
            • Return this int64 value.
          • Set resultMaps entry: key: 3, value: 1 (int64)

            resultMap: {
                1: (string) "<unk>",
                2: (float32) 0,
                3: (int64) 2
            }
            
        • Iteration 4:

          • Check is EOF (end of file) for current 14-byte sequence: yes
          • Break the loop
            • Loop was finished
            • Return:
        number: 1,
        value: {
            1: (string) "<unk>",
            2: (float32) 0,
            3: (int64) 2
        }
        
    • Instantiate a Message object and return it.

      Message{Number: 1, Value: {1: "<unk>", 2: 0, 3: 2}}
      
  • Find the message processor function corresponding to our message.Number=1 from ProtoDescriptor.MessageProcessorFns function map,

  • Execute it,
  • The modelprotoDescriptor.MessageProcessorFns[1] function converts message.Value as props: map[protobuf.Number]interface{},
    • props[1]: Piece (string), string value of the token,
    • props[2]: Score (float32), score of the token,
    • props[3]: PieceType (Type/byte), token type of the token (can be sentencepiece.NORMAL, sentencepiece.CONTROL, sentencepiece.BYTE, etc... constants were defined in src/sentencepiece/model.go). If not represented in props map, default is sentencepiece.NORMAL,
    • Then instantiates new sentencepiece.SentencePiece from props map,
    • Appends it into main object's Pieces array.
mainObject.Pieces: {
    {Piece: "<unk>", Score: 0, PieceType: 2(sentencepiece.UNKNOW)}
}

from src/sentencepiece/model.go

var modelprotoDescriptor = protobuf.ProtoDescriptor{
    ...
    MessageProcessorFns: map[protobuf.Number]func(interface{}, protobuf.Message){
        1: func(mainObject interface{}, message protobuf.Message) {
            mo := mainObject.(*ModelProto)
            props := message.Value.(map[protobuf.Number]interface{})
            pieceTypeVal, err := common.InterfaceToInt(props[3])
            if err != nil {
                pieceTypeVal = int(NORMAL)
            }
            item := newSentencePiece(props[1].(string), props[2].(float32), Type(pieceTypeVal))
            *mo.Pieces = append(*mo.Pieces, item)
        },
        ...
    }
}

6.3.2. Reading 1st token

... Reading 1st token

  • Call ProtobufReader.readMessage(...) method,

    • Call ProtobufReader.readField(...) method on pbr.fileReader,

      • Do things which we dove into for first token above,
      • Return:

        number: 1,
        value: {
            1: (string) "<s>",
            2: (float32) 0,
            3: (int64) 3
        }
        
    • Instantiate a Message object and return it.

      Message{Number: 1, Value: {1: "<s>", 2: 0, 3: 3}}
      
  • Do other stuff...

mainObject.Pieces: {
    {Piece: "<unk>", Score: 0, PieceType: 2(sentencepiece.UNKNOWN), ByteFallback: 0x00},
    {Piece: "<s>", Score: 0, PieceType: 3(sentencepiece.CONTROL), ByteFallback: 0x00}
}

6.3.3. Reading some other tokens

..... Some steps were taken

... Reading 13th token

  • Call ProtobufReader.readMessage(...) method,

    • Call ProtobufReader.readField(...) method on pbr.fileReader, do things which we dove into for first token above,
    • Instantiate a Message object and return it.

      Message{Number: 1, Value: {1: "<0x0A>", 2: 0, 3: 6}}
      
  • Do other stuff...

mainObject.Pieces: {
    {Piece: "<unk>", Score: 0, PieceType: 2(sentencepiece.UNKNOWN), ByteFallback: 0x00},
    {Piece: "<s>", Score: 0, PieceType: 3(sentencepiece.CONTROL), ByteFallback: 0x00},
    ...
    {Piece: "<0x0A>", Score: 0, PieceType: 6(sentencepiece.BYTE), ByteFallback: 0x0A},
}

..... Some steps were taken

... Reading 259th token

  • Call ProtobufReader.readMessage(...) method,

    • Call ProtobufReader.readField(...) method on pbr.fileReader, do things which we dove into for first token above,
    • Instantiate a Message object and return it.

      Message{Number: 1, Value: {1: "▁▁", 2: -1000000000}}
      
  • Do other stuff...

mainObject.Pieces: {
    {Piece: "<unk>", Score: 0, PieceType: 2(sentencepiece.UNKNOWN), ByteFallback: 0},
    {Piece: "<s>", Score: 0, PieceType: 3(sentencepiece.CONTROL), ByteFallback: 0},
    ...
    {Piece: "<0x0A>", Score: 0, PieceType: 6(sentencepiece.BYTE), ByteFallback: 0x0A},
    ...
    {Piece: "▁▁", Score: -1000000000, PieceType: 1(sentencepiece.NORMAL), ByteFallback: 0}
}

... Reading 260th token

  • Call ProtobufReader.readMessage(...) method,

    • Call ProtobufReader.readField(...) method on pbr.fileReader, do things which we dove into for first token above,
    • Instantiate a Message object and return it.

      Message{Number: 1, Value: {1: "▁t", 2: -1}}
      
  • Do other stuff...

mainObject.Pieces: {
    {Piece: "<unk>", Score: 0, PieceType: 2(sentencepiece.UNKNOWN), ByteFallback: 0},
    {Piece: "<s>", Score: 0, PieceType: 3(sentencepiece.CONTROL), ByteFallback: 0},
    ...
    {Piece: "<0x0A>", Score: 0, PieceType: 6(sentencepiece.BYTE), ByteFallback: 0x0A},
    ...
    {Piece: "▁▁", Score: -1000000000, PieceType: 1(sentencepiece.NORMAL), ByteFallback: 0},
    {Piece: "▁t", Score: -1, PieceType: 1(sentencepiece.NORMAL), ByteFallback: 0}
}

... Reading 261th token

  • Call ProtobufReader.readMessage(...) method,

    • Call ProtobufReader.readField(...) method on pbr.fileReader, do things which we dove into for first token above,
    • Instantiate a Message object and return it.

      Message{Number: 1, Value: {1: "er", 2: -2}}
      
  • Do other stuff...

mainObject.Pieces: {
    {Piece: "<unk>", Score: 0, PieceType: 2(sentencepiece.UNKNOWN), ByteFallback: 0},
    {Piece: "<s>", Score: 0, PieceType: 3(sentencepiece.CONTROL), ByteFallback: 0},
    ...
    {Piece: "<0x0A>", Score: 0, PieceType: 6(sentencepiece.BYTE), ByteFallback: 0x0A},
    ...
    {Piece: "▁▁", Score: -1000000000, PieceType: 1(sentencepiece.NORMAL), ByteFallback: 0},
    {Piece: "▁t", Score: -1, PieceType: 1(sentencepiece.NORMAL), ByteFallback: 0},
    {Piece: "er", Score: -2, PieceType: 1(sentencepiece.NORMAL), ByteFallback: 0}
}

..... Some steps were taken

... Reading 1,001st token

  • Call ProtobufReader.readMessage(...) method,

    • Call ProtobufReader.readField(...) method on pbr.fileReader, do things which we dove into for first token above,
    • Instantiate a Message object and return it.

      Message{Number: 1, Value: {1: "ied", 2: -741}}
      
  • Do other stuff...

mainObject.Pieces: {
    {Piece: "<unk>", Score: 0, PieceType: 2(sentencepiece.UNKNOWN), ByteFallback: 0},
    {Piece: "<s>", Score: 0, PieceType: 3(sentencepiece.CONTROL), ByteFallback: 0},
    ...
    {Piece: "<0x0A>", Score: 0, PieceType: 6(sentencepiece.BYTE), ByteFallback: 0x0A},
    ...
    {Piece: "▁▁", Score: -1000000000, PieceType: 1(sentencepiece.NORMAL), ByteFallback: 0},
    {Piece: "▁t", Score: -1, PieceType: 1(sentencepiece.NORMAL), ByteFallback: 0},
    {Piece: "er", Score: -2, PieceType: 1(sentencepiece.NORMAL), ByteFallback: 0}
    ...
    {Piece: "ied", Score: -741, PieceType: 1(sentencepiece.NORMAL), ByteFallback: 0}
}

..... Some steps were taken

... Reading 10,001st token

  • Call ProtobufReader.readMessage(...) method,

    • Call ProtobufReader.readField(...) method on pbr.fileReader, do things which we dove into for first token above,
    • Instantiate a Message object and return it.

      Message{Number: 1, Value: {1: "ång", 2: -9741}}
      
  • Do other stuff...

mainObject.Pieces: {
    {Piece: "<unk>", Score: 0, PieceType: 2(sentencepiece.UNKNOWN), ByteFallback: 0},
    {Piece: "<s>", Score: 0, PieceType: 3(sentencepiece.CONTROL), ByteFallback: 0},
    ...
    {Piece: "<0x0A>", Score: 0, PieceType: 6(sentencepiece.BYTE), ByteFallback: 0x0A},
    ...
    {Piece: "▁▁", Score: -1000000000, PieceType: 1(sentencepiece.NORMAL), ByteFallback: 0},
    {Piece: "▁t", Score: -1, PieceType: 1(sentencepiece.NORMAL), ByteFallback: 0},
    {Piece: "er", Score: -2, PieceType: 1(sentencepiece.NORMAL), ByteFallback: 0}
    ...
    {Piece: "ied", Score: -741, PieceType: 1(sentencepiece.NORMAL), ByteFallback: 0}
    ...
    {Piece: "ång", Score: -9741, PieceType: 1(sentencepiece.NORMAL), ByteFallback: 0}
}

..... Some steps were taken

... Reading 31,001st token

  • Call ProtobufReader.readMessage(...) method,

    • Call ProtobufReader.readField(...) method on pbr.fileReader, do things which we dove into for first token above,
    • Instantiate a Message object and return it.

      Message{Number: 1, Value: {1: "동", 2: -30741}}
      
  • Do other stuff...

mainObject.Pieces: {
    {Piece: "<unk>", Score: 0, PieceType: 2(sentencepiece.UNKNOWN), ByteFallback: 0},
    {Piece: "<s>", Score: 0, PieceType: 3(sentencepiece.CONTROL), ByteFallback: 0},
    ...
    {Piece: "<0x0A>", Score: 0, PieceType: 6(sentencepiece.BYTE), ByteFallback: 0x0A},
    ...
    {Piece: "▁▁", Score: -1000000000, PieceType: 1(sentencepiece.NORMAL), ByteFallback: 0},
    {Piece: "▁t", Score: -1, PieceType: 1(sentencepiece.NORMAL), ByteFallback: 0},
    {Piece: "er", Score: -2, PieceType: 1(sentencepiece.NORMAL), ByteFallback: 0}
    ...
    {Piece: "ied", Score: -741, PieceType: 1(sentencepiece.NORMAL), ByteFallback: 0}
    ...
    {Piece: "ång", Score: -9741, PieceType: 1(sentencepiece.NORMAL), ByteFallback: 0}
    ...
    {Piece: "동", Score: -30741-30741, PieceType: 1(sentencepiece.NORMAL), ByteFallback: 0}
}

6.3.4. Reading TrainerSpec

..... Some steps were taken

..... After finishing all of 32,000 tokens, we get a different message which has Number=2

... Reading TrainerSpec

If you are curious about what is a TrainerSpec and what it contains, you can check out this Protobuf structure for details.

  • Call ProtobufReader.readMessage(...) method,

    • Call ProtobufReader.readField(...) method on pbr.fileReader, do things which we dove into for first token above,
    • Instantiate a Message object and return it.

      Message{
          Number: 2, 
          Value: {
              1: (string) "/large_experiments/theorem/datasets/MERGED/all.test1.merged"
              2: (string) "spm_model_32k_200M_charcov099995_allowWSO__v2"
              3: (int64) 2
              4: (int64) 32000
              6: (int64) 0
              7: (string) "text"
              10: (float32) 0.99995
              11: (int64) 200000000
              14: (int64) 1000000
              15: (float32) 0.75
              16: (int64) 80
              17: (int64) 2
              18: (int64) 4192
              19: (int64) 1
              20: (int64) 16
              21: (int64) 1
              22: (int64) 1
              23: (int64) 1
              24: (int64) 0
              25: (int64) 1
              26: (int64) 1
              32: (int64) 1
              33: (int64) 1
              34: (int64) 0
              35: (int64) 1
              36: ([]uint8) []
              40: (int64) 0
              41: (int64) 1
              42: (int64) 2
              43: (int64) -1
              44: (string) " ⁇ "
              45: (string) "<unk>"
              46: (string) "<s>"
              47: (string) "</s>"
              48: (string) "<pad>"
              49: (int64) 0
              50: (int64) 0
              51: (float32) 0
              52: (int64) 0
          }
      }
      
  • We don't use this information, do nothing as following:

from src/sentencepiece/model.go

var modelprotoDescriptor = protobuf.ProtoDescriptor{
    ...
    MessageProcessorFns: map[protobuf.Number]func(interface{}, protobuf.Message){
        ...
        2: func(mainObject interface{}, message protobuf.Message) {
            // Do nothing, we don't need TrainerSpec at this time.
        },
        ...
    }
}

6.3.5. Reading NormalizerSpec

... Reading NormalizerSpec

  • Call ProtobufReader.readMessage(...) method,

    • Call ProtobufReader.readField(...) method on pbr.fileReader, do things which we dove into for first token above,
    • Instantiate a Message object and return it.

      Message{
          Number: 3, 
          Value: {
              1: (string) "identity"
              2: ([]uint8) []
              3: (int64) 1
              4: (int64) 0
              6: ([]uint8) []
          }
      }
      
  • We don't use this information, but we convert it as following:

from src/sentencepiece/model.go

var modelprotoDescriptor = protobuf.ProtoDescriptor{
    ...
    MessageProcessorFns: map[protobuf.Number]func(interface{}, protobuf.Message){
        ...
        3: func(mainObject interface{}, message protobuf.Message) {
            mo := mainObject.(*ModelProto)
            props := message.Value.(map[protobuf.Number]interface{})
            ns := NormalizerSpec{}
            ns.Name = props[1].(string)
            ns.PrecompiledCharsmap = props[2].([]byte)

            ns.AddDummyPrefix = common.InterfaceToBool(props[3], true)
            ns.RemoveExtraWhitespaces = common.InterfaceToBool(props[4], true)
            ns.EscapeWhitespaces = common.InterfaceToBool(props[5], true)
            stringVal, ok := props[6].(string)
            if !ok {
                byteArrVal, ok := props[6].([]byte)
                if !ok {
                    stringVal = ""
                } else {
                    stringVal = string(byteArrVal)
                }
            }
            ns.NormalizationRuleTsv = stringVal
            mo.NormalizerSpec = &ns
        },
    }
}
  • Finished.