10. RoPE (ROTARY POSITIONAL EMBEDDINGS)¶

The Llama model uses RoPE (Rotary Positional Embeddings) alongside the standard embedding layer to highlight the influence of token positions within a sequence.

For decades, embedding has been the most commonly used technique to represent words, concepts, or tokens in NLP (Natural Language Processing) world. Typically, an embedding model is trained to let them learn frequencies of use of tokens together. The tokens are placed in suitable positions within a multi-dimensional space, where distances reflect the difference or similarity between them.

There are a wide variety of different methods to implement embeddings. Some of them take token positions into account.

Taking the token positions into account is important. Think of, two sentences containing exactly the same words in different orders. If you don't take the positions into account, your system handles these two sentences (despite with different meanings) as the same thing.

10.1. Preliminary Concepts¶

Let's try to explain with an example. Think of, we have a sequence of 5 tokens and an embedding layer with the shape of {128256, 4096}. So, we have:

A token embedding sequence which was calculated using the embedding layer. Our input tensor will be with the shape of {5, 4096},
We have 32 "attention heads" (according to modelArgs.N_Heads = 32). Our each attention head will have a dimension modelArgs.Dim / modelArgs.N_Heads. In our case, it is 4096 / 32 = 128. So our positional embedding tensors will have 128 dimensions.
We have an array of position indices of the tokens: {1, 2, 3, 4, 5}.

We have several alternatives to calculate the positional embeddings. Some of them are:

Taking the positions as they are: {0, 1, 2, 3, 4},
Taking the positions in normalized values between 0 and 1 as {0., 0.25, 0.50, 0.75, 1.},
As suggested in the paper Attention Is All You Need, using sinusoidal functions. Each dimension of the positional encoding corresponds to a sinusoid:

\[ \begin{gathered} PE_(pos,2i) = \sin(pos/10000^\frac{2i}{d_{model}}) \\ PE_(pos,2i+1) = \cos(pos/10000^\frac{2i}{d_{model}}) \end{gathered} \]

This means that, we have 128 dimensions for each position, i will loop from 0 to 64 (half of 128).

The original paper suggests using 10000 as base theta value, and the Llama 2 model uses this value. But newer versions of Llama (3 and higher) started to use 500000 as base theta value, so, we will stick to using 500000.

Update with Llama 3.1: The Llama 3.1 version comes with a small adjustment on frequencies. apply_scaling(...) method was added into original Llama 3.1 implementation, that calculates wavelengths from these frequencies and applies some limitations on them. Implementation detail will be discussed in the following subchapters. Currently we represent this operation with scl(...).

Our PE positional embedding array for 3th position will be like:

\[ \begin{gathered} PE = \left\lbrace \begin{array}{l} \sin\left(scl\left(\frac{3}{500000^\frac{0}{128}}\right)\right), \cos\left(scl\left(\frac{3}{500000^\frac{0}{128}}\right)\right), \sin\left(scl\left(\frac{3}{500000^\frac{2}{128}}\right)\right), \cos\left(scl\left(\frac{3}{500000^\frac{2}{128}}\right)\right), \\ \dots, \\ \sin\left(scl\left(\frac{3}{500000^\frac{124}{128}}\right)\right), \cos\left(scl\left(\frac{3}{500000^\frac{124}{128}}\right)\right), \sin\left(scl\left(\frac{3}{500000^\frac{126}{128}}\right)\right), \cos\left(scl\left(\frac{3}{500000^\frac{126}{128}}\right)\right) \end{array} \right\rbrace \\ \end{gathered} \]
As suggested in the paper RoFormer, using sinusoidal functions in slightly different ways, as described following parts,
Using the output of a custom function that takes the position indices as input.

We have several alternatives to integrate the positional embeddings with the token embedding vectors. If our positional embedding vectors and the token embedding vectors both have same dimension, we can sum or multiply them:

Summation: As suggested in the paper Attention Is All You Need too, we can simply add two vectors as new_embeddings = token_embeddings + positional_embeddings,
Multiplication: As suggested in the paper RoFormer too, we can simply multiply element-wise two vectors as new_embeddings = token_embeddings * positional_embeddings.

Think of, we can take the positions as they are, {0, 1, 2, 3, 4}, and add them with each dimension of our embedding vectors. It may work, but sounds not so much meaningful, right? So, we need to find a more meaningful method.

You can find some introduction for absolute position embedding and relative position embedding in the paper RoFormer alongside the proposed approach RoPE (Rotary Positional Embeddings) by the same paper.

The main idea behind these approaches is to let the model effectively take into account token positions. RoPE (Rotary Positional Embeddings) approach represents the positions of tokens in the polar coordinate system, which employs angles and complex numbers.

With this approach, we have a chance to combine multiple approaches and we have the following advantages:

To distribute our positional embeddings in the space in a limited range ([-1, +1] because of limits of cos and sin functions),
Other approaches prefer summation to integrate the positional embeddings with the token embedding vectors. However, the summation corrupts the exact data that the vector has. But in this approach, we prefer multiplication. And the value we multiply is a sinusoidal function, a sinusoid, we can think that we are only rotating the original embedding vector by an angle. So, theoretically, we don't corrupt the original data,
In the approach of RoFormer that we used in this project, takes the items of 128 dimension of an attention head as 64 pairs. Then, it obtains a complex number from each pair by taking the first item of the pair as real part and the second item of the pair as imaginary part of a complex number.

\[ \begin{gathered} \text{where }i^2=-1, i\text{ is the imaginary unit,} \\ out = abs \cdot \cos(angle) + abs \cdot \sin(angle) \cdot i \end{gathered} \]

Taking the items of an attention head as float pairs and representing them as complex numbers in polar coordinate system makes our method more suitable for its mathematical nature. Also, this allows it to represent these complex numbers as a matrix, which allows us to perform matrix operations.
With this approach, the influence of position on embeddings is high for lower dimensions (going from the first dimension throughout 128 dimensions) and converges to zero for higher dimensions. This makes higher dimensions of embeddings less sensitive to positional data than lower dimensions. Because the polar coordinates calculated for higher dimensions are nearly the same value. >Important note: I read it in a few sources then I saw it with the 3D charts that I drew in the notebook 10.BONUS-PRECOMPUTING-FREQUENCY-TENSOR.ipynb, also can be found the bottom of this chapter.

Note: In this approach, some concepts from computing, mathematics, geometry, and physics were combined. For example, we can think that:

Our positions are points in a time series,

Our positional encoding function as a function with respect to time, so we can say it is a signal,

Our angles as angular frequency (\({\displaystyle \omega }\)) and our positions as real independent variable/time of a sine wave.

In the following subchapters, we will see how these polar coordinates are precalculated. These values don't vary by input, so this calculation is made only once.

Sources:

10.2. Introduction to Precomputing The Frequency Tensor¶

You can check out the following for more information:

The Python codes that create the sample data and graphs used here with this Python Notebook: 10.BONUS-PRECOMPUTING-FREQUENCY-TENSOR.ipynb.

A Gentle Introduction to Positional Encoding in Transformer Models, Part 1 article.

The precomputeFreqsCis(...) function takes four arguments: dim, end, theta, and useScaled.

The argument dim is computed as int(modelArgs.Dim/modelArgs.N_Heads), end is computed as modelArgs.MaxSequenceLength*2, theta is modelArgs.RopeTheta, useScaled is modelArgs.UseScaledRope. In our case, dim = 4096/32 = 128, end = 2048 * 2 = 4096, theta = 500000, useScaled = true.

In our case, the precomputeFreqsCis(...) function is called with dim = 128, end = 4096, theta = 500000, and useScaled = true.

^{from src/model/llamatransformer.go}

func NewLlamaTransformer(model *Model) (*LlamaTransformer, error) {
    result := &LlamaTransformer{}
    ...
    if result.PrecomputedFreqsCis, err = precomputeFreqsCis(int(dim/modelArgs.N_Heads), modelArgs.MaxSequenceLength*2, modelArgs.RopeTheta, modelArgs.UseScaledRope); err != nil {
        return nil, err
    }
    ...
}

10.3. Initiating Angles of Frequency Tensor¶

In the original Llama 3.1 Python repository of Meta, this Python code initiates the freqs array:

import torch

def precompute_freqs_cis(dim: int, end: int, theta: float = 10000.0, use_scaled: bool = False):
    ...
    freqs = 1.0 / (theta ** (torch.arange(0, dim, 2)[: (dim // 2)].float() / dim))
    ...

Important note: The theta variables in both Go and Python code are not an angle. They are explained as: "Scaling factor for frequency computation. Defaults to 10000.0, but in our case, this value comes as 500000.0 for Llama 3.1.".
Instead, the freqs is an array of angles, that corresponds to \(\Theta\) and each item of freqs array corresponds to \(\theta_i\) below.
Personally, at first sight, I was confused why they called scaling factor as theta which is a term that made me think it is an angle, but it isn't, items of the freqs are in an angle unit (radians), but at the end, they are only premise value (scaling factor) for the real angles!

The original equation in section "3.2.2 General form" of RoFormer: Enhanced Transformer with Rotary Position Embedding:

\[ \begin{gathered} \Theta = \left\lbrace \theta_i = 10000^{-\frac{2(i - 1)}{dim}}, i \in [1, 2, \dots, \frac{dim}{2}] \right\rbrace \end{gathered} \]

If we expand it for dim=128 and use 500000 instead of 10000 in our case:

\[ \begin{gathered} \Theta = \left\lbrace \theta_1 = 500000^{-\frac{2(1 - 1)}{128}}, \theta_2 = 500000^{-\frac{2(2 - 1)}{128}}, \theta_3 = 500000^{-\frac{2(3 - 1)}{128}}, \dots, \theta_{63} = 500000^{-\frac{2(63 - 1)}{128}}, \theta_{64} = 500000^{-\frac{2(64 - 1)}{128}} \right\rbrace \\ = \\ \Theta = \left\lbrace \theta_1 = 500000^{-\frac{0}{128}}, \theta_2 = 500000^{-\frac{2}{128}}, \theta_3 = 500000^{-\frac{4}{128}}, \dots, \theta_{63} = 500000^{-\frac{124}{128}}, \theta_{64} = 500000^{-\frac{126}{128}} \right\rbrace \\ = \\ \Theta = \left\lbrace \theta_1 = \frac{1}{500000^{\frac{0}{128}}}, \theta_2 = \frac{1}{500000^{\frac{2}{128}}}, \theta_3 = \frac{1}{500000^{\frac{4}{128}}}, \dots, \theta_{63} = \frac{1}{500000^{\frac{124}{128}}}, \theta_{64} = \frac{1}{500000^{\frac{126}{128}}} \right\rbrace \end{gathered} \]

Update with Llama 3.1: The Llama 3.1 version comes with a small adjustment on frequencies. Implementation detail will be discussed in the following subchapters. Currently we represent this operation with scl(...).

If it will be expressed with variable names in the code and scaling is applied:

\[ \begin{gathered} freqs = \left\lbrace \text{item}_{i} = scl\left(\frac{1}{theta^{\frac{val}{dim}}}\right) \right\rbrace , val \in \lbrack0, 2, 4, ...,dim - 2\rbrack , i \in \left[0, 1, 2, ..., \frac{dim}{2} -1\right] \\ \\ \\ \text{freqs} = \left\lbrace \text{item}_0 = scl\left(\frac{1}{500000^{\frac{0}{128}}}\right), \text{item}_1 = scl\left(\frac{1}{500000^{\frac{2}{128}}}\right), \text{item}_2 = scl\left(\frac{1}{500000^{\frac{4}{128}}}\right), \dots, \text{item} _{62} = scl\left(\frac{1}{500000^{\frac{124}{128}}}\right), \text{item} _{63} = scl\left(\frac{1}{500000^{\frac{126}{128}}}\right) \right\rbrace \end{gathered} \]

You can find original Python implementation of apply_scaling(...) function which is represented as \(scl\left(...\right)\) here.

^{from src/model/llamatransformer.go}

func applyScaling(freqs *ml.Tensor) error {
    // See Llama 3.1 Code: https://github.com/meta-llama/llama-models/blob/f45cdfd624b98b6655540f7101d8d9cb432e631c/models/llama3_1/reference_impl/model.py#L45
    // Values obtained from grid search
    scaleFactor := float32(8.0)
    lowFreqFactor := float32(1.0)
    highFreqFactor := float32(4.0)
    oldContextLen := float32(8192) // original llama3 length
    lowFreqWavelen := oldContextLen / lowFreqFactor
    highFreqWavelen := oldContextLen / highFreqFactor
    for i := 0; i < freqs.Size[0]; i++ {
        freq, err := freqs.GetItem_AsFloat32([]int{i})
        if err != nil {
            return err
        }
        var newFreq float32
        wavelen := 2 * math.Pi / freq
        if wavelen < highFreqWavelen {
            newFreq = freq
        } else if wavelen > lowFreqWavelen {
            newFreq = freq / scaleFactor
        } else {
            smooth := (oldContextLen/wavelen - lowFreqFactor) / (highFreqFactor - lowFreqFactor)
            newFreq = (1-smooth)*freq/scaleFactor + smooth*freq

        }
        if err := freqs.SetItem_FromFloat32([]int{i}, newFreq); err != nil {
            return err
        }
    }
    return nil
}

func precomputeFreqsCis(dim int, end int, theta float64, useScaled bool) (*ml.Tensor, error) {
    ...
    dimFloat := float32(dim)
    freqs, err := ml.ARange(0, dim, 2, ml.DT_BF16)
    ...
    err = freqs.Apply_AsFloat32(func(val float32) float32 {
        return float32(1.0 / math.Pow(theta, float64(val/dimFloat)))
    })
    ...
    if useScaled {
        err = applyScaling(freqs)
        if err != nil {
            return nil, err
        }
    }
    ...
}

Go project (ours) values of the freqs array:

freqs: {0, 2, 4, 6, 8, 10, 12, ..., 124, 126} # has 64 items, at this time, the freqs array contains "val" values that exists in the equations above.
# after running Apply_AsFloat32 and then applyScaling, freqs will be:
freqs: { # has 64 items, in radians.
    1.0000e+00, 8.1250e-01, 6.6016e-01, 5.3906e-01, 4.3945e-01, 3.5742e-01,
    2.9102e-01, 2.3730e-01, 1.9336e-01, 1.5723e-01, 1.2793e-01, 1.0449e-01,
    8.4961e-02, 6.9336e-02, 5.6641e-02, 4.6143e-02, 3.7598e-02, 3.0518e-02,
    2.4902e-02, 2.0264e-02, 1.6479e-02, 1.3489e-02, 1.0986e-02, 8.9111e-03,
    7.2632e-03, 5.9204e-03, 4.8218e-03, 3.9368e-03, 3.2043e-03, 2.1515e-03,
    1.3504e-03, 8.5068e-04, 5.1880e-04, 3.1090e-04, 1.7834e-04, 9.5367e-05,
    7.7724e-05, 6.2943e-05, 5.1498e-05, 4.1962e-05, 3.4094e-05, 2.7895e-05,
    2.2650e-05, 1.8477e-05, 1.5080e-05, 1.2279e-05, 1.0014e-05, 8.1062e-06,
    6.6459e-06, 5.3942e-06, 4.4107e-06, 3.5912e-06, 2.9206e-06, 2.3842e-06,
    1.9372e-06, 1.5795e-06, 1.2890e-06, 1.0431e-06, 8.5309e-07, 6.9663e-07,
    5.6624e-07, 4.6194e-07, 3.7625e-07, 3.0547e-07
}

This is the output of Python + Pytorch environment. There are slight differences between them because of floating point precision differences.

In the table below, you can find approximate equivalents of radian angle values in degrees, with corresponding "val" indices:

val	rad	deg	val	rad	deg	val	rad	deg	val	rad	deg
0	1.00000000	57.29578	32	0.03760603	2.15467	64	0.00052485	0.03007	96	0.00000665	0.00038
2	0.81461722	46.67413	34	0.03063452	1.75523	66	0.00031269	0.01792	98	0.00000542	0.00031
4	0.66360128	38.02155	36	0.02495541	1.42984	68	0.00017851	0.01023	100	0.00000441	0.00025
6	0.54058099	30.97301	38	0.02032910	1.16477	70	0.00009556	0.00548	102	0.00000359	0.00021
8	0.44036663	25.23115	40	0.01656044	0.94884	72	0.00007785	0.00446	104	0.00000293	0.00017
10	0.35873023	20.55373	42	0.01349042	0.77294	74	0.00006342	0.00363	106	0.00000238	0.00014
12	0.29222783	16.74342	44	0.01098953	0.62965	76	0.00005166	0.00296	108	0.00000194	0.00011
14	0.23805381	13.63948	46	0.00895226	0.51293	78	0.00004208	0.00241	110	0.00000158	0.00009
16	0.19392276	11.11096	48	0.00729267	0.41784	80	0.00003428	0.00196	112	0.00000129	0.00007
18	0.15797281	9.05118	50	0.00594073	0.34038	82	0.00002793	0.00160	114	0.00000105	0.00006
20	0.12868738	7.37324	52	0.00483942	0.27728	84	0.00002275	0.00130	116	0.00000086	0.00005
22	0.10483095	6.00637	54	0.00394228	0.22588	86	0.00001853	0.00106	118	0.00000070	0.00004
24	0.08539710	4.89289	56	0.00321145	0.18400	88	0.00001510	0.00086	120	0.00000057	0.00003
26	0.06956595	3.98584	58	0.00216657	0.12414	90	0.00001230	0.00070	122	0.00000046	0.00003
28	0.05666962	3.24693	60	0.00137189	0.07860	92	0.00001002	0.00057	124	0.00000038	0.00002
30	0.04616405	2.64501	62	0.00085675	0.04909	94	0.00000816	0.00047	126	0.00000031	0.00002

Sources:

RoFormer: Enhanced Transformer with Rotary Position Embedding: Paper | Papers with Code | LabML Annotated Implementation
Llama 2: Open Foundation and Fine-Tuned Chat Models : Paper
Llama: Open and Efficient Foundation Language Models : Paper

10.4. Getting Outer Product of Frequency Tensor and Position Indices¶

import torch

def precompute_freqs_cis(dim: int, end: int, theta: float = 500000.0):
    ...
    t = torch.arange(end, device=freqs.device)  # type: ignore
    freqs = torch.outer(t, freqs).float()  # type: ignore
    ...

^{from src/model/llamatransformer.go}

func precomputeFreqsCis(dim int, end int) (*ml.Tensor, error) {
    ...
    t, err := ml.ARange(0, end, 1, ml.DT_BF16)
    if err != nil {
        return nil, err
    }
    ...
}

The end is computed as modelArgs.MaxSequenceLength*2. In our case, end = 2048 * 2 = 4096.

Note for weirdnes here: In original implementation, modelArgs.MaxSequenceLength value is equal to 512 which is limitation for input prompt token count. With multiplying to 2, they've aimed to avoid of unnecessary calculations.
However, in our implementation, we specified modelArgs.MaxSequenceLength as 2048, and when we multiply it with 2, we get 4096, an unnecessary and unmeaningful value. But I left it as it is, it doesn't hurt correctness, it causes only calculating unused unnecessary values.
We will continue with 4096, but know that, it is unnecessarily high.
On this issue, the original Python code has a comment (read this with considering this comment was taken from Llama 2 code, not Llama 3.1):

# Note that self.params.max_seq_len is multiplied by 2 because the token limit for the Llama 2 generation of models is 4096. 
# Adding this multiplier instead of using 4096 directly allows for dynamism of token lengths while training or fine-tuning.```

At first, we generate a tensor named t with 4096 items as: {0, 1, 2, 3, ..., 4093, 4094, 4095}. This tensor contains our position indices.

^{from src/model/llamatransformer.go}

func precomputeFreqsCis(dim int, end int) (*ml.Tensor, error) {
    ...
    freqs, err = ml.Outer(t, freqs)
    if err != nil {
        return nil, err
    }
    ...
}

By calling ml.Outer(t, freqs) function, a tensor with shape {4096, 64} which is outer product of tensors t with shape {4096}and freqs with shape {64}.

This "outer product" function takes first argument as row vectors, second argument as column vectors.

In our case, we take items of t as rows, items of freqs as columns, then create 2D tensor called result as follows:

row	column	set the result as	python equivalent (rad)	python equivalent (deg)
`t[0] = 0`	`freqs[0] = 1.0000e+00`	`result[0][0] = 0 * 1.0000e+00 = 0`	0.00000	0.00000
	`freqs[1] = 8.1250e-01`	`result[0][1] = 0 * 8.1250e-01 = 0`	0.00000	0.00000
...
`t[1] = 1`	`freqs[0] = 1.0000e+00`	`result[1][0] = 1 * 1.0000e+00 = 1.0000e+00`	1.00000000	57.29578
	`freqs[1] = 8.1250e-01`	`result[1][1] = 1 * 8.1250e-01 = 8.1250e-01`	0.81461722	46.67413
	...
	`freqs[62] = 3.7625e-07`	`result[1][62] = 1 * 3.7625e-07 = 3.7625e-07`	0.00000038	0.00002
	`freqs[63] = 3.0547e-07`	`result[1][63] = 1 * 3.0547e-07 = 3.0547e-07`	0.00000031	0.00002
...
`t[2] = 2`	`freqs[0] = 1.0000e+00`	`result[2][0] = 2 * 1.0000e+00 = 2.0000e+00`	2.00000000	114.59155
	`freqs[1] = 8.1250e-01`	`result[2][1] = 2 * 8.1250e-01 = 1.6250e+00`	1.62923443	93.34825
	...
	`freqs[62] = 3.7625e-07`	`result[2][62] = 2 * 3.7625e-07 = 7.5251e-07`	0.00000075	0.00004
	`freqs[63] = 3.0547e-07`	`result[2][63] = 2 * 3.0547e-07 = 6.1095e-07`	0.00000061	0.00004
...
`t[4094] = 8190`	`freqs[0] = 1.0000e+00`	`result[4094][0] = 4094 * 1.0000e+00 = 4.0800e+03`	4094.00000000	234568.90625 (normalized: -151.09375)
	`freqs[1] = 8.1250e-01`	`result[4094][1] = 4094 * 8.1250e-01 = 3.3120e+03`	3335.04296875	191083.87500 (normalized: -76.12500)
	...
	`freqs[62] = 3.7625e-07`	`result[4094][62] = 4094 * 3.7625e-07 = 1.5335e-03`	0.00154234	0.08837
	`freqs[63] = 3.0547e-07`	`result[4094][63] = 4094 * 3.0547e-07 = 1.2436e-03`	0.00125642	0.07199
...
`t[4095] = 4095`	`freqs[0] = 1.0000e+00`	`result[4095][0] = 4095 * 1.0000e+00 = 4.0800e+03`	4095.00000000	234626.20312 (normalized: -93.79688)
	`freqs[1] = 8.1250e-01`	`result[4095][1] = 4095 * 8.1250e-01 = 3.3120e+03`	3335.85742188	191130.54688 (normalized: -29.45312)
	...
	`freqs[62] = 3.7625e-07`	`result[4095][62] = 4095 * 3.7625e-07 = 1.5335e-03`	0.00154272	0.00154272
	`freqs[63] = 3.0547e-07`	`result[4095][63] = 4095 * 3.0547e-07 = 1.2436e-03`	0.00125673	0.07201
	...

^{from src/ml/operations_impl.go}

func Outer(vec1 *Tensor, vec2 *Tensor) (*Tensor, error) {
    if err := processErrors(
        checkIsVector(vec1),
        checkIsVector(vec2),
        checkSameDataType(vec1, vec2),
    ); err != nil {
        return nil, err
    }
    itemSize := vec1.DataType.ItemSize
    result := NewEmptyTensor([]int{vec1.Size[0], vec2.Size[0]}, vec1.DataType)
    for i := 0; i < vec1.Size[0]; i++ {
        rowValF32, err := vec1.GetItemByOffset_AsFloat32(i * itemSize)
        if err != nil {
            return nil, err
        }
        for j := 0; j < vec2.Size[0]; j++ {
            colValF32, err := vec2.GetItemByOffset_AsFloat32(j * itemSize)
            if err != nil {
                return nil, err
            }
            valF32 := rowValF32 * colValF32
            if err := result.SetItem_FromFloat32([]int{i, j}, valF32); err != nil {
                return nil, err
            }
        }
    }
    return result, nil
}

10.5. Calculating Frequency Tensor as Cis (Polar Coordinates)¶

cis is described at Wikipedia:

cis is a mathematical notation defined by cis x = cos x + i sin x,[nb 1] where cos is the cosine function, i is the imaginary unit and sin is the sine function. x is the argument of the complex number (angle between line to point and x-axis in polar form).

\[ \begin{gathered} \text{where }i^2=-1, i\text{ is the imaginary unit,} \\ cis x = cos x + i \cdot sin x \end{gathered} \]

With this notation, we can express a point's location in cartesian coordinate system with cosine and sine of one angle, which is called as polar coordinates.

We've calculated angles of our polar coordinate points as freqs in previous chapter.

import torch

def precompute_freqs_cis(dim: int, end: int, theta: float = 500000.0):
    ...
    freqs_cis = torch.polar(torch.ones_like(freqs), freqs)  # complex64
    return freqs_cis

^{from src/model/llamatransformer.go}

func precomputeFreqsCis(dim int, end int) (*ml.Tensor, error) {
    ...
    ones, err := ml.OnesLike(freqs)
    if err != nil {
        return nil, err
    }
    freqs_cis, err := ml.Polar(ones, freqs)
    if err != nil {
        return nil, err
    }
    return freqs_cis, nil
}

We create a tensor which contains all 1 values with same shape {4096, 64} and data type as freqs tensor, via ml.OnesLike(...)). These 1 values will be the magnitude of our vector in polar coordinate system. We use 1 for magnitude to get identity vector for the angle.

By calling ml.Polar(ones, freqs) function, a tensor with shape {4096, 64} which is outer product of tensors t with shape {4096}and freqs with shape {64} is got.

Polar function is described at Pytorch TORCH.POLAR documentation:

Constructs a complex tensor whose elements are Cartesian coordinates corresponding to the polar coordinates with absolute value abs and angle.

\[ \begin{gathered} \text{where }i^2=-1, i\text{ is the imaginary unit,} \\ out = abs \cdot \cos(angle) + abs \cdot \sin(angle) \cdot i \end{gathered} \]

In our case, the abs argument is a tensor full of 1 values. angle argument is our freqs variable.

For each item in the angle tensor:

The cosine value of the angle is got, multiplied by absItemF64 = 1 (it is always 1 in our case), set as the real part of the resulting complex number,
The sine value of the angle is got, multiplied by absItemF64 = 1 (it is always 1 in our case), set as the imaginary part of the resulting complex number.

Then, the dst tensor with DT_COMPLEX data type has cosine and sine values of our angles as complex numbers.

^{from src/ml/operations_impl.go}

func Polar(abs *Tensor, angle *Tensor) (*Tensor, error) {
    // See: (For formula) https://pytorch.org/docs/stable/generated/torch.polar.html
    ...
    for readOffset := 0; readOffset < abs.GetBytesCount(); readOffset += absItemSize {
        absItemF32, err := abs.GetItemByOffset_AsFloat32(readOffset)
        if err != nil {
            return nil, err
        }
        angleItemF32, err := angle.GetItemByOffset_AsFloat32(readOffset)
        if err != nil {
            return nil, err
        }

        absItemF64 := float64(absItemF32)
        angleItemF64 := float64(angleItemF32)

        realPart := absItemF64 * math.Cos(angleItemF64)
        imagPart := absItemF64 * math.Sin(angleItemF64)
        resultItem := complex64(complex(realPart, imagPart))
        if err := dst.SetItemByOffset(writeOffset, resultItem); err != nil {
            return nil, err
        }
        writeOffset += dstItemSize
    }
    return dst, nil
}

In our case, with 1 is always as absItemF64:

angleItemF64	realPart	imagPart	resultItem
`freqs[0][0] = 0`	`cos(0) = 1`	`sin(0) = 0`	(1 + 0i)
`freqs[0][1] = 0`	`cos(0) = 1`	`sin(0) = 0`	(1 + 0i)
...
`freqs[1][0] = 1`	`cos(1) = 0.5403023`	`sin(1) = 0.84147096`	(0.5403023 + 0.84147096i)
...
`freqs[4095][63] = 0.0012436`	`cos(0.0012436) = 1`	`sin(0.0012436) = 0.0012436`	(1 + 0.0012436i)

10.6. The result¶

Recap of what we have so far:

An embedding layer with the shape of {128256, 4096}, that contains vectors that have 4096 dimensions each, 128256 different token vectors.
A token embedding sequence which was calculated using the embedding layer. Our input tensor will be with the shape of {SequenceLength, 4096},
We have 32 "attention heads" (according to modelArgs.N_Heads = 32). Our each attention head will have a dimension modelArgs.Dim / modelArgs.N_Heads. In our case, it is 4096 / 32 = 128. So our positional embedding tensors will have 128 dimensions.
Because of we have 32 attention heads and the dimesion of our each attention head is 128, we will separate our xq (queries) matrix into 32 equal pieces, then because of we have 8 key/value heads, we will separate our xk (keys)matrix into 8 equal pieces, that end up with 128 at one dimension. Then, the integration with positional embeddings and the token embeddings is done on 128 dimension.
Think of, we have 5 tokens to encode, so we have position indices of the tokens: {1, 2, 3, 4, 5}.

After all of these processes, we will end up with a "positional encoding tensor" for a sequence having 5 positions as follows:

Note: The \(\LaTeX\) support of Github web app lacks and gives non-explanatory errors when you have more than a limit of superscipts/subscripts/fraction notations. So, it was a must to separate the biggest set notation into chunks.

The operation of applyScaling(...) function is represented with scl(...).

\[ PE = \left\lbrace \begin{array}{l} p_{pos,2i} = \sin\left(pos \cdot scl\left(\frac{1}{500000^\frac{2i}{d_{model}}}\right)\right) \\ \\ p_{pos,2i+1} = \cos\left(pos \cdot scl\left(\frac{1}{500000^\frac{2i}{d_{model}}}\right)\right) \\ \end{array} \right\rbrace \]

Positional Encoding tensor for 5 positions, without converting to complex number:

\[ \begin{gathered} \text{where }PE_{pos,i} \in \mathbb{R}^{dim = 128}, \\ \\ PE = \left\lbrack \begin{array}{l} PE_{pos0}, PE_{pos1}, PE_{pos2}, PE_{pos3}, PE_{pos4} \end{array} \right\rbrack \end{gathered} \]

\[ \begin{gathered} PE_{pos0}= \left\lbrace \begin{array}{l} \sin\left(scl\left(\frac{0}{500000^{\frac{0}{128}}}\right)\right), \cos\left(scl\left(\frac{0}{500000^{\frac{0}{128}}}\right)\right), \sin\left(scl\left(\frac{0}{500000^{\frac{2}{128}}}\right)\right), \cos\left(scl\left(\frac{0}{500000^{\frac{2}{128}}}\right)\right), \sin\left(scl\left(\frac{0}{500000^{\frac{4}{128}}}\right)\right), \cos\left(scl\left(\frac{0}{500000^{\frac{4}{128}}}\right)\right), \dots, \\\\ \sin\left(scl\left(\frac{0}{500000^{\frac{124}{128}}}\right)\right), \cos\left(scl\left(\frac{0}{500000^{\frac{124}{128}}}\right)\right), \sin\left(scl\left(\frac{0}{500000^{\frac{126}{128}}}\right)\right), \cos\left(scl\left(\frac{0}{500000^{\frac{126}{128}}}\right)\right) \end{array} \right\rbrace \\ \end{gathered} \]

\[ \begin{gathered} PE_{pos1}= \left\lbrace \begin{array}{l} \sin\left(scl\left(\frac{1}{500000^{\frac{0}{128}}}\right)\right), \cos\left(scl\left(\frac{1}{500000^{\frac{0}{128}}}\right)\right), \sin\left(scl\left(\frac{1}{500000^{\frac{2}{128}}}\right)\right), \cos\left(scl\left(\frac{1}{500000^{\frac{2}{128}}}\right)\right), \sin\left(scl\left(\frac{1}{500000^{\frac{4}{128}}}\right)\right), \cos\left(scl\left(\frac{1}{500000^{\frac{4}{128}}}\right)\right), \dots, \\\\ \sin\left(scl\left(\frac{1}{500000^{\frac{124}{128}}}\right)\right), \cos\left(scl\left(\frac{1}{500000^{\frac{124}{128}}}\right)\right), \sin\left(scl\left(\frac{1}{500000^{\frac{126}{128}}}\right)\right), \cos\left(scl\left(\frac{1}{500000^{\frac{126}{128}}}\right)\right) \end{array} \right\rbrace \\ \end{gathered} \]

\[ \begin{gathered} PE_{pos2}= \left\lbrace \begin{array}{l} \sin\left(scl\left(\frac{2}{500000^{\frac{0}{128}}}\right)\right), \cos\left(scl\left(\frac{2}{500000^{\frac{0}{128}}}\right)\right), \sin\left(scl\left(\frac{2}{500000^{\frac{2}{128}}}\right)\right), \cos\left(scl\left(\frac{2}{500000^{\frac{2}{128}}}\right)\right), \sin\left(scl\left(\frac{2}{500000^{\frac{4}{128}}}\right)\right), \cos\left(scl\left(\frac{2}{500000^{\frac{4}{128}}}\right)\right), \dots, \\\\ \sin\left(scl\left(\frac{2}{500000^{\frac{124}{128}}}\right)\right), \cos\left(scl\left(\frac{2}{500000^{\frac{124}{128}}}\right)\right), \sin\left(scl\left(\frac{2}{500000^{\frac{126}{128}}}\right)\right), \cos\left(scl\left(\frac{2}{500000^{\frac{126}{128}}}\right)\right) \end{array} \right\rbrace \\ \end{gathered} \]

\[ \begin{gathered} PE_{pos3}= \left\lbrace \begin{array}{l} \sin\left(scl\left(\frac{3}{500000^{\frac{0}{128}}}\right)\right), \cos\left(scl\left(\frac{3}{500000^{\frac{0}{128}}}\right)\right), \sin\left(scl\left(\frac{3}{500000^{\frac{2}{128}}}\right)\right), \cos\left(scl\left(\frac{3}{500000^{\frac{2}{128}}}\right)\right), \sin\left(scl\left(\frac{3}{500000^{\frac{4}{128}}}\right)\right), \cos\left(scl\left(\frac{3}{500000^{\frac{4}{128}}}\right)\right), \dots, \\\\ \sin\left(scl\left(\frac{3}{500000^{\frac{124}{128}}}\right)\right), \cos\left(scl\left(\frac{3}{500000^{\frac{124}{128}}}\right)\right), \sin\left(scl\left(\frac{3}{500000^{\frac{126}{128}}}\right)\right), \cos\left(scl\left(\frac{3}{500000^{\frac{126}{128}}}\right)\right) \end{array} \right\rbrace \\ \end{gathered} \]

\[ \begin{gathered} PE_{pos4}= \left\lbrace \begin{array}{l} \sin\left(scl\left(\frac{4}{500000^{\frac{0}{128}}}\right)\right), \cos\left(scl\left(\frac{4}{500000^{\frac{0}{128}}}\right)\right), \sin\left(scl\left(\frac{4}{500000^{\frac{2}{128}}}\right)\right), \cos\left(scl\left(\frac{4}{500000^{\frac{2}{128}}}\right)\right), \sin\left(scl\left(\frac{4}{500000^{\frac{4}{128}}}\right)\right), \cos\left(scl\left(\frac{4}{500000^{\frac{4}{128}}}\right)\right), \dots, \\\\ \sin\left(scl\left(\frac{4}{500000^{\frac{124}{128}}}\right)\right), \cos\left(scl\left(\frac{4}{500000^{\frac{124}{128}}}\right)\right), \sin\left(scl\left(\frac{4}{500000^{\frac{126}{128}}}\right)\right), \cos\left(scl\left(\frac{4}{500000^{\frac{126}{128}}}\right)\right) \end{array} \right\rbrace \\ \end{gathered} \]

Positional Encoding tensor for 5 positions, after converting to complex number:

\[ \begin{gathered} \text{where }PE_{pos,i} \in \mathbb{C}^{\frac{dim}{2} = 64}, i^2=-1, i\text{ is the imaginary unit,} \\ \\ PE = \left\lbrack \begin{array}{l} PE_{pos0}, PE_{pos1}, PE_{pos2}, PE_{pos3}, PE_{pos4} \end{array} \right\rbrack \end{gathered} \]

\[ \begin{gathered} PE_{pos0}= \left\lbrace \begin{array}{l} \sin\left(scl\left(\frac{0}{500000^{\frac{0}{128}}}\right)\right) + i \cos\left(scl\left(\frac{0}{500000^{\frac{0}{128}}}\right)\right), \sin\left(scl\left(\frac{0}{500000^{\frac{2}{128}}}\right)\right) + i \cos\left(scl\left(\frac{0}{500000^{\frac{2}{128}}}\right)\right), \sin\left(scl\left(\frac{0}{500000^{\frac{4}{128}}}\right)\right) + i \cos\left(scl\left(\frac{0}{500000^{\frac{4}{128}}}\right)\right), \dots, \\\\ \sin\left(scl\left(\frac{0}{500000^{\frac{124}{128}}}\right)\right) + i \cos\left(scl\left(\frac{0}{500000^{\frac{124}{128}}}\right)\right), \sin\left(scl\left(\frac{0}{500000^{\frac{126}{128}}}\right)\right) + i \cos\left(scl\left(\frac{0}{500000^{\frac{126}{128}}}\right)\right) \end{array} \right\rbrace \\ \end{gathered} \\\\ \]

\[ \begin{gathered} PE_{pos1}= \left\lbrace \begin{array}{l} \sin\left(scl\left(\frac{1}{500000^{\frac{0}{128}}}\right)\right) + i \cos\left(scl\left(\frac{1}{500000^{\frac{0}{128}}}\right)\right), \sin\left(scl\left(\frac{1}{500000^{\frac{2}{128}}}\right)\right) + i \cos\left(scl\left(\frac{1}{500000^{\frac{2}{128}}}\right)\right), \sin\left(scl\left(\frac{1}{500000^{\frac{4}{128}}}\right)\right) + i \cos\left(scl\left(\frac{1}{500000^{\frac{4}{128}}}\right)\right), \dots, \\\\ \sin\left(scl\left(\frac{1}{500000^{\frac{124}{128}}}\right)\right) + i \cos\left(scl\left(\frac{1}{500000^{\frac{124}{128}}}\right)\right), \sin\left(scl\left(\frac{1}{500000^{\frac{126}{128}}}\right)\right) + i \cos\left(scl\left(\frac{1}{500000^{\frac{126}{128}}}\right)\right) \end{array} \right\rbrace \\\\ \end{gathered} \\ \]

\[ \begin{gathered} PE_{pos2}= \left\lbrace \begin{array}{l} \sin\left(scl\left(\frac{2}{500000^{\frac{0}{128}}}\right)\right) + i \cos\left(scl\left(\frac{2}{500000^{\frac{0}{128}}}\right)\right), \sin\left(scl\left(\frac{2}{500000^{\frac{2}{128}}}\right)\right) + i \cos\left(scl\left(\frac{2}{500000^{\frac{2}{128}}}\right)\right), \sin\left(scl\left(\frac{2}{500000^{\frac{4}{128}}}\right)\right) + i \cos\left(scl\left(\frac{2}{500000^{\frac{4}{128}}}\right)\right), \dots, \\\\ \sin\left(scl\left(\frac{2}{500000^{\frac{124}{128}}}\right)\right) + i \cos\left(scl\left(\frac{2}{500000^{\frac{124}{128}}}\right)\right), \sin\left(scl\left(\frac{2}{500000^{\frac{126}{128}}}\right)\right) + i \cos\left(scl\left(\frac{2}{500000^{\frac{126}{128}}}\right)\right) \end{array} \right\rbrace \\\\ \end{gathered} \\ \]

\[ \begin{gathered} PE_{pos3}= \left\lbrace \begin{array}{l} \sin\left(scl\left(\frac{3}{500000^{\frac{0}{128}}}\right)\right) + i \cos\left(scl\left(\frac{3}{500000^{\frac{0}{128}}}\right)\right), \sin\left(scl\left(\frac{3}{500000^{\frac{2}{128}}}\right)\right) + i \cos\left(scl\left(\frac{3}{500000^{\frac{2}{128}}}\right)\right), \sin\left(scl\left(\frac{3}{500000^{\frac{4}{128}}}\right)\right) + i \cos\left(scl\left(\frac{3}{500000^{\frac{4}{128}}}\right)\right), \dots, \\\\ \sin\left(scl\left(\frac{3}{500000^{\frac{124}{128}}}\right)\right) + i \cos\left(scl\left(\frac{3}{500000^{\frac{124}{128}}}\right)\right), \sin\left(scl\left(\frac{3}{500000^{\frac{126}{128}}}\right)\right) + i \cos\left(scl\left(\frac{3}{500000^{\frac{126}{128}}}\right)\right) \end{array} \right\rbrace \\ \end{gathered} \\\\ \]

\[ \begin{gathered} PE_{pos4}= \left\lbrace \begin{array}{l} \sin\left(scl\left(\frac{4}{500000^{\frac{0}{128}}}\right)\right) + i \cos\left(scl\left(\frac{4}{500000^{\frac{0}{128}}}\right)\right), \sin\left(scl\left(\frac{4}{500000^{\frac{2}{128}}}\right)\right) + i \cos\left(scl\left(\frac{4}{500000^{\frac{2}{128}}}\right)\right), \sin\left(scl\left(\frac{4}{500000^{\frac{4}{128}}}\right)\right) + i \cos\left(scl\left(\frac{4}{500000^{\frac{4}{128}}}\right)\right), \dots, \\\\ \sin\left(scl\left(\frac{4}{500000^{\frac{124}{128}}}\right)\right) + i \cos\left(scl\left(\frac{4}{500000^{\frac{124}{128}}}\right)\right), \sin\left(scl\left(\frac{4}{500000^{\frac{126}{128}}}\right)\right) + i \cos\left(scl\left(\frac{4}{500000^{\frac{126}{128}}}\right)\right) \end{array} \right\rbrace \\ \end{gathered} \\ \]

10.7. Visualized Form of Some Samples from The Frequency Tensor¶

The charts below aim to give you some insight into the values of angles and corresponding polar coordinates in the frequency tensor. The chart titles contain which index ranges are taken as samples in a particular chart.

These charts are drawn to make it easy for you to compare changes between positions and dimensions.

You can check out the Python codes that create the sample data and charts used here with this Python Notebook: 10.BONUS-PRECOMPUTING-FREQUENCY-TENSOR.ipynb

3D charts of polar coordinates:

Chart 3D - Polar coordinates of 0th position

2D charts of angles:

Chart - Angles per sample positions and dimensions