Skip to content

LLaMA Nuts and Bolts

LinkedIn Twitter HitCount License

Welcome!

This documentation website is a customized version of original documentation of the LLaMA Nuts and Bolts repository. You can find the running Go implementation of the project codes in this repository.

A holistic way of understanding how LLaMA and its components run in practice, with code and detailed documentation. "The nuts and bolts" (practical side instead of theoretical facts, pure implementation details) of required components, infrastructure, and mathematical operations without using external dependencies or libraries.

This project intentionally doesn't have support for GPGPU (such as nVidia CUDA, OpenCL) as well as SIMD because it doesn't aim to be a production application, for now. Instead, the project relies on CPU cores to perform all mathematical operations, including linear algebraic computations. To increase performance, the code has been optimized as much as necessary, utilizing parallelization via goroutines.

LLaMA Nuts and Bolts Screen Recording GIF
LLaMA Nuts and Bolts Screen Recording GIF, captured while the application was running on the Apple MacBook Pro M1 Chip. Predefined prompts within the application were executed. The GIF is 20x faster.

💭 WHY THIS PROJECT?

This project was developed for only educational purposes, and has not been tested for production or commercial usage. The goal is to make an experimental project that can perform inference on the LLaMa 2 7B-chat model completely outside of the Python ecosystem. Throughout this journey, the aim is to acquire knowledge and shed light on the abstracted internal layers of this technology.

This journey is an intentional journey of literally reinventing the wheel. While reading this journey here, you will navigate toward the target with a deductive flow. You will encounter the same stops and obstacles I encountered during this journey.

If you are curious like me about how the LLMs (Large Language Models) and transformers work and have delved into conceptual explanations and schematic drawings in the sources but hunger for deeper understanding, then this project is perfect for you too!

📐 MODEL DIAGRAM

The whole flow of LLaMa 7B-Chat model without abstraction:

Complete Model Diagram

🎯 COVERAGE

Due to any of the existing libraries (except the built-in packages and a few helpers) wasn't used, all of the required functions were implemented by this project in the style of Go. However, the main goal of this project is to do inference only on the LLaMa 2 7B-chat model, the functionality fulfills only the requirements of this specific model. Not much, not less, because the goal of our project is not to be a production-level tensor framework.

The project provides a CLI (command line interface) application allowing users to choose from predefined prompts or write custom prompts. It then performs inference on the model and displays the generated text on the console. The application supports "streaming," enabling immediate display of generated tokens on the screen without waiting for the entire process to complete.

As you can see in the chapters here, covered things are:

📦 INSTALLATION and BUILDING

Installation and building instructions are described at GitHub README.

💻 RUNNING

Run the project with executing go run ... command or executing the compiled executable. It's more suggested that to run this project's executable after building it, and without virtualization for higher performance.

When you run the project, you will see the following screen. It prints the summary of the loading process of model files and a summary of model details.

First start of the application

Printing Model Metadata

If you select the first item in the menu by pressing 0 key and ENTER, the application prints the metadata of LLaMa 2 7B-chat model on the console:

Printing metadata 1 Printing metadata 2

Executing a Prompt

Alongside you can select one of predefined prompts in the menu, you can select one of latest two items (Other, manual input) to input your custom prompts.

With the [Text completion] choices, the model is used only to perform text completion task. New tokens will be generated according to the input prompt text.

With the [Chat mode] choices, the application surrounds the prompt with "[INST]" and "[/INST]" strings to specify "this is an instruction prompt". Also it surrounds the system prompt part with <<SYS>>\n and \n<</SYS>>\n\n strings to specify this part is a system prompt.

At the end, a chat mode prompt string will be look like following:

"[INST] <<SYS>>
Always answer with emojis
<</SYS>>

How to go from Beijing to NY? [/INST]"

And the output of this prompt is like the following (consists of emojis with their names and unicode escape sequences):

Example emoji output

🧱 ASSUMPTIONS

The full-compliant, generic, production-ready, and battle-tested tensor frameworks should have support for a wide range of platforms, acceleration devices/processors/platforms, use cases, and lots of convertibility between data types, etc.

In LLaMA Nuts and Bolts scenario, some assumptions have been made to focus only on required set of details.

Full-compliant applications/frameworks LLaMA Nuts and Bolts
Use existing robust libraries to read/write file formats, perform calculations, etc. This project aims to reinvent the wheel, so it doesn't use any existing library. It implements everything it requires, precisely as much as necessary.
Should support a wide range of different data types and perform calculations between different typed tensors in an optimized and performant way. Has a limited elasticity for only required operations.
 Should support a wide range of different file formats. Has a limited support for only required file formats with only required instructions. 
Should support top-k, top-p, and, temperature concepts of the LLMs (Large Language Models) to randomize the outputs, explained here. This project doesn't have support for randomized outputs intentionally, just gives the outputs that have the highest probability.
Should support different acceleration technologies such as nVidia CUDA, OpenCL, Metal Framework, AVX2 instructions, and ARM Neon instructions, that enable us GPGPU or SIMD (Single instruction, multiple data) usage. This project doesn't have support for GPGPU and SIMD (Single instruction, multiple data) intentionally because it doesn't aim to be a production application, for now. However, for a few days, I had tried an experiment with ARM Neon instructions on my MacBook Pro M1, it worked successfully with float32 data type, but with the CPU cycles required to convert BFloat16 to float32 negated the saved time that came with ARM Neon.

Also, I've realized that the Go compiler doesn't have support for 2-byte floats, even though I've tried using CGO. So, I gave up on this issue. If you're curious about it, you can check out the single commit on the experiment branch arm_neon_experiment.

⭐ CONTRIBUTING and SUPPORTING the PROJECT

You are welcome to create issues to report any bugs or problems you encounter. At present, I'm not sure whether this project should be expanded to cover more concepts or not. Only time will tell 😊.

If you liked and found my project helpful and valuable, I would greatly appreciate it if you could give the repo a star ⭐ on GitHub. Your support and feedback not only help the project improve and grow but also contribute to reaching a wider audience within the community. Additionally, it motivates me to create even more innovative projects in the future.

📖 REFERENCES

I want to thank to contributors of the awesome sources which were referred during development of this project and writing this documentation. You can find these sources below, also in between the lines in code and documentation.

You can find a complete and categorized list of refereces in 19. REFERENCES chapter of this documentation.

The following resources are most crucial ones, but it's suggested that to check out the 19. REFERENCES chapter:

📜 LICENSE

LLaMA Nuts and Bolts is licensed under the Apache License, Version 2.0. See LICENSE for the full license text.