Local 13B LLM “ChatGPT” running on M2 Max GPU

Share this:

Running a local “ChatGPT” on M2 Max is quite fun. Text generation web UI project which makes it really easy to install and run Large Language Models (LLM) like LLaMA. As with my previous article suggest, you can definitely run it on a Mac.

Let’s go ahead and get Text generation web UI installed!

Text generation web UI in a chat mode running speechless-llama2-hermes-orca-platypus-wizardlm-13b.Q4_K_M.gguf model

Getting one prerequisite installed

Honestly, the whole process is rather easy and straightforward. But there is an unmentioned prerequisite of the project, specifically Rust. It’s not a direct dependency. But, one of the Python libraries that the project uses needs it to be available. The installer should automatically install everything else that you need.

I use brew on my Mac. It makes it really easy to install packages like Rust on your Mac. If need to install brew, I wrote about it here about setting up a dev environment on your Macbook.

To install Rust with brew, simply run the following command.

brew install rust

Installing Text generation web UI on macOS

With that, you should be set and ready to install Text generation web ui.

Firstly, clone the one-click installer’s repository: https://github.com/oobabooga/text-generation-webui#one-click-installers.

git clone https://github.com/oobabooga/text-generation-webui.git

You can also clone it on an external SSD if you don’t have enough space on your main disk. Just make sure the external disk does not have a space in its volume name.

To install, run the start_macos.sh script


*Note: This may change again in the future as the project is very much active and ever evolving. If tt does not work, visit the git repo and check the latest installation steps there.*

On its first run, it will install everything that the project needs. When it finally ends, you will get a prompt on which GPU model you have on your Mac. Choose the M1 GPU option for the Apple M2 Max (or any M1/M2-based machines).

Upon installation, it would start the web interface without an LLM model. The next thing to do is to download one from Hugging Face. At this point, I would just shut down Text generation web UI and complete the next few steps.

Downloading a pre-trained large learning model

Any GGUF-based models *should* work. Text generation web ui’s interface provide a way to download the model directly from Hugging Face. At the moment, the model that I am using is the speechless-llama2-hermes-orca-platypus-wizardlm-13b.Q4_K_M.gguf. This was recommended on the r/localLlama subreddit as one that’s pretty good. Download this into the models directory.

Text Generation Web UI models location

To automatically load the model and optimise this for the M2 Max, I added the necessary flags in the CMD_FLAGS.txt file.

echo '--model speechless-llama2-hermes-orca-platypus-wizardlm-13b.Q4_K_M.gguf --api --n-gpu-layers 1 --threads 8 --n_ctx 4096 --mlock' >> CMD_FLAGS.txt
echo '--model speechless-llama2-hermes-orca-platypus-wizardlm-13b.Q4_K_M.gguf --api --n-gpu-layers 1 --threads 8 --n_ctx 4096 --mlock' >> CMD_FLAGS.txt

The full list of settings can be found here, but here’s a summary of the settings I used.

--model speechless-llama2-hermes-orca-platypus-wizardlm-13b.Q4_K_M.ggufLoad the speechless-llama2-hermes-orca-platypus-wizardlm-13b.Q4_K_M.gguf model by default
--apiEnable the API extension. I use this for my Workato-based integration experimentation for work
--n-gpu-layers 1Number of layers to offload to the GPU. I don’t remember where I found this recommendation but setting it to 1 will make use of all 38 GPU cores.
--threads 8Number of threads to use. I matched it with M2 Max’s 8 performance cores
--n_ctx 4096Set the size of the prompt context. Llama 2 has 4096 context length. Source: https://agi-sphere.com/context-length/
--mlockForce the system to keep the model in RAM

At this point, you’re ready to load up Text generation web UI again using the ./start_macos.sh script. Here you can see that it used up about 11GB of the GPU’s RAM.

Text generation web UI started with the 13B model loaded in the GPU RAM

Go to on your browser to load up the web ui.

Text generation web ui new chat window

You can confirm the settings on the CMD_FLAGS.txt is loaded correctly in the Model tab.

Text generation web ui model tab

The following screenshot shows how the GPU is fully utilised when the LLM model is inferred while the CPU cores runs pretty light.

Screenshot shows how the GPU is fully utilised when the LLM model is inferred while the CPU cores runs pretty light

20+ tokens per second

In the video below, you can see how our Local “ChatGPT” on M2 Max performs.

Speechless-Llama2-Hermes-Orca-Platypus-WizardLM-13B-GGU LLM model M2 Max 38-cores GPU, 64GB RAM

Granted this is nowhere close to high-end setups that can generate up to 100s of tokens per second. But as you can see, at 20+ tokens/sec, it feels fast enough like a human responding back in real-time. For me, it’s good enough for me to build experiments around the use of LLMs and process automations at work.

All in all, I’ve been quite impressed with the M2 Max and how its been able to be a decent machine to run an LLM locally.

If this post has been useful, support me by buying me a latte or two 🙂
Buy Me A Coffee
Share this:

You may also like...

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.