Local 13B LLM “ChatGPT” running on M2 Max GPU
Running a local “ChatGPT” on M2 Max is quite fun. Text generation web UI project which makes it really easy to install and run Large Language Models (LLM) like LLaMA. As with my previous article suggest, you can definitely run it on a Mac.
Let’s go ahead and get Text generation web UI installed!

Getting one prerequisite installed
Honestly, the whole process is rather easy and straightforward. But there is an unmentioned prerequisite of the project, specifically Rust. It’s not a direct dependency. But, one of the Python libraries that the project uses needs it to be available. The installer should automatically install everything else that you need.
I use brew on my Mac. It makes it really easy to install packages like Rust on your Mac. If need to install brew, I wrote about it here about setting up a dev environment on your Macbook.
To install Rust with brew, simply run the following command.
brew install rust
Installing Text generation web UI on macOS
With that, you should be set and ready to install Text generation web ui.
Firstly, clone the one-click installer’s repository: https://github.com/oobabooga/text-generation-webui#one-click-installers.
git clone https://github.com/oobabooga/text-generation-webui.git
You can also clone it on an external SSD if you don’t have enough space on your main disk. Just make sure the external disk does not have a space in its volume name.
To install, run the start_macos.sh script
./start_macos.sh
*Note: This may change again in the future as the project is very much active and ever evolving. If tt does not work, visit the git repo and check the latest installation steps there.*
On its first run, it will install everything that the project needs. When it finally ends, you will get a prompt on which GPU model you have on your Mac. Choose the M1 GPU option for the Apple M2 Max (or any M1/M2-based machines).
Upon installation, it would start the web interface without an LLM model. The next thing to do is to download one from Hugging Face. At this point, I would just shut down Text generation web UI and complete the next few steps.
Downloading a pre-trained large learning model
Any GGUF-based models *should* work. Text generation web ui’s interface provide a way to download the model directly from Hugging Face. At the moment, the model that I am using is the speechless-llama2-hermes-orca-platypus-wizardlm-13b.Q4_K_M.gguf. This was recommended on the r/localLlama subreddit as one that’s pretty good. Download this into the models
directory.

To automatically load the model and optimise this for the M2 Max, I added the necessary flags in the CMD_FLAGS.txt
file.
echo '--model speechless-llama2-hermes-orca-platypus-wizardlm-13b.Q4_K_M.gguf --api --n-gpu-layers 1 --threads 8 --n_ctx 4096 --mlock' >> CMD_FLAGS.txt

The full list of settings can be found here, but here’s a summary of the settings I used.
Settings | Descriptions |
---|---|
--model speechless-llama2-hermes-orca-platypus-wizardlm-13b.Q4_K_M.gguf | Load the speechless-llama2-hermes-orca-platypus-wizardlm-13b.Q4_K_M.gguf model by default |
--api | Enable the API extension. I use this for my Workato-based integration experimentation for work |
--n-gpu-layers 1 | Number of layers to offload to the GPU. I don’t remember where I found this recommendation but setting it to 1 will make use of all 38 GPU cores. |
--threads 8 | Number of threads to use. I matched it with M2 Max’s 8 performance cores |
--n_ctx 4096 | Set the size of the prompt context. Llama 2 has 4096 context length. Source: https://agi-sphere.com/context-length/ |
--mlock | Force the system to keep the model in RAM |
At this point, you’re ready to load up Text generation web UI again using the ./start_macos.sh
script. Here you can see that it used up about 11GB of the GPU’s RAM.

Go to http://127.0.0.1:7860
on your browser to load up the web ui.

You can confirm the settings on the CMD_FLAGS.txt is loaded correctly in the Model tab.

The following screenshot shows how the GPU is fully utilised when the LLM model is inferred while the CPU cores runs pretty light.

20+ tokens per second
In the video below, you can see how our Local “ChatGPT” on M2 Max performs.
Granted this is nowhere close to high-end setups that can generate up to 100s of tokens per second. But as you can see, at 20+ tokens/sec, it feels fast enough like a human responding back in real-time. For me, it’s good enough for me to build experiments around the use of LLMs and process automations at work.
All in all, I’ve been quite impressed with the M2 Max and how its been able to be a decent machine to run an LLM locally.
If this post has been useful, support me by buying me a latte or two 🙂
