Getting My AI Agent to Run Faster on My M1 MBP
VIEW[{cdate}][text] VIEW[{title}][text]
If your text responses are lagging, you've likely hit the "architecture wall."
Even with 64GB of RAM, your M1 is currently wrestling with two factors that make Gemma 4 feel slow compared to older models:
1. The "First-Month" Performance Bug
Gemma 4 only launched about 40 days ago (April 2026). Early benchmarks show that while it fits perfectly in your 64GB of memory, the current version of Ollama has a known performance bug with it.
- The Issue: Instead of offloading all the work to your M1’s GPU cores, Gemma 4 is currently forcing your CPU to do the heavy lifting.
- The Result: You're probably getting 2–5 tokens per second (the "slow crawl") instead of the 15–20 tokens per second your M1 Max is capable of.
2. The M1 vs. M5 "Processing" Gap
Your 64GB of RAM is huge, but the M1's memory bandwidth is like a high-speed local road, while the new M5 (which released in March) is a multi-lane superhighway.
- M1 Max: ~400 GB/s bandwidth.
- M5 Max: ~614 GB/s bandwidth.
- The Impact: Because AI generation speed is almost entirely limited by how fast the chip can "read" the model from the RAM, an M5 will always feel 50–60% faster even if both machines have 64GB of memory.
How to speed things up right now:
Since you’re a power user with Ghostty and zsh, try these three tweaks to bypass the lag:
- Switch to the "26B MoE" version: Gemma 4 comes in a "Mixture of Experts" (MoE) flavor. Instead of using all 26 billion parameters for every word, it only activates about 4 billion at a time. It's significantly faster on M1 hardware and still incredibly smart for coding.
- Use the MLX Framework: Since you’re on Apple Silicon, skip Ollama for a second and try the MLX version of Gemma 4. Apple’s own MLX team usually releases highly optimized kernels that can double the speed of new models on M1 chips compared to generic runners.
- Check your "Context" size: If OpenCode is feeding thousands of your Obsidian notes into Gemma 4's 256k context window, the "prefill" (the time before the first word appears) will take forever. Try limiting the agent to only look at the specific folder you're working in.
One quick check for your "Waiting" problem:
When you ask a question, is there a long pause before it starts typing, or does the typing itself look like a slow typewriter? (The first is a "Context" problem; the second is a "Bandwidth" problem).