I Bought a €2,500 AI Computer, and It Was a Mistake

by Felix Hoops, Venture Lead, Haven

The unified memory systems sound perfect until you're three months deep and realize you made the wrong call.

The truth about running local LLMs in 2026: unified memory systems like Apple Macs are the easy choice, but GPUs (Graphics Processing Units) are still king for inference (a.k.a. running LLMs). Here's everything I learned after dozens of hours of tweaking my setup, and several months of use. This is just the first article and more will follow with a full ready-to-copy config of my setup.

The Pitch That Sold Me

Here's the pitch I heard: Unified memory architecture solves the VRAM (Video RAM; usually in short supply on GPUs) bottleneck. Buy an AMD Strix Halo or Apple Silicon, and you get 64GB+ of RAM shared between CPU and GPU. Just plug in and run ~100B-parameter models, which is a crazy step up from the ~8B models you can run on an average RTX card.

It sounded perfect. Too perfect.

I bought an AMD Ryzen AI 395+ 128GB for €1,700 at the start of the year. It's now selling for around €800 more. Financially, I'm a genius. Practically? I've spent the last three months wondering if I should have just bought Nvidia RTX GPUs.

But before you run out and buy hardware, let's talk about what you're actually getting into.

"We Have Opus at Home"

You have seen this on your feed. I guarantee it.

Yes, every model release there will be hype posts claiming that this new release definitely is Claude Opus-level. No it is not. Let's set some realistic expectations.

Your local model stack is not going to fully replace Claude Opus 4.7. Or GPT-5.5. Or whatever frontier model comes out next week. Not today, not in six months. Maybe in a year, but then frontier models will also have improved.

And that's fine.

I still use cloud models for software engineering. When I'm writing complex code, debugging issues, or doing deep architectural work, I'm on Claude. No shame in that.

But local models are my personal assistant. They handle the stuff where privacy matters, where I don't want to worry about tokens, where I want full control over what's running.

Here's what local gives you:

  • No token anxiety. You're not watching your balance drain every time you ask a question. Sure, you're paying for electricity, but that is likely going to be a lot less.
  • Full privacy. Nothing leaves your machine. Your prompts, your data, your context. Depending on what you work on, this is a killer requirement or just a feel-good benefit.
  • Complete control. You know exactly which model is running, at what settings, with what parameters. Nobody can nerf it. Nobody can secretly change the quantization or the thinking effort.
  • Sovereignty. The more you adopt AI for your workflows, the more you grow dependent on the availability of inference. If you cannot at least fall back on your own inference, you may be in existential trouble if your cloud access dries up.

But here's the catch: if you want to actually enjoyably work with a local model, it has to fully fit in your VRAM. And that means you need serious hardware.

For agents especially, you need a dedicated machine running 24/7. Agents like Hermes or OpenClaw need constant availability. If your inference server is sleeping or busy, your agent can't work. A separate PC guarantees availability from any device and supports always-on agents.

The Hardware Fork: Red Pill or Blue Pill?

Perhaps even both?

You have two paths. They're fundamentally different, and choosing wrong means wasting a lot of money.

Blue Pill: Unified Memory (Apple Silicon, Nvidia DGX Spark or AMD Strix Halo)

This is the "wake up when the matrix tells you to" route. You take the easy path, accept the limitations, and live happily ever after. Mostly.

Apple Silicon pioneered this. You see a lot of people on X flexing their Mac Minis and Studios, so how bad of an idea could it be? I decided on AMD's Strix Halo (Ryzen AI 395+) because the value proposition looked insane.

The unified memory architecture means 128GB of RAM is shared between CPU and GPU. No dedicated VRAM. By allowing up to 124GB of dynamic VRAM allocation in the bootloader, you gain access to an entirely different league of models in the 100B parameter range.

On my Strix Halo, I'm getting ~20 tokens/sec on qwen3.5-122b-a10b with llama.cpp using the Vulkan backend. Prompt processing runs at roughly 130 tokens/sec. That sounds fast until you realize a single RTX 3090 with 24GB VRAM can run qwen3.6-27b at 8x the prompt processing speed and about 1.5x the token generation speed (and that is before you get into things like speculative decoding). Yes, this is not a completely fair comparison, but we will get back to that when talking about models.

The unified memory advantage? Massive VRAM at comparatively low prices. Also, you buy an all-in-one system that just works out of the box.

The disadvantage? Unified memory bandwidth is significantly lower than dedicated VRAM bandwidth. And for inference, bandwidth is the key factor. Compute power is not nearly as important as you may think. That means that unified memory systems are effectively locked to MoE (Mixture of Experts) models if you want any usable speed. Yes, I can run qwen3.6-27b on Strix Halo, but that results in <10 tokens/sec output, which is not what I would call usable.

MoE are models that use only a fraction of their parameters for every predicted token. The "a10b" in qwen3.5-122b-a10b means that the model uses only 10 billion of its total 122 billion parameters per token. Naturally, this requires less bandwidth, making the models run faster than a dense model that uses all of its parameters every token.

Red Pill: Dedicated GPU Rig

This is the "stay in Wonderland, see how deep the rabbit hole goes" route. You build a custom PC, deal with parts sourcing, power supplies, and cooling. But the higher memory bandwidth gets you performance that unified memory can't touch.

Multiple Nvidia cards. Maybe AMD if you enjoy debugging. This is the route where you get 2-3x the throughput for the same model.

The 122B model only has 10B active parameters and runs on Strix Halo. The 27B model running on one RTX 3090 is significantly faster. I like comparing these two models because they represent the best and fastest that both platforms can currently run.

Why didn't I take this route initially? Because of two things:

  • I'd need to build an extra PC. Adding a GPU rig means a whole new build, power supply, case, cooling, the works. (Although you can use an external GPU for your first steps if you really want to.)
  • The cost for comparable VRAM is insane. A single RTX 3090 (currently most cost-effective) gives you 24GB VRAM and goes for roughly €1,000 (or $) on the used market. To match the 128GB unified memory on Strix Halo, you'd need to stack a lot of cards. And that costs significantly more than the Strix Halo did.

But here's what I learned after three months: for inference performance, one RTX 3090 would have been the better choice. Not because it can run bigger models. It can't run qwen3.5-122b in any quantization. It is better because the smaller models that run on an RTX 3090 are surprisingly useful, and they run significantly faster than a comparable model on unified memory.

I don't regret buying the Strix Halo. It's now extra real estate for large MoE models and STT/TTS models I can load on demand while my RTX 3090 is serving a stable one-model config. But for pure LLM inference in general? GPUs are king.

Using GPUs is a bit harder. But if you're serious about local inference, it's the path that actually delivers. The software stack, models, and exact configs I landed on are a lot to cover. I already started writing that up for my next post.

One Last Thing

Buy a used RTX 3090 yesterday.

More articles

vDL's Engineering AI Workflow

How we use AI agents, coding assistants, skills, local context, and human review to ship faster without lowering the engineering bar.

Read more

The Open-Source Paradox: Why Giving Away Code Builds Stronger Companies

How giving away our code became our biggest competitive advantage - and why more venture studios should try it.

Read more

Got a project? Let's talk.

Our office

  • Aschheim
    Jägerweg 10
    85609, Aschheim, Germany