Jun 19, 2026 7 min read

DwarfStar and the New Altitude of Local LLMs

I have been running Ollama for about two years. Right now I have gemma4:31b sitting on a MacBook Pro, M4 Pro, 48GB of unified memory, and it is genuinely useful. It drafts, it summarizes, it answers questions about code without a single packet leaving the laptop. Small open-weight models got good. That part of the story is settled.

But notice the word: small. For two years the local story has been about clever small models. You traded raw capability for privacy, offline use, and zero per-token cost. You ran the thing that fit. What you could not do was run a near-frontier model on a machine you actually own.

That is the line DwarfStar crosses. The floor moved from a tiny model on a phone to a near-frontier model on a high-end workstation. Not a 7B distillation pretending to be smart. The real target is DeepSeek V4.

I should say up front: my 48GB machine is below the 96GB floor DwarfStar needs, so I have not run it. Everything here is me reading the code description, the README, the model card, the author's posts, and other people's measurements. I am an interested observer of the trend, not someone with hands-on ds4 benchmarks.

Who built it and why that matters

DwarfStar (the author also writes it "DwarfStar 4", and the binaries are ds4, ds4-server, ds4-agent) is the work of Salvatore Sanfilippo, better known as antirez. If you have touched a backend in the last decade you have touched his code: he created Redis in 2009, wrote the overwhelming majority of it, and was its BDFL for about eleven years before stepping back in 2020 and returning in late 2024.

That lineage matters here because DwarfStar is built like Redis was built. It is a from-scratch native inference engine in pure C, with Metal, CUDA, and ROCm backends. It is not a wrapper around llama.cpp or Ollama. "ds4" reads as "DeepSeek 4", and that name is the whole philosophy: this is not a general-purpose runner for every model family. It is one engine, implementing one bespoke quantization recipe, for one model family. He laid out the reasoning on his blog.

A quick detour through MoE

To understand why this works at all, you need the shape of the model. DeepSeek V4 is a Mixture-of-Experts (MoE) model. Instead of one dense network where every parameter fires on every token, the weights are split into many small "expert" networks plus a router that, for each token, picks a handful of experts to actually run.

DeepSeek V4 Flash, the primary ds4 target, has 284B total parameters but only 13B active per token. It uses the top 6 of 256 routed experts plus 1 shared expert that always runs, with a context window around 1M tokens. There is also a V4 PRO at 1.6T total, 49B active, 384 routed experts across 61 layers, for machines with very high memory.

The consequence is the whole point. The overwhelming majority of the weight sits in the routed experts, and any given token only touches a few of them. The parts that run on every token, the attention projections, the router that picks the experts, the embeddings, the output head, are comparatively small. Most of the model is the part that is conditionally used. A small part of the model is the part that is always in the decision loop. That asymmetry is the opening DwarfStar exploits.

The asymmetric quant: 2/8 bit

Quantization means storing weights at lower precision to shrink the model. The usual approach quantizes everything uniformly and accepts a uniform quality hit. DwarfStar does something more careful, and this is the core idea worth understanding.

It quantizes asymmetrically. Only the routed MoE experts, which are the bulk of the weights, get crushed down to roughly 2 bits: the up and gate projections at IQ2_XXS, the down projection at Q2_K. Everything that decides what to do is left at high precision:

Shared experts, attention projections, and the output head at Q8_0
The router (ffn_gate_inp) at F16
Embeddings at F16
Norms and biases at F32

antirez calls it a "2/8 bit" recipe. The README puts it plainly: only the routed MoE experts are quantized, and the other components are left untouched to guarantee quality. The model card says the same thing from the other direction: keeping the decision-making components at Q8_0 preserves model behavior, and crushing the experts buys the size.

The intuition is clean once you map it onto how the model computes. A router operating on garbled, 2-bit inputs would pick the wrong experts, and then it does not matter how precise those experts are. Attention deciding what to attend to on 2-bit math would smear the whole forward pass. But an individual expert, called occasionally, contributing one slice of a sum over several experts, can tolerate being coarse. You are averaging over several of them per token anyway, so the errors are smaller in effect and they partly wash out. You spend your precision budget where a wrong bit changes the trajectory, and you save it where a wrong bit is one noisy term in a sum.

The GGUF quants land at about 80.8GB for the 2-bit build and about 153.3GB for the 4-bit build. That 80.8GB number is why a 96GB Mac is the entry point.

The hardware and cost reality

Here is where the altitude shift becomes concrete. DwarfStar runs starting from machines with 96GB of RAM (less if you stream weights off an SSD, at a speed cost). The README's tiers: 96 to 128GB for an imatrix-tuned 2-bit quant, 256GB or more for 4-bit, 512GB for the PRO at 2-bit. Named hardware includes a MacBook Pro M3 Max 128GB, an M5 Max 128GB, a Mac Studio M3 Ultra 512GB, the NVIDIA DGX Spark GB10 128GB, and the Framework Desktop on Strix Halo.

Throughput is real, not a tech demo. On an M5 Max 128GB, antirez reports about 463 tokens/s prefill on Flash. On a Mac Studio M3 Ultra 512GB running the much larger PRO, about 138 tokens/s prefill and about 9.5 tokens/s generation.

On cost, the README itself lists none. The numbers come from antirez's follow-up post on distributed inference: about 6 to 7k for an M5 Max 128GB laptop running Flash, which he calls one of the best deals, and about 12k total for a Mac Studio M3 Ultra 512GB running PRO, which he frames as a frontier model at home. He does not state a currency. He is based in Italy, so read those as approximate, not as a quote in any particular dollar. The smaller number buys the Flash laptop; the larger one buys the full PRO rig. Do not collapse those two.

How good is it, really

antirez describes Flash as "quasi-frontier", and he is careful about the hedge: "DeepSeek V4 Flash feels quasi-frontier. The PRO is even better. Both resist 2 bit quantization very well." Simon Willison independently called DeepSeek V4 "almost on the frontier". Quasi-frontier, not frontier. The distinction is the honest one.

The reception has been loud and mostly positive: roughly 13,000 GitHub stars in about a month. The praise clusters around two things people did not expect to survive 2-bit experts: reliable tool calling and usable long context. The mood is "it feels close", and nobody can tell you exactly how close.

Because here is what does not exist, as far as I can find: a reproducible, published quality benchmark comparing the 2-bit expert quant against full DeepSeek V4 Flash. No perplexity, no KL divergence, no GPQA, no MMLU, no coding eval in any accessible source. "Does not feel dumber" is a real signal from real users, but it is not a measurement, and the degradation versus full Flash stays unquantified.

Limitations and open questions

Quality loss is unproven. The theory of why crushing experts should be safe is sound and the vibes are good, but vibes are not a benchmark. Until someone runs perplexity or KL divergence against full Flash, "resists 2-bit very well" is a claim, not a result.
The 96GB memory wall is real. The 2-bit Flash quant is 80.8GB. My 48GB M4 Pro cannot load it, and most people's machines cannot either. The floor is a high-end workstation, not the laptop most people own.
Currency on the cost figures is unstated. Treat the 7k and 12k as rough magnitudes, not invoices, and remember the 7k is the Flash laptop while the 12k is the PRO rig.
Single-model lock-in. DwarfStar is a bespoke engine for one model family with one quant recipe. That focus is exactly why it works as well as it does, but it is the opposite of Ollama's general-purpose design. If you want to run Gemma 4 or anything else, this is not your tool. And if DeepSeek's next release changes the architecture, the recipe has to be redone.
I have not run it. For a 48GB machine like mine, a small Gemma 4 variant on Ollama runs on-device on Apple Silicon today and costs nothing. Different tool, different job.

My take

The framing that matters is altitude, not novelty. Local inference is old news. What is new is that the model you can run locally is no longer obviously a compromise. antirez argues, and I read this as his opinion rather than settled fact, that open-weight models are closing the gap on the closed labs while access to the strongest models gets more controlled. If that trend holds, owning a near-frontier model outright, with no API in the loop and no data leaving the box, stops being a hobbyist curiosity and becomes a real option for people who care about privacy and access.

The clever part of DwarfStar is not that it runs a big model on a Mac. It is the choice of where to spend precision. An MoE model hands you a natural seam: most of the weight is in parts that are conditionally used and partly redundant, and a small minority is in the parts that steer every token. Quantize along that seam and you get an 80GB file that people say does not feel dumb. What it needs now is the boring part nobody has done: real eval numbers against full Flash. Until then I will keep running gemma4:31b on my 48GB machine and watch this one closely, because quantizing a model along the lines of how it actually computes is the right idea even if the proof is still missing.