Mistral Medium 3.5 is local, if your local machine has 80GB to spare

From the source material

Unsloth Studio chat interface from the Mistral 3.5 local run documentation. — 1 / 1

Unsloth’s guide shows Mistral Medium 3.5 running through local tooling, but the hardware numbers are the real headline. (Image: Unsloth)

Unsloth has published a practical run guide for Mistral Medium 3.5, and the useful part is not the launch sparkle. It is the table. In Unsloth’s Mistral 3.5 local inference documentation, the new model is described as Mistral-Medium-3.5-128B: a dense 128B-parameter, multimodal, hybrid reasoning model with text and image input, text output, and a 262,144-token maximum context length. That sounds like a cloud model. Unsloth’s pitch is that you can run it locally. The asterisk is doing bicep curls.

The recommended memory targets are the story: about 64GB total memory for a 3-bit quant, 80GB for 4-bit, and 128GB to 170GB for 8-bit. Total memory means RAM plus VRAM, or unified memory on machines like high-end Macs. Translation: this is local in the same way a home espresso machine can be local. Yes, it fits in your kitchen. No, the cheap one from the dorm room is not invited.

That distinction matters because “open model” coverage keeps flattening two different claims into one. One claim is access: weights, quantized builds, GGUFs, llama.cpp support, and the ability to run outside a vendor API. The other claim is practicality: whether the machine on your desk can produce usable speed, context length, multimodal behavior, and tool-heavy agent loops without turning into a space heater with opinions. Mistral Medium 3.5 clears more of the first bar than the closed labs would like. The second bar is where physics starts collecting rent.

Unsloth recommends Dynamic 4-bit GGUFs as the starting point for local inference, specifically its `unsloth/Mistral-Medium-3.5-128B-GGUF` build, with paths through Unsloth Studio and llama.cpp. That is the right practical framing. A 128B dense model is not a cute weekend download for an 8GB laptop. It is a serious local deployment candidate for people with high-memory desktops, workstations, Mac Studio-class unified memory, or server hardware they control. Good enough, cheap enough, and available without an API contract is still a meaningful shift. It just is not free of hardware consequences.

The mode switch is also worth noticing. Unsloth says Mistral Medium 3.5 supports an instant instruct mode and a high reasoning mode. For llama.cpp or llama-server, the guide uses `--chat-template-kwargs '{"reasoning_effort":"high"}'` for complex prompts, coding, research, math, and agentic use, and `reasoning_effort="none"` for fast replies, extraction, chat, and simpler instructions. This is the part builders should not hand-wave. If a model exposes reasoning as an operational setting, you should treat it like a cost, latency, and quality knob, not a personality trait.

The sensible test is not “can it answer one impressive prompt?” It is: can it stay useful across the work you actually want local control over? Codebase exploration. Private document extraction. Long-context review. Image-plus-text analysis. Agent runs where you would rather not ship every intermediate artifact to a hosted frontier API. Mistral Medium 3.5 is interesting because those are exactly the categories Unsloth points toward. It is less interesting as yet another leaderboard beast wandering through launch week with a glossy cape.

There are caveats, and they are refreshingly concrete. Unsloth warns that available memory should exceed the size of the quantized model, even though llama.cpp can fall back to partial RAM or disk offload at a speed penalty. Long context, larger batches, tool-heavy runs, and image prompts need more memory. It also says multimodal or vision GGUFs do not currently work in Ollama because of separate `mmproj` vision files, so llama.cpp-compatible backends are the path if vision matters. And there is a very specific landmine: do not use CUDA 13.2, because Unsloth says it may produce gibberish outputs while NVIDIA works on a fix. That is not a philosophical caveat. That is the kind of caveat that saves a weekend.

The practical buying advice is simple. If you have 32GB of memory and want a local assistant, this is probably not your daily driver. Look at smaller models, or use Mistral Medium 3.5 through a hosted route when it appears in the places you already pay for. If you have 64GB unified memory, the 3-bit path may be a testable curiosity, but expect tradeoffs. If you have 80GB or more and you care about private local workflows, the Dynamic 4-bit route is where this becomes a legitimate experiment. If you have 128GB-plus, you are finally in the part of the map where the phrase “local 128B model” stops sounding like a prank.

For teams, the better question is not whether this replaces a frontier API. It is where local ownership changes the workflow. A local 128B model can be useful when data sensitivity, latency predictability, offline access, or cost control matter more than always having the absolute strongest model. It can also be useful as a secondary worker: run extraction, draft patches, inspect documents, or pre-chew long context locally, then escalate only the hard judgment calls to a hosted model. Open models do not have to win every benchmark to be useful. They have to make the expensive cloud call less necessary.

The catch is maintenance. Local inference means you own the weird parts: quant choice, backend updates, GPU drivers, chat templates, context settings, memory pressure, throughput, and the little incompatibilities that show up only after you invite the model into real work. Closed-lab translation aside: the API bill disappears from one line item and reappears as your time, your hardware, and your tolerance for logs.

Still, this is the right kind of open-model news for Useful Machines readers because it is not abstract model worship. It tells builders what to download, what memory class they need, what backend to use, what mode switch matters, and what not to touch if they value their afternoon. Mistral Medium 3.5 being locally runnable does not mean local frontier-ish AI is suddenly casual. It means the frontier of “local” has moved up to machines with serious memory. That is progress. Just measure the desk before you buy the dream.

In short

Unsloth’s Mistral 3.5 run guide turns a model launch into a hardware reality check: this is open local inference, not laptop magic.