On-Device LLM Inference in 2026: M3/M4 Macs, Jetson, and Mobile

March 22, 2026 · Editorial Team · 8 min read · on-device-ai local-llm inference

Running a language model on your own hardware has gone from "technically possible if you're patient" to "actually practical" over the past 18 months. The hardware got better, the runtimes got more optimized, and the models got smarter at smaller parameter counts. All of that converged at roughly the same time.

This post covers what real on-device inference looks like across three hardware categories: Apple Silicon Macs (M3 and M4), NVIDIA Jetson for edge deployments, and mobile phones. Real numbers, real constraints, no theoretical maximums.

Why run locally at all

Before getting into hardware specifics, it's worth being clear about why you'd bother. The cloud is fast, cheap enough for moderate use, and requires no setup. Local inference makes sense in specific situations:

Privacy: Your data never leaves your machine. If you're processing medical records, legal documents, or proprietary code, that matters.

Latency: A locally running 7B model on fast hardware responds in under 200ms. Cloud API calls typically take 500ms-3 seconds. For interactive applications, that's noticeable.

Cost at scale: Cloud inference at $5-15 per million tokens adds up fast at high volumes. After the hardware cost, local inference is essentially free.

Air-gapped environments: Certain infrastructure, research, government, and industrial deployments can't route traffic to external APIs. Local inference is the only option.

Tinkering: Some people just want to understand what they're running, and there's value in that.

Apple Silicon: M3 and M4

Apple's unified memory architecture is what makes Mac an interesting platform for local inference. The CPU and GPU share the same physical memory pool, which means a 16GB M3 MacBook Pro has 16 gigabytes of unified memory that both the model weights and the GPU computation can access directly. There's no PCIe bandwidth bottleneck between system RAM and a discrete GPU. For LLM inference, that architectural choice turns out to be quite good.

M3 MacBook Pro (16GB)

The 16GB base config is the minimum useful configuration. You can run 7B models comfortably and quantized 13B models with some compromise.

With llama.cpp (the most common runtime), Llama 3.3-8B at Q4_K_M quantization (a 4-bit compression that trades a small amount of quality for roughly 50% smaller weight files) runs at about 60-80 tokens per second on a base M3. The model takes about 4.5GB of memory at that quantization, leaving the rest for the OS and other applications. Prompt processing (filling the context with input text) is faster than generation, typically 200-400 tokens per second.

At 16GB you can't comfortably run much above 8B parameters unless you drop to very aggressive 2-bit quantization, which hurts quality noticeably. 7B-8B is the sweet spot.

A session with Mistral-7B-Instruct using Ollama on a base M3 feels snappy for personal use. Response times in the 2-5 second range for a paragraph, which is perfectly acceptable.

M3 Pro / M3 Max (36GB, 48GB, 96GB)

The M3 Pro with 36GB and M3 Max with 48GB or 96GB change the equation significantly. At 36GB you can run 30B class models at Q4 quantization. At 48GB, a Q4 Llama-2-70B fits. At 96GB on a maxed-out M3 Max, you can run 70B models at reasonable quality with comfortable headroom.

Speed scales roughly with the memory bandwidth. The M3 Max has 400 GB/s of memory bandwidth versus the base M3's ~100 GB/s. That difference shows up directly in token generation speed.

M3 Max with 96GB running Llama 3.3-70B at Q4_K_M: expect around 18-25 tokens per second. That's meaningful, real-time generation for interactive use.

M4 and M4 Pro (released late 2024)

M4 base chips show meaningful improvements over M3 in inference performance. The neural engine got upgraded and llama.cpp with Metal optimizations now uses it more effectively. Practically, M4-based MacBooks show roughly 15-25% faster token generation at equivalent quantization levels compared to equivalent M3 configs.

M4 Pro with 24GB runs 13B models comfortably and shows noticeably better performance per dollar than comparable M3 Pro configs. If you're buying new hardware specifically for local inference, M4 Pro with 24GB is currently the best value configuration.

Runtimes: what you'll actually use

llama.cpp with Ollama on top is the most common setup. Ollama handles model downloading, management, and serving with an OpenAI-compatible API endpoint that most frameworks can talk to directly. Point Cursor, Open WebUI, or any LangChain-based app at localhost:11434 and it works. The setup takes about 10 minutes.

LM Studio is the GUI option if you want a graphical interface. Download models from Hugging Face through the app, pick quantization levels, and run them. Less configuration than Ollama but also less flexible.

MLX (Apple's ML framework) is worth knowing about for Apple Silicon specifically. Apple built MLX with their hardware in mind, and for certain model architectures it outperforms llama.cpp. The MLX-LM library on Hugging Face has optimized versions of popular models. For Phi-4, the MLX versions are noticeably faster than llama.cpp at the same quantization.

NVIDIA Jetson

Jetson is NVIDIA's family of edge computing boards designed for AI inference in embedded and deployed environments. The current relevant hardware is the Jetson Orin line.

Jetson Orin NX 16GB (~$450-500)

The Orin NX 16GB is the practical workhorse for edge LLM deployment. It has 16GB of shared CPU/GPU memory (similar concept to Apple's unified memory), a 1024-core Ampere GPU, and draws about 15-25 watts under load. That power profile means you can run it from a small UPS, embed it in a cabinet, or deploy it in settings where 200-watt GPU cards aren't practical.

On the Orin NX 16GB, Llama 3.3-8B at Q4 quantization runs at about 15-25 tokens per second. That's slower than a Mac, but the Jetson is designed for environments where a Mac would be impractical: a factory floor, a vehicle, a retail kiosk, a medical device.

The typical deployment stack on Jetson uses NVIDIA's TensorRT-LLM runtime, which compiles model weights into an optimized binary for the specific Jetson hardware. Setup is more complex than Ollama (NVIDIA's documentation is workable but not beginner-friendly), but the performance benefit over llama.cpp is real, often 30-50% faster generation.

Jetson Orin AGX 64GB (~$2,000): The large end of the Jetson line. 64GB of unified memory, a 2048-core Ampere GPU. Can run 30B models at Q4 quantization with usable generation speeds. Used in autonomous vehicles, high-end robotics, and demanding industrial inference applications. At $2,000 for the development kit (or $1,000 in volume), it's not cheap for a single unit, but for a deployed system it's often the right choice over paying cloud inference costs indefinitely.

Mobile: Android and iOS

Mobile inference is the hardest category because the constraints are genuinely tight. You've got at most 6-8GB of memory that the OS will let an app use, thermal limits that kick in after a few minutes of sustained computation, and battery draw that users notice.

What actually runs on phones

Sub-3B models are the current practical upper bound for most phones. Phi-4-mini at 3.8B parameters, heavily quantized to 2-3 bits, runs at about 10-15 tokens per second on a recent flagship Android (Snapdragon 8 Gen 3) or iPhone 15 Pro/16. That's just barely fast enough for interactive use.

Gemma-3-2B (the smallest Gemma-3 variant) runs at about 20-30 tokens per second on the same hardware, with noticeably more comfortable memory headroom.

Apple's own on-device models in iOS 18+ use 3B-class models for system features. The performance they get by writing models specifically for Apple Neural Engine, with custom kernels and tight OS integration, is better than what third-party apps can achieve through public APIs. Third-party apps go through CoreML or similar, which adds overhead.

Frameworks for mobile LLM

llama.cpp has iOS and Android ports. It works, but the setup is not beginner-friendly. Building for mobile requires cross-compilation and some platform-specific configuration.

Google's MediaPipe LLM Inference API abstracts some of this for Android. It provides a higher-level API for running quantized LLMs on Android devices, with built-in support for Gemma models. The tradeoff is you're limited to models that MediaPipe supports, but for production Android apps that's a reasonable constraint.

Apple Core ML is the iOS path. Apple publishes Llama and Gemma variants optimized for Core ML. The conversion process takes time but the resulting performance is better than running llama.cpp on-device.

Honest limitations

Mobile on-device inference is real but you have to set expectations correctly. A 3B model running at 10-20 tokens per second will handle simple summarization, classification, and short responses well. It's not going to replace a frontier model for complex reasoning. The main use cases are: on-device text classification, small private assistants, keyboard/autocomplete features, and applications where the model handles short, specific tasks repeatedly.

For anything requiring multi-step reasoning or handling long documents, mobile is still not the right platform. Send those to a server.

Comparison across platforms

Hardware	Max Comfortable Model	Typical Token Speed	Power Draw	Best For
M3 MacBook (16GB)	8B Q4	60-80 t/s	~20W	Personal dev, local tools
M3 Pro (36GB)	30B Q4	25-40 t/s	~25W	Team server, higher quality
M3 Max (96GB)	70B Q4	18-25 t/s	~30W	Near-frontier local quality
M4 Pro (24GB)	13B Q4	50-65 t/s	~22W	Best value for dev use
Jetson Orin NX (16GB)	8B Q4	15-25 t/s	15-25W	Edge deployment
Jetson Orin AGX (64GB)	30B Q4	12-20 t/s	40-60W	Industrial edge AI
Flagship phone (2024)	3B Q4	10-20 t/s	~5W	On-device private features

The practical setup in 2026

For most developers who want to experiment with local inference: get a machine with M3 or M4 Apple Silicon, install Ollama, and pull the models you want to test. The whole process takes under 20 minutes and you'll have a locally running model with an OpenAI-compatible API that you can point any framework at.

For production edge deployments: Jetson Orin NX or AGX depending on your memory requirements, with TensorRT-LLM for performance. Expect a few days of setup rather than 20 minutes, but the result is a low-power, deployable inference node.

For mobile: keep your models at or under 3B parameters, use platform-native frameworks (Core ML on iOS, MediaPipe on Android), and accept that you're building something specialized rather than general-purpose.

The gap between local and cloud quality is real. A local 7B model is not the same as GPT-4o. But for a surprising number of tasks, especially those where you need speed, privacy, or cost predictability, local inference has become a real option rather than a compromise.