Mixture of Experts Models Explained: DeepSeek-V3, Mixtral, and How MoE Works

April 15, 2026 · Editorial Team · 7 min read · llm-architecture mixture-of-experts deepseek

Mixture of experts is one of those terms that sounds more complicated than it is. The basic idea is old, it's been around in machine learning literature since the early 1990s, but it became practically relevant to large language models only when the scale of transformer models made the compute cost of running everything through every layer unsustainable.

DeepSeek-V3 and Mixtral are the most prominent open-weight MoE models right now, and understanding why they're built this way requires understanding what the architecture actually buys you, and what it costs you.

The core problem MoE solves

A standard dense transformer (what most people think of when they say "LLM") activates all of its parameters for every single token it processes. If the model has 70 billion parameters, every one of those parameters does work on every token. That's expensive. The compute cost scales linearly with parameter count.

Mixture of experts breaks this by dividing the model into many specialized "expert" networks, each of which handles different kinds of tokens or patterns, and routing each token to only a small subset of experts rather than all of them. The rest of the model stays idle for that token.

You end up with a model that has a lot of total parameters but activates only a fraction of them per computation. More parameters available without proportionally more compute per inference call.

The routing mechanism is the clever part. A small trainable network called a "router" or "gating function" looks at each token and decides which experts to activate. It's not hand-designed: the routing is learned during training, and the experts specialize organically based on which types of inputs they tend to receive.

What "8 experts, 2 active" actually means

Mixtral 8x7B was the model that made MoE mainstream for open-weight LLMs. Its architecture has 8 expert networks per transformer layer, and for any given token, only 2 of those 8 experts are activated. That "8x7B" name is a bit misleading for people used to dense models.

The total parameter count is about 46.7 billion (each of the 8 experts is roughly 7B-equivalent, but they share some components). The activated parameter count per token is about 12.9 billion, since only 2 experts fire. Inference compute is closer to running a 12B dense model than a 47B dense model.

In practice this means Mixtral 8x7B has the quality of a much larger model while costing roughly the same compute as a mid-size dense model during inference. Benchmark scores on MMLU and coding tasks match or exceed Llama 2-70B while running in less than half the memory bandwidth.

The catch: you still need to load all 46.7B parameters into memory even though you're only computing with 12.9B of them per token. Memory requirements track total parameter count, but compute tracks activated parameter count. This is the fundamental MoE tradeoff.

DeepSeek-V3: fine-grained MoE

DeepSeek-V3 (released December 2024) pushed the MoE architecture further with what they call fine-grained experts. Rather than having 8 large experts, DeepSeek-V3 has 256 experts per layer with only 8 activated per token.

The intuition: more, smaller experts means finer-grained specialization. A single expert in an 8-expert model has to be general enough to handle a wide variety of inputs because tokens from many different domains will be routed to it. With 256 smaller experts, each expert can specialize more narrowly, theoretically capturing more nuanced patterns.

Total parameters: 671 billion. Activated parameters per forward pass: about 37 billion. That's a 671B parameter model running with the compute load of a ~37B dense model. The efficiency ratio is better than Mixtral's.

The training cost Deepseek disclosed was $5.6 million for the full 671B V3 training run. That figure generated a lot of discussion because GPT-4 level training runs were rumored to cost $50-100 million or more. The cost efficiency comes from the MoE architecture reducing compute requirements, combined with a careful engineering effort to train on H800 clusters efficiently.

On benchmarks through early 2026, DeepSeek-V3 performs at a level competitive with frontier closed models on many tasks, which is why its release was significant. An open-weight model at that quality level changes what's achievable without API access.

The router problem: load balancing

Here's something that gets glossed over in most MoE explanations: the router needs to distribute tokens roughly evenly across experts. If one expert receives 80% of all tokens and the others receive 5% each, you're effectively running a dense model with wasted capacity. The goal of MoE only works if experts are actually used.

Training MoE models requires special care to prevent "expert collapse," where a few experts dominate and the rest become marginally useful. Mixtral addresses this with an auxiliary loss term during training that penalizes unbalanced routing. DeepSeek-V3 uses a more sophisticated load balancing mechanism they call auxiliary-loss-free load balancing, where routing is adjusted dynamically during training without adding a separate loss term.

This is an active research area. Routing quality is one of the key variables that separates good MoE models from ones that don't realize their theoretical efficiency gains.

Inference infrastructure: the hidden cost

The memory/compute tradeoff becomes concrete when you think about deployment. Running DeepSeek-V3 requires loading 671 billion parameters. At 4-bit quantization, that's roughly 335GB of memory. That means multiple high-end GPUs even after aggressive quantization.

On a single H100 (80GB VRAM), you can't run DeepSeek-V3 at all. You need either:

4x H100 minimum (in FP8, their released format) for the full model
A quantized version (Q4) that fits in fewer GPUs but sacrifices quality
A distilled version like DeepSeek-R1-Distill that's a smaller dense model

This is why, despite the compute efficiency argument, MoE models aren't automatically accessible for everyone. The memory requirement for the full weights is a real barrier that dense models of equivalent inference cost don't have.

For local deployment, Mixtral 8x7B is more tractable. At Q4 quantization, it fits in 24-30GB, which means it's runnable on a single M3 Max Mac or a 3090/4090 GPU. The performance is genuinely good for an open-weight model.

Mixtral vs DeepSeek-V3: practical differences

Size and capability: DeepSeek-V3 is significantly more capable. 671B total parameters with 37B active versus 47B total/13B active. It's a different class of model, just using the same architectural approach.

Deployment requirements: Mixtral 8x7B is accessible to individuals with decent hardware. DeepSeek-V3 requires serious infrastructure.

Licensing: Mixtral is released under Apache 2.0, genuinely permissive for commercial use. DeepSeek-V3 uses a custom license that allows commercial use with some restrictions worth reading carefully before production deployment.

Provider access: DeepSeek-V3 is available through Fireworks, Together, and other inference providers at competitive rates. You don't have to run it yourself. Prices through providers like Fireworks run around $0.90-1.20 per million tokens, which is significantly cheaper than frontier closed models.

When MoE architecture matters for users

For most people using these models through chat interfaces or API calls, the MoE architecture is an implementation detail. You're not managing expert routing or load balancing yourself.

Where it matters:

Self-hosting: If you're running models on your own hardware, understanding that a MoE model needs its full weight loaded even when activating a fraction per token changes your hardware planning. You need more memory than a comparable-quality dense model would require.

Batched inference: MoE models actually get more efficient at larger batch sizes because different inputs in the same batch may route to different experts, keeping more experts busy. For high-throughput serving, MoE can be more cost-efficient than dense models per quality unit.

Behavior at extreme scale: MoE models sometimes have less smooth behavior at the edges of their capability than dense models of equivalent quality, because specific unusual inputs might route to underspecialized experts. This shows up as more variance in outputs across different types of tasks.

Where MoE is going

The pattern that DeepSeek-V3 established, massive total parameters with high sparsity and aggressive cost reduction through fine-grained experts, is almost certainly going to continue. The architecture allows researchers to scale to larger total parameter counts without proportionally scaling compute budgets, which is practically valuable as the cost of frontier training runs continues to climb.

Whether frontier closed models like GPT-5 use MoE internally isn't public, but the architecture would make sense at their scale. The compute savings at 100B+ parameter counts are significant enough that it would be surprising if it weren't being used.

For open-weight development, MoE is now a proven path to building highly capable models on constrained training budgets. DeepSeek's $5.6M training cost for a 671B model will be a reference point for the research community for years.

MoE isn't magic. It's a specific set of tradeoffs: higher memory requirements, more complex training, routing overhead, and infrastructure complexity, in exchange for better quality-per-compute-unit. Whether those tradeoffs make sense depends entirely on what you're trying to do. For DeepSeek building a frontier open-weight model on a limited training budget, it made obvious sense. For a team deploying a model on a single server, the memory overhead might not.

The architecture is mature enough now that you don't need to understand every detail of how expert routing works to make good decisions about using MoE models. You mainly need to understand: total parameter count determines your memory needs, active parameter count determines your inference speed, and the ratio between them is the efficiency win the architecture offers.