DeepSeek V4 Research Papers: Engram, MoE & Non-Nvidia Training Explained

Three DeepSeek V4 papers published Dec 2025–Jan 2026 reveal the Engram memory system, 1T-param MoE architecture, and Huawei Ascend training stack. Full breakdown here.

DeepSeek V4 Research Papers: What Three Architecture Documents Actually Say

Between December 2025 and January 2026, DeepSeek published three research papers that together describe the architectural foundation of V4. The papers were published quietly — no press release, no formal announcement — but they became the primary source for everything known about V4 before the model itself is released.

Here is what the papers actually say, and what they leave open.

Paper 1: The Engram Memory Architecture

The first paper, published in December 2025, introduces Engram — DeepSeek's conditional memory mechanism for long-context retrieval. The core problem the paper addresses is attention degradation at scale: standard transformer attention becomes increasingly unreliable as context length grows past 128K tokens.

Engram's approach is to decouple storage from retrieval. Rather than applying uniform attention across the full context, Engram identifies relevance signals at ingestion time and creates a structured memory index that can be queried more selectively during generation.

The paper reports 97% Needle-in-a-Haystack accuracy at 1M tokens, compared to 84.2% for standard attention — a 12.8 percentage-point improvement. The paper does not include independent third-party validation of these numbers.

What the paper leaves open: The computational overhead of Engram at inference time. The paper focuses on accuracy metrics, not latency or throughput comparisons. For developers evaluating V4 for production use, this is a gap that real-world benchmarks will need to fill.

Paper 2: MoE Architecture and Scaling

The second paper, published December 2025 alongside the Engram paper, describes V4's Mixture-of-Experts architecture in detail. The key parameters:

Total parameters: ~1 trillion
Active parameters per token: ~37 billion (same as V3)
Expert specialization: V4 has significantly more expert sub-networks than V3, allowing for greater domain specialization without increasing inference cost

The key design decision — keeping active parameters at 37B despite scaling total parameters to 1T — is explained in the paper as a deliberate choice to preserve V3's inference economics while increasing model capacity. The additional parameters go toward more specialized experts, not wider active computation.

The paper includes ablation studies showing that per-expert specialization improves performance on coding, mathematics, and multilingual tasks compared to V3's expert configuration.

What the paper leaves open: How the router (the component that decides which experts handle each token) was trained. MoE routing quality is often a significant source of variance in real-world performance, and the paper's treatment of routing is notably brief.

Paper 3: Training Hardware and the Non-Nvidia Stack

The third paper, published January 2026, addresses the most technically unusual aspect of V4: it was trained entirely on Huawei Ascend 910B and Cambricon MLU chips, with no Nvidia hardware involved.

This is the first public documentation of a frontier-scale model trained at this scale on non-Nvidia hardware. The paper describes the engineering work required to adapt the training stack — data parallelism, pipeline parallelism, and gradient communication — to hardware without CUDA.

The paper does not provide direct comparisons of training efficiency between the Ascend 910B stack and equivalent Nvidia hardware, citing competitive reasons. It does document that pretraining was completed successfully and reports training loss curves consistent with prior DeepSeek models.

What the paper leaves open: The comparative cost and time of training on this hardware stack versus Nvidia equivalents. This is relevant for the broader question of whether non-Nvidia training is economically viable at frontier scale, and the paper declines to answer it directly.

What the Three Papers Together Establish

Reading the papers as a set, several things are reasonably established:

V4's architecture is real and documented — not vaporware
The Engram approach to long-context retrieval is a genuine architectural innovation, not a marketing claim
Training at trillion-parameter scale on non-Nvidia hardware is technically feasible, even if the cost comparison remains undisclosed
The active parameter count (37B) means inference cost is comparable to V3, which had favorable cost-per-output benchmarks

What remains unestablished until independent evaluation:

Whether Engram's 97% retrieval accuracy holds on adversarial benchmarks, not just internal evaluations
Whether the routing architecture produces consistent quality across domains
Whether the non-Nvidia training stack introduces any systematic biases or capability gaps

Timeline: From Papers to Weights

December 2025: Engram and MoE architecture papers published
January 2026: Training hardware paper published
March 9, 2026: V4 Lite (approximately 200B parameters) appears on DeepSeek's website
Late April 2026 (expected): Full V4 weights released

The gap between paper and release is longer than usual for DeepSeek, which has historically released models within weeks of architectural papers. The extended timeline likely reflects the complexity of training on non-Nvidia hardware and reaching internal performance targets.

→ DeepSeek V4: Complete Developer Guide
→ DeepSeek V4's Engram Memory Explained
→ AnyCap Model Routing