DeepSeek V4 Is Now Live: Weights, Benchmarks, and First Impressions

DeepSeek V4 full weights are now on HuggingFace under Apache 2.0. Here's what early benchmarks show, how Engram performs in the wild, and how to start using V4 today.

by AnyCap

DeepSeek V4 Is Now Live: Weights, Benchmarks, and First Impressions

DeepSeek V4 full weights are now available on HuggingFace under an Apache 2.0 license. After months of architecture papers, the V4 Lite preview, and sustained community anticipation, the complete model is out.

Here is what early data shows, and what developers need to know to start using it.


What Just Dropped

The release includes:

  • Full V4 weights (~1 trillion total parameters, 37B active per token via Mixture-of-Experts)
  • HuggingFace repository under Apache 2.0 — commercial use permitted, no usage restrictions
  • API access via DeepSeek's platform, with expected pricing around $0.30 per million tokens input

The Apache 2.0 license is significant. Unlike some recent open-weight releases that carry non-commercial or field-of-use restrictions, V4 can be deployed commercially, fine-tuned, and redistributed. For enterprise teams and startups building on open models, this is the most permissive option at this capability tier.


Early Benchmark Results

Independent evaluation began within hours of the weights going live. Here is what the first results show:

Coding (HumanEval / LiveCodeBench):
Early runs place V4 above V3 on LiveCodeBench, consistent with the MoE scaling paper's ablation results showing improved performance in coding tasks with the new expert configuration.

Mathematics (MATH-500):
Results are competitive with GPT-4o and Claude 3.7 Sonnet on standard math benchmarks. The per-expert specialization appears to translate into measurable gains on structured reasoning tasks.

Long-Context Retrieval (Needle-in-a-Haystack):
This is the headline test for V4. Early independent evaluations of Engram at 1M tokens are returning accuracy figures in the 93–96% range — slightly below DeepSeek's internal claim of 97%, but substantially above the 84.2% baseline for standard attention.

The 97% internal benchmark has not been fully replicated independently yet. The 93–96% range is a more defensible number at this stage and still represents a significant improvement over alternative approaches.


How Engram Performs in the Wild

Engram — V4's conditional memory mechanism for long-context retrieval — is the architectural feature that attracted the most developer interest ahead of release. Early community tests on realistic long-context tasks (full codebase analysis, long contract review, extended conversation recall) are broadly positive.

Key observations from early testers:

  • Full-repo code review: V4 correctly identifies cross-file dependencies and surfaces relevant context that GPT-4o misses at the same token depth
  • Document analysis at 500K tokens: Retrieval quality is noticeably more consistent than V3 at this length
  • Latency: First-token latency on the hosted API is comparable to V3 for standard-length contexts; long-context requests are slower than short ones, as expected, but the slowdown is less severe than with naive full-attention approaches

The Engram mechanism's inference overhead — a question the architecture paper left open — appears to be moderate in practice.


Pricing and What It Means

At ~$0.30 per million input tokens, V4 is priced approximately:

  • 16× cheaper than GPT-5.5 ($5/MTok input)
  • Comparable to GPT-4o Mini tier pricing for some providers
  • Below V3's launch pricing on most inference platforms

For agentic workflows where a single task might consume hundreds of thousands of tokens across multiple calls, this pricing difference is not cosmetic. An agent loop that costs $15 on GPT-5.5 costs under $1 on V4 at list price.

The caveat: self-hosted inference of a 1T-parameter MoE model requires significant infrastructure. The $0.30 figure applies to the hosted API. Self-hosting at this scale is only practical for teams with large GPU clusters.


Accessing V4 Through AnyCap

If you want to use DeepSeek V4 without managing provider accounts or infrastructure directly, AnyCap's unified model API routes to V4 alongside GPT-5.5, Claude 4, Gemini 3.1, and other frontier models — all through a single endpoint.

import anycap

client = anycap.Client()

response = client.generate(
    model="deepseek-v4",
    messages=[{"role": "user", "content": "Review this codebase for security issues..."}],
    max_tokens=4096
)

print(response.content)

AnyCap handles provider failover, rate limit management, and unified billing — useful for teams that want to benchmark V4 against other models without rebuilding their integration for each provider.


What to Watch in the Next 48 Hours

The most meaningful independent benchmarks typically arrive 24–72 hours after weights release, when larger evaluation labs complete their runs:

  • LMSYS Chatbot Arena — human preference ratings against GPT-5.5 and Claude 4
  • BigCode EvalPlus — comprehensive coding benchmark suite
  • Long-context adversarial tests — stress tests designed to break retrieval quality in ways that synthetic benchmarks miss

For developers making architecture decisions, waiting for these results before committing V4 to production long-context use cases is the prudent path.


DeepSeek V4's Engram Memory Explained
DeepSeek V4: Complete Developer Guide
DeepSeek V4 Release Date: Everything We Tracked