OPEN_SOURCE ↗
REDDIT · REDDIT// 6h agoINFRASTRUCTURE
Step 3.5 Flash Undercuts Qwen Serve Costs
This Reddit thread compares why StepFun's Step 3.5 Flash is cheaper on DeepInfra than Qwen3.6-35B-A3B. The key factors are active compute, memory pressure, context length, quantization, throughput, and provider economics rather than total parameter count alone.
// ANALYSIS
Hot take: total parameters are a bad proxy for API price; the bill is mostly about what has to stay hot per token and how efficiently the provider can run it.
- –Step 3.5 Flash is a sparse MoE model with 196B total parameters but only 11B active per token, and StepFun describes it as designed around inference cost and speed.
- –Qwen3.6-35B-A3B has 35B total parameters and 3B activated per token, but its serving stack includes a 256-expert MoE design, a vision encoder, and a long 256K native context, which affects deployment economics.
- –DeepInfra’s listed pricing is $0.10 input / $0.30 output per 1M tokens for Step 3.5 Flash versus $0.19 input / $1.00 output for Qwen3.6-35B-A3B, so the provider is clearly pricing for more than parameter count alone.
- –The output-token gap is the bigger tell: vendors usually charge more where decode-time throughput, long-context KV cache pressure, and demand are harsher.
- –In plain English, “4x bigger” by headline size does not mean 4x more expensive to serve; sparse activation can flip that intuition.
// TAGS
mixture-of-expertsinferencepricingapistepfunqwenllm-economicsdeepinfra
DISCOVERED
6h ago
2026-05-01
PUBLISHED
8h ago
2026-04-30
RELEVANCE
7/ 10
AUTHOR
urarthur