> ▌

Theo - t3․gg

DIY Smart Code

WorldofAI

Github Awesome

Better Stack

Better Stack

Eric Michaud

The PrimeTime

Two Minute Papers

Better Stack

DIY Smart Code

DesignCourse

AI Samson

Income stream surfers

Discover AI

The PrimeTime

Bijan Bowen

Github Awesome

AICodeKing

Better Stack
A model tournament aimed at replacing a premium Anthropic model found DeepSeek V4 Flash was the cheapest strong option. MiniMax M2.7 also stood out, underscoring how quickly frontier-quality models are converging on price.
The post describes a model tournament run for an agentic product intended to replace Sonnet 4.7, which the author says was becoming too expensive to operate. The lineup included Qwen 3.5, DeepSeek V4 Pro, DeepSeek V4 Flash, Sonnet 4.7, MiniMax M2.7, Kimi K2.6, and GLM-5. According to the author, DeepSeek V4 Flash was the clear cost winner, but MiniMax M2.7 delivered the best overall performance and “blew every model away,” reportedly reaching 92% in the test.
DeepSWE is a new benchmark from Datacurve for evaluating frontier coding agents on original, long-horizon software engineering tasks. It focuses on contamination-free tasks written from scratch across 91 repositories and 5 languages, with hand-written verifiers and reference solutions that require substantially more code than older public benchmarks. The release also includes a leaderboard showing clearer separation among top models than saturated benchmarks usually do.
Cactus Hybrid Router is a small routing model from Cactus Compute that decides, on the fly, whether a request should be handled by an on-device edge model or handed off to a frontier cloud model such as Gemini. The post claims the 65k-parameter router can help Gemma 4 2B match Gemini-3.1-Flash-Lite by sending only 15-55% of tasks to the cloud, while keeping the rest local. Cactus’s own docs and repo support the broader hybrid-inference idea, including confidence-based cloud handoff and multimodal routing across text, vision, and audio.
A Reddit benchmark compares the RTX 5090, RTX PRO 6000 Blackwell Max-Q Workstation Edition, and RTX PRO 6000 Blackwell Workstation Edition on a diffusion-heavy Forge Neo workload. The tuned 5090 is fastest, but the Max-Q card matches it at 400W while the stock workstation card closes much of the gap at 600W.
This post benchmarks an ultra-budget dual-RTX 3060 setup running Unsloth’s Qwen3.6-27B GGUF variants in llama.cpp on CUDA. The author reports strong, stable throughput on a dated PCIe 3.0 x8/x8 platform, with MTP pushing generation into the low-40 t/s range and non-MTP mode delivering more context at a still-solid ~30 t/s. The main tradeoff is that tensor parallel mode currently blocks KV-cache quantization, which caps usable context and makes very long prompts awkward.
Google's Gemma 4 31B and Alibaba's Qwen 3.6 27B have officially surpassed GPT-5 on the Artificial Analysis Coding Index. The shift marks a historic milestone where workstation-class local models are now out-performing last year's premier cloud systems in pure logic and software engineering.
Levent Alpoge says Claude Mythos also found a cute, simple proof for Erdős’s unit-distance problem, following OpenAI’s recent breakthrough on the same question. The post reads like a second datapoint that frontier models can independently assemble non-obvious mathematical arguments when given the right scaffolding.
This GitHub benchmark evaluates short-fiction writing by having models respond to the same constrained creative briefs and then comparing the resulting stories head-to-head with evaluator LLMs. The latest leaderboard refresh adds Baidu Ernie 5.1, Qwen 3.7 Max, Mistral Medium 3.5, and Grok 4.3, with the reported scores placing Ernie 5.1 at -0.35, Qwen 3.7 Max at -2.01, Mistral Medium 3.5 at -2.13, and Grok 4.3 at -3.81. The benchmark also tracks compliance with the 600-800 word target range and measures how well stories incorporate the required elements.
A Reddit user benchmarked llama.cpp on an RX 9070 XT under ROCm 7.2.3 and found it only matched an older MI50 on generation speed, despite the newer card’s better prompt throughput. The comparison is noisy because the test used different quants and different VM hosts, but it still raises questions about AMD ROCm performance on RDNA 4 for local LLMs.
A side-project blog reports GRPO experiments on sub-500M models for 64-token Reddit summarization, trained on a 3x Mac mini M4 cluster with MLX and distributed vLLM rollouts. The staged curriculum, where length is learned first and quality second, outperformed joint length-plus-quality training across both Qwen2.5-0.5B-Instruct and LFM-2.5-350M.
A rejected llama.cpp PR shows a narrow but real win on AMD Strix Halo: retuned warp counts and tile sizes push MoE prefill up by roughly 30% at short context, with gains tapering as context grows. It is a local patch, not an upstream mainline change, and the benefit is specific to MoE workloads.