SenseTime drops encoder-free NEO-unify multimodal model
SenseTime's NEO-unify is a 2B parameter multimodal model that eliminates vision encoders and VAEs, processing raw pixels directly via Mixture-of-Transformer (MoT) architecture and flow matching.
The "encoder-free" trend is hitting its stride, proving that raw pixel processing can rival specialized VAEs with significantly higher data efficiency.
- –Eliminates CLIP/SigLIP dependencies, reducing architectural bloat and lowering inference latency for local edge deployments.
- –Mixture-of-Transformer backbone allows simultaneous visual understanding and image generation without the performance trade-offs of modular systems.
- –Flow matching on raw pixels achieves 31.56 PSNR, nearly matching Flux's VAE while remaining a completely unified architecture.
- –Extreme data efficiency—outperforming counterparts like Bagel with fewer tokens—suggests a more scalable path for native multimodal pre-training.
DISCOVERED
46d ago
2026-04-14
PUBLISHED
46d ago
2026-04-14
RELEVANCE
AUTHOR
Few-Personality6088