Qwen-VLA, developed by the Qwen Team, unifies diverse embodied decision-making problems such as robot manipulation and vision-language navigation into a single vision-language-action model. This unified framework achieved an 83.6% average success rate on real-world ALOHA manipulation tasks and demonstrated strong generalization, including outperforming previous specialist models on specific dynamic manipulation benchmarks.
View blogA single-GPU template for RL fine-tuning a Qwen base model with VPO. Fork it, hit launch, and you get a ready-to-run training loop for experimenting with verifiable rewards and reasoning behavior.
Research by Klindt, LeCun, and Balestriero formally defines and proves the conditions under which a Joint-Embedding Predictive Architecture (JEPA), specifically LeJEPA, learns a "World Model" by recovering the true underlying latent variables. It demonstrates that LeJEPA achieves linear identifiability if and only if the latent variables follow an isotropic Gaussian distribution, which then enables optimal latent-space planning.
View blogParallax introduces a Parameterized Local Linear Attention mechanism that scales the theoretical advantages of Local Linear Attention (LLA) for large language model pretraining. It achieves consistent perplexity reductions and improved downstream performance by reformulating LLA for computational efficiency and demonstrating a strong, positive interaction with the Muon optimizer.
View blogGoogle's Gemini Embedding 2 is a native multimodal embedding model capable of creating unified vector representations for video, audio, image, and text, including interleaved combinations. The model achieved state-of-the-art performance across multimodal retrieval benchmarks with a global mean score of 77.2, and demonstrated superior native audio understanding, outperforming ASR-based baselines by +3.59 mrr@10 on the MSEB benchmark.
View blogGamma-World is a generative multi-agent world model designed for interactive simulation, introducing Simplex Rotary Agent Encoding for permutation-symmetric agent identities and Sparse Hub Attention for efficient cross-agent communication. The model demonstrates superior video fidelity (e.g., FVD 184.1 vs 333.8 for Solaris) and scalability (linear computational cost with agents) compared to baselines, enabling real-time interactive rollouts and zero-shot generalization to more agents.
View blogThis work presents a sample-complexity theory demonstrating that learning from a model's own latent representations significantly improves data efficiency for hierarchical structures. It shows that iterative latent prediction achieves a sample complexity proportional to the vocabulary size and branching factor, constant with respect to hierarchy depth, in contrast to the exponential dependency found in token-level prediction methods.
View blogResearchers at Harvard and MIT developed Bidirectional Evolutionary Search (BES), a framework that integrates evolutionary operators with a backward search for goal decomposition, enabling language models and agentic systems to discover higher-quality solutions. The method showed performance gains of up to 3.8% on multi-hop reasoning for Llama-3.1-8B-Instruct and consistently outperformed baselines on open problem-solving tasks like Circle Packing and Heilbronn problems.
View blogAn "LLM sleep" mechanism enables Transformer-based models with fast-weight memories to perform offline recurrent passes for memory consolidation, enhancing their ability to perform deep reasoning on information evicted from the active attention window. This approach consistently improves task performance on benchmarks like cellular automata and math reasoning by allowing iterative processing of past contexts without increasing real-time inference latency.
View blogAUTOSCIENTISTS is a decentralized, self-organizing multi-agent framework designed to conduct long-running scientific experimentation without central coordination. The system achieved a 74.40% mean leaderboard percentile on BioML-Bench, 1.9x acceleration in GPT training optimization, and improved protein fitness prediction by 6.5% on ProteinGym.
View blogGenClaw introduces a code-driven agentic paradigm for image generation, empowering AI agents to use executable code as a direct 'paintbrush' for precise visual construction, rather than relying solely on text prompt optimization. The system substantially enhances compositional control, text rendering accuracy, and image editing consistency by first generating structured, code-based visual sketches before leveraging image models for photorealistic details.
View blogDéjàView introduces a looping transformer architecture for multi-view 3D reconstruction that re-uses a single, time-conditioned block for iterative refinement. This approach achieves accuracy comparable to or better than billion-parameter feed-forward models while using only 117 million parameters and significantly less memory.
View blogminWM is an open-source framework that transforms bidirectional video diffusion models into real-time interactive, camera-controllable autoregressive (AR) world models. It achieves over 200x reduction in first-frame latency while maintaining visual quality and user-defined camera control, providing a unified and reproducible pipeline for this conversion.
View blogGPIC, a collaborative effort from Stanford University and other institutions, introduces a permissive, stable, large, and accessible image corpus comprising 100 million images with high-quality synthetic text captions, designed as a modern benchmark for visual generative modeling. It provides a new foundation for reproducible research and evaluation using the FD-DINOv2 metric, addressing the limitations of prior benchmarks.
View blogResearchers at The University of Tokyo analyzed the effects of compressed Chain-of-Thought (CoT) data on large language model (LLM) post-training, demonstrating that while Supervised Fine-Tuning (SFT) struggles with decomposition, subsequent Reinforcement Learning with Verifiable Rewards (RLVR) effectively enables models to decompose compressed reasoning steps and generalize to novel compositional tasks. The study provides a CoT taxonomy and guidelines for efficient data design.
View blogLocateAnything is a vision-language framework that introduces Parallel Box Decoding (PBD) to enable the simultaneous generation of entire bounding box units, overcoming the sequential decoding bottleneck in visual grounding. This method achieves a greater than 10x speedup in decoding throughput and improves localization accuracy, for example, by 3.8% in mean F1-score on LVIS over previous methods, setting new benchmarks across diverse visual perception tasks.
View blogResearchers at Hexo Labs and the University of Oxford developed SIA, a system that autonomously improves AI performance by iteratively updating both its operational scaffold and underlying model weights. SIA consistently surpassed harness-only approaches and prior state-of-the-art across diverse tasks, achieving gains such as a 20.1 percentage point accuracy increase in Chinese legal classification and a 12.4% faster runtime for GPU kernel optimization.
View blogResearchers from the Harbin Institute of Technology developed Effective Feedback Compute (EFC), a trace-level scalar measure quantifying the utility of closed-loop feedback in agent harnesses. This new scaling coordinate and its task-demand-normalized variants consistently achieve higher predictive accuracy (up to R^2 = 0.99) for agent system performance compared to raw computational metrics, demonstrating that feedback quality is a primary driver of success.
View blogMUSE-Autoskill presents a comprehensive agent framework that enables large language model agents to autonomously create, manage, evaluate, and refine their own reusable skills, integrating a five-stage skill lifecycle and multi-level memory system. This approach leads to a 7.16 percentage point improvement in task accuracy with self-generated skills and allows for the transfer of these skills across different agent architectures, improving another agent's performance by 10.51 percentage points.
View blogResearchers from Zhejiang University and Alibaba Group developed FluxMem, a memory framework for LLM agents that models memory as a dynamically evolving heterogeneous graph, adapting its structure and content through continuous interaction. The system achieved state-of-the-art performance across diverse benchmarks, including a 95.06% LLM-as-a-judge score on LoCoMo, an 8.1% success rate on Mind2Web Cross-Task, and a 12.73% absolute improvement on GAIA general assistant tasks.
View blogNEO-ov, developed by S-Lab, NTU, and SenseTime Research, introduces a native, encoder-free vision-language model that unifies single-image, multi-image, and video understanding with spatial intelligence through an end-to-end autoregressive architecture. The model establishes new performance benchmarks for native VLMs, matching or surpassing leading modular counterparts on various reasoning and perception tasks, and showing strong spatial intelligence comparable to specialist models.
View blog