alphaXiv

Explore

Sign In

Blog

Feedback

Browser Extension

Upgrade to Pro

Dark mode

We're hiring

Ask or search anything...

What are the most popular benchmarks for math reasoning?

Alt + Enter to search

Events

Watch Recordings
HotLikes
Sign in
HotLikes
Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments
28 May 2026
Qiuyue Wang
Mingsheng Li
Jian Guan

Qwen-VLA, developed by the Qwen Team, unifies diverse embodied decision-making problems such as robot manipulation and vision-language navigation into a single vision-language-action model. This unified framework achieved an 83.6% average success rate on real-world ALOHA manipulation tasks and demonstrated strong generalization, including outperforming previous specialist models on specific dynamic manipulation benchmarks.

View blog
#computer-science#artificial-intelligence#computation-and-language
Audio
197
Paper thumbnail
1,152
Sandbox templateQwen VPO

A single-GPU template for RL fine-tuning a Qwen base model with VPO. Fork it, hit launch, and you get a ready-to-run training loop for experimenting with verifiable rewards and reasoning behavior.

#reinforcement-learning#fine-tuning#reasoning
Explore template
Paper thumbnail
When Does LeJEPA Learn a World Model?
25 May 2026
David Klindt
Yann LeCun
Randall Balestriero

Research by Klindt, LeCun, and Balestriero formally defines and proves the conditions under which a Joint-Embedding Predictive Architecture (JEPA), specifically LeJEPA, learns a "World Model" by recovering the true underlying latent variables. It demonstrates that LeJEPA achieves linear identifiability if and only if the latent variables follow an isotropic Gaussian distribution, which then enables optimal latent-space planning.

View blog
#computer-science#machine-learning#deep-reinforcement-learning
Audio
Paper thumbnail
2,173
Parallax: Parameterized Local Linear Attention for Language Modeling
27 May 2026
Yifei Zuo
Dhruv Pai
Zhichen Zeng

Parallax introduces a Parameterized Local Linear Attention mechanism that scales the theoretical advantages of Local Linear Attention (LLA) for large language model pretraining. It achieves consistent perplexity reductions and improved downstream performance by reformulating LLA for computational efficiency and demonstrating a strong, positive interaction with the Muon optimizer.

View blog
#attention-mechanisms#computer-science#artificial-intelligence
Audio
34
Paper thumbnail
146
Gemini Embedding 2: A Native Multimodal Embedding Model from Gemini
26 May 2026
Madhuri Shanbhogue
Zhe Li
Shanfeng Zhang

Google's Gemini Embedding 2 is a native multimodal embedding model capable of creating unified vector representations for video, audio, image, and text, including interleaved combinations. The model achieved state-of-the-art performance across multimodal retrieval benchmarks with a global mean score of 77.2, and demonstrated superior native audio understanding, outperforming ASR-based baselines by +3.59 mrr@10 on the MSEB benchmark.

View blog
#computer-science#contrastive-learning#computer-vision-and-pattern-recognition
Audio
Paper thumbnail
1,030
Gamma-World: Generative Multi-Agent World Modeling Beyond Two Players
27 May 2026
Fangfu Liu
Kai He
Tianchang Shen

Gamma-World is a generative multi-agent world model designed for interactive simulation, introducing Simplex Rotary Agent Encoding for permutation-symmetric agent identities and Sparse Hub Attention for efficient cross-agent communication. The model demonstrates superior video fidelity (e.g., FVD 184.1 vs 333.8 for Solaris) and scalability (linear computational cost with agents) compared to baselines, enabling real-time interactive rollouts and zero-shot generalization to more agents.

View blog
#attention-mechanisms#computer-science#computer-vision-and-pattern-recognition
Audio
293
Paper thumbnail
367
Learn from your own latents and not from tokens: A sample-complexity theory
26 May 2026
Daniel J. Korchinski
Alessandro Favero
Matthieu Wyart

This work presents a sample-complexity theory demonstrating that learning from a model's own latent representations significantly improves data efficiency for hierarchical structures. It shows that iterative latent prediction achieves a sample complexity proportional to the vocabulary size and branching factor, constant with respect to hierarchy depth, in contrast to the exponential dependency found in token-level prediction methods.

View blog
#computer-science#machine-learning#representation-learning
Audio
Paper thumbnail
226
Self-Improving Language Models with Bidirectional Evolutionary Search
27 May 2026
Guowei Xu
Zhenting Qi
Huangyuan Su

Researchers at Harvard and MIT developed Bidirectional Evolutionary Search (BES), a framework that integrates evolutionary operators with a backward search for goal decomposition, enabling language models and agentic systems to discover higher-quality solutions. The method showed performance gains of up to 3.8% on multi-hop reasoning for Llama-3.1-8B-Instruct and consistently outperformed baselines on open problem-solving tasks like Circle Packing and Heilbronn problems.

View blog
#agentic-frameworks#agents#computer-science
Audio
57
Paper thumbnail
326
Do Language Models Need Sleep? Offline Recurrence for Improved Online Inference
27 May 2026
Sangyun Lee
Sean McLeish
Tom Goldstein

An "LLM sleep" mechanism enables Transformer-based models with fast-weight memories to perform offline recurrent passes for memory consolidation, enhancing their ability to perform deep reasoning on information evicted from the active attention window. This approach consistently improves task performance on benchmarks like cellular automata and math reasoning by allowing iterative processing of past contexts without increasing real-time inference latency.

View blog
#attention-mechanisms#computer-science#artificial-intelligence
Audio
4
Paper thumbnail
3,873
AutoScientists: Self-Organizing Agent Teams for Long-Running Scientific Experimentation
27 May 2026
Shanghua Gao
Ada Fang
Marinka Zitnik

AUTOSCIENTISTS is a decentralized, self-organizing multi-agent framework designed to conduct long-running scientific experimentation without central coordination. The system achieved a 74.40% mean leaderboard percentile on BioML-Bench, 1.9x acceleration in GPT training optimization, and improved protein fitness prediction by 6.5% on ProteinGym.

View blog
#agentic-frameworks#agents#ai-for-genomics
Audio
157
Paper thumbnail
276
GenClaw: Code-Driven Agentic Image Generation
28 May 2026
Junyan Ye
Jun He
Zilong Huang

GenClaw introduces a code-driven agentic paradigm for image generation, empowering AI agents to use executable code as a direct 'paintbrush' for precise visual construction, rather than relying solely on text prompt optimization. The system substantially enhances compositional control, text rendering accuracy, and image editing consistency by first generating structured, code-based visual sketches before leveraging image models for photorealistic details.

View blog
#agents#computer-science#computer-vision-and-pattern-recognition
Audio
79
Paper thumbnail
118
Déjà View: Looping Transformers for Multi-View 3D Reconstruction
28 May 2026
Alessandro Burzio
Tobias Fischer
Sven Elflein

DéjàView introduces a looping transformer architecture for multi-view 3D reconstruction that re-uses a single, time-conditioned block for iterative refinement. This approach achieves accuracy comparable to or better than billion-parameter feed-forward models while using only 117 million parameters and significantly less memory.

View blog
#computer-science#computer-vision-and-pattern-recognition#efficient-transformers
Audio
Paper thumbnail
125
minWM: A Full-Stack Open-Source Framework for Real-Time Interactive Video World Models
28 May 2026
Min Zhao
Hongzhou Zhu
Bokai Yan

minWM is an open-source framework that transforms bidirectional video diffusion models into real-time interactive, camera-controllable autoregressive (AR) world models. It achieves over 200x reduction in first-frame latency while maintaining visual quality and user-defined camera control, providing a unified and reproducible pipeline for this conversion.

View blog
#computer-science#computer-vision-and-pattern-recognition#fine-tuning
Audio
388
Paper thumbnail
153
GPIC: A Giant Permissive Image Corpus for Visual Generation
28 May 2026
Keshigeyan Chandrasegaran
Kyle Sargent
Suchir Agarwal

GPIC, a collaborative effort from Stanford University and other institutions, introduces a permissive, stable, large, and accessible image corpus comprising 100 million images with high-quality synthetic text captions, designed as a modern benchmark for visual generative modeling. It provides a new foundation for reproducible research and evaluation using the FD-DINOv2 metric, addressing the limitations of prior benchmarks.

View blog
#computer-science#artificial-intelligence#computer-vision-and-pattern-recognition
Audio
Paper thumbnail
89
Zipping the Thought: When and How Compressed Reasoning Data Works in LLM Post-Training
27 May 2026
Kohsei Matsutani
Gouki Minegishi
Takeshi Kojima

Researchers at The University of Tokyo analyzed the effects of compressed Chain-of-Thought (CoT) data on large language model (LLM) post-training, demonstrating that while Supervised Fine-Tuning (SFT) struggles with decomposition, subsequent Reinforcement Learning with Verifiable Rewards (RLVR) effectively enables models to decompose compressed reasoning steps and generalize to novel compositional tasks. The study provides a CoT taxonomy and guidelines for efficient data design.

View blog
#chain-of-thought#computer-science#artificial-intelligence
Audio
Paper thumbnail
55
LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding
27 May 2026
Shihao Wang
Shilong Liu
Yuanguo Kuang

LocateAnything is a vision-language framework that introduces Parallel Box Decoding (PBD) to enable the simultaneous generation of entire bounding box units, overcoming the sequential decoding bottleneck in visual grounding. This method achieves a greater than 10x speedup in decoding throughput and improves localization accuracy, for example, by 3.8% in mean F1-score on LVIS over previous methods, setting new benchmarks across diverse visual perception tasks.

View blog
#computer-science#artificial-intelligence#computer-vision-and-pattern-recognition
Audio
1,564
Paper thumbnail
511
SIA: Self Improving AI with Harness & Weight Updates
28 May 2026
Prannay Hebbar
Yogendra Manawat
Samuel Verboomen

Researchers at Hexo Labs and the University of Oxford developed SIA, a system that autonomously improves AI performance by iteratively updating both its operational scaffold and underlying model weights. SIA consistently surpassed harness-only approaches and prior state-of-the-art across diverse tasks, achieving gains such as a 20.1 percentage point accuracy increase in Chinese legal classification and a 12.4% faster runtime for GPU kernel optimization.

View blog
#agentic-frameworks#agents#ai-for-health
Audio
Paper thumbnail
219
Scaling Laws for Agent Harnesses via Effective Feedback Compute
28 May 2026
Xuanliang Zhang
Dingzirui Wang
Keyan Xu

Researchers from the Harbin Institute of Technology developed Effective Feedback Compute (EFC), a trace-level scalar measure quantifying the utility of closed-loop feedback in agent harnesses. This new scaling coordinate and its task-demand-normalized variants consistently achieve higher predictive accuracy (up to R^2 = 0.99) for agent system performance compared to raw computational metrics, demonstrating that feedback quality is a primary driver of success.

View blog
#agentic-frameworks#agents#computer-science
Audio
Paper thumbnail
76
MUSE-Autoskill: Self-Evolving Agents via Skill Creation, Memory, Management, and Evaluation
26 May 2026
Huawei Lin
Peng Li
Jie Song

MUSE-Autoskill presents a comprehensive agent framework that enables large language model agents to autonomously create, manage, evaluate, and refine their own reusable skills, integrating a five-stage skill lifecycle and multi-level memory system. This approach leads to a 7.16 percentage point improvement in task accuracy with self-generated skills and allows for the transfer of these skills across different agent architectures, improving another agent's performance by 10.51 percentage points.

View blog
#agentic-frameworks#agents#computer-science
Audio
Paper thumbnail
526
Rethinking Memory as Continuously Evolving Connectivity
27 May 2026
Jizhan Fang
Buqiang Xu
Zhixian Wang

Researchers from Zhejiang University and Alibaba Group developed FluxMem, a memory framework for LLM agents that models memory as a dynamically evolving heterogeneous graph, adapting its structure and content through continuous interaction. The system achieved state-of-the-art performance across diverse benchmarks, including a 95.06% LLM-as-a-judge score on LoCoMo, an 8.1% success rate on Mind2Web Cross-Task, and a 12.73% absolute improvement on GAIA general assistant tasks.

View blog
#agentic-frameworks#agents#computer-science
Audio
884
Paper thumbnail
166
From Pixels to Words -- Towards Native One-Vision Models at Scale
27 May 2026
Haiwen Diao
Jiahao Wang
Penghao Wu

NEO-ov, developed by S-Lab, NTU, and SenseTime Research, introduces a native, encoder-free vision-language model that unifies single-image, multi-image, and video understanding with spatial intelligence through an end-to-end autoregressive architecture. The model establishes new performance benchmarks for native VLMs, matching or surpassing leading modular counterparts on various reasoning and perception tasks, and showing strong spatial intelligence comparable to specialist models.

View blog
#computer-science#computer-vision-and-pattern-recognition#multi-modal-learning
Audio
789
Paper thumbnail
351
There are no more papers matching your filters at the moment.