
Together AI at ICML 2026: frontier research across the full stack
Quick Answer
Together AI presents nine papers at ICML 2026, showcasing advancements from frontier agents like DSGym and ThunderAgent to kernel optimizations, emphasizing a holistic approach across the AI stack.
Quick Take
Together AI presents nine papers at ICML 2026, showcasing advancements from frontier agents like DSGym and ThunderAgent to kernel optimizations, emphasizing a holistic approach across the AI stack. Noteworthy achievements include a 3.6x increase in agent throughput and state-of-the-art discoveries in various fields using open models.
Key Points
- DSGym standardizes evaluation for 1,000+ data science tasks across 10+ domains.
- ThunderAgent achieves up to 3.6x higher agent throughput with minimal code changes.
- TTT-Discover beats human benchmarks in multiple fields using an open 120B model.
- Aurora's adaptive speculative decoding improves traffic handling by 1.25x.
- ParallelKernelBench evaluates 87 multi-GPU kernel tasks for performance optimization.
📖 Reader Mode
~10 min readNine papers is a lot to take in as a list. The better way to read them is by where they sit in the stack. Frontier AI is not built at a single layer. It is the product of research that runs from the agent down to the GPU kernel, and a gain at any one layer is wasted if the layers around it cannot keep up.
This is at the core of how we work. From frontier agents at the top to kernels at the bottom, our research touches each one, and each layer feeds the next. The research becomes part of the Together platform, and the production workloads running on that platform point us to the next research problem. Aurora, our ICML paper on adaptive speculative decoding, is a clear example: the same line of work ships today as our ATLAS speculator in production.
Here is this year's work, layer by layer, top down.
01Frontier agents
Agents that do real work — measured on tasks you can't fake your way through.
DSGym1,000+ tasks across 10+ domains
ThunderAgentUp to 3.6× faster agent inference
TTT-DiscoverBeats best human, open model
02Model shaping
Turning a base model into a reasoner — even where there's no answer key to check.
RARO25% win rate, no verifier
V1Up to 10% more correct answers
03Algorithmic optimizations
Cutting the cost of every token — speculative decoding that adapts to live traffic.
Aurora1.25× as traffic shifts
04Systems optimizations
Fitting more on the same GPUs — longer context, bigger batches, faster MoE decode.
Untied Ulysses5M-token context on one node
OEAUp to 39% faster MoE decode
05Kernels
The bedrock of the stack: where raw hardware becomes usable speed.
ParallelKernelBench87 multi-GPU kernel tasks
Frontier agents
The top of the stack: building agents that do real work, and measuring them honestly.
Paper 1
DSGym: A Holistic Framework for Evaluating and Training Data Science Agents
1,000+ tasks
10+ domains, one API
More than 100 data-science tasks across 10+ domains, unified behind a single evaluation and training API. Data-science agents have been hard to measure fairly, every benchmark has its own interface and many of their tasks can be solved without ever opening the data. DSGym standardizes the measurement and closes that loophole.
It puts diverse evaluation suites behind one API with shared abstractions for datasets, agents, and metrics, and runs each task in a self-contained execution environment where the agent has to work with the data rather than recall an answer. On top of the refined existing suites, it adds 90 expert bioinformatics tasks grounded in academic literature and 92 end-to-end Kaggle-style modeling competitions. The same environment then runs in reverse as a training engine: trajectory generation and synthetic query pipelines produce execution-verified data, which we used to train a 4B model into a state-of-the-art open-source data-science agent, with no human labeling. Evaluate honestly, synthesize from the same harness, fine-tune, re-evaluate, all in one framework.
Authors: Fan Nie, Junlin Wang, Harper Hua, Federico Bianchi, Yongchan Kwon, Zhenting Qi, Owen Queen, Shang Zhu, James Zou. With collaborators at Stanford, Duke, and Harvard.
Paper: arXiv:2601.16344
Paper 2
ThunderAgent: A Simple, Fast and Program-Aware Agentic Inference System
up to 3.6×
agent throughput
Achieving 1.5 to 3.6x higher serving throughput for agent workloads, with three lines of code to adopt. Parallel agent workloads stop collapsing under load, and the fix is not a faster model.
The problem is the inference engine has no idea it is running an agent. It treats each step of a multi-turn workflow as an isolated request, inflating latency by up to 7.14x under load. ThunderAgent makes the workflow itself a first-class object the scheduler can reason about end to end. Alongside the throughput gain, it delivers 1.8 to 3.9x faster RL rollouts and up to 4.2x disk savings over state-of-the-art systems.
Authors: Hao Kang, Ziyang Li, Xinyu Yang, Weili Xu, Yinfang Chen, Junxiong Wang, Beidi Chen, Tushar Krishna, Chenfeng Xu, Simran Arora. With collaborators at Georgia Tech, CMU, and UIUC.
Paper: arXiv:2602.13692
Paper 3
Learning to Discover at Test Time (TTT-Discover)
Beats best human
open model, ~$500
State-of-the-art discoveries across four fields, mathematics, GPU kernels, competitive algorithms, and biology, all with open models. Every prior result at this level relied on closed frontier models behind an API you cannot inspect or run. TTT-Discover reaches it with an open 120B model and one unchanged method, for a few hundred dollars per problem.
The usual recipe for AI discovery is search: prompt a frozen model thousands of times and keep the best sample. TTT-Discover instead runs reinforcement learning at test time on the single problem in front of it, so every attempt becomes training data for the next one and the model improves as it works, and with the same sampling budget plain best-of-N never catches up. The same setup, unchanged, set a tighter bound on a 60-year-old Erdős problem in mathematics (past the previous AI record, which used closed models), discovered a GPU kernel faster than the best prior submission on the GPUMode leaderboard, produced a first-place-level finish on a competitive-programming contest, and set a new high on a single-cell denoising benchmark in biology, all on the open gpt-oss-120b. The code and the record-setting kernels are public.
Authors: Mert Yuksekgonul, Daniel Koceja, Xinhao Li, Federico Bianchi, Jed McCaleb, Xiaolong Wang, Jan Kautz, Yejin Choi, James Zou, Carlos Guestrin, Yu Sun. With collaborators at Stanford, NVIDIA, UC San Diego, and the Astera Institute.
Paper: arXiv:2601.16175
Model shaping
How you train and shape a model: reasoning, fine-tuning, and reinforcement learning.
Paper 4
Escaping the Verifier: Learning to Reason via Demonstrations (RARO)
25% vs 5.9%
vs experts · no verifier
A 25% win rate against expert responses, versus 5.9% for supervised fine-tuning, with no verifier at all. You can get RL-grade reasoning on tasks that have no checker, like poetry writing or financial analysis, not just math and code.
RL-based reasoning normally assumes a verifier that can score correctness. RARO (Relativistic Adversarial Reasoning Optimization) replaces it with an adversarial game: one model acts both as a policy that produces expert-quality answers and as a relativistic critic, which learns to pick a better response of the two. Pairwise comparison with a tie option is key for training stability. On Countdown, RARO reaches 54.4% accuracy versus 57.7% for RL with a ground-truth verifier despite not using one; by contrast, SFT or iterative DPO do not go beyond 40.7%. The learned critic can double as a test-time reranker that lifts DeepMath from 57.5% to 68.4% at 7B.
Authors: Locke Cai, Max Ryabinin, Ivan Provilkov
Paper: arXiv:2511.21667
Paper 5
V1: Unifying Generation and Self-Verification for Parallel Reasoners
+10%
Pass@1 · same compute
Up to 10% more correct answers, pulled from generations you already paid for. The win is better answer selection, not more compute.
When you sample many answers with no oracle, scoring each one independently fails, the judge hands almost everything a 10 out of 10 and loses the ability to discriminate. V1 reframes selection as comparison through a near-linear Swiss-tournament verifier. The training recipe, V1-PairRL, teaches a single model to generate and to pairwise-verify its own outputs at the same time, which lifts base accuracy even with no test-time verification at all.
Authors: Harman Singh, Xiuyu Li, Kusha Sareen, Monishwaran Maheswaran, Sijun Tan, Xiaoxia Wu, Junxiong Wang, Alpay Ariyak, Qingyang Wu, Samir Khaki, Rishabh Tiwari, Long Lian, Yucheng Lu, Boyi Li, Alane Suhr, Ben Athiwaratkun, Kurt Keutzer. With collaborators at UC Berkeley, NVIDIA, and Mila.
Paper: arXiv:2603.04304
Algorithmic optimizations
Making the math of inference cheaper: speculative decoding, quantization, RL inference.
Paper 6
When RL Meets Adaptive Speculative Training: A Unified Training-Serving System (Aurora)
1.25×
as traffic shifts
A 1.5x day-0 speedup on brand-new frontier models like MiniMax M2.1 229B and Qwen3-Coder-Next 80B, plus an additional 1.25x over a strong static speculator as traffic shifts. Speculative decoding that is fast on day 0 and keeps getting faster the longer it runs.
Most deployments train the speculator offline and freeze it, so it is slow to deploy and goes stale as traffic and target models change. Aurora reframes online speculator learning as an asynchronous reinforcement-learning problem running in production: accepted and rejected tokens are the reward signal, the training server updates the speculator continuously, and new weights hot-swap into the server with zero downtime.
Authors: Junxiong Wang, Fengxiang Bie, Jisen Li, Zhongzhu Zhou, Zelei Shao, Yubo Wang, Yinghui Liu, Qingyang Wu, Avner May, Sri Yanamandra, Yineng Zhang, Ce Zhang, Tri Dao, Percy Liang, Ben Athiwaratkun, Shuaiwen Leon Song, Chenfeng Xu, Xiaoxia Wu.
Paper: arXiv:2602.06932
Systems optimizations
The systems that train and serve models: disaggregation, batching, scheduling, context parallelism.
Paper 7
Untied Ulysses: Memory-Efficient Context Parallelism via Headwise Chunking
5M tokens
one node · ~87.5% mem
5M-token context training on a single 8xH100 node, with up to 87.5% less attention memory. You can train at a very long context without a bigger cluster.
Activation memory inside the attention layer caps long-context training, and adding GPUs does not move that ceiling. UPipe processes a few attention heads at a time and reuses the same buffers across stages, cutting peak attention memory by up to 87.5% on a 32B Transformer. The result is 5M tokens on one node, about 25% beyond prior methods, and 8M tokens on two nodes, while matching their throughput. It is a drop-in replacement for DeepSpeed-Ulysses on the same FlashAttention-3 kernels, and the code is open.
Authors: Ravi Ghadia, Maksim Abraham, Sergei Vorobyov, Max Ryabinin.
Paper: arXiv:2602.21196
Paper 8
Opportunistic Expert Activation: Batch-Aware Expert Routing for Faster Decode Without Retraining (OEA)
up to 39%
faster MoE decode
Up to 39% lower MoE decode latency, with no retraining and no architecture change. You reclaim the sparsity that batching quietly destroys, for free.
A Mixture-of-Experts model is supposed to be cheap because each token touches only a few experts, but the moment you batch, that sparsity collapses: At batch size 16, a model where each token wants eight experts ends up loading around 82 of them. OEA's batch-aware routing recovers the sparsity at inference time in two phases, the second of which lets tokens share experts the batch has already loaded, buying back quality at zero latency cost. Accuracy holds flat across AIME24, GPQA, LiveCodeBench, and MATH 500, and there is a single hyperparameter to tune.
Authors: Costin-Andrei Oncescu, Qingyang Wu, Wai Tong Chung, Robert Wu, Bryan Gopal, Junxiong Wang, Tri Dao, Ben Athiwaratkun. With collaborators at Harvard and Princeton.
Paper: arXiv:2511.02237
Kernels
The bedrock of the stack: where raw hardware becomes usable speed. Built by the teams behind FlashAttention, ThunderKittens, and Megakernel.
Paper 9
ParallelKernelBench: Benchmarking LLMs on Multi-GPU Kernel Generation
87 workloads
multi-GPU kernel bench
We built the first benchmark for multi-GPU kernel generation. It has 87 real distributed workloads, and today's best frontier model solves fewer than a third.
ParallelKernelBench takes 87 problems from production codebases like Megatron-LM, DeepSpeed, NeMo-RL, and TensorRT-LLM, spanning 12 forms of parallelism. Each one asks a model to replace a PyTorch and NCCL reference with a CUDA kernel that uses symmetric memory for direct NVLink communication. Zero-shot, the best model clears 28 of 87. In an agentic loop with compile, correctness, and performance feedback, Gemini 3 Pro climbs from 24 to 35 correct and beats the PyTorch baseline on 26. A handful run faster than anything publicly available. Single-GPU code generation is getting solved. Multi-GPU is where it breaks, because communication, not compute, sets the ceiling.
Authors: Willy Chan, Nathan Paek, Simon Guo, Simran Arora, Daniel Y. Fu.
Paper: https://www.alphaxiv.org/abs/2606.parallel-kernel-bench
By the numbers
1,000+ tasks
10+ domains, one API
DSGym
up to 3.6×
agent throughput
ThunderAgent
Beats best human
open model, ~$500
TTT-Discover
25% vs 5.9%
vs experts · no verifier
RARO
+10%
Pass@1 · same compute
V1
1.25×
faster as traffic shifts
Aurora
5M tokens
one node · ~87.5% mem
Untied Ulysses
up to 39%
faster MoE decode
OEA
87 workloads
multi-GPU kernel bench
ParallelKernelBench
Where to find us in Seoul
All nine papers are at ICML 2026, July 6 to 11, in Seoul. Stop by booth B714 to dive deeper into the work.
Together AI @ ICML 2026
COEX, Seoul · July 6 to 11Booth B714 · all week
PaperDate & time · Seoul KSTSession
Frontier agents
DSGym
Thu 7/9 · 2:30 PM – 4:15 PM
#66567
ThunderAgent
Tue 7/7 · 10:30 AM – 12:15 PM
#62040
TTT-Discover
Tue 7/7 · 5:00 PM – 6:45 PM
#62199
Model shaping
RARO
Thu 7/9 · 10:30 AM – 12:15 PM
#61507
V1
Thu 7/9 · 5:00 PM – 6:45 PM
#64825
Algorithmic optimizations
Aurora
Thu 7/9 · 10:30 AM – 12:15 PM
#66675
Systems optimizations
Untied Ulysses
Thu 7/9 · 5:00 PM – 6:45 PM
#66102
OEA
Wed 7/8 · 10:30 AM – 12:15 PM
#62958
Kernels
ParallelKernelBench
Fri 7/10 · 8:00 AM – 5:00 PM
DL4C · Hall B2
We will also be at our booth all week. Stop by to:
- Talk through any of the nine papers with the people who wrote them
- See how this research shows up in the Together platform, from fine-tuning to production inference
- Meet the team and hear what we are working on next
Come build this with us
We are hiring researchers and research engineers who want to work across the whole stack, not just one layer of it. See open roles at together.ai/careers, or come find us at the booth.
For everyone else: request a meeting in Seoul, browse the full research blog, and follow along on X at [@togethercompute] as the deep dives land across the conference.
— Originally published at together.ai
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from Together AI
See more →
ParallelKernelBench: Frontier LLMs can't write fast multi-GPU kernels (yet)
ParallelKernelBench evaluates LLMs' ability to generate efficient multi-GPU CUDA kernels across 87 workloads. While the best model manages to solve less than a third of the tasks effectively, some generated kernels outperform existing public implementations, highlighting the potential for improvement in LLM capabilities.
