Together AI at ICML 2026: frontier research across the full stack

1d ago

·~10 min·6/30/2026·en·0

Quick Answer

Together AI presents nine papers at ICML 2026, showcasing advancements from frontier agents like DSGym and ThunderAgent to kernel optimizations, emphasizing a holistic approach across the AI stack.

Quick Take

Together AI presents nine papers at ICML 2026, showcasing advancements from frontier agents like DSGym and ThunderAgent to kernel optimizations, emphasizing a holistic approach across the AI stack. Noteworthy achievements include a 3.6x increase in agent throughput and state-of-the-art discoveries in various fields using open models.

Key Points

DSGym standardizes evaluation for 1,000+ data science tasks across 10+ domains.
ThunderAgent achieves up to 3.6x higher agent throughput with minimal code changes.
TTT-Discover beats human benchmarks in multiple fields using an open 120B model.
Aurora's adaptive speculative decoding improves traffic handling by 1.25x.
ParallelKernelBench evaluates 87 multi-GPU kernel tasks for performance optimization.

📖 Reader Mode

~10 min read

Together Research

Nine papers is a lot to take in as a list. The better way to read them is by where they sit in the stack. Frontier AI is not built at a single layer. It is the product of research that runs from the agent down to the GPU kernel, and a gain at any one layer is wasted if the layers around it cannot keep up.

This is at the core of how we work. From frontier agents at the top to kernels at the bottom, our research touches each one, and each layer feeds the next. The research becomes part of the Together platform, and the production workloads running on that platform point us to the next research problem. Aurora, our ICML paper on adaptive speculative decoding, is a clear example: the same line of work ships today as our ATLAS speculator in production.

Here is this year's work, layer by layer, top down.

01Frontier agents

Agents that do real work — measured on tasks you can't fake your way through.

DSGym1,000+ tasks across 10+ domains

ThunderAgentUp to 3.6× faster agent inference

TTT-DiscoverBeats best human, open model

02Model shaping

Turning a base model into a reasoner — even where there's no answer key to check.

RARO25% win rate, no verifier

V1Up to 10% more correct answers

03Algorithmic optimizations

Cutting the cost of every token — speculative decoding that adapts to live traffic.

Aurora1.25× as traffic shifts

04Systems optimizations

Fitting more on the same GPUs — longer context, bigger batches, faster MoE decode.

Untied Ulysses5M-token context on one node

OEAUp to 39% faster MoE decode

05Kernels

The bedrock of the stack: where raw hardware becomes usable speed.

ParallelKernelBench87 multi-GPU kernel tasks

Frontier agents

The top of the stack: building agents that do real work, and measuring them honestly.

Paper 1

DSGym: A Holistic Framework for Evaluating and Training Data Science Agents

1,000+ tasks

10+ domains, one API

More than 100 data-science tasks across 10+ domains, unified behind a single evaluation and training API. Data-science agents have been hard to measure fairly, every benchmark has its own interface and many of their tasks can be solved without ever opening the data. DSGym standardizes the measurement and closes that loophole.

It puts diverse evaluation suites behind one API with shared abstractions for datasets, agents, and metrics, and runs each task in a self-contained execution environment where the agent has to work with the data rather than recall an answer. On top of the refined existing suites, it adds 90 expert bioinformatics tasks grounded in academic literature and 92 end-to-end Kaggle-style modeling competitions. The same environment then runs in reverse as a training engine: trajectory generation and synthetic query pipelines produce execution-verified data, which we used to train a 4B model into a state-of-the-art open-source data-science agent, with no human labeling. Evaluate honestly, synthesize from the same harness, fine-tune, re-evaluate, all in one framework.

Authors: Fan Nie, Junlin Wang, Harper Hua, Federico Bianchi, Yongchan Kwon, Zhenting Qi, Owen Queen, Shang Zhu, James Zou. With collaborators at Stanford, Duke, and Harvard.

Paper: arXiv:2601.16344

Paper 2

ThunderAgent: A Simple, Fast and Program-Aware Agentic Inference System

up to 3.6×

agent throughput

Achieving 1.5 to 3.6x higher serving throughput for agent workloads, with three lines of code to adopt. Parallel agent workloads stop collapsing under load, and the fix is not a faster model.

The problem is the inference engine has no idea it is running an agent. It treats each step of a multi-turn workflow as an isolated request, inflating latency by up to 7.14x under load. ThunderAgent makes the workflow itself a first-class object the scheduler can reason about end to end. Alongside the throughput gain, it delivers 1.8 to 3.9x faster RL rollouts and up to 4.2x disk savings over state-of-the-art systems.

Authors: Hao Kang, Ziyang Li, Xinyu Yang, Weili Xu, Yinfang Chen, Junxiong Wang, Beidi Chen, Tushar Krishna, Chenfeng Xu, Simran Arora. With collaborators at Georgia Tech, CMU, and UIUC.

Paper: arXiv:2602.13692

Paper 3

Learning to Discover at Test Time (TTT-Discover)

Beats best human

open model, ~$500

State-of-the-art discoveries across four fields, mathematics, GPU kernels, competitive algorithms, and biology, all with open models. Every prior result at this level relied on closed frontier models behind an API you cannot inspect or run. TTT-Discover reaches it with an open 120B model and one unchanged method, for a few hundred dollars per problem.

The usual recipe for AI discovery is search: prompt a frozen model thousands of times and keep the best sample. TTT-Discover instead runs reinforcement learning at test time on the single problem in front of it, so every attempt becomes training data for the next one and the model improves as it works, and with the same sampling budget plain best-of-N never catches up. The same setup, unchanged, set a tighter bound on a 60-year-old Erdős problem in mathematics (past the previous AI record, which used closed models), discovered a GPU kernel faster than the best prior submission on the GPUMode leaderboard, produced a first-place-level finish on a competitive-programming contest, and set a new high on a single-cell denoising benchmark in biology, all on the open gpt-oss-120b. The code and the record-setting kernels are public.

Authors: Mert Yuksekgonul, Daniel Koceja, Xinhao Li, Federico Bianchi, Jed McCaleb, Xiaolong Wang, Jan Kautz, Yejin Choi, James Zou, Carlos Guestrin, Yu Sun. With collaborators at Stanford, NVIDIA, UC San Diego, and the Astera Institute.

Paper: arXiv:2601.16175

Model shaping

How you train and shape a model: reasoning, fine-tuning, and reinforcement learning.

Paper 4

Escaping the Verifier: Learning to Reason via Demonstrations (RARO)

25% vs 5.9%

vs experts · no verifier

A 25% win rate against expert responses, versus 5.9% for supervised fine-tuning, with no verifier at all. You can get RL-grade reasoning on tasks that have no checker, like poetry writing or financial analysis, not just math and code.

RL-based reasoning normally assumes a verifier that can score correctness. RARO (Relativistic Adversarial Reasoning Optimization) replaces it with an adversarial game: one model acts both as a policy that produces expert-quality answers and as a relativistic critic, which learns to pick a better response of the two. Pairwise comparison with a tie option is key for training stability. On Countdown, RARO reaches 54.4% accuracy versus 57.7% for RL with a ground-truth verifier despite not using one; by contrast, SFT or iterative DPO do not go beyond 40.7%. The learned critic can double as a test-time reranker that lifts DeepMath from 57.5% to 68.4% at 7B.

Authors: Locke Cai, Max Ryabinin, Ivan Provilkov

Paper: arXiv:2511.21667

Paper 5

V1: Unifying Generation and Self-Verification for Parallel Reasoners

+10%

Pass@1 · same compute

Up to 10% more correct answers, pulled from generations you already paid for. The win is better answer selection, not more compute.

When you sample many answers with no oracle, scoring each one independently fails, the judge hands almost everything a 10 out of 10 and loses the ability to discriminate. V1 reframes selection as comparison through a near-linear Swiss-tournament verifier. The training recipe, V1-PairRL, teaches a single model to generate and to pairwise-verify its own outputs at the same time, which lifts base accuracy even with no test-time verification at all.

Authors: Harman Singh, Xiuyu Li, Kusha Sareen, Monishwaran Maheswaran, Sijun Tan, Xiaoxia Wu, Junxiong Wang, Alpay Ariyak, Qingyang Wu, Samir Khaki, Rishabh Tiwari, Long Lian, Yucheng Lu, Boyi Li, Alane Suhr, Ben Athiwaratkun, Kurt Keutzer. With collaborators at UC Berkeley, NVIDIA, and Mila.

Paper: arXiv:2603.04304

Algorithmic optimizations

Making the math of inference cheaper: speculative decoding, quantization, RL inference.

Paper 6

When RL Meets Adaptive Speculative Training: A Unified Training-Serving System (Aurora)

1.25×

as traffic shifts

A 1.5x day-0 speedup on brand-new frontier models like MiniMax M2.1 229B and Qwen3-Coder-Next 80B, plus an additional 1.25x over a strong static speculator as traffic shifts. Speculative decoding that is fast on day 0 and keeps getting faster the longer it runs.

Most deployments train the speculator offline and freeze it, so it is slow to deploy and goes stale as traffic and target models change. Aurora reframes online speculator learning as an asynchronous reinforcement-learning problem running in production: accepted and rejected tokens are the reward signal, the training server updates the speculator continuously, and new weights hot-swap into the server with zero downtime.

Authors: Junxiong Wang, Fengxiang Bie, Jisen Li, Zhongzhu Zhou, Zelei Shao, Yubo Wang, Yinghui Liu, Qingyang Wu, Avner May, Sri Yanamandra, Yineng Zhang, Ce Zhang, Tri Dao, Percy Liang, Ben Athiwaratkun, Shuaiwen Leon Song, Chenfeng Xu, Xiaoxia Wu.

Paper: arXiv:2602.06932

Systems optimizations

The systems that train and serve models: disaggregation, batching, scheduling, context parallelism.

Paper 7

Untied Ulysses: Memory-Efficient Context Parallelism via Headwise Chunking

5M tokens

one node · ~87.5% mem

5M-token context training on a single 8xH100 node, with up to 87.5% less attention memory. You can train at a very long context without a bigger cluster.

Activation memory inside the attention layer caps long-context training, and adding GPUs does not move that ceiling. UPipe processes a few attention heads at a time and reuses the same buffers across stages, cutting peak attention memory by up to 87.5% on a 32B Transformer. The result is 5M tokens on one node, about 25% beyond prior methods, and 8M tokens on two nodes, while matching their throughput. It is a drop-in replacement for DeepSpeed-Ulysses on the same FlashAttention-3 kernels, and the code is open.

Authors: Ravi Ghadia, Maksim Abraham, Sergei Vorobyov, Max Ryabinin.

Paper: arXiv:2602.21196

Paper 8

Opportunistic Expert Activation: Batch-Aware Expert Routing for Faster Decode Without Retraining (OEA)

up to 39%

faster MoE decode

Up to 39% lower MoE decode latency, with no retraining and no architecture change. You reclaim the sparsity that batching quietly destroys, for free.

A Mixture-of-Experts model is supposed to be cheap because each token touches only a few experts, but the moment you batch, that sparsity collapses: At batch size 16, a model where each token wants eight experts ends up loading around 82 of them. OEA's batch-aware routing recovers the sparsity at inference time in two phases, the second of which lets tokens share experts the batch has already loaded, buying back quality at zero latency cost. Accuracy holds flat across AIME24, GPQA, LiveCodeBench, and MATH 500, and there is a single hyperparameter to tune.

Authors: Costin-Andrei Oncescu, Qingyang Wu, Wai Tong Chung, Robert Wu, Bryan Gopal, Junxiong Wang, Tri Dao, Ben Athiwaratkun. With collaborators at Harvard and Princeton.

Paper: arXiv:2511.02237

Kernels

The bedrock of the stack: where raw hardware becomes usable speed. Built by the teams behind FlashAttention, ThunderKittens, and Megakernel.

Paper 9

ParallelKernelBench: Benchmarking LLMs on Multi-GPU Kernel Generation

87 workloads

multi-GPU kernel bench

We built the first benchmark for multi-GPU kernel generation. It has 87 real distributed workloads, and today's best frontier model solves fewer than a third.

ParallelKernelBench takes 87 problems from production codebases like Megatron-LM, DeepSpeed, NeMo-RL, and TensorRT-LLM, spanning 12 forms of parallelism. Each one asks a model to replace a PyTorch and NCCL reference with a CUDA kernel that uses symmetric memory for direct NVLink communication. Zero-shot, the best model clears 28 of 87. In an agentic loop with compile, correctness, and performance feedback, Gemini 3 Pro climbs from 24 to 35 correct and beats the PyTorch baseline on 26. A handful run faster than anything publicly available. Single-GPU code generation is getting solved. Multi-GPU is where it breaks, because communication, not compute, sets the ceiling.

Authors: Willy Chan, Nathan Paek, Simon Guo, Simran Arora, Daniel Y. Fu.

Paper: https://www.alphaxiv.org/abs/2606.parallel-kernel-bench

By the numbers

1,000+ tasks

10+ domains, one API

DSGym

up to 3.6×

agent throughput

ThunderAgent

Beats best human

open model, ~$500

TTT-Discover

25% vs 5.9%

vs experts · no verifier

RARO

+10%

Pass@1 · same compute

1.25×

faster as traffic shifts

Aurora

5M tokens

one node · ~87.5% mem

Untied Ulysses

up to 39%

faster MoE decode

OEA

87 workloads

multi-GPU kernel bench

ParallelKernelBench

Where to find us in Seoul

All nine papers are at ICML 2026, July 6 to 11, in Seoul. Stop by booth B714 to dive deeper into the work.

Together AI @ ICML 2026

COEX, Seoul · July 6 to 11Booth B714 · all week

PaperDate & time · Seoul KSTSession

Frontier agents

DSGym

Thu 7/9 · 2:30 PM – 4:15 PM

#66567

ThunderAgent

Tue 7/7 · 10:30 AM – 12:15 PM

#62040

TTT-Discover

Tue 7/7 · 5:00 PM – 6:45 PM

#62199

Model shaping

RARO

Thu 7/9 · 10:30 AM – 12:15 PM

#61507

Thu 7/9 · 5:00 PM – 6:45 PM

#64825

Algorithmic optimizations

Aurora

Thu 7/9 · 10:30 AM – 12:15 PM

#66675

Systems optimizations

Untied Ulysses

Thu 7/9 · 5:00 PM – 6:45 PM

#66102

OEA

Wed 7/8 · 10:30 AM – 12:15 PM

#62958

Kernels

ParallelKernelBench

Fri 7/10 · 8:00 AM – 5:00 PM

DL4C · Hall B2

We will also be at our booth all week. Stop by to:

Talk through any of the nine papers with the people who wrote them
See how this research shows up in the Together platform, from fine-tuning to production inference
Meet the team and hear what we are working on next

Come build this with us

We are hiring researchers and research engineers who want to work across the whole stack, not just one layer of it. See open roles at together.ai/careers, or come find us at the booth.

For everyone else: request a meeting in Seoul, browse the full research blog, and follow along on X at [@togethercompute] as the deep dives land across the conference.

‍

— Originally published at together.ai

Continue reading on together.ai

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from Together AI

See more →

ParallelKernelBench: Frontier LLMs can't write fast multi-GPU kernels (yet)

Together AI

1w ago

FeaturedOriginal

ParallelKernelBench: Frontier LLMs can't write fast multi-GPU kernels (yet)

AI Summary

ParallelKernelBench evaluates LLMs' ability to generate efficient multi-GPU CUDA kernels across 87 workloads. While the best model manages to solve less than a third of the tasks effectively, some generated kernels outperform existing public implementations, highlighting the potential for improvement in LLM capabilities.

#LLM #AI Coding #GPU

Together AI at ICML 2026: frontier research across the full stack

Quick Answer

Quick Take

Key Points

📖 Reader Mode

Frontier agents

DSGym: A Holistic Framework for Evaluating and Training Data Science Agents

ThunderAgent: A Simple, Fast and Program-Aware Agentic Inference System

Learning to Discover at Test Time (TTT-Discover)

Model shaping

Escaping the Verifier: Learning to Reason via Demonstrations (RARO)

V1: Unifying Generation and Self-Verification for Parallel Reasoners

Algorithmic optimizations

When RL Meets Adaptive Speculative Training: A Unified Training-Serving System (Aurora)

Systems optimizations

Untied Ulysses: Memory-Efficient Context Parallelism via Headwise Chunking

Opportunistic Expert Activation: Batch-Aware Expert Routing for Faster Decode Without Retraining (OEA)

Kernels

ParallelKernelBench: Benchmarking LLMs on Multi-GPU Kernel Generation

By the numbers

Where to find us in Seoul

Together AI @ ICML 2026

Come build this with us

Want this in your inbox every morning?

More from Together AI

ParallelKernelBench: Frontier LLMs can't write fast multi-GPU kernels (yet)

Kimi K2.7 Code vs Claude Fable 5: Landing pages that cost 94% less

Serving MiniMax-M3 for efficient inference: Unlocking 1M-Token Context and Multimodality Without Regrets