Benchmarking inference at scale: coding agents

5/19/2026

·~6 min·5/19/2026·en·0

Quick Answer

Together AI's new benchmark for coding agents reveals that its Together Inference Engine achieves 31% higher TPS than TensorRT-LLM, maintaining under 1s TTFT at 625 TPM per GPU.

Quick Take

This performance is crucial for handling high concurrency and long context requests in production environments.

Key Points

Together Inference Engine outperforms TensorRT- with 31% higher TPS.
TTFT remains under 1 second, crucial for user experience.
Benchmark simulates high concurrency with long input requests.
Performance gains achieved through full-stack profiling and optimization.
EAGLE speculative decoding used for improved efficiency.

Source Excerpt

Real-world inference benchmarks for coding agents: 31% more TPS than TensorRT-, 2× better TTFT at saturation, and 76% lower cost than Claude Opus 4. 6.

Read the full article on together.ai

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from Together AI

See more →

Open, convenient and predictable: Introducing Provisioned Throughput

Together AI

3w ago

FeaturedOriginal

Open, convenient and predictable: Introducing Provisioned Throughput

AI Summary

Together AI introduces Provisioned Throughput, offering guaranteed inference capacity for MiniMax M3 and GLM-5.2 at $0.05 per PTU per minute, achieving costs up to 90% lower than Claude Opus 4.8. This new model provides predictable pricing and a 99% uptime SLA, catering to companies transitioning to open weight models for production workloads.

#Inference #Open Source #AI Startup

Benchmarking inference at scale: coding agents

Quick Answer

Quick Take

Key Points

Source Excerpt

Want this in your inbox every morning?

More from Together AI

Open, convenient and predictable: Introducing Provisioned Throughput

Configuring Dedicated Model Inference

Kimi K3 vs Claude Fable 5 on DeepSWE: Cost and Coding

Related in this space

Synthetic Data Generation for Financial AI Research with NVIDIA NeMo

Deploy a Production-Ready NVIDIA AI-Q Blueprint on Oracle Cloud Infrastructure

Deploy Long-Context Reasoning and Agentic Workflows with MiniMax M3 on NVIDIA Accelerated Infrastructure