Benchmarking inference at scale: coding agents | AI Deep Signal