(Written after Neurips 2025)
This year in San Diego, NeurIPS drew over 20,000 attendees. Despite the massive turnout, there was surprising convergence around two buzzwords: RL and agents. This echoes Sam Altman's prediction from early this year: 2025 is the year of agents.
Everyone I met at company socials talked about RL at some point. References to post-training are also about RL. According to OpenAI employees, the terms "RL" and "reasoning" are now pretty much interchangeable. The broader term "agent" is even more ubiquitous. Nearly every company at the job fairs pitched "agents" when recruiting research scientists. Even mechanistic interpretability researchers want agents to do interpretability studies for them!
Topics that excite current RL researchers
Interactions are limited to verifiable rewards.
Learning is still very sample inefficient (which is why systems work like PipelineRL is so exciting).

We still haven't figured out "exploration"—the agent's ability to efficiently and autonomously acquire useful information to learn.
Topics that excite current agent researchers
The chaotic part underlying this RL convergence is that it's challenging to distinguish the correct takeaways. Quoting Yejin from her keynote: "conclusions from effortless RL ≠ effortful RL". She suggested that many RL papers are overclaiming due to poor experimental setups.

Screenshot from Yejin’s talk: “The Art of (Artificial) Reasoning”
Despite the buzz around agents, this paper released during NeurIPS week found that production agents are still built using simple scaffolds. "68% execute at most 10 steps before requiring human intervention, 70% rely on prompting off-the-shelf models instead of weight tuning, and 74% depend primarily on human evaluation."
I also heard people disagreeing about whether and how we should invest more resources to scale up RL training. The field hasn't really agreed on how to resolve sample inefficiency and is exploring different directions:
Some folks bet on language-native optimization (e.g., GEPA, Stanford's blog on Following the Text Gradient at Scale) instead of raw rollouts.
AllenAI uses paired preference learning of weak-vs-weaker data (building on this delta learning paper) to improve reasoning. Similarly, Stanford's blog on Following the Text Gradient at Scale shows that providing rich text feedback for preference pair data can outperform GRPO.
Some folks focus on using off-policy SFT samples to enable smarter generations—for example, curating training data that trains models to generate reasoning abstractions before going into finer-grained CoT.

Cited paper during Aviral’s talk in the Foundations of Reasoning in Language Models workshop.
One common sentiment is about focusing on pretraining. I had a brief meetup with somebody from OpenAI, who worked with Ilya before on turning CoT from an emergent property into a learned ability, leading to the o1 breakthrough. He said he's now more interested in how models are pretrained. He believes we might have to change how models are pretrained (too bad he didn't share what exactly to change). This sentiment is echoed by speculation that Gemini's fast catchup on performance against GPT models is due to investment in pretraining.

Met Sara and learned more about her startup Adaption. Really excited for it to take off one day!
Expos are crazy. Many cool demos and lots and lots of hiring for post-training researchers.
Expos are crazy. Many cool demos and lots and lots of hiring for post-training researchers.