At Databricks, we use reinforcement studying (RL) to develop reasoning fashions for issues that our prospects face in addition to for our merchandise, such because the Databricks Assistant and AI/BI Genie. These duties embrace producing code, analyzing knowledge, integrating organizational information, domain-specific analysis, and info extraction (IE) from paperwork. Duties like coding or info extraction typically have verifiable rewards — correctness might be checked straight (e.g., passing assessments, matching labels). This permits for reinforcement studying with no realized reward mannequin, often known as RLVR (reinforcement studying with verifiable rewards). In different domains, a customized reward mannequin could also be required — which Databricks additionally helps. On this submit, we concentrate on the RLVR setting.

As an instance of the facility of RLVR, we utilized our coaching stack to a well-liked tutorial benchmark in knowledge science referred to as BIRD. This benchmark research the duty of reworking a pure language question to a SQL code that runs on a database. This is a vital downside for Databricks customers, enabling non-SQL specialists to speak to their knowledge. It is usually a difficult job the place even one of the best proprietary LLMs don’t work effectively out of the field. Whereas BIRD neither totally captures the real-world complexity of this job nor the full-breadth of actual merchandise like Databricks AI/BI Genie (Determine 1), its recognition permits us to measure the efficacy of RLVR for knowledge science on a effectively understood benchmark.

We concentrate on bettering a base SQL coding mannequin utilizing RLVR, isolating these positive factors from enhancements pushed by agentic designs. Progress is measured on the single-model, single‑era monitor of the BIRD leaderboard (i.e., no self‑consistency), which evaluates on a personal take a look at set.
We set a brand new state-of-the-art take a look at accuracy of 73.5% on this benchmark. We did so utilizing our commonplace RLVR stack and coaching solely on the BIRD coaching set. The earlier greatest rating on this monitor was 71.8%[1], achieved by augmenting the BIRD coaching set with extra knowledge and utilizing a proprietary LLM (GPT-4o). Our rating can be 8.7 share factors higher than the unique base mannequin, and it’s a substantial enchancment over proprietary LLMs (see Determine 2). This end result showcases the simplicity and generality of RLVR: we reached this rating with off-the-shelf knowledge and the usual RL elements we’re rolling out in Agent Bricks, and we did so on our first submission to BIRD. RLVR is a strong baseline that AI builders ought to take into account at any time when sufficient coaching knowledge is on the market.
We constructed our submission primarily based on the BIRD dev set. We discovered that Qwen 2.5 32B Coder Instruct was one of the best place to begin. We fine-tuned this mannequin utilizing each Databricks TAO – an offline RL methodology, and our RLVR stack. This strategy alongside cautious immediate and mannequin choice was enough to get us to the highest of the BIRD Benchmark. This result’s a public demonstration of the identical methods we’re utilizing to enhance in style Databricks merchandise like AI/BI Genie and Assistant and to assist our prospects construct brokers utilizing Agent Bricks.
Our outcomes spotlight the facility of RLVR and the efficacy of our coaching stack. Databricks prospects have additionally reported nice outcomes utilizing our stack of their reasoning domains. We expect this recipe is highly effective, composable, and broadly relevant to a variety of duties. If you’d prefer to preview RLVR on Databricks, contact us right here.
1See Desk 1 in https://arxiv.org/pdf/2505.20315
Authors: Alnur Ali, Ashutosh Baheti, Jonathan Chang, Ta-Chung Chi, Brandon Cui, Andrew Drozdov, Jonathan Frankle, Abhay Gupta, Pallavi Koppol, Sean Kulinski, Jonathan Li, Dipendra Kumar Misra, Jose Javier Gonzalez Ortiz, Krista Opsahl-Ong