Medical language models (MLMs) have become pivotal in advancing medical natural language processing. However, prior models that rely on pre-training or supervised fine-tuning often exhibit low data efficiency and limited practicality in real-world clinical applications. While OpenAI’s o1 highlights test-time scaling in mathematics, attempts to replicate this approach in medicine typically distill responses from GPT-series models to open-source models, focusing primarily on multiple-choice tasks. This strategy, though straightforward, neglects critical concerns like data privacy and realistic deployment in clinical settings. In this work, we present deployable, small-scale medical reasoning system, MedS3, designed for long-chain reasoning in clinical tasks using a self-evolution paradigm. Starting with a seed dataset of around 8,000 instances spanning five domains and 16 datasets, we prompt a base policy model to perform Monte Carlo Tree Search (MCTS) to construct rule-verifiable reasoning chains. Each reasoning step is assigned an evolution rollout value, allowing verified trajectories to train the policy model and the process reward model (PRM). During inference, the policy model generates multiple responses, and the reward model selects the one with a newly proposed PRM-guided Vote-Sum (P-VS) strategy. Experiments on eleven evaluation datasets demonstrate that MedS3 outperforms not only the prior strongest medical model by 6.59, but also 32B-level general reasoning models by 8.71 points.
MedS3 utilizes a Monte-Carlo Tree Search pipeline to self-generate step-by-step reasoning paths for each question in the seed dataset (a). During this process, MedS3 uses result simulation to obtain the rollout value for each node (b); After obtaining the child's rollout value, MedS3 executes back-propagate to enable precise value prediction from deeper layer to transfer back to shallow nodes (c). After gathering all correct and wrong finish nodes, we use supervised fine-tuning to optimize the policy model \(\pi\) with correct reasoning trajectories and step-wise discriminative loss to obtain a process reward model \(V_{\theta}\) (d).
The PRM-guided Vote-Sum (P-VS) strategy is a novel approach to selecting the final response from the policy model, considering both the occurrence of semantically equivalent outputs and the confidence score predicted by PRM. The PRM model is used to evaluate the reasoning paths generated by the policy model. The PRM model assigns a score to each reasoning path based on the likelihood of the path being correct. The P-VS strategy then selects the response whose answer is estimated to have the highest values in total.
With only 500 training instances per task, we collect 7,465 data instances from 16 datasets in 5 domains, covering diverse medical application domains, demonstraing a data-efficient approach. The data statistics are shown in the table below.
@article{jiang2025meds,
title={MedS $\^{} 3$: Towards Medical Small Language Models with Self-Evolved Slow Thinking},
author={Jiang, Shuyang and Liao, Yusheng and Chen, Zhe and Zhang, Ya and Wang, Yanfeng and Wang, Yu},
journal={arXiv preprint arXiv:2501.12051},
year={2025}
}