Academic Project Page

Abstract

Medical language models (MLMs) have become pivotal in advancing medical natural language processing. However, prior models that rely on pre-training or supervised fine-tuning often exhibit low data efficiency and limited practicality in real-world clinical applications. While OpenAI’s o1 highlights test-time scaling in mathematics, attempts to replicate this approach in medicine typically distill responses from GPT-series models to open-source models, focusing primarily on multiple-choice tasks. This strategy, though straightforward, neglects critical concerns like data privacy and realistic deployment in clinical settings. In this work, we present deployable, small-scale medical reasoning system, MedS³, designed for long-chain reasoning in clinical tasks using a self-evolution paradigm. Starting with a seed dataset of around 8,000 instances spanning five domains and 16 datasets, we prompt a base policy model to perform Monte Carlo Tree Search (MCTS) to construct rule-verifiable reasoning chains. Each reasoning step is assigned an evolution rollout value, allowing verified trajectories to train the policy model and the process reward model (PRM). During inference, the policy model generates multiple responses, and the reward model selects the one with a newly proposed PRM-guided Vote-Sum (P-VS) strategy. Experiments on eleven evaluation datasets demonstrate that MedS³ outperforms not only the prior strongest medical model by 6.59, but also 32B-level general reasoning models by 8.71 points.

Method: Self-Evolution System

MedS³ utilizes a Monte-Carlo Tree Search pipeline to self-generate step-by-step reasoning paths for each question in the seed dataset (a). During this process, MedS³ uses result simulation to obtain the rollout value for each node (b); After obtaining the child's rollout value, MedS³ executes back-propagate to enable precise value prediction from deeper layer to transfer back to shallow nodes (c). After gathering all correct and wrong finish nodes, we use supervised fine-tuning to optimize the policy model \(\pi\) with correct reasoning trajectories and step-wise discriminative loss to obtain a process reward model \(V_{\theta}\) (d).

Figure 1. Overview of the construction of MedS³ framework.

PRM Use Strategy: PRM-guided Vote-Sum (P-VS)

The PRM-guided Vote-Sum (P-VS) strategy is a novel approach to selecting the final response from the policy model, considering both the occurrence of semantically equivalent outputs and the confidence score predicted by PRM. The PRM model is used to evaluate the reasoning paths generated by the policy model. The PRM model assigns a score to each reasoning path based on the likelihood of the path being correct. The P-VS strategy then selects the response whose answer is estimated to have the highest values in total.

Figure 2. An example of P-VS

Data Statistics

With only 500 training instances per task, we collect 7,465 data instances from 16 datasets in 5 domains, covering diverse medical application domains, demonstraing a data-efficient approach. The data statistics are shown in the table below.

Figure 3. Data statistics of MedS³

Comprehensive Improvements on 11 Clinical Testbeds

Table 1. Experiment results in 11 medical datasets among four types of models. We highlight the best results with bold and underlines the second-best results. MedS3 with PRM guided Vote Sum (P-VS) achieves superior performances on real-world clinical datasets.

Ablation on PRM vs ORM, Vote-Sum vs BoN

Table 2. omparison of P-VS with other decoding methods under the same token budgets. “Majority” means we use SC to select the final response. “ORM with Last” means selecting the response with the max ORM value.

Analysis on Test-time Scaling

Figure 4. Test-time scaling of MedS³ on five clinical datasets. MedS³ achieves superior performances on real-world clinical datasets with nearly unbounded scaling law methods

Figure 5. Comparison of MedS³ with other methods on achieving slow-thinking. The MCTS+PRM (used in MedS³) is the most superior.

BibTeX

@article{jiang2025meds,
        title={MedS $\^{} 3$: Towards Medical Small Language Models with Self-Evolved Slow Thinking},
        author={Jiang, Shuyang and Liao, Yusheng and Chen, Zhe and Zhang, Ya and Wang, Yanfeng and Wang, Yu},
        journal={arXiv preprint arXiv:2501.12051},
        year={2025}
      }

MedS3: Towards Medical Small Language Models with Self-Evolved Slow Thinking