Slow ThinkingMedS3: Towards Medical Small Language Models with Self-Evolved Slow Thinking

1Shanghai Jiao Tong University
2Fudan University
3Shanghai Artificial Intelligence Laboratory

Corresponding Author

Abstract

Medical language models (MLMs) have become pivotal in advancing medical natural language processing. However, prior models that rely on pre-training or supervised fine-tuning often exhibit low data efficiency and limited practicality in real-world clinical applications. While OpenAI’s o1 highlights test-time scaling in mathematics, attempts to replicate this approach in medicine typically distill responses from GPT-series models to open-source models, focusing primarily on multiple-choice tasks. This strategy, though straightforward, neglects critical concerns like data privacy and realistic deployment in clinical settings. In this work, we present deployable, small-scale medical reasoning system, MedS3, designed for long-chain reasoning in clinical tasks using a self-evolution paradigm. Starting with a seed dataset of around 8,000 instances spanning five domains and 16 datasets, we prompt a base policy model to perform Monte Carlo Tree Search (MCTS) to construct rule-verifiable reasoning chains. Each reasoning step is assigned an evolution rollout value, allowing verified trajectories to train the policy model and the process reward model (PRM). During inference, the policy model generates multiple responses, and the reward model selects the one with a newly proposed PRM-guided Vote-Sum (P-VS) strategy. Experiments on eleven evaluation datasets demonstrate that MedS3 outperforms not only the prior strongest medical model by 6.59, but also 32B-level general reasoning models by 8.71 points.

Method: Self-Evolution System

MedS3 utilizes a Monte-Carlo Tree Search pipeline to self-generate step-by-step reasoning paths for each question in the seed dataset (a). During this process, MedS3 uses result simulation to obtain the rollout value for each node (b); After obtaining the child's rollout value, MedS3 executes back-propagate to enable precise value prediction from deeper layer to transfer back to shallow nodes (c). After gathering all correct and wrong finish nodes, we use supervised fine-tuning to optimize the policy model \(\pi\) with correct reasoning trajectories and step-wise discriminative loss to obtain a process reward model \(V_{\theta}\) (d).

MedS3
Figure 1. Overview of the construction of MedS3 framework.

PRM Use Strategy: PRM-guided Vote-Sum (P-VS)

The PRM-guided Vote-Sum (P-VS) strategy is a novel approach to selecting the final response from the policy model, considering both the occurrence of semantically equivalent outputs and the confidence score predicted by PRM. The PRM model is used to evaluate the reasoning paths generated by the policy model. The PRM model assigns a score to each reasoning path based on the likelihood of the path being correct. The P-VS strategy then selects the response whose answer is estimated to have the highest values in total.

pvs
Figure 2. An example of P-VS

Data Statistics

With only 500 training instances per task, we collect 7,465 data instances from 16 datasets in 5 domains, covering diverse medical application domains, demonstraing a data-efficient approach. The data statistics are shown in the table below.

data
Figure 3. Data statistics of MedS3

Comprehensive Improvements on 11 Clinical Testbeds

pvs
Table 1. Experiment results in 11 medical datasets among four types of models. We highlight the best results with bold and underlines the second-best results. MedS3 with PRM guided Vote Sum (P-VS) achieves superior performances on real-world clinical datasets.

Ablation on PRM vs ORM, Vote-Sum vs BoN

ablation
Table 2. omparison of P-VS with other decoding methods under the same token budgets. “Majority” means we use SC to select the final response. “ORM with Last” means selecting the response with the max ORM value.

Analysis on Test-time Scaling

scaling
Figure 4. Test-time scaling of MedS3 on five clinical datasets. MedS3 achieves superior performances on real-world clinical datasets with nearly unbounded scaling law
methods
Figure 5. Comparison of MedS3 with other methods on achieving slow-thinking. The MCTS+PRM (used in MedS3) is the most superior.

BibTeX

@article{jiang2025meds,
        title={MedS $\^{} 3$: Towards Medical Small Language Models with Self-Evolved Slow Thinking},
        author={Jiang, Shuyang and Liao, Yusheng and Chen, Zhe and Zhang, Ya and Wang, Yanfeng and Wang, Yu},
        journal={arXiv preprint arXiv:2501.12051},
        year={2025}
      }