Abstract

Introduction

Visual summary of overthinking failure modes and efficiency-performance trade-offs.

Overview of Failure Modes and DeCS Gains

Open PDF

Method

From limitation analysis to DeCS: detector, decoupled reward, and curriculum scheduling.

1) NRP Detection

Detect the necessary reasoning prefix and separate useful reasoning from redundancy.

2) Decoupled Reward

Protect essential tokens and consistently penalize redundant tokens after NRP.

3) Curriculum Schedule

Adapt easy-prompt ratio to avoid suppressing exploratory behavior during training.

Experiments

Core effectiveness: module ablation and post-training reasoning behavior.

Token vs PNRP

DECS converts efficiency gains into higher PNRP (proportion of NRP) scores.

Pass@K

DECS preserves exploration and maintains strong scaling under multiple attempts.

Token Budget Scaling

Scaling trends under different generation budgets across AIME2024, AIME2025, and AMC23.

Difficulty vs Efficiency

DECS consistently improves compression quality across different difficulty levels.

Case Studies

Qualitative examples across math, coding, and science reasoning settings.

Conclusion

DeCS shows that substantial overthinking reduction is achievable with careful token-level reward design and curriculum-aware data scheduling. The method consistently improves efficiency while preserving or improving reasoning quality.

BibTeX