Abstract
Introduction
Visual summary of overthinking failure modes and efficiency-performance trade-offs.
Method
From limitation analysis to DeCS: detector, decoupled reward, and curriculum scheduling.
1) NRP Detection
Detect the necessary reasoning prefix and separate useful reasoning from redundancy.
2) Decoupled Reward
Protect essential tokens and consistently penalize redundant tokens after NRP.
3) Curriculum Schedule
Adapt easy-prompt ratio to avoid suppressing exploratory behavior during training.
DeCS Training Pipeline
Open PDFExperiments
Core effectiveness: module ablation and post-training reasoning behavior.
Token vs PNRP
DECS converts efficiency gains into higher PNRP (proportion of NRP) scores.
Pass@K
DECS preserves exploration and maintains strong scaling under multiple attempts.
Token Budget Scaling
Scaling trends under different generation budgets across AIME2024, AIME2025, and AMC23.
Difficulty vs Efficiency
DECS consistently improves compression quality across different difficulty levels.
Case Studies
Qualitative examples across math, coding, and science reasoning settings.
Conclusion
DeCS shows that substantial overthinking reduction is achievable with careful token-level reward design and curriculum-aware data scheduling. The method consistently improves efficiency while preserving or improving reasoning quality.