State-Driven Reasoning · Paper Coming Soon

State of Thought Enables Endogenous Reasoning

SoT reframes test-time reasoning as a closed loop: the model's endogenous state selects the right historical evidence for the next step and controls when to stop, rather than following fixed external reasoning programs.

Zhiren Gong1,2, Yikun Hou1,4, Zeng Zihao1, Ming Xiao5, Chau Yuen3, Wei Yang Bryan Lim1

1 College of Computing and Data Science, Nanyang Technological University 2 Interdisciplinary Graduate Programme, Nanyang Technological University 3 School of Electrical and Electronic Engineering, Nanyang Technological University 4 Department of Mathematics and Mathematical Statistics, Umea University 5 Department of Information Science and Engineering, Royal Institute of Technology, Sweden

Across 4 models and 20 datasets, SoT improves quality while reducing generated tokens by 69.0% and latency by 48.7%.

From external control to state of thought
Figure. Paradigm shift: from external control to endogenous state-conditioned reasoning.
SoT overview across models tasks and efficiency
Figure. SoT performance across models, tasks, and efficiency metrics.

Why This Paradigm Matters

  • External scripts are rigid: fixed reasoning formats are brittle across heterogeneous tasks.
  • Search-heavy methods are costly: quality gains often depend on large sampling and high latency.
  • SoT changes the control variable: reasoning is driven by endogenous state, not external templates.
  • Result: better quality-efficiency frontier with a reusable closed-loop controller.

4

Backbones

20

Benchmarks

69.0%

Token reduction

48.7%

Latency reduction

2.63x

Long-context gain factor

1.29x

Quantitative gain factor

1.62x

General understanding gain factor

1.72x

Symbolic/code gain factor

1.08x

Multimodal gain factor

84.1%

Trajectory judge agreement

Abstract

Existing test-time reasoning often depends on external control: scripted reasoning formats or expanded search spaces. SoT instead introduces endogenous reasoning, where a compact state derived from internal model dynamics controls both evidence selection and stopping.

Concretely, SoT extracts a dynamics-geometric state and selectively activates historical support that matches the current reasoning regime. This turns reasoning into a state-conditioned evidence process rather than an externally prescribed token chain.

The empirical pattern is consistent across quantitative, general understanding, symbolic/code, long-context, and multimodal tasks: SoT improves task quality while reducing unnecessary reasoning cost.

Method

1) Read Endogenous State

At each step, SoT reads a compact state summarizing geometry, progress dynamics, directional consistency, and uncertainty.

Output: a control signal that reflects current reasoning regime.

2) Select State-Matched Evidence

SoT activates only the subset of historical reasoning support that is useful under the current state.

Output: sparse active context for the next reasoning step.

3) State-Conditioned Stopping

The same endogenous state also governs whether to continue or stop, avoiding fixed-depth reasoning schedules.

Output: a closed loop balancing quality and efficiency.

State trajectories across reasoning paradigms
Figure. Different reasoning paradigms occupy distinguishable endogenous state regions; SoT spans broader adaptive regimes.
State-conditioned activation and stopping patterns
Figure. Different endogenous states induce different sparse evidence activation patterns and stop tendencies.

Mechanism Insight

  • Generalization source: SoT transfers as a control principle, not as a dataset-specific prompt recipe.
  • Efficiency source: compute is redirected to relevant support, rather than uniformly longer chains.
  • Interpretability: state clusters align with distinct evidence-selection and stop behaviors.

Main Results

Main Result Table (Llama-3.1-8B · Quantitative + Symbolic/Code)

Category Method GSM8K MATH DROP QS avg FOLIO ProofWriter BBH-Temporal HumanEval MBPP S&C avg
GreedyVanilla79.058.26.253.929.626.026.025.055.233.0
ReasoningCoT80.864.21.755.54.924.570.025.051.631.2
ReasoningPS71.240.01.443.44.915.255.010.022.017.6
ReasoningSR75.844.815.650.432.531.046.025.036.432.9
ReasoningSC83.654.01.653.216.828.552.010.035.227.3
ReasoningBoN83.654.01.653.216.828.552.010.035.227.3
ReasoningCB48.036.519.837.140.435.834.030.049.238.6
ReasoningMCTS55.233.513.237.536.031.530.035.049.236.7
MemoryH2O79.859.22.453.66.427.848.010.042.026.3
MemorySNAP80.460.22.354.18.426.255.015.041.227.3
MemorySTREAM70.444.21.344.41.531.81.00.05.613.0
LatentCOCO81.460.03.754.845.345.251.020.047.242.5
RL-BasedGRPO0.00.50.30.22.02.27.00.00.01.8
OursSoT83.654.838.962.843.460.277.048.849.254.5

Main Result Table (Llama-3.1-8B · General + Long-Context)

Category Method CommonsenseQA StrategyQA BoolQ MMLU RACE GU avg HotpotQA NarrativeQA LongBench MultiFieldQA LCR avg
GreedyVanilla48.068.859.245.650.754.27.020.132.716.2
ReasoningCoT32.831.537.234.050.336.32.04.526.97.1
ReasoningPS30.518.221.531.242.028.11.73.020.45.3
ReasoningSR57.064.072.262.462.763.617.921.725.520.6
ReasoningSC28.534.543.231.045.735.81.94.325.86.8
ReasoningBoN28.534.543.231.045.735.81.94.325.86.8
ReasoningCB46.071.582.251.256.061.115.527.526.221.8
ReasoningMCTS42.065.582.246.449.757.013.527.127.821.0
MemoryH2O33.530.839.235.019.732.42.02.00.01.7
MemorySNAP33.531.038.534.616.731.82.02.00.01.7
MemorySTREAM34.826.238.822.40.025.62.50.00.01.1
LatentCOCO40.563.570.042.843.352.04.410.834.611.9
RL-BasedGRPO10.53.89.88.212.08.70.61.15.31.6
OursSoT71.067.280.064.675.771.125.739.438.333.0

Extended Experimental Modules

Click each module to show the corresponding experimental table and interpretation.

Backbone Quantitative Reasoning avg Symbolic and Code avg General Understanding avg Long-Context avg
Qwen2.5-14B (SoT)66.176.881.4See Appendix table
Mixtral-8x7B (SoT)46.351.772.623.5
  • Qwen2.5-14B shows especially strong General Understanding performance across all five GU datasets.
  • Mixtral-8x7B remains near-best in quantitative tasks and leads strongly in symbolic, GU, and long-context averages.
  • The same SoT controller transfers across markedly different backbone architectures.
Dataset CoT acc. SoT acc. CoT latency (s) SoT latency (s) CoT tokens SoT tokens
Beans-M80.882.829.914.1232.3493.5
Fashion-MNIST68.872.831.012.5225.7462.9
DocVQA71.171.819.412.2167.2226.8
InfographicVQA67.566.97.817.1229.5301.4
  • SoT improves accuracy on Beans-M, Fashion-MNIST, and DocVQA while lowering latency on several visual tasks.
  • InfographicVQA remains harder for latency, showing where multimodal control still has room to improve.
  • Overall, state-conditioned evidence organization transfers beyond text-only reasoning.
Variant Quantitative avg Symbolic and Code avg General Understanding avg Long-Context avg
Top-3 baseline average53.544.061.524.2
SoT-Training-free59.049.152.027.8
SoT-Embedding59.049.851.928.6
  • Even without full internal access, SoT variants remain competitive and often exceed Top-3 baseline averages in key domains.
  • The largest relative resilience appears in long-context tasks, where trajectory-level organization is critical.
  • These results support that the mechanism is not tied to one specific implementation interface.

Performance Interpretation

SoT consistently outperforms strong baselines across heterogeneous reasoning regimes. This pattern suggests the gain is structural: changing the control variable from external token programs to endogenous state-conditioned evidence organization.

On Llama-3.1-8B, SoT leads all four domain averages and achieves best or tied-best performance on most datasets, showing that robust gains do not require larger search budgets.

Efficiency & Trade-off

Tradeoff on Qwen2.5-14B
Figure. Trade-off on Qwen2.5-14B.
Tradeoff on Mixtral-8x7B
Figure. Trade-off on Mixtral-8x7B.
Tradeoff on Llama-3.1-8B
Figure. Trade-off on Llama-3.1-8B.
Tradeoff on Qwen3-VL-8B
Figure. Trade-off on Qwen3-VL-8B.

Llama Efficiency Summary (Domain Average)

Method QS Tok / Lat S&C Tok / Lat GU Tok / Lat LCR Tok / Lat Avg Tok Avg Lat (s)
CoT271.6 / 17.5369.8 / 27.5212.9 / 12.8198.1 / 19.6263.119.4
Self-Consistency736.8 / 43.1953.9 / 63.5626.5 / 37.6533.8 / 48.9712.848.3
Constrained Beam423.9 / 22.0363.9 / 19.4341.7 / 16.8283.7 / 19.5353.319.4
MCTS422.1 / 21.4364.7 / 19.4335.2 / 16.5286.6 / 20.5352.119.4
H2O226.0 / 16.7264.4 / 34.6185.6 / 15.8155.8 / 21.5207.922.1
COCO171.5 / 11.5214.6 / 20.487.2 / 6.7131.5 / 30.3151.217.2
SoT219.4 / 5.8294.0 / 9.564.0 / 1.8181.6 / 22.0189.89.8

Efficiency Insight

  • Search-heavy baselines often move to high-token and high-latency zones.
  • Memory-only compression may cut tokens but can lose state-relevant support and hurt quality.
  • SoT frontier shift: quality gains and cost reduction are achieved jointly, not by a simple trade.

Case Studies

Case-Level Insight

Across arithmetic, long-context QA, and multimodal chart reasoning, SoT exhibits a shared pattern: it keeps only support still needed for the next decision, suppresses obsolete context, and raises stop readiness when sufficient evidence has accumulated.

Representative SoT case trajectory
Figure. Representative SoT case with step-level active evidence and stop probability.
Arithmetic case study
Figure. Arithmetic case: staged decomposition with non-monotone evidence carry.
Long-context case study
Figure. Long-context case: sparse retrieval after exploration stabilizes contrastive reasoning.
VLM case study
Figure. Multimodal case: the same state-conditioned interface transfers to chart-grounded reasoning.

Resources

Paper

Coming soon.

Code

Coming soon.

Citation

Formal BibTeX will be released soon.