State-Driven Reasoning · Paper Coming Soon

State of Thought Enables Endogenous Reasoning

SoT reframes test-time reasoning as a closed loop: the model's endogenous state selects the right historical evidence for the next step and controls when to stop, rather than following fixed external reasoning programs.

Zhiren Gong^1,2, Yikun Hou^1,4, Zeng Zihao¹, Ming Xiao⁵, Chau Yuen³, Wei Yang Bryan Lim¹

¹ College of Computing and Data Science, Nanyang Technological University ² Interdisciplinary Graduate Programme, Nanyang Technological University ³ School of Electrical and Electronic Engineering, Nanyang Technological University ⁴ Department of Mathematics and Mathematical Statistics, Umea University ⁵ Department of Information Science and Engineering, Royal Institute of Technology, Sweden

Across 4 models and 20 datasets, SoT improves quality while reducing generated tokens by 69.0% and latency by 48.7%.

▶ Watch the tutorial Paper (coming soon) Code (coming soon)

From external control to state of thought — Figure. Paradigm shift: from external control to endogenous state-conditioned reasoning.

SoT overview across models tasks and efficiency — Figure. SoT performance across models, tasks, and efficiency metrics.

Why This Paradigm Matters

External scripts are rigid: fixed reasoning formats are brittle across heterogeneous tasks.
Search-heavy methods are costly: quality gains often depend on large sampling and high latency.
SoT changes the control variable: reasoning is driven by endogenous state, not external templates.
Result: better quality-efficiency frontier with a reusable closed-loop controller.

4

Backbones

20

Benchmarks

69.0%

Token reduction

48.7%

Latency reduction

2.63x

Long-context gain factor

1.29x

Quantitative gain factor

1.62x

General understanding gain factor

1.72x

Symbolic/code gain factor

1.08x

Multimodal gain factor

84.1%

Trajectory judge agreement

Abstract

Existing test-time reasoning often depends on external control: scripted reasoning formats or expanded search spaces. SoT instead introduces endogenous reasoning, where a compact state derived from internal model dynamics controls both evidence selection and stopping.

Concretely, SoT extracts a dynamics-geometric state and selectively activates historical support that matches the current reasoning regime. This turns reasoning into a state-conditioned evidence process rather than an externally prescribed token chain.

The empirical pattern is consistent across quantitative, general understanding, symbolic/code, long-context, and multimodal tasks: SoT improves task quality while reducing unnecessary reasoning cost.

Method

1) Read Endogenous State

At each step, SoT reads a compact state summarizing geometry, progress dynamics, directional consistency, and uncertainty.

Output: a control signal that reflects current reasoning regime.

2) Select State-Matched Evidence

SoT activates only the subset of historical reasoning support that is useful under the current state.

Output: sparse active context for the next reasoning step.

3) State-Conditioned Stopping

The same endogenous state also governs whether to continue or stop, avoiding fixed-depth reasoning schedules.

Output: a closed loop balancing quality and efficiency.

State trajectories across reasoning paradigms — Figure. Different reasoning paradigms occupy distinguishable endogenous state regions; SoT spans broader adaptive regimes.

State-conditioned activation and stopping patterns — Figure. Different endogenous states induce different sparse evidence activation patterns and stop tendencies.

Mechanism Insight

Generalization source: SoT transfers as a control principle, not as a dataset-specific prompt recipe.
Efficiency source: compute is redirected to relevant support, rather than uniformly longer chains.
Interpretability: state clusters align with distinct evidence-selection and stop behaviors.

Main Results

Main Result Table (Llama-3.1-8B · Quantitative + Symbolic/Code)

Category	Method	GSM8K	MATH	DROP	QS avg	FOLIO	ProofWriter	BBH-Temporal	HumanEval	MBPP	S&C avg
Greedy	Vanilla	79.0	58.2	6.2	53.9	29.6	26.0	26.0	25.0	55.2	33.0
Reasoning	CoT	80.8	64.2	1.7	55.5	4.9	24.5	70.0	25.0	51.6	31.2
Reasoning	PS	71.2	40.0	1.4	43.4	4.9	15.2	55.0	10.0	22.0	17.6
Reasoning	SR	75.8	44.8	15.6	50.4	32.5	31.0	46.0	25.0	36.4	32.9
Reasoning	SC	83.6	54.0	1.6	53.2	16.8	28.5	52.0	10.0	35.2	27.3
Reasoning	BoN	83.6	54.0	1.6	53.2	16.8	28.5	52.0	10.0	35.2	27.3
Reasoning	CB	48.0	36.5	19.8	37.1	40.4	35.8	34.0	30.0	49.2	38.6
Reasoning	MCTS	55.2	33.5	13.2	37.5	36.0	31.5	30.0	35.0	49.2	36.7
Memory	H2O	79.8	59.2	2.4	53.6	6.4	27.8	48.0	10.0	42.0	26.3
Memory	SNAP	80.4	60.2	2.3	54.1	8.4	26.2	55.0	15.0	41.2	27.3
Memory	STREAM	70.4	44.2	1.3	44.4	1.5	31.8	1.0	0.0	5.6	13.0
Latent	COCO	81.4	60.0	3.7	54.8	45.3	45.2	51.0	20.0	47.2	42.5
RL-Based	GRPO	0.0	0.5	0.3	0.2	2.0	2.2	7.0	0.0	0.0	1.8
Ours	SoT	83.6	54.8	38.9	62.8	43.4	60.2	77.0	48.8	49.2	54.5

Main Result Table (Llama-3.1-8B · General + Long-Context)

Category	Method	CommonsenseQA	StrategyQA	BoolQ	MMLU	RACE	GU avg	HotpotQA	NarrativeQA	LongBench MultiFieldQA	LCR avg
Greedy	Vanilla	48.0	68.8	59.2	45.6	50.7	54.2	7.0	20.1	32.7	16.2
Reasoning	CoT	32.8	31.5	37.2	34.0	50.3	36.3	2.0	4.5	26.9	7.1
Reasoning	PS	30.5	18.2	21.5	31.2	42.0	28.1	1.7	3.0	20.4	5.3
Reasoning	SR	57.0	64.0	72.2	62.4	62.7	63.6	17.9	21.7	25.5	20.6
Reasoning	SC	28.5	34.5	43.2	31.0	45.7	35.8	1.9	4.3	25.8	6.8
Reasoning	BoN	28.5	34.5	43.2	31.0	45.7	35.8	1.9	4.3	25.8	6.8
Reasoning	CB	46.0	71.5	82.2	51.2	56.0	61.1	15.5	27.5	26.2	21.8
Reasoning	MCTS	42.0	65.5	82.2	46.4	49.7	57.0	13.5	27.1	27.8	21.0
Memory	H2O	33.5	30.8	39.2	35.0	19.7	32.4	2.0	2.0	0.0	1.7
Memory	SNAP	33.5	31.0	38.5	34.6	16.7	31.8	2.0	2.0	0.0	1.7
Memory	STREAM	34.8	26.2	38.8	22.4	0.0	25.6	2.5	0.0	0.0	1.1
Latent	COCO	40.5	63.5	70.0	42.8	43.3	52.0	4.4	10.8	34.6	11.9
RL-Based	GRPO	10.5	3.8	9.8	8.2	12.0	8.7	0.6	1.1	5.3	1.6
Ours	SoT	71.0	67.2	80.0	64.6	75.7	71.1	25.7	39.4	38.3	33.0

Extended Experimental Modules

Click each module to show the corresponding experimental table and interpretation.

Backbone	Quantitative Reasoning avg	Symbolic and Code avg	General Understanding avg	Long-Context avg
Qwen2.5-14B (SoT)	66.1	76.8	81.4	See Appendix table
Mixtral-8x7B (SoT)	46.3	51.7	72.6	23.5

Qwen2.5-14B shows especially strong General Understanding performance across all five GU datasets.
Mixtral-8x7B remains near-best in quantitative tasks and leads strongly in symbolic, GU, and long-context averages.
The same SoT controller transfers across markedly different backbone architectures.

Dataset	CoT acc.	SoT acc.	CoT latency (s)	SoT latency (s)	CoT tokens	SoT tokens
Beans-M	80.8	82.8	29.9	14.1	232.3	493.5
Fashion-MNIST	68.8	72.8	31.0	12.5	225.7	462.9
DocVQA	71.1	71.8	19.4	12.2	167.2	226.8
InfographicVQA	67.5	66.9	7.8	17.1	229.5	301.4

SoT improves accuracy on Beans-M, Fashion-MNIST, and DocVQA while lowering latency on several visual tasks.
InfographicVQA remains harder for latency, showing where multimodal control still has room to improve.
Overall, state-conditioned evidence organization transfers beyond text-only reasoning.

Variant	Quantitative avg	Symbolic and Code avg	General Understanding avg	Long-Context avg
Top-3 baseline average	53.5	44.0	61.5	24.2
SoT-Training-free	59.0	49.1	52.0	27.8
SoT-Embedding	59.0	49.8	51.9	28.6

Even without full internal access, SoT variants remain competitive and often exceed Top-3 baseline averages in key domains.
The largest relative resilience appears in long-context tasks, where trajectory-level organization is critical.
These results support that the mechanism is not tied to one specific implementation interface.

Performance Interpretation

SoT consistently outperforms strong baselines across heterogeneous reasoning regimes. This pattern suggests the gain is structural: changing the control variable from external token programs to endogenous state-conditioned evidence organization.

On Llama-3.1-8B, SoT leads all four domain averages and achieves best or tied-best performance on most datasets, showing that robust gains do not require larger search budgets.

Efficiency & Trade-off

Tradeoff on Qwen2.5-14B — Figure. Trade-off on Qwen2.5-14B.

Tradeoff on Mixtral-8x7B — Figure. Trade-off on Mixtral-8x7B.

Tradeoff on Llama-3.1-8B — Figure. Trade-off on Llama-3.1-8B.

Tradeoff on Qwen3-VL-8B — Figure. Trade-off on Qwen3-VL-8B.

Llama Efficiency Summary (Domain Average)

Method	QS Tok / Lat	S&C Tok / Lat	GU Tok / Lat	LCR Tok / Lat	Avg Tok	Avg Lat (s)
CoT	271.6 / 17.5	369.8 / 27.5	212.9 / 12.8	198.1 / 19.6	263.1	19.4
Self-Consistency	736.8 / 43.1	953.9 / 63.5	626.5 / 37.6	533.8 / 48.9	712.8	48.3
Constrained Beam	423.9 / 22.0	363.9 / 19.4	341.7 / 16.8	283.7 / 19.5	353.3	19.4
MCTS	422.1 / 21.4	364.7 / 19.4	335.2 / 16.5	286.6 / 20.5	352.1	19.4
H2O	226.0 / 16.7	264.4 / 34.6	185.6 / 15.8	155.8 / 21.5	207.9	22.1
COCO	171.5 / 11.5	214.6 / 20.4	87.2 / 6.7	131.5 / 30.3	151.2	17.2
SoT	219.4 / 5.8	294.0 / 9.5	64.0 / 1.8	181.6 / 22.0	189.8	9.8

Efficiency Insight

Search-heavy baselines often move to high-token and high-latency zones.
Memory-only compression may cut tokens but can lose state-relevant support and hurt quality.
SoT frontier shift: quality gains and cost reduction are achieved jointly, not by a simple trade.

Case Studies

Case-Level Insight

Across arithmetic, long-context QA, and multimodal chart reasoning, SoT exhibits a shared pattern: it keeps only support still needed for the next decision, suppresses obsolete context, and raises stop readiness when sufficient evidence has accumulated.

Representative SoT case trajectory — Figure. Representative SoT case with step-level active evidence and stop probability.

Arithmetic case study — Figure. Arithmetic case: staged decomposition with non-monotone evidence carry.

Long-context case study — Figure. Long-context case: sparse retrieval after exploration stabilizes contrastive reasoning.

VLM case study — Figure. Multimodal case: the same state-conditioned interface transfers to chart-grounded reasoning.

Resources

▶ Tutorial

A narrated, animated ~4½-minute video tour of the problem, the idea, and the results.

Paper

Coming soon.

Code

Coming soon.

Citation

Formal BibTeX will be released soon.

Contact

For collaboration or questions, contact zhiren001@e.ntu.edu.sg.