a new model - s1 released a week ago shows how llm breakthroughs might be made by a thousand cuts going forward

the coolest thing about this new method is that it is unironically super simple _{(literally in the name, 'simple test-time scaling')}

i bet thousands of researchers over the past few days have gone - "wait, i could've thought of that"

what is thinking?

thinking is forcing the model to wrap some tokens within <think> tags

after it has generated tokens within those tags, the model states it's final answer

what is inference scaling?

with the release of o1, OpenAI claimed inference-time scaling laws. but what does this mean?

they showed that models perform better on benchmarks the longer you make them 'think'

Scaling Image
this image taken from the s1 repo, shows how more tokens -> higher accuracy over a variety of benchmarks

how do we make models think longer?

now we know what thinking is, and that models perform better the more they 'think'

s1 shows us how to make models think longer - by replacing </think> with "Wait" ... yes it is literally that simple apparently

by doing this, we don't allow the model to stop thinking and we can control the amount of tokens it generates

for all we know, o1 might be doing something similar but we can't know because OpenAI does not give us the reasoning tokens :(

this simple change allows s1 (which is built off the tiny Qwen 2.5 32B Instruct Model) to reach levels similar to o1 super efficiently

is "wait" the best we can do?

well, they tried a bunch of things, and "wait" seems to work the best

>

Table 4:Budget forcing extrapolation ablations. We compare ignoring the end-of-thinking delimiter twice and appending none or various strings.
Model	AIME 2024	MATH 500	GPQA
No extrapolation	50.0	93.0	57.6
------------------	----	----	----
2x without string	50.0	90.2	55.1
2x “Alternatively”	50.0	92.2	59.6
2x “Hmm”	50.0	93.0	59.6
2x “Wait”	53.3	93.0	59.6

more evals -

Model	# ex.	AIME 2024	MATH 500	GPQA
API only
o1-preview	N.A.	44.6	85.5	73.3
o1-mini	N.A.	70.0	90.0	60.0
o1	N.A.	74.4	94.8	77.3
Gemini 2.0	N.A.	60.0	N.A.	N.A.
Flash Think.
Open Weights
Qwen2.5-32B-Instruct	N.A.	26.7	84.0	49.0
QwQ-32B	N.A.	50.0	90.6	65.2
r1	> 800K	79.8	97.3	71.5
r1-distill	800K	72.6	94.3	62.1
Open Weights and Open Data
Sky-T1	17K	43.3	82.4	56.8
Bespoke-32B	17K	63.3	93.0	58.1
s1 w/o BF	1K	50.0	92.6	56.6
s1-32B	1K	56.7	93.0	59.6

this cost basically nothing

According to the paper -

"We perform supervised finetuning on Qwen2.5-32B-Instruct using s1K to obtain our model s1-32B using basic hyperparameters outlined in §C. Finetuning took 26 minutes on 16 NVIDIA H100 GPUs with PyTorch FSDP."

training this entire thing, ends up costing like 20$

this should encourage researchers all over the world to just start fiddling with different ideas and seeing what works

conclusion

i love things like this
literally anyone with an idea and spare change can try to extract just that bit more from tiny models. even the gpu-poorest can now try to push sota and i'm all for it

this shit is super exciting and i can't wait to see how we decide to manipulate and gaslight these black boxes of probabilities next into finally being able to count the 'r's in strawberry

if you liked this, you should ask me to write more, i don't write often enough, hmu on

twitter
linkedin

s1 and the age of the gpu-poor