recap

the stanford team behind s1, followed up with the new and improved s1.1 this week

they re-used the same s1k question dataset from s1 but they used deepseek r1, instead of gemini to generate the reasoning traces

just this small change led to an improvement in AIME 2024 and AIME 2025 scores _{(particularly in the AIME 2025 benchmark)}

Model	Examples	AIME 2024	AIME 2025
Open Weights and Open Data
LIMO	817	56.3	44.5
s1 vanilla	1K	50.0	26.7
s1 with Budget Forcing “Wait” 1x	1K	53.3	26.7
s1.1 vanilla	1K	56.7	53.3
s1.1 with Budget Forcing “Wait” 1x	1K	56.7	60.0

this shows that tiny changes can push the performance, even if by very little

roadblocks

according to this xjdr tweet (of entropix fame), "Wait... " performed the best -

'Wait ...' did the best in my testing pic.twitter.com/ckWPHn2kjX
— xjdr (@_xjdr) February 3, 2025

a kind online stranger offered me up to $200 in credits to confirm this

an important thing to know is - i had never benchmarked, fine-tuned, or even run inference for a model before today

i spent a lot of time figuring out which gpu-provider to use - eventually settling on runpod

also, deciding on the best gpu configuration also took a lot of trial and error, i kept running into cuda errors

i burned through ~20$ in gpu credits before settling on an 8*A40 cluster to run the benchmarks

results

finally coming to the exciting bit, command i used to run the eval -


$ OPENAI_API_KEY=$OPENAI_API_KEY
  PROCESSOR=gpt-4o-mini lm_eval
--model vllm 
--model_args pretrained=simplescaling/s1.1-32B,dtype=float32,
                                      tensor_parallel_size=8 
--tasks aime25_nofigures,aime24_nofigures 
--batch_size auto 
--apply_chat_template 
--output_path s1.1forcingignore2wait 
--log_samples 
--gen_kwargs "max_gen_toks=32768,max_tokens_thinking=auto,
        thinking_n_ignore=2,thinking_n_ignore_str=wait..."

replacing "Wait" with "wait...", "wait ..." and "Wait ..." showed no improvement on either of the two benchmarks

Model	Examples	AIME 2024	AIME 2025
s1.1 with Budget Forcing "wait..." 2x	1K	56.7	53.3
s1.1 with Budget Forcing "wait ..." 2x	1K	56.7	53.3
s1.1 with Budget Forcing "Wait ..." 2x	1K	56.7	46.7

in fact, "Wait ..." 2x performs the worst on AIME2025 "wait..." 2x does better but still worse than "Wait" 1x

full eval results -

'wait...'
'wait ...'
'Wait ...'

future work

tbd

s1.1 benchmarking

recap

roadblocks

results

future work