s1.1 benchmarking





recap


the stanford team behind s1, followed up with the new and improved s1.1 this week

they re-used the same s1k question dataset from s1 but they used deepseek r1, instead of gemini to generate the reasoning traces

just this small change led to an improvement in AIME 2024 and AIME 2025 scores (particularly in the AIME 2025 benchmark)

Model Examples AIME 2024 AIME 2025
Open Weights and Open Data
LIMO 817 56.3 44.5
s1 vanilla 1K 50.0 26.7
s1 with Budget Forcing “Wait” 1x 1K 53.3 26.7
s1.1 vanilla 1K 56.7 53.3
s1.1 with Budget Forcing “Wait” 1x 1K 56.7 60.0


this shows that tiny changes can push the performance, even if by very little


roadblocks


according to this xjdr tweet (of entropix fame), "Wait... " performed the best -


a kind online stranger offered me up to $200 in credits to confirm this

an important thing to know is - i had never benchmarked, fine-tuned, or even run inference for a model before today

i spent a lot of time figuring out which gpu-provider to use - eventually settling on runpod

also, deciding on the best gpu configuration also took a lot of trial and error, i kept running into cuda errors


i burned through ~20$ in gpu credits before settling on an 8*A40 cluster to run the benchmarks



results


finally coming to the exciting bit, command i used to run the eval -



$ OPENAI_API_KEY=$OPENAI_API_KEY
  PROCESSOR=gpt-4o-mini lm_eval
--model vllm 
--model_args pretrained=simplescaling/s1.1-32B,dtype=float32,
                                      tensor_parallel_size=8 
--tasks aime25_nofigures,aime24_nofigures 
--batch_size auto 
--apply_chat_template 
--output_path s1.1forcingignore2wait 
--log_samples 
--gen_kwargs "max_gen_toks=32768,max_tokens_thinking=auto,
        thinking_n_ignore=2,thinking_n_ignore_str=wait..."
  
``

replacing "Wait" with "wait...", "wait ..." and "Wait ..." showed no improvement on either of the two benchmarks

Model Examples AIME 2024 AIME 2025
s1.1 with Budget Forcing "wait..." 2x 1K 56.7 53.3
s1.1 with Budget Forcing "wait ..." 2x 1K 56.7 53.3
s1.1 with Budget Forcing "Wait ..." 2x 1K 56.7 46.7


in fact, "Wait ..." 2x performs the worst on AIME2025 "wait..." 2x does better but still worse than "Wait" 1x


full eval results -

'wait...'
'wait ...'
'Wait ...'



future work


tbd