s1.1 benchmarking
recap
the stanford team behind s1, followed up with the new and improved s1.1 this week
they re-used the same s1k question dataset from s1 but they used deepseek r1, instead of gemini to generate the reasoning traces
just this small change led to an improvement in AIME 2024 and AIME 2025 scores
Model | Examples | AIME 2024 | AIME 2025 |
---|---|---|---|
Open Weights and Open Data | |||
LIMO | 817 | 56.3 | 44.5 |
s1 vanilla | 1K | 50.0 | 26.7 |
s1 with Budget Forcing “Wait” 1x | 1K | 53.3 | 26.7 |
s1.1 vanilla | 1K | 56.7 | 53.3 |
s1.1 with Budget Forcing “Wait” 1x | 1K | 56.7 | 60.0 |
this shows that tiny changes can push the performance, even if by very little
roadblocks
according to this xjdr tweet (of entropix fame), "Wait... " performed the best -
'Wait ...' did the best in my testing pic.twitter.com/ckWPHn2kjX
— xjdr (@_xjdr) February 3, 2025
a kind online stranger offered me up to $200 in credits to confirm this
an important thing to know is - i had never benchmarked, fine-tuned, or even run inference for a model before today
i spent a lot of time figuring out which gpu-provider to use - eventually settling on runpod
also, deciding on the best gpu configuration also took a lot of trial and error, i kept running into cuda errors
i burned through ~20$ in gpu credits before settling on an 8*A40 cluster to run the benchmarks
results
finally coming to the exciting bit, command i used to run the eval -
$ OPENAI_API_KEY=$OPENAI_API_KEY
PROCESSOR=gpt-4o-mini lm_eval
--model vllm
--model_args pretrained=simplescaling/s1.1-32B,dtype=float32,
tensor_parallel_size=8
--tasks aime25_nofigures,aime24_nofigures
--batch_size auto
--apply_chat_template
--output_path s1.1forcingignore2wait
--log_samples
--gen_kwargs "max_gen_toks=32768,max_tokens_thinking=auto,
thinking_n_ignore=2,thinking_n_ignore_str=wait..."
replacing "Wait" with "wait...", "wait ..." and "Wait ..." showed no improvement on either of the two benchmarks
Model | Examples | AIME 2024 | AIME 2025 |
---|---|---|---|
s1.1 with Budget Forcing "wait..." 2x | 1K | 56.7 | 53.3 |
s1.1 with Budget Forcing "wait ..." 2x | 1K | 56.7 | 53.3 |
s1.1 with Budget Forcing "Wait ..." 2x | 1K | 56.7 | 46.7 |
in fact, "Wait ..." 2x performs the worst on AIME2025 "wait..." 2x does better but still worse than "Wait" 1x
full eval results -
'wait...'
'wait ...'
'Wait ...'
future work
tbd