Sycophancy is noticeably higher and a couple tests on domains where I can assess output quality from my expertise (outlay a parts buyer workflow given some proprietary details, explain why measures can't distinguish between two given countable subsets of the transcendentals, write a contrarian defense of Thrasymachus, show how the SEC phase of UEFI boot changed from pre-8 to 8 to 10 and 11) gave no difference in quality.
I'm gonna stick with v3-0324 and I recommend that others do the same.
even though I'm normally a fan of release early and often, deepseek often loses some impact because they tend to release model and evals on different days. it wouldnt hurt anyone to just wait a day to release both together so that the conversations are more substantive.
ofc deepseek is doing the highest order bit of just train a good model and let everyone figure it out on their own time.
Interesting observation about the increased sycophancy. Your tests on specific domains are insightful. Seems like v3.1 might be a step back in practical quality. Thanks for sharing your experience, I'll probably hold off on upgrading for now too.
Sycophancy is noticeably higher and a couple tests on domains where I can assess output quality from my expertise (outlay a parts buyer workflow given some proprietary details, explain why measures can't distinguish between two given countable subsets of the transcendentals, write a contrarian defense of Thrasymachus, show how the SEC phase of UEFI boot changed from pre-8 to 8 to 10 and 11) gave no difference in quality.
I'm gonna stick with v3-0324 and I recommend that others do the same.
Is there any benchmarks and comparisons compared to gpt-oss? I believe it far exceeds gpt oss or even gpt5 otherwise they wounldn't have released it
the model was released literally one hour ago so we need to be a little bit more patient.
even though I'm normally a fan of release early and often, deepseek often loses some impact because they tend to release model and evals on different days. it wouldnt hurt anyone to just wait a day to release both together so that the conversations are more substantive.
ofc deepseek is doing the highest order bit of just train a good model and let everyone figure it out on their own time.
Scores 71.6% on Aider Benchmark
So it beats Claude 4 Opus on Aider Polyglot
https://xcancel.com/scaling01/status/1957890953026392212
Interesting observation about the increased sycophancy. Your tests on specific domains are insightful. Seems like v3.1 might be a step back in practical quality. Thanks for sharing your experience, I'll probably hold off on upgrading for now too.