Might be true for older or smaller LLMs, but from my experience, not for recent LLMs anymore (including non-thinking ones).
Back in early 2022, I was working on automated exam grading with GPT-3. This was even before ChatGPT was released, so this required careful output-order tuning. Asking for the reasoning first instead of the grade made indeed a huge difference.
I kept using and preaching this pattern for the next few years and assumed that it would naturally improve the accuracy of LLM predictions since the self-conditioning argument makes total sense on paper.
However, after I grew suspicious of my own assumptions, I did some evaluations in late 2024, with Gemini 2.0 I believe, and to my surprise, the output order trick hardly influenced the results anymore. This was even before reasoning became mainstream.
My guess is that as the models grew bigger and better, their latent representations became more stable and covering more tokens into the future. The inner representation will confidently represent the full answer relatively independently of the (forced) answer order.
Of course there are unlucky cases where indeed a wrong initial binary answer is sampled, but recent models seem increasingly resistant against hallucinating arguments in favor of the wrong answer and instead self-correct themselves later in the response.
@jonatanpallesen less than half as many workers per retiree, but the retirees also live longer, consume more medical resources, and young people are increasingly delaying joining the workforce, if they will ever productively work at all
feels like a predecessor to https://t.co/Cs0RS5PIFC
they trained the generator to produce the weights for an implicit neural representation that will generate the actual image
the results were actually pretty good and it had many advantages like arbitrary resolutions and aspect ratios
never understood why this didn't get more attention