So over in the stable diffusion world, after playing long enough, I find that certain input tokens don't correspond to the semantic thing that word defines. Like it just doesn't know what to do with certain words. My best example is actually unicode symbols - sometimes you get lucky, like a certain unicode emoji thing actually embedded the concept in the image, but others just don't do anything.
But even though the semantic object/idea may not have come through in the output, adding any tokens at all usually drives the solution space SOMEWHERE different.
It seems similar, that we're generating stuff that, with luck and model training, might happen to correspond to a causal chain of thought matching the final action or response.
But it also might just be stuff pushing the solution somewhere inside the space...
--So instead of causal representation of the model's internal process, I'm thinking (lol) that CoT tokens cause... 1. Small shifts in the model's probability space 2. As we build up a number of CoT tokens, this is a scaffold that provides a path through the probability space to arrive at a particular answer.
And so the fact that there's a logical coherence in the paths the model takes that lead it to a usually correct answer with a usually pretty good causal explanation is just an emergent property of the technology rather than a reflection of what the model 'thinks'.
Even if our tokens aren't causally faithful, they help constrain the probability space provide guiderails funneling the model toward a certain result.
In this sense, the model isn't "lying" - it generates what it predicts are plausible reasoning tokens for that context, regardless of what actually influenced its predictions in the upstream context data or inside the model itself.