How exactly are the steps generated in DeepSeek-R1?

Question

As a narrowing-in on the question How does DeepSeek-R1 perform its "reasoning" part exactly?, how exactly does the <think> step generation work? What is an example using demo short made-up numerical vectors (like 4D vectors to keep things simple)?

A prompt I am looking at seeing how it works is something simple but not a simple math equation.

Why do some metals rust while others do not?

Everywhere I look for an explanation is talking about DeepSeek-R1:

"reasoning"
"thinking"
"deducing"
"understanding"
etc..

I am trying to understand exactly what is meant by these terms in this context, so I would like to know exactly how the <think> tags are generated.

After asking ChatGPT for several days, the most picture I gleaned was:

Tokenize prompt into numerical vectors (I got the gist of how basic LLMs generate text from prompts, written up here).
Generate initial "thought representation". Each token somehow attends to other tokens, somehow meaning the model calculates which words are important for reasoning. (Don't get this part). ChatGPT says of this step: "The model learns relationships between words, before generating reasoning steps. Contrast words like 'while' and 'do not' help frame an explanation." I don't get how it figures out words and phrases "frame an explanation", or what is exactly happening at this step at a practical, vector level.
Generate step-by-step reasoning. Now that attention has structured relationships, the model expands the input into multi-step reasoning (somehow?). Each token's vector is updated based on attention scores, resulting in contextualized reasoning vectors? What does that mean exactly? Then somehow, this new set of vectors is used to predict structured reasoning steps inside <think> tags.
Expand the reasoning vectors. DeepSeek-R1 now predicts each step one-by-one by expanding the reasoning vectors (somehow?). This somehow involves "isolating core concepts", and "contrasting things". Generating complete multi-step explanations.

None of the information in the above steps is very useful or practical, it's still too vague. From that information, I cannot imagine in my head the flow of numerical vectors, and couldn't explain what they mean by "reasoning" exactly.

Can you explain with a basic pseudo-code/pseudo-data example, using my prompt or something similar, how the "reasoning" might work, to generate the <think> tags in some detail?

Claude Sonnet 3.7 has this to say (plugging in my question). Still don't get it. — Lance Pollard, Feb 25 '25 at 10:34

score 1 · Accepted Answer · answered Feb 26 '25 at 07:45

Your claude reference is simplified and only partially correct, though real implementations like DeepSeek-R1 transformer differ significantly. After the shared attention layer, the transformer does have implicitly aligned user prompts with learned patterns such as your referenced task_vector to frame tasks in a syntactic, logical, or linguistic type which in your case is a contrastive reasoning task to activate the linguistic/logic expert FFN' parameters in the subsequent MoE layer, besides its apparent reasoning content to activate the chemistry and material science expert FFNs' parameters. And the attention output embedding is routed to these selected MoE-based FFNs in a weighted fashion according to attention scores.

However, concepts like "oxidation" or "protective layers" are implicitly encoded in and emerged from the model’s weights, not some explicit "knowledge vectors" retrieved from some embedding vector database. Also The model does not explicitly group tokens into concept clusters like rusting_metals, which are represented implicitly through the global shared attention weights of its proprietary MLA layer of its transformer blocks instead of the standard MHA layer. MLA enhances efficiency by compressing key-value pairs into low-dimensional latent vectors, significantly reducing the memory footprint associated with the key-value cache thus facilitating faster inference without compromising the model's performance.

Modern large language models (LLMs) often face communication bottlenecks on current hardware, rather than purely computational limitations. Multi-head Latent Attention (MLA) addresses this issue by utilizing low-rank matrices in the key-value layers, enabling the caching of compressed latent key-value (KV) states. This design significantly reduces the KV cache size compared to traditional multi-head attention, thus accelerating inference.

Finally <think> steps are not generated via literal vector additions like your combine_vectors(), instead, steps are autoregressively generated by the transformer until an end-of-reasoning marker such as </think> is predicted, similar to standard LLMs text generation. Each head in the MLA block specializes in a specific aspect of the contextualized attention relation, for instance, one head tracks rust→iron, another handles while-contrasts. The final steps synthesize both the shared cross-token attention and intra-token expert-enriched transformation, all of which work together to generate domain-knowledge coherent CoT autoregressively during the inference of the pretrained DeepSeek-R1.

How exactly are the steps generated in DeepSeek-R1?

1 Answers1

Linked