Attention mechanism

Hello , I hope you are having a great day.
this question is a bit theoretical and I am struggling to find a full answer.
Could you explain how does projecting embedded tokens to keys ,query ,values and calculating softmax of keys*queries /sqrt(dk) then multiply each value vector by its weight and sum them then stack each weight value and project it again to new vector and forward to a neural network and do the same operation again and again gave the computers the ability of understand ,reasoning and think like a human ?

1 Like

If you want to understand how LLMs work, I highly recommend reading and working through “Build A Large Language Model (From Scratch)” by Sebastian Raschka. I was a technical reviewer for this book, and it explains the attention mechanism (along many other components/concepts) with step-by-step guidance.

3 Likes

+1 on Sebastian Raschka’s book

2 Likes

Hey @bachirr, welcome! That’s honestly one of the biggest open questions in AI right now.

The short answer: We don’t fully understand how attention leads to reasoning and understanding. There are entire research teams (like Anthropic’s interpretability team) trying to figure this out!

What we do know about attention mechanics:
The QK multiplication works because of vector dot products - you get ~0 when vectors are perpendicular, close to 1 when parallel. This lets the model compare tokens in the projected Q,K space and figure out relationships like “it” referring back to “cat.”

The deeper mystery - how does this scale to reasoning?

  • Multi-head specialization: Each attention head seems to focus on different “threads” of context (like tracking different characters in a story), with each layer handling different levels of abstraction
  • Feature composition: Early layers learn simple patterns, deeper layers combine them into increasingly abstract concepts
  • Post-training is crucial: The reasoning capabilities you see aren’t just from architecture! They emerge heavily from post-training techniques like RL and preference tuning, not just pattern matching from pretraining data

But honestly? We built an engine that works amazingly well, but we’re still reverse-engineering exactly why it works so well. The field needs people asking these fundamental questions.

If you’re interested in the post-training side (which is where a lot of the “intelligence” gets added), there are some great recent examples like how DeepSeek-R1 used GRPO to push reasoning capabilities beyond what seemed possible before. I also composed a visual guide to post-training techniquesthat breaks down the decision tree for choosing between different approaches like SFT, DPO/APO vs RL.

2 Likes

Your link 404’d, but I’d like to view your resource. Are you also getting that issue?

1 Like

Should work now.
Let me know if you still have issues viewing it’s

1 Like

I would say that it is a very good imitation of human “thinking” and reasoning, but our processes are more complex. Regarding transformer architecture - it is empirical observation and experimenting. There is an idea behind the attention mechanism, which is attendance of each token to each other. It has never been created from the idea that - head no. 1 is going to be responsible for relationship between verbs and nouns, though if you look into activations of Q*K, you can derive some ideas what relationship the head might try to infer

2 Likes