Hello , I hope you are having a great day.
this question is a bit theoretical and I am struggling to find a full answer.
Could you explain how does projecting embedded tokens to keys ,query ,values and calculating softmax of keys*queries /sqrt(dk) then multiply each value vector by its weight and sum them then stack each weight value and project it again to new vector and forward to a neural network and do the same operation again and again gave the computers the ability of understand ,reasoning and think like a human ?
If you want to understand how LLMs work, I highly recommend reading and working through “Build A Large Language Model (From Scratch)” by Sebastian Raschka. I was a technical reviewer for this book, and it explains the attention mechanism (along many other components/concepts) with step-by-step guidance.
+1 on Sebastian Raschka’s book
Hey @bachirr, welcome! That’s honestly one of the biggest open questions in AI right now.
The short answer: We don’t fully understand how attention leads to reasoning and understanding. There are entire research teams (like Anthropic’s interpretability team) trying to figure this out!
What we do know about attention mechanics:
The QK multiplication works because of vector dot products - you get ~0 when vectors are perpendicular, close to 1 when parallel. This lets the model compare tokens in the projected Q,K space and figure out relationships like “it” referring back to “cat.”
The deeper mystery - how does this scale to reasoning?
- Multi-head specialization: Each attention head seems to focus on different “threads” of context (like tracking different characters in a story), with each layer handling different levels of abstraction
- Feature composition: Early layers learn simple patterns, deeper layers combine them into increasingly abstract concepts
- Post-training is crucial: The reasoning capabilities you see aren’t just from architecture! They emerge heavily from post-training techniques like RL and preference tuning, not just pattern matching from pretraining data
But honestly? We built an engine that works amazingly well, but we’re still reverse-engineering exactly why it works so well. The field needs people asking these fundamental questions.
If you’re interested in the post-training side (which is where a lot of the “intelligence” gets added), there are some great recent examples like how DeepSeek-R1 used GRPO to push reasoning capabilities beyond what seemed possible before. I also composed a visual guide to post-training techniquesthat breaks down the decision tree for choosing between different approaches like SFT, DPO/APO vs RL.
Your link 404’d, but I’d like to view your resource. Are you also getting that issue?
Should work now.
Let me know if you still have issues viewing it’s
I would say that it is a very good imitation of human “thinking” and reasoning, but our processes are more complex. Regarding transformer architecture - it is empirical observation and experimenting. There is an idea behind the attention mechanism, which is attendance of each token to each other. It has never been created from the idea that - head no. 1 is going to be responsible for relationship between verbs and nouns, though if you look into activations of Q*K, you can derive some ideas what relationship the head might try to infer