Attention mechanism

bachirr · July 11, 2025, 4:18pm

Hello , I hope you are having a great day.
this question is a bit theoretical and I am struggling to find a full answer.
Could you explain how does projecting embedded tokens to keys ,query ,values and calculating softmax of keys*queries /sqrt(dk) then multiply each value vector by its weight and sum them then stack each weight value and project it again to new vector and forward to a neural network and do the same operation again and again gave the computers the ability of understand ,reasoning and think like a human ?

dkleine · July 19, 2025, 9:08pm

If you want to understand how LLMs work, I highly recommend reading and working through “Build A Large Language Model (From Scratch)” by Sebastian Raschka. I was a technical reviewer for this book, and it explains the attention mechanism (along many other components/concepts) with step-by-step guidance.

fsommers · July 30, 2025, 11:26pm

+1 on Sebastian Raschka’s book

bird-of-paradise · August 1, 2025, 4:10am

Hey @bachirr, welcome! That’s honestly one of the biggest open questions in AI right now.

The short answer: We don’t fully understand how attention leads to reasoning and understanding. There are entire research teams (like Anthropic’s interpretability team) trying to figure this out!

What we do know about attention mechanics:
The QK multiplication works because of vector dot products - you get ~0 when vectors are perpendicular, close to 1 when parallel. This lets the model compare tokens in the projected Q,K space and figure out relationships like “it” referring back to “cat.”

The deeper mystery - how does this scale to reasoning?

Multi-head specialization: Each attention head seems to focus on different “threads” of context (like tracking different characters in a story), with each layer handling different levels of abstraction
Feature composition: Early layers learn simple patterns, deeper layers combine them into increasingly abstract concepts
Post-training is crucial: The reasoning capabilities you see aren’t just from architecture! They emerge heavily from post-training techniques like RL and preference tuning, not just pattern matching from pretraining data

But honestly? We built an engine that works amazingly well, but we’re still reverse-engineering exactly why it works so well. The field needs people asking these fundamental questions.

If you’re interested in the post-training side (which is where a lot of the “intelligence” gets added), there are some great recent examples like how DeepSeek-R1 used GRPO to push reasoning capabilities beyond what seemed possible before. I also composed a visual guide to post-training techniquesthat breaks down the decision tree for choosing between different approaches like SFT, DPO/APO vs RL.

joeylovessarah · August 2, 2025, 3:02am

Your link 404’d, but I’d like to view your resource. Are you also getting that issue?

bird-of-paradise · August 2, 2025, 3:56am

Should work now.
Let me know if you still have issues viewing it’s

entfane · August 2, 2025, 3:57pm

I would say that it is a very good imitation of human “thinking” and reasoning, but our processes are more complex. Regarding transformer architecture - it is empirical observation and experimenting. There is an idea behind the attention mechanism, which is attendance of each token to each other. It has never been created from the idea that - head no. 1 is going to be responsible for relationship between verbs and nouns, though if you look into activations of Q*K, you can derive some ideas what relationship the head might try to infer

Topic		Replies	Views
Question About Attention Score Computation & Intuition 🤗Transformers	1	1891	January 12, 2021
AI Memory : The Simplest System That Beats Every Complex Solution Research	7	872	March 14, 2026
Conceptual questions about transformers 🤗Transformers	10	1169	August 26, 2021
Is attention of different encoder layers comprabale? 🤗Transformers	0	299	December 6, 2022
Multihead attention Models	1	121	October 2, 2024

Attention mechanism

Related topics