Attention Is All You Need - Paper Explained

곡유
μ†ŒμŠ€ μ½”λ“œ
  • κ²Œμ‹œμΌ 2024. 04. 23.
  • In this video, I'll try to present a comprehensive study on Ashish Vaswani and his coauthors' renowned paper, β€œattention is all you need”
    This paper is a major turning point in deep learning research. The transformer architecture, which was introduced in this paper, is now used in a variety of state-of-the-art models in natural language processing and beyond.
    πŸ“‘ Chapters:
    0:00 Abstract
    0:39 Introduction
    2:44 Model Details
    3:20 Encoder
    3:30 Input Embedding
    5:22 Positional Encoding
    11:05 Self-Attention
    15:38 Multi-Head Attention
    17:31 Add and Layer Normalization
    20:38 Feed Forward NN
    23:40 Decoder
    23:44 Decoder in Training and Testing Phase
    27:31 Masked Multi-Head Attention
    30:03 Encoder-decoder Self-Attention
    33:19 Results
    35:37 Conclusion
    πŸ“ Link to the paper:
    arxiv.org/abs/1706.03762
    πŸ‘₯ Authors:
    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin
    πŸ”— Helpful Links:
    - "Vectoring Words (Word Embeddings)" by Computerphile:
    β€’ Vectoring Words (Word ...
    - "Transformer Architecture: The Positional Encoding" by Amirhossein Kazemnejad:
    kazemnejad.com/blog/transform...
    - "The Illustrated Transformer" by Jay Alammar:
    jalammar.github.io/illustrate...
    - Lennart Svensson's Video on Masked self-attention:
    β€’ Transformers - Part 7 ...
    - Lennart Svensson's Video on Encoder-decoder self-attention:
    β€’ Transformer - Part 8 -...
    πŸ™ I'd like to express my gratitude to Dr. Nasersharif, my supervisor, for suggesting this paper to me.
    πŸ™‹β€β™‚οΈ Find me on: linktr.ee/HalflingWizard
    #Transformer #Attention #Deep_Learning
  • κ³Όν•™κΈ°μˆ 

λŒ“κΈ€ • 46

  • @BenihimeLawliet
    @BenihimeLawliet λ…„ μ „ +14

    Finally an understandable video! I didn't find any other clear explanation about how the decoder works and the difference between test and train phases!
    Thank you very much, my saviour

  • @pigritor
    @pigritor λ…„ μ „ +3

    Exatly that level of details I was looking for. Not too deep but not superficial. Great video, hoping that you are living long and in state of prosperity

  • @gossipGirlMegan
    @gossipGirlMegan 6 κ°œμ›” μ „ +1

    The most clearly explaination in the yt so far.

  • @snehanjalikalamkar2268
    @snehanjalikalamkar2268 λ…„ μ „ +3

    Such a great video with an excellent explanation! It was very helpful.
    Being an MCU fan, your examples played a major role in keeping me hooked up to the video, haha! :D

  • @nathansmith8187
    @nathansmith8187 4 κ°œμ›” μ „

    Fantastic explanation. Most clear and concise one I've seen yet for the Attention paper.

  • @ayoubelmhamdi7920
    @ayoubelmhamdi7920 6 κ°œμ›” μ „ +1

    you do a great job in the encoder part, thank you very much

  • @MaryamBibi-nu6ou
    @MaryamBibi-nu6ou λ…„ μ „

    Voila! Ecstatic about explanation of the Paper"Attention is all you need" in this video.πŸ‘πŸΌ

  • @fa4954
    @fa4954 λ…„ μ „ +2

    Thanks for very great explanation and helpful extra links.It would be great if you could share the slides too so we can use it, refer to it and add our notes on it.

  • @atomicitee
    @atomicitee λ…„ μ „ +2

    Excellent overview, thanks so much!

  • @Haniyahmadi
    @Haniyahmadi 2 λ…„ μ „ +2

    Thanks for sharing the information with us, it was very informative

  • @serhattadik6402
    @serhattadik6402 λ…„ μ „ +2

    Thanks a lot for this informative video. Very much appreciate your effort.

  • @helenjackson9870
    @helenjackson9870 λ…„ μ „ +1

    Much clearer than my text book, thanks for sharing

  • @royalarindam
    @royalarindam 9 κ°œμ›” μ „

    This is brilliant. Thanks for sharing!

  • @Josh-di2ig
    @Josh-di2ig λ…„ μ „ +4

    Thanks for a great video. I have a question. Are Query, Key, and Value matrices the exact copies of the input embeddings? And from which training process are the weight matrices learned?

  • @betonassu
    @betonassu λ…„ μ „ +1

    Amazing content! Thank you :)

  • @fereshtehfeizabadi3129
    @fereshtehfeizabadi3129 2 λ…„ μ „

    Thanks a lot for making this informative video!

  • @huilanzhu1562
    @huilanzhu1562 λ…„ μ „

    Great video! Very intuitive!

  • @mirabirhossain1842
    @mirabirhossain1842 λ…„ μ „ +5

    Holy shit! your explanation is just so good. Also, thanks for adding those necessary links to other materials. This is the best explanation of transformer I have come across till now. Thank you for the detailed and carefully-cured work.

  • @violinplayer7201
    @violinplayer7201 λ…„ μ „

    Thanks, this is so helpful!

  • @DrJanpha
    @DrJanpha 10 κ°œμ›” μ „

    The best so far on this subject.. Well done

  • @rlv8472
    @rlv8472 λ…„ μ „ +6

    Evolution Is All You Need

  • @leibai9233
    @leibai9233 2 λ…„ μ „ +1

    Hi Mohammad, thank you. The video is great!

  • @balavenkat8911
    @balavenkat8911 6 κ°œμ›” μ „

    Fantastic work.. thanks

  • @samiulalim9708
    @samiulalim9708 3 κ°œμ›” μ „

    Fantastic explanation, calm and soft πŸ‘

  • @amandalmia6243
    @amandalmia6243 λ…„ μ „

    As you mentioned at 5:12 about the embedding vector, can you please elaborate more on how they get the embeddings of size 512. Thanks

  • @MariamMeha
    @MariamMeha 11 κ°œμ›” μ „

    no one can explain better than this. TYSM.

  • @tahamohd1409
    @tahamohd1409 λ…„ μ „

    so great. thank you!

  • @polarbear986
    @polarbear986 λ…„ μ „

    best explanation on youtube

  • @mehmetozer692
    @mehmetozer692 λ…„ μ „ +1

    Excellent tutorial. Made it easier for me to understand the paper. Still, it will take some time and effort to further comprehend.

    • @mehmetozer692
      @mehmetozer692 λ…„ μ „

      One question is that, in 10:25, shouldn't encoding vector values start with sin(w_0 . t) instead of sin(w_1 . t)? And 10000 in the denominator instead of 1000?

  • @Cameron_Drummer
    @Cameron_Drummer κ°œμ›” μ „

    This video saved my module

  • @dizoner2610
    @dizoner2610 7 κ°œμ›” μ „

    Thank you very much 😊

  • @miltonborgesdasilva3263
    @miltonborgesdasilva3263 11 κ°œμ›” μ „

    what an amazing channel

  • @KogDrum
    @KogDrum 7 κ°œμ›” μ „

    I noticed a potential discrepancy in the video at the 17:52.
    I would appreciate it if you could clarify this for me. It appears in the figrure that the output Z from the attention block is added to the input embedded after the positional encoding, rather than directly to the input embedding itself. I may be mistaken, so I kindly request what you are thinking the correct interpretation. Thank you!

  • @TJVideoChannelUTube
    @TJVideoChannelUTube λ…„ μ „

    In Transformer model, only these layer types are involved in the deep learning/containing trainable parameters, and (3) with activation functions:
    (1). Word Embedding Layer;
    (2). Weighted matrices for K, V, Q;
    (3). Feed Forward Layer or Fully Connected Layer.
    Correct?

  • @me4447
    @me4447 λ…„ μ „

    Doesn't the LayerNorm here happen separately for each token ("word") separately, that is separately for "Popcorn" and "Popped" in your video?

  • @PasseScience
    @PasseScience 2 λ…„ μ „ +2

    Hi, thanks for the video! There are several things that are still unclear to me. First I do not understand well how the architecture is dynamic with respect to the size of the input. I mean what does change structurally when we change the size of the input, are there some inner parts that should be parallely repeated? or does this architecture fix a size of max window that we hope will be larger than any sequence input?
    The other question is the most important one, it seems every explanation of transformer architecture I have found so far focuses one what we WANT a self attention or attention layer to do but never say a word of WHY after training those attention layers will do, by emergence, what we expect them to do. I guess it has something to do with the chosen structure of data in input and output of those layers, as well as the data flow which is forced but I do not have yet the revelation.
    If you could help me with those, that would be great!

    • @PasseScience
      @PasseScience 2 λ…„ μ „ +1

      ​@@HalflingWizard Hello, thanks for your answer!
      I think I got what I wanted looking again at what you explained for self attention (or attention). Each word embedding is replaced by a linear combination of every word embedding, so the only thing the NN can do when learning is adapt those weight in the linear combination and thus in a way it's natural to expect it to mix relevant pieces of information together and decorrelate and forgot what has no relationship. I think the skip connection around the attention block is also relevant to what we want in the sense that it introduces an asymmetry toward the reference word and more or less forces the rest of the linear combination to be the same kind of abstract information than the one in input. In a nutshell it's quite clear that this structure of information forces quite well the meaning of the work of the attention layer to be an attention mechanism.
      Could you confirm that the feed forward net after the attention blocs do not cross pieces of information from one embedding to another? (ie the same feed forward net is applied independently to each embedding in input, it seems it's what you are saying), if yes I do not get exactly how it relates to the examples for shallow etc... patterns you give just after.

    • @PasseScience
      @PasseScience 2 λ…„ μ „ +2

      @@HalflingWizard Oh yes, good point. Yes indeed if you have horizontal (spatial mixing) then vertical (channel mixing) you have in fact mixing of more or less any kind. So in a nutshell if I input the sentence "My little sister is drawing a *shape* with *4 sides* of the *same length* and *4 right angles* " The attention layer allows the piece of information I put in bold to be gathered in a single embedding, then the channel mixing FFN can transform that in a single concept, let's say "a square", for the next layers. That makes sense. We go back to the fact that the attention encoder stack is a clever way to have a full connected network but in a very economical and symetrical way. Thx a lot.

  • @user-um4xc9dz8o
    @user-um4xc9dz8o 3 κ°œμ›” μ „

    Did this slides made by beamer? I like the theme. Can you share the theme?

  • @amazingpatrick4659
    @amazingpatrick4659 λ…„ μ „

    Should the dimension of Wq,Wk,and Wv be n * n when n equals to the number of the sequence? if they are square matrices,then the matrices in the video are wrong.

  • @lyndenchang5637
    @lyndenchang5637 λ…„ μ „ +1

    Hi, can you provide the slides?

  • @muhammadtanveer8045
    @muhammadtanveer8045 2 λ…„ μ „

    Hi, the video is good but the problem is that the English translation of your voice is hide most of the things of the model. Please remove this English translation display so that the viewers can fully understand the conceptual things visually

  • @shoaibshoobi7131
    @shoaibshoobi7131 7 κ°œμ›” μ „

    Hello Muhammad You are great

  • @KumR
    @KumR 3 κ°œμ›” μ „

    20

  • @googleyoutubechannel8554

    It's funny, all 20 YT videos on this topic either focus deeply on the steps and mechanics of how the binary data is transformed in details 'xyz' way to be amenable to gpu computation at step 'foo', with too much detail. Offering 'reasons' for each step that are surface level, don't follow, or just straight up wrong, and do not build up to create a coherent picture of the nature of 'attention'. All 'deep' videos are also too lazy to make better diagrams, so they use the crap ones from the paper too, which I thought was pretty funny (seriously, all of them!) Or YT vids are of the second type, they go the other way, giving a very surface level description of 'attention', but failing to explain the critical dynamics necessary to reason about this technique.
    There are basically _no videos_ in the middle, this one is basically the first type, and the paper itself is an example of the first type as well, so I'm honestly wondering if anyone at all understands how to reason about 'attention'.
    I'm wondering if the technique just sort of worked out, after trying who knows how many schemes. The researchers have some idea what's going on, but don't understand the core dynamics either, so they invented a concept called 'attention', but it's not a very helpful framework.