Multi Head Attention in Transformer Neural Networks with Code!

곡유
μ†ŒμŠ€ μ½”λ“œ
  • κ²Œμ‹œμΌ 2024. 03. 27.
  • Let's talk about multi-head attention in transformer neural networks
    Let's understand the intuition, math and code of Self Attention in Transformer Neural Networks
    ABOUT ME
    β­• Subscribe: krplus.net/uCodeEmporiu...
    πŸ“š Medium Blog: / dataemporium
    πŸ’» Github: github.com/ajhalthor
    πŸ‘” LinkedIn: / ajay-halthor-477974bb
    RESOURCES
    [ 1πŸ”Ž] Code for video: github.com/ajhalthor/Transfor...
    [2 πŸ”Ž] Transformer Main Paper: arxiv.org/abs/1706.03762
    [3 πŸ”Ž] Bidirectional RNN Paper: deeplearning.cs.cmu.edu/F20/d...
    PLAYLISTS FROM MY CHANNEL
    β­• ChatGPT Playlist of all other videos: β€’ ChatGPT
    β­• Transformer Neural Networks: β€’ Natural Language Proce...
    β­• Convolutional Neural Networks: β€’ Convolution Neural Net...
    β­• The Math You Should Know : β€’ The Math You Should Know
    β­• Probability Theory for Machine Learning: β€’ Probability Theory for...
    β­• Coding Machine Learning: β€’ Code Machine Learning
    MATH COURSES (7 day free trial)
    πŸ“• Mathematics for Machine Learning: imp.i384100.net/MathML
    πŸ“• Calculus: imp.i384100.net/Calculus
    πŸ“• Statistics for Data Science: imp.i384100.net/AdvancedStati...
    πŸ“• Bayesian Statistics: imp.i384100.net/BayesianStati...
    πŸ“• Linear Algebra: imp.i384100.net/LinearAlgebra
    πŸ“• Probability: imp.i384100.net/Probability
    OTHER RELATED COURSES (7 day free trial)
    πŸ“• ⭐ Deep Learning Specialization: imp.i384100.net/Deep-Learning
    πŸ“• Python for Everybody: imp.i384100.net/python
    πŸ“• MLOps Course: imp.i384100.net/MLOps
    πŸ“• Natural Language Processing (NLP): imp.i384100.net/NLP
    πŸ“• Machine Learning in Production: imp.i384100.net/MLProduction
    πŸ“• Data Science Specialization: imp.i384100.net/DataScience
    πŸ“• Tensorflow: imp.i384100.net/Tensorflow
    TIMSTAMPS
    0:00 Introduction
    0:33 Transformer Overview
    2:32 Multi-head attention theory
    4:35 Code Breakdown
    13:47 Final Coded Class

λŒ“κΈ€ • 76

  • @Dhanush-zj7mf
    @Dhanush-zj7mf 6 κ°œμ›” μ „ +8

    We are very much fortunate to have all this for free. Thank You.

  • @barni_7762
    @barni_7762 11 κ°œμ›” μ „ +9

    Wow! I have watched a few other transformer explaination videos (they were shorter and yet tried to cover more content) and I honestly didn't understand anything. Your video on the other hand was crystal clear and not only do I now understand how every part works, but also have an idea WHY it is there. Also you were super specific about the details that are otherwise left out, great work!

  • @romainjouhameau2764
    @romainjouhameau2764 10 κ°œμ›” μ „ +2

    Very well explained. I really enjoy this mix between explanations and your code examples.
    Your videos are the best ressources to learn about transformers.
    Really thankful for your work ! Thanks a lot

  • @ajaytaneja111
    @ajaytaneja111 λ…„ μ „ +2

    Ajay, I'm currently on a holiday and was watching your Transformer videos on my mobile whilst taking my evening coffee with my mom! And I have been doing this for the past 3 to 4 days. Today my mom seemed so impressed with your oratory skills asked me if I could also lecture on a subject as spontaneously as the Ajay on the video?! Now you've started giving me a complex dude! Ha ha.

    • @CodeEmporium
      @CodeEmporium  λ…„ μ „ +1

      Hahahaha. Thanks to you and your mom for the kind words! And sorry for the tough spot :) Maybe you should show her some of your blogs since you’re pretty good at writing yourself

  • @marcviolides1565
    @marcviolides1565 18 일 μ „

    Good job Ajay! Best explanation I have seen so far!

  • @ulassbingol
    @ulassbingol 4 κ°œμ›” μ „

    This was one of the best explanations of multi-attention. Thanks for your effort.

  • @user-fe2mj9ze5v
    @user-fe2mj9ze5v 5 κ°œμ›” μ „

    Great works. One of the most clear explaination ever about Multi Head Attention

  • @user-wr4yl7tx3w
    @user-wr4yl7tx3w λ…„ μ „ +4

    exactly the type of content needed. thanks

    • @CodeEmporium
      @CodeEmporium  λ…„ μ „

      You are so welcome! Thanks for watching!

  • @amiralioghli8622
    @amiralioghli8622 6 κ°œμ›” μ „

    Thank you so much for taking the time to code and explain the transformer model in such detail, I followed your series from zeros to heros. You are amazing and, if possible please do a series on how transformers can be used for time series anomaly detection and forecasting. it is extremly necessary on yotube for somone!

  • @prashantlawhatre7007
    @prashantlawhatre7007 λ…„ μ „ +1

    ❀❀ Loving this series on Transformers.

    • @CodeEmporium
      @CodeEmporium  λ…„ μ „

      Thanks so much for commenting and watching! I really appreciate it

  • @vivekmettu9374
    @vivekmettu9374 8 κ°œμ›” μ „

    Absolutely loved your explanation. Thank you for contributing!!

  • @rajv4509
    @rajv4509 10 κ°œμ›” μ „

    Brilliant stuff! Thanks for the time & effort you have put in to create these videos ... dhanyavadagalu :)

  • @user-mo2wj2zu5d
    @user-mo2wj2zu5d 11 κ°œμ›” μ „ +1

    Exactly the content I needed.Thanks very much.

  • @ayoghes2277
    @ayoghes2277 11 κ°œμ›” μ „

    Thank you for making this video Ajay !!

    • @CodeEmporium
      @CodeEmporium  11 κ°œμ›” μ „

      My pleasure! Hope you enjoy the rest of the series’s

  • @DouglasASean
    @DouglasASean 10 κ°œμ›” μ „

    Thanks for your work, much neaded right now.

  • @paull923
    @paull923 λ…„ μ „ +2

    interesting and useful

  • @surajgorai618
    @surajgorai618 λ…„ μ „

    Very rich content as always.. Thanks for sharing

    • @CodeEmporium
      @CodeEmporium  λ…„ μ „

      Thanks so much for commenting and watching!

  • @Slayer-dan
    @Slayer-dan λ…„ μ „ +5

    You never Never disappoint bro. Vielen vielen dank!

    • @CodeEmporium
      @CodeEmporium  λ…„ μ „ +1

      Thanks for the kind words and the support :)

  • @simonebonato5881
    @simonebonato5881 7 κ°œμ›” μ „

    Outstanding video and clear explanation!

    • @CodeEmporium
      @CodeEmporium  7 κ°œμ›” μ „

      Thanks so much! Real glad this is helpful

  • @saikiranbondi6868
    @saikiranbondi6868 11 κ°œμ›” μ „

    You are wonderful my brother your way of explaining is soo good

  • @prashantlawhatre7007
    @prashantlawhatre7007 11 κ°œμ›” μ „ +1

    5:44, we should also set `bias=False` in nn.Linear().

  • @jiefeiwang5330
    @jiefeiwang5330 9 κ°œμ›” μ „ +3

    Really nice explanation! Just a small catch. 13:25 I believe you need to permute the variable "values" from size [1, 8, 4, 64] to [1, 4, 8, 64] before reshaping it(Line 71). Otherwise, you are trying to combine the same part of head from multiple words, rather than combine multiple parts of heads from the same word

  • @raphango
    @raphango 10 κ°œμ›” μ „

    Thanks very much again! πŸ˜„

  • @prasenjitgiri919
    @prasenjitgiri919 10 κ°œμ›” μ „

    ty for the effort you have put in, much appreciated but will you please explain the start token, it leave an understanding gap for me.

  • @yanlu914
    @yanlu914 λ…„ μ „

    After getting values, I think it should permute values first like before, and then reshape values.

  • @chenmargalit7375
    @chenmargalit7375 8 κ°œμ›” μ „ +1

    Hi, thanks for the great series !. Something I don't understand and I'd love to hear ur opinion:
    You say the initial input is a one hot encoded vector which is the size of sequence length. Lets say my vocab is 1000 (all the words I want to support) and the sequence length is 30. How do I represent one word out of 1000, in a 30 sequence length vector? the index I put the 1 will not be correct as it might actually be in position 500 in the real vocab tensor

  • @kollivenkatamadhukar5059
    @kollivenkatamadhukar5059 5 κ°œμ›” μ „

    Where can I get the theory part of it is good that you are explaining the code part of it can you share any link where we can read the theory part as well

  • @ivantankoua9286
    @ivantankoua9286 6 κ°œμ›” μ „

    Thanks!

  • @stanislavdidenko8436
    @stanislavdidenko8436 λ…„ μ „ +3

    maybe you have to divide at first 1536 by 3, and then by 8. But you do it by 8 first and then by 3, which sounds like you mix q, k, v vectors dimensions.

    • @oussamawahbi4976
      @oussamawahbi4976 11 κ°œμ›” μ „

      good point, but i think because the parameters that generate q, k, v are learned , it doesnt matter which you should divide by first, i could be wrong though

  • @creativeuser9086
    @creativeuser9086 10 κ°œμ›” μ „

    what about the weights for K,V,Q for each head as well as the output?

  • @seddikboudissa8668
    @seddikboudissa8668 11 일 μ „

    Hello good job but i have a small misunderstanding on the transformer paper they computed different different key query .. for each head and here you splitting the key and query where each head takes a split . Whats the difference between the two approachs ?

  • @superghettoindian01
    @superghettoindian01 λ…„ μ „

    You are incredible, I’ve seen a good chunk of your videos and wanted to thank you from the bottom of my heart! With your content I feel like that maybe even an idiot like me can understand it (one day - maybe? πŸ€”)!
    I hope you enjoy a lot of success!

    • @CodeEmporium
      @CodeEmporium  λ…„ μ „

      Super kind words. Thank you so much! I’m sure you aren’t an idiot and we hope can all learn together!

  • @xdhanav5449
    @xdhanav5449 28 일 μ „

    Wow, this is a very intuitive explanation! I have a question though. From my understanding, the attention aids the encoder and decoder blocks in the transformer to understand which words that came either before or after (sometimes) will have a strong impact on the generation of the next word, through the feedforward neural network and other processes. Given a sentence like "The cook is always teaching the assistant new techniques and giving her advice.", what is a method I could implement to determine the pronoun-profession relationships to understand that cook is not paired with "her", rather "assistant" is. I have tried two methods so far. 1. Using the pretrained contextual embeddings from BERT. 2. (relating to this video) I thought that I could almost reverse engineer the attention methods by creating an attention vector to understand what pair of pronoun-professions WOULD be relevant, through self attention. However, this method did not work as well (better than method 1) and I believe this is because the sentence structures are very nuanced, so I believe that the attention process is not actually understanding the grammatical relationships between words in the sentence. How could I achieve this: a method that could determine which of the two professions in a sentence like above are referenced by the pronoun. I hope you can see why I thought that using an attention matrix would be beneficial here because the attention would explain which profession was more important in deciding whether the pronoun would be "he" or "her". This is a brief description of what I am trying to do, so if you can, I could elaborate more about this over email or something else. Thank you in advance for your help and thanks a million for your amazing explanations of transformer processes!

    • @xdhanav5449
      @xdhanav5449 28 일 μ „

      I would like to add additionally that in my approach of using attention, I don't actually create query, key, value vectors. I take the embeddings, do the dot product, scale it, and use softmax to convert it into a probability distribution. Possibly this is where my approach goes wrong. The original embeddings of the words in the sentence are created from BERT, so there should already be positional encoding and other relevant things for embeddings.

  • @fayezalhussein7115
    @fayezalhussein7115 11 κ°œμ›” μ „

    please code you explain how could way implemnt hybrid model(vision transfomrer+cnn) for image classification task

  • @pi5549
    @pi5549 λ…„ μ „ +1

    5:05 Why separate variables for input_dim (embedding dimension IIUC) and d_model? Aren't these always going to be the same? Would we ever want this component to spit out a contextualized-wordVector that's a different length from the input wordVector?

    • @oussamawahbi4976
      @oussamawahbi4976 11 κ°œμ›” μ „

      I have the same question , and i assume that most of the times input_dim should equal d_model in order to have a consistent vocabulary between the input and the output

    • @ShawnMorel
      @ShawnMorel 9 κ°œμ›” μ „

      My understanding is that it sets you up to be able to choose different hyper-parameters e.g. if you want a smaller input word embedding space size but a larger internal representation. Table 3 of the original transformers paper shows a few different combinations of these parameters arxiv.org/pdf/1706.03762.pdf

  • @creativityoverload2049
    @creativityoverload2049 6 κ°œμ›” μ „

    For how much i tried to understood, query, key and value are representation of embedded word after positional embedding so, with different purposes, but why are we dividing it into multiple heads in first place and dividing it into 64 each when we can just have 1 head with 512 q,k,v and then perform self attention. Even if we are using multiple head it for increasing context wouldn't 8 different vector of 512 for each q,k,v then performing self attention on each and combine them later will give us more accurate result. I mean to say why 512 representation of word is having 64 qkv each
    Someone please explain this.

  • @tonywang7933
    @tonywang7933 10 κ°œμ›” μ „

    At 4:57 d_model is 512, so is input_dim. But at 14:23 input_dim is 1024, I thought they should be the same number, are you saying you reduce the dimension of input into the dimension of the model by some compression technique like PCA?
    at 14:23, it looks like input_dim is only used at the very beginning, once we are in the model, input dimension is shrinked to 512

    • @jubaerhossain1865
      @jubaerhossain1865 8 κ°œμ›” μ „ +1

      It's not PCA. It dimension conversion by weight matrix multiplication. For example, to make (1x1024) -> (1x512), we need a weight matrix of 1024x512... This is just an example, not the actual scenario demonstrated here.

  • @Handelsbilanzdefizit
    @Handelsbilanzdefizit λ…„ μ „

    But why do they do this multihead-thing? Is it to reduce computational cost? 8*(64Β²) < 512Β²

  • @wishIKnewHowToLove
    @wishIKnewHowToLove 11 κ°œμ›” μ „

    thx)

    • @CodeEmporium
      @CodeEmporium  11 κ°œμ›” μ „

      You are very welcome! Hope you enjoy your stay on the channel :)

  • @josephfemia8496
    @josephfemia8496 λ…„ μ „ +1

    Hello, I was wondering what the actual difference is between key and value? I’m a bit confused between the difference is between β€œWhat I can offer” vs β€œWhat I actually offer”.

    • @yashs761
      @yashs761 λ…„ μ „ +3

      This is a great video that might help you build intuition behind the difference of query, key and value. I've linked the exact timestamp: krplus.net/bidio/gdqjg2F0ZX2monI

    • @ShawnMorel
      @ShawnMorel 9 κ°œμ›” μ „

      First, remember that what we're trying to learn is Q-weights, K-weights, V-weights such that
      - input-embedding * Q-weights = Q (a vector that can be used as a query)
      - input-embedding * K-weights = K (a vector that can be used as a key)
      - input-embedding * V-weights = V (a vector that can be used as a value)
      Linguistic / Grammar intuition
      Let's assume that we had those Q, K and V, and we wanted to search for content for some query Q, how might we do that? Lgrammatically

    • @healthertsy1863
      @healthertsy1863 6 κ°œμ›” μ „ +1

      @@yashs761 Thank you so much, this video has helped me a lot! The lecturer is brilliant!

  • @physicsphere
    @physicsphere λ…„ μ „

    I just started with AI-ML for few months, can you guide me what should I learn for getting a job .. I like your videos.

    • @CodeEmporium
      @CodeEmporium  λ…„ μ „

      Nice! There are many answers to this. But to keep it short and effective, I would say know your fundamentals. This could be just picking one Regression model (like Linear Regression) and understand exactly how it works and why it works. I do the same for 1 classification model (like logistic regression). Look at both from the lens of Code, Math and real life problems.
      I think this is a good starting point for now. Honestly, it doesn’t exactly matter where you start as long as you start and don’t stop. I’m sure you’ll succeed!
      That said, if you are interested in the content I mentioned earlier, I should have some playlists with titles β€œLinear Regression β€œ and β€œLogistic Regression”. So do check them out if / when you’re interested. Hope this helps.

    • @physicsphere
      @physicsphere λ…„ μ „ +1

      @@CodeEmporium thanks for the reply.. sure I will check.. I am going to do a work using transformers.. ur videos really help, specially the coding demonstration...

  • @pi5549
    @pi5549 λ…„ μ „ +1

    14:40 Your embedding dimension is 1024. So how come qkv.shape[-1] is 3x512 not 3x1024?

    • @oussamawahbi4976
      @oussamawahbi4976 11 κ°œμ›” μ „ +1

      qkv is the result of the qkv_layer , which takes embeddings of size 1024 and has 3*d_model=3*512 neurons , therefor the output of this layer will be of dimension (batch_size, seq_length, 3*512)

  • @stanislavdidenko8436
    @stanislavdidenko8436 λ…„ μ „

    PriemlΠ΅mo!

  • @suchinthanawijesundara6464

    ❀❀

  • @kaitoukid1088
    @kaitoukid1088 λ…„ μ „ +3

    Are you a full-time creator or do you work on AI while making digital content?

    • @CodeEmporium
      @CodeEmporium  λ…„ μ „ +7

      The latter. I have a full time job as a machine learning engineer. I make content like this on the side for now :)

    • @trevorthieme5157
      @trevorthieme5157 λ…„ μ „

      ​@CodeEmporium How complex is the work you do with the AI VS. what you teach us here? Would you say it's harder to code by far or is it mostly just scaling up, reformatting, and sorting data to train the models?

    • @Stopinvadingmyhardware
      @Stopinvadingmyhardware 11 κ°œμ›” μ „

      @@CodeEmporium Are you able to disclose your employer’s name?

  • @kartikpodugu
    @kartikpodugu κ°œμ›” μ „

    I have two doubts.
    1. How Q, K, V are calculated from input text ?
    2. How Q, K, V are calculated for multiple heads ?
    Can you elaborate or point me to a proper resource.

    • @naveenpoliasetty954
      @naveenpoliasetty954 4 일 μ „

      word embeddings are fed into separate linear layers (fully connected neural networks) to generate the Q, K, and V vectors. These layers project the word embeddings into a new vector space specifically designed for the attention mechanism within the transformer architecture.

  • @thechoosen4240
    @thechoosen4240 2 κ°œμ›” μ „

    Good job bro, JESUS IS COMING BACK VERY SOON; WATCH AND PREPARE