Positional Encoding in Transformer Neural Networks Explained

๊ณต์œ 
์†Œ์Šค ์ฝ”๋“œ
  • ๊ฒŒ์‹œ์ผ 2024. 04. 26.
  • Positional Encoding! Let's dig into it
    ABOUT ME
    โญ• Subscribe: krplus.net/uCodeEmporiu...
    ๐Ÿ“š Medium Blog: / dataemporium
    ๐Ÿ’ป Github: github.com/ajhalthor
    ๐Ÿ‘” LinkedIn: / ajay-halthor-477974bb
    RESOURCES
    [ 1๐Ÿ”Ž] Code for video: github.com/ajhalthor/Transfor...
    [ 2๐Ÿ”Ž] My video on multi-head attention attention: โ€ข Multi Head Attention i...
    [3 ๐Ÿ”Ž] Transformer Main Paper: arxiv.org/abs/1706.03762
    PLAYLISTS FROM MY CHANNEL
    โญ• ChatGPT Playlist of all other videos: โ€ข ChatGPT
    โญ• Transformer Neural Networks: โ€ข Natural Language Proce...
    โญ• Convolutional Neural Networks: โ€ข Convolution Neural Net...
    โญ• The Math You Should Know : โ€ข The Math You Should Know
    โญ• Probability Theory for Machine Learning: โ€ข Probability Theory for...
    โญ• Coding Machine Learning: โ€ข Code Machine Learning
    MATH COURSES (7 day free trial)
    ๐Ÿ“• Mathematics for Machine Learning: imp.i384100.net/MathML
    ๐Ÿ“• Calculus: imp.i384100.net/Calculus
    ๐Ÿ“• Statistics for Data Science: imp.i384100.net/AdvancedStati...
    ๐Ÿ“• Bayesian Statistics: imp.i384100.net/BayesianStati...
    ๐Ÿ“• Linear Algebra: imp.i384100.net/LinearAlgebra
    ๐Ÿ“• Probability: imp.i384100.net/Probability
    OTHER RELATED COURSES (7 day free trial)
    ๐Ÿ“• โญ Deep Learning Specialization: imp.i384100.net/Deep-Learning
    ๐Ÿ“• Python for Everybody: imp.i384100.net/python
    ๐Ÿ“• MLOps Course: imp.i384100.net/MLOps
    ๐Ÿ“• Natural Language Processing (NLP): imp.i384100.net/NLP
    ๐Ÿ“• Machine Learning in Production: imp.i384100.net/MLProduction
    ๐Ÿ“• Data Science Specialization: imp.i384100.net/DataScience
    ๐Ÿ“• Tensorflow: imp.i384100.net/Tensorflow
    TIMSTAMPS
    0:00 Transformer Overview
    2:23 Transformer Architecture Deep Dive
    5:11 Positional Encoding
    7:25 Code Breakdown
    11:11 Final Coded Class

๋Œ“๊ธ€ • 86

  • @CodeEmporium
    @CodeEmporium  ๋…„ ์ „ +14

    If you think I deserve it, please do consider a like and subscribe to support the channel. Thanks so much for watching ! :)

    • @mello1016
      @mello1016 ๋…„ ์ „

      Amazing videos, but dude please remove your face from the thumbnails. It adds zero value and is distracting from choosing the content. Don't follow a herd, better represent something unique in there.

    • @LuizHenrique-qr3lt
      @LuizHenrique-qr3lt ๋…„ ์ „

      tks!!

    • @arydeshpande
      @arydeshpande ๋…„ ์ „

      i think you desrve it, big thank you to you

    • @wishIKnewHowToLove
      @wishIKnewHowToLove ๋…„ ์ „ +1

      :)

    • @becayebalde3820
      @becayebalde3820 6 ๊ฐœ์›” ์ „

      shut up @@mello1016 adding his facing makes it less abstract. You can see it is a human behind, and it makes it easier to focus.

  • @becayebalde3820
    @becayebalde3820 6 ๊ฐœ์›” ์ „ +3

    Man you are awesome! I thought transformers were too hard and needed too much effort to understand.
    While I was willing to put that much effort, your playlist has been extraordinarily useful to me.
    Thank you! I subscribed

  • @sabzimatic
    @sabzimatic 5 ๊ฐœ์›” ์ „ +1

    Hands down!! You have put in sincere effort in explaining crucial concepts in Transformers.
    Kudos to you! Wishing you the best !!

    • @CodeEmporium
      @CodeEmporium  5 ๊ฐœ์›” ์ „

      Thanks for the super kind words! Definitely more to come. In the middle of making a series on Reinforcement Learning now :)

  • @BenderMetallo
    @BenderMetallo ๋…„ ์ „ +1

    Attention is all you need! cit
    your tutorials are gold, thank you

  • @user-wr4yl7tx3w
    @user-wr4yl7tx3w ๋…„ ์ „ +4

    Really enjoying this Transformer series.

    • @CodeEmporium
      @CodeEmporium  ๋…„ ์ „

      Thanks so much for watching and commenting on them :)

  • @judedavis92
    @judedavis92 ๋…„ ์ „ +1

    Thanks for the great video! Loving this series!

    • @CodeEmporium
      @CodeEmporium  ๋…„ ์ „ +1

      Thanks so much for watching ! Hope you enjoy the rest :)

  • @shauryai
    @shauryai ๋…„ ์ „ +1

    Thanks for detailed videos on Transformer concepst!

    • @CodeEmporium
      @CodeEmporium  ๋…„ ์ „

      My pleasure :) Thank you for the support

  • @DeepakKandel-go3ff
    @DeepakKandel-go3ff 9 ๊ฐœ์›” ์ „ +1

    Yes, Totally worth a like.

  • @PravsAI
    @PravsAI 5 ๊ฐœ์›” ์ „

    One of the great explanation ajay , happy to see kannada words here ! . Look forward for more videos like this :-)
    Kudos ! Great work ....

  • @kimrichies
    @kimrichies ๋…„ ์ „

    Your efforts are much appreciated

  • @ShivarajKarki
    @ShivarajKarki 11 ๊ฐœ์›” ์ „

    Kannada abhimanige kannidigana namaskara.. Nimma gyana bhandarakke namana.

  • @pizzaeater9509
    @pizzaeater9509 ๋…„ ์ „

    Most brilliant and simple to understand video

  • @SanjithKumar-xf4sg
    @SanjithKumar-xf4sg 7 ๊ฐœ์›” ์ „

    One of the best series for transformers๐Ÿ˜„

  • @XNexezX
    @XNexezX 5 ๊ฐœ์›” ์ „

    Dude these videos are so nice. Starting my masters thesis on a transformer-based topic soon and this is really helping me learn the basics

    • @CodeEmporium
      @CodeEmporium  5 ๊ฐœ์›” ์ „

      Perfect! Super glad youโ€™re on this journey. The field is very fun :)

  • @paull923
    @paull923 ๋…„ ์ „

    Thx! Clear and concise!

  • @ziki5993
    @ziki5993 ๋…„ ์ „

    wonderful video

  • @srivatsa1193
    @srivatsa1193 ๋…„ ์ „

    Bro.. You are Awesome!

  • @guruphiji
    @guruphiji 10 ๊ฐœ์›” ์ „

    what a voice !!!

  • @ChrisHalden007
    @ChrisHalden007 ๋…„ ์ „

    Great video. Thanks

  • @AbhishekS-cv3cr
    @AbhishekS-cv3cr 11 ๊ฐœ์›” ์ „

    approved!

  • @sangabahati3545
    @sangabahati3545 11 ๊ฐœ์›” ์ „

    You video are useful for me ,Congratulation for excellent works. But I suggest you demonstrate a real video in multivariate time series forecasting or classification.

  • @superghettoindian01
    @superghettoindian01 ๋…„ ์ „ +1

    As before, great work on this Transformer Series! Am trying to go through all your code / videos slowly so I make sure I'm fully absorbing it. Where I'm struggling / slowest right now is in my intuition behind some of these tensor operations with stack / concatenate. Do you have any recommendations for study material apart from the torch documentation?

    • @CodeEmporium
      @CodeEmporium  ๋…„ ์ „ +2

      Thanks so much! Hmm. Maybe hugging face has some good resources too. Aside from this, Iโ€™ll be making a playlist on the evolution of language models so some design choices become more intuitive. Hope youโ€™ll stick around for that

  • @Slayer-dan
    @Slayer-dan ๋…„ ์ „

    Thanks a lot ๐Ÿ’š

  • @mihirchauhan6346
    @mihirchauhan6346 6 ๊ฐœ์›” ์ „ +2

    At time 6:24 reason 1 (periodicity) for positional encoding was under-specified, hence needed more clarity where it was mentioned that a word pays attention to other words (farther apart) in the sentence using periodicity property of sine and cosine function in order to make the solution tractable? Is it mentioned in some papers or can you cite this. Thanks.

  • @ThinAirElon
    @ThinAirElon 6 ๊ฐœ์›” ์ „ +1

    Theoratically what does it mean to add embedding vector and positional vector ?

  • @andreytolkushkin3611
    @andreytolkushkin3611 8 ๊ฐœ์›” ์ „ +1

    hi, you probably wont see this since its been 6 months siince youve posted the video, however: im trying to write code for handwritten mathematical expression recognition and am trying to recreate the BTTR model. In it they use a Densenet as the transformer encoder and use "image positional encoding" whhich is supposed to be a generalization for 2d of the sinusodal positional encoding. What would be the logic behind the 2d image positional encoding. They do have code on github but i have no idea how to interpret it, could you please help

  • @caiyu538
    @caiyu538 6 ๊ฐœ์›” ์ „

    Clear explanation. If I want to use transformer for time series and the time is not evenly changing, there is irregularities of time points. How could I positional encoding of these time into transformer?

  • @user-pm9nt6xk3c
    @user-pm9nt6xk3c 8 ๊ฐœ์›” ์ „

    Hey Ajay, first of all, this is a video so well-built that I will be recommending it to our data science, AI, and Robotics clubs, your content is great and I can see the next Andrew NG before me, regardless I do have a question, why is it that there must be a max number of words in a transformer architecture I dont fully understand the reason behind it considering most of the operations conducted on the first half don't require a fixed length of input data since this isn't your usual neural network layer, do you mind explaining? because I do feel like this is flying above my head

    • @CodeEmporium
      @CodeEmporium  8 ๊ฐœ์›” ์ „

      Your words are too kind. And good question. So what is fixed in length in this specific architecture is the maximum number of words in a sentence, not the number of words in a sentence. The remain unused words are filled with โ€œpadding tokensโ€. This will be come clearer when you watch the videos of coding out the complete transformer in the playlist โ€œTransformers from Scratchโ€. We essentially do this so we can pass fixed size vector inputs through every part of the transformer. That said, I have seen more recent implementations where the size is dynamic

  • @lexingtonjackson3657
    @lexingtonjackson3657 ๊ฐœ์›” ์ „

    I liked you already , now you are a kannadiga and i like you more.

  • @user-kd7xd2gb5s
    @user-kd7xd2gb5s 13 ์ผ ์ „

    i love your shit man, this was so usefull i actually understood this ml shit and now can be elon musk up in this llm shit

  • @ilyas8523
    @ilyas8523 ๋…„ ์ „

    Great videos, especially the one where you explained what a transformer is. Beside youtube, do you have a full time job or is this it? Just curious

    • @CodeEmporium
      @CodeEmporium  ๋…„ ์ „

      Thank you! And yep I have a full time job as a Machine Learning Engineer outside of KRplus :)

  • @aliwaleed9173
    @aliwaleed9173 9 ๊ฐœ์›” ์ „

    thanks for the information in this video however i think i have a miss understanding ,
    you said before that before the vocabs are going into the embedding victor which is like a bag of related word together in a box ,
    but in the start of this video you said at first the words has done into a one hot encoder then passed to the positional encoding so what i want to know know is which is the scenarios is the right:
    1- we take the word and search it into the embedding space then pass it into the positional encoder
    2- we take the word and do it a one hot encoder then send it to the positional encoder

  • @ajaytaneja111
    @ajaytaneja111 ๋…„ ์ „

    Hi Ajay, isn't the purpose of positional encoding to figure out where the word is located in the sequence which actually the attention mechanism derives benefit from? Thanks... And again great content, grateful

    • @CodeEmporium
      @CodeEmporium  ๋…„ ์ „ +1

      Yes! The idea overall is to create meaningful embedding for words that understand context. This is opposed to the tradition CBoW or Skip gram word embeddings that donโ€™t quite get this context.

  • @LuizHenrique-qr3lt
    @LuizHenrique-qr3lt ๋…„ ์ „

    Hey Ajay, great video!! Congratulations, I'm learning a lot from you thank you! Ajay I have some doubts, the first is that I didn't quite understand the difference between max sequence length and d_model. For example, if I have texts with 50 tokens in size, that is, my largest text has up to 50 tokens, this would be my max sequence length, however if my d_model were 10, my largest sequence would have to be divided into 5 to be able to pass through the model because it only accepts 10 tokens at a time, is my thinking correct?

    • @CodeEmporium
      @CodeEmporium  ๋…„ ์ „ +1

      They way you described sequence length = 50 is correct. It is the maximum number of tokens you can pass into your network at a time (itโ€™s the max number of words/ subwords/characters). D_model is the embedding dimension. Models donโ€™t understand words, but they understand numbers. And so, you transform every token into some set of numbers (called a vector) and the number of numbers in this vector is d_model. Letโ€™s say d_model is 512 and also say we have a sentence โ€œmy name is ajayโ€. The word โ€œmyโ€ would be converted into a 512 dimensional vector. As would โ€œnameโ€, โ€œisโ€ and โ€œajayโ€. The idea of these vectors/embedding is to get some dense numeric representation of the context of a word (so similar words are represented with vectors that are close to each other and dissimilar words are represented with vectors that are farther from each other)

    • @LuizHenrique-qr3lt
      @LuizHenrique-qr3lt ๋…„ ์ „

      @@CodeEmporium hm ok good answer, now a doubt if d_model is the dimension that I will put my tokens. Why don't some transformer models accept very long texts? for example: if I have the
      string length = 10
      d_model = 3
      the phrase "my name is Ajay"
      would turn 4 vectors
      my: [0,0.2,0.6]
      name: [0, 0.1, 0.11]
      is: [0.5, 0.2, 0.0]
      Ajay: [0,0,1] with d_model dimensions each
      Why can't I put very large sequences in my model? Why does d_model interfere with this

    • @DevelopersHutt
      @DevelopersHutt 10 ๊ฐœ์›” ์ „

      @@LuizHenrique-qr3lt The max sequence length refers to the maximum number of tokens in a sequence, while d_model represents the dimensionality of the token embeddings. They serve different purposes in the Transformer model. The max sequence length determines the size of the input that can be processed at once, whereas d_model influences the complexity and expressive power of the model. In your example, if the max sequence length is 50 and d_model is 10, the largest sequence would need to be divided into smaller segments or chunks of 10 tokens to fit within the model's input limit.

  • @balakrishnaprasad8928
    @balakrishnaprasad8928 ๋…„ ์ „

    Please make a detailed video series on the math for data science

    • @CodeEmporium
      @CodeEmporium  ๋…„ ์ „

      I have made some math in machine learning videos. Maybe check the playlist โ€œTh e math you should knowโ€ on the channel

  • @lorenzobianconi7724
    @lorenzobianconi7724 ๋…„ ์ „

    hi ajy thanfor your videos. why are there 512 dimensions? who established this number? and how can we count the 175b parameters in gpt3. can you make a video when you break down the whole process of a transformer in one clear shot. possibly not using a translation but for exemple an answer task. thanks love your video and determination to spread knowledge

    • @giacomomunda3359
      @giacomomunda3359 6 ๊ฐœ์›” ์ „

      512 is a hyperparameter. You can actually decide which dimension to use, but it has been proven that higher dimension usually work better, since they are able to capture more linguistic information, e.g. semantics, syntax, etc. BERT for instance uses 768 dimensions and the OpenAI ada embeddings have 1536 dimensions.

  • @wishIKnewHowToLove
    @wishIKnewHowToLove ๋…„ ์ „

    I like your English is clean :) no disgusting non-Californian accent :)

  • @neetpride5919
    @neetpride5919 ๋…„ ์ „

    Is there an advantage to using one-hot encoding instead of an integer index encoding for the words? If we're gonna download a pre-existing word2vec dictionary and map each word to its word vector during the data preparation anyway, the one-hot encoding seems like it'd just create an unnecessary large sparse matrix.

    • @CodeEmporium
      @CodeEmporium  ๋…„ ์ „

      The idea here is we are not going to use a preexisting word2vec for the transformer. Everything in clouding the embedding for every word will be learned during training. An issue with word2vec is they are fixed embeddings and donโ€™t necessarily capture word context very well. This concept was introduced in the paper that introduced ELMo โ€œDeep Contextualized word presentationsโ€ (Peter et al., 2018). Would recommend giving this a read if youโ€™re interest

  • @hermannangstl1904
    @hermannangstl1904 ๋…„ ์ „

    From what I understood each word/token is represented by a 512-dimensional vector. This values of this vector are modified by means of (Self)Attention and Positional Encoding.
    What is a bit counter-intuitive for me is that the place in which a word/token comes can be different in different sentences. For example lets take the word "Ajay".
    (1) In this sentence it's in 4th position: "My name is Ajay"
    (2) In a different sentence it is on 1st position: "Ajay explains very well".
    So the Positional Encodings for the word "Ajay" vary - they might be different in each sentence. How can the network be trained, how can it learn, with such contradicting input data?

    • @CodeEmporium
      @CodeEmporium  ๋…„ ์ „

      This is a good question. But it intuitively does make sense that the same word In different sentences can have different meanings. Take the word โ€œgroundedโ€. You can represent this as a 512 dimensional vector. But letโ€™s say โ€œgroundedโ€ occurs in 2 sentences: (1) The truth is grounded in reality (2) Youโ€™re grounded! Go to your room. In these examples, โ€œgroundedโ€ has differing meanings and should hence have different vector representations. This is why we need surrounding context to understand word vectors individually. This is probably a lil hard to see with your example since โ€œAjayโ€ is a proper noun. However, for non-proper nouns, context matters.
      I think you should take a look at the paper โ€œDeep Contextualized word Representationsโ€ by Matthew Peters (2018). They more formally answer the question you are asking. This is the paper that introduced ELMo embedding. According to this paper, Turns out that using different vectors based on context really improved models on Part of Speech Tagging and Language modeling

    • @DevelopersHutt
      @DevelopersHutt 10 ๊ฐœ์›” ์ „

      You've raised an important point. While it is true that the positional encoding for a word like "Ajay" can vary depending on its position in different sentences.
      Let's consider the word "Ajay" in two different sentences and see how the Transformer model handles it:
      (1) Sentence 1: "My name is Ajay."
      (2) Sentence 2: "Ajay explains very well."
      In both sentences, the word "Ajay" has different positions, but the Transformer model can still learn and make sense of it. Here's a simplified example of how it works:
      Input Encoding: Each word, including "Ajay," is initially represented by a 512-dimensional vector.
      Sentence 1: "Ajay" is represented as [0.1, 0.2, 0.3, ..., 0.4].
      Sentence 2: "Ajay" is represented as [0.5, 0.6, 0.7, ..., 0.8].
      Positional Encoding: The model incorporates positional encodings to differentiate the positions of words.
      Sentence 1: The positional encoding for the 4th position is [0.4, 0.3, 0.2, ..., 0.1].
      Sentence 2: The positional encoding for the 1st position is [1.0, 0.9, 0.8, ..., 0.5].
      Attention and Context: The Transformer's attention mechanism considers the positional encodings along with the input representations to compute contextualized representations.
      Sentence 1: The attention mechanism incorporates the positional encoding and input embedding of "Ajay" at the 4th position to capture its contextual information within the sentence.
      Sentence 2: Similarly, the attention mechanism considers the positional encoding and input embedding of "Ajay" at the 1st position in the context of the second sentence.
      By attending to different positions and incorporating positional encodings, the model can learn to associate the word "Ajay" with its specific context and meaning in each sentence. Through training on various examples, the model adjusts its weights and learns to generate appropriate representations for words based on their positions, allowing it to make meaningful predictions and capture the contextual relationships between words effectively.

  • @convolutionalnn2582
    @convolutionalnn2582 ๋…„ ์ „ +1

    In the code on the final class, position is 1 to max sequence length....Which include both even and odd...I think we use cos for odd and sin for even..Why all the position are pass which mean 1 to max sequence length including even are pass in cos and odd are pass in sin.

    • @CodeEmporium
      @CodeEmporium  ๋…„ ์ „

      I think I responded to this in another video you asked this question. Hope that helped tho :)

    • @convolutionalnn2582
      @convolutionalnn2582 ๋…„ ์ „

      @@CodeEmporium Yeah but you didn't answer it fully

  • @LuizHenrique-qr3lt
    @LuizHenrique-qr3lt ๋…„ ์ „

    my second doubt is that when I use BertTokenizer for example it transforms the text:
    [my name is ajay] in a list of integers for example [101, 11590, 11324, 10124, 138, 78761, 102], where does that part go? I couldn't understand that part

    • @CodeEmporium
      @CodeEmporium  ๋…„ ์ „ +1

      So I havenโ€™t shown the text encoding details just yet. :) since 4 words were encoded into 7 numbers, I assume the โ€œBestTokenizerโ€ is encoding each subword / word piece into some number. Essentially, the tokenizer is taking the sentence, breaking it down into word pieces (7 in this case) and each is being mapped to a unique integer number. Later on, you will see each number being mapped to a larger vector (I explained more details about why these vectors exist in the other comment)

  • @SAIDULISLAM-kc8ps
    @SAIDULISLAM-kc8ps ๋…„ ์ „

    can you please tell me the between sequence length and dimension of embedding ?

    • @CodeEmporium
      @CodeEmporium  ๋…„ ์ „ +1

      Sequence length = maximum number of characters/words we can pass into the transformer at a time.
      Dimension of embedding = size of vector representing each character / word.

    • @SAIDULISLAM-kc8ps
      @SAIDULISLAM-kc8ps ๋…„ ์ „

      @@CodeEmporium thanks a lot.

  • @7_bairapraveen928
    @7_bairapraveen928 ๋…„ ์ „

    I am kind on new bie here, if you think this is valid please answer.
    why are you introducing parameters of dimension 512 for the vocab size , making a neural network,
    I mean what happens if we dont do that?

    • @CodeEmporium
      @CodeEmporium  ๋…„ ์ „

      Why are we using 512 dimensions instead of the 1 hot vector of size equal to the vocabulary size? This is because of the curse of dimensionality. Vocabulary sizes are huge (often in the 10s of thousands). This is a lot for any model, neural network or not, to process. There was a 2001 paper by Yashua Bengio โ€œA Neural Probabilistic Language Modelโ€ that describes exactly this issue and why it was introduced. I would recommend giving it a read. Also, my next series will delve into the history of language models so I hope youโ€™ll stay tuned for this. Maybe some of the design choices will become clearer.

  • @joaogoncalves1149
    @joaogoncalves1149 8 ๊ฐœ์›” ์ „

    I think that queen/king example is somewhat cherry picked, as the principle behind the analogy fails for many examples.

  • @aar953
    @aar953 4 ๊ฐœ์›” ์ „ +1

    There is one mistake that you are making. We are not taking a single output as input to the decoder, but all the previous outputs up to the current time step as input to the decoder.

    • @CodeEmporium
      @CodeEmporium  4 ๊ฐœ์›” ์ „ +3

      Yea thatโ€™s correct from a practical standpoint. I dive into this when coding this out in the rest of this playlist โ€œTransformers from scratch โ€œ. Hope those videos clear things up!

    • @aar953
      @aar953 4 ๊ฐœ์›” ์ „ +1

      @@CodeEmporium Thanks for answering. I understand that you have to make a trade-off between simplicity and accuracy. Here, I just wanted to note that little more complexity would have added quite a lot more accuracy.
      Your content is excellent!

  • @__hannibaalbarca__
    @__hannibaalbarca__ 10 ๊ฐœ์›” ์ „

    I don't nothing about Python, but it look extremely slow.