Positional Encoding in Transformer Neural Networks Explained
์์ค ์ฝ๋
- ๊ฒ์์ผ 2024. 04. 26.
- Positional Encoding! Let's dig into it
ABOUT ME
โญ Subscribe: krplus.net/uCodeEmporiu...
๐ Medium Blog: / dataemporium
๐ป Github: github.com/ajhalthor
๐ LinkedIn: / ajay-halthor-477974bb
RESOURCES
[ 1๐] Code for video: github.com/ajhalthor/Transfor...
[ 2๐] My video on multi-head attention attention: โข Multi Head Attention i...
[3 ๐] Transformer Main Paper: arxiv.org/abs/1706.03762
PLAYLISTS FROM MY CHANNEL
โญ ChatGPT Playlist of all other videos: โข ChatGPT
โญ Transformer Neural Networks: โข Natural Language Proce...
โญ Convolutional Neural Networks: โข Convolution Neural Net...
โญ The Math You Should Know : โข The Math You Should Know
โญ Probability Theory for Machine Learning: โข Probability Theory for...
โญ Coding Machine Learning: โข Code Machine Learning
MATH COURSES (7 day free trial)
๐ Mathematics for Machine Learning: imp.i384100.net/MathML
๐ Calculus: imp.i384100.net/Calculus
๐ Statistics for Data Science: imp.i384100.net/AdvancedStati...
๐ Bayesian Statistics: imp.i384100.net/BayesianStati...
๐ Linear Algebra: imp.i384100.net/LinearAlgebra
๐ Probability: imp.i384100.net/Probability
OTHER RELATED COURSES (7 day free trial)
๐ โญ Deep Learning Specialization: imp.i384100.net/Deep-Learning
๐ Python for Everybody: imp.i384100.net/python
๐ MLOps Course: imp.i384100.net/MLOps
๐ Natural Language Processing (NLP): imp.i384100.net/NLP
๐ Machine Learning in Production: imp.i384100.net/MLProduction
๐ Data Science Specialization: imp.i384100.net/DataScience
๐ Tensorflow: imp.i384100.net/Tensorflow
TIMSTAMPS
0:00 Transformer Overview
2:23 Transformer Architecture Deep Dive
5:11 Positional Encoding
7:25 Code Breakdown
11:11 Final Coded Class
If you think I deserve it, please do consider a like and subscribe to support the channel. Thanks so much for watching ! :)
Amazing videos, but dude please remove your face from the thumbnails. It adds zero value and is distracting from choosing the content. Don't follow a herd, better represent something unique in there.
tks!!
i think you desrve it, big thank you to you
:)
shut up @@mello1016 adding his facing makes it less abstract. You can see it is a human behind, and it makes it easier to focus.
Man you are awesome! I thought transformers were too hard and needed too much effort to understand.
While I was willing to put that much effort, your playlist has been extraordinarily useful to me.
Thank you! I subscribed
Hands down!! You have put in sincere effort in explaining crucial concepts in Transformers.
Kudos to you! Wishing you the best !!
Thanks for the super kind words! Definitely more to come. In the middle of making a series on Reinforcement Learning now :)
Attention is all you need! cit
your tutorials are gold, thank you
You are so welcome !
Really enjoying this Transformer series.
Thanks so much for watching and commenting on them :)
Thanks for the great video! Loving this series!
Thanks so much for watching ! Hope you enjoy the rest :)
Thanks for detailed videos on Transformer concepst!
My pleasure :) Thank you for the support
Yes, Totally worth a like.
One of the great explanation ajay , happy to see kannada words here ! . Look forward for more videos like this :-)
Kudos ! Great work ....
Your efforts are much appreciated
Thanks so much for watching :)
Kannada abhimanige kannidigana namaskara.. Nimma gyana bhandarakke namana.
Most brilliant and simple to understand video
Haha thanks a lot :) I try
One of the best series for transformers๐
Dude these videos are so nice. Starting my masters thesis on a transformer-based topic soon and this is really helping me learn the basics
Perfect! Super glad youโre on this journey. The field is very fun :)
Thx! Clear and concise!
Thanks So much
wonderful video
Thank you so much! :D
Bro.. You are Awesome!
Nah you are awesome
what a voice !!!
Great video. Thanks
My pleasure
approved!
You video are useful for me ,Congratulation for excellent works. But I suggest you demonstrate a real video in multivariate time series forecasting or classification.
As before, great work on this Transformer Series! Am trying to go through all your code / videos slowly so I make sure I'm fully absorbing it. Where I'm struggling / slowest right now is in my intuition behind some of these tensor operations with stack / concatenate. Do you have any recommendations for study material apart from the torch documentation?
Thanks so much! Hmm. Maybe hugging face has some good resources too. Aside from this, Iโll be making a playlist on the evolution of language models so some design choices become more intuitive. Hope youโll stick around for that
Thanks a lot ๐
Super welcome
At time 6:24 reason 1 (periodicity) for positional encoding was under-specified, hence needed more clarity where it was mentioned that a word pays attention to other words (farther apart) in the sentence using periodicity property of sine and cosine function in order to make the solution tractable? Is it mentioned in some papers or can you cite this. Thanks.
Theoratically what does it mean to add embedding vector and positional vector ?
hi, you probably wont see this since its been 6 months siince youve posted the video, however: im trying to write code for handwritten mathematical expression recognition and am trying to recreate the BTTR model. In it they use a Densenet as the transformer encoder and use "image positional encoding" whhich is supposed to be a generalization for 2d of the sinusodal positional encoding. What would be the logic behind the 2d image positional encoding. They do have code on github but i have no idea how to interpret it, could you please help
Clear explanation. If I want to use transformer for time series and the time is not evenly changing, there is irregularities of time points. How could I positional encoding of these time into transformer?
Hey Ajay, first of all, this is a video so well-built that I will be recommending it to our data science, AI, and Robotics clubs, your content is great and I can see the next Andrew NG before me, regardless I do have a question, why is it that there must be a max number of words in a transformer architecture I dont fully understand the reason behind it considering most of the operations conducted on the first half don't require a fixed length of input data since this isn't your usual neural network layer, do you mind explaining? because I do feel like this is flying above my head
Your words are too kind. And good question. So what is fixed in length in this specific architecture is the maximum number of words in a sentence, not the number of words in a sentence. The remain unused words are filled with โpadding tokensโ. This will be come clearer when you watch the videos of coding out the complete transformer in the playlist โTransformers from Scratchโ. We essentially do this so we can pass fixed size vector inputs through every part of the transformer. That said, I have seen more recent implementations where the size is dynamic
I liked you already , now you are a kannadiga and i like you more.
i love your shit man, this was so usefull i actually understood this ml shit and now can be elon musk up in this llm shit
Great videos, especially the one where you explained what a transformer is. Beside youtube, do you have a full time job or is this it? Just curious
Thank you! And yep I have a full time job as a Machine Learning Engineer outside of KRplus :)
thanks for the information in this video however i think i have a miss understanding ,
you said before that before the vocabs are going into the embedding victor which is like a bag of related word together in a box ,
but in the start of this video you said at first the words has done into a one hot encoder then passed to the positional encoding so what i want to know know is which is the scenarios is the right:
1- we take the word and search it into the embedding space then pass it into the positional encoder
2- we take the word and do it a one hot encoder then send it to the positional encoder
Hi Ajay, isn't the purpose of positional encoding to figure out where the word is located in the sequence which actually the attention mechanism derives benefit from? Thanks... And again great content, grateful
Yes! The idea overall is to create meaningful embedding for words that understand context. This is opposed to the tradition CBoW or Skip gram word embeddings that donโt quite get this context.
Hey Ajay, great video!! Congratulations, I'm learning a lot from you thank you! Ajay I have some doubts, the first is that I didn't quite understand the difference between max sequence length and d_model. For example, if I have texts with 50 tokens in size, that is, my largest text has up to 50 tokens, this would be my max sequence length, however if my d_model were 10, my largest sequence would have to be divided into 5 to be able to pass through the model because it only accepts 10 tokens at a time, is my thinking correct?
They way you described sequence length = 50 is correct. It is the maximum number of tokens you can pass into your network at a time (itโs the max number of words/ subwords/characters). D_model is the embedding dimension. Models donโt understand words, but they understand numbers. And so, you transform every token into some set of numbers (called a vector) and the number of numbers in this vector is d_model. Letโs say d_model is 512 and also say we have a sentence โmy name is ajayโ. The word โmyโ would be converted into a 512 dimensional vector. As would โnameโ, โisโ and โajayโ. The idea of these vectors/embedding is to get some dense numeric representation of the context of a word (so similar words are represented with vectors that are close to each other and dissimilar words are represented with vectors that are farther from each other)
@@CodeEmporium hm ok good answer, now a doubt if d_model is the dimension that I will put my tokens. Why don't some transformer models accept very long texts? for example: if I have the
string length = 10
d_model = 3
the phrase "my name is Ajay"
would turn 4 vectors
my: [0,0.2,0.6]
name: [0, 0.1, 0.11]
is: [0.5, 0.2, 0.0]
Ajay: [0,0,1] with d_model dimensions each
Why can't I put very large sequences in my model? Why does d_model interfere with this
@@LuizHenrique-qr3lt The max sequence length refers to the maximum number of tokens in a sequence, while d_model represents the dimensionality of the token embeddings. They serve different purposes in the Transformer model. The max sequence length determines the size of the input that can be processed at once, whereas d_model influences the complexity and expressive power of the model. In your example, if the max sequence length is 50 and d_model is 10, the largest sequence would need to be divided into smaller segments or chunks of 10 tokens to fit within the model's input limit.
Please make a detailed video series on the math for data science
I have made some math in machine learning videos. Maybe check the playlist โTh e math you should knowโ on the channel
hi ajy thanfor your videos. why are there 512 dimensions? who established this number? and how can we count the 175b parameters in gpt3. can you make a video when you break down the whole process of a transformer in one clear shot. possibly not using a translation but for exemple an answer task. thanks love your video and determination to spread knowledge
512 is a hyperparameter. You can actually decide which dimension to use, but it has been proven that higher dimension usually work better, since they are able to capture more linguistic information, e.g. semantics, syntax, etc. BERT for instance uses 768 dimensions and the OpenAI ada embeddings have 1536 dimensions.
I like your English is clean :) no disgusting non-Californian accent :)
Thank you for the compliments
Is there an advantage to using one-hot encoding instead of an integer index encoding for the words? If we're gonna download a pre-existing word2vec dictionary and map each word to its word vector during the data preparation anyway, the one-hot encoding seems like it'd just create an unnecessary large sparse matrix.
The idea here is we are not going to use a preexisting word2vec for the transformer. Everything in clouding the embedding for every word will be learned during training. An issue with word2vec is they are fixed embeddings and donโt necessarily capture word context very well. This concept was introduced in the paper that introduced ELMo โDeep Contextualized word presentationsโ (Peter et al., 2018). Would recommend giving this a read if youโre interest
From what I understood each word/token is represented by a 512-dimensional vector. This values of this vector are modified by means of (Self)Attention and Positional Encoding.
What is a bit counter-intuitive for me is that the place in which a word/token comes can be different in different sentences. For example lets take the word "Ajay".
(1) In this sentence it's in 4th position: "My name is Ajay"
(2) In a different sentence it is on 1st position: "Ajay explains very well".
So the Positional Encodings for the word "Ajay" vary - they might be different in each sentence. How can the network be trained, how can it learn, with such contradicting input data?
This is a good question. But it intuitively does make sense that the same word In different sentences can have different meanings. Take the word โgroundedโ. You can represent this as a 512 dimensional vector. But letโs say โgroundedโ occurs in 2 sentences: (1) The truth is grounded in reality (2) Youโre grounded! Go to your room. In these examples, โgroundedโ has differing meanings and should hence have different vector representations. This is why we need surrounding context to understand word vectors individually. This is probably a lil hard to see with your example since โAjayโ is a proper noun. However, for non-proper nouns, context matters.
I think you should take a look at the paper โDeep Contextualized word Representationsโ by Matthew Peters (2018). They more formally answer the question you are asking. This is the paper that introduced ELMo embedding. According to this paper, Turns out that using different vectors based on context really improved models on Part of Speech Tagging and Language modeling
You've raised an important point. While it is true that the positional encoding for a word like "Ajay" can vary depending on its position in different sentences.
Let's consider the word "Ajay" in two different sentences and see how the Transformer model handles it:
(1) Sentence 1: "My name is Ajay."
(2) Sentence 2: "Ajay explains very well."
In both sentences, the word "Ajay" has different positions, but the Transformer model can still learn and make sense of it. Here's a simplified example of how it works:
Input Encoding: Each word, including "Ajay," is initially represented by a 512-dimensional vector.
Sentence 1: "Ajay" is represented as [0.1, 0.2, 0.3, ..., 0.4].
Sentence 2: "Ajay" is represented as [0.5, 0.6, 0.7, ..., 0.8].
Positional Encoding: The model incorporates positional encodings to differentiate the positions of words.
Sentence 1: The positional encoding for the 4th position is [0.4, 0.3, 0.2, ..., 0.1].
Sentence 2: The positional encoding for the 1st position is [1.0, 0.9, 0.8, ..., 0.5].
Attention and Context: The Transformer's attention mechanism considers the positional encodings along with the input representations to compute contextualized representations.
Sentence 1: The attention mechanism incorporates the positional encoding and input embedding of "Ajay" at the 4th position to capture its contextual information within the sentence.
Sentence 2: Similarly, the attention mechanism considers the positional encoding and input embedding of "Ajay" at the 1st position in the context of the second sentence.
By attending to different positions and incorporating positional encodings, the model can learn to associate the word "Ajay" with its specific context and meaning in each sentence. Through training on various examples, the model adjusts its weights and learns to generate appropriate representations for words based on their positions, allowing it to make meaningful predictions and capture the contextual relationships between words effectively.
In the code on the final class, position is 1 to max sequence length....Which include both even and odd...I think we use cos for odd and sin for even..Why all the position are pass which mean 1 to max sequence length including even are pass in cos and odd are pass in sin.
I think I responded to this in another video you asked this question. Hope that helped tho :)
@@CodeEmporium Yeah but you didn't answer it fully
my second doubt is that when I use BertTokenizer for example it transforms the text:
[my name is ajay] in a list of integers for example [101, 11590, 11324, 10124, 138, 78761, 102], where does that part go? I couldn't understand that part
So I havenโt shown the text encoding details just yet. :) since 4 words were encoded into 7 numbers, I assume the โBestTokenizerโ is encoding each subword / word piece into some number. Essentially, the tokenizer is taking the sentence, breaking it down into word pieces (7 in this case) and each is being mapped to a unique integer number. Later on, you will see each number being mapped to a larger vector (I explained more details about why these vectors exist in the other comment)
can you please tell me the between sequence length and dimension of embedding ?
Sequence length = maximum number of characters/words we can pass into the transformer at a time.
Dimension of embedding = size of vector representing each character / word.
@@CodeEmporium thanks a lot.
I am kind on new bie here, if you think this is valid please answer.
why are you introducing parameters of dimension 512 for the vocab size , making a neural network,
I mean what happens if we dont do that?
Why are we using 512 dimensions instead of the 1 hot vector of size equal to the vocabulary size? This is because of the curse of dimensionality. Vocabulary sizes are huge (often in the 10s of thousands). This is a lot for any model, neural network or not, to process. There was a 2001 paper by Yashua Bengio โA Neural Probabilistic Language Modelโ that describes exactly this issue and why it was introduced. I would recommend giving it a read. Also, my next series will delve into the history of language models so I hope youโll stay tuned for this. Maybe some of the design choices will become clearer.
I think that queen/king example is somewhat cherry picked, as the principle behind the analogy fails for many examples.
There is one mistake that you are making. We are not taking a single output as input to the decoder, but all the previous outputs up to the current time step as input to the decoder.
Yea thatโs correct from a practical standpoint. I dive into this when coding this out in the rest of this playlist โTransformers from scratch โ. Hope those videos clear things up!
@@CodeEmporium Thanks for answering. I understand that you have to make a trade-off between simplicity and accuracy. Here, I just wanted to note that little more complexity would have added quite a lot more accuracy.
Your content is excellent!
I don't nothing about Python, but it look extremely slow.