Transformer Encoder in 100 lines of code!
์์ค ์ฝ๋
- ๊ฒ์์ผ 2024. 04. 27.
- ABOUT ME
โญ Subscribe: krplus.net/uCodeEmporiu...
๐ Medium Blog: / dataemporium
๐ป Github: github.com/ajhalthor
๐ LinkedIn: / ajay-halthor-477974bb
RESOURCES
[ 1 ๐] Code for Video: github.com/ajhalthor/Transfor...
PLAYLISTS FROM MY CHANNEL
โญ Transformers from scratch playlist: โข Self Attention in Tran...
โญ ChatGPT Playlist of all other videos: โข ChatGPT
โญ Transformer Neural Networks: โข Natural Language Proce...
โญ Convolutional Neural Networks: โข Convolution Neural Net...
โญ The Math You Should Know : โข The Math You Should Know
โญ Probability Theory for Machine Learning: โข Probability Theory for...
โญ Coding Machine Learning: โข Code Machine Learning
MATH COURSES (7 day free trial)
๐ Mathematics for Machine Learning: imp.i384100.net/MathML
๐ Calculus: imp.i384100.net/Calculus
๐ Statistics for Data Science: imp.i384100.net/AdvancedStati...
๐ Bayesian Statistics: imp.i384100.net/BayesianStati...
๐ Linear Algebra: imp.i384100.net/LinearAlgebra
๐ Probability: imp.i384100.net/Probability
OTHER RELATED COURSES (7 day free trial)
๐ โญ Deep Learning Specialization: imp.i384100.net/Deep-Learning
๐ Python for Everybody: imp.i384100.net/python
๐ MLOps Course: imp.i384100.net/MLOps
๐ Natural Language Processing (NLP): imp.i384100.net/NLP
๐ Machine Learning in Production: imp.i384100.net/MLProduction
๐ Data Science Specialization: imp.i384100.net/DataScience
๐ Tensorflow: imp.i384100.net/Tensorflow
TIMESTAMP
0:00 What we will cover
0:53 Introducing Colab
1:24 Word Embeddings and d_model
3:00 What are Attention heads?
3:59 What is Dropout?
4:59 Why batch data?
7:46 How to sentences into the transformer?
9:03 Why feed forward layers in transformer?
9:44 Why Repeating Encoder layers?
11:00 The โEncoderโ Class, nn.Module, nn.Sequential
14:38 The โEncoderLayerโ Class
17:45 What is Attention: Query, Key, Value vectors
20:03 What is Attention: Matrix Transpose in PyTorch
21:17 What is Attention: Scaling
23:09 What is Attention: Masking
24:53 What is Attention: Softmax
25:42 What is Attention: Value Tensors
26:22 CRUX OF VIDEO: โMultiHeadAttentionโ Class
36:27 Returning the flow back to โEncoderLayerโ Class
37:12 Layer Normalization
43:17 Returning the flow back to โEncoderLayerโ Class
43:44 Feed Forward Layers
44:24 Why Activation Functions?
46:03 Finish the Flow of Encoder
48:03 Conclusion & Decoder for next video
If you think I deserve it, please consider hitting the like button and subscribe for more content like this :)
This is the best explanation I have gone through
Superb and so love these classes! Will watch all of them one by one
I watched the entire series and it gave me a deeper understanding on how all of this works. Very well done!!!! Takes a real master to take a complex topic and break it down in such a consumable way. I do have one question: What is the point of the permute? Can we not specify the shape we want in the reshape call?
Next level video *especially* because of the dimensions laid out and giving intuition for things like k.transpose(-1, -2). Likely the best resource out right now!! Thanks for all your work!
Super glad you find this all useful!
Very clear, useful and helpful explanation! Thank you!
This video was really informative. Thank you for all the detailed explanations!
It's really helpful that you are going through all the sizes of the various vectors and matrices.
Glad it is helpful!
This is the best content on youtube
bro you're a legend!
Just amazing!!!
@CodeEmporium
The transformer series is awesome!
It is very informative.
I have one comment, It is usually recommended to perform dropout before normalization layers. This is because normalization layers may undo dropout effects by re-scaling the input. By performing dropout before normalization, we ensure that the inputs to the normalization layer are still diverse and have different scales.
I believe you mean after, right?
This is the most detailed Transformer video, THANK YOU!
I have one question, the values is [30, 8, 200, 64], before we reshape it, shouldn't we permute it first? like:
values = values.permute(0, 2, 1, 3).reshape(batch_size, max_sequence_length, self.num_heads * self.head_dim)
Thank you, I going through all your videos. great work!
Really friendly for the beginners!๐
Thanks a lot! Glad you found it useful
thank you!
bro... i love how u dive deep into explanations. You're a very good teacher holy shit
You are awesome .The way you teach is incredible.
Thanks so much for this compliment. Super glad you enjoyed this
Appreciate your work! As someone else mentioned, hope you can do an implementation of training the network for a few iterations.
Yea. Thatโs the plan. I am currently working on setting the full thing up.
Thanks for the great series. Would be very helpful if you'd attach the Colab.
Wonderful explanation
Thanks a lot :)
you are awesome bro
Awesome content as always ! Are you planning to demonstrate a training example of training for the encoder for the next video ? For example on a wikipedia data sample or something like that ?
Hoping to get to that stage. I currently have the code ready but itโs a lil strange during inference. for more context : I am running into a situation where itโs predicting the End of Sentence token only. Planning to fix this soon and have a full overview of the transformer soon. But in the mean time there are so many more videos I can make on the decoder
@@CodeEmporium Oh ok i see, i'm also close to that step, i'll let you know if i find something
awesome content! thanks a lot!!
Thanks so much!
Hats Off to you for explaining such a complex topic with simplicity and understanding. Thanks a lot. Is there any course you're offering besides these awesome videos on youtube? Want to learn more concept from you.
Thanks so much for the compliments. At the moment , my best teaching resources are on KRplus. Luckily, there are hundreds of videos on the channel haha
@@CodeEmporium Thanks for the info, sir. I am a student of AI and ML interested very much in NLP. If you have any suggestions for research projects that I can pursue for my academic research. Kindly suggest. I am reading the papers one by one. If you have any interesting ideas, it would help me a lot.
Please, blow up the decoder as well hahaa !!
Thank Ajay, these videos were very helpful for me.
Hi Ajay. I think, we need to make a small change in the forward() function of the encoder class. We should be doing `x_residual = x.clone() # or x_residual = x[:]` instead of `x_residual =x`. This will ensure that x_residual contains a copy of the original x and is not affected by any changes made to x.
Oh interesting. I have been running into issues during training. Iโll make this change and check. Thanks a ton for surfacing!
thank u a lot
Amazing video series! At 39:07, why does the layer normalization just consider 1 dimension, the length of parameter shape, but not consider the batch size? Your previous video about the layer normalization mentioned layer normalization should consider both. Am I missing something?
Great video!!!!best content for transformer... Ca n you suggest ways to implement transformer encoder for a time series data
Thanks!
Please do Cross Attention and maybe Attention visualizations next!
Yep! I plan to do some more videos on the decoder part too
Thanks!
Thanks for the donation and for watching!
I am looking forward to see whether you will try to put all the bits of the transformer together i.e. the positional encoder before this "encoder" and then the decoder after. I wonder whether/how it will respond to the input text "My name is Ajay". Would it respond as though in a conversation "Hi, how are you" / "My name is Bot" or generate more text in the same vein e.g. "I am 28 years old" or translate it to another language or something else. To achieve an end-to-end use case I guess we will also need appropriate data to be able to train the models and then actually train the models, save the model weights somehow, etc. Am new to all this but your videos are gradually helping me understand more e.g. encoder input and output matrix being of the same size to permit stacking. Thanks ๐
This is the goal. I am constructing this transformer bit by bit and showing my findings. We will eventually have the full thing
What would be the best book to learn probability and statistics for Machine Learning?
Before any book just take a 500 level course on probability and linear algebra each from any universities free online classes. These two topics are not truly understood with even the best explanations, just by solving problems
Hi. Great video but i have a question. Referring to 19:31, why is the dimension of k found by using the code q.size()[-1], shouldn't it be k.size()[-1] instead. Thnx in advance:)
I understand how the forward way works, but not how the learning works. Basically all videos I have seen so far covering Transformers "only" explain the way forward, but not the training. For example I'd like to know what the loss function is.
Question 2: afaik an Encoder can work on its own and doesn't (necessarily) need a Decoder (for example for non-translation use cases). How does the training work in this case? What is the loss function here? (-> we don't have a target sentence)
If you go further into the playlist (I just uploaded the code for this my my most recent video in the playlist), it is a cross entropy loss. We compare every character generated to the label; take the average loss; and perform backpropogation to update all weights in the network once after seeing all sentences in the batch
For your Question 2, I am not exactly sure what you are alluding to. Yes, you can just use the encoder but depending on the task you want to solve, youโll need to define an appropriate loss. For example, BERT architectures are encoder only architectures that may append additional feed forward networks to solve a specific task. These architectures will also learn via back propagation once we are able to quantify a loss.
@@CodeEmporium Thank you for your reply. For Q2: My plan is to deal/code/understand the Encoder and the Decoder part separately, starting with the Encoder. Especially how this Attention vectors develop over time. How they actually look for a small example, trained with a couple of sentences. Visualize them. See how, for example, "dog" is closer to "cat" than to, for example, "screwdriver".
But I don't know what the loss function would be to train this model. Could I maybe feed the network with parts of a sentence so that it can learn how to predict the next word?
E.G. Full sentence could be: "my dog likes to chase the cat of my neighbor".
X: "my" Y: "dog"
X: "my dog" Y: "likes"
X: "my dog likes" Y: "to"
X: "my dog likes to" Y: "chase"
... and so on ...
Would this kind of training be sufficient for the network to calculate the Attention vectors?
Overall your explaination is great, But I little confiused. Actually i could not understand the difference between positinal encoding and Position-wise Feed Forward Network. Can anyone explain to me?
your code is pretty clean, except i more like "black" code formatting
I know itโs a lazy question, but can someone tell me why is multi-head better than single head for performing attention?
a video about how to code chatgpt in which the code is generated by chatgpt ๐
where did you get the 3 for 3 times 512 =1536? Is it 3 because you have query, key, and value?
For every token (word or character), we have 3 vectors: query, key and value. Each token is represented by a 512 dimensional vector. This is encoded into the query key and value vectors that are also 512 dimensions each. Hence 3 * 512
I think you forgot to address in you MHA code to pass the mask value.. I think here you need ModuleList and can't use nn.Sequential
I definitely need this for the decoder and I get around this by implementing my custom โSequentialโ class. I was able to run this code tho just fine as is (sorry if I missed exactly what you are alluding to)
@@CodeEmporium Ah of course - I missed that we don't need it for the encoder (and that you could implement custom nn.Sequential as opposed to a ModuleList of the Layers. Although I'm not sure which of the approaches would be nicer).
Yolov8
Did he just mimic what Andrej Kaparthy was doing. Explanation not even 10% as clear as what Andrej did. So bad.