Transformer Encoder in 100 lines of code!

๊ณต์œ 
์†Œ์Šค ์ฝ”๋“œ
  • ๊ฒŒ์‹œ์ผ 2024. 04. 27.
  • ABOUT ME
    โญ• Subscribe: krplus.net/uCodeEmporiu...
    ๐Ÿ“š Medium Blog: / dataemporium
    ๐Ÿ’ป Github: github.com/ajhalthor
    ๐Ÿ‘” LinkedIn: / ajay-halthor-477974bb
    RESOURCES
    [ 1 ๐Ÿ”Ž] Code for Video: github.com/ajhalthor/Transfor...
    PLAYLISTS FROM MY CHANNEL
    โญ• Transformers from scratch playlist: โ€ข Self Attention in Tran...
    โญ• ChatGPT Playlist of all other videos: โ€ข ChatGPT
    โญ• Transformer Neural Networks: โ€ข Natural Language Proce...
    โญ• Convolutional Neural Networks: โ€ข Convolution Neural Net...
    โญ• The Math You Should Know : โ€ข The Math You Should Know
    โญ• Probability Theory for Machine Learning: โ€ข Probability Theory for...
    โญ• Coding Machine Learning: โ€ข Code Machine Learning
    MATH COURSES (7 day free trial)
    ๐Ÿ“• Mathematics for Machine Learning: imp.i384100.net/MathML
    ๐Ÿ“• Calculus: imp.i384100.net/Calculus
    ๐Ÿ“• Statistics for Data Science: imp.i384100.net/AdvancedStati...
    ๐Ÿ“• Bayesian Statistics: imp.i384100.net/BayesianStati...
    ๐Ÿ“• Linear Algebra: imp.i384100.net/LinearAlgebra
    ๐Ÿ“• Probability: imp.i384100.net/Probability
    OTHER RELATED COURSES (7 day free trial)
    ๐Ÿ“• โญ Deep Learning Specialization: imp.i384100.net/Deep-Learning
    ๐Ÿ“• Python for Everybody: imp.i384100.net/python
    ๐Ÿ“• MLOps Course: imp.i384100.net/MLOps
    ๐Ÿ“• Natural Language Processing (NLP): imp.i384100.net/NLP
    ๐Ÿ“• Machine Learning in Production: imp.i384100.net/MLProduction
    ๐Ÿ“• Data Science Specialization: imp.i384100.net/DataScience
    ๐Ÿ“• Tensorflow: imp.i384100.net/Tensorflow
    TIMESTAMP
    0:00 What we will cover
    0:53 Introducing Colab
    1:24 Word Embeddings and d_model
    3:00 What are Attention heads?
    3:59 What is Dropout?
    4:59 Why batch data?
    7:46 How to sentences into the transformer?
    9:03 Why feed forward layers in transformer?
    9:44 Why Repeating Encoder layers?
    11:00 The โ€œEncoderโ€ Class, nn.Module, nn.Sequential
    14:38 The โ€œEncoderLayerโ€ Class
    17:45 What is Attention: Query, Key, Value vectors
    20:03 What is Attention: Matrix Transpose in PyTorch
    21:17 What is Attention: Scaling
    23:09 What is Attention: Masking
    24:53 What is Attention: Softmax
    25:42 What is Attention: Value Tensors
    26:22 CRUX OF VIDEO: โ€œMultiHeadAttentionโ€ Class
    36:27 Returning the flow back to โ€œEncoderLayerโ€ Class
    37:12 Layer Normalization
    43:17 Returning the flow back to โ€œEncoderLayerโ€ Class
    43:44 Feed Forward Layers
    44:24 Why Activation Functions?
    46:03 Finish the Flow of Encoder
    48:03 Conclusion & Decoder for next video

๋Œ“๊ธ€ • 67

  • @CodeEmporium
    @CodeEmporium  ๋…„ ์ „ +21

    If you think I deserve it, please consider hitting the like button and subscribe for more content like this :)

  • @surajgorai618
    @surajgorai618 ๋…„ ์ „ +2

    This is the best explanation I have gone through

  • @jingcheng2602
    @jingcheng2602 2 ๊ฐœ์›” ์ „

    Superb and so love these classes! Will watch all of them one by one

  • @danielbrooks6246
    @danielbrooks6246 ๋…„ ์ „

    I watched the entire series and it gave me a deeper understanding on how all of this works. Very well done!!!! Takes a real master to take a complex topic and break it down in such a consumable way. I do have one question: What is the point of the permute? Can we not specify the shape we want in the reshape call?

  • @sushantmehta7789
    @sushantmehta7789 ๋…„ ์ „ +3

    Next level video *especially* because of the dimensions laid out and giving intuition for things like k.transpose(-1, -2). Likely the best resource out right now!! Thanks for all your work!

  • @gigabytechanz9646
    @gigabytechanz9646 ๋…„ ์ „

    Very clear, useful and helpful explanation! Thank you!

  • @seyedmatintavakoliafshari8272
    @seyedmatintavakoliafshari8272 2 ๊ฐœ์›” ์ „

    This video was really informative. Thank you for all the detailed explanations!

  • @user-wr4yl7tx3w
    @user-wr4yl7tx3w ๋…„ ์ „ +1

    It's really helpful that you are going through all the sizes of the various vectors and matrices.

  • @DeanLa
    @DeanLa 10 ๊ฐœ์›” ์ „

    This is the best content on youtube

  • @FAHMIAYARI
    @FAHMIAYARI 10 ๊ฐœ์›” ์ „

    bro you're a legend!

  • @Zero-ss6pn
    @Zero-ss6pn 2 ๊ฐœ์›” ์ „

    Just amazing!!!

  • @salemibrahim2933
    @salemibrahim2933 ๋…„ ์ „ +1

    @CodeEmporium
    The transformer series is awesome!
    It is very informative.
    I have one comment, It is usually recommended to perform dropout before normalization layers. This is because normalization layers may undo dropout effects by re-scaling the input. By performing dropout before normalization, we ensure that the inputs to the normalization layer are still diverse and have different scales.

  • @user-yk2bh8ns5y
    @user-yk2bh8ns5y ๋…„ ์ „ +7

    This is the most detailed Transformer video, THANK YOU!
    I have one question, the values is [30, 8, 200, 64], before we reshape it, shouldn't we permute it first? like:
    values = values.permute(0, 2, 1, 3).reshape(batch_size, max_sequence_length, self.num_heads * self.head_dim)

  • @pierrelebreton7634
    @pierrelebreton7634 ๋…„ ์ „

    Thank you, I going through all your videos. great work!

  • @xingfenyizhen
    @xingfenyizhen 8 ๊ฐœ์›” ์ „ +2

    Really friendly for the beginners!๐Ÿ˜

    • @CodeEmporium
      @CodeEmporium  8 ๊ฐœ์›” ์ „

      Thanks a lot! Glad you found it useful

  • @user-ut2xu8eb7c
    @user-ut2xu8eb7c 7 ๊ฐœ์›” ์ „

    thank you!

  • @moseslee8761
    @moseslee8761 7 ๊ฐœ์›” ์ „

    bro... i love how u dive deep into explanations. You're a very good teacher holy shit

  • @user-ul2mw6fu2e
    @user-ul2mw6fu2e 4 ๊ฐœ์›” ์ „

    You are awesome .The way you teach is incredible.

    • @CodeEmporium
      @CodeEmporium  4 ๊ฐœ์›” ์ „ +1

      Thanks so much for this compliment. Super glad you enjoyed this

  • @KurtGr
    @KurtGr ๋…„ ์ „

    Appreciate your work! As someone else mentioned, hope you can do an implementation of training the network for a few iterations.

    • @CodeEmporium
      @CodeEmporium  ๋…„ ์ „

      Yea. Thatโ€™s the plan. I am currently working on setting the full thing up.

  • @chenmargalit7375
    @chenmargalit7375 9 ๊ฐœ์›” ์ „

    Thanks for the great series. Would be very helpful if you'd attach the Colab.

  • @nallarajeshkumar9036
    @nallarajeshkumar9036 8 ๊ฐœ์›” ์ „

    Wonderful explanation

  • @ramanshariati5738
    @ramanshariati5738 9 ๊ฐœ์›” ์ „

    you are awesome bro

  • @TransalpDave
    @TransalpDave ๋…„ ์ „ +1

    Awesome content as always ! Are you planning to demonstrate a training example of training for the encoder for the next video ? For example on a wikipedia data sample or something like that ?

    • @CodeEmporium
      @CodeEmporium  ๋…„ ์ „ +1

      Hoping to get to that stage. I currently have the code ready but itโ€™s a lil strange during inference. for more context : I am running into a situation where itโ€™s predicting the End of Sentence token only. Planning to fix this soon and have a full overview of the transformer soon. But in the mean time there are so many more videos I can make on the decoder

    • @TransalpDave
      @TransalpDave ๋…„ ์ „

      @@CodeEmporium Oh ok i see, i'm also close to that step, i'll let you know if i find something

  • @li-pingho1441
    @li-pingho1441 ๋…„ ์ „

    awesome content! thanks a lot!!

  • @dwarakanathchandra7611
    @dwarakanathchandra7611 7 ๊ฐœ์›” ์ „

    Hats Off to you for explaining such a complex topic with simplicity and understanding. Thanks a lot. Is there any course you're offering besides these awesome videos on youtube? Want to learn more concept from you.

    • @CodeEmporium
      @CodeEmporium  7 ๊ฐœ์›” ์ „

      Thanks so much for the compliments. At the moment , my best teaching resources are on KRplus. Luckily, there are hundreds of videos on the channel haha

    • @dwarakanathchandra7611
      @dwarakanathchandra7611 7 ๊ฐœ์›” ์ „

      @@CodeEmporium Thanks for the info, sir. I am a student of AI and ML interested very much in NLP. If you have any suggestions for research projects that I can pursue for my academic research. Kindly suggest. I am reading the papers one by one. If you have any interesting ideas, it would help me a lot.

  • @cmacompilation4649
    @cmacompilation4649 ๋…„ ์ „

    Please, blow up the decoder as well hahaa !!
    Thank Ajay, these videos were very helpful for me.

  • @prashantlawhatre7007
    @prashantlawhatre7007 ๋…„ ์ „ +2

    Hi Ajay. I think, we need to make a small change in the forward() function of the encoder class. We should be doing `x_residual = x.clone() # or x_residual = x[:]` instead of `x_residual =x`. This will ensure that x_residual contains a copy of the original x and is not affected by any changes made to x.

    • @CodeEmporium
      @CodeEmporium  ๋…„ ์ „ +1

      Oh interesting. I have been running into issues during training. Iโ€™ll make this change and check. Thanks a ton for surfacing!

  • @user-im5qb7ix6e
    @user-im5qb7ix6e 11 ๊ฐœ์›” ์ „

    thank u a lot

  • @qingjieqi3379
    @qingjieqi3379 ๋…„ ์ „

    Amazing video series! At 39:07, why does the layer normalization just consider 1 dimension, the length of parameter shape, but not consider the batch size? Your previous video about the layer normalization mentioned layer normalization should consider both. Am I missing something?

  • @chrisillas3010
    @chrisillas3010 ๋…„ ์ „

    Great video!!!!best content for transformer... Ca n you suggest ways to implement transformer encoder for a time series data

  • @-mwolf
    @-mwolf ๋…„ ์ „

    Thanks!
    Please do Cross Attention and maybe Attention visualizations next!

    • @CodeEmporium
      @CodeEmporium  ๋…„ ์ „ +1

      Yep! I plan to do some more videos on the decoder part too

  • @RanDuan-dp6oz
    @RanDuan-dp6oz 11 ๊ฐœ์›” ์ „ +1

    Thanks!

    • @CodeEmporium
      @CodeEmporium  11 ๊ฐœ์›” ์ „

      Thanks for the donation and for watching!

  • @GIChow
    @GIChow ๋…„ ์ „ +1

    I am looking forward to see whether you will try to put all the bits of the transformer together i.e. the positional encoder before this "encoder" and then the decoder after. I wonder whether/how it will respond to the input text "My name is Ajay". Would it respond as though in a conversation "Hi, how are you" / "My name is Bot" or generate more text in the same vein e.g. "I am 28 years old" or translate it to another language or something else. To achieve an end-to-end use case I guess we will also need appropriate data to be able to train the models and then actually train the models, save the model weights somehow, etc. Am new to all this but your videos are gradually helping me understand more e.g. encoder input and output matrix being of the same size to permit stacking. Thanks ๐Ÿ‘

    • @CodeEmporium
      @CodeEmporium  ๋…„ ์ „ +2

      This is the goal. I am constructing this transformer bit by bit and showing my findings. We will eventually have the full thing

  • @convolutionalnn2582
    @convolutionalnn2582 ๋…„ ์ „ +2

    What would be the best book to learn probability and statistics for Machine Learning?

    • @linkinlinkinlinkin654
      @linkinlinkinlinkin654 ๋…„ ์ „

      Before any book just take a 500 level course on probability and linear algebra each from any universities free online classes. These two topics are not truly understood with even the best explanations, just by solving problems

  • @eekinchan6620
    @eekinchan6620 6 ๊ฐœ์›” ์ „

    Hi. Great video but i have a question. Referring to 19:31, why is the dimension of k found by using the code q.size()[-1], shouldn't it be k.size()[-1] instead. Thnx in advance:)

  • @hermannangstl1904
    @hermannangstl1904 ๋…„ ์ „

    I understand how the forward way works, but not how the learning works. Basically all videos I have seen so far covering Transformers "only" explain the way forward, but not the training. For example I'd like to know what the loss function is.
    Question 2: afaik an Encoder can work on its own and doesn't (necessarily) need a Decoder (for example for non-translation use cases). How does the training work in this case? What is the loss function here? (-> we don't have a target sentence)

    • @CodeEmporium
      @CodeEmporium  ๋…„ ์ „

      If you go further into the playlist (I just uploaded the code for this my my most recent video in the playlist), it is a cross entropy loss. We compare every character generated to the label; take the average loss; and perform backpropogation to update all weights in the network once after seeing all sentences in the batch
      For your Question 2, I am not exactly sure what you are alluding to. Yes, you can just use the encoder but depending on the task you want to solve, youโ€™ll need to define an appropriate loss. For example, BERT architectures are encoder only architectures that may append additional feed forward networks to solve a specific task. These architectures will also learn via back propagation once we are able to quantify a loss.

    • @hermannangstl1904
      @hermannangstl1904 ๋…„ ์ „

      @@CodeEmporium Thank you for your reply. For Q2: My plan is to deal/code/understand the Encoder and the Decoder part separately, starting with the Encoder. Especially how this Attention vectors develop over time. How they actually look for a small example, trained with a couple of sentences. Visualize them. See how, for example, "dog" is closer to "cat" than to, for example, "screwdriver".
      But I don't know what the loss function would be to train this model. Could I maybe feed the network with parts of a sentence so that it can learn how to predict the next word?
      E.G. Full sentence could be: "my dog likes to chase the cat of my neighbor".
      X: "my" Y: "dog"
      X: "my dog" Y: "likes"
      X: "my dog likes" Y: "to"
      X: "my dog likes to" Y: "chase"
      ... and so on ...
      Would this kind of training be sufficient for the network to calculate the Attention vectors?

  • @amiralioghli8622
    @amiralioghli8622 8 ๊ฐœ์›” ์ „

    Overall your explaination is great, But I little confiused. Actually i could not understand the difference between positinal encoding and Position-wise Feed Forward Network. Can anyone explain to me?

  • @froozynoobfan
    @froozynoobfan ๋…„ ์ „

    your code is pretty clean, except i more like "black" code formatting

  • @creativeuser9086
    @creativeuser9086 11 ๊ฐœ์›” ์ „ +1

    I know itโ€™s a lazy question, but can someone tell me why is multi-head better than single head for performing attention?

  • @godly_wisdom777
    @godly_wisdom777 ๋…„ ์ „

    a video about how to code chatgpt in which the code is generated by chatgpt ๐Ÿ˜

  • @user-wr4yl7tx3w
    @user-wr4yl7tx3w ๋…„ ์ „

    where did you get the 3 for 3 times 512 =1536? Is it 3 because you have query, key, and value?

    • @CodeEmporium
      @CodeEmporium  ๋…„ ์ „ +1

      For every token (word or character), we have 3 vectors: query, key and value. Each token is represented by a 512 dimensional vector. This is encoded into the query key and value vectors that are also 512 dimensions each. Hence 3 * 512

  • @-mwolf
    @-mwolf ๋…„ ์ „

    I think you forgot to address in you MHA code to pass the mask value.. I think here you need ModuleList and can't use nn.Sequential

    • @CodeEmporium
      @CodeEmporium  ๋…„ ์ „ +1

      I definitely need this for the decoder and I get around this by implementing my custom โ€œSequentialโ€ class. I was able to run this code tho just fine as is (sorry if I missed exactly what you are alluding to)

    • @-mwolf
      @-mwolf ๋…„ ์ „

      @@CodeEmporium Ah of course - I missed that we don't need it for the encoder (and that you could implement custom nn.Sequential as opposed to a ModuleList of the Layers. Although I'm not sure which of the approaches would be nicer).

  • @vigneshvicky6720
    @vigneshvicky6720 8 ๊ฐœ์›” ์ „

    Yolov8

  • @user-kz2es8sg3f
    @user-kz2es8sg3f 7 ๊ฐœ์›” ์ „

    Did he just mimic what Andrej Kaparthy was doing. Explanation not even 10% as clear as what Andrej did. So bad.