The complete guide to Transformer neural Networks!

곡유
μ†ŒμŠ€ μ½”λ“œ
  • κ²Œμ‹œμΌ 2024. 04. 19.
  • Let's do a deep dive into the Transformer Neural Network Architecture for language translation.
    ABOUT ME
    β­• Subscribe: krplus.net/uCodeEmporiu...
    πŸ“š Medium Blog: / dataemporium
    πŸ’» Github: github.com/ajhalthor
    πŸ‘” LinkedIn: / ajay-halthor-477974bb
    RESOURCES
    [ 1 πŸ”Ž] Transformer Architecture Image :github.com/ajhalthor/Transfor...
    [2 πŸ”Ž] draw.io version of the image for clarity: github.com/ajhalthor/Transfor...
    PLAYLISTS FROM MY CHANNEL
    β­• Transformers from scratch playlist: β€’ Self Attention in Tran...
    β­• ChatGPT Playlist of all other videos: β€’ ChatGPT
    β­• Transformer Neural Networks: β€’ Natural Language Proce...
    β­• Convolutional Neural Networks: β€’ Convolution Neural Net...
    β­• The Math You Should Know : β€’ The Math You Should Know
    β­• Probability Theory for Machine Learning: β€’ Probability Theory for...
    β­• Coding Machine Learning: β€’ Code Machine Learning
    MATH COURSES (7 day free trial)
    πŸ“• Mathematics for Machine Learning: imp.i384100.net/MathML
    πŸ“• Calculus: imp.i384100.net/Calculus
    πŸ“• Statistics for Data Science: imp.i384100.net/AdvancedStati...
    πŸ“• Bayesian Statistics: imp.i384100.net/BayesianStati...
    πŸ“• Linear Algebra: imp.i384100.net/LinearAlgebra
    πŸ“• Probability: imp.i384100.net/Probability
    OTHER RELATED COURSES (7 day free trial)
    πŸ“• ⭐ Deep Learning Specialization: imp.i384100.net/Deep-Learning
    πŸ“• Python for Everybody: imp.i384100.net/python
    πŸ“• MLOps Course: imp.i384100.net/MLOps
    πŸ“• Natural Language Processing (NLP): imp.i384100.net/NLP
    πŸ“• Machine Learning in Production: imp.i384100.net/MLProduction
    πŸ“• Data Science Specialization: imp.i384100.net/DataScience
    πŸ“• Tensorflow: imp.i384100.net/Tensorflow
    TIMESTAMPS
    0:00 Introduction
    1:38 Transformer at a high level
    4:15 Why Batch Data? Why Fixed Length Sequence?
    6:13 Embeddings
    7:00 Positional Encodings
    7:58 Query, Key and Value vectors
    9:19 Masked Multi Head Self Attention
    14:46 Residual Connections
    15:50 Layer Normalization
    17:57 Decoder
    20:12 Masked Multi Head Cross Attention
    22:47
    24:03 Tokenization & Generating the next translated word
    26:00 Transformer Inference Example

λŒ“κΈ€ • 98

  • @CodeEmporium
    @CodeEmporium  λ…„ μ „ +9

    The link to the image and it’s raw file are in the description. If you think I deserve it, please give this video a like and subscribe for more! If you think it’s worth sharing, please do so as well. I would love to grow to 100k subscribers this year with your help :) Thank you!

    • @RanDuan-dp6oz
      @RanDuan-dp6oz λ…„ μ „

      Just gave the thumb up! Just curious: what software did you use to draw such a wonderful diagram?

    • @junningdeng7385
      @junningdeng7385 λ…„ μ „

      Sooooo nice! Where we can find the link to the imageπŸ˜‚

    • @CodeEmporium
      @CodeEmporium  λ…„ μ „ +1

      Thanks I used draw.io to draw the image

    • @CodeEmporium
      @CodeEmporium  λ…„ μ „

      The image can be found in the description of the video on GitHub

    • @user-np2jc9km3u
      @user-np2jc9km3u 4 κ°œμ›” μ „

      But what is the source for the kannada words that was feed in to the output?, how can we get those word in reality? could you explain me if you are willing to. Thank you.

  • @siddheshdandagavhal9804
    @siddheshdandagavhal9804 9 κ°œμ›” μ „ +7

    Most underrated youtuber. You are explaining this complex topics with such an ease. Many big channels avoid explaining this topics. Really appreciate your work man.

    • @CodeEmporium
      @CodeEmporium  8 κ°œμ›” μ „

      Thanks a lot for the kind words. I try :)

    • @ShimoriUta77
      @ShimoriUta77 3 κ°œμ›” μ „

      Bro for real! It never felt a possibility for me to learn ML but this guy took me by hand and is teaching all this for free!
      I can't even thank this dude enough

  • @menghan9260
    @menghan9260 10 κ°œμ›” μ „ +4

    The way you approach this topic make it so easy to understand, and I appreciate the pace of your talking. Best content on transformer.

    • @CodeEmporium
      @CodeEmporium  10 κ°œμ›” μ „

      You are very welcome. And thanks so much for that super thanks. You didn’t have to, but very appreciated

  • @ianrugg
    @ianrugg λ…„ μ „ +4

    Great overview! Thanks for taking the time to put all this together!

  • @Anirudh-cf3oc
    @Anirudh-cf3oc 6 κ°œμ›” μ „ +2

    You are the most underrated KRplusr. This is the best video explaining Transformers completely in the most intuitive way. I started my journey with Transformers with your first Transformers video few years ago which was very helpful. Also, I am so happy to see an AI tutorial video using an Indian Language. I really appreciate your work.

  • @ramakantshakya5478
    @ramakantshakya5478 λ…„ μ „ +3

    Amazing explanations throughout the series, and top-notch content, as always. Waiting for a detailed explanation/visualisation of the backward pass in the encoder/decoder during training. I would appreciate it if you were thinking in the same way.

  • @asdfasdf71865
    @asdfasdf71865 9 κ°œμ›” μ „ +1

    i like your visualization of the matrixes. those residual connections and positional embeddings were good details to mention here

  • @ArunKumar-bp5lo
    @ArunKumar-bp5lo 5 κ°œμ›” μ „

    love the visualization makes it so clear

  • @Mr.AIFella
    @Mr.AIFella λ…„ μ „

    You're explanation is the most realistic explication of the Transformer that I've ever seen in the internet.
    Thanks dude.

    • @CodeEmporium
      @CodeEmporium  λ…„ μ „

      That means a lot. Thank you. Please like subscribe and share around if you can :)

  • @helloansuman
    @helloansuman λ…„ μ „ +2

    Amazing❀ Salute to the dedication in making this video, visual explaination and knowledge.

    • @CodeEmporium
      @CodeEmporium  λ…„ μ „

      Thanks so much for watching and commenting!

  • @aintgonhappen
    @aintgonhappen λ…„ μ „

    Video quality is amazing.
    Keep it up, buddy!

  • @bhashganti9483
    @bhashganti9483 3 κ°œμ›” μ „

    Awesome tutorial on application of "transformer" architecture for language translation.
    This is my very first lesson on the topic and I will give a 5+ stars.
    Thx dude you inspired me to subscribe to your channel -- my very first you tube subscription .
    Can't thank you enough!!

    • @CodeEmporium
      @CodeEmporium  3 κ°œμ›” μ „

      Thanks for the kind words! And super glad this video was helpful. Hope you enjoy the full playlist β€œTransformers from scratch β€œ of which this video is a part of :)

  • @triloksachin4826
    @triloksachin4826 21 일 μ „

    Amazing video, keep up the good work. Thanks for this!!

  • @moseslee8761
    @moseslee8761 7 κ°œμ›” μ „

    You explain really well! I think its quite complex but as you explained it, it has become more clear. I think with the coding video, it is extremely useful

  • @Sneha-Sivakumar
    @Sneha-Sivakumar 5 κ°œμ›” μ „

    this was a brilliant video!! super comprehensive

  • @wireghost897
    @wireghost897 9 κ°œμ›” μ „

    Very well explained. Thank you.

  • @amiralioghli8622
    @amiralioghli8622 7 κ°œμ›” μ „ +1

    Thank you so much for taking the time to code and explain the transformer model in such detail, I followed your series from zeros to heros. You are amazing and, if possible please do a series on how transformers can be used for time series anomaly detection and forecasting. it is extremly necessary on yotube for somone!

  • @enrico1976
    @enrico1976 3 κ°œμ›” μ „

    That was awesome. Thank you man!!!

  • @soumilyade1057
    @soumilyade1057 λ…„ μ „

    hopefully the series is completed soon ❀️ would binge watch 😁

    • @CodeEmporium
      @CodeEmporium  λ…„ μ „ +1

      Yep. Maybe 1 or 2 videos left. I am running into some issues, but I’ll probably either have them solved or just have a fun community help video. Either way, it should be good

    • @soumilyade1057
      @soumilyade1057 λ…„ μ „

      @@CodeEmporium β™₯️β™₯️ 😌

  • @lakshman587
    @lakshman587 5 κ°œμ›” μ „ +1

    Thank you so much for all these videos, I have learnt a lot from your videos!!!
    I thought you were from Tamil Nadu, but today I got to know that you were from Karnataka!!
    Where from Karnataka? I'm staying in Bangalore, Would like to meet you in-person!!!!!

  • @cyberpunkbuilds
    @cyberpunkbuilds κ°œμ›” μ „

    You kanada written language is really beautiful!

  • @user-pu4iz8wb4d
    @user-pu4iz8wb4d λ…„ μ „

    THIS IS AMAZING ,helped me a lot thanks :)

    • @CodeEmporium
      @CodeEmporium  λ…„ μ „

      Thanks so much for watching and commenting!

  • @Diego-nw4rt
    @Diego-nw4rt λ…„ μ „

    Great channel and very useful video, thank you very much! I will watch other videos of your channel as well.
    I have a question. After you perform layer normalization obtaining an output tensor, how do you give a three-dimensional tensor as input to a feed forward layer?
    Do you flatten the input?

  • @abirbenaissa3717
    @abirbenaissa3717 8 κ°œμ›” μ „

    Life saver, thank you

  • @codeative
    @codeative λ…„ μ „

    Very well explained πŸ‘

    • @CodeEmporium
      @CodeEmporium  λ…„ μ „

      Thanks a ton for commenting and watching :)

  • @user-wr4yl7tx3w
    @user-wr4yl7tx3w λ…„ μ „

    Really well presented.

  • @prashantlawhatre7007
    @prashantlawhatre7007 λ…„ μ „

    Eagerly waiting for the upcoming videos in the series.

    • @CodeEmporium
      @CodeEmporium  λ…„ μ „

      Thanks! Probably just 1-2 long form video(s) more

  • @amitsingha1637
    @amitsingha1637 6 κ°œμ›” μ „

    Bro all of my Confusion vanished like vanishing Gradient.
    Thanks. Really worth it.

  • @k-c
    @k-c λ…„ μ „

    Will have to brush up my basics and then come back to this.

    • @CodeEmporium
      @CodeEmporium  λ…„ μ „ +1

      Yea. This can be a lot of info. Hopefully the earlier videos in this playlist will help too

    • @k-c
      @k-c λ…„ μ „

      @@CodeEmporium Your channel is really good! Thanks for all the work.

  • @charleskangai4618
    @charleskangai4618 κ°œμ›” μ „

    Excellent!

  • @phaZZi6461
    @phaZZi6461 λ…„ μ „

    hi, i really love your complete model overview!
    also at 8:08 you mention that the difference between K Q V isnt very explicit to the model. what would be your personal intuitive interpretation for what a Key vector might extract/learn from a input word? i find the key conept a bit odd and wondered how the authors came up with the idea of training a Key vector(/matrix), where previous attention papers only had a value vector, which would be used in both places (K and V) of the equation .
    when i think about information retrieval concepts where we have a search query and documents to be ranked, iirc the intuition there is to compute a dot product to get a similarity/relevance score between them. in my mind the concept of "how relevant is each document" isnt that far off from "how much attention should i pay to each document".
    And analogously I would interpret documents to be Values, and the idea of a key seems to be absent? (unless IR in practice computes a key for each document, basically a key_of(document)-query-similarity; then i just answered the question myself).
    anyways, i wondered if it wouldnt be possible to simplify the attention mechanism, while keeping it conceptually similar. not sure where i should look to get to know more about this.

  • @ravikumarnaduvin5399
    @ravikumarnaduvin5399 λ…„ μ „

    My friend Ajay, your playlist "Transformers from scratch" is great. It was very appealing to me to see your block diagram representation. Waiting with great anticipation for the final video. Would you be able to make it available soon?

    • @CodeEmporium
      @CodeEmporium  λ…„ μ „ +1

      Glad you like it! I am hitting a few roadblocks though I feel I am 99% there. I’ll make a video on this to mostly ask the community. So it should be a fun exercise for everyone too :) hoping when that is resolved, we can make a final video :D

  • @anandgupta2892
    @anandgupta2892 11 κ°œμ›” μ „

    very well πŸ‘

  • @DanielTorres-gd2uf
    @DanielTorres-gd2uf λ…„ μ „

    Damn, could've used a few weeks ago for my OMSCS quiz. Solid review though, nice job!

  • @davefaulkner6302
    @davefaulkner6302 κ°œμ›” μ „

    Fantastic lecture. The attention layer and their inter-relationships are very well explained. Thank you. However this and other videos gloss over the use of the fully-connected layers following the attention layer. Using FC with language model embeddings makes little sense to me. Are there 512x50 inputs to the FC, i.e., is the input sentence simply flattened as input to the FC layer?

  • @naveenrs7460
    @naveenrs7460 λ…„ μ „ +1

    Lovely brother. I am your Neighbour Tamizhan. Lovely brotherhood

  • @josephfemia8496
    @josephfemia8496 λ…„ μ „ +1

    If I can recommend a next steps to this series, going into Bert, GPT, and DETR would be lovely extensions

    • @CodeEmporium
      @CodeEmporium  λ…„ μ „ +2

      I was kind of thinking the same! For now, I have videos on BERT , GPT on the channel if you haven’t checked it out. But an architecture deep dive would be fun too :)

    • @RanDuan-dp6oz
      @RanDuan-dp6oz λ…„ μ „

      @@CodeEmporium Yes, that will be super fun! Also, it would be great if you can introduce how a ML practitioner could do fine tune based on these complex models.

  • @sarahgh8756
    @sarahgh8756 2 κ°œμ›” μ „

    Thank you for all the videos about transformer. Although I understood the architecture, I still dont know what to set for the input of the decoder (embedded target) and mask for the TEST phase?

  • @fayezalhussein7115
    @fayezalhussein7115 λ…„ μ „

    amaaazing

  • @capyk5455
    @capyk5455 λ…„ μ „

    Amazing

  • @joegarcia8935
    @joegarcia8935 λ…„ μ „

    Thanks!

    • @CodeEmporium
      @CodeEmporium  λ…„ μ „

      You are super welcome! I appreciate the donation! Thanks!

  • @whiteroadism
    @whiteroadism λ…„ μ „

    Great video. At 12:09 , how will dividing all the numbers by 8 ensure the small values are not too small or large values are not too large? Wouldn't dividing by 8 cause a number to be 8 times smaller?

  • @abulfahadsohail466
    @abulfahadsohail466 λ…„ μ „

    Please can you apply transformers which you have built on text summarisation. It is really helpful.

  • @rafaelgp9072
    @rafaelgp9072 11 κ°œμ›” μ „

    Would be nice a video like this explaining LLAMA model

  • @wishIKnewHowToLove
    @wishIKnewHowToLove λ…„ μ „

    concise

  • @markusnascimento210
    @markusnascimento210 11 κ°œμ›” μ „

    Very good. In general articles donΒ΄t show the dimensions when explaining. It helps a lot. Tks

  • @CyKeulz
    @CyKeulz λ…„ μ „

    Great! Still a bit too hard for me but i still learned stuff.
    Question, would it be possible to use the same encoder accross multiple languages ? without retrainning it after the first time, i mean.

    • @CodeEmporium
      @CodeEmporium  λ…„ μ „

      I hope the full playlist β€œTransformers from scratch” helps with pacing this.
      To your second question. This is a simple transformer neural network and not the typical language model like BERT/GPT. The transformer on its own doesn’t make use of transfer learning typically. So some retraining will be required. That said, if you were using the language models, then you might just need to fine tune your parameters to the target language (which is technically training). Or if you go the GPT3 route, you could get away without fine tuning and use meta learning techniques instead.

  • @paragbhardwaj5753
    @paragbhardwaj5753 λ…„ μ „

    Do a video on this new model. Called RWKV-LM.

  • @colinmaharaj50
    @colinmaharaj50 8 κ°œμ›” μ „

    Can this be done in pure C++

  • @susmitjaiswal136
    @susmitjaiswal136 11 κ°œμ›” μ „

    what is the use of feed forward network in transformer ..please answer

  • @gabrielnilo6101
    @gabrielnilo6101 10 κ°œμ›” μ „

    11:08 I am sorry if I am wrong but the transposed K matrix, isn't it 50x30x64?

  • @anwarulislam6823
    @anwarulislam6823 λ…„ μ „

    Without bci multi head attention process possible with human brain?

  • @venkideshk2413
    @venkideshk2413 λ…„ μ „

    Masked multihead attention is for decoder right. Is that a typo in your encoder architecture.

  • @erikschmidt3067
    @erikschmidt3067 λ…„ μ „

    What're in the feed forward layers? Just an input and output layer? Are there hidden layers? What are the sizes of the layers?

    • @CodeEmporium
      @CodeEmporium  λ…„ μ „

      Freed forward layers are hidden layers. It’s just essentially 2,048 neurons in size. You can think of it as mapping 512 dimension vector to 2,048 dimension vector. And then mapping the 2048 vector to 512 dimensions. All of this to capture additional information about the word

  • @user-np2jc9km3u
    @user-np2jc9km3u 4 κ°œμ›” μ „

    But what is the source for the kannada words that was feed in to the output?, how can we get those word in reality? could someone explain me if you are willing to. Thank you.

  • @samurock100
    @samurock100 2 κ°œμ›” μ „

    1kth like

  • @TheTimtimtimtam
    @TheTimtimtimtam λ…„ μ „

    First :)

  • @jamesroy9027
    @jamesroy9027 9 κ°œμ›” μ „

    background music create lot of disturbance and especially that pop out sound otherwise content delivery is best

  • @creativeuser9086
    @creativeuser9086 11 κ°œμ›” μ „ +1

    So you're from the silicon valley of India. We all now it

  • @wintobisakul1848
    @wintobisakul1848 λ…„ μ „

    amazing fluent in english speak like native speaker

    • @CodeEmporium
      @CodeEmporium  λ…„ μ „ +1

      I am a native English speaker, but I’ve lived a good amount of my adolescence and early adult life in India

    • @wintobisakul1848
      @wintobisakul1848 λ…„ μ „

      @@CodeEmporium Wow, so that means you also speak the Indian dialect, which I assume makes you fluent in three languages?

    • @wintobisakul1848
      @wintobisakul1848 λ…„ μ „

      I truly appreciate your explanation regarding content, tone, accent, and other related aspects.