Layer Normalization - EXPLAINED (in Transformer Neural Networks)

곡유
μ†ŒμŠ€ μ½”λ“œ
  • κ²Œμ‹œμΌ 2024. 04. 26.
  • Lets talk about Layer Normalization in Transformer Neural Networks!
    ABOUT ME
    β­• Subscribe: krplus.net/uCodeEmporiu...
    πŸ“š Medium Blog: / dataemporium
    πŸ’» Github: github.com/ajhalthor
    πŸ‘” LinkedIn: / ajay-halthor-477974bb
    RESOURCES
    [ 1πŸ”Ž] Code for video: github.com/ajhalthor/Transfor...
    [2 πŸ”Ž ] The paper that introduced the concept: arxiv.org/pdf/1607.06450.pdf
    [3 πŸ”Ž ]Layer normalization in transformer architecture: arxiv.org/pdf/2002.04745.pdf
    [4 πŸ”Ž ]Batch Normalization underperforms with NLP tasks. Reasons are empirical : arxiv.org/pdf/2003.07845.pdf
    [5 ] Residual Connections minimize vanishing gradients: stats.stackexchange.com/quest...
    [6 πŸ”Ž] Transformer Main Paper: arxiv.org/abs/1706.03762
    PLAYLISTS FROM MY CHANNEL
    β­• ChatGPT Playlist of all other videos: β€’ ChatGPT
    β­• Transformer Neural Networks: β€’ Natural Language Proce...
    β­• Convolutional Neural Networks: β€’ Convolution Neural Net...
    β­• The Math You Should Know : β€’ The Math You Should Know
    β­• Probability Theory for Machine Learning: β€’ Probability Theory for...
    β­• Coding Machine Learning: β€’ Code Machine Learning
    MATH COURSES (7 day free trial)
    πŸ“• Mathematics for Machine Learning: imp.i384100.net/MathML
    πŸ“• Calculus: imp.i384100.net/Calculus
    πŸ“• Statistics for Data Science: imp.i384100.net/AdvancedStati...
    πŸ“• Bayesian Statistics: imp.i384100.net/BayesianStati...
    πŸ“• Linear Algebra: imp.i384100.net/LinearAlgebra
    πŸ“• Probability: imp.i384100.net/Probability
    OTHER RELATED COURSES (7 day free trial)
    πŸ“• ⭐ Deep Learning Specialization: imp.i384100.net/Deep-Learning
    πŸ“• Python for Everybody: imp.i384100.net/python
    πŸ“• MLOps Course: imp.i384100.net/MLOps
    πŸ“• Natural Language Processing (NLP): imp.i384100.net/NLP
    πŸ“• Machine Learning in Production: imp.i384100.net/MLProduction
    πŸ“• Data Science Specialization: imp.i384100.net/DataScience
    πŸ“• Tensorflow: imp.i384100.net/Tensorflow
    TIMSTAMPS
    0:00 Transformer Encoder Overview
    0:56 "Add & Norm": Transformer Encoder Deep Dive
    5:13 Layer Normalization: What & why
    7:33 Layer Normalization: Working out the math by hand
    12:10 Final Coded Class

λŒ“κΈ€ • 49

  • @superghettoindian01
    @superghettoindian01 λ…„ μ „ +1

    Another great video - I like the structure you use of summarising the concept and then diving into the implementation. The code really helps bring it together as some others have commented. I look forward to seeing more of this series and would love to see a longer video of you deploying a transformer on some dummy data (perhaps you already have one - still going through the content)!

    • @CodeEmporium
      @CodeEmporium  λ…„ μ „ +1

      Thanks so much on commenting on all the videos! I really appreciate it. And yea, going to be introducing the code behind the encoder and decoder in the coming sessions!

    • @superghettoindian01
      @superghettoindian01 λ…„ μ „

      @@CodeEmporium I love your content so will do what I can to comment , like and spread!

  • @yangkewen
    @yangkewen 4 κ°œμ›” μ „

    very clear and sound explanation for a complex concept, thumb up for the hard works!

  • @minorinxx
    @minorinxx 6 κ°œμ›” μ „ +2

    the diagram is GOLD

  • @saahilnayyer6865
    @saahilnayyer6865 9 κ°œμ›” μ „

    Nice series on transformer. Really liked it. Btw interesting design choice for the video to use a landscape layout of the transform architecture during intro :D

  • @_.hello._.world_
    @_.hello._.world_ λ…„ μ „ +3

    Great! Since you’re covering transformer components, I would love to see TransformerXL and RelativePositionalEmbedding concepts explained in the upcoming videos! ☺️

    • @MsFearco
      @MsFearco λ…„ μ „

      i help with relpos, its the same as usual posenc but instead of fixed positional encoding its in relation to each other, pairwise.

  • @ashutoshtripathi5699
    @ashutoshtripathi5699 2 κ°œμ›” μ „

    best explanation ever!

  • @jbca
    @jbca 11 κ°œμ›” μ „

    I really like your voice and delivery. It’s quite reassuring, which is nice when the subject of the videos can be pretty complicated.

    • @CodeEmporium
      @CodeEmporium  11 κ°œμ›” μ „

      Thanks so much! Very glad you liked this. There will be more to come

  • @caiyu538
    @caiyu538 6 κ°œμ›” μ „

    Great lectures.

  • @Ibrahimnada1995
    @Ibrahimnada1995 9 κ°œμ›” μ „

    Thanks Man ,
    you deserve more than a like

    • @CodeEmporium
      @CodeEmporium  9 κ°œμ›” μ „

      Thanks a ton for the kind words :)

  • @luvsuneja
    @luvsuneja λ…„ μ „ +1

    if our batch has 2 vectors of 2 words x 3 embedding size say:
    [[1,2,3],[4,5,6]] and [[1,3,5], [2,4,6]]
    For layer normalization,
    Is nu_1 = mean(1,2,3,1,3,5)
    and nu_2 = mean(4,5,6,2,4,6)
    Just wanted to clarify. Keep up the great work, brother. Like the small bite sized videos.

  • @TD-vi1jx
    @TD-vi1jx λ…„ μ „ +1

    Great video! I find these very informative, so please keep them going! Question on the output dimensions though. In your transformer overview video the big diagram shows that after the layer normalization, you have a matrix of shape [batch_size, sequence_len, dmodel] (in the video 30x50x512 I believe.) However here you end up with an output matrix (out) of [sequence_len, batch_size, dmodel] (5x3x8). Do we need to reshape these output matrices again to [batch_size,sequence_len,dmodel], or am I missing something? Thanks again for all the informative content!

    • @dhawajpunamia1000
      @dhawajpunamia1000 9 κ°œμ›” μ „

      I wonder the same. Do you know the reason for it?

  • @vib2810
    @vib2810 2 κ°œμ›” μ „

    as per my understanding and from the LayerNorm code in pytorch, in NLP for an input of size [N, T, Embed], statistics are computed using only the Embed dim, and layer norm is applied to each token in each batch. But for vision with an input of size [N,C,H,W], statistics are computed using the [C,H,W] dimentions

  • @misterx3321
    @misterx3321 8 κ°œμ›” μ „

    Beautifully done video but isn't layer normalization essentially batch normalization layer?

  • @mohammadyahya78
    @mohammadyahya78 λ…„ μ „

    Wow. Amazing channel

  • @shreejanshrestha1931
    @shreejanshrestha1931 λ…„ μ „ +1

    EXCITED!!!!!

  • @user-wr4yl7tx3w
    @user-wr4yl7tx3w λ…„ μ „ +2

    the python example really helped to solidify understanding.

  • @MatheusHenrique-jz1dc
    @MatheusHenrique-jz1dc λ…„ μ „

    Thank you very much friend, very good!!

  • @chargeca3573
    @chargeca3573 11 κ°œμ›” μ „

    Thanks! This vedio helps me a lot

  • @tylerknight99
    @tylerknight99 λ…„ μ „ +2

    10:30 when layers normalizations are "computed across the layer and also the batch", are their means and std connected as if the batch boundaries aren't there? So that means there's a different learnable gamma and beta parameter for each word?

    • @CodeEmporium
      @CodeEmporium  λ…„ μ „ +2

      β€œMeans and std connected as if the batch boundary isn’t there” ~ yes
      But we have the same gammas and betas for the same layer across ALL words in the dataset

  • @fenglema36
    @fenglema36 λ…„ μ „ +2

    Amazing your expiation are so clear. Can this help with exploding gradients?

    • @CodeEmporium
      @CodeEmporium  λ…„ μ „ +1

      Thanks so much! And yea. Later normalization should help exploding and vanishing gradients. So training is stable

    • @fenglema36
      @fenglema36 λ…„ μ „

      @@CodeEmporium Perfect thanks.

  • @superghettoindian01
    @superghettoindian01 λ…„ μ „

    Have an actual question this time! While trying to understand the differences between layer and batch normalization, I was wondering whether it’s also accurate to say you are normalising across the features of a vector when normalising the activation function - since each layer is a matrix multiply across all features of a row, would normalising across activation functions be similar to normalising across the features?
    In the same thread, can/should layer and batch normalization be run concurrently? If not, are there reasons to choose one over the other?

    • @CodeEmporium
      @CodeEmporium  λ…„ μ „ +4

      Good questions. From what I understand, we normalize the activation values across the layer (I.e make sure the values across the layer follow a bell curve of sorts). While in batch normalization, we do the same exact thing but across the batch dimension instead
      The only issue i see with batch normalization is that it is dependent on the batch size (which is typically an order magnitude smaller than the size of layers) If we apply a normalization to a small number of items, we might get erratic results (that is the mean and standard deviation of a small number of numbers simply don’t lead to values that are truly β€œnormalized”). One remedy would be to increase your batch size to a sizable number (like in the hundreds). I have heard this is an issue with NLP problems specifically. But I would need to do my own experimentation to see why.

  • @darshantank554
    @darshantank554 λ…„ μ „

    πŸ”₯πŸ”₯πŸ”₯

  • @madhukarmukkamula1515
    @madhukarmukkamula1515 λ…„ μ „

    Great Video !! Just a question, why do we need to swap the dimensions and all that other stuff ? Why can't we do something like this
    # assuming inputs is of format (batch_size,sentence_length,embedding_dim)
    mean=inputs.mean(dim=-1,keepdim=True)
    var = ((inputs - mean) ** 2).mean(dim=-1, keepdim=True)
    epsilon = 1e-5
    std = (var + epsilon).sqrt()
    y = (inputs - mean) / std
    y

    • @alexjolly1689
      @alexjolly1689 10 κ°œμ›” μ „ +1

      I have the same doubt . did you get it ?

  • @yanlu914
    @yanlu914 λ…„ μ „ +1

    Where can I find the reason why we need to calculate mean and standard deviation over the parameter shapes? In Pytorch, they just caculate over the last dimension, hidden size.

    • @yanlu914
      @yanlu914 λ…„ μ „

      I' m sorry, I just check the Pytorch code, there is normalized_shape like parameter_shape in your code.

  • @saurabhnirwan549
    @saurabhnirwan549 8 κ°œμ›” μ „

    Is pytorch better for NLP task than Tensorflow?

    • @CodeEmporium
      @CodeEmporium  8 κ°œμ›” μ „

      I wouldn’t necessarily say that is the case. Tensor flow and PyTorch are frameworks we can use for building these complex models. PyTorch might be easier to use since you don’t need to code out tensors themselves. But tensorflow (or even going as low as numpy) can be used to train models too

    • @saurabhnirwan549
      @saurabhnirwan549 8 κ°œμ›” μ „

      @@CodeEmporium Thanks for clarifying

  • @yagneshbhadiyadra7938
    @yagneshbhadiyadra7938 κ°œμ›” μ „

    Value Matrix should be on the right side of the multiplication with Attention Weights matrix

  • @user-wr4yl7tx3w
    @user-wr4yl7tx3w λ…„ μ „

    Given that we have 8 heads, Is it 512 / 8, which is 64? Are we actually going to split the 512 into 8 equal 64 length parts?

    • @CodeEmporium
      @CodeEmporium  λ…„ μ „ +1

      In theory, that’s what’s happening. In code, we make β€œ8 parallel heads” by introducing another dimension. For example, vectors of shape 30 (max sequence length) x 10 (batch size) x 512 (embedding size) would form query key and value tensors of shape 30 x 10 x 8 x 64. So there is a new dimension for heads that essentially acts like another batch dimension for more parallel computation

  • @pussiestroker
    @pussiestroker 4 κ°œμ›” μ „

    I am actually not sure why it is okay to reshape your input from (B, S, E) to (S, B, E) @ 10:10. It doesn't matter in this case as B=1, but in general wouldn't you be changing the your data? In particular, you would change the number of words in each batch, i.e., changing the maximum sentence length.