The complete guide to Transformer neural Networks!
μμ€ μ½λ
- κ²μμΌ 2024. 04. 19.
- Let's do a deep dive into the Transformer Neural Network Architecture for language translation.
ABOUT ME
β Subscribe: krplus.net/uCodeEmporiu...
π Medium Blog: / dataemporium
π» Github: github.com/ajhalthor
π LinkedIn: / ajay-halthor-477974bb
RESOURCES
[ 1 π] Transformer Architecture Image :github.com/ajhalthor/Transfor...
[2 π] draw.io version of the image for clarity: github.com/ajhalthor/Transfor...
PLAYLISTS FROM MY CHANNEL
β Transformers from scratch playlist: β’ Self Attention in Tran...
β ChatGPT Playlist of all other videos: β’ ChatGPT
β Transformer Neural Networks: β’ Natural Language Proce...
β Convolutional Neural Networks: β’ Convolution Neural Net...
β The Math You Should Know : β’ The Math You Should Know
β Probability Theory for Machine Learning: β’ Probability Theory for...
β Coding Machine Learning: β’ Code Machine Learning
MATH COURSES (7 day free trial)
π Mathematics for Machine Learning: imp.i384100.net/MathML
π Calculus: imp.i384100.net/Calculus
π Statistics for Data Science: imp.i384100.net/AdvancedStati...
π Bayesian Statistics: imp.i384100.net/BayesianStati...
π Linear Algebra: imp.i384100.net/LinearAlgebra
π Probability: imp.i384100.net/Probability
OTHER RELATED COURSES (7 day free trial)
π β Deep Learning Specialization: imp.i384100.net/Deep-Learning
π Python for Everybody: imp.i384100.net/python
π MLOps Course: imp.i384100.net/MLOps
π Natural Language Processing (NLP): imp.i384100.net/NLP
π Machine Learning in Production: imp.i384100.net/MLProduction
π Data Science Specialization: imp.i384100.net/DataScience
π Tensorflow: imp.i384100.net/Tensorflow
TIMESTAMPS
0:00 Introduction
1:38 Transformer at a high level
4:15 Why Batch Data? Why Fixed Length Sequence?
6:13 Embeddings
7:00 Positional Encodings
7:58 Query, Key and Value vectors
9:19 Masked Multi Head Self Attention
14:46 Residual Connections
15:50 Layer Normalization
17:57 Decoder
20:12 Masked Multi Head Cross Attention
22:47
24:03 Tokenization & Generating the next translated word
26:00 Transformer Inference Example
The link to the image and itβs raw file are in the description. If you think I deserve it, please give this video a like and subscribe for more! If you think itβs worth sharing, please do so as well. I would love to grow to 100k subscribers this year with your help :) Thank you!
Just gave the thumb up! Just curious: what software did you use to draw such a wonderful diagram?
Sooooo nice! Where we can find the link to the imageπ
Thanks I used draw.io to draw the image
The image can be found in the description of the video on GitHub
But what is the source for the kannada words that was feed in to the output?, how can we get those word in reality? could you explain me if you are willing to. Thank you.
Most underrated youtuber. You are explaining this complex topics with such an ease. Many big channels avoid explaining this topics. Really appreciate your work man.
Thanks a lot for the kind words. I try :)
Bro for real! It never felt a possibility for me to learn ML but this guy took me by hand and is teaching all this for free!
I can't even thank this dude enough
The way you approach this topic make it so easy to understand, and I appreciate the pace of your talking. Best content on transformer.
You are very welcome. And thanks so much for that super thanks. You didnβt have to, but very appreciated
Great overview! Thanks for taking the time to put all this together!
Thanks so much! My pleasure
You are the most underrated KRplusr. This is the best video explaining Transformers completely in the most intuitive way. I started my journey with Transformers with your first Transformers video few years ago which was very helpful. Also, I am so happy to see an AI tutorial video using an Indian Language. I really appreciate your work.
Amazing explanations throughout the series, and top-notch content, as always. Waiting for a detailed explanation/visualisation of the backward pass in the encoder/decoder during training. I would appreciate it if you were thinking in the same way.
i like your visualization of the matrixes. those residual connections and positional embeddings were good details to mention here
love the visualization makes it so clear
You're explanation is the most realistic explication of the Transformer that I've ever seen in the internet.
Thanks dude.
That means a lot. Thank you. Please like subscribe and share around if you can :)
Amazingβ€ Salute to the dedication in making this video, visual explaination and knowledge.
Thanks so much for watching and commenting!
Video quality is amazing.
Keep it up, buddy!
I shall. Thanks so much!
Awesome tutorial on application of "transformer" architecture for language translation.
This is my very first lesson on the topic and I will give a 5+ stars.
Thx dude you inspired me to subscribe to your channel -- my very first you tube subscription .
Can't thank you enough!!
Thanks for the kind words! And super glad this video was helpful. Hope you enjoy the full playlist βTransformers from scratch β of which this video is a part of :)
Amazing video, keep up the good work. Thanks for this!!
You explain really well! I think its quite complex but as you explained it, it has become more clear. I think with the coding video, it is extremely useful
this was a brilliant video!! super comprehensive
Very well explained. Thank you.
Thank you so much for taking the time to code and explain the transformer model in such detail, I followed your series from zeros to heros. You are amazing and, if possible please do a series on how transformers can be used for time series anomaly detection and forecasting. it is extremly necessary on yotube for somone!
That was awesome. Thank you man!!!
hopefully the series is completed soon β€οΈ would binge watch π
Yep. Maybe 1 or 2 videos left. I am running into some issues, but Iβll probably either have them solved or just have a fun community help video. Either way, it should be good
@@CodeEmporium β₯οΈβ₯οΈ π
Thank you so much for all these videos, I have learnt a lot from your videos!!!
I thought you were from Tamil Nadu, but today I got to know that you were from Karnataka!!
Where from Karnataka? I'm staying in Bangalore, Would like to meet you in-person!!!!!
You kanada written language is really beautiful!
THIS IS AMAZING ,helped me a lot thanks :)
Thanks so much for watching and commenting!
Great channel and very useful video, thank you very much! I will watch other videos of your channel as well.
I have a question. After you perform layer normalization obtaining an output tensor, how do you give a three-dimensional tensor as input to a feed forward layer?
Do you flatten the input?
Life saver, thank you
You are very welcome
Very well explained π
Thanks a ton for commenting and watching :)
Really well presented.
Thanks a ton! :)
Eagerly waiting for the upcoming videos in the series.
Thanks! Probably just 1-2 long form video(s) more
Bro all of my Confusion vanished like vanishing Gradient.
Thanks. Really worth it.
Will have to brush up my basics and then come back to this.
Yea. This can be a lot of info. Hopefully the earlier videos in this playlist will help too
@@CodeEmporium Your channel is really good! Thanks for all the work.
Excellent!
hi, i really love your complete model overview!
also at 8:08 you mention that the difference between K Q V isnt very explicit to the model. what would be your personal intuitive interpretation for what a Key vector might extract/learn from a input word? i find the key conept a bit odd and wondered how the authors came up with the idea of training a Key vector(/matrix), where previous attention papers only had a value vector, which would be used in both places (K and V) of the equation .
when i think about information retrieval concepts where we have a search query and documents to be ranked, iirc the intuition there is to compute a dot product to get a similarity/relevance score between them. in my mind the concept of "how relevant is each document" isnt that far off from "how much attention should i pay to each document".
And analogously I would interpret documents to be Values, and the idea of a key seems to be absent? (unless IR in practice computes a key for each document, basically a key_of(document)-query-similarity; then i just answered the question myself).
anyways, i wondered if it wouldnt be possible to simplify the attention mechanism, while keeping it conceptually similar. not sure where i should look to get to know more about this.
My friend Ajay, your playlist "Transformers from scratch" is great. It was very appealing to me to see your block diagram representation. Waiting with great anticipation for the final video. Would you be able to make it available soon?
Glad you like it! I am hitting a few roadblocks though I feel I am 99% there. Iβll make a video on this to mostly ask the community. So it should be a fun exercise for everyone too :) hoping when that is resolved, we can make a final video :D
very well π
Damn, could've used a few weeks ago for my OMSCS quiz. Solid review though, nice job!
Fantastic lecture. The attention layer and their inter-relationships are very well explained. Thank you. However this and other videos gloss over the use of the fully-connected layers following the attention layer. Using FC with language model embeddings makes little sense to me. Are there 512x50 inputs to the FC, i.e., is the input sentence simply flattened as input to the FC layer?
Lovely brother. I am your Neighbour Tamizhan. Lovely brotherhood
Thanks so much! :)
If I can recommend a next steps to this series, going into Bert, GPT, and DETR would be lovely extensions
I was kind of thinking the same! For now, I have videos on BERT , GPT on the channel if you havenβt checked it out. But an architecture deep dive would be fun too :)
@@CodeEmporium Yes, that will be super fun! Also, it would be great if you can introduce how a ML practitioner could do fine tune based on these complex models.
Thank you for all the videos about transformer. Although I understood the architecture, I still dont know what to set for the input of the decoder (embedded target) and mask for the TEST phase?
amaaazing
Thanks so much :)
Amazing
Thanks so much!
Thanks!
You are super welcome! I appreciate the donation! Thanks!
Great video. At 12:09 , how will dividing all the numbers by 8 ensure the small values are not too small or large values are not too large? Wouldn't dividing by 8 cause a number to be 8 times smaller?
Please can you apply transformers which you have built on text summarisation. It is really helpful.
Would be nice a video like this explaining LLAMA model
concise
Thanks! I try not to bore :)
Very good. In general articles donΒ΄t show the dimensions when explaining. It helps a lot. Tks
My pleasure!
Great! Still a bit too hard for me but i still learned stuff.
Question, would it be possible to use the same encoder accross multiple languages ? without retrainning it after the first time, i mean.
I hope the full playlist βTransformers from scratchβ helps with pacing this.
To your second question. This is a simple transformer neural network and not the typical language model like BERT/GPT. The transformer on its own doesnβt make use of transfer learning typically. So some retraining will be required. That said, if you were using the language models, then you might just need to fine tune your parameters to the target language (which is technically training). Or if you go the GPT3 route, you could get away without fine tuning and use meta learning techniques instead.
Do a video on this new model. Called RWKV-LM.
Can this be done in pure C++
what is the use of feed forward network in transformer ..please answer
11:08 I am sorry if I am wrong but the transposed K matrix, isn't it 50x30x64?
Without bci multi head attention process possible with human brain?
Masked multihead attention is for decoder right. Is that a typo in your encoder architecture.
What're in the feed forward layers? Just an input and output layer? Are there hidden layers? What are the sizes of the layers?
Freed forward layers are hidden layers. Itβs just essentially 2,048 neurons in size. You can think of it as mapping 512 dimension vector to 2,048 dimension vector. And then mapping the 2048 vector to 512 dimensions. All of this to capture additional information about the word
But what is the source for the kannada words that was feed in to the output?, how can we get those word in reality? could someone explain me if you are willing to. Thank you.
1kth like
First :)
Please keep being the first! :)
background music create lot of disturbance and especially that pop out sound otherwise content delivery is best
So you're from the silicon valley of India. We all now it
Haha kinda yea.
amazing fluent in english speak like native speaker
I am a native English speaker, but Iβve lived a good amount of my adolescence and early adult life in India
@@CodeEmporium Wow, so that means you also speak the Indian dialect, which I assume makes you fluent in three languages?
I truly appreciate your explanation regarding content, tone, accent, and other related aspects.