Tap to unmute
Transformer Neural Networks - EXPLAINED! (Attention is all you need)
μμ€ μ½λ
- κ²μμΌ 2020. 01. 12.
- Please subscribe to keep me alive: krplus.net/uCodeEmporiu...
BLOG: medium.com/@dataemporium
MATH COURSES (7 day free trial)
π Mathematics for Machine Learning: imp.i384100.net/MathML
π Calculus: imp.i384100.net/Calculus
π Statistics for Data Science: imp.i384100.net/AdvancedStati...
π Bayesian Statistics: imp.i384100.net/BayesianStati...
π Linear Algebra: imp.i384100.net/LinearAlgebra
π Probability: imp.i384100.net/Probability
OTHER RELATED COURSES (7 day free trial)
π β Deep Learning Specialization: imp.i384100.net/Deep-Learning
π Python for Everybody: imp.i384100.net/python
π MLOps Course: imp.i384100.net/MLOps
π Natural Language Processing (NLP): imp.i384100.net/NLP
π Machine Learning in Production: imp.i384100.net/MLProduction
π Data Science Specialization: imp.i384100.net/DataScience
π Tensorflow: imp.i384100.net/Tensorflow
REFERENCES
[1] The main Paper: arxiv.org/abs/1706.03762
[2] Tensor2Tensor has some code with a tutorial: www.tensorflow.org/tutorials/...
[3] Transformer very intuitively explained - Amazing: jalammar.github.io/illustrated...
[4] Medium Blog on intuitive explanation: medium.com/inside-machine-lea...
[5] Pretrained word embeddings: nlp.stanford.edu/projects/glove/
[6] Intuitive explanation of Layer normalization: mlexplained.com/2018/11/30/an...
[7] Paper that gives even better results than transformers (Pervasive Attention): arxiv.org/abs/1808.03867
[8] BERT uses transformers to pretrain neural nets for common NLP tasks. : ai.googleblog.com/2018/11/ope...
[9] Stanford Lecture on RNN: cs231n.stanford.edu/slides/201...
[10] Colahβs Blog: colah.github.io/posts/2015-08...
[11] Wiki for timeseries of events: en.wikipedia.org/wiki/Transfo...)
For more details and code on building a translator using a transformer neural network, check out my playlist "Transformers from scratch":
what a hugely underrated video. You did such a better job at explaining this on multiple abstraction layers in such a short video than most videos I could find on the topic which were more than twice as long.
Great video! Watched it a few times already so these timestamps will help me out:
Incredibly well explained and concise. I can't believe you pulled off such a complete explanation in just 13 minutes!
Wow.
The multi-pass approach to progressively explaining the internals worked well. Thanks for your content!
I love the multi-pass way of explanation so that the viewer can process high level concepts and then build upon that knowledge, great job.
This is awesome!!! Thank you for breaking it down concisely, understandably, and deeply! Itβs hard to find explanations that arenβt so simplistic theyβre useless, or so involved they donβt save time in achieving understanding. Thank you!!
Great video!! I am taking a course in my university and one of the lectures was about RNNs and transformers. Your video of 13 mins explains way better than the 100 mins lecture i attended. Thank you!
Really great video. As someone transitioning from pure math into machine learning and AI, I find the language barrier to be the biggest hurdle and you broke down these concepts in a really clear way. I love the multiple layer approach you took to this video, I think it worked really well to first give a big picture overview of the architecture before delving deeper.
LOVE the multipass strategy for explaining the architecture. I don't think I've seen this approach used with ML, and it's a shame as this is an incredibly useful strategy for people like me trying to play catch up. I hopped on the ML train a little late, but stuff like this makes me feel not nearly as lost.
Went through several videos on 'Attention is all you need' paper before this, all the details you managed to cover in thirteen minutes is amazing. Could not find explanation that is so easy to understand anywhere else. Great job!
Great explanation. Could you do another video on positional encoding specifically? It seems to be very important, but Iβve found it the most confusing part of this architecture.
That was an awesome explanation. I have a question about the Add & Norm block. Do you add the embedded vector before or after performing normalization ? Is there even a difference if we do one instead of the other ?
This is one of the best explanations for transformers I've come across online! Awesome job, man! Thanks. I'll totally recommend your channel to some classmates!! :)
Very Underrated. Please keep doing these videos. You have no idea what a great amount of service this is doing to the young research communities who are just learning to read research papers. Instant subscribed.π
This is by far the best explanation of the Transformers architecture that I have ever seen! Thanks a lot!
Thanks, man. This is a really clear and high-level explanation. Really helpful for some guys like me who just stepped into this area.
Thank you for making this! As a curious outsider I have been anxious about falling behind in recent years and this was perfect to bring me up to speed - at least enough to follow the conversation.
I've worked on none of these things, but listening to this, I think of mechanisms that might improve them. I think real brains do a bit of combinatorial optimization, in understanding ambiguity is resolved by simply trying the alternatives and seeing which works out better, but it's not always in parallel. And in writing there's a bit of trying, failing, trying something different etc. In a machine vocabulary you might think of a simulated annealing algorithm. So when I see multiple encodings being combined with weightings I immediately think "this is not optimal" - it's better to go through combinatorial optimization of encodings for different words to find better and worse fits than to try to resolve multiple encodings without trying in the context of the others.