This is the third video about the transformer decoder and the final video introducing the transformer architecture. Here we mainly learn about the encoder-decoder multi-head self-attention layer, used to incorporate information from the encoder into the decoder. It should be noted that this layer is also commonly known as the cross-attention layer.
The video is part of a series of videos on the transformer architecture, https://arxiv.org/abs/1706.03762. You can find the complete series and a longer motivation here:
https://www.youtube.com/playlist?list=PLDw5cZwIToCvXLVY2bSqt7F2gu8y-Rqje
Slides are available here:
https://chalmersuniversity.box.com/s/c2a64rz0hlp44pdouq9mc24msbz60xf2
Add comment