PyTorch Seq2Seq

The tutorials use PyTorch with Hugging Face Datasets for loading Multi30k and a lightweight local preprocessing pipeline instead of legacy text-processing dependencies.

Tutorials

1 - Sequence to Sequence Learning with Neural Networks

This first tutorial covers the workflow of a PyTorch seq2seq project. We'll cover the basics of seq2seq networks using encoder-decoder models, how to implement these models in PyTorch, and how to prepare the text data with a small local preprocessing pipeline. The model itself will be based off an implementation of Sequence to Sequence Learning with Neural Networks, which uses multi-layer LSTMs.
2 - Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation

Now we have the basic workflow covered, this tutorial will focus on improving our results. Building on our knowledge of PyTorch gained from the previous tutorial, we'll cover a second model, which helps with the information compression problem faced by encoder-decoder models. This model will be based off an implementation of Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation, which uses GRUs.
3 - Neural Machine Translation by Jointly Learning to Align and Translate

Next, we learn about attention by implementing Neural Machine Translation by Jointly Learning to Align and Translate. This further alleviates the information compression problem by allowing the decoder to "look back" at the input sentence by creating context vectors that are weighted sums of the encoder hidden states. The weights for this weighted sum are calculated via an attention mechanism, where the decoder learns to pay attention to the most relevant words in the input sentence.
4 - Packed Padded Sequences, Masking, Inference and BLEU

In this notebook, we will improve the previous model architecture by adding packed padded sequences and masking. These are two methods commonly used in NLP. Packed padded sequences allow us to only process the non-padded elements of our input sentence with our RNN. Masking is used to force the model to ignore certain elements we do not want it to look at, such as attention over padded elements. Together, these give us a small performance boost. We also cover a very basic way of using the model for inference, allowing us to get translations for any sentence we want to give to the model and how we can view the attention values over the source sequence for those translations. Finally, we show how to calculate the BLEU metric from our translations.
5 - Convolutional Sequence to Sequence Learning

We finally move away from RNN based models and implement a fully convolutional model. One of the downsides of RNNs is that they are sequential. That is, before a word is processed by the RNN, all previous words must also be processed. Convolutional models can be fully parallelized, which allow them to be trained much quicker. We will be implementing the Convolutional Sequence to Sequence model, which uses multiple convolutional layers in both the encoder and decoder, with an attention mechanism between them.
6 - Attention Is All You Need

Continuing with the non-RNN based models, we implement the Transformer model from Attention Is All You Need. This model is based solely on attention mechanisms and introduces Multi-Head Attention. The encoder and decoder are made of multiple layers, with each layer consisting of Multi-Head Attention and Positionwise Feedforward sublayers. This model is currently used in many state-of-the-art sequence-to-sequence and transfer learning tasks.
7 - KV Caching for decoder
The Key-Value (KV) cache is a foundational optimization in transformer models. Instead of re-calculating attention scores from scratch for the entire text history, the model simply stores the intermediate Key and Value vectors and appends new ones as tokens are generated. This example will give a glimpse how it is done in LLM. There are small adjustment made for previous notebook: Attention Is All You Need

LLM in progress

7 - 7_Tokenization_and_Language_Modeling_Data.ipynb
- BPE/tokenization basics
- causal LM dataset windows
- train/val split, packing, masking
8 - 8_Decoder_Only_Transformer.ipynb
- GPT-style blocks
- causal self-attention
- tied embeddings, RMSNorm/LayerNorm, GELU/SwiGLU
9 - 9_Pretraining_a_Small_Language_Model.ipynb
- cross-entropy next-token training
- sampling: temperature, top-k/top-p
- checkpointing and eval perplexity
10 - 10_Scaling_and_Systems_Notes.ipynb
- parameter/FLOP counting
- mixed precision
- gradient accumulation
- optional torch.compile
11 - 11_Data_Filtering_and_Dedup.ipynb
12 - 12_SFT_and_DPO_Minimal.ipynb

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
assets		assets
1_Sequence_to_Sequence_Learning_with_Neural_Networks.ipynb		1_Sequence_to_Sequence_Learning_with_Neural_Networks.ipynb
2_Learning_Phrase_Representations_using_RNN_Encoder_Decoder_for_Statistical_Machine_Translation_ipynb.ipynb		2_Learning_Phrase_Representations_using_RNN_Encoder_Decoder_for_Statistical_Machine_Translation_ipynb.ipynb
3_Neural_Machine_Translation_by_Jointly_Learning_to_Align_and_Translate.ipynb		3_Neural_Machine_Translation_by_Jointly_Learning_to_Align_and_Translate.ipynb
4_Packed_Padded_Sequences,_Masking,_Inference_and_BLEU.ipynb		4_Packed_Padded_Sequences,_Masking,_Inference_and_BLEU.ipynb
5_Convolutional_Sequence_to_Sequence_Learning.ipynb		5_Convolutional_Sequence_to_Sequence_Learning.ipynb
6_Attention_is_All_You_Need.ipynb		6_Attention_is_All_You_Need.ipynb
7_KVCaching.ipynb		7_KVCaching.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PyTorch Seq2Seq

Tutorials

LLM in progress

Reference

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PyTorch Seq2Seq

Tutorials

LLM in progress

Reference

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages