!pip install torch==1.7.1 torchvision==0.8.2 torchaudio==0.7.2
!pip install torchtext==0.8.1
!pip install antlr4-python3-runtime==4.9.2
Discover how to build a Seq2Seq Transformer with PyTorch to translate legacy PL/I code into modern Kotlin, transforming outdated codebases into contemporary solutions.
A Practical Guide to Building a Seq2Seq Transformer from scratch with PyTorch
Introduction
This jupyter notebook details how to build a sequence-to-sequence (Seq2Seq) model using PyTorch, focusing on the Transformer architecture. Our goal is to develop a transpiler that translates and transpiles PL/I code into Kotlin. Throughout the notebook, we’ll cover everything from setting up the necessary libraries to training the model and evaluating its performance
Libraries and setup
We start by installing the required libraries:
We then import all the necessary libraries. json will be used to import our data, antlr4 provides the backbone for our transpiler, enabling us to parse PL/I code and manipulate its structure with ease, torch PyTorch empowers us to build and train neural networks seamlessly, torchtext complements PyTorch by providing utilities for text processing and dataset handling, jinja2 simplifies code generation with its template engine ensuring a smooth transition between languages and to finish matplotlib will come in handy during training for visualising our progress
import json
from antlr4 import *
from pli.PLILexer import PLILexer
from pli.PLIParser import PLIParser
from pli.PLIVisitor import PLIVisitor
import torch
import torch.nn as nn
import torch.optim as optim
from torchtext.data import Field, TabularDataset, BucketIterator
from jinja2 import Template
import matplotlib.pyplot as plt
The Transformer Model
The Transformer class encapsulates a custom implementation of the Transformer model, a powerful architecture for sequence-to-sequence tasks. It comprises various components, including embedding layers for source and target sequences, positional embeddings to capture sequence order, and a multi-layer Transformer module. The model utilizes dropout for regularization and employs masks to handle padding and prevent information leakage during training. With these components, the Transformer class can efficiently process source and target sequences, facilitating tasks like language translation or code generation
class Transformer(nn.Module):
# Initialises the Transformer model
def __init__(
self,
embedding_size,
src_vocab_size,
trg_vocab_size,
src_pad_idx,
num_heads,
num_encoder_layers,
num_decoder_layers,
forward_expansion,
dropout,
max_len,
device,
):super(Transformer, self).__init__()
self.src_word_embedding = nn.Embedding(src_vocab_size, embedding_size)
self.src_position_embedding = nn.Embedding(max_len, embedding_size)
self.trg_word_embedding = nn.Embedding(trg_vocab_size, embedding_size)
self.trg_position_embedding = nn.Embedding(max_len, embedding_size)
self.device = device
self.transformer = nn.Transformer(
embedding_size,
num_heads,
num_encoder_layers,
num_decoder_layers,
forward_expansion,
dropout,
)self.fc_out = nn.Linear(embedding_size, trg_vocab_size)
self.dropout = nn.Dropout(dropout)
self.src_pad_idx = src_pad_idx
# Generates a mask for the source sequence to handle padding
def make_src_mask(self, src):
= src.transpose(0, 1) == self.src_pad_idx
src_mask
# (N, src_len)
return src_mask.to(self.device)
# Forward pass of the Transformer model
def forward(self, src, trg):
= src.shape
src_seq_length, N = trg.shape
trg_seq_length, N
= (
src_positions 0, src_seq_length)
torch.arange(1)
.unsqueeze(
.expand(src_seq_length, N)self.device)
.to(
)
= (
trg_positions 0, trg_seq_length)
torch.arange(1)
.unsqueeze(
.expand(trg_seq_length, N)self.device)
.to(
)
= self.dropout(
embed_src self.src_word_embedding(src) + self.src_position_embedding(src_positions))
(
)= self.dropout(
embed_trg self.trg_word_embedding(trg) + self.trg_position_embedding(trg_positions))
(
)
= self.make_src_mask(src)
src_padding_mask = self.transformer.generate_square_subsequent_mask(trg_seq_length).to(
trg_mask self.device
)
= self.transformer(
out
embed_src,
embed_trg,=src_padding_mask,
src_key_padding_mask=trg_mask,
tgt_mask
)= self.fc_out(out)
out return out
Data preparation and tokenization
For our model to understand and generate text, we need to tokenize our input data. We also set up our datasets to allow us to manage training and testing data efficiently
Tokenize
We define a simple tokenizer that splits a string into tokens based on whitespace, it then creates two Field objects with some specifications and finally we create a dictionary that maps field names to tuples to be used later to specify how to load and process the data
import torchtext.data as data
= lambda x: x.split()
tokenizer
= data.Field(sequential=True, use_vocab=True, tokenize=tokenizer, lower=True, init_token="<sos>", eos_token="<eos>")
pli = data.Field(sequential=True, use_vocab=True, tokenize=tokenizer, lower=True, init_token="<sos>", eos_token="<eos>")
ktl
= {'pli': ('p', pli), 'ktl': ('k', ktl)} fields
Splits
We’re using the TabularDataset module from the torchtext.data package to create separate datasets for training and testing. The data is loaded from JSON files located in the ‘data’ directory. We specify the format of the data as JSON and define the fields to be extracted from the JSON files using the fields parameter
= data.TabularDataset.splits(
train, test ='data',
path='train.json',
train='test.json',
testformat='json',
=fields,
fields )
Vocab
Building a vocabulary involves creating a dictionary that maps each unique word in the dataset to a unique index. This process is crucial for natural language processing tasks as it allows machine learning models to represent words as numerical values, which they can process and understand. In this case, we’re building vocabularies for the PL/I (pli) and Kotlin (ktl) datasets, ensuring that the model has a predefined set of words it can understand and process during training and inference
=10000, min_freq=1)
pli.build_vocab(train, max_size=10000, min_freq=1) ktl.build_vocab(train, max_size
Translate and transpile
Translate
Here we define a translate_sequence function. It tokenizes the input sentence, adds < sos > and < eos > tokens at the beginning and end respectively, converts the tokens to indices using the vocabulary of the pli field, converts the indices to a PyTorch tensor, and iteratively predicts the next token in the translated sequence using the trained model until either the < eos > token is predicted or the maximum length is reached. Finally, it converts the predicted indices back to tokens using the vocabulary of the Kotlin field and returns the translated sentence, removing the start token < sos > as it is only used for signaling the model to start generation and is not part of the actual translated text
def translate_sequence(sentence, pli, ktl, device, max_length=50):
if type(sentence) == str:
= [token.lower() for token in sentence.split()]
tokens else:
= [token.lower() for token in sentence]
tokens # Add <SOS> and <EOS> in beginning and end respectively
0, pli.init_token)
tokens.insert(
tokens.append(pli.eos_token)
# Iterate each languae token and convert to an index
= [pli.vocab.stoi[token] for token in tokens]
text_to_indices
# Convert to Tensor
= torch.LongTensor(text_to_indices).unsqueeze(1).to(device)
sentence_tensor
= [ktl.vocab.stoi["<sos>"]]
outputs for i in range(max_length):
= torch.LongTensor(outputs).unsqueeze(1).to(device)
trg_tensor
with torch.no_grad():
= model(sentence_tensor, trg_tensor)
output
= output.argmax(2)[-1, :].item()
best_guess
outputs.append(best_guess)
if best_guess == ktl.vocab.stoi["<eos>"]:
break
= [ktl.vocab.itos[idx] for idx in outputs]
translated_sentence # remove start token
return translated_sentence[1:]
Transpile
We also define a transpile_sequence function. It retrieves the code tokens and context data from the input dictionary. Then, it initializes an empty list to store the transpiled code with proper indentation. The function then iterates through the tokens, adjusting the indentation level based on curly braces {} encountered in the code. It ignores the
def transpile_sequence(translated, level):
= translated["code"]
tokens = translated["context"]
data = []
lint
for t in tokens:
= "".rjust(level * 4)
spacer if t == "{":
+= 1
level elif t == "}" and level > 0:
-= 1
level = "".rjust(level * 4)
spacer
if t != "<eos>":
+ t)
lint.append(spacer
= " ".join(lint)
code = Template(code)
t
return t.render(data), level
Specifics
Before defining the final parameters and training our model we have a few useful functions to define
This remove_eos function removes all < eos > tokens and then concatenate the remaining tokens into a single string
def remove_eos(witheos):
= []
noeos for w in witheos:
if w != '<eos>':
noeos.append(w)return " ".join(noeos)
The following functions, save_checkpoint and load_checkpoint, are essential for saving and loading the state of our model during training or inference
def save_checkpoint(state, filename="checkpoint.pth.tar"):
print("=> Saving checkpoint")
torch.save(state, filename)
def load_checkpoint(checkpoint, model, optimizer):
print("=> Loading checkpoint")
"state_dict"])
model.load_state_dict(checkpoint["optimizer"]) optimizer.load_state_dict(checkpoint[
Training
Hyperparameters
We start by defining some model hyperparameters that help define how our model processes and transforms our input data. We also define some training hyperparameters that regulate the training process of our model
# ready to define everything we need for training our Seq2Seq model
= torch.device("cuda" if torch.cuda.is_available() else "cpu")
device
= True
load_model = True
save_model
# Model hyperparameters
= len(pli.vocab)
src_vocab_size = len(ktl.vocab)
trg_vocab_size = 512
embedding_size = 8
num_heads = 6
num_encoder_layers = 6
num_decoder_layers = 0.10
dropout = 100
max_len = 4
forward_expansion = ktl.vocab.stoi["<pad>"]
src_pad_idx
global level
# Training hyperparameters
= 200
num_epochs = 3e-4
learning_rate = 32
batch_size = [] training_losses
Model Initialization and Training Setup
We then initialize the model, optimizer, scheduler, and criterion for training a Transformer model. we also create iterators for the training and test datasets that facillitate efficient data management, improve computational efficiency, and enhance the overall training process’s effectiveness
= BucketIterator.splits(
train_iterator, test_iterator
(train, test),=batch_size,
batch_size=True,
sort_within_batch=lambda x: len(x.p),
sort_key=device,
device
)= Transformer(
model
embedding_size,
src_vocab_size,
trg_vocab_size,
src_pad_idx,
num_heads,
num_encoder_layers,
num_decoder_layers,
forward_expansion,
dropout,
max_len,
device,
).to(device)
= optim.Adam(model.parameters(), lr=learning_rate)
optimizer
= torch.optim.lr_scheduler.ReduceLROnPlateau(
scheduler =0.1, patience=10, verbose=True
optimizer, factor
)
= ktl.vocab.stoi["<pad>"]
pad_idx = nn.CrossEntropyLoss(ignore_index=pad_idx) criterion
Training loop
Finaly we define our custom training loop. Within the training loop, batches of data are fetched using the training_iterator. Input and target sequences are transferred to the appropriate device (CPU or GPU), and the model is then trained using forward and backward passes. Gradient clipping is applied to prevent exploding gradients, and the optimizer is used to update the model parameters based on the computed gradients. After training, we evaluate the model’s performance on example pli sequences, translating them to Kotlin using the trained model. Finally we utilise matplotlib to visualize the training loss over iterations and optionally save the trained model checkpoint for future use
if __name__ == '__main__':
for epoch in range(num_epochs):
eval()
model.# little bit of output to check the progress
if epoch % 50 == 0:
print(f"[Epoch {epoch} / {num_epochs}]")
# Iterate over the dataset and extract PLI sequences
= [
sentences 'pli': ['PROCEDURE', 'MAIN', '{{type0}}', '{{type1}}'],
{'context': {'type0': 'Array', 'type1': 'String'}},
'pli': ['DO'], 'context': {}},
{'pli': ['END'], 'context': {}}
{
]print(f"Translated example sentence:")
= 0
level for s in sentences:
= translate_sequence(
translated 'pli'], pli, ktl, device, max_length=50
s[
)= transpile_sequence({
transpiled, level 'code': translated,
'context': s['context']
}, level)print(f"{transpiled}")
= training_losses[-5:]
last_5_losses print("Last 5 training losses:", last_5_losses)
model.train()= []
losses for batch_idx, batch in enumerate(train_iterator):
# Get input and targets and get to cuda
= batch.p.to(device)
inp_data = batch.k.to(device)
target # Forward
= model(inp_data, target[:-1, :])
output = output.reshape(-1, output.shape[2])
output = target[1:].reshape(-1)
target
optimizer.zero_grad()= criterion(output, target)
loss
losses.append(loss.item())# Back prop
loss.backward()# Clip to avoid exploding gradient issues, makes sure grads are
# within a healthy range
=1)
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm# Gradient descent step
optimizer.step()
training_losses.append(loss.item())= sum(losses) / len(losses)
mean_loss
scheduler.step(mean_loss)# Print the final epoch
print(f"[Epoch {num_epochs} / {num_epochs}]")
# Iterate over the dataset and extract PLI sequences
for s in sentences:
= translate_sequence(
translated 'pli'], pli, ktl, device, max_length=50
s[
)= transpile_sequence({
transpiled, level 'code': translated,
'context': s['context']
}, level)print(f"{transpiled}")
# Translate and transpile here if needed
= training_losses[-5:]
last_5_losses print("Last 5 training losses:", last_5_losses)
='Training Loss')
plt.plot(training_losses, label'Iterations')
plt.xlabel('training_losses')
plt.ylabel('Training Loss Over Iterations')
plt.title(
plt.legend()
plt.show()if save_model:
= {
checkpoint "state_dict": model.state_dict(),
"optimizer": optimizer.state_dict(),
} save_checkpoint(checkpoint)
[Epoch 0 / 200]
Translated example sentence:
<unk> <unk> <unk> <unk> <unk> <unk> <unk> var var var var var var var var var var var var var var var var var var var var var var var var var var var var var var var var var var var var var var var var var var var
<unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk>
<unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk>
Last 5 training losses: []
[Epoch 50 / 200]
Translated example sentence:
fun main
{
}
Last 5 training losses: [0.20085901021957397, 0.1371316760778427, 0.10002613812685013, 0.07097212225198746, 0.059380751103162766]
Epoch 61: reducing learning rate of group 0 to 3.0000e-05.
[Epoch 100 / 200]
Translated example sentence:
fun main (args: Array<String>)
{
}
Last 5 training losses: [0.007832802832126617, 0.007300992030650377, 0.009152849204838276, 0.009059373289346695, 0.012209202162921429]
[Epoch 150 / 200]
Translated example sentence:
fun main (args: Array<String>)
{
}
Last 5 training losses: [0.003737865248695016, 0.0034258407540619373, 0.003976386971771717, 0.004198224283754826, 0.0028304841835051775]
Epoch 161: reducing learning rate of group 0 to 3.0000e-06.
Epoch 172: reducing learning rate of group 0 to 3.0000e-07.
Epoch 183: reducing learning rate of group 0 to 3.0000e-08.
Epoch 194: reducing learning rate of group 0 to 3.0000e-09.
[Epoch 200 / 200]
fun main (args: Array<String>)
{
}
Last 5 training losses: [0.0040907906368374825, 0.0033304481767117977, 0.0034919297322630882, 0.0036950635258108377, 0.003289699787274003]
=> Saving checkpoint
/usr/local/lib/python3.9/dist-packages/torchtext/data/batch.py:23: UserWarning: Batch class will be retired soon and moved to torchtext.legacy. Please see the most recent release notes for further information.
warnings.warn('{} class will be retired soon and moved to torchtext.legacy. Please see the most recent release notes for further information.'.format(self.__class__.__name__), UserWarning)
Running the Transpiler
And as little bonus we can show you an example of how our transpiller works. We define a run_model function where we’re processing a PL1 file, which contains original PL1 code. We then load our pre-trained model and its optimizer using the checkpoint file. We set up a lexer and parser for the PL1 code using ANTLR4. We generate a dataset from the parsed code using a visitor pattern, which iterates through the AST (Abstract Syntax Tree) generated by the parser. For each statement in the PL1 code, we translate it into Kotlin using our sequence-to-sequence model and transpile it into Kotlin code. The transpiled Kotlin code is accumulated and printed as the output. This function essentially automates the process of translating our PL1 code to Kotlin
def run_model(filename):
with open(filename, 'r') as file:
= file.read()
original_code print("PL1:")
print(original_code)
print()
"checkpoint.pth.tar"), model, optimizer)
load_checkpoint(torch.load(# Lexer setup
= FileStream(filename)
input_stream = PLILexer(input_stream)
lexer = CommonTokenStream(lexer)
stream
# Parser setup
= PLIParser(stream)
parser = parser.program()
tree
# Dataset generation
= PLIVisitor()
visitor = visitor.visit(tree)
statements
# Accumulate transpiled sequences
= ""
transpiled_code = 0
level for s in statements:
= translate_sequence(
translated "pli"], pli, ktl, device, max_length=50
s[
)= transpile_sequence({
transpiled, level 'code': translated,
'context': s['context']
}, level)+= transpiled+ "\n"
transpiled_code
# Print the entire block of transpiled code
print("KTL:")
print("\n" + transpiled_code)
# Example usage:
= "FIB.PLI" # Replace with the actual filename
filename run_model(filename)
PL1:
Factorial: proc options (main);
dcl (n,result) fixed bin(31);
n = 5;
result = Compute_factorial(n);
end Factorial;
/***********************************************/
/* Subroutine */
/***********************************************/
Compute_factorial: proc (n) returns (fixed bin(31));
dcl n fixed bin(15);
if n <= 1 then
return(1);
return( n*Compute_factorial(n-1) );
end Compute_factorial;
=> Loading checkpoint
KTL:
fun main (args: Array<String>)
{
var n : Int
var result : Int
n = 5
result = compute_factorial(n)
}
fun compute_factorial(n : Int) : Int
{
if(n<=1)
{
return 1
}
return n*compute_factorial(n-1)
}
Conclusion
And there you have it ! This notebook provides a comprehensive look at building and training a Seq2Seq Transformer model in PyTorch, from scratch, for in this case the specific task of transpiling PL/I code to Kotlin. The model not only demonstrates the capabilities of neural networks in handling language translation tasks but also highlights the flexibility of the Transformer architecture in processing sequence data.
For those interested in further exploring machine translation models, another article is available where I utilize the Hugging Face API to achieve similar tasks. This API simplifies many of the steps covered here and provides powerful tools to be able to deploy machine translation models in production environments.