<a href="https://colab.research.google.com/github/parth-pai/Learners_Space_2023_NLP/blob/main/LS_Training_Model_Final.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##**Final Project**
This is my Final Project of Learner's Space on Natural Language Processing, where I will be using some datasets and model for machine translation and then fine-tune the same model. I engaged in discussion with some of my friends who had taken this Learner's Space project as well, so that I can help them whenever they have a doubt and they help me when I have a doubt or run into an error.

First we give T4 GPU access and check whether the GPU is working or not using the code below

In [1]:
!nvidia-smi

Mon Aug 14 15:53:22 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   60C    P8    10W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

After giving GPU access, we download necessary libraries like :


1.   `transformers` for importing the pipelines and for tokenizer
2.   `accelerate` for faster loading.
3.   `sentencepiece` while applying the pipeline.
4.   `gradio` for interactive app experience
5.   `datasets` to import datasets
6.   `evaluate` for loading the metric and evaluating
7.   `sacrebleu` for the metric


In [2]:
! pip install -q transformers accelerate sentencepiece gradio datasets evaluate sacrebleu

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m17.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.2/244.2 kB[0m [31m28.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m34.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m20.0/20.0 MB[0m [31m40.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.3/519.3 kB[0m [31m51.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.4/81.4 kB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m118.9/118.9 kB[0m [31m13.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m27.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━

## **Dataset and Preprocessing**
Now we import the "enimai/MuST-C-it" dataset which has both english and italian sentences in it.

In [3]:
from datasets import load_dataset
raw_dataset = load_dataset("enimai/MuST-C-it")
raw_dataset

Downloading readme:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/55.3M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/300k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/520k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['en', 'it'],
        num_rows: 253588
    })
    validation: Dataset({
        features: ['en', 'it'],
        num_rows: 1309
    })
    test: Dataset({
        features: ['en', 'it'],
        num_rows: 2574
    })
})

Splitting the dataset with a ratio of 80% in training dataset and 20% in testing dataset with seed value.

In [4]:
from sklearn.model_selection import train_test_split
split_dataset = raw_dataset["train"].train_test_split(test_size=0.2, seed=30)
split_dataset

DatasetDict({
    train: Dataset({
        features: ['en', 'it'],
        num_rows: 202870
    })
    test: Dataset({
        features: ['en', 'it'],
        num_rows: 50718
    })
})

Pre-process the text to give tokenized text

In [5]:
def pre_process_text(text):
  inputs = []
  for sample in text['en']:
    inputs.append(sample)

  outputs = []
  for sample in text['it']:
    outputs.append(sample)

  tokenized_text=tokenizer(inputs, text_target=outputs, max_length=200)
  return tokenized_text

##**Importing Model**
Now we import the "Helsinki-NLP/opus-mt-en-it" model and then use pipelines to see translation in practice from the dataset


In [6]:
from transformers import pipeline
model_name="Helsinki-NLP/opus-mt-en-it"
translator=pipeline("translation", model=model_name)
translator("The buildings in that city look so organised and beautiful. A perfect concrete jungle.")

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.38k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/343M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

Downloading (…)olve/main/source.spm:   0%|          | 0.00/789k [00:00<?, ?B/s]

Downloading (…)olve/main/target.spm:   0%|          | 0.00/814k [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/2.35M [00:00<?, ?B/s]



[{'translation_text': 'Gli edifici in quella città sembrano così organizzati e belli. Una giungla di cemento perfetto.'}]

`AutoTokenizer` is used to load tokenizer values for a wide range of models. Here it is used to automatically select appropriate tokens based on model's architecture.

In [7]:
from transformers import AutoTokenizer
tokenizer=AutoTokenizer.from_pretrained(model_name, return_tensors="pt")

Gives the tokenised dataset with preprocesing and removing the columns like `['en']`,`['it']` etc

In [8]:
tokenized_dataset=split_dataset.map(
    pre_process_text,
    batched=True,
    remove_columns=split_dataset['train'].column_names
)

Map:   0%|          | 0/202870 [00:00<?, ? examples/s]

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


Map:   0%|          | 0/50718 [00:00<?, ? examples/s]

In the below given code:

1.   class `AutoModelForSeq2SeqLM` loads a pre-trained sequence-to-sequence language model
2.   A data collator `DataCollatorForSeq2Seq` is used to preprocess and format data for input to the model during training or evaluation



In [9]:
from transformers import AutoModelForSeq2SeqLM
from transformers import DataCollatorForSeq2Seq

model= AutoModelForSeq2SeqLM.from_pretrained(model_name)
data_collator=DataCollatorForSeq2Seq(tokenizer,model=model)

In [10]:
batches = data_collator([tokenized_dataset["train"][i] for i in range(1,3)])
batches['decoder_input_ids']

tensor([[80034,    54,  3317,  3469,  5490,  9477,   225,   708,   932,    86,
           235,    16,     1,     6,   528,    45,  1057,    17,  7009,   296,
            18,  2897,    17,     7,  1230,   343,    18,   365,   114,  9716,
            10,     7,  1142,  8188,    43,   350, 12394,    18,   343,  2386,
             6,   337,     3,    17,     9,  3317,  2339,  1451,    51,    46,
          1027,    23,   421,     5,   100,    10,  2216,  8188,     2],
        [80034,  7572,  1146, 20765,   630,    42, 16200, 25431,    86,    23,
          4373,     2,     0, 80034, 80034, 80034, 80034, 80034, 80034, 80034,
         80034, 80034, 80034, 80034, 80034, 80034, 80034, 80034, 80034, 80034,
         80034, 80034, 80034, 80034, 80034, 80034, 80034, 80034, 80034, 80034,
         80034, 80034, 80034, 80034, 80034, 80034, 80034, 80034, 80034, 80034,
         80034, 80034, 80034, 80034, 80034, 80034, 80034, 80034, 80034]])

##**Evaluating the Accuracy**
Here we are evaluating our model using sacrebleu score. It checks for matching words and gives lesser score if they aren't matching.

In [11]:
import evaluate
metric_eval = evaluate.load("sacrebleu")

Downloading builder script:   0%|          | 0.00/8.15k [00:00<?, ?B/s]

Here we compute the metrics by considering the predictions and lables etc. And get the score out as return

In [12]:
import numpy as np

def compute_metrics(eval):
  preds, label = eval
  if isinstance(preds, tuple):
    preds=preds[0]
  decoded_preds= tokenizer.batch_decode(preds, skip_special_tokens=True)

  labels=np.where(labels !=100, labels,tokenizer.pad_token_id)
  decoded_labels=tokenizer.batch_decode(labels, skip_special_tokens=True)

  decoded_preds=[pred.strip() for pred in decoded_preds]
  decoded_labels=[[label.strip()] for label in decoded_labels]

  result=metric_eval.compute(predictions=decoded_preds, references=decoded_labels)
  return {"bleu": result["score"]}

Now we will log into Hugging Face notebook

In [13]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

##**Fine-Tuning**
Now we use `Seq2SeqTrainingArguments` class which provides a way to specify various training settings and hyperparameters for seq2seq models

In [17]:
from transformers import Seq2SeqTrainingArguments

arg= Seq2SeqTrainingArguments(
    f"model-en-to-it",
    evaluation_strategy="no",
    save_strategy="epoch",
    num_train_epochs=3,
    weight_decay=0.01,
    learning_rate=3e-5,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=64,
    save_total_limit=3,
    predict_with_generate=True,
    fp16=True,
    push_to_hub=True
)

We now use the `Seq2SeqTrainer` instance used to train and evaluate sequence to sequence model. We can start the training after initializing it.

In [18]:
from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    model,
    arg,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

/content/model-en-to-it is already a clone of https://huggingface.co/parthpai/model-en-to-it. Make sure you pull the latest changes with `repo.git_pull()`.


Now the training starts here. It might take few minutes or upto an hour depending on epoch size etc.

In [19]:
trainer.train()



Step,Training Loss
500,1.294
1000,1.267
1500,1.2414
2000,1.26
2500,1.2522
3000,1.2502
3500,1.2415
4000,1.2492
4500,1.2364
5000,1.2277


Adding files tracked by Git LFS: ['source.spm', 'target.spm']. This may take a bit of time if the files are large.
Several commits (2) will be pushed upstream.
Several commits (3) will be pushed upstream.


TrainOutput(global_step=19020, training_loss=1.1422517776489258, metrics={'train_runtime': 4588.4073, 'train_samples_per_second': 132.641, 'train_steps_per_second': 4.145, 'total_flos': 1.3068977520771072e+16, 'train_loss': 1.1422517776489258, 'epoch': 3.0})

##**Saving in Drive**
Here after training the model, we save it into google drive which can be accessed later if we need to build gradio interface

In [22]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [23]:
model.save_pretrained('/content/drive/MyDrive/LS_NLP_Parth')