--- library_name: transformers license: mit base_model: microsoft/prophetnet-large-uncased-cnndm datasets: - eilamc14/wikilarge-clean language: - en tags: - prophetnet - text-simplification - WikiLarge model-index: - name: prophetnet-large-uncased-cnndm-text-simplification results: - task: type: text2text-generation name: Text Simplification dataset: name: ASSET type: facebook/asset url: https://huggingface.co/datasets/facebook/asset split: test metrics: - type: SARI value: 38.01 - type: FKGL value: 7.70 - type: BERTScore value: 67.82 - type: LENS value: 60.85 - type: Identical ratio (ci) value: 0.11 - task: type: text2text-generation name: Text Simplification dataset: name: MEDEASI type: cbasu/Med-EASi url: https://huggingface.co/datasets/cbasu/Med-EASi split: test metrics: - type: SARI value: 36.86 - type: FKGL value: 8.50 - type: BERTScore value: 38.31 - type: LENS value: 51.36 - type: Identical ratio (ci) value: 0.04 - task: type: text2text-generation name: Text Simplification dataset: name: OneStopEnglish type: OneStopEnglish url: https://github.com/nishkalavallabhi/OneStopEnglishCorpus split: advanced→elementary metrics: - type: SARI value: 39.17 - type: FKGL value: 7.00 - type: BERTScore value: 65.22 - type: LENS value: 61.53 - type: Identical ratio (ci) value: 0.13 --- # Model Card for Model ID This is one of the models fine-tuned on text simplification for [Simplify This](https://github.com/eilamc14/Simplify-This) project. ## Model Details ### Model Description Fine-tuned **sequence-to-sequence (encoder–decoder) Transformer** for **English text simplification**. Trained on the dataset **`eilamc14/wikilarge-clean`** (cleaned WikiLarge-style pairs). - **Model type:** Seq2Seq Transformer (encoder–decoder) - **Language (NLP):** English - **License:** `mit` - **Finetuned from model:** `microsoft/prophetnet-large-uncased-cnndm` ### Model Sources - **Repository (code):** https://github.com/eilamc14/Simplify-This - **Dataset:** https://huggingface.co/datasets/eilamc14/wikilarge-clean - **Paper [optional]:** — - **Demo [optional]:** — ## Uses ### Direct Use The model is intended for **English text simplification**. - **Input format:** `Simplify: ` - **Output:** `` **Typical uses** - Research on automatic text simplification - Benchmarking against other simplification systems - Demos/prototypes that require simpler English rewrites ### Downstream Use This repository already contains a **fine-tuned** model specialized for text simplification. Further fine-tuning is **optional** and mainly relevant when: - Adapting to a markedly different domain (e.g., medical/legal/news) - Addressing specific failure modes (e.g., over/under-simplification, factual drops) - Distilling/quantizing for deployment constraints When fine-tuning further, keep the same input convention: `Simplify: <...>`. ### Out-of-Scope Use Not intended for: - Tasks unrelated to simplification (dialogue, translation etc.) - Production use without additional safety filtering (no toxicity/bias mitigation) - Languages other than English - High-stakes settings (legal/medical advice, safety-critical decisions) ## Bias, Risks, and Limitations The model was trained on **Wikipedia and Simple English Wikipedia** alignments (via WikiLarge). As a result, it inherits the characteristics and limitations of this data: - **Domain bias:** Simplifications may reflect encyclopedic style; performance may degrade on informal, technical, or domain-specific text (e.g., medical/legal/news). - **Content bias:** Wikipedia content itself contains biases in coverage, cultural perspective, and phrasing. Simplified outputs may reflect or amplify these. - **Simplification quality:** The model may: - Over-simplify (drop important details) - Under-simplify (retain complex phrasing) - Produce ungrammatical or awkward rephrasings - **Language limitation:** Only suitable for English. Applying to other languages is unsupported. - **Safety limitation:** The model has not been aligned to avoid toxic, biased, or harmful content. If the input text contains such content, the output may reproduce or modify it without safeguards. ### Recommendations - **Evaluation required:** Always evaluate the model in the target domain before deployment. Benchmark simplification quality (e.g., with SARI, FKGL, BERTScore, LENS, human evaluation). - **Human oversight:** Use human-in-the-loop review for applications where meaning preservation is critical (education, accessibility tools, etc.). - **Attribution:** Preserve source attribution where required (Wikipedia → CC BY-SA). - **Not for high-stakes use:** Avoid legal, medical, or safety-critical applications without extensive validation and domain adaptation. ## How to Get Started with the Model Load the model and tokenizer directly from the Hugging Face Hub: ```python from transformers import AutoModelForSeq2SeqLM, AutoTokenizer model_id = "eilamc14/bart-base-text-simplification" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForSeq2SeqLM.from_pretrained(model_id) # Example input PREFIX = "Simplify: " text = "The committee deemed the proposal unnecessarily complicated." # Tokenize and generate inputs = tokenizer(PREFIX+text, return_tensors="pt") outputs = model.generate(**inputs, max_new_tokens=64, num_beams=4) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ## Training Details ### Training Data [WikiLarge-clean](https://huggingface.co/datasets/eilamc14/wikilarge-clean) Dataset ### Training Procedure - **Hardware:** NVIDIA L4 GPU on Google Colab - **Objective:** Standard sequence-to-sequence cross-entropy loss - **Training type:** Full fine-tuning of all parameters (no LoRA/PEFT used) - **Batching:** Dynamic padding with Hugging Face `Trainer` / PyTorch DataLoader - **Evaluation:** Monitored on the `validation` split with metrics (SARI and identical_ratio) - **Stopping criteria:** Early stopping CallBack based on validation performance #### Preprocessing The dataset was preprocessed by prefixing each source sentence with **"Simplify: "** and tokenizing both the source (inputs) and target (labels). #### Memory & Checkpointing To reduce VRAM during training, gradient checkpointing was enabled and the KV cache was disabled: ```python model.config.use_cache = False # required when using gradient checkpointing model.gradient_checkpointing_enable() # saves memory at the cost of extra compute ``` **Notes** - Disabling `use_cache` avoids warnings/conflicts with gradient checkpointing and reduces memory usage in the forward pass. - Gradient checkpointing trades **GPU memory ↓** for **training speed ↓** (extra recomputation). - For **inference/evaluation**, re-enable the cache for faster generation: ```python model.config.use_cache = True ``` #### Training Hyperparameters The models were trained with Hugging Face `Seq2SeqTrainingArguments`. Hyperparameters varied slightly across models and runs to optimize, and full logs (batch size, steps, exact LR schedule) were not preserved. Below are the **typical defaults** used: - **Epochs:** 5 - **Evaluation strategy:** every 300 steps - **Save strategy:** every 300 steps (keep best model, `eval_loss` as criterion) - **Learning rate:** ~3e-5 - **Batch size:** ~8-64 , depends on model size - **Optimizer:** `adamw_torch_fused` - **Precision:** bf16 - **Generation config (during eval):** `max_length=128`, `num_beams=4`, `predict_with_generate=True` - **Other settings:** - Weight decay: 0.01 - Label smoothing: 0.1 - Warmup ratio: 0.1 - Max grad norm: 0.5 - Dataloader workers: 8 (L4 GPU) > Because hyperparameters were adjusted between runs and not all were logged, exact reproduction may differ slightly. ## Evaluation ### Testing Data - [**ASSET**](https://huggingface.co/datasets/facebook/asset) (test subset) - [**MEDEASI**](https://huggingface.co/datasets/cbasu/Med-EASi) (test subset) - [**OneStopEnglish**](https://github.com/nishkalavallabhi/OneStopEnglishCorpus) (advanced → elementary) ### Metrics - **Identical ratio (ci)** — share of outputs identical to the source (case insensitive), both normalized by basic, language-agnostic: strip, NFKC, collapse spaces - **SARI** — main simplification metric (higher is better) - **FKGL** — readability grade level (lower is simpler) - **BERTScore (F1)** — semantic similarity (higher is better) - **LENS** — composite simplification quality score (higher is better) ### Generation Arguments ```python gen_args = dict( max_new_tokens=64, num_beams=4, length_penalty=1.0, no_repeat_ngram_size=3, early_stopping=True, do_sample=False, ) ``` ### Results | Dataset | Identical ratio (ci) | SARI | FKGL | BERTScore | LENS | |--------------------|---------------------:|------:|-----:|----------:|------:| | **ASSET** | 0.11 | 38.01 | 7.70 | 67.82 | 60.85 | | **MEDEASI** | 0.04 | 36.86 | 8.50 | 38.31 | 51.36 | | **OneStopEnglish** | 0.13 | 39.17 | 7.00 | 65.22 | 61.53 | ## Environmental Impact - **Hardware Type:** Single NVIDIA L4 GPU (Google Colab) - **Hours used:** Approx. 5–10 - **Cloud Provider:** Google Cloud (via Colab) - **Compute Region:** Unknown (Google Colab dynamic allocation) - **Carbon Emitted:** Estimated to be very low (< a few kg CO₂eq), since training was limited to a single GPU for a small number of hours. ## Citation **BibTeX:** [More Information Needed] **APA:** [More Information Needed]