slone
/

canine-c-bashkir-gec-v1

Token Classification

grammatical error correction

Model card Files Files and versions

cointegrated commited on Dec 21, 2022

Commit

6101dd6

·

1 Parent(s): d9aa21f

Create README.md

Files changed (1) hide show

README.md +67 -0

README.md ADDED Viewed

	@@ -0,0 +1,67 @@

+---
+language:
+- ba
+license: apache-2.0
+tags:
+- grammatical error correction
+---
+# Canine-c Bashkir Spelling Correction v1
+This model is a version of [google/canine-c](https://huggingface.co/openai/whisper-small) fine-tuned to fix corrupted texts.
+It was trained on a mixture of two parallel datasets in the Bashkir language:
+- sentences post-edited by humans after OCR
+- artificially randomly corrupted sentences along with their original versions
+For each character, the model predicts whether to replace it and whether to insert another character next to it.
+In this way, the model can be used to fix spelling or OCR errors.
+On a held-out set, it reduces the number of required edits by 40%.
+## How to use
+You can use the model by feeding sentences to the following code:
+```Python
+import torch
+from transformers import CanineTokenizer, CanineForTokenClassification
+tokenizer = CanineTokenizer.from_pretrained('slone/canine-c-bashkir-gec-v1')
+model = CanineForTokenClassification.from_pretrained('slone/canine-c-bashkir-gec-v1')
+if torch.cuda.is_available():
+    model.cuda()
+LABELS_THIS = [c[5:] for c in model.config.id2label.values() if c.startswith('THIS_')]
+LABELS_NEXT = [c[5:] for c in model.config.id2label.values() if c.startswith('NEXT_')]
+def fix_text(text, boost=0):
+    """Apply the model to edit the text. `boost` is a parameter to control edit aggressiveness."""
+    bx = tokenizer(text, return_tensors='pt', padding=True)
+    with torch.inference_mode():
+        out = model(**bx.to(model.device))
+        n1, n2 =  len(LABELS_THIS), len(LABELS_NEXT)
+        logits1 = out.logits[0, :, :n1].view(-1, n1)
+        logits2 = out.logits[0, :, n1:].view(-1, n2)
+        if boost:
+            logits1[1:, 0] -= boost
+            logits2[:, 0] -= boost
+        ids1, ids2 = logits1.argmax(-1).tolist(), logits2.argmax(-1).tolist()
+    result = []
+    for c, id1, id2 in zip(' ' + text, ids1, ids2):
+        l1, l2 = LABELS_THIS[id1], LABELS_NEXT[id2]
+        if l1 == 'KEEP':
+            result.append(c)
+        elif l1 != 'DELETE':
+            result.append(l1)
+        if l2 != 'PASS':
+            result.append(l2)
+    return ''.join(result)
+text = 'У йыл дан д ың йөҙө һoрөмлэнде.'
+print(fix_text(text))  # Уйылдандың йөҙө һөрөмләнде.
+```
+The parameter `boost` can be used to control the aggressiveness of editing:
+positive values increase the probability of changing the text, negative values decrease it.