Commit
·
6101dd6
1
Parent(s):
d9aa21f
Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,67 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language:
|
| 3 |
+
- ba
|
| 4 |
+
license: apache-2.0
|
| 5 |
+
tags:
|
| 6 |
+
- grammatical error correction
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
# Canine-c Bashkir Spelling Correction v1
|
| 10 |
+
|
| 11 |
+
This model is a version of [google/canine-c](https://huggingface.co/openai/whisper-small) fine-tuned to fix corrupted texts.
|
| 12 |
+
It was trained on a mixture of two parallel datasets in the Bashkir language:
|
| 13 |
+
- sentences post-edited by humans after OCR
|
| 14 |
+
- artificially randomly corrupted sentences along with their original versions
|
| 15 |
+
|
| 16 |
+
For each character, the model predicts whether to replace it and whether to insert another character next to it.
|
| 17 |
+
|
| 18 |
+
In this way, the model can be used to fix spelling or OCR errors.
|
| 19 |
+
|
| 20 |
+
On a held-out set, it reduces the number of required edits by 40%.
|
| 21 |
+
|
| 22 |
+
## How to use
|
| 23 |
+
|
| 24 |
+
You can use the model by feeding sentences to the following code:
|
| 25 |
+
|
| 26 |
+
```Python
|
| 27 |
+
import torch
|
| 28 |
+
from transformers import CanineTokenizer, CanineForTokenClassification
|
| 29 |
+
|
| 30 |
+
tokenizer = CanineTokenizer.from_pretrained('slone/canine-c-bashkir-gec-v1')
|
| 31 |
+
model = CanineForTokenClassification.from_pretrained('slone/canine-c-bashkir-gec-v1')
|
| 32 |
+
if torch.cuda.is_available():
|
| 33 |
+
model.cuda()
|
| 34 |
+
|
| 35 |
+
LABELS_THIS = [c[5:] for c in model.config.id2label.values() if c.startswith('THIS_')]
|
| 36 |
+
LABELS_NEXT = [c[5:] for c in model.config.id2label.values() if c.startswith('NEXT_')]
|
| 37 |
+
|
| 38 |
+
def fix_text(text, boost=0):
|
| 39 |
+
"""Apply the model to edit the text. `boost` is a parameter to control edit aggressiveness."""
|
| 40 |
+
bx = tokenizer(text, return_tensors='pt', padding=True)
|
| 41 |
+
with torch.inference_mode():
|
| 42 |
+
out = model(**bx.to(model.device))
|
| 43 |
+
n1, n2 = len(LABELS_THIS), len(LABELS_NEXT)
|
| 44 |
+
logits1 = out.logits[0, :, :n1].view(-1, n1)
|
| 45 |
+
logits2 = out.logits[0, :, n1:].view(-1, n2)
|
| 46 |
+
if boost:
|
| 47 |
+
logits1[1:, 0] -= boost
|
| 48 |
+
logits2[:, 0] -= boost
|
| 49 |
+
ids1, ids2 = logits1.argmax(-1).tolist(), logits2.argmax(-1).tolist()
|
| 50 |
+
result = []
|
| 51 |
+
for c, id1, id2 in zip(' ' + text, ids1, ids2):
|
| 52 |
+
l1, l2 = LABELS_THIS[id1], LABELS_NEXT[id2]
|
| 53 |
+
if l1 == 'KEEP':
|
| 54 |
+
result.append(c)
|
| 55 |
+
elif l1 != 'DELETE':
|
| 56 |
+
result.append(l1)
|
| 57 |
+
if l2 != 'PASS':
|
| 58 |
+
result.append(l2)
|
| 59 |
+
return ''.join(result)
|
| 60 |
+
|
| 61 |
+
text = 'У йыл дан д ың йөҙө һoрөмлэнде.'
|
| 62 |
+
print(fix_text(text)) # Уйылдандың йөҙө һөрөмләнде.
|
| 63 |
+
```
|
| 64 |
+
|
| 65 |
+
The parameter `boost` can be used to control the aggressiveness of editing:
|
| 66 |
+
positive values increase the probability of changing the text, negative values decrease it.
|
| 67 |
+
|