fc63 commited on
Commit
12eb650
Β·
verified Β·
1 Parent(s): 83f6f30

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +118 -11
README.md CHANGED
@@ -1,16 +1,123 @@
 
 
 
 
 
 
 
 
1
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  datasets:
3
- - samzirbo/europarl.en-es.gendered
4
- - czyzi0/luna-speech-dataset
5
- - czyzi0/pwr-azon-speech-dataset
6
- - sagteam/author_profiling
7
- - kaushalgawri/nptel-en-tags-and-gender-v0
8
  metrics:
9
- - f1
10
- - accuracy
11
- - precision
12
- - recall
13
  base_model:
14
- - microsoft/deberta-v3-large
15
  pipeline_tag: text-classification
16
- ---
 
 
 
 
 
 
 
 
 
1
+ # Gender Prediction from Text ✍️ β†’ πŸ‘©β€πŸ¦°πŸ‘¨
2
+
3
+ This model predicts the **gender of the author** based on a given English or non-English text. It is built upon [DeBERTa-v3-large](https://huggingface.co/microsoft/deberta-v3-large) and fine-tuned on a diverse, multilingual, and multi-domain dataset with both formal and informal texts.
4
+
5
+ πŸ“ **Space link**: [πŸ”— Try it out on Hugging Face Spaces](https://huggingface.co/spaces/fc63/Gender_Prediction)
6
+ πŸ“ **Model repo**: [πŸ”— View on Hugging Face Hub](https://huggingface.co/fc63/gender_prediction_model_from_text)
7
+ 🧠 **Source code**: [GitHub](https://github.com/fc63/gender-classification)
8
+
9
  ---
10
+
11
+ ## πŸ“Š Model Summary
12
+
13
+ - **Base model**: `microsoft/deberta-v3-large`
14
+ - **Fine-tuned on**: binary gender classification task (`female` vs `male`)
15
+ - **Best F1 Score**: `0.69` on a balanced multi-domain test set
16
+ - **Max token length**: 128
17
+ - **Evaluation Metrics**:
18
+ - F1: 0.69
19
+ - Accuracy: 0.69
20
+ - Precision: 0.69
21
+ - Recall: 0.69
22
+
23
+ ---
24
+
25
+ ## 🧾 Datasets Used
26
+
27
+ | Dataset | Domain | Type |
28
+ |--------|--------|------|
29
+ | [samzirbo/europarl.en-es.gendered](https://huggingface.co/datasets/samzirbo/europarl.en-es.gendered) | Formal speech (Parliament) | English |
30
+ | [czyzi0/luna-speech-dataset](https://huggingface.co/datasets/czyzi0/luna-speech-dataset) | Phone conversations | Polish β†’ Translated |
31
+ | [czyzi0/pwr-azon-speech-dataset](https://huggingface.co/datasets/czyzi0/pwr-azon-speech-dataset) | Phone conversations | Polish β†’ Translated |
32
+ | [sagteam/author_profiling](https://huggingface.co/datasets/sagteam/author_profiling) | Social posts | Russian β†’ Translated |
33
+ | [kaushalgawri/nptel-en-tags-and-gender-v0](https://huggingface.co/datasets/kaushalgawri/nptel-en-tags-and-gender-v0) | Spoken transcripts | English |
34
+ | [Blog Authorship Corpus](https://u.cs.biu.ac.il/~koppel/BlogCorpus.htm) | Blog posts | English |
35
+
36
+ All datasets were normalized, translated if necessary, deduplicated, and **balanced via random undersampling** to ensure equal representation of both genders.
37
+
38
+ ---
39
+
40
+ ## πŸ› οΈ Preprocessing & Training
41
+
42
+ - **Normalization**: Cleaned quotes, dashes, placeholders, noise, and HTML/code from all datasets.
43
+ - **Translation**: Used `Helsinki-NLP/opus-mt-*` models for Polish and Russian data.
44
+ - **Undersampling**: Random undersampling to balance male and female samples.
45
+ - **Training Strategy**:
46
+ - LR Finder used to optimize learning rate (`2.66e-6`)
47
+ - Fine-tuned using early stopping on both F1 and loss
48
+ - Step-based evaluation every 250 steps
49
+ - Best checkpoint at step 24,750 saved and evaluated
50
+ - **Second Phase Fine-tuning**:
51
+ - Performed on full merged dataset for 2 epochs
52
+ - Used cosine learning rate scheduler and warm-up steps
53
+
54
+ ---
55
+
56
+ ## πŸ“ˆ Performance (on full merged test set)
57
+
58
+ | Class | Precision | Recall | F1-Score | Accuracy | Support |
59
+ |-----|-----|--------|----------|---------|---------|
60
+ | Female | 0.70 | 0.65 | 0.68 | | 591,027 |
61
+ | Male | 0.68 | 0.72 | 0.70 | | 591,027 |
62
+ | **Macro Avg** | 0.69 | 0.69 | 0.69 | | 1,182,054 |
63
+ | **Accuracy** | | | | **0.69** | 1,182,054 |
64
+
65
+ ---
66
+
67
+ ## πŸ“¦ Usage Example
68
+
69
+ ```python
70
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
71
+ import torch
72
+ import torch.nn.functional as F
73
+
74
+ model_name = "fc63/gender_prediction_model_from_text"
75
+ tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)
76
+ model = AutoModelForSequenceClassification.from_pretrained(model_name).eval().to("cuda")
77
+
78
+ def predict(text):
79
+ inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=128).to("cuda")
80
+ with torch.no_grad():
81
+ outputs = model(**inputs)
82
+ probs = F.softmax(outputs.logits, dim=1)
83
+ pred = torch.argmax(probs, dim=1).item()
84
+ confidence = round(probs[0][pred].item() * 100, 1)
85
+ gender = "Female" if pred == 0 else "Male"
86
+ return f"{gender} (Confidence: {confidence}%)"
87
+ ```
88
+ ```
89
+ sample_text = "I love writing in my journal every night. It helps me reflect on the day and plan for tomorrow."
90
+ print(predict(sample_text))
91
+ ```
92
+ The Output Of This Sample:
93
+ ```
94
+ Female (Confidence: 84.1%)
95
+ ```
96
+ ---
97
+
98
+ ## πŸ› οΈ Model Card Metadata
99
+
100
+ ```yaml
101
  datasets:
102
+ - samzirbo/europarl.en-es.gendered
103
+ - czyzi0/luna-speech-dataset
104
+ - czyzi0/pwr-azon-speech-dataset
105
+ - sagteam/author_profiling
106
+ - kaushalgawri/nptel-en-tags-and-gender-v0
107
  metrics:
108
+ - f1
109
+ - accuracy
110
+ - precision
111
+ - recall
112
  base_model:
113
+ - microsoft/deberta-v3-large
114
  pipeline_tag: text-classification
115
+ ```
116
+
117
+ ---
118
+
119
+ ## πŸ‘¨β€πŸ”¬ Author & License
120
+
121
+ **Author**: Furkan Γ‡oban
122
+ **Project**: CENG-481 Gender Prediction Model
123
+ **License**: MIT