Baseline_BERTimbau-large-Topic_Classification-Council-PT: Multi-Label Topic Classification for Portuguese Council Discussion Subjects
Model Description
BERTimbau Large Topic Classification Council PT is a baseline implementation, consisting of a fine-tuned version of neuralmind/bert-large-portuguese-cased (BERTimbau Large) for multi-label topic classification to automatically identify and categorize discussion subjects from Portuguese municipal council meeting minutes, optimized with dynamic per-label thresholds to identify multiple simultaneous topics within municipal discussion subjects.
Key Features
- 🎯 Specialized for Municipal Topics: Fine-tuned on Portuguese council meeting minutes discussion subjects
- 🧠 Large Transformer Model: 334M parameters (24 layers, 1024 hidden dim, 16 attention heads)
- 📊 Multi-Label Classification: Identifies multiple co-occurring topics per subject
- ⚡ Dynamic Thresholds: Optimized per-label classification thresholds (not fixed 0.5)
- 🇵🇹 Portuguese-Native: Built on BERTimbau, pre-trained on Brazilian Portuguese corpus
- 🔄 End-to-End Learning: Direct fine-tuning on task-specific data
Model Details
- Architecture: BERT Large (Transformer Encoder)
- Base Model: neuralmind/bert-large-portuguese-cased (BERTimbau Large)
- Parameters: ~334M
- 24 transformer layers
- 1024 hidden dimensions
- 16 attention heads
- 4096 intermediate size
- Vocabulary Size: 29,794 WordPiece tokens
- Max Sequence Length: 256 tokens
- Classification Head: Linear layer (1024 → 22 labels)
- Loss Function: BCEWithLogitsLoss (multi-label)
- Optimization: Dynamic per-label thresholds (F1-maximization)
- Framework: PyTorch + Hugging Face Transformers
How It Works
The model processes Portuguese municipal texts through a fine-tuned transformer architecture:
Text Preprocessing
- Basic normalization (lowercasing, whitespace cleanup)
- Minimum text length filtering (>10 chars)
- Punctuation removal for noise reduction
Tokenization
- WordPiece tokenization (BERTimbau vocabulary)
- Max length truncation at 256 tokens
- Padding to max length for batching
Transformer Encoding
- 24-layer BERT encoder processes token sequences
- Self-attention captures contextual dependencies
- [CLS] token representation used for classification
Multi-Label Prediction
- Linear classification head outputs 22 logits
- Sigmoid activation for independent probabilities
- Dynamic per-label thresholds for final predictions
Threshold Optimization
- Each label has optimal threshold (0.10-0.90 range)
- Optimized via F1-score grid search on validation set
- Handles class imbalance better than fixed 0.5 threshold
Usage
import torch
import numpy as np
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from joblib import load
# Load model and tokenizer
model_path = "path/to/model"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSequenceClassification.from_pretrained(model_path)
# Load optimal thresholds and label encoder
thresholds = np.load("optimal_thresholds.npy")
mlb = load("mlb_encoder.joblib")
# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
model.eval()
# Classify text
text = "A Câmara Municipal aprovou o orçamento de 2024 para obras públicas e educação."
# Tokenize
inputs = tokenizer(
text,
return_tensors="pt",
max_length=256,
truncation=True,
padding="max_length"
).to(device)
# Predict
with torch.no_grad():
outputs = model(**inputs)
probs = torch.sigmoid(outputs.logits).cpu().numpy()[0]
# Apply optimized thresholds
predictions = (probs >= thresholds).astype(int)
predicted_labels = mlb.inverse_transform(predictions.reshape(1, -1))[0]
# Display results
print(f"Text: {text}")
print(f"\nPredicted Topics:")
for label in predicted_labels:
idx = list(mlb.classes_).index(label)
prob = probs[idx]
thresh = thresholds[idx]
print(f" - {label}: {prob:.4f} (threshold: {thresh:.2f})")
Categories
The model classifies topics into 22 Portuguese administrative categories:
| Category | Portuguese Name |
|---|---|
| General Administration | Administração Geral, Finanças e Recursos Humanos |
| Environment | Ambiente |
| Economic Activities | Atividades Económicas |
| Social Action | Ação Social |
| Science | Ciência |
| Communication | Comunicação e Relações Públicas |
| External Cooperation | Cooperação Externa e Relações Internacionais |
| Culture | Cultura |
| Sports | Desporto |
| Education | Educação e Formação Profissional |
| Energy & Telecommunications | Energia e Telecomunicações |
| Housing | Habitação |
| Private Construction | Obras Particulares |
| Public Works | Obras Públicas |
| Territorial Planning | Ordenamento do Território |
| Other | Outros |
| Heritage | Património |
| Municipal Police | Polícia Municipal |
| Animal Protection | Proteção Animal |
| Civil Protection | Proteção Civil |
| Health | Saúde |
| Traffic & Transport | Trânsito, Transportes e Comunicações |
Evaluation Results
Comprehensive Performance Metrics
| Metric | Score | Description |
|---|---|---|
| F1-macro | 0.4819 | Macro-averaged F1 score |
| F1-micro | 0.7300 | Micro-averaged F1 score |
| F1-weighted | 0.727 | Weighted-averaged F1 score |
| Accuracy | 0.4234 | Subset accuracy (exact match) |
| Hamming Loss | 0.0435 | Label-wise error rate |
| Average Precision (macro) | 0.538 | Macro-averaged AP |
License
This model is released under the cc-by-nc-nd-4.0 license.
- Downloads last month
- 77
Model tree for liaad/Baseline_BERTimbau-large-Topic_Classification-Council-PT
Base model
neuralmind/bert-large-portuguese-cased