Baseline_BERTimbau-large-Topic_Classification-Council-PT: Multi-Label Topic Classification for Portuguese Council Discussion Subjects

Model Description

BERTimbau Large Topic Classification Council PT is a baseline implementation, consisting of a fine-tuned version of neuralmind/bert-large-portuguese-cased (BERTimbau Large) for multi-label topic classification to automatically identify and categorize discussion subjects from Portuguese municipal council meeting minutes, optimized with dynamic per-label thresholds to identify multiple simultaneous topics within municipal discussion subjects.

Key Features

  • 🎯 Specialized for Municipal Topics: Fine-tuned on Portuguese council meeting minutes discussion subjects
  • 🧠 Large Transformer Model: 334M parameters (24 layers, 1024 hidden dim, 16 attention heads)
  • 📊 Multi-Label Classification: Identifies multiple co-occurring topics per subject
  • Dynamic Thresholds: Optimized per-label classification thresholds (not fixed 0.5)
  • 🇵🇹 Portuguese-Native: Built on BERTimbau, pre-trained on Brazilian Portuguese corpus
  • 🔄 End-to-End Learning: Direct fine-tuning on task-specific data

Model Details

  • Architecture: BERT Large (Transformer Encoder)
  • Base Model: neuralmind/bert-large-portuguese-cased (BERTimbau Large)
  • Parameters: ~334M
    • 24 transformer layers
    • 1024 hidden dimensions
    • 16 attention heads
    • 4096 intermediate size
  • Vocabulary Size: 29,794 WordPiece tokens
  • Max Sequence Length: 256 tokens
  • Classification Head: Linear layer (1024 → 22 labels)
  • Loss Function: BCEWithLogitsLoss (multi-label)
  • Optimization: Dynamic per-label thresholds (F1-maximization)
  • Framework: PyTorch + Hugging Face Transformers

How It Works

The model processes Portuguese municipal texts through a fine-tuned transformer architecture:

  1. Text Preprocessing

    • Basic normalization (lowercasing, whitespace cleanup)
    • Minimum text length filtering (>10 chars)
    • Punctuation removal for noise reduction
  2. Tokenization

    • WordPiece tokenization (BERTimbau vocabulary)
    • Max length truncation at 256 tokens
    • Padding to max length for batching
  3. Transformer Encoding

    • 24-layer BERT encoder processes token sequences
    • Self-attention captures contextual dependencies
    • [CLS] token representation used for classification
  4. Multi-Label Prediction

    • Linear classification head outputs 22 logits
    • Sigmoid activation for independent probabilities
    • Dynamic per-label thresholds for final predictions
  5. Threshold Optimization

    • Each label has optimal threshold (0.10-0.90 range)
    • Optimized via F1-score grid search on validation set
    • Handles class imbalance better than fixed 0.5 threshold

Usage

import torch
import numpy as np
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from joblib import load

# Load model and tokenizer
model_path = "path/to/model"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSequenceClassification.from_pretrained(model_path)

# Load optimal thresholds and label encoder
thresholds = np.load("optimal_thresholds.npy")
mlb = load("mlb_encoder.joblib")

# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
model.eval()

# Classify text
text = "A Câmara Municipal aprovou o orçamento de 2024 para obras públicas e educação."

# Tokenize
inputs = tokenizer(
    text,
    return_tensors="pt",
    max_length=256,
    truncation=True,
    padding="max_length"
).to(device)

# Predict
with torch.no_grad():
    outputs = model(**inputs)
    probs = torch.sigmoid(outputs.logits).cpu().numpy()[0]

# Apply optimized thresholds
predictions = (probs >= thresholds).astype(int)
predicted_labels = mlb.inverse_transform(predictions.reshape(1, -1))[0]

# Display results
print(f"Text: {text}")
print(f"\nPredicted Topics:")
for label in predicted_labels:
    idx = list(mlb.classes_).index(label)
    prob = probs[idx]
    thresh = thresholds[idx]
    print(f"  - {label}: {prob:.4f} (threshold: {thresh:.2f})")

Categories

The model classifies topics into 22 Portuguese administrative categories:

Category Portuguese Name
General Administration Administração Geral, Finanças e Recursos Humanos
Environment Ambiente
Economic Activities Atividades Económicas
Social Action Ação Social
Science Ciência
Communication Comunicação e Relações Públicas
External Cooperation Cooperação Externa e Relações Internacionais
Culture Cultura
Sports Desporto
Education Educação e Formação Profissional
Energy & Telecommunications Energia e Telecomunicações
Housing Habitação
Private Construction Obras Particulares
Public Works Obras Públicas
Territorial Planning Ordenamento do Território
Other Outros
Heritage Património
Municipal Police Polícia Municipal
Animal Protection Proteção Animal
Civil Protection Proteção Civil
Health Saúde
Traffic & Transport Trânsito, Transportes e Comunicações

Evaluation Results

Comprehensive Performance Metrics

Metric Score Description
F1-macro 0.4819 Macro-averaged F1 score
F1-micro 0.7300 Micro-averaged F1 score
F1-weighted 0.727 Weighted-averaged F1 score
Accuracy 0.4234 Subset accuracy (exact match)
Hamming Loss 0.0435 Label-wise error rate
Average Precision (macro) 0.538 Macro-averaged AP

License

This model is released under the cc-by-nc-nd-4.0 license.

Downloads last month
77
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for liaad/Baseline_BERTimbau-large-Topic_Classification-Council-PT

Finetuned
(59)
this model

Collection including liaad/Baseline_BERTimbau-large-Topic_Classification-Council-PT