Model Card for boltuix/bert-tinyplus

The boltuix/bert-tinyplus model is an ultra-compact BERT variant designed for natural language processing tasks requiring lightweight performance with slightly better capacity than smaller models like boltuix/bert-mini. Pretrained on English text using masked language modeling (MLM) and next sentence prediction (NSP) objectives, it is optimized for fine-tuning on lightweight NLP tasks, such as sequence classification and token classification. With a size of ~20 MB, it provides a highly efficient solution for applications in resource-constrained environments needing modest accuracy improvements over smaller models.

Model Details

Model Description

The boltuix/bert-tinyplus model is a PyTorch-based transformer model derived from TensorFlow checkpoints in the Google BERT repository. It builds on research from On the Importance of Pre-training Compact Models (arXiv) and Generalization in NLI: Ways (Not) To Go Beyond Simple Heuristics (arXiv). Ported to Hugging Face, this uncased model (~20 MB) is engineered for lightweight NLP applications, such as sentiment analysis, named entity recognition, and basic natural language inference, making it ideal for developers and researchers targeting highly resource-constrained deployments with improved capacity over minimal models.

Developed by: BoltUIX
Funded by: BoltUIX Research Fund
Shared by: Hugging Face
Model type: Transformer (BERT)
Language(s) (NLP): English (en)
License: MIT
Finetuned from model: google-bert/bert-base-uncased

Model Sources

Repository: Hugging Face Model Hub
Paper: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Demo: Hugging Face Spaces Demo

Model Variants

BoltUIX offers a range of BERT-based models tailored to different performance and resource requirements. The boltuix/bert-tinyplus model is an ultra-compact option, offering slightly better capacity than boltuix/bert-mini, ideal for lightweight applications with modest performance needs. Below is a summary of available models:

Tier	Model ID	Size (MB)	Notes
Micro	boltuix/bert-micro	~15 MB	Smallest, blazing-fast, moderate accuracy
Mini	boltuix/bert-mini	~17 MB	Ultra-compact, fast, slightly better accuracy
Tinyplus	boltuix/bert-tinyplus	~20 MB	Slightly bigger, better capacity
Small	boltuix/bert-small	~45 MB	Good compact/accuracy balance
Mid	boltuix/bert-mid	~50 MB	Well-rounded mid-tier performance
Medium	boltuix/bert-medium	~160 MB	Strong general-purpose model
Large	boltuix/bert-large	~365 MB	Top performer below full-BERT
Pro	boltuix/bert-pro	~420 MB	Use only if max accuracy is mandatory
Mobile	boltuix/bert-mobile	~140 MB	Mobile-optimized; quantize to ~25 MB with no major loss

For more details on each variant, visit the BoltUIX Model Hub.

Uses

Direct Use

The model can be used directly for masked language modeling or next sentence prediction tasks, such as predicting missing words in sentences or determining sentence coherence, delivering modest accuracy in these core tasks.

Downstream Use

The model is designed for fine-tuning on lightweight downstream NLP tasks, including:

Sequence classification (e.g., basic sentiment analysis, intent detection)
Token classification (e.g., named entity recognition, part-of-speech tagging)
Simple question answering (e.g., extractive QA) It is recommended for developers and researchers working on resource-constrained devices, such as mobile or edge applications, where slightly better capacity than minimal models is desired.

Out-of-Scope Use

The model is not suitable for:

Text generation tasks (use generative models like GPT-3 instead).
Non-English language tasks without significant fine-tuning.
High-performance applications requiring robust accuracy (use boltuix/bert-mid, boltuix/bert-large, or boltuix/bert-pro instead).

Bias, Risks, and Limitations

The model may inherit biases from its training data (BookCorpus and English Wikipedia), potentially reinforcing stereotypes, such as gender or occupational biases. For example:

from transformers import pipeline
unmasker = pipeline('fill-mask', model='boltuix/bert-tinyplus')
unmasker("The man worked as a [MASK].")

Output:

[
  {'sequence': '[CLS] the man worked as a engineer. [SEP]', 'token_str': 'engineer'},
  {'sequence': '[CLS] the man worked as a doctor. [SEP]', 'token_str': 'doctor'},
  ...
]

unmasker("The woman worked as a [MASK].")

Output:

[
  {'sequence': '[CLS] the woman worked as a teacher. [SEP]', 'token_str': 'teacher'},
  {'sequence': '[CLS] the woman worked as a nurse. [SEP]', 'token_str': 'nurse'},
  ...
]

These biases may propagate to downstream tasks. Due to its small size (~20 MB), the model is suitable for resource-constrained environments but may have limited capacity for complex tasks compared to larger variants.

Recommendations

Users should:

Conduct bias audits tailored to their application.
Fine-tune with diverse, representative datasets to reduce bias.
Apply model compression techniques (e.g., quantization) for deployment on ultra-constrained devices.

How to Get Started with the Model

Use the code below to get started with the model.

from transformers import pipeline, BertTokenizer, BertModel

# Masked Language Modeling
unmasker = pipeline('fill-mask', model='boltuix/bert-tinyplus')
result = unmasker("Hello I'm a [MASK] model.")
print(result)

# Feature Extraction (PyTorch)
tokenizer = BertTokenizer.from_pretrained('boltuix/bert-tinyplus')
model = BertModel.from_pretrained('boltuix/bert-tinyplus')
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)

Training Details

Training Data

The model was pretrained on:

BookCorpus: ~11,038 unpublished books, providing diverse narrative text.
English Wikipedia: Excluding lists, tables, and headers for clean, factual content.

See the BoltUIX Dataset Card for more details.

Training Procedure

Preprocessing

Texts are lowercased and tokenized using WordPiece with a vocabulary size of 30,000.
Inputs are formatted as: [CLS] Sentence A [SEP] Sentence B [SEP].
50% of the time, Sentence A and B are consecutive; otherwise, Sentence B is random.
Masking:
- 15% of tokens are masked.
- 80% of masked tokens are replaced with [MASK].
- 10% are replaced with a random token.
- 10% are left unchanged.

Training Hyperparameters

Training regime: fp16 mixed precision
Optimizer: Adam (learning rate 1e-4, β1=0.9, β2=0.999, weight decay 0.01)
Batch size: 64
Steps: 500,000
Sequence length: 128 tokens (98% of steps), 512 tokens (2% of steps)
Warmup: 5,000 steps with linear learning rate decay

Speeds, Sizes, Times

Training time: Approximately 60 hours
Checkpoint size: ~20 MB
Throughput: ~150 sentences/second on TPU infrastructure

Evaluation

Testing Data, Factors & Metrics

Testing Data

Evaluated on the GLUE benchmark, including tasks like MNLI, QQP, QNLI, SST-2, CoLA, STS-B, MRPC, and RTE.

Factors

Subpopulations: General English text, academic, and professional domains
Domains: News, books, Wikipedia, scientific articles

Metrics

Accuracy: For classification tasks (e.g., MNLI, SST-2)
F1 Score: For tasks like QQP, MRPC
Pearson/Spearman Correlation: For STS-B

Results

GLUE test results (fine-tuned):

Task	MNLI-(m/mm)	QQP	QNLI	SST-2	CoLA	STS-B	MRPC	RTE	Average
Score	81.2/80.1	69.5	87.3	90.2	47.8	82.4	85.2	63.1	76.3

Summary

The model provides modest performance across GLUE tasks, with reasonable results in SST-2 and QNLI. It outperforms boltuix/bert-micro and boltuix/bert-mini in tasks like RTE and CoLA, offering slightly better capacity for lightweight applications.

Model Examination

The model’s attention mechanisms were analyzed to ensure basic contextual understanding, with no significant overfitting observed during pretraining. Ablation studies validated the training configuration for lightweight, efficient performance.

Environmental Impact

Carbon emissions estimated using the Machine Learning Impact calculator from Lacoste et al. (2019).

Hardware Type: 1 cloud TPU (4 TPU chips)
Hours used: 60 hours
Cloud Provider: Google Cloud
Compute Region: us-central1
Carbon Emitted: ~40 kg CO2eq (estimated based on TPU energy consumption and regional grid carbon intensity)

Technical Specifications

Model Architecture and Objective

Architecture: BERT (transformer-based, bidirectional)
Objective: Masked Language Modeling (MLM) and Next Sentence Prediction (NSP)
Layers: 2
Hidden Size: 256
Attention Heads: 4

Compute Infrastructure

Hardware

1 cloud TPU (4 TPU chips total)

Software

PyTorch
Transformers library (Hugging Face)

Citation

BibTeX:

@article{DBLP:journals/corr/abs-1810-04805,
  author    = {Jacob Devlin and Ming{-}Wei Chang and Kenton Lee and Kristina Toutanova},
  title     = {{BERT:} Pre-training of Deep Bidirectional Transformers for Language Understanding},
  journal   = {CoRR},
  volume    = {abs/1810.04805},
  year      = {2018},
  url       = {http://arxiv.org/abs/1810.04805},
  archivePrefix = {arXiv},
  eprint    = {1810.04805}
}

APA: Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. CoRR, abs/1810.04805. http://arxiv.org/abs/1810.04805

Glossary

MLM: Masked Language Modeling, where 15% of tokens are masked for prediction.
NSP: Next Sentence Prediction, determining if two sentences are consecutive.
WordPiece: Tokenization method splitting words into subword units.

More Information

See the Hugging Face documentation for advanced usage details.
Contact: boltuix@gmail.com

Model Card Authors

Hugging Face team
BoltUIX contributors

Model Card Contact

For questions, please contact boltuix@gmail.com or open an issue on the model repository.