Update README to include transformers version requirement
Browse files
README.md
CHANGED
|
@@ -124,6 +124,12 @@ model-index:
|
|
| 124 |
ModChemBERT is a ModernBERT-based chemical language model (CLM), trained on SMILES strings for masked language modeling (MLM) and downstream molecular property prediction (classification & regression).
|
| 125 |
|
| 126 |
## Usage
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 127 |
### Load Model
|
| 128 |
```python
|
| 129 |
from transformers import AutoModelForMaskedLM, AutoTokenizer
|
|
@@ -146,19 +152,6 @@ fill = pipeline("fill-mask", model=model, tokenizer=tokenizer)
|
|
| 146 |
print(fill("c1ccccc1[MASK]"))
|
| 147 |
```
|
| 148 |
|
| 149 |
-
## Intended Use
|
| 150 |
-
* Primary: Research and development for molecular property prediction, experimentation with pooling strategies, and as a foundational model for downstream applications.
|
| 151 |
-
* Appropriate for: Binary / multi-class classification (e.g., toxicity, activity) and single-task or multi-task regression (e.g., solubility, clearance) after fine-tuning.
|
| 152 |
-
* Not intended for generating novel molecules.
|
| 153 |
-
|
| 154 |
-
## Limitations
|
| 155 |
-
- Out-of-domain performance may degrade for: very long (>128 token) SMILES, inorganic / organometallic compounds, polymers, or charged / enumerated tautomers are not well represented in training.
|
| 156 |
-
- No guarantee of synthesizability, safety, or biological efficacy.
|
| 157 |
-
|
| 158 |
-
## Ethical Considerations & Responsible Use
|
| 159 |
-
- Potential biases arise from training corpora skewed to drug-like space.
|
| 160 |
-
- Do not deploy in clinical or regulatory settings without rigorous, domain-specific validation.
|
| 161 |
-
|
| 162 |
## Architecture
|
| 163 |
- Backbone: ModernBERT
|
| 164 |
- Hidden size: 768
|
|
@@ -289,6 +282,19 @@ Optimal parameters (per dataset) for the `MLM + DAPT + TAFT OPT` merged model:
|
|
| 289 |
|
| 290 |
</details>
|
| 291 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 292 |
## Hardware
|
| 293 |
Training and experiments were performed on 2 NVIDIA RTX 3090 GPUs.
|
| 294 |
|
|
|
|
| 124 |
ModChemBERT is a ModernBERT-based chemical language model (CLM), trained on SMILES strings for masked language modeling (MLM) and downstream molecular property prediction (classification & regression).
|
| 125 |
|
| 126 |
## Usage
|
| 127 |
+
Install the `transformers` library starting from v4.56.1:
|
| 128 |
+
|
| 129 |
+
```bash
|
| 130 |
+
pip install -U transformers>=4.56.1
|
| 131 |
+
```
|
| 132 |
+
|
| 133 |
### Load Model
|
| 134 |
```python
|
| 135 |
from transformers import AutoModelForMaskedLM, AutoTokenizer
|
|
|
|
| 152 |
print(fill("c1ccccc1[MASK]"))
|
| 153 |
```
|
| 154 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 155 |
## Architecture
|
| 156 |
- Backbone: ModernBERT
|
| 157 |
- Hidden size: 768
|
|
|
|
| 282 |
|
| 283 |
</details>
|
| 284 |
|
| 285 |
+
## Intended Use
|
| 286 |
+
* Primary: Research and development for molecular property prediction, experimentation with pooling strategies, and as a foundational model for downstream applications.
|
| 287 |
+
* Appropriate for: Binary / multi-class classification (e.g., toxicity, activity) and single-task or multi-task regression (e.g., solubility, clearance) after fine-tuning.
|
| 288 |
+
* Not intended for generating novel molecules.
|
| 289 |
+
|
| 290 |
+
## Limitations
|
| 291 |
+
- Out-of-domain performance may degrade for: very long (>128 token) SMILES, inorganic / organometallic compounds, polymers, or charged / enumerated tautomers are not well represented in training.
|
| 292 |
+
- No guarantee of synthesizability, safety, or biological efficacy.
|
| 293 |
+
|
| 294 |
+
## Ethical Considerations & Responsible Use
|
| 295 |
+
- Potential biases arise from training corpora skewed to drug-like space.
|
| 296 |
+
- Do not deploy in clinical or regulatory settings without rigorous, domain-specific validation.
|
| 297 |
+
|
| 298 |
## Hardware
|
| 299 |
Training and experiments were performed on 2 NVIDIA RTX 3090 GPUs.
|
| 300 |
|