Veronica-Polymorphic

Veronica-Polymorphic is a decoder‑only transformer featuring a polymorphic MLP layer: each token is processed by a soft mixture of specialized branches (SwiGLU, GLU, Depthwise Causal Conv) under an entropy‑regularized router. The design enables adaptive capacity, incremental expansion (adding new branches post‑pretrain), and targeted specialization (e.g. translation modules) without full retraining from scratch.

TL;DR

Feature Description
Architecture 24‑layer causal Transformer (RoPE, untied embeddings, 551M params)
Polymorphic MLP Soft routing over 3 base branches (SwiGLU, GLU, DepthwiseConv)
Routing Control Depth-scaled temperature (√depth) + entropy maximization
Precision BF16 with FP32 LayerNorm for stability
Positional Encoding Rotary (RoPE, θ=10,000)
Dataset Mix FinePDFs‑1B 50% • DCLM Baseline‑1B 30% • FineWeb-Edu 20%
Context Length 1024 (0-30k) → 2048 (30k-60k) — 512 causes router collapse on 24L
Expansion Add new branches (e.g. Translation) via lightweight migration + fine‑tune

Installation

pip install -e .
from veronica import VeronicaConfig, VeronicaForCausalLM
cfg = VeronicaConfig(n_layer=24, num_funcs=3)  # base polymorphic setup
model = VeronicaForCausalLM(cfg)

Notes

  • The collection link aggregates additional samples (e.g., educational/web sources) used to complete the 50/30/20 composition.
  • Please refer to each dataset’s license/terms; FinePDFs is curated from public PDFs and is referenced, not redistributed here.

Total tokens target (example): ~60B. The composition balances semantic density (FinePDFs) and generality (DCLM) per codelion’s guidance.

Generation example:

from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("gpt2")  # or your saved tokenizer
prompt = "The theory of relativity states that"
ids = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**ids, max_new_tokens=64, temperature=0.7, top_p=0.9)
| Current status | between v0.2 and v0.3 |
print(tok.decode(out[0], skip_special_tokens=True))

Architecture Overview

High Level

Input Embeddings → [Block × N]
   Block: Pre-LN → Multi-Head Self-Attention (RoPE) → Pre-LN → Polymorphic MLP (Routing + Branch Fusion) → Residual
Untied LM Head

Dataset Citations

If you use these datasets or composition, please cite:

@article{sharma2025billion,
  title   = {The 1 Billion Token Challenge: Finding the Perfect Pre-training Mix},
  author  = {Sharma, Asankhaya},
  year    = {2025},
  url     = {https://huggingface.co/blog/codelion/optimal-dataset-mixing/}
}

Related collection and datasets:


Polymorphic MLP

Per token & layer:

router_logits = Router(x)          # Linear → GELU → Linear
α = softmax(router_logits / τ)
branches = [SwiGLU(x), GLU(x), DepthwiseConvMLP(x)]
output = Σ α_i * branch_i(x)

Routing stabilized by:

  • Temperature schedule (τ high early → softer mixing)
  • Entropy-max aux-loss (subtract entropy from total loss to maximize it)
  • Optional forcing during warmup to guarantee gradient flow to new branches

Branch Types

Branch Purpose Structure
SwiGLU Smooth gated MLP Linear(up 2×) → split → SiLU × gate → Linear(down)
GLU Alternative gating dynamics Linear(up 2×) → split → Sigmoid × gate → Linear(down)
DepthwiseConv Local token patterns Depthwise causal conv (k=3) → expand → GELU → contract

Positional Encoding

Rotary embeddings (RoPE) applied to Q/K heads with cached cos/sin; no absolute learned positions.

Stability Choices

Mechanism Rationale
FP32 LayerNorm Prevent BF16 precision drift
Entropy-Max Aux Avoid early router collapse
High initial τ Encourage exploration across branches
Gradient Checkpointing Memory efficiency for depth

Dataset Mixture (codelion / DataComp inspired)

Training uses a curated blend guided by open mixture studies:

Source Share Notes
FinePDFs 50% Technical & academic PDFs (higher semantic density)
DCLM Baseline 30% General web corpus (DataComp LM baseline)
FineWeb‑Edu 20% Educational domain for structured explanatory patterns

Total tokens target (example): ~60B (adjustable). The composition balances semantic density vs generality, echoing codelion’s optimal ratio analyses.


Training Setup

Hyperparameter Value (example)
Layers 24
Hidden size 768
Heads 12
MLP mult 4.0
Batch (per device) 4
Grad Accumulation 8 (effective batch 32)
LR 1.2e-4 cosine decay
Warmup 10% steps
Weight Decay 0.01
Label Smoothing 0.01
Precision bf16 + fp32 LayerNorm
Max Seq Len 1024→2048 (curriculum)
Router τ 2.2 → 1.4 (freeze first 6k steps, depth-scaled)
Aux weight λ 0.008 → 0.016 (depth-scaled √2×)
Router forcing 10% prob for first 5k steps
Rep penalty (α) 0.05 (smoke quality)

Launch:

python scripts/train_veronica.py \
  --config configs/veronica-pretrain-24L.json \
  --dataset_paths data/mix_optimal_50_30_20 \
  --output_dir runs/veronica-pretrain-24L \
  --per_device_train_batch_size 4 \
  --gradient_accumulation_steps 8 \
  --max_steps 60000 \
  --learning_rate 1.2e-4 \
  --warmup_ratio 0.10 \
  --weight_decay 0.01 \
  --max_seq_len 1024 \
  --router_tau_start 2.2 --router_tau_end 1.4 --router_tau_freeze_steps 6000 \
  --router_aux_start 0.008 --router_aux_end 0.016 \
  --router_force_prob 0.10 --router_force_warmup_steps 5000 \
  --rep_alpha 0.05 \
  --seed 42

Critical Discovery: Context Length & Router Stability on Deep Models

The 512 Token Trap (24L Only)

Finding: With 24 layers, starting training at 512 context length causes router collapse by step 3k:

Step 3000: alpha=[0.73, 0.14, 0.12], entropy=0.70 (UNHEALTHY)

Root Cause:

  • With 512 tokens/batch and 24 routing decisions per token → 12,288 routing examples per batch
  • But distributed across 3 branches and 24 layers → each branch-layer combination receives only ~170 gradient samples
  • Insufficient signal for stable gradient descent on router parameters
  • Weak branches cannot recover from random initialization noise
  • Router collapses toward dominant branch to minimize aux loss conflict

Why This Doesn't Happen on 12L:

  • Same 512 tokens → 6,144 routing examples
  • Each branch-layer: ~170 samples (same as 24L)
  • But 12 layers = shorter gradient path → less noise accumulation
  • Router can stabilize before collapse

Solution: Start at 1024 for Deep Models

Corrected curriculum for 24L:

0–20k steps:   1024 tokens  ✅ 24,576 routing examples = stable gradients
20k–60k steps: 2048 tokens  🎯 48,152 examples = final quality

DO NOT use 512 ctx on 24L — this is an empirical hard constraint, not a performance optimization.

For 12L and shallower: 512→1024→2048 curriculum works fine.

Mathematical threshold: ~15–20 layers appears to be the crossover where 512 becomes unstable. Always use 1024 or higher for ≥24L.


Depth Scaling for 24L (Mathematical Rationale)

With 24 independent routing decisions per token (one per layer), naive parameters from shallower models (12L) cause amplified specialization and branch collapse. We apply square-root depth scaling to maintain equivalent "softness" across architectures:

Temperature Scaling

Softmax sharpness compounds across layers. To preserve exploration:

τ_24L = τ_12L × √(24/12) = τ_12L × √2 ≈ τ_12L × 1.41

For 12L baseline τ=1.6, we use τ=2.2 for 24L (start) and τ=1.4 (end).

Aux Weight Scaling

Entropy gradient must compete with 24 layers pulling toward specialization:

λ_24L = λ_12L × √2 ≈ λ_12L × 1.41

For 12L baseline λ=0.005→0.012, we use λ=0.008→0.016 for 24L.

Forcing Probability

Each branch needs more examples across deeper network:

P_force_24L ≈ P_force_12L × (24/12) = 2 × P_force_12L

For 12L 5%, we use 10% for 24L during warmup (0–5k steps).

Empirical Results (Training Logs)

  • Step 300: Entropy 1.00, perfect uniform distribution [0.33, 0.33, 0.33]
  • Step 5k: Entropy 0.73, healthy distribution [0.71, 0.11, 0.18]
  • Step 7k: Entropy 0.80–0.93 (exploration phase post tau-freeze)
  • Step 10k: Loss ~34, no branch collapse
  • Step 11k (post branch-1 recovery): Entropy 0.84–0.93, distribution [0.57, 0.15, 0.27]
  • Step 12k: Stable soft routing, eval loss 4.07

Router Health Metrics

Monitor log lines:

[router] alpha=[a0, a1, a2, ...] entropy_norm=E

Targets by Training Phase

Phase Steps Entropy Target Min Branch Share Notes
Warmup 0–5k ≥0.90 ≥0.25 Forcing active, near-uniform
Post-freeze 5k–10k ≥0.75 ≥0.12 Specialization begins
Stable 10k+ ≥0.70 ≥0.15 Soft routing converged
Final 40k–60k ≥0.65 ≥0.12 Acceptable specialization

Observed Distribution (24L, Step 12k)

alpha=[0.571, 0.153, 0.276]  entropy_norm=0.876

Ideal soft routing: dominant branch ~55–65%, minorities ~15–25% each.


Context Length Curriculum

Architecture-Dependent Strategy

For 24L (≥20L in general):

1024 tokens: Steps 0–20k   (NO 512 phase — causes router collapse)
2048 tokens: Steps 20k–60k

For 12L and shallower:

512 tokens:  Steps 0–10k
1024 tokens: Steps 10k–30k
2048 tokens: Steps 30k–60k

Phase 1 (24L): 1024 Tokens (Steps 0–20k)

  • Purpose: Router stability + pattern learning (REQUIRED for 24L from step 0)
  • VRAM: ~8–9GB (batch=4, accum=8)
  • Throughput: ~8–10 sec/step
  • Why not 512: Insufficient routing examples cause branch collapse by 3k steps

Phase 2 (24L): 2048 Tokens (Steps 20k–60k)

  • Purpose: Final capacity, long-document coherence
  • VRAM: ~12–13GB (batch=4, accum=8)
  • Switching criteria: Stable routing on 1024 (entropy ≥0.75, branches ≥0.15)
  • Expected dip: Temporary entropy −0.02–0.04, recovers within 500 steps

eg. thats with BF16

Switching Template

python scripts/train_veronica.py \
  --resume_from runs/veronica-24L-1024/checkpoint-12000 \
  --output_dir runs/veronica-24L-2048 \
  --max_seq_len 2048 \
  # ... keep all other router params unchanged

Incremental Expansion (Add New Branch Post‑Pretrain)

Goal: Increase capacity or add a specialization (e.g. translation) without full restart.

Steps

  1. Load original checkpoint + config:
    cfg = VeronicaConfig.from_pretrained(old_dir)
    old_funcs = cfg.num_funcs
    cfg.num_funcs = old_funcs + 1  # adding one branch
    model = VeronicaForCausalLM.from_pretrained(old_dir, config=cfg, ignore_mismatched_sizes=True)
    
  2. Implement new branch class (see Translation branch below) and extend PolymorphicMLP construction.
  3. Copy existing router weights and init new column small:
    import torch, torch.nn as nn
    for blk in model.blocks:
      lin = blk.mlp.router[-1]  # final Linear
      with torch.no_grad():
        # existing weights remain; new slice initialized
        nn.init.normal_(lin.weight[old_funcs:], mean=0.0, std=0.02)
        if lin.bias is not None:
          nn.init.zeros_(lin.bias[old_funcs:])
    
  4. Freeze old branches & attention for warmup:
    for name, p in model.named_parameters():
      if "funcs.%d" % (old_funcs) in name or "router.2" in name:  # new branch + router final layer
        p.requires_grad = True
      else:
        p.requires_grad = False
    
  5. High τ + light forcing (0–1k steps): router_tau_start=1.8, router_force_prob≈0.15.
  6. Blend phase (1–3k steps): unfreeze old branches, lower τ → 1.2, increase aux to mid (e.g. 0.006).
  7. Stabilize: restore standard schedule (τ→1.0, aux→0.01), disable forcing.

Recommended Minimal Fine‑Tune Command

python scripts/train_veronica.py \
  --config expanded-config.json \  # updated num_funcs
  --resume_from runs/veronica-pretrain-24L/checkpoint-60000 \
  --output_dir runs/veronica-expand-translation \
  --max_steps 8000 \
  --per_device_train_batch_size 4 \
  --gradient_accumulation_steps 8 \
  --learning_rate 8e-5 \
  --router_tau_start 1.8 --router_tau_end 1.2 --router_tau_freeze_steps 1500 \
  --router_aux_start 0.001 --router_aux_end 0.008 \
  --router_force_prob 0.15 --router_force_warmup_steps 1200

Translation Specialization Branch

Add a branch focusing on cross‑lingual adaptation without retraining entire backbone.

Design Goals

Requirement Implementation Choice
Lightweight Low‑rank adapters + language conditioning
Reusable Shares main hidden size; no separate encoder
Controllable Can be forced via force_func for targeted tuning

Example Branch Implementation

class TranslationBranch(nn.Module):
  def __init__(self, hidden_size: int, mlp_mult: float = 2.0, rank: int = 64, num_langs: int = 16):
    super().__init__()
    self.rank = rank
    self.lang_embed = nn.Embedding(num_langs, hidden_size)
    inner = int(hidden_size * mlp_mult)
    self.up = nn.Linear(hidden_size, inner)
    self.down = nn.Linear(inner, hidden_size)
    # Low-rank adapters
    self.A = nn.Linear(hidden_size, rank, bias=False)
    self.B = nn.Linear(rank, hidden_size, bias=False)
    self.gate = nn.Linear(hidden_size, 1)

  def forward(self, x: torch.Tensor, lang_ids: Optional[torch.Tensor] = None) -> torch.Tensor:
    # x: (B, T, H); lang_ids: (B,) or (B,T) token-level
    if lang_ids is not None:
      if lang_ids.dim() == 1:  # broadcast sentence level
        lang_vec = self.lang_embed(lang_ids).unsqueeze(1)  # (B,1,H)
      else:
        lang_vec = self.lang_embed(lang_ids)              # (B,T,H)
      x = x + lang_vec
    h = self.up(x)
    h = torch.gelu(h)
    h = self.down(h)
    # Adapter residual
    a = self.A(x)
    a = torch.gelu(a)
    a = self.B(a)
    g = torch.sigmoid(self.gate(x))  # (B,T,1)
    return h + g * a

Integrate Into PolymorphicMLP

Inside branch construction:

if num_funcs >= 4:
  funcs.append(TranslationBranch(hidden_size, mlp_mult=2.0))

Passing Language IDs

  • Add lang_ids to model forward signature (optional).
  • Modify TranslationBranch call: func(x, lang_ids=lang_ids) for branches expecting it; others ignore.
  • For multilingual fine‑tune, prepend special language tokens or maintain a side tensor of language indices.

Fine‑Tuning Strategy

  1. Collect multilingual parallel / monolingual corpora (e.g. FLORES, WikiMatrix, OSCAR subset).
  2. Freeze base transformer + existing branches initially.
  3. Force translation branch (force_func = translation_index) for exploratory steps.
  4. Gradually unfreeze attention + other branches for joint adaptation.
  5. Evaluate on BLEU / COMET vs baseline; adjust rank / mlp_mult if underfitting.

Evaluation & Monitoring

Metric Purpose
CE / PPL Language modeling convergence
Router Entropy Diversity of branch usage
Alpha Distribution Detect collapse or dominance
Translation BLEU (if added) Cross-lingual quality

Limitations

Area Limitation
Alignment Base LM (no RLHF / instruction tuning)
Multilingual Requires added translation branch + fine‑tune
Safety No filtering; may reproduce dataset biases
Interpretability Router decisions not fully explainable

Router Stability (Important)

Dynamic soft‑routing is powerful but sensitive. The training methodology has been refined through empirical testing on 24L to ensure healthy branch growth.

Known Issues & Solutions

Issue Symptom Solution
Early collapse Branch <10% by 3k steps Increase tau_start (2.2→2.4), extend freeze (6k→8k)
Post-freeze oscillation Entropy spikes 0.75→0.95 Expected; aux pushes exploration. Monitor 500 steps.
Weak branch stagnation Branch <12% after 10k Targeted forcing: --force_branch_idx X --force_branch_until +1000, aux=0 during window
Adaptive forcing loops Repeated forced windows Do not use adaptive forcing; rely on aux+tau only

Failed Experiment: Adaptive Forcing (DO NOT USE)

Attempted solution: Auto-detect weak branches (<threshold) and dynamically apply forcing windows

# BROKEN CODE — DO NOT USE
if min(alpha) < 0.15 and not in_cooldown:
    weak_idx = argmin(alpha)
    force_branch_idx = weak_idx
    force_until = current_step + 1000
    in_cooldown = True

Why it failed:

  1. Cascade loops: Forcing branch A → weakens branch B → triggers forcing B → weakens A → infinite oscillation
  2. Artificial alpha: During forced windows, alpha reflects forcing distribution [0,0,1], not learned preferences
  3. Gradient confusion: Aux loss receives artificial entropy signals, disrupts learning
  4. Manual intervention superior: Targeted forcing with aux=0 isolates signal cleanly

Lesson: Router needs consistent pressure (tau + aux), not reactive intervention. Manual forcing for recovery only, not automated.

Safeguards Implemented (Validated)

  1. Depth-scaled parameters: τ and λ scaled by √(depth_ratio) to maintain effective softness
  2. Extended freeze: Tau held constant for 6k steps (10% of training) to prevent premature specialization
  3. Entropy-max loss: Subtract (not add) aux_loss to maximize branch diversity
  4. Warmup forcing: 10% probability during first 5k steps ensures all branches receive gradients
  5. FP32 LayerNorm: Prevents BF16 precision drift in routing logits
  6. NO adaptive forcing: Rely on tau/aux scheduling + manual intervention when needed

Intervention Playbook (Step-by-Step)

Scenario: Branch drops <10% before 5k steps

  1. Stop training, resume from last good checkpoint
  2. Increase --router_tau_start by +0.2 (e.g., 2.2→2.4)
  3. Extend --router_tau_freeze_steps by +2000
  4. Increase --router_force_prob to 0.12–0.15

Scenario: Branch stuck <12% after 10k steps

  1. Run targeted forcing (see Incremental Expansion section)
  2. Force weak branch for 1k steps with aux=0, LR=5e-5
  3. Resume normal training with aux restored
  4. Expected recovery: +3–8% share within 500 steps

Scenario: Entropy <0.70 and falling after 15k

  1. Increase --router_aux_end by +0.002 (e.g., 0.016→0.018)
  2. Consider raising --router_tau_end slightly (1.4→1.5) to slow sharpening

Fine‑Tuning Note

If using standard HF Trainer without custom loss, set router_aux_weight=0 in config to avoid incorrect gradient direction. Use scripts/train_veronica.py for full entropy-max support.

Empirical Training Log (24L Complete Journey)

First attempt (FAILED — 512 ctx):

  • Step 0–300: Perfect init (entropy 1.0) with high tau + forcing
  • Step 3000: Router collapse — alpha=[0.73, 0.14, 0.12], entropy 0.70 ❌
  • Diagnosis: 512 ctx insufficient for 24L depth
  • Action: Abandoned run, restarted from scratch with 1024 ctx

Adaptive forcing experiment (FAILED):

  • Implementation: Auto-detect weak branches, dynamic forcing windows
  • Outcome: Cascade loops, artificial alpha patterns [0,0,1], gradient confusion
  • Action: Reverted code, relied on tau/aux only

Final successful run (1024 ctx from step 0):

  • Step 0–300: Perfect uniformity (entropy 1.0), high tau (2.2) + 10% forcing
  • Step 1000: Loss 87→52, entropy 0.92, balanced [0.39, 0.32, 0.29]
  • Step 3000: Loss 41, entropy 0.73, distribution [0.71, 0.13, 0.16] (healthy)
  • Step 5000: Loss 37, forcing disabled, entropy 0.72 maintained
  • Step 6000: Tau unfreezes (2.2→1.4 schedule begins)
  • Step 6000-7000: Entropy spikes 0.80→0.93 (exploration phase, expected)
  • Step 10000: Loss 34, branch 1 weakened to ~10% (concern threshold)
  • Intervention: Targeted forcing on branch 1 (10k→11k steps)
    • --force_branch_idx 1 --force_branch_until 11000
    • --router_aux_start 0.0 (isolate gradient signal)
    • --learning_rate 5e-5 (gentle nudge)
  • Step 11000: Branch 1 recovered to 15%, entropy 0.84–0.93 ✅
  • Step 12000: Stable soft routing [0.57, 0.15, 0.27], entropy 0.876
    • Eval loss 4.41→4.07 (intervention improved generalization)
    • Loss trend: 34→33 (continued healthy descent)
    • All branches active and contributing

Key learnings:

  1. ✅ 1024 ctx required from step 0 for 24L
  2. ✅ Depth-scaled tau/aux/forcing parameters validated
  3. ✅ Targeted forcing (aux=0, short window) effective for recovery
  4. ❌ Adaptive forcing causes more problems than it solves
  5. ✅ Entropy 0.84–0.93 with min branch 15% = healthy soft routing

Status: Methodology validated on 24L/551M through 12k steps. Ready for 2048 ctx phase (30k+). Core API stable; default schedules proven effective.


Practical Training Tips

DO

  • Use 1024 ctx from step 0 for 24L models (512 causes router collapse)
  • ✅ Scale tau/aux with √(depth_ratio) when changing layer count
  • ✅ Use depth-scaled forcing probability (10% for 24L vs 5% for 12L)
  • ✅ Freeze tau for ~10% of total training steps (6k for 60k total)
  • ✅ Monitor entropy every 100 steps; save checkpoints every 500
  • ✅ Apply targeted forcing (aux=0, short window) for weak branches after 10k
  • ✅ Keep aux weight increasing throughout training (e.g., 0.008→0.016)
  • ✅ Trust depth-scaled parameters — they're empirically validated

DON'T

  • Use 512 ctx on 24L (causes collapse by 3k steps — empirically proven)
  • Implement adaptive forcing (causes cascade loops and artificial alpha)
  • ❌ Lower tau too aggressively (<1.2 for 24L can cause collapse)
  • ❌ Set aux=0 for normal training (only during targeted forcing windows)
  • ❌ Switch context length without verifying entropy stability (≥0.72 for 1k steps)
  • ❌ Expect perfect uniformity throughout training (soft routing allows specialization)
  • ❌ Panic if entropy spikes post tau-freeze (oscillation is expected; monitor 500 steps)
  • ❌ Use curriculum 512→1024→2048 on deep models (≥20L requires 1024 start)

VRAM Optimization

If hitting OOM on 2048 ctx:

--per_device_train_batch_size 2 \
--gradient_accumulation_steps 16  # keeps effective batch = 32

Quick Health Check (Per 1k Steps)

grep "\[router\]" logs/train.log | tail -10

Look for:

  • Entropy trend (should be ≥0.70)
  • Min branch value (should be ≥0.12)
  • Loss trend (should decrease or stabilize)

Roadmap

Version Goal
v0.1 Core polymorphic MLP + tests
v0.2 Router logging + entropy regularization
v0.3 Channel attention option
v0.4 FlashAttention integration
v0.5 Expansion utilities (branch migration helpers)
v0.6 Translation branch reference implementation

Contributing

PRs welcome for: new branch types, expansion helpers, multilingual adapters, evaluation scripts.


License

Apache-2.0


Citation

@misc{veronica-2025,
  title={Veronica: Entropy-Regularized Polymorphic Branching for Adaptive Language Modeling},
  author={Emanuele D'Angelo|GG-Ally},
  year={2025},
  howpublished={\url{https://huggingface.co/MhaWay/Veronica}}
}

Acknowledgments

  • Mixture & routing concepts inspired by Switch Transformer, GLaM, MoE literature.
  • Dataset composition ratios guided by codelion’s DataComp LM mixture studies.
  • RoPE adaptation referencing GPT-NeoX implementation details.

FAQ

Q: Why entropy-max instead of load-balancing penalty?
To avoid premature specialization and keep new branches trainable; scaling uses increasing aux weight schedule.

Q: Can I add many branches at once?
Recommended incremental (3→4→5) to prevent starvation.

Q: How to specialize for translation?
Add TranslationBranch, warmup with forced routing, then blended fine-tune with multilingual data.

Q: Does expansion erase prior knowledge?
No; existing branches retain weights. Router + new branch adapt during short fine‑tune.


Happy branching! 🌿

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train MhaWay/Veronica