Veronica-Polymorphic

Veronica-Polymorphic is a decoder‑only transformer featuring a polymorphic MLP layer: each token is processed by a soft mixture of specialized branches (SwiGLU, GLU, Depthwise Causal Conv) under an entropy‑regularized router. The design enables adaptive capacity, incremental expansion (adding new branches post‑pretrain), and targeted specialization (e.g. translation modules) without full retraining from scratch.

TL;DR

Feature	Description
Architecture	24‑layer causal Transformer (RoPE, untied embeddings, 551M params)
Polymorphic MLP	Soft routing over 3 base branches (SwiGLU, GLU, DepthwiseConv)
Routing Control	Depth-scaled temperature (√depth) + entropy maximization
Precision	BF16 with FP32 LayerNorm for stability
Positional Encoding	Rotary (RoPE, θ=10,000)
Dataset Mix	FinePDFs‑1B 50% • DCLM Baseline‑1B 30% • FineWeb-Edu 20%
Context Length	1024 (0-30k) → 2048 (30k-60k) — 512 causes router collapse on 24L
Expansion	Add new branches (e.g. Translation) via lightweight migration + fine‑tune

Installation

pip install -e .
from veronica import VeronicaConfig, VeronicaForCausalLM
cfg = VeronicaConfig(n_layer=24, num_funcs=3)  # base polymorphic setup
model = VeronicaForCausalLM(cfg)

Source	Share	Link
FinePDFs‑1B	50%	https://huggingface.co/datasets/codelion/finepdfs-1B
DCLM Baseline‑1B	30%	https://huggingface.co/datasets/codelion/dclm-baseline-1B
Additional samples	20%	https://huggingface.co/collections/codelion/pre-training-dataset-samples

Notes

The collection link aggregates additional samples (e.g., educational/web sources) used to complete the 50/30/20 composition.
Please refer to each dataset’s license/terms; FinePDFs is curated from public PDFs and is referenced, not redistributed here.

Total tokens target (example): ~60B. The composition balances semantic density (FinePDFs) and generality (DCLM) per codelion’s guidance.

Generation example:

from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("gpt2")  # or your saved tokenizer
prompt = "The theory of relativity states that"
ids = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**ids, max_new_tokens=64, temperature=0.7, top_p=0.9)
| Current status | between v0.2 and v0.3 |
print(tok.decode(out[0], skip_special_tokens=True))

Architecture Overview

High Level

Input Embeddings → [Block × N]
   Block: Pre-LN → Multi-Head Self-Attention (RoPE) → Pre-LN → Polymorphic MLP (Routing + Branch Fusion) → Residual
Untied LM Head

Dataset Citations

If you use these datasets or composition, please cite:

@article{sharma2025billion,
  title   = {The 1 Billion Token Challenge: Finding the Perfect Pre-training Mix},
  author  = {Sharma, Asankhaya},
  year    = {2025},
  url     = {https://huggingface.co/blog/codelion/optimal-dataset-mixing/}
}

Related collection and datasets:

codelion pre‑training dataset samples: https://huggingface.co/collections/codelion/pre-training-dataset-samples
codelion/dclm-baseline-1B: https://huggingface.co/datasets/codelion/dclm-baseline-1B
codelion/finepdfs-1B: https://huggingface.co/datasets/codelion/finepdfs-1B

Polymorphic MLP

Per token & layer:

router_logits = Router(x)          # Linear → GELU → Linear
α = softmax(router_logits / τ)
branches = [SwiGLU(x), GLU(x), DepthwiseConvMLP(x)]
output = Σ α_i * branch_i(x)

Routing stabilized by:

Temperature schedule (τ high early → softer mixing)
Entropy-max aux-loss (subtract entropy from total loss to maximize it)
Optional forcing during warmup to guarantee gradient flow to new branches

Branch Types

Branch	Purpose	Structure
SwiGLU	Smooth gated MLP	Linear(up 2×) → split → SiLU × gate → Linear(down)
GLU	Alternative gating dynamics	Linear(up 2×) → split → Sigmoid × gate → Linear(down)
DepthwiseConv	Local token patterns	Depthwise causal conv (k=3) → expand → GELU → contract

Positional Encoding

Rotary embeddings (RoPE) applied to Q/K heads with cached cos/sin; no absolute learned positions.

Stability Choices

Mechanism	Rationale
FP32 LayerNorm	Prevent BF16 precision drift
Entropy-Max Aux	Avoid early router collapse
High initial τ	Encourage exploration across branches
Gradient Checkpointing	Memory efficiency for depth

Dataset Mixture (codelion / DataComp inspired)

Training uses a curated blend guided by open mixture studies:

Source	Share	Notes
FinePDFs	50%	Technical & academic PDFs (higher semantic density)
DCLM Baseline	30%	General web corpus (DataComp LM baseline)
FineWeb‑Edu	20%	Educational domain for structured explanatory patterns

Total tokens target (example): ~60B (adjustable). The composition balances semantic density vs generality, echoing codelion’s optimal ratio analyses.

Training Setup

Hyperparameter	Value (example)
Layers	24
Hidden size	768
Heads	12
MLP mult	4.0
Batch (per device)	4
Grad Accumulation	8 (effective batch 32)
LR	1.2e-4 cosine decay
Warmup	10% steps
Weight Decay	0.01
Label Smoothing	0.01
Precision	bf16 + fp32 LayerNorm
Max Seq Len	1024→2048 (curriculum)
Router τ	2.2 → 1.4 (freeze first 6k steps, depth-scaled)
Aux weight λ	0.008 → 0.016 (depth-scaled √2×)
Router forcing	10% prob for first 5k steps
Rep penalty (α)	0.05 (smoke quality)

Launch:

python scripts/train_veronica.py \
  --config configs/veronica-pretrain-24L.json \
  --dataset_paths data/mix_optimal_50_30_20 \
  --output_dir runs/veronica-pretrain-24L \
  --per_device_train_batch_size 4 \
  --gradient_accumulation_steps 8 \
  --max_steps 60000 \
  --learning_rate 1.2e-4 \
  --warmup_ratio 0.10 \
  --weight_decay 0.01 \
  --max_seq_len 1024 \
  --router_tau_start 2.2 --router_tau_end 1.4 --router_tau_freeze_steps 6000 \
  --router_aux_start 0.008 --router_aux_end 0.016 \
  --router_force_prob 0.10 --router_force_warmup_steps 5000 \
  --rep_alpha 0.05 \
  --seed 42

Critical Discovery: Context Length & Router Stability on Deep Models

The 512 Token Trap (24L Only)

Finding: With 24 layers, starting training at 512 context length causes router collapse by step 3k:

Step 3000: alpha=[0.73, 0.14, 0.12], entropy=0.70 (UNHEALTHY)

Root Cause:

With 512 tokens/batch and 24 routing decisions per token → 12,288 routing examples per batch
But distributed across 3 branches and 24 layers → each branch-layer combination receives only ~170 gradient samples
Insufficient signal for stable gradient descent on router parameters
Weak branches cannot recover from random initialization noise
Router collapses toward dominant branch to minimize aux loss conflict

Why This Doesn't Happen on 12L:

Same 512 tokens → 6,144 routing examples
Each branch-layer: ~170 samples (same as 24L)
But 12 layers = shorter gradient path → less noise accumulation
Router can stabilize before collapse

Solution: Start at 1024 for Deep Models

Corrected curriculum for 24L:

0–20k steps:   1024 tokens  ✅ 24,576 routing examples = stable gradients
20k–60k steps: 2048 tokens  🎯 48,152 examples = final quality

DO NOT use 512 ctx on 24L — this is an empirical hard constraint, not a performance optimization.

For 12L and shallower: 512→1024→2048 curriculum works fine.

Mathematical threshold: ~15–20 layers appears to be the crossover where 512 becomes unstable. Always use 1024 or higher for ≥24L.

Depth Scaling for 24L (Mathematical Rationale)

With 24 independent routing decisions per token (one per layer), naive parameters from shallower models (12L) cause amplified specialization and branch collapse. We apply square-root depth scaling to maintain equivalent "softness" across architectures:

Temperature Scaling

Softmax sharpness compounds across layers. To preserve exploration:

τ_24L = τ_12L × √(24/12) = τ_12L × √2 ≈ τ_12L × 1.41

For 12L baseline τ=1.6, we use τ=2.2 for 24L (start) and τ=1.4 (end).

Aux Weight Scaling

Entropy gradient must compete with 24 layers pulling toward specialization:

λ_24L = λ_12L × √2 ≈ λ_12L × 1.41

For 12L baseline λ=0.005→0.012, we use λ=0.008→0.016 for 24L.

Forcing Probability

Each branch needs more examples across deeper network:

P_force_24L ≈ P_force_12L × (24/12) = 2 × P_force_12L

For 12L 5%, we use 10% for 24L during warmup (0–5k steps).

Empirical Results (Training Logs)

Step 300: Entropy 1.00, perfect uniform distribution [0.33, 0.33, 0.33]
Step 5k: Entropy 0.73, healthy distribution [0.71, 0.11, 0.18]
Step 7k: Entropy 0.80–0.93 (exploration phase post tau-freeze)
Step 10k: Loss ~34, no branch collapse
Step 11k (post branch-1 recovery): Entropy 0.84–0.93, distribution [0.57, 0.15, 0.27] ✅
Step 12k: Stable soft routing, eval loss 4.07

Router Health Metrics

Monitor log lines:

[router] alpha=[a0, a1, a2, ...] entropy_norm=E

Targets by Training Phase

Phase	Steps	Entropy Target	Min Branch Share	Notes
Warmup	0–5k	≥0.90	≥0.25	Forcing active, near-uniform
Post-freeze	5k–10k	≥0.75	≥0.12	Specialization begins
Stable	10k+	≥0.70	≥0.15	Soft routing converged
Final	40k–60k	≥0.65	≥0.12	Acceptable specialization

Observed Distribution (24L, Step 12k)

alpha=[0.571, 0.153, 0.276]  entropy_norm=0.876

Ideal soft routing: dominant branch ~55–65%, minorities ~15–25% each.

Context Length Curriculum

Architecture-Dependent Strategy

For 24L (≥20L in general):

1024 tokens: Steps 0–20k   (NO 512 phase — causes router collapse)
2048 tokens: Steps 20k–60k

For 12L and shallower:

512 tokens:  Steps 0–10k
1024 tokens: Steps 10k–30k
2048 tokens: Steps 30k–60k

Phase 1 (24L): 1024 Tokens (Steps 0–20k)

Purpose: Router stability + pattern learning (REQUIRED for 24L from step 0)
VRAM: ~8–9GB (batch=4, accum=8)
Throughput: ~8–10 sec/step
Why not 512: Insufficient routing examples cause branch collapse by 3k steps

Phase 2 (24L): 2048 Tokens (Steps 20k–60k)

Purpose: Final capacity, long-document coherence
VRAM: ~12–13GB (batch=4, accum=8)
Switching criteria: Stable routing on 1024 (entropy ≥0.75, branches ≥0.15)
Expected dip: Temporary entropy −0.02–0.04, recovers within 500 steps

eg. thats with BF16

Switching Template

python scripts/train_veronica.py \
  --resume_from runs/veronica-24L-1024/checkpoint-12000 \
  --output_dir runs/veronica-24L-2048 \
  --max_seq_len 2048 \
  # ... keep all other router params unchanged

Incremental Expansion (Add New Branch Post‑Pretrain)

Goal: Increase capacity or add a specialization (e.g. translation) without full restart.

Steps

Load original checkpoint + config:

cfg = VeronicaConfig.from_pretrained(old_dir)
old_funcs = cfg.num_funcs
cfg.num_funcs = old_funcs + 1  # adding one branch
model = VeronicaForCausalLM.from_pretrained(old_dir, config=cfg, ignore_mismatched_sizes=True)

Implement new branch class (see Translation branch below) and extend PolymorphicMLP construction.

Copy existing router weights and init new column small:

import torch, torch.nn as nn
for blk in model.blocks:
  lin = blk.mlp.router[-1]  # final Linear
  with torch.no_grad():
    # existing weights remain; new slice initialized
    nn.init.normal_(lin.weight[old_funcs:], mean=0.0, std=0.02)
    if lin.bias is not None:
      nn.init.zeros_(lin.bias[old_funcs:])

Freeze old branches & attention for warmup:

for name, p in model.named_parameters():
  if "funcs.%d" % (old_funcs) in name or "router.2" in name:  # new branch + router final layer
    p.requires_grad = True
  else:
    p.requires_grad = False

High τ + light forcing (0–1k steps): router_tau_start=1.8, router_force_prob≈0.15.
Blend phase (1–3k steps): unfreeze old branches, lower τ → 1.2, increase aux to mid (e.g. 0.006).
Stabilize: restore standard schedule (τ→1.0, aux→0.01), disable forcing.

Recommended Minimal Fine‑Tune Command

python scripts/train_veronica.py \
  --config expanded-config.json \  # updated num_funcs
  --resume_from runs/veronica-pretrain-24L/checkpoint-60000 \
  --output_dir runs/veronica-expand-translation \
  --max_steps 8000 \
  --per_device_train_batch_size 4 \
  --gradient_accumulation_steps 8 \
  --learning_rate 8e-5 \
  --router_tau_start 1.8 --router_tau_end 1.2 --router_tau_freeze_steps 1500 \
  --router_aux_start 0.001 --router_aux_end 0.008 \
  --router_force_prob 0.15 --router_force_warmup_steps 1200

Translation Specialization Branch

Add a branch focusing on cross‑lingual adaptation without retraining entire backbone.

Design Goals

Requirement	Implementation Choice
Lightweight	Low‑rank adapters + language conditioning
Reusable	Shares main hidden size; no separate encoder
Controllable	Can be forced via `force_func` for targeted tuning

Example Branch Implementation

class TranslationBranch(nn.Module):
  def __init__(self, hidden_size: int, mlp_mult: float = 2.0, rank: int = 64, num_langs: int = 16):
    super().__init__()
    self.rank = rank
    self.lang_embed = nn.Embedding(num_langs, hidden_size)
    inner = int(hidden_size * mlp_mult)
    self.up = nn.Linear(hidden_size, inner)
    self.down = nn.Linear(inner, hidden_size)
    # Low-rank adapters
    self.A = nn.Linear(hidden_size, rank, bias=False)
    self.B = nn.Linear(rank, hidden_size, bias=False)
    self.gate = nn.Linear(hidden_size, 1)

  def forward(self, x: torch.Tensor, lang_ids: Optional[torch.Tensor] = None) -> torch.Tensor:
    # x: (B, T, H); lang_ids: (B,) or (B,T) token-level
    if lang_ids is not None:
      if lang_ids.dim() == 1:  # broadcast sentence level
        lang_vec = self.lang_embed(lang_ids).unsqueeze(1)  # (B,1,H)
      else:
        lang_vec = self.lang_embed(lang_ids)              # (B,T,H)
      x = x + lang_vec
    h = self.up(x)
    h = torch.gelu(h)
    h = self.down(h)
    # Adapter residual
    a = self.A(x)
    a = torch.gelu(a)
    a = self.B(a)
    g = torch.sigmoid(self.gate(x))  # (B,T,1)
    return h + g * a

Integrate Into `PolymorphicMLP`

Inside branch construction:

if num_funcs >= 4:
  funcs.append(TranslationBranch(hidden_size, mlp_mult=2.0))

Passing Language IDs

Add lang_ids to model forward signature (optional).
Modify TranslationBranch call: func(x, lang_ids=lang_ids) for branches expecting it; others ignore.
For multilingual fine‑tune, prepend special language tokens or maintain a side tensor of language indices.

Fine‑Tuning Strategy

Collect multilingual parallel / monolingual corpora (e.g. FLORES, WikiMatrix, OSCAR subset).
Freeze base transformer + existing branches initially.
Force translation branch (force_func = translation_index) for exploratory steps.
Gradually unfreeze attention + other branches for joint adaptation.
Evaluate on BLEU / COMET vs baseline; adjust rank / mlp_mult if underfitting.

Evaluation & Monitoring

Metric	Purpose
CE / PPL	Language modeling convergence
Router Entropy	Diversity of branch usage
Alpha Distribution	Detect collapse or dominance
Translation BLEU (if added)	Cross-lingual quality

Limitations

Area	Limitation
Alignment	Base LM (no RLHF / instruction tuning)
Multilingual	Requires added translation branch + fine‑tune
Safety	No filtering; may reproduce dataset biases
Interpretability	Router decisions not fully explainable

Router Stability (Important)

Dynamic soft‑routing is powerful but sensitive. The training methodology has been refined through empirical testing on 24L to ensure healthy branch growth.

Known Issues & Solutions

Issue	Symptom	Solution
Early collapse	Branch <10% by 3k steps	Increase `tau_start` (2.2→2.4), extend freeze (6k→8k)
Post-freeze oscillation	Entropy spikes 0.75→0.95	Expected; aux pushes exploration. Monitor 500 steps.
Weak branch stagnation	Branch <12% after 10k	Targeted forcing: `--force_branch_idx X --force_branch_until +1000`, aux=0 during window
Adaptive forcing loops	Repeated forced windows	Do not use adaptive forcing; rely on aux+tau only

Failed Experiment: Adaptive Forcing (DO NOT USE)

Attempted solution: Auto-detect weak branches (<threshold) and dynamically apply forcing windows

# BROKEN CODE — DO NOT USE
if min(alpha) < 0.15 and not in_cooldown:
    weak_idx = argmin(alpha)
    force_branch_idx = weak_idx
    force_until = current_step + 1000
    in_cooldown = True

Why it failed:

Cascade loops: Forcing branch A → weakens branch B → triggers forcing B → weakens A → infinite oscillation
Artificial alpha: During forced windows, alpha reflects forcing distribution [0,0,1], not learned preferences
Gradient confusion: Aux loss receives artificial entropy signals, disrupts learning
Manual intervention superior: Targeted forcing with aux=0 isolates signal cleanly

Lesson: Router needs consistent pressure (tau + aux), not reactive intervention. Manual forcing for recovery only, not automated.

Safeguards Implemented (Validated)

Depth-scaled parameters: τ and λ scaled by √(depth_ratio) to maintain effective softness
Extended freeze: Tau held constant for 6k steps (10% of training) to prevent premature specialization
Entropy-max loss: Subtract (not add) aux_loss to maximize branch diversity
Warmup forcing: 10% probability during first 5k steps ensures all branches receive gradients
FP32 LayerNorm: Prevents BF16 precision drift in routing logits
NO adaptive forcing: Rely on tau/aux scheduling + manual intervention when needed

Intervention Playbook (Step-by-Step)

Scenario: Branch drops <10% before 5k steps

Stop training, resume from last good checkpoint
Increase --router_tau_start by +0.2 (e.g., 2.2→2.4)
Extend --router_tau_freeze_steps by +2000
Increase --router_force_prob to 0.12–0.15

Scenario: Branch stuck <12% after 10k steps

Run targeted forcing (see Incremental Expansion section)
Force weak branch for 1k steps with aux=0, LR=5e-5
Resume normal training with aux restored
Expected recovery: +3–8% share within 500 steps

Scenario: Entropy <0.70 and falling after 15k

Increase --router_aux_end by +0.002 (e.g., 0.016→0.018)
Consider raising --router_tau_end slightly (1.4→1.5) to slow sharpening

Fine‑Tuning Note

If using standard HF Trainer without custom loss, set router_aux_weight=0 in config to avoid incorrect gradient direction. Use scripts/train_veronica.py for full entropy-max support.

Empirical Training Log (24L Complete Journey)

First attempt (FAILED — 512 ctx):

Step 0–300: Perfect init (entropy 1.0) with high tau + forcing
Step 3000: Router collapse — alpha=[0.73, 0.14, 0.12], entropy 0.70 ❌
Diagnosis: 512 ctx insufficient for 24L depth
Action: Abandoned run, restarted from scratch with 1024 ctx

Adaptive forcing experiment (FAILED):

Implementation: Auto-detect weak branches, dynamic forcing windows
Outcome: Cascade loops, artificial alpha patterns [0,0,1], gradient confusion
Action: Reverted code, relied on tau/aux only

Final successful run (1024 ctx from step 0):

Step 0–300: Perfect uniformity (entropy 1.0), high tau (2.2) + 10% forcing
Step 1000: Loss 87→52, entropy 0.92, balanced [0.39, 0.32, 0.29]
Step 3000: Loss 41, entropy 0.73, distribution [0.71, 0.13, 0.16] (healthy)
Step 5000: Loss 37, forcing disabled, entropy 0.72 maintained
Step 6000: Tau unfreezes (2.2→1.4 schedule begins)
Step 6000-7000: Entropy spikes 0.80→0.93 (exploration phase, expected)
Step 10000: Loss 34, branch 1 weakened to ~10% (concern threshold)
Intervention: Targeted forcing on branch 1 (10k→11k steps)
- --force_branch_idx 1 --force_branch_until 11000
- --router_aux_start 0.0 (isolate gradient signal)
- --learning_rate 5e-5 (gentle nudge)
Step 11000: Branch 1 recovered to 15%, entropy 0.84–0.93 ✅
Step 12000: Stable soft routing [0.57, 0.15, 0.27], entropy 0.876
- Eval loss 4.41→4.07 (intervention improved generalization)
- Loss trend: 34→33 (continued healthy descent)
- All branches active and contributing

Key learnings:

✅ 1024 ctx required from step 0 for 24L
✅ Depth-scaled tau/aux/forcing parameters validated
✅ Targeted forcing (aux=0, short window) effective for recovery
❌ Adaptive forcing causes more problems than it solves
✅ Entropy 0.84–0.93 with min branch 15% = healthy soft routing

Status: Methodology validated on 24L/551M through 12k steps. Ready for 2048 ctx phase (30k+). Core API stable; default schedules proven effective.

Practical Training Tips

DO

✅ Use 1024 ctx from step 0 for 24L models (512 causes router collapse)
✅ Scale tau/aux with √(depth_ratio) when changing layer count
✅ Use depth-scaled forcing probability (10% for 24L vs 5% for 12L)
✅ Freeze tau for ~10% of total training steps (6k for 60k total)
✅ Monitor entropy every 100 steps; save checkpoints every 500
✅ Apply targeted forcing (aux=0, short window) for weak branches after 10k
✅ Keep aux weight increasing throughout training (e.g., 0.008→0.016)
✅ Trust depth-scaled parameters — they're empirically validated

DON'T

❌ Use 512 ctx on 24L (causes collapse by 3k steps — empirically proven)
❌ Implement adaptive forcing (causes cascade loops and artificial alpha)
❌ Lower tau too aggressively (<1.2 for 24L can cause collapse)
❌ Set aux=0 for normal training (only during targeted forcing windows)
❌ Switch context length without verifying entropy stability (≥0.72 for 1k steps)
❌ Expect perfect uniformity throughout training (soft routing allows specialization)
❌ Panic if entropy spikes post tau-freeze (oscillation is expected; monitor 500 steps)
❌ Use curriculum 512→1024→2048 on deep models (≥20L requires 1024 start)

VRAM Optimization

If hitting OOM on 2048 ctx:

--per_device_train_batch_size 2 \
--gradient_accumulation_steps 16  # keeps effective batch = 32

Quick Health Check (Per 1k Steps)

grep "\[router\]" logs/train.log | tail -10

Look for:

Entropy trend (should be ≥0.70)
Min branch value (should be ≥0.12)
Loss trend (should decrease or stabilize)

Roadmap

Version	Goal
v0.1	Core polymorphic MLP + tests
v0.2	Router logging + entropy regularization
v0.3	Channel attention option
v0.4	FlashAttention integration
v0.5	Expansion utilities (branch migration helpers)
v0.6	Translation branch reference implementation

Contributing

PRs welcome for: new branch types, expansion helpers, multilingual adapters, evaluation scripts.

License

Apache-2.0

Citation

@misc{veronica-2025,
  title={Veronica: Entropy-Regularized Polymorphic Branching for Adaptive Language Modeling},
  author={Emanuele D'Angelo|GG-Ally},
  year={2025},
  howpublished={\url{https://huggingface.co/MhaWay/Veronica}}
}

Acknowledgments

Mixture & routing concepts inspired by Switch Transformer, GLaM, MoE literature.
Dataset composition ratios guided by codelion’s DataComp LM mixture studies.
RoPE adaptation referencing GPT-NeoX implementation details.

FAQ

Q: Why entropy-max instead of load-balancing penalty?
To avoid premature specialization and keep new branches trainable; scaling uses increasing aux weight schedule.

Q: Can I add many branches at once?
Recommended incremental (3→4→5) to prevent starvation.

Q: How to specialize for translation?
Add TranslationBranch, warmup with forced routing, then blended fine-tune with multilingual data.

Q: Does expansion erase prior knowledge?
No; existing branches retain weights. Router + new branch adapt during short fine‑tune.

Happy branching! 🌿

Downloads last month: -; Downloads are not tracked for this model. How to track

Datasets used to train MhaWay/Veronica

Evaluation results

Metadata error: specify a dataset to view leaderboard