File size: 9,846 Bytes
30d8664
 
 
 
 
 
 
a57e4fb
30d8664
 
 
 
 
 
 
fcfaa7e
30d8664
fcfaa7e
30d8664
fcfaa7e
30d8664
 
 
a57e4fb
30d8664
fcfaa7e
30d8664
 
 
 
 
 
 
 
fcfaa7e
 
30d8664
 
 
 
 
 
a57e4fb
2a3cce1
30d8664
 
 
 
 
 
 
fcfaa7e
30d8664
 
 
 
 
 
a57e4fb
30d8664
 
 
 
 
 
 
fcfaa7e
30d8664
 
fcfaa7e
30d8664
 
 
 
a57e4fb
 
 
30d8664
 
 
a57e4fb
30d8664
 
fcfaa7e
 
30d8664
 
 
 
 
 
 
 
fcfaa7e
30d8664
 
fcfaa7e
30d8664
 
 
 
 
 
 
 
 
 
 
 
 
 
 
17bee0c
 
30d8664
 
a57e4fb
30d8664
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
17bee0c
30d8664
 
 
 
17bee0c
30d8664
 
 
 
 
17bee0c
 
 
 
30d8664
 
 
 
 
 
 
 
17bee0c
30d8664
 
 
 
 
 
 
17bee0c
 
30d8664
17bee0c
30d8664
17bee0c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
---
license: apache-2.0
datasets:
- Allanatrix/Scientific_Research_Tokenized
language:
- en
base_model:
- Allanatrix/NexaSci
pipeline_tag: text-generation
tags:
- Science
- Hypothesis
- Methodology
---

# NexaSci Family of Models

## Welcome to the NexaSci Repository!

Get ready to supercharge your scientific research with the **Nexasci family of models**! This Hugging Face repository hosts a powerful suite of Mixture-of-Experts (MoE) models designed to generate hypotheses and methodologies across **physics**, **biology**, and **materials science**. Built with efficiency and scalability in mind, the NexaSci family includes the baseline **NexaSci**, the reasoning-enhanced **NEXASci-1-CoT**, and the long-context powerhouse **NEXA-1-Max**. Whether you’re a researcher tackling complex STEM problems, a data scientist exploring scientific ML, or a student learning about domain-specific AI, this repository is your go-to resource for cutting-edge scientific computation.

## Model Overview

The NexaSci family is a 110 million to 2.2 billion parameter architecture that uses a **Semantic Router** to direct queries to domain-specific expert modules (Physics, Biology, Materials Science). It’s optimized for resource-constrained environments, leveraging advanced training strategies, hardware optimizations, and techniques like reinforcement learning and sparse attention. Below are the current and planned models:

### 1. NexaSci-1-Mini (Still working on this Indefinite timeline)
- **Parameters**: ~110 million
- **Purpose**: Generates hypotheses and methodological scaffolding for scientific tasks in physics, biology, and materials science.
- **Architecture**:
  - **Semantic Router**: BERT-based classifier routes queries to domain-specific experts.
  - **Expert Modules**: T5-based submodules for Physics, Biology, and Materials Science.
  - **Inference & Validation Pipeline**: Aggregates expert outputs and ensures consistency.
  - **Knowledge Feedback Loop**: Refines routing using reinforcement learning.
- **Training**:
  - Pretrained on ~2B tokens from arXiv, PubMed, and other scientific corpora.
  - Fine-tuned with QLoRA on 500k instruction-style samples.
  - Uses AzureSky Optimizer (Stochastic Approximation + Adam hybrid).
- **Use Cases**:
  - Generate plausible hypotheses (e.g., new material properties).
  - Suggest experimental methods (e.g., protein folding protocols).
  - Summarize scientific texts with domain-specific insights.

### 2. NEXASci-1-COT (Coming Soon)
- **Parameters**: 756 million to 1.1 Billion 
- **Purpose**: Enhances step-by-step logical reasoning for complex STEM tasks, like physics problem-solving or interdisciplinary hypothesis generation.
- **Architecture**:
  - Adds a **Chain of Thought (CoT) Processor** with sparse attention (Longformer-style) for multi-step reasoning.
  - Includes **Conditional Routing** to engage the CoT Processor based on a “reasoning_required” flag.
  - Integrates with expert modules for structured, logical outputs.
- **Training**:
  - Trained in three stages: Easy (basic logic), Moderate (complex tasks), Hard (advanced reasoning).
  - Uses ~2B tokens
  - Employs AzureSky Optimizer with reinforcement learning fine-tuning.
- **Use Cases**:
  - Solve multi-step physics problems (e.g., astrophysics simulations).
  - Generate detailed, logical methodologies (e.g., combining CFD and alloy modeling).
  - Teach scientific reasoning in educational settings.

### 3. NEXASci-1-Max (Coming soon)
- **Parameters**: ~2.2 billion
- **Purpose**: Processes large scientific documents (up to 20,000 tokens) with deep contextual understanding.
- **Architecture**:
  - Features a **Long Context Attention Layer** with two Flash Attention v2 layers for efficient long-sequence processing.
  - Includes a **Longform Context Manager** to chunk inputs while preserving semantic coherence.
  - Scales parameters using mixed precision training and gradient checkpointing.
- **Training**:
  - Trained on ~2B tokens, including a Long-Context Corpus of full arXiv papers and NIH grants.
  - Uses AzureSky Optimizer with mixed precision (FP16/BF16) and gradient checkpointing.
- **Use Cases**:
  - Summarize or analyze long scientific papers (e.g., 120K-token preprints).
  - Generate hypotheses from extended contexts (e.g., patent methods).
  - Support multi-query tasks requiring deep document understanding.

### Future Models (Planned)
- **NEXASci-1-Scout**: A lightweight version (~50M parameters) optimized for distilling and curating datasets and maaking the corpa for the model family
- **NEXASci-1-Super**: A larger-scale model (~10B parameters) for advanced scientific tasks, using ~1B tokens. Planned for high-performance computing clusters.
- **NEXASci-1-MultiModal**: Integrates text, images, and graphs for scientific data analysis (e.g., protein structures, simulation plots). Planned for future research.

## Dataset and Training Details

The NexaSci family is trained on a **tiered token strategy** to maximize efficiency and domain specificity, as outlined in the architecture document:

- **Warm Start Corpus** (100M tokens): General language understanding from FineWeb-Edu, OpenWebMath, Wikipedia, and Aristo Science Questions.
- **Scientific Pretraining Corpus** (1-2B tokens): Domain-specific data from arXiv (physics), PubMed/BioRxiv (biology), and Materials Project/ChemRxiv (materials science).
- **Instruction Fine-Tune Dataset** (500K tokens): 5k high-quality instruction-style samples for hypothesis and method generation.

**Token Efficiency Strategies**:
- Entropy scoring to remove low-information samples.
- Semantic tagging (e.g., [PHYS], [BIO], [MTH]) for domain routing.
- Distillation using larger models (e.g., GPT-4) to summarize and structure data.
- Routing and filtering to activate only relevant expert paths.

**Total Token Budget**:
For all models ~2B tokens 

**Hardware**:
Currently limited here still looking and hunting

**Optimization Techniques**:
- Sparse attention, mixed precision training, gradient checkpointing.
- Hyperparameter tuning with Optuna, Just-in-Time (JIT) compilation, multi-threading.
- AzureSky Optimizer for efficient convergence.


# Download Models:

Model weights are hosted on Hugging Face. Download them using the transformers library or directly from the repository’s model card.
Example:huggingface-cli download your-username/nexamoe-base


# Usage

Load a Model: Use the transformers library to load NexaMOE models:
```
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "your-username/nexasci-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")


Generate Hypotheses or Methods:Provide a prompt with optional domain tags:
prompt = "[PHYS] Suggest a hypothesis for dark matter detection."
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_length=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))


Use NEXA-CoT for Reasoning:Enable the CoT Processor for step-by-step logic:
prompt = "[BIO] [reasoning_required] Propose a method to predict protein folding."
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_length=500)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))


Process Long Documents with NEXA-Ultramax:Handle large inputs (up to 20,000 tokens):
with open("arxiv_paper.txt", "r") as f:
    document = f.read()
prompt = f"[MAT] Summarize this document: {document}"
inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=20000).to("cuda")
outputs = model.generate(**inputs, max_length=1000)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))


Fine-Tune with QLoRA:Use the provided instruction dataset for fine-tuning:
from peft import LoraConfig, get_peft_model
from datasets import load_dataset

dataset = load_dataset("your-username/nexamoe-instruction-data")
lora_config = LoraConfig(r=8, lora_alpha=16, target_modules=["q", "v"])
model = get_peft_model(model, lora_config)
```
# Train with your preferred trainer (e.g., Hugging Face Trainer)

Run Inference via CLI or GUI:

"Command-Line: python inference.py --model your-username/nexamoe-base --prompt "[PHYS] Hypothesise a new superconductor."

Opens a web interface to interact with the model.

# Performance Metrics

Extreme Specialisation: Modular experts improve response fidelity and interpretability.
Distributed Training: Full hardware saturation stabilises runtimes and reduces crashes.
Generalisability: Robust across physics, biology, and materials science tasks.
Optimiser Efficiency: AzureSky Optimiser enhances convergence speed and precision.

See the architecture document for detailed loss curves and metrics.
Similar Models
Explore related models for inspiration:

Grok (xAI): General-purpose conversational AI with scientific capabilities. Link
LLaMA (Meta AI): Efficient research models for NLP tasks. Link
SciBERT: BERT variant for scientific text processing. Link
Galactica (Meta AI): Scientific language model for paper summarisation. Link
BioBERT: BERT variant for biomedical text. Link

For the models, cite:
Allanatrix. (2025). NexaMOE Family of Models. Retrieved (6/17/2025)

Acknowledgements
We thank the scientific and AI communities for advancing Mixture-of-Experts architectures and domain-specific LLMs. Special thanks to the authors of the datasets used (arXiv, PubMed, Materials Project) and the developers of tools like Transformers, PEFT, and Optuna.
For more information, see https://materialsproject.org/, https://arxiv.org/, https://pubmed.ncbi.nlm.nih.gov/

License
MIT License (see the LICENSE file for details).

Have questions or ideas? Open an issue on GitHub or join the discussion on Hugging Face. Happy researching!