Allanatrix
/

NexaSci

@@ -13,17 +13,17 @@ tags:
 - Methodology
 ---
-# NexaMOE Family of Models
-## Welcome to the NexaMOE Repository!
-Get ready to supercharge your scientific research with the **NexaMOE family of models**! This Hugging Face repository hosts a powerful suite of Mixture-of-Experts (MoE) models designed to generate hypotheses and methodologies across **physics**, **biology**, and **materials science**. Built with efficiency and scalability in mind, the NexaMOE family includes the baseline **NexaMOE**, the reasoning-enhanced **NEXA-CoT**, and the long-context powerhouse **NEXA-Ultramax**. Whether you’re a researcher tackling complex STEM problems, a data scientist exploring scientific ML, or a student learning about domain-specific AI, this repository is your go-to resource for cutting-edge scientific computation.
 ## Model Overview
 The NexaSci family is a 110 million to 2.2 billion parameter architecture that uses a **Semantic Router** to direct queries to domain-specific expert modules (Physics, Biology, Materials Science). It’s optimized for resource-constrained environments, leveraging advanced training strategies, hardware optimizations, and techniques like reinforcement learning and sparse attention. Below are the current and planned models:
-### 1. NexaSci-1 (Still working on this Indefinite timeline)
 - **Parameters**: ~110 million
 - **Purpose**: Generates hypotheses and methodological scaffolding for scientific tasks in physics, biology, and materials science.
 - **Architecture**:
@@ -32,8 +32,8 @@ The NexaSci family is a 110 million to 2.2 billion parameter architecture that u
   - **Inference & Validation Pipeline**: Aggregates expert outputs and ensures consistency.
   - **Knowledge Feedback Loop**: Refines routing using reinforcement learning.
 - **Training**:
-  - Pretrained on ~325M tokens from arXiv, PubMed, and other scientific corpora.
-  - Fine-tuned with QLoRA on 300k instruction-style samples.
   - Uses AzureSky Optimizer (Stochastic Approximation + Adam hybrid).
 - **Use Cases**:
   - Generate plausible hypotheses (e.g., new material properties).
@@ -49,7 +49,7 @@ The NexaSci family is a 110 million to 2.2 billion parameter architecture that u
   - Integrates with expert modules for structured, logical outputs.
 - **Training**:
   - Trained in three stages: Easy (basic logic), Moderate (complex tasks), Hard (advanced reasoning).
-  - Uses ~425-500M tokens, including a Reasoning Curriculum Dataset (50-75M tokens) for CoT optimization.
   - Employs AzureSky Optimizer with reinforcement learning fine-tuning.
 - **Use Cases**:
   - Solve multi-step physics problems (e.g., astrophysics simulations).
@@ -64,10 +64,10 @@ The NexaSci family is a 110 million to 2.2 billion parameter architecture that u
   - Includes a **Longform Context Manager** to chunk inputs while preserving semantic coherence.
   - Scales parameters using mixed precision training and gradient checkpointing.
 - **Training**:
-  - Trained on ~600-650M tokens, including a Long-Context Corpus (100-150M tokens) of full arXiv papers and NIH grants.
   - Uses AzureSky Optimizer with mixed precision (FP16/BF16) and gradient checkpointing.
 - **Use Cases**:
-  - Summarize or analyze long scientific papers (e.g., 20K-token preprints).
   - Generate hypotheses from extended contexts (e.g., patent methods).
   - Support multi-query tasks requiring deep document understanding.
@@ -81,10 +81,8 @@ The NexaSci family is a 110 million to 2.2 billion parameter architecture that u
 The NexaSci family is trained on a **tiered token strategy** to maximize efficiency and domain specificity, as outlined in the architecture document:
 - **Warm Start Corpus** (100M tokens): General language understanding from FineWeb-Edu, OpenWebMath, Wikipedia, and Aristo Science Questions.
-- **Scientific Pretraining Corpus** (200-300M tokens): Domain-specific data from arXiv (physics), PubMed/BioRxiv (biology), and Materials Project/ChemRxiv (materials science).
-- **Instruction Fine-Tune Dataset** (25-30M tokens): 300k high-quality instruction-style samples for hypothesis and method generation.
-- **Reasoning Curriculum Dataset** (50-75M tokens, CoT only): SciBench, OpenBookQA, and others for step-by-step reasoning.
-- **Long-Context Corpus** (100-150M tokens, UltraMAX only): Full arXiv papers, NIH grants, and USPTO patents for long-context alignment.
 **Token Efficiency Strategies**:
 - Entropy scoring to remove low-information samples.
@@ -93,14 +91,10 @@ The NexaSci family is trained on a **tiered token strategy** to maximize efficie
 - Routing and filtering to activate only relevant expert paths.
 **Total Token Budget**:
-- NexaMOE-Mini: ~325M tokens
-- NEXA-CoT: ~425-500M  tokens
-- NEXA-Ultramax: ~600-650M tokens
 **Hardware**:
-- CPU: Intel i5 vPro 8th Gen (overclocked to 6.0 GHz) with 16 GB RAM.
-- GPUs: Dual NVIDIA T4 GPUs (cloud-hosted) at 90%+ capacity.
-- Performance: 47-50 petaflops with an optimized CPU-GPU pipeline.
 **Optimization Techniques**:
 - Sparse attention, mixed precision training, gradient checkpointing.

 - Methodology
 ---
+# NexaSci Family of Models
+## Welcome to the NexaSci Repository!
+Get ready to supercharge your scientific research with the **Nexasci family of models**! This Hugging Face repository hosts a powerful suite of Mixture-of-Experts (MoE) models designed to generate hypotheses and methodologies across **physics**, **biology**, and **materials science**. Built with efficiency and scalability in mind, the NexaSci family includes the baseline **NexaSci**, the reasoning-enhanced **NEXASci-1-CoT**, and the long-context powerhouse **NEXA-1-Max**. Whether you’re a researcher tackling complex STEM problems, a data scientist exploring scientific ML, or a student learning about domain-specific AI, this repository is your go-to resource for cutting-edge scientific computation.
 ## Model Overview
 The NexaSci family is a 110 million to 2.2 billion parameter architecture that uses a **Semantic Router** to direct queries to domain-specific expert modules (Physics, Biology, Materials Science). It’s optimized for resource-constrained environments, leveraging advanced training strategies, hardware optimizations, and techniques like reinforcement learning and sparse attention. Below are the current and planned models:
+### 1. NexaSci-1-Mini (Still working on this Indefinite timeline)
 - **Parameters**: ~110 million
 - **Purpose**: Generates hypotheses and methodological scaffolding for scientific tasks in physics, biology, and materials science.
 - **Architecture**:
   - **Inference & Validation Pipeline**: Aggregates expert outputs and ensures consistency.
   - **Knowledge Feedback Loop**: Refines routing using reinforcement learning.
 - **Training**:
+  - Pretrained on ~2B tokens from arXiv, PubMed, and other scientific corpora.
+  - Fine-tuned with QLoRA on 500k instruction-style samples.
   - Uses AzureSky Optimizer (Stochastic Approximation + Adam hybrid).
 - **Use Cases**:
   - Generate plausible hypotheses (e.g., new material properties).
   - Integrates with expert modules for structured, logical outputs.
 - **Training**:
   - Trained in three stages: Easy (basic logic), Moderate (complex tasks), Hard (advanced reasoning).
+  - Uses ~2B tokens
   - Employs AzureSky Optimizer with reinforcement learning fine-tuning.
 - **Use Cases**:
   - Solve multi-step physics problems (e.g., astrophysics simulations).
   - Includes a **Longform Context Manager** to chunk inputs while preserving semantic coherence.
   - Scales parameters using mixed precision training and gradient checkpointing.
 - **Training**:
+  - Trained on ~2B tokens, including a Long-Context Corpus of full arXiv papers and NIH grants.
   - Uses AzureSky Optimizer with mixed precision (FP16/BF16) and gradient checkpointing.
 - **Use Cases**:
+  - Summarize or analyze long scientific papers (e.g., 120K-token preprints).
   - Generate hypotheses from extended contexts (e.g., patent methods).
   - Support multi-query tasks requiring deep document understanding.
 The NexaSci family is trained on a **tiered token strategy** to maximize efficiency and domain specificity, as outlined in the architecture document:
 - **Warm Start Corpus** (100M tokens): General language understanding from FineWeb-Edu, OpenWebMath, Wikipedia, and Aristo Science Questions.
+- **Scientific Pretraining Corpus** (1-2B tokens): Domain-specific data from arXiv (physics), PubMed/BioRxiv (biology), and Materials Project/ChemRxiv (materials science).
+- **Instruction Fine-Tune Dataset** (500K tokens): 5k high-quality instruction-style samples for hypothesis and method generation.
 **Token Efficiency Strategies**:
 - Entropy scoring to remove low-information samples.
 - Routing and filtering to activate only relevant expert paths.
 **Total Token Budget**:
+For all models ~2B tokens
 **Hardware**:
+Currently limited here still looking and hunting
 **Optimization Techniques**:
 - Sparse attention, mixed precision training, gradient checkpointing.