I loved this part since it matches much of what is evolving in AI, Biology and Neuroscience: “Vibe Computer” framing flips that:
Prompting becomes programming, just in a pre-language era.
Tools become I/O and syscalls, not “plugins.”
Memory becomes state, not merely “context.” <-- This might also be timeline and IO in the future.. Memory we need later might be distilled state matching human intelligence e.g. better wiring through reuse and improvement of efficiency and pathways.
Evaluation becomes replay and verification, not only benchmarks. Great Read! Thanks - Aaron
**HAO (Human AI Orchestra)** is a next-generation collaborative development framework that maximizes synergy between **human intuition** and the **diverse strengths of multiple LLMs**—turning the human from a “coder” into a **conductor**.
At its core is an **11-step workflow** you can run immediately: divergence for wild ideas → convergence into architecture → critique & voting → synthesis → blueprinting (Gantree) → prototyping (PPR) → cross-review → refinement → roadmap → implementation. The philosophy is intentionally **anti-standardization**, treats **conflict as a resource**, and keeps **orchestration** (human-in-control) as the center.
This repo includes the **developer manual** (with concrete prompt templates), plus real artifact histories from two full runs: **Dancing with Noise** and **Dancing with Time**.
I just released omarkamali/wikipedia-labels, with all the structural labels and namespace from wikipedia in 300+ languages. A gift for the data preprocessors and cleaners among us.
I've been busy working on some new ranking/position methodologies and excited to start sharing some results.
Plot legends:
- X = truncation rate (low = good) - ? = confusion rate (low = good) - blue bars = average completion tokens (low = good) - black diamonds = CI-banded performance (high = good) - cluster squares = models inside this group are equivalent
openai/gpt-oss-120b remains the king in all dimensions of interest: truncation rates, completion lengths and performance. If I had but one complaint it's the reason_effort does not seem to actually work - more on this soon.
Second is a 3-way tie in performance between the Qwen3-235B-2507 we all know and love with an unexpected entrant - ByteDance-Seed/Seed-OSS-36B-Instruct
This is a very capable model and it's reasoning effort controls actually works, but you should absolutely not leave it on the default "unlimited" - enable a sensible limit (4k works well for 8k context length).
Third place is another 3-way tie, this one between Seed-OSS-36B (it straddles the CI boundary between 2nd and 3rd place), Qwen/Qwen3-Next-80B-A3B-Instruct (demonstrating that full attention may be overrated after all and gated is the way to go) and the newly released zai-org/GLM-4.7 which offers excellent across the board performance with some of the shortest reasoning traces I've seen so far.
The list of hands-on notebooks (some beginner-friendly!) to get started with fine-tuning using TRL keeps growing!!
• SFT • GRPO • Tool calling & agents • RL environments with OpenEnv • LLMs and VLMs ✨ Many run on FREE Colab, making it super easy to get started fast!