DeepFabric: Generate, Train and Evaluate with Datasets curated for Model Behavior Training.

Community Article Published December 4, 2025
DeepFabric logo

Introduction

Training language models to reason and call tools correctly is central to an Agent's success. The key reason Agents often fail to see the light of day in production, is inefficient and incorrect Tool calling. To call Tools correctly, the model must use the correct function signatures, construct valid schema (JSON) parameters, use the correct Types (str, int, bool etc.) and determine when to use which tool and just as importantly, when not to use a tool. This requires training data with thousands of examples , where every call is structurally correct, as if not, malformed examples teach malformed behavior and your Agent fails in production.

When an Agent fails as the result of calling Function incorrectly, the failure typically becomes a software problem with post-processing of retry loops - yet the failures can then compound more, so a chain of tool calls in a multi-turn conversation, becomes a debug nightmare. The challenge intensifies further when training around custom newer tools, unlikely to be in the models training data - perhaps a specific set of MCP servers that an agent will need to call reliably.

Well luckily there is a solution now, and an open source solution developed with a community, born out of the same frustrations had from a lack of training data and data generation tools capable of generating the type of data required to train robust Agents.

DeepFabric

DeepFabric is a framework for training model behavior for complex Agents. It leverages novel algorithms to ensure domain specific, yet diverse samples with low duplication and can take user declared tools and generate thousands of structurally valid tool calling training samples, complete with reasoning traces.

DeepFabric addresses the need for strict structure conformity through constrained decoding and type validation, both pre and post sample generation. During generation, it guides the LLM to produce tool calls that match your declared tool schemas exactly. After generation, it validates each sample to ensure all tool calls are structurally correct - parameter names match, types are valid, and required fields are present. If a sample fails validation, it gets regenerated. This loop continues until you have a dataset where every tool call is guaranteed to be valid against your tool definitions.

The result is training data that can be uploaded to Hugging Face and then used within Hugging Face's apply_chat_template and TRL's trainers, such as SFTTrainer and the popular unsloth training framework — with no post-processing required.

The same dataset can be used to split off an evaluation set, that then challenges the model to perform inline with the training data, so you get effective end-to-end training and evaluation for tool-calling agents.

Topic Trees & Graphs for Diversity and Low Duplication

Synthetic datasets have two failure modes that are surprisingly hard to avoid. The first is repetition - when you generate hundreds of samples from similar prompts, you end up with variations on the same few patterns. The model overfits to these patterns and fails to generalize. The second is drift - the generation process wanders away from your intended domain, producing samples that are technically valid but not relevant to the knowledge domain you're trying to train the Agent to perform in.

Most dataset generation tools struggle with this tradeoff. Push for more diversity and the samples drift off-topic. Constrain to the domain and you get repetitive patterns. This is where popular tools often fall short: they either produce high duplication within a narrow scope, or diverse samples that stray from the training objective.

They also need you to seed data upfront, which is not always possible or easy.

DeepFabric addresses these problems through topic structures. Before generating any training samples, it builds a hierarchical tree or graph of topics starting from a single root prompt (no upfront data required). So if you start with "Python programming", it branches into data structures, control flow, and functions. Each of those branches further into specific subtopics - data structures splits into lists, dictionaries, and sets; control flow splits into conditionals, loops, and exception handling. The branching continues until you reach the target depth.

["Python programming fundamentals", "variables and data types", "primitive data types"]
["Python programming fundamentals", "variables and data types", "composite data types"]
["Python programming fundamentals", "variables and data types", "type casting and conversion"]
["Python programming fundamentals", "functions and scope", "defining functions"]
["Python programming fundamentals", "functions and scope", "function arguments"]
["Python programming fundamentals", "functions and scope", "return values"]
["Python programming fundamentals", "control flow statements", "if-else statements"]
["Python programming fundamentals", "control flow statements", "for loops"]
["Python programming fundamentals", "control flow statements", "while loops"]

A sample topic tree for Python programming fundamentals with depth 3 and branching factor 3.

This structure solves the diversity-relevance tradeoff. Every topic in the tree traces back to your root prompt, so samples can't drift into unrelated domains. But because each leaf represents a unique path through the hierarchy, no two samples are generated from the same prompt. You get high diversity with low duplication, all while staying anchored to your intended training domain.

Each leaf in the tree becomes the basis for a training sample. With a depth of 3 and branching factor of 3, you get 27 unique leaf topics, each generating distinct examples. The samples are diverse because they're grounded in different parts of the topic space, not because you're adding random noise to the same prompt. And because the tree structure is deterministic, you can reproduce exactly the same topic coverage across runs.

DeepFabric also supports topic graphs for more complex domain structures where concepts interconnect rather than branch hierarchically. But for most training scenarios, trees provide the right balance of coverage, diversity, and domain focus.

Reasoning Traces

A tool call without context is just syntax. To train a model that can generalize to new situations, you need to capture the reasoning that led to the tool selection. DeepFabric supports two reasoning styles that serve different training objectives.

Freetext reasoning captures natural language chain-of-thought, similar to how o1 or other reasoning models expose their thinking. The model explains its thought process in prose: considering the user's request, evaluating available tools, working through the logic of parameter construction. This style produces reasoning that feels conversational and exploratory, with the model sometimes reconsidering its approach or noting edge cases.

{
  "reasoning": {
    "style": "freetext",
    "content": "The user wants to create a GitHub issue. I have access to github_create_issue which takes repository, title, and body parameters. The repository format should be 'owner/repo', so I'll use 'acme/webapp'. For the title, I should be descriptive but concise - something like 'Login bug: users unable to authenticate'. The body should provide more context about the issue..."
  }
}

Agent reasoning captures structured steps with explicit thought-action pairs. Each step has a number, a thought explaining the current reasoning, and an action describing what the model did. This style produces reasoning that's more systematic and easier to parse programmatically. It's particularly useful when you want to train models that should follow explicit planning patterns.

{
  "reasoning": {
    "style": "agent",
    "content": [
      {
        "step_number": 1,
        "thought": "User reports a TypeError in their authentication module. Need to locate the source file first.",
        "action": "Use read_file to examine src/auth/login.py"
      },
      {
        "step_number": 2,
        "thought": "Found the bug on line 42: comparing string to None without proper null check. The fix requires adding an 'is not None' guard.",
        "action": "Use write_file to patch the conditional with proper null handling"
      },
      {
        "step_number": 3,
        "thought": "Code is fixed. Should verify the change doesn't break existing tests.",
        "action": "Use execute_cmd to run pytest on the auth module and confirm all tests pass"
      }
    ]
  }
}

Both styles integrate with tool calling. The reasoning trace shows why each tool was selected and how arguments were constructed. When you train on this data, the model learns not just the mechanics of tool calling but the decision-making process behind it.

Tool-Calling Dataset Generation

DeepFabric generates two types of tool-calling datasets that reflect different agent architectures.

Single-turn generation produces one-shot tool calling examples. The model receives a query, reasons about which tools to use, executes them, and provides a response - all in a single turn. This matches the OpenAI function calling pattern where the assistant message contains tool calls, followed by tool results, followed by a final response. Single-turn datasets are ideal for training models that need to complete tasks in one interaction.

Multi-turn generation produces extended conversations where tool results inform subsequent steps. The model might search for information, use those results to decide what to do next, execute another tool, and iterate until the task is complete. Multi-turn datasets teach models to chain tools together and handle complex workflows that can't be solved in one step.

The output format uses the OpenAI chat schema that most training frameworks expect. Messages have roles (system, user, assistant, tool), and assistant messages can contain tool_calls with function names and JSON arguments. Tool responses are linked back via tool_call_id. This format works directly with tokenizers that support tool calling in their chat templates.

Beyond the messages, each sample includes the reasoning trace that led to the tool calls, and the full tool definitions in OpenAI function calling format. The reasoning shows the model's thought process - why it selected particular tools and how it decided on parameter values. The tool definitions mean each sample is self-contained; you don't need external files to understand what tools were available.

Here's an example from alwaysfurther/deepfabric-github-mcp, a dataset generated for training agents to use GitHub's MCP server:

{
  "messages": [
    {
      "role": "system",
      "content": "You are an AI assistant with access to GitHub tools..."
    },
    {
      "role": "user",
      "content": "Create an issue in the repo acme/webapp to track the login bug we discussed"
    },
    {
      "role": "assistant",
      "content": "",
      "tool_calls": [
        {
          "id": "call_0",
          "type": "function",
          "function": {
            "name": "github_create_issue",
            "arguments": "{\"repository\": \"acme/webapp\", \"title\": \"Login bug: users unable to authenticate\", \"body\": \"Users are experiencing authentication failures...\"}"
          }
        }
      ]
    },
    {
      "role": "tool",
      "tool_call_id": "call_0",
      "content": "Issue #142 created: https://github.com/acme/webapp/issues/142"
    },
    {
      "role": "assistant",
      "content": "I've created issue #142 in acme/webapp to track the login bug..."
    }
  ],
  "reasoning": {
    "style": "agent",
    "content": [
      {
        "step_number": 1,
        "thought": "User wants to create a GitHub issue to track a login bug",
        "action": "Use github_create_issue with repository, title, and body"
      }
    ]
  },
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "github_create_issue",
        "description": "Create a GitHub issue in a repository",
        "parameters": {
          "type": "object",
          "properties": {
            "repository": {"type": "string", "description": "Repository in format 'owner/repo'"},
            "title": {"type": "string", "description": "Issue title"},
            "body": {"type": "string", "description": "Issue description"}
          },
          "required": ["repository", "title", "body"]
        }
      }
    }
  ]
}

The messages array follows OpenAI's chat format exactly. The assistant's first message has empty content but includes tool_calls. The tool response links back via tool_call_id. The final assistant message synthesizes the result. This structure works directly with apply_chat_template on tokenizers that support tool calling.

It's worth noting that these aren't real tool executions - the tools aren't actually called during dataset generation. Instead, DeepFabric simulates realistic tool responses that teach the model the correct patterns: when to call a tool, how to construct parameters, and how to interpret results. The model learns the mechanics of tool calling from these examples, so that when deployed with real tools, it knows exactly how to use them.

Custom Tools

While DeepFabric includes a registry of common tools (weather, time, web search, calculations), the real power comes from defining your own. You specify tools in YAML with their name, description, parameters, and return type. Each parameter has a name, type, description, and required flag.

Here's how the GitHub MCP tools are defined:

- name: "github_create_issue"
  description: "Create a GitHub issue in a repository"
  parameters:
    - name: repository
      type: str
      description: "Repository in format 'owner/repo'"
      required: true
    - name: title
      type: str
      description: "Issue title"
      required: true
    - name: body
      type: str
      description: "Issue description"
      required: true
    - name: labels
      type: list
      description: "List of label names to apply"
      required: false
  returns: "Created issue number and URL"

- name: "github_create_pull_request"
  description: "Create a GitHub pull request"
  parameters:
    - name: repository
      type: str
      description: "Repository in format 'owner/repo'"
      required: true
    - name: title
      type: str
      description: "Pull request title"
      required: true
    - name: head_branch
      type: str
      description: "Branch containing the changes"
      required: true
    - name: base_branch
      type: str
      description: "Branch to merge into"
      required: true
  returns: "Created pull request number and URL"

- name: "github_search_code"
  description: "Search for code across GitHub repositories"
  parameters:
    - name: query
      type: str
      description: "Search query with optional qualifiers"
      required: true
    - name: per_page
      type: int
      description: "Number of results per page"
      required: false
  returns: "List of matching code snippets with file paths"

During generation, these definitions are converted to OpenAI function calling format and included in each sample. The LLM sees the tool definitions as part of its context and generates tool calls that match your specific APIs. This means you can train models on your exact tool interfaces - whether that's internal APIs, MCP servers, or custom integrations.

The tool definitions also serve as validation schemas. If the LLM generates a tool call with an invalid parameter name or wrong type, the sample fails validation and gets regenerated. This ensures your training data only contains tool calls that would actually work against your real APIs.

HuggingFace Integration

DeepFabric is built for the HuggingFace ecosystem. The goal was to make the path from generation to training as short as possible.

Generation produces JSONL files where each line is a complete training sample. The messages field contains the conversation in standard chat format. You can upload these directly to the Hub with a single command, and the dataset card is generated automatically with appropriate tags.

Loading the dataset uses the standard load_dataset function. Since the format matches what tokenizers expect, formatting for training is straightforward. You call apply_chat_template on the messages, and the tokenizer handles the conversion to the model's expected format - including tool call syntax if the tokenizer supports it.

Training works with TRL's SFTTrainer, Unsloth, Axolotl, or any framework that accepts HuggingFace datasets. The samples are already structured correctly, so you don't need custom data collators or preprocessing pipelines. Load the dataset, format it with your tokenizer, and start training.

After training, DeepFabric includes an evaluation module that tests your model against held-out samples. It measures tool selection accuracy (did the model choose the right tools?), parameter accuracy (did it construct valid arguments?), and overall task completion. This closes the loop from synthetic data generation through training to evaluation.

Here's what the complete workflow looks like:

Generate and Upload

# Generate dataset
deepfabric generate config.yaml --output-save-as dataset.jsonl

# Upload to Hub
deepfabric upload dataset.jsonl --repo username/my-agent-dataset

Load and Format

from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM
from trl import SFTTrainer, SFTConfig

# Load and format
dataset = load_dataset("username/my-agent-dataset", split="train")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")

def format_example(example):
    messages = [{k: v for k, v in msg.items() if v is not None}
                for msg in example["messages"]]
    return {"text": tokenizer.apply_chat_template(messages, tokenize=False)}

formatted = dataset.map(format_example)

Train/Eval Split

splits = dataset.train_test_split(test_size=0.1, seed=42)
train_ds = splits["train"]
evals = splits["test"]

Train

model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=formatted,
    args=SFTConfig(output_dir="./output", num_train_epochs=3),
)
trainer.train()

Run Evaluation

Test how effective the trained model is at calling tools correctly, down to specific parameter type, and correct schema usage.

The DeepFabric evaluation module makes this easy, by running inference with the trained model over the evaluation set, and measuring tool call accuracy, parameter accuracy, and overall task success.

from deepfabric.evaluation import Evaluator, EvaluatorConfig, InferenceConfig

config = EvaluatorConfig(
    inference_config=InferenceConfig(
        model_path=model_name,  # Base model from Hub
        adapter_path="./lora_adapter",  # Local LoRA weights
        backend="transformers",
    ),
)

evaluator = Evaluator(config)
results = evaluator.evaluate(dataset=evals)

Configuration

DeepFabric uses YAML configuration that separates concerns cleanly. The topics section defines how to build the topic tree. The generation section specifies the LLM provider, conversation type, reasoning style, and tool configuration. The output section controls what goes into the final dataset.

Here's a complete configuration for generating a GitHub MCP agent dataset:

# Shared LLM settings
llm:
  provider: "openai"
  model: "gpt-4o"
  temperature: 0.7

# Topic tree generation
topics:
  prompt: "GitHub repository management and automation"
  mode: tree
  depth: 3
  degree: 3
  save_as: "github-topics.jsonl"

# Training sample generation
generation:
  system_prompt: |
    Generate realistic examples of GitHub tool usage.
    Show clear reasoning for tool selection and parameter construction.

  conversation:
    type: chain_of_thought
    reasoning_style: agent
    agent_mode: single_turn

  tools:
    registry_path: "github-tools.yaml"
    max_per_query: 3
    strict: true

  max_retries: 3

# Output dataset configuration
output:
  system_prompt: |
    You are an AI assistant with access to GitHub tools.
    Analyze tasks, select appropriate tools, and execute them with valid parameters.

  include_system_message: true
  num_samples: 100
  batch_size: 5
  save_as: "github-mcp-dataset.jsonl"

# Optional: auto-upload to Hub
huggingface:
  repository: "username/github-mcp-agent"
  tags: ["github", "mcp", "tool-calling"]

For agent training, you'll typically use chain_of_thought conversation type with agent reasoning style. The agent_mode can be single_turn for one-shot tool calling or multi_turn for extended conversations. Tool configuration points to your custom tools file and sets limits on how many tools can be used per sample.

The system prompt in the output section is what gets embedded in your training data. This is the instruction the trained model will see at inference time. Keep it aligned with how you'll actually deploy the model.

Library Usage

DeepFabric is also a Python library, so you can integrate it programmatically into existing ML pipelines. This is useful when you need to generate datasets as part of a larger workflow, or when you want more control over the generation process.

from deepfabric import DeepFabricConfig
from deepfabric.tree import Tree
from deepfabric.generator import DataSetGenerator
from deepfabric.dataset_manager import create_dataset, save_dataset

# Load configuration
config = DeepFabricConfig.from_yaml("config.yaml")

# Generate topic tree
tree = Tree(config.get_tree_arguments())
topic_model = tree.generate()

# Create generator and produce samples
generator = DataSetGenerator(config.get_generator_arguments())
dataset = create_dataset(
    engine=generator,
    topic_model=topic_model,
    config=config,
    num_samples=100,
    batch_size=5
)

# Save locally or upload to Hub
save_dataset(dataset, "output.jsonl", config=config)

This approach lets you chain dataset generation with training, evaluation, or other pipeline stages without leaving your Python environment.

Wrap up

There is a lot more that DeepFabric is capable of, and a lot more planned. If you need any support, or would like to contibute, please see the resources that follow.

Resources

Community

Sign up or log in to comment