Llama 3.1 8B - Structured API Generation (LoRA Adapter)
Fine-tuned adapter for generating structured JSON API calls from natural language queries
This LoRA adapter demonstrates that context-engineered small models can outperform generic large models on structured tasks: 40% vs 20.5% exact match compared to GPT-4 class baseline on our evaluation set.
Model Overview
This is a LoRA adapter fine-tuned on unsloth/llama-3.1-8b-instruct-bnb-4bit for structured API generation. The model takes natural language queries and tool specifications as input, and generates JSON objects with query, tool_name, and arguments fields.
Context Engineering Approach: Instead of relying on a massive generic model, we teach a small 8B model to understand and maintain structured output constraints through domain-specific fine-tuning. This demonstrates the power of task-specific context engineering over general-purpose scale.
Key Performance Metrics
| Metric | Our Model | Azure GPT-4o | Improvement |
|---|---|---|---|
| Exact Match Accuracy | 40.0% (20/50) | 20.5% (10/50) | +95% |
| Tool Name Accuracy | 98.0% (49/50) | ~90% | +8.9% |
| Arguments Partial Match | 76.0% | 60.2% | +26% |
| JSON Validity | 100% (50/50) | 100% | - |
| Model Size | 8B params | ~120B params | 15x smaller |
| Training Time | 4m 52s | N/A | - |
Baseline Details: Azure GPT-4o (GPT-4 Optimized, ~120B parameters) evaluated on the same 50 test examples with temperature=0.7, using standard chat completion API with JSON schema enforcement.
Quick Start
Installation
pip install torch transformers peft bitsandbytes accelerate
Usage
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
# Load base model and adapter
base_model = "unsloth/llama-3.1-8b-instruct-bnb-4bit"
adapter_path = "kineticdrive/llama-structured-api-adapter"
model = AutoModelForCausalLM.from_pretrained(
base_model,
device_map="auto",
torch_dtype=torch.bfloat16,
load_in_4bit=True
)
model = PeftModel.from_pretrained(model, adapter_path)
model.eval()
tokenizer = AutoTokenizer.from_pretrained(base_model)
# Generate API call
prompt = """Return a JSON object with keys query, tool_name, arguments describing the API call.
Query: Fetch the first 100 countries in ascending order.
Chosen tool: getallcountry
Arguments should mirror the assistant's recommendation."""
messages = [{"role": "user", "content": prompt}]
inputs = tokenizer.apply_chat_template(
messages,
return_tensors="pt",
add_generation_prompt=True
).to(model.device)
with torch.no_grad():
outputs = model.generate(
inputs,
max_new_tokens=256,
temperature=0.0,
do_sample=False,
pad_token_id=tokenizer.pad_token_id
)
result = tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True)
print(result)
Output:
{
"arguments": {"limit": 100, "order": "asc"},
"query": "Fetch the first 100 countries in ascending order.",
"tool_name": "getallcountry"
}
Training Details
Dataset
โ ๏ธ Note: This is a proof-of-concept with a small, domain-specific dataset:
- Training: 300 examples (~6 examples per tool on average)
- Validation: 60 examples
- Test: 50 examples (held-out from training)
- Domains: API calls, math functions, data processing, web services
- Tool Coverage: 50+ unique functions
Why this works: The base Llama 3.1 8B Instruct model already has strong reasoning and JSON generation capabilities. We're teaching it task-specific structure preservation, not training from scratch. With ~6 examples per tool, the model learns to maintain the structured format while generalizing across similar API patterns.
Training Hyperparameters
LoRA Configuration:
r: 32 # Low-rank dimension
alpha: 64 # LoRA scaling factor
dropout: 0.1
target_modules: [q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj]
trainable_params: 84M (1.04% of base model)
Training:
max_epochs: 3
actual_steps: 39 # Early convergence after ~1.2 epochs
batch_size: 2
gradient_accumulation_steps: 4
effective_batch_size: 8 # 2 * 4
learning_rate: 2e-4
lr_scheduler: linear
warmup_steps: 10
optimizer: adamw_8bit
weight_decay: 0.01
max_seq_length: 2048
Training Results
- Final Training Loss: 0.50
- Final Validation Loss: 0.58
- Training Time: 4m 52s
- GPU: 2x RTX 3090 (21.8GB/24GB per GPU)
- Total Steps: 39 (early stopping due to loss convergence)
- Steps per Epoch: ~37 (300 examples / effective batch size 8)
Evaluation
Overall Results
Tested on 50 held-out examples with diverse API calls:
| Metric | Score | Definition |
|---|---|---|
| Exact Match | 40.0% (20/50) | Strict JSON equality after key normalization |
| Tool Name Accuracy | 98.0% (49/50) | Correct function name selected |
| Query Preservation | 92.0% (46/50) | Original user query maintained in output |
| Args Partial Match | 76.0% | Key-wise F1 score on arguments dict |
| JSON Validity | 100% (50/50) | Parseable JSON with no syntax errors |
| Functional Correctness | 71.0% | Tool call would succeed (correct tool + has required args) |
Baseline (Azure GPT-4o): 20.5% exact match (10/50), 60.2% field F1
Metric Definitions
Each metric measures a different aspect of context engineering โ how well the model maintains structured constraints:
Exact Match Accuracy
- What: Strict string equality after whitespace normalization and key sorting
- Why: Measures perfect adherence to schema and value formats
- Context Engineering: Tests whether model learned exact output templates
- Example:
{"query": "...", "tool_name": "getallcountry", "arguments": {"limit": 100, "order": "asc"}}must match exactly
Tool Name Accuracy
- What: Percentage of predictions with correct
tool_namefield matching expected function - Why: Most critical metric โ wrong tool = complete failure
- Context Engineering: Tests tool routing learned from examples
- Example: Query "fetch countries" โ must output
"tool_name": "getallcountry", not"getAllCountries"or"get_country"
- What: Percentage of predictions with correct
Query Preservation
- What: Original user query appears verbatim (or case-normalized) in output
queryfield - Why: Ensures no information loss in pipeline
- Context Engineering: Tests whether model maintains input fidelity vs paraphrasing
- Example: Input "Fetch the first 100 countries" โ Output must contain
"query": "Fetch the first 100 countries"(not "Get 100 countries")
- What: Original user query appears verbatim (or case-normalized) in output
Arguments Partial Match
- What: Key-wise F1 score โ for each expected argument key, check if present with correct value
- Why: Captures "mostly correct" calls where 1-2 args differ
- Context Engineering: Tests parameter mapping consistency
- Example: Expected
{"limit": 100, "order": "asc"}vs Predicted{"limit": 100, "order": "ASC"}= 1.0 key match, 0.5 value match
JSON Validity
- What: Output is parseable JSON (no syntax errors, bracket matching, valid escaping)
- Why: Invalid JSON = parsing error in production
- Context Engineering: Tests structural constraint adherence
- Example: Must output
{"key": "value"}not{key: value}or{"key": "value"(missing brace)
Functional Correctness
- What: Tool call would execute successfully โ correct tool name + all required arguments present
- Why: Captures "usable" outputs even if not exact match
- Context Engineering: Tests minimum viable output quality
- Example:
{"tool_name": "getallcountry", "arguments": {"limit": 100}}is functional even if"order"is missing (assuming it's optional)
Evaluation Setup Transparency
Test Set: 50 examples held-out from training, covering diverse API calls across 50+ tools
Our Model:
- Base:
unsloth/llama-3.1-8b-instruct-bnb-4bit - Adapter: This LoRA fine-tune
- Temperature: 0.0 (deterministic)
- Max tokens: 256
- Prompt format: Same as training (query + tool spec โ JSON output)
Baseline (Azure GPT-4o):
- Model: Azure OpenAI GPT-4o (gpt-4o-2024-08-06, ~120B params)
- Temperature: 0.7 (as per Azure defaults)
- Max tokens: 256
- Prompt format: Chat completion with system message describing JSON schema
- JSON mode: Enabled via API parameter
โ ๏ธ Evaluation Limitations:
- Small test set (n=50): With 20/50 vs 10/50 exact matches, confidence intervals overlap. A larger test set (200-300 examples) would provide more robust comparisons.
- Baseline prompt optimization: Azure GPT-4o was evaluated with standard JSON schema enforcement but not extensively prompt-engineered. A more optimized baseline prompt might close the gap.
- In-distribution generalization: Test set covers same API domains as training. Out-of-distribution tools or phrasing patterns may degrade performance.
Context Engineering Examples
Example 1: Exact Match (Both models)
Input:
Query: Get all documents sorted by date
Tool: getDocuments
Args: {"sort": "date", "order": "desc"}
Our Model Output:
{"query": "Get all documents sorted by date", "tool_name": "getDocuments", "arguments": {"sort": "date", "order": "desc"}}
GPT-4o Output:
{"query": "Get all documents sorted by date", "tool_name": "getDocuments", "arguments": {"sort": "date", "order": "desc"}}
โ Both models: Exact match
Example 2: Our model wins (Case normalization)
Input:
Query: Fetch first 100 countries in ascending order
Tool: getallcountry
Args: {"limit": 100, "order": "asc"}
Our Model Output:
{"query": "Fetch first 100 countries in ascending order", "tool_name": "getallcountry", "arguments": {"limit": 100, "order": "asc"}}
GPT-4o Output:
{"query": "Fetch first 100 countries in ascending order", "tool_name": "getallcountry", "arguments": {"limit": 100, "order": "ASC"}}
โ Our model: Exact match (learned lowercase "asc" from examples) โ ๏ธ GPT-4o: Functional correctness, but not exact match (case differs)
Example 3: Both models functional but not exact
Input:
Query: Calculate sum of [1, 2, 3, 4, 5]
Tool: calculate
Args: {"operation": "sum", "values": [1, 2, 3, 4, 5]}
Our Model Output:
{"query": "Calculate sum of [1, 2, 3, 4, 5]", "tool_name": "calculate", "arguments": {"operation": "sum", "numbers": [1, 2, 3, 4, 5]}}
GPT-4o Output:
{"query": "Calculate the sum of the array [1, 2, 3, 4, 5]", "tool_name": "calculate", "arguments": {"op": "sum", "values": [1, 2, 3, 4, 5]}}
โ ๏ธ Our model: Wrong key name ("numbers" instead of "values") but correct tool
โ ๏ธ GPT-4o: Paraphrased query + abbreviated arg key ("op")
Both: Functional correctness โ , Not exact match โ
Use Cases
- AI Agent API generation: Route user queries to appropriate backend APIs
- Structured data extraction: Convert natural language to database queries
- Function calling for LLMs: Generate tool invocations for agent frameworks
- Tool routing and parameter extraction: Map intents to functions with correct arguments
- API request generation: Transform conversational requests into structured API calls
Best for: High-volume, latency-sensitive, cost-constrained deployments where you control the API schema and need consistent structured output.
Limitations
Scope Limitations
- Single API calls only: Optimized for one tool per query (not multi-step workflows)
- English language only: Not tested on non-English queries
- Domain-specific: Best performance on APIs similar to training distribution (REST APIs, CRUD operations, math functions)
- Proof-of-concept scale: Trained on 300 examples across 50+ tools (~6 examples/tool average)
Known Failure Modes
- Optional parameters: May omit optional arguments not seen in training examples
- Case sensitivity: Generally learns lowercase conventions from training data (e.g., "asc" not "ASC")
- Synonym handling: May not recognize alternative phrasings for same tool (e.g., "retrieve" vs "fetch" vs "get")
- Argument key variations: Expects exact key names from training (e.g., won't map "num" โ "number")
- Complex nested args: Struggles with deeply nested JSON structures (>2 levels)
Evaluation Caveats
- Small test set (n=50): Statistical confidence is limited; need 200-300 examples for robust claims
- In-distribution bias: Test set covers same domains as training; OOD generalization untested
- Baseline comparison: Azure GPT-4o not extensively prompt-optimized for this specific task
Future Work & Next Steps
To strengthen this proof-of-concept into a production-grade system:
Evaluation Robustness
- Expand test set to 200-300 examples for statistically significant comparisons
- Hold-out tool evaluation: Train on subset of tools, test on completely unseen tools
- OOD phrasing evaluation: Test with paraphrased queries (synonyms, different word order, extra context)
- Fair baseline comparison: Lock in Azure GPT-4o prompt template, temperature=0, optimize for this task
Model Improvements
- Ablation study: Evaluate base Llama 3.1 8B (no LoRA) to quantify adapter contribution
- Larger training set: Scale to 1,000-5,000 examples for better generalization
- Multi-turn support: Extend to conversational API generation (clarifying questions, follow-ups)
- Error recovery: Fine-tune on failure cases to handle edge cases
Deployment Hardening
- Latency optimization: Quantize to INT4 or deploy with vLLM for sub-second inference
- Monitoring: Add production metrics (latency P99, error rates, schema violations)
- A/B testing framework: Compare SLM vs LLM in production traffic
- Fallback strategy: Route complex queries to GPT-4 when confidence is low
Model Details
- Developed by: AI_ATL25 Team
- Model type: LoRA Adapter for Llama 3.1 8B
- Language: English
- License: Llama 3.1 Community License
- Finetuned from: unsloth/llama-3.1-8b-instruct-bnb-4bit
- Adapter Size: 335MB
- Trainable Parameters: 84M (1.04% of base model)
- Proof-of-concept: Yes โ intended to demonstrate feasibility, not production-ready without further evaluation
Citation
@misc{llama31-structured-api-adapter,
title={Fine-tuned Llama 3.1 8B for Structured API Generation},
author={AI_ATL25 Team},
year={2025},
publisher={HuggingFace},
howpublished={\url{https://huggingface.co/kineticdrive/llama-structured-api-adapter}}
}
Contact
- GitHub: AI_ATL25
- HuggingFace: @kineticdrive
Framework versions
- PEFT 0.17.1
- Downloads last month
- 17