Agent Gemma 3n E2B - Tool Calling Edition
A specialized version of Gemma 3n E2B optimized for on-device tool/function calling with LiteRT-LM. While Google's standard LiteRT-LM models focus on general text generation, this model is specifically designed for agentic workflows with advanced tool calling capabilities.
Why This Model?
Google's official LiteRT-LM releases provide excellent on-device inference but don't include built-in tool calling support. This model bridges that gap by:
- β Native tool/function calling via Jinja templates
- β Multimodal support (text, vision, audio)
- β On-device optimized - No cloud API required
- β INT4 quantized - Efficient memory usage
- β Production ready - Tested and validated
Perfect for building AI agents that need to interact with external tools, APIs, or functions while running completely on-device.
Model Details
- Base Model: Gemma 3n E2B
- Format: LiteRT-LM v1.4.0
- Quantization: INT4
- Size: ~3.2GB
- Tokenizer: SentencePiece
- Capabilities:
- Advanced tool/function calling
- Multi-turn conversations with tool interactions
- Vision processing (images)
- Audio processing
- Streaming responses
Tool Calling Example
The model uses a sophisticated Jinja template that supports OpenAI-style function calling:
from litert_lm import Engine, Conversation
# Load the model
engine = Engine.create("gemma-3n-E2B-it-agent-fixed.litertlm", backend="cpu")
conversation = Conversation.create(engine)
# Define tools the model can use
tools = [
{
"name": "get_weather",
"description": "Get current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string", "description": "City name"},
"unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
},
"required": ["location"]
}
},
{
"name": "search_web",
"description": "Search the internet for information",
"parameters": {
"type": "object",
"properties": {
"query": {"type": "string", "description": "Search query"}
},
"required": ["query"]
}
}
]
# Have a conversation with tool calling
message = {
"role": "user",
"content": "What's the weather in San Francisco and latest news about AI?"
}
response = conversation.send_message(message, tools=tools)
print(response)
Example Output
The model will generate structured tool calls:
<start_function_call>call:get_weather{location:San Francisco,unit:celsius}<end_function_call>
<start_function_call>call:search_web{query:latest AI news}<end_function_call>
<start_function_response>
You then execute the functions and send back results:
# Execute tools (your implementation)
weather = get_weather("San Francisco", "celsius")
news = search_web("latest AI news")
# Send tool responses back
tool_response = {
"role": "tool",
"content": [
{
"name": "get_weather",
"response": {"temperature": 18, "condition": "partly cloudy"}
},
{
"name": "search_web",
"response": {"results": ["OpenAI releases GPT-5...", "..."]}
}
]
}
final_response = conversation.send_message(tool_response)
print(final_response)
# "The weather in San Francisco is 18Β°C and partly cloudy.
# In AI news, OpenAI has released GPT-5..."
Advanced Features
Multi-Modal Tool Calling
Combine vision, audio, and tool calling:
message = {
"role": "user",
"content": [
{"type": "image", "data": image_bytes},
{"type": "text", "text": "What's in this image? Search for more info about it."}
]
}
response = conversation.send_message(message, tools=[search_tool])
# Model can see the image AND call search functions
Streaming Tool Calls
Get tool calls as they're generated:
def on_token(token):
if "<start_function_call>" in token:
print("Tool being called...")
print(token, end="", flush=True)
conversation.send_message_async(message, tools=tools, callback=on_token)
Nested Tool Execution
The model can chain tool calls:
# User: "Book me a flight to Tokyo and reserve a hotel"
# Model: calls check_flights() β calls book_hotel() β confirms both
Performance
Benchmarked on CPU (no GPU acceleration):
- Prefill Speed: 21.20 tokens/sec
- Decode Speed: 11.44 tokens/sec
- Time to First Token: ~1.6s
- Cold Start: ~4.7s
- Tool Call Latency: ~100-200ms additional
GPU acceleration provides 3-5x speedup on supported hardware.
Installation & Usage
Requirements
LiteRT-LM Runtime - Build from source:
git clone https://github.com/google-ai-edge/LiteRT.git cd LiteRT/LiteRT-LM bazel build -c opt //runtime/engine:litert_lm_mainSupported Platforms: Linux (clang), macOS, Android
Quick Start
# Download model
wget https://huggingface.co/kontextdev/agent-gemma/resolve/main/gemma-3n-E2B-it-agent-fixed.litertlm
# Run with simple prompt
./bazel-bin/runtime/engine/litert_lm_main \
--model_path=gemma-3n-E2B-it-agent-fixed.litertlm \
--backend=cpu \
--input_prompt="Hello, I need help with some tasks"
# Run with GPU (if available)
./bazel-bin/runtime/engine/litert_lm_main \
--model_path=gemma-3n-E2B-it-agent-fixed.litertlm \
--backend=gpu \
--input_prompt="What can you help me with?"
Python API (Recommended)
from litert_lm import Engine, Conversation, SessionConfig
# Initialize
engine = Engine.create("gemma-3n-E2B-it-agent-fixed.litertlm", backend="gpu")
# Configure session
config = SessionConfig(
max_tokens=2048,
temperature=0.7,
top_p=0.9
)
# Start conversation
conversation = Conversation.create(engine, config)
# Define your tools
tools = [...] # Your function definitions
# Chat with tool calling
while True:
user_input = input("You: ")
response = conversation.send_message(
{"role": "user", "content": user_input},
tools=tools
)
# Handle tool calls if present
if has_tool_calls(response):
results = execute_tools(extract_calls(response))
response = conversation.send_message({
"role": "tool",
"content": results
})
print(f"Agent: {response['content']}")
Tool Call Format
The model uses this format for tool interactions:
Function Declaration (system/developer role):
<start_of_turn>developer
<start_function_declaration>
{
"name": "function_name",
"description": "What it does",
"parameters": {...}
}
<end_function_declaration>
<end_of_turn>
Function Call (assistant):
<start_function_call>call:function_name{arg1:value1,arg2:value2}<end_function_call>
Function Response (tool role):
<start_function_response>response:function_name{result:value}<end_function_response>
Use Cases
Personal AI Assistant
- Calendar management
- Email sending
- Web searching
- File operations
IoT & Smart Home
- Device control
- Sensor monitoring
- Automation workflows
- Voice commands
Development Tools
- Code generation with API calls
- Database queries
- Deployment automation
- Testing & debugging
Business Applications
- CRM integration
- Data analysis
- Report generation
- Customer support
Model Architecture
Built on Gemma 3n E2B with 9 optimized components:
Section 0: LlmMetadata (Agent Jinja template)
Section 1: SentencePiece Tokenizer
Section 2: TFLite Embedder
Section 3: TFLite Per-Layer Embedder
Section 4: TFLite Audio Encoder (HW accelerated)
Section 5: TFLite End-of-Audio Detector
Section 6: TFLite Vision Adapter
Section 7: TFLite Vision Encoder
Section 8: TFLite Prefill/Decode (INT4)
All components are optimized for on-device inference with hardware acceleration support.
Comparison
| Feature | Standard Gemma LiteRT-LM | This Model |
|---|---|---|
| Text Generation | β | β |
| Tool Calling | β | β |
| Multimodal | β | β |
| Streaming | β | β |
| On-Device | β | β |
| Jinja Templates | Basic | Advanced Agent Template |
| INT4 Quantization | β | β |
Limitations
- Tool Execution: The model generates tool calls but doesn't execute them - you need to implement the actual functions
- Context Window: Limited to 4096 tokens (configurable)
- Streaming Tool Calls: Partial tool calls may need buffering
- Hardware Requirements: Minimum 4GB RAM recommended
- No Native GPU on CPU-only systems: Falls back to CPU inference
Tips for Best Results
- Clear Tool Descriptions: Provide detailed function descriptions
- Schema Validation: Validate tool call arguments before execution
- Error Handling: Handle malformed tool calls gracefully
- Context Management: Keep conversation history concise
- Temperature: Use 0.7-0.9 for creative tasks, 0.3-0.5 for precise tool calls
- Batching: Process multiple tool calls in parallel when possible
License
This model inherits the Gemma license from the base model.
Citation
@misc{agent-gemma-litertlm,
title={Agent Gemma 3n E2B - Tool Calling Edition},
author={kontextdev},
year={2025},
publisher={HuggingFace},
howpublished={\url{https://huggingface.co/kontextdev/agent-gemma}}
}
Links
Support
For issues or questions:
- Open an issue on GitHub
- Check the LiteRT-LM docs
- Community forum: Google AI Edge
Built with β€οΈ for the on-device AI community
- Downloads last month
- 6