Agent Gemma 3n E2B - Tool Calling Edition

A specialized version of Gemma 3n E2B optimized for on-device tool/function calling with LiteRT-LM. While Google's standard LiteRT-LM models focus on general text generation, this model is specifically designed for agentic workflows with advanced tool calling capabilities.

Why This Model?

Google's official LiteRT-LM releases provide excellent on-device inference but don't include built-in tool calling support. This model bridges that gap by:

  • βœ… Native tool/function calling via Jinja templates
  • βœ… Multimodal support (text, vision, audio)
  • βœ… On-device optimized - No cloud API required
  • βœ… INT4 quantized - Efficient memory usage
  • βœ… Production ready - Tested and validated

Perfect for building AI agents that need to interact with external tools, APIs, or functions while running completely on-device.

Model Details

  • Base Model: Gemma 3n E2B
  • Format: LiteRT-LM v1.4.0
  • Quantization: INT4
  • Size: ~3.2GB
  • Tokenizer: SentencePiece
  • Capabilities:
    • Advanced tool/function calling
    • Multi-turn conversations with tool interactions
    • Vision processing (images)
    • Audio processing
    • Streaming responses

Tool Calling Example

The model uses a sophisticated Jinja template that supports OpenAI-style function calling:

from litert_lm import Engine, Conversation

# Load the model
engine = Engine.create("gemma-3n-E2B-it-agent-fixed.litertlm", backend="cpu")
conversation = Conversation.create(engine)

# Define tools the model can use
tools = [
    {
        "name": "get_weather",
        "description": "Get current weather for a location",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {"type": "string", "description": "City name"},
                "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
            },
            "required": ["location"]
        }
    },
    {
        "name": "search_web",
        "description": "Search the internet for information",
        "parameters": {
            "type": "object",
            "properties": {
                "query": {"type": "string", "description": "Search query"}
            },
            "required": ["query"]
        }
    }
]

# Have a conversation with tool calling
message = {
    "role": "user",
    "content": "What's the weather in San Francisco and latest news about AI?"
}

response = conversation.send_message(message, tools=tools)
print(response)

Example Output

The model will generate structured tool calls:

<start_function_call>call:get_weather{location:San Francisco,unit:celsius}<end_function_call>
<start_function_call>call:search_web{query:latest AI news}<end_function_call>
<start_function_response>

You then execute the functions and send back results:

# Execute tools (your implementation)
weather = get_weather("San Francisco", "celsius")
news = search_web("latest AI news")

# Send tool responses back
tool_response = {
    "role": "tool",
    "content": [
        {
            "name": "get_weather",
            "response": {"temperature": 18, "condition": "partly cloudy"}
        },
        {
            "name": "search_web",
            "response": {"results": ["OpenAI releases GPT-5...", "..."]}
        }
    ]
}

final_response = conversation.send_message(tool_response)
print(final_response)
# "The weather in San Francisco is 18Β°C and partly cloudy.
#  In AI news, OpenAI has released GPT-5..."

Advanced Features

Multi-Modal Tool Calling

Combine vision, audio, and tool calling:

message = {
    "role": "user",
    "content": [
        {"type": "image", "data": image_bytes},
        {"type": "text", "text": "What's in this image? Search for more info about it."}
    ]
}

response = conversation.send_message(message, tools=[search_tool])
# Model can see the image AND call search functions

Streaming Tool Calls

Get tool calls as they're generated:

def on_token(token):
    if "<start_function_call>" in token:
        print("Tool being called...")
    print(token, end="", flush=True)

conversation.send_message_async(message, tools=tools, callback=on_token)

Nested Tool Execution

The model can chain tool calls:

# User: "Book me a flight to Tokyo and reserve a hotel"
# Model: calls check_flights() β†’ calls book_hotel() β†’ confirms both

Performance

Benchmarked on CPU (no GPU acceleration):

  • Prefill Speed: 21.20 tokens/sec
  • Decode Speed: 11.44 tokens/sec
  • Time to First Token: ~1.6s
  • Cold Start: ~4.7s
  • Tool Call Latency: ~100-200ms additional

GPU acceleration provides 3-5x speedup on supported hardware.

Installation & Usage

Requirements

  1. LiteRT-LM Runtime - Build from source:

    git clone https://github.com/google-ai-edge/LiteRT.git
    cd LiteRT/LiteRT-LM
    bazel build -c opt //runtime/engine:litert_lm_main
    
  2. Supported Platforms: Linux (clang), macOS, Android

Quick Start

# Download model
wget https://huggingface.co/kontextdev/agent-gemma/resolve/main/gemma-3n-E2B-it-agent-fixed.litertlm

# Run with simple prompt
./bazel-bin/runtime/engine/litert_lm_main \
  --model_path=gemma-3n-E2B-it-agent-fixed.litertlm \
  --backend=cpu \
  --input_prompt="Hello, I need help with some tasks"

# Run with GPU (if available)
./bazel-bin/runtime/engine/litert_lm_main \
  --model_path=gemma-3n-E2B-it-agent-fixed.litertlm \
  --backend=gpu \
  --input_prompt="What can you help me with?"

Python API (Recommended)

from litert_lm import Engine, Conversation, SessionConfig

# Initialize
engine = Engine.create("gemma-3n-E2B-it-agent-fixed.litertlm", backend="gpu")

# Configure session
config = SessionConfig(
    max_tokens=2048,
    temperature=0.7,
    top_p=0.9
)

# Start conversation
conversation = Conversation.create(engine, config)

# Define your tools
tools = [...]  # Your function definitions

# Chat with tool calling
while True:
    user_input = input("You: ")
    response = conversation.send_message(
        {"role": "user", "content": user_input},
        tools=tools
    )

    # Handle tool calls if present
    if has_tool_calls(response):
        results = execute_tools(extract_calls(response))
        response = conversation.send_message({
            "role": "tool",
            "content": results
        })

    print(f"Agent: {response['content']}")

Tool Call Format

The model uses this format for tool interactions:

Function Declaration (system/developer role):

<start_of_turn>developer
<start_function_declaration>
{
  "name": "function_name",
  "description": "What it does",
  "parameters": {...}
}
<end_function_declaration>
<end_of_turn>

Function Call (assistant):

<start_function_call>call:function_name{arg1:value1,arg2:value2}<end_function_call>

Function Response (tool role):

<start_function_response>response:function_name{result:value}<end_function_response>

Use Cases

Personal AI Assistant

  • Calendar management
  • Email sending
  • Web searching
  • File operations

IoT & Smart Home

  • Device control
  • Sensor monitoring
  • Automation workflows
  • Voice commands

Development Tools

  • Code generation with API calls
  • Database queries
  • Deployment automation
  • Testing & debugging

Business Applications

  • CRM integration
  • Data analysis
  • Report generation
  • Customer support

Model Architecture

Built on Gemma 3n E2B with 9 optimized components:

Section 0: LlmMetadata (Agent Jinja template)
Section 1: SentencePiece Tokenizer
Section 2: TFLite Embedder
Section 3: TFLite Per-Layer Embedder
Section 4: TFLite Audio Encoder (HW accelerated)
Section 5: TFLite End-of-Audio Detector
Section 6: TFLite Vision Adapter
Section 7: TFLite Vision Encoder
Section 8: TFLite Prefill/Decode (INT4)

All components are optimized for on-device inference with hardware acceleration support.

Comparison

Feature Standard Gemma LiteRT-LM This Model
Text Generation βœ… βœ…
Tool Calling ❌ βœ…
Multimodal βœ… βœ…
Streaming βœ… βœ…
On-Device βœ… βœ…
Jinja Templates Basic Advanced Agent Template
INT4 Quantization βœ… βœ…

Limitations

  • Tool Execution: The model generates tool calls but doesn't execute them - you need to implement the actual functions
  • Context Window: Limited to 4096 tokens (configurable)
  • Streaming Tool Calls: Partial tool calls may need buffering
  • Hardware Requirements: Minimum 4GB RAM recommended
  • No Native GPU on CPU-only systems: Falls back to CPU inference

Tips for Best Results

  1. Clear Tool Descriptions: Provide detailed function descriptions
  2. Schema Validation: Validate tool call arguments before execution
  3. Error Handling: Handle malformed tool calls gracefully
  4. Context Management: Keep conversation history concise
  5. Temperature: Use 0.7-0.9 for creative tasks, 0.3-0.5 for precise tool calls
  6. Batching: Process multiple tool calls in parallel when possible

License

This model inherits the Gemma license from the base model.

Citation

@misc{agent-gemma-litertlm,
  title={Agent Gemma 3n E2B - Tool Calling Edition},
  author={kontextdev},
  year={2025},
  publisher={HuggingFace},
  howpublished={\url{https://huggingface.co/kontextdev/agent-gemma}}
}

Links

Support

For issues or questions:


Built with ❀️ for the on-device AI community

Downloads last month
6
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support