TinyLlama-JSON-Intent (GPTQ 4-bit)

This is a fine-tuned version of TinyLlama/TinyLlama-1.1B-Chat-v1.0 that has been specifically trained to act as an e-commerce intent detection model. Given a catalog of products and a user's request, it outputs a structured JSON object representing the user's intent (add or remove), the product name, and the quantity.

This version of the model is quantized to 4-bit using GPTQ, making it highly efficient for inference in terms of memory usage and speed. The QLoRA adapter was merged into the final GPTQ model—no separate adapter loading is required.

Model Description

The base model, TinyLlama-Chat, was fine-tuned using the QLoRA method on a synthetic dataset of 100 examples. The training objective was to teach the model to ignore conversational pleasantries and strictly output a JSON object that can be directly parsed by a backend system for managing a shopping cart.

Intended Use & Limitations

This model is designed for a specific task: parsing user requests in an e-commerce context. It should not be used as a general-purpose chatbot.

  • Primary Use: Backend service for intent detection from user text.
  • Out-of-Scope: General conversation, answering questions, or any task not related to adding/removing items from a list.

How to Use

The model expects a prompt formatted in a specific way, following the TinyLlama-Chat template. You must provide the Catalog and the User request.

Important: You need to install optimum and auto-gptq to run this 4-bit GPTQ model.

pip install -q optimum auto-gptq transformers

Here's how to run inference in Python:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

# Model repository on the Hugging Face Hub
model_id = "jtlicardo/tinyllama-ecommerce-intent-gptq"

# Load the tokenizer and the 4-bit quantized model
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.float16 # Recommended for inference
)

# --- Define the prompt ---
catalog = """Catalog:
Shampoo (400ml bottle)
Hand Soap (250ml dispenser)
Peanut Butter (340g jar)
Headphones
Green Tea (25 tea bags)"""

user_query = "Could you please take off 4 pairs of headphons from my cart?"

# --- Format the prompt using the model's chat template ---
# The model was trained to see this structure.
prompt = f"<|user|>\n{catalog}\n\nUser:\n{user_query}\n<|assistant|>\n"

# --- Generate the output ---
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
outputs = pipe(
    prompt,
    max_new_tokens=50,       # Max length of the JSON output
    do_sample=False,         # Use deterministic output
    temperature=None,        # Not needed for do_sample=False
    top_p=None,              # Not needed for do_sample=False
    return_full_text=False   # Only return the generated part
)

# The output will be a clean JSON string
generated_json = outputs[0]['generated_text'].strip()
print(generated_json)
# Expected output:
# {"action": "remove", "product": "Headphones", "quantity": 4}

Training Procedure

This model was fine-tuned using the trl library's SFTTrainer.

  • Method: QLoRA (4-bit quantization with LoRA adapters)
  • Dataset: A custom JSONL file with 100 prompt/completion pairs.
  • Configuration: completion_only_loss=True was used to ensure the model only learned to generate the assistant's JSON response.
Downloads last month
6
Safetensors
Model size
261M params
Tensor type
I32
·
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support