TinyLlama-JSON-Intent (GPTQ 4-bit)
This is a fine-tuned version of TinyLlama/TinyLlama-1.1B-Chat-v1.0
that has been specifically trained to act as an e-commerce intent detection model. Given a catalog of products and a user's request, it outputs a structured JSON object representing the user's intent (add
or remove
), the product
name, and the quantity
.
This version of the model is quantized to 4-bit using GPTQ, making it highly efficient for inference in terms of memory usage and speed. The QLoRA adapter was merged into the final GPTQ model—no separate adapter loading is required.
- Adapter Version: jtlicardo/tinyllama-ecommerce-intent-adapter
Model Description
The base model, TinyLlama-Chat, was fine-tuned using the QLoRA method on a synthetic dataset of 100 examples. The training objective was to teach the model to ignore conversational pleasantries and strictly output a JSON object that can be directly parsed by a backend system for managing a shopping cart.
Intended Use & Limitations
This model is designed for a specific task: parsing user requests in an e-commerce context. It should not be used as a general-purpose chatbot.
- Primary Use: Backend service for intent detection from user text.
- Out-of-Scope: General conversation, answering questions, or any task not related to adding/removing items from a list.
How to Use
The model expects a prompt formatted in a specific way, following the TinyLlama-Chat template. You must provide the Catalog
and the User
request.
Important: You need to install optimum
and auto-gptq
to run this 4-bit GPTQ model.
pip install -q optimum auto-gptq transformers
Here's how to run inference in Python:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
# Model repository on the Hugging Face Hub
model_id = "jtlicardo/tinyllama-ecommerce-intent-gptq"
# Load the tokenizer and the 4-bit quantized model
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
torch_dtype=torch.float16 # Recommended for inference
)
# --- Define the prompt ---
catalog = """Catalog:
Shampoo (400ml bottle)
Hand Soap (250ml dispenser)
Peanut Butter (340g jar)
Headphones
Green Tea (25 tea bags)"""
user_query = "Could you please take off 4 pairs of headphons from my cart?"
# --- Format the prompt using the model's chat template ---
# The model was trained to see this structure.
prompt = f"<|user|>\n{catalog}\n\nUser:\n{user_query}\n<|assistant|>\n"
# --- Generate the output ---
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
outputs = pipe(
prompt,
max_new_tokens=50, # Max length of the JSON output
do_sample=False, # Use deterministic output
temperature=None, # Not needed for do_sample=False
top_p=None, # Not needed for do_sample=False
return_full_text=False # Only return the generated part
)
# The output will be a clean JSON string
generated_json = outputs[0]['generated_text'].strip()
print(generated_json)
# Expected output:
# {"action": "remove", "product": "Headphones", "quantity": 4}
Training Procedure
This model was fine-tuned using the trl
library's SFTTrainer
.
- Method: QLoRA (4-bit quantization with LoRA adapters)
- Dataset: A custom JSONL file with 100
prompt
/completion
pairs. - Configuration:
completion_only_loss=True
was used to ensure the model only learned to generate the assistant's JSON response.
- Downloads last month
- 6