DiSTER - Term Extraction
Collection
Dataset and Model for Term Extraction from the paper: Crossing Domains without Labels: Distant Supervision for Term Extraction
•
3 items
•
Updated
This is a fine-tuned version of meta-llama/Meta-Llama-3-8B-Instruct on our dataset SynTerm for cross-domain technical / scientific term extraction. The model was optimized to output lists of domain-relevant terms given an input text.
meta-llama/Meta-Llama-3-8B-InstructSynTerm (ElenaSenger/SynTerm-fine-tuning)The model was trained with conversation-style prompts, so the same format should be used for inference.
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "ElenaSenger/DiSTER-Llama-3-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
torch_dtype="auto"
)
# Use the chat format as in SynTerm-fine-tuning
prompt = (
'{"id": "test_0", "conversations": [\n'
' {"from": "human", "value": "Text: We used dropout regularization and AdamW optimizer to train a CNN on MRI images."},\n'
' {"from": "gpt", "value": "I\'ve read this text."},\n'
' {"from": "human", "value": "What describes (technical or scientific) terms in the text, that are relevant to the domain medical-imaging?"},\n'
' {"from": "gpt", "value": ""}\n'
']}\n'
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=128,
do_sample=True, # Enable sampling for temperature to take effect
temperature=0.1,
top_p=1.0
)
decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("=== Decoded output ===")
print(decoded)
# Print only the model's reply
reply = decoded[len(prompt):].strip()
print("\n=== Model reply only ===")
print(reply)
Base model
meta-llama/Meta-Llama-3-8B-Instruct