Yoonyoul's picture
๐Ÿงพ Update model card with project-specific README only
38e3e29 verified
metadata
language: ko
license: mit
tags:
  - sentence-transformers
  - semantic-search
  - medical
  - pharmaceutical
  - korean
datasets:
  - drug_product_similarity_train
library_name: sentence-transformers
pipeline_tag: feature-extraction
base_model: intfloat/multilingual-e5-small
model_name: Yoonyoul/fine-tuned-e5-small-drugproduct
model_type: sentence-transformer

๐Ÿงฌ Fine-tuned E5-small for Korean Drug Product Semantic Embedding

๐Ÿ“˜ Model Overview

์ด ๋ชจ๋ธ์€ intfloat/multilingual-e5-small ๊ธฐ๋ฐ˜์œผ๋กœ,
์˜์•ฝํ’ˆ ์š”์•ฝยท์ƒ์„ธ ๋ฐ์ดํ„ฐ(drug_summary, drug_details) ๋ฐ ์ œํ’ˆ ์œ ํ˜• ์ •์˜(drug_type_definition), DUR ๊ทœ์ œ ์ •์˜(drug_dur_type_definition)๋ฅผ ํ™œ์šฉํ•˜์—ฌ
ํ•œ๊ตญ์–ด ์˜์•ฝํ’ˆ ๋„๋ฉ”์ธ์— ๋งž๊ฒŒ 3๋‹จ๊ณ„ ํŒŒ์ธํŠœ๋‹(fine-tuning) ๋œ SentenceTransformer ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.


๐Ÿงฉ Base Model Selection Rationale

์ด ํ”„๋กœ์ ํŠธ๋Š” ๋‹ค๊ตญ์–ด ํ™˜๊ฒฝ์—์„œ๋„ ์˜์•ฝํ’ˆ ๋ช…์นญ, ํšจ๋Šฅ, DUR ๊ทœ์ œ์˜ ๋ณต์žกํ•œ ์˜๋ฏธ ๊ด€๊ณ„๋ฅผ ์ •ํ™•ํžˆ ์ž„๋ฒ ๋”ฉํ•˜๊ธฐ ์œ„ํ•ด
E5(multilingual-E5) ๊ณ„์—ด ๋ชจ๋ธ ์ค‘ intfloat/multilingual-e5-small์„ ์„ ํƒํ–ˆ์Šต๋‹ˆ๋‹ค.

์„ ์ • ์ด์œ ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

  1. ๋‹ค๊ตญ์–ด ๋ฌธ์žฅ ํ‘œํ˜„๋ ฅ

    • ์˜์–ด๋ฟ ์•„๋‹ˆ๋ผ ํ•œ๊ตญ์–ด, ์ผ๋ณธ์–ด, ์ค‘๊ตญ์–ด, ๋…์ผ์–ด ๋“ฑ ๋‹ค์–‘ํ•œ ์–ธ์–ด์—์„œ ๊ท ํ˜• ์žกํžŒ ์˜๋ฏธ ํ‘œํ˜„ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.
    • ์˜์•ฝํ’ˆ ๋ฐ์ดํ„ฐ๋Š” ์™ธ๋ž˜์–ดยทํ•™์ˆ ์šฉ์–ด๊ฐ€ ํ˜ผํ•ฉ๋œ ํ˜•ํƒœ๊ฐ€ ๋งŽ๊ธฐ ๋•Œ๋ฌธ์— multilingual encoder๊ฐ€ ์œ ๋ฆฌํ•ฉ๋‹ˆ๋‹ค.
  2. ํšจ์œจ์  ์„ฑ๋Šฅ ๋Œ€๋น„ ํŒŒ๋ผ๋ฏธํ„ฐ ํฌ๊ธฐ (Small Variant)

    • small ๋ชจ๋ธ์€ ์•ฝ 33M ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ, M1/M2 ๋งฅ๋ถ ๋“ฑ ๋กœ์ปฌ ํ™˜๊ฒฝ์—์„œ๋„ ์•ˆ์ •์ ์œผ๋กœ fine-tuning ๊ฐ€๋Šฅํ–ˆ์Šต๋‹ˆ๋‹ค.
    • FP16 ๋˜๋Š” bfloat16 ์ง€์›์œผ๋กœ GPUยทMPS ํ™˜๊ฒฝ์—์„œ๋„ ํšจ์œจ์ ์ธ ์—ฐ์‚ฐ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
  3. ๋ฌธ์žฅ ๋‹จ์œ„ ์˜๋ฏธ ๊ฒ€์ƒ‰(semantic retrieval)์— ์ตœ์ ํ™”

    • E5 ๋ชจ๋ธ์€ โ€œ๋ฌธ์žฅ ๋‹จ์œ„ ์˜๋ฏธ ์ž„๋ฒ ๋”ฉ(Sentence Embedding)โ€์„ ์œ„ํ•ด ํ•™์Šต๋˜์–ด ์žˆ์–ด,
      ๋‹จ์ˆœ ์งˆ์˜("๊ธฐ์นจ์•ฝ", "์—ด ๋‚ด๋ฆฌ๋Š” ์•ฝ")์™€ ์ œํ’ˆ๋ช…("ํŒ์ฝœ์—์ด", "ํƒ€์ด๋ ˆ๋†€") ๊ฐ„ ์˜๋ฏธ ๋งค์นญ์— ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ์„ ๋ณด์ž…๋‹ˆ๋‹ค.
  4. Sentence-Transformers์™€ ์™„๋ฒฝํ•œ ํ˜ธํ™˜์„ฑ

    • SentenceTransformer ์ธํ„ฐํŽ˜์ด์Šค์™€ 100% ํ˜ธํ™˜๋˜์–ด, PyTorch ๊ธฐ๋ฐ˜ pipeline ํ†ตํ•ฉ์ด ์šฉ์ดํ–ˆ์Šต๋‹ˆ๋‹ค.

๐Ÿ”น Step 1: Drug Type Semantic Alignment

  • ๋ฐ์ดํ„ฐ์…‹: drug_type_def_list.csv
  • ๋ชฉํ‘œ: "ํ•ด์—ด์ œ" โ†’ "์ฒด์˜จ์„ ๋‚ฎ์ถ”๋Š” ์•ฝ" ๊ณผ ๊ฐ™์€ ๊ฐœ๋… ๋งคํ•‘ ํ•™์Šต
  • ๋ชจ๋ธ ๊ฒฐ๊ณผ: /model/fine_tuned_e5_small_drugtype

๐Ÿ”น Step 2: DUR Type Semantic Alignment

  • ๋ฐ์ดํ„ฐ์…‹: drug_dur_type_similarity_train.csv
  • ๋ชฉํ‘œ: "์ž„๋ถ€๊ธˆ๊ธฐ", "๋…ธ์ธ์ฃผ์˜", "๋ณ‘์šฉ๊ธˆ๊ธฐ" ๋“ฑ DUR ํƒ€์ž…๊ณผ ์ „๋ฌธ์  ์„ค๋ช… ๊ฐ„ ์˜๋ฏธ ๋งคํ•‘ ํ•™์Šต
  • ๋ชจ๋ธ ๊ฒฐ๊ณผ: /model/fine_tuned_e5_small_drugdurtype

๐Ÿ”น Step 3: Drug Product Semantic Alignment

  • ๋ฐ์ดํ„ฐ์…‹: drug_product_similarity_train.csv (์•ฝ 3,000๊ฑด ์ƒ˜ํ”Œ)
  • ๋ชฉํ‘œ: "ํŒ์ฝœ์—์ด๋‚ด๋ณต์•ก" ๊ฐ™์€ ์‹ค์ œ ์ œํ’ˆ๊ณผ "์—ด์„ ๋‚ด๋ฆฌ๋Š” ์•ฝ" ๊ฐ™์€ ์งˆ์˜ ๊ฐ„ ์˜๋ฏธ ๋งค์นญ ๊ฐ•ํ™”
  • ๋ชจ๋ธ ๊ฒฐ๊ณผ: /model/fine_tuned_e5_small_drugproduct_accum

๐Ÿ”น Experimental: Drug Ingredient + Product Type Fine-tuning

fine_tuned_e5_small_drugdurtype ๋ชจ๋ธ์„ ๊ธฐ๋ฐ˜์œผ๋กœ,
์˜์•ฝํ’ˆ ์„ฑ๋ถ„(ingredient_name)๊ณผ ์ œํ’ˆ ์œ ํ˜•(product_type)์„ ๊ฒฐํ•ฉํ•œ
์ž„๋ฒ ๋”ฉ ํ•™์Šต(fine_tuned_e5_small_drug_ptype_ingredients)์„ ์ถ”๊ฐ€๋กœ ์ˆ˜ํ–‰ํ•˜์˜€์Šต๋‹ˆ๋‹ค.

โš™๏ธ ์ ์šฉ ๋‚ด์šฉ

ํ•ญ๋ชฉ ๊ฐ’
ํ•™์Šต ๋ฐ์ดํ„ฐ "์„ฑ๋ถ„๋ช…์€(๋Š”) ์ œํ’ˆ์œ ํ˜• ์ œ์ œ์—์„œ ์‚ฌ์šฉ๋˜๋Š” ์˜์•ฝ ์„ฑ๋ถ„์ด๋‹ค."
์ƒ˜ํ”Œ ์‚ฌ์ด์ฆˆ 1,289
ํ‰๊ท  ์†์‹ค 0.0012
์œ ์‚ฌ๋„ ํ‰๊ฐ€ ์˜๋ฏธ์  ๊ตฌ๋ถ„์ด ์ถฉ๋ถ„ํžˆ ์ด๋ฃจ์–ด์ง€์ง€ ์•Š์Œ
๊ด€์ฐฐ ์˜ˆ์‹œ โ€œ์†Œ์—ผยท์ง„ํ†ต์ œโ€ ๊ณ„์—ด์˜ ์ด๋ถ€ํ”„๋กœํŽœ๊ณผ ๋น„๊ด€๋ จ ์„ฑ๋ถ„์ธ ์—ผํ™”๋‚˜ํŠธ๋ฅจ, ์„ธํ‹ฐ๋ฆฌ์ง„์ด ๋ชจ๋‘ 0.91~0.94 ์ˆ˜์ค€์˜ ์œ ์‚ฌ๋„๋ฅผ ๋ณด์ž„

๐Ÿ“‹ ๊ด€์ฐฐ ๋‚ด์šฉ

  • ๋ชจ๋ธ์€ ์•ˆ์ •์ ์œผ๋กœ ์ˆ˜๋ ดํ•˜์˜€์œผ๋‚˜, ๋ฌธ์žฅ ํŒจํ„ด์˜ ๋ฐ˜๋ณต์„ฑ๊ณผ Positive-only ๋ฐ์ดํ„ฐ ๊ตฌ์„ฑ์œผ๋กœ ์ธํ•ด
    ํšจ๋Šฅ๊ตฐ ๊ฐ„ ์˜๋ฏธ์  ๊ฒฝ๊ณ„๊ฐ€ ์ œ๋Œ€๋กœ ํ˜•์„ฑ๋˜์ง€ ์•Š์•˜์Œ.
  • ์ „์ฒด ์œ ์‚ฌ๋„ ๋ถ„ํฌ๊ฐ€ ๊ณผ๋„ํ•˜๊ฒŒ ๋†’๊ฒŒ ์ˆ˜๋ ดํ•˜์—ฌ, ์˜๋ฏธ๋ณด๋‹ค ๋ฌธ์ฒด ํŒจํ„ด์„ ์ค‘์‹ฌ์œผ๋กœ ํ•™์Šต๋œ ๊ฒƒ์œผ๋กœ ๊ด€์ฐฐ๋จ.
  • ๊ฒฐ๋ก ์ ์œผ๋กœ ๋ณธ ๋ชจ๋ธ์€ ์„ฑ๋ถ„-์ œํ’ˆ์œ ํ˜• ๊ฒฐํ•ฉ ํ•™์Šต์ด ์˜๋ฏธ ๊ฒ€์ƒ‰ ํ’ˆ์งˆ ๊ฐœ์„ ์— ์‹ค์งˆ์  ์ด์ ์„ ์ œ๊ณตํ•˜์ง€ ์•Š์Œ์„ ํ™•์ธํ•˜์˜€์œผ๋ฉฐ, ํ˜„์žฌ ํŒŒ์ดํ”„๋ผ์ธ์—๋Š” ์ ์šฉํ•˜์ง€ ์•Š๊ธฐ๋กœ ๊ฒฐ์ •ํ•จ.

๐Ÿง  Use Case Example

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer("Yoonyoul/fine-tuned-e5-small-drugproduct")

query = "์—ด์„ ๋‚ด๋ฆฌ๋Š” ์•ฝ์€?"
docs = [
    "ํŒ์ฝœ์—์ด๋‚ด๋ณต์•ก์€ ํ•ด์—ด์ง„ํ†ต์ œ์ž…๋‹ˆ๋‹ค.",
    "๋งˆ์ด์•”๋ถ€ํ†จ์ •์€ ํ•ญ๊ฒฐํ•ต์ œ์ž…๋‹ˆ๋‹ค.",
    "์ง€๋ฅดํ…์ •์€ ํ•ญํžˆ์Šคํƒ€๋ฏผ์ œ์ž…๋‹ˆ๋‹ค."
]

emb_q = model.encode(query, convert_to_tensor=True)
emb_d = model.encode(docs, convert_to_tensor=True)

scores = util.cos_sim(emb_q, emb_d)[0]
for doc, score in zip(docs, scores):
    print(f"{doc} โ†’ ์œ ์‚ฌ๋„: {score.item():.4f}")

โš™๏ธ Training Environment

ํ•ญ๋ชฉ ๋ฒ„์ „
Python 3.12.4
torch 2.4.1
transformers 4.44.2
sentence-transformers 3.0.1
accelerate 0.27.0
pandas 2.2.3

๐Ÿ“… Release Info