--- language: ko license: mit tags: - sentence-transformers - semantic-search - medical - pharmaceutical - korean datasets: - drug_product_similarity_train library_name: sentence-transformers pipeline_tag: feature-extraction base_model: intfloat/multilingual-e5-small model_name: Yoonyoul/fine-tuned-e5-small-drugproduct model_type: sentence-transformer --- # ๐Ÿงฌ Fine-tuned E5-small for Korean Drug Product Semantic Embedding ## ๐Ÿ“˜ Model Overview ์ด ๋ชจ๋ธ์€ **[intfloat/multilingual-e5-small](https://huggingface.co/intfloat/multilingual-e5-small)** ๊ธฐ๋ฐ˜์œผ๋กœ, ์˜์•ฝํ’ˆ ์š”์•ฝยท์ƒ์„ธ ๋ฐ์ดํ„ฐ(`drug_summary`, `drug_details`) ๋ฐ ์ œํ’ˆ ์œ ํ˜• ์ •์˜(`drug_type_definition`), DUR ๊ทœ์ œ ์ •์˜(`drug_dur_type_definition`)๋ฅผ ํ™œ์šฉํ•˜์—ฌ ํ•œ๊ตญ์–ด ์˜์•ฝํ’ˆ ๋„๋ฉ”์ธ์— ๋งž๊ฒŒ **3๋‹จ๊ณ„ ํŒŒ์ธํŠœ๋‹(fine-tuning)** ๋œ SentenceTransformer ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. - GitHub Repository: [https://github.com/ryukato/fine-tuned-e5-drugmodel](https://github.com/ryukato/fine-tuned-e5-drugmodel) --- ## ๐Ÿงฉ Base Model Selection Rationale ์ด ํ”„๋กœ์ ํŠธ๋Š” ๋‹ค๊ตญ์–ด ํ™˜๊ฒฝ์—์„œ๋„ **์˜์•ฝํ’ˆ ๋ช…์นญ, ํšจ๋Šฅ, DUR ๊ทœ์ œ์˜ ๋ณต์žกํ•œ ์˜๋ฏธ ๊ด€๊ณ„๋ฅผ ์ •ํ™•ํžˆ ์ž„๋ฒ ๋”ฉ**ํ•˜๊ธฐ ์œ„ํ•ด **E5(multilingual-E5)** ๊ณ„์—ด ๋ชจ๋ธ ์ค‘ `intfloat/multilingual-e5-small`์„ ์„ ํƒํ–ˆ์Šต๋‹ˆ๋‹ค. ์„ ์ • ์ด์œ ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค: 1. **๋‹ค๊ตญ์–ด ๋ฌธ์žฅ ํ‘œํ˜„๋ ฅ** - ์˜์–ด๋ฟ ์•„๋‹ˆ๋ผ ํ•œ๊ตญ์–ด, ์ผ๋ณธ์–ด, ์ค‘๊ตญ์–ด, ๋…์ผ์–ด ๋“ฑ ๋‹ค์–‘ํ•œ ์–ธ์–ด์—์„œ ๊ท ํ˜• ์žกํžŒ ์˜๋ฏธ ํ‘œํ˜„ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. - ์˜์•ฝํ’ˆ ๋ฐ์ดํ„ฐ๋Š” ์™ธ๋ž˜์–ดยทํ•™์ˆ ์šฉ์–ด๊ฐ€ ํ˜ผํ•ฉ๋œ ํ˜•ํƒœ๊ฐ€ ๋งŽ๊ธฐ ๋•Œ๋ฌธ์— multilingual encoder๊ฐ€ ์œ ๋ฆฌํ•ฉ๋‹ˆ๋‹ค. 2. **ํšจ์œจ์  ์„ฑ๋Šฅ ๋Œ€๋น„ ํŒŒ๋ผ๋ฏธํ„ฐ ํฌ๊ธฐ (Small Variant)** - `small` ๋ชจ๋ธ์€ ์•ฝ **33M ํŒŒ๋ผ๋ฏธํ„ฐ**๋กœ, M1/M2 ๋งฅ๋ถ ๋“ฑ ๋กœ์ปฌ ํ™˜๊ฒฝ์—์„œ๋„ ์•ˆ์ •์ ์œผ๋กœ fine-tuning ๊ฐ€๋Šฅํ–ˆ์Šต๋‹ˆ๋‹ค. - FP16 ๋˜๋Š” bfloat16 ์ง€์›์œผ๋กœ GPUยทMPS ํ™˜๊ฒฝ์—์„œ๋„ ํšจ์œจ์ ์ธ ์—ฐ์‚ฐ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. 3. **๋ฌธ์žฅ ๋‹จ์œ„ ์˜๋ฏธ ๊ฒ€์ƒ‰(semantic retrieval)์— ์ตœ์ ํ™”** - E5 ๋ชจ๋ธ์€ โ€œ๋ฌธ์žฅ ๋‹จ์œ„ ์˜๋ฏธ ์ž„๋ฒ ๋”ฉ(Sentence Embedding)โ€์„ ์œ„ํ•ด ํ•™์Šต๋˜์–ด ์žˆ์–ด, ๋‹จ์ˆœ ์งˆ์˜(`"๊ธฐ์นจ์•ฝ"`, `"์—ด ๋‚ด๋ฆฌ๋Š” ์•ฝ"`)์™€ ์ œํ’ˆ๋ช…(`"ํŒ์ฝœ์—์ด"`, `"ํƒ€์ด๋ ˆ๋†€"`) ๊ฐ„ ์˜๋ฏธ ๋งค์นญ์— ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ์„ ๋ณด์ž…๋‹ˆ๋‹ค. 4. **Sentence-Transformers์™€ ์™„๋ฒฝํ•œ ํ˜ธํ™˜์„ฑ** - `SentenceTransformer` ์ธํ„ฐํŽ˜์ด์Šค์™€ 100% ํ˜ธํ™˜๋˜์–ด, PyTorch ๊ธฐ๋ฐ˜ pipeline ํ†ตํ•ฉ์ด ์šฉ์ดํ–ˆ์Šต๋‹ˆ๋‹ค. --- ## ๐Ÿ”น Step 1: Drug Type Semantic Alignment - ๋ฐ์ดํ„ฐ์…‹: `drug_type_def_list.csv` - ๋ชฉํ‘œ: `"ํ•ด์—ด์ œ" โ†’ "์ฒด์˜จ์„ ๋‚ฎ์ถ”๋Š” ์•ฝ"` ๊ณผ ๊ฐ™์€ ๊ฐœ๋… ๋งคํ•‘ ํ•™์Šต - ๋ชจ๋ธ ๊ฒฐ๊ณผ: `/model/fine_tuned_e5_small_drugtype` ### ๐Ÿ”น Step 2: DUR Type Semantic Alignment - ๋ฐ์ดํ„ฐ์…‹: `drug_dur_type_similarity_train.csv` - ๋ชฉํ‘œ: `"์ž„๋ถ€๊ธˆ๊ธฐ"`, `"๋…ธ์ธ์ฃผ์˜"`, `"๋ณ‘์šฉ๊ธˆ๊ธฐ"` ๋“ฑ DUR ํƒ€์ž…๊ณผ ์ „๋ฌธ์  ์„ค๋ช… ๊ฐ„ ์˜๋ฏธ ๋งคํ•‘ ํ•™์Šต - ๋ชจ๋ธ ๊ฒฐ๊ณผ: `/model/fine_tuned_e5_small_drugdurtype` ### ๐Ÿ”น Step 3: Drug Product Semantic Alignment - ๋ฐ์ดํ„ฐ์…‹: `drug_product_similarity_train.csv` (์•ฝ 3,000๊ฑด ์ƒ˜ํ”Œ) - ๋ชฉํ‘œ: `"ํŒ์ฝœ์—์ด๋‚ด๋ณต์•ก"` ๊ฐ™์€ ์‹ค์ œ ์ œํ’ˆ๊ณผ `"์—ด์„ ๋‚ด๋ฆฌ๋Š” ์•ฝ"` ๊ฐ™์€ ์งˆ์˜ ๊ฐ„ ์˜๋ฏธ ๋งค์นญ ๊ฐ•ํ™” - ๋ชจ๋ธ ๊ฒฐ๊ณผ: `/model/fine_tuned_e5_small_drugproduct_accum` --- ## ๐Ÿ”น Experimental: Drug Ingredient + Product Type Fine-tuning `fine_tuned_e5_small_drugdurtype` ๋ชจ๋ธ์„ ๊ธฐ๋ฐ˜์œผ๋กœ, ์˜์•ฝํ’ˆ ์„ฑ๋ถ„(`ingredient_name`)๊ณผ ์ œํ’ˆ ์œ ํ˜•(`product_type`)์„ ๊ฒฐํ•ฉํ•œ ์ž„๋ฒ ๋”ฉ ํ•™์Šต(`fine_tuned_e5_small_drug_ptype_ingredients`)์„ ์ถ”๊ฐ€๋กœ ์ˆ˜ํ–‰ํ•˜์˜€์Šต๋‹ˆ๋‹ค. ### โš™๏ธ ์ ์šฉ ๋‚ด์šฉ | ํ•ญ๋ชฉ | ๊ฐ’ | |------|----| | **ํ•™์Šต ๋ฐ์ดํ„ฐ** | `"์„ฑ๋ถ„๋ช…์€(๋Š”) ์ œํ’ˆ์œ ํ˜• ์ œ์ œ์—์„œ ์‚ฌ์šฉ๋˜๋Š” ์˜์•ฝ ์„ฑ๋ถ„์ด๋‹ค."` | | **์ƒ˜ํ”Œ ์‚ฌ์ด์ฆˆ** | 1,289 | | **ํ‰๊ท  ์†์‹ค** | 0.0012 | | **์œ ์‚ฌ๋„ ํ‰๊ฐ€** | ์˜๋ฏธ์  ๊ตฌ๋ถ„์ด ์ถฉ๋ถ„ํžˆ ์ด๋ฃจ์–ด์ง€์ง€ ์•Š์Œ | | **๊ด€์ฐฐ ์˜ˆ์‹œ** | โ€œ์†Œ์—ผยท์ง„ํ†ต์ œโ€ ๊ณ„์—ด์˜ `์ด๋ถ€ํ”„๋กœํŽœ`๊ณผ ๋น„๊ด€๋ จ ์„ฑ๋ถ„์ธ `์—ผํ™”๋‚˜ํŠธ๋ฅจ`, `์„ธํ‹ฐ๋ฆฌ์ง„`์ด ๋ชจ๋‘ 0.91~0.94 ์ˆ˜์ค€์˜ ์œ ์‚ฌ๋„๋ฅผ ๋ณด์ž„ | ### ๐Ÿ“‹ ๊ด€์ฐฐ ๋‚ด์šฉ - ๋ชจ๋ธ์€ ์•ˆ์ •์ ์œผ๋กœ ์ˆ˜๋ ดํ•˜์˜€์œผ๋‚˜, ๋ฌธ์žฅ ํŒจํ„ด์˜ ๋ฐ˜๋ณต์„ฑ๊ณผ Positive-only ๋ฐ์ดํ„ฐ ๊ตฌ์„ฑ์œผ๋กœ ์ธํ•ด ํšจ๋Šฅ๊ตฐ ๊ฐ„ ์˜๋ฏธ์  ๊ฒฝ๊ณ„๊ฐ€ ์ œ๋Œ€๋กœ ํ˜•์„ฑ๋˜์ง€ ์•Š์•˜์Œ. - ์ „์ฒด ์œ ์‚ฌ๋„ ๋ถ„ํฌ๊ฐ€ ๊ณผ๋„ํ•˜๊ฒŒ ๋†’๊ฒŒ ์ˆ˜๋ ดํ•˜์—ฌ, ์˜๋ฏธ๋ณด๋‹ค ๋ฌธ์ฒด ํŒจํ„ด์„ ์ค‘์‹ฌ์œผ๋กœ ํ•™์Šต๋œ ๊ฒƒ์œผ๋กœ ๊ด€์ฐฐ๋จ. - **๊ฒฐ๋ก ์ ์œผ๋กœ ๋ณธ ๋ชจ๋ธ์€ ์„ฑ๋ถ„-์ œํ’ˆ์œ ํ˜• ๊ฒฐํ•ฉ ํ•™์Šต์ด ์˜๋ฏธ ๊ฒ€์ƒ‰ ํ’ˆ์งˆ ๊ฐœ์„ ์— ์‹ค์งˆ์  ์ด์ ์„ ์ œ๊ณตํ•˜์ง€ ์•Š์Œ์„ ํ™•์ธํ•˜์˜€์œผ๋ฉฐ, ํ˜„์žฌ ํŒŒ์ดํ”„๋ผ์ธ์—๋Š” ์ ์šฉํ•˜์ง€ ์•Š๊ธฐ๋กœ ๊ฒฐ์ •ํ•จ.** --- ## ๐Ÿง  Use Case Example ```python from sentence_transformers import SentenceTransformer, util model = SentenceTransformer("Yoonyoul/fine-tuned-e5-small-drugproduct") query = "์—ด์„ ๋‚ด๋ฆฌ๋Š” ์•ฝ์€?" docs = [ "ํŒ์ฝœ์—์ด๋‚ด๋ณต์•ก์€ ํ•ด์—ด์ง„ํ†ต์ œ์ž…๋‹ˆ๋‹ค.", "๋งˆ์ด์•”๋ถ€ํ†จ์ •์€ ํ•ญ๊ฒฐํ•ต์ œ์ž…๋‹ˆ๋‹ค.", "์ง€๋ฅดํ…์ •์€ ํ•ญํžˆ์Šคํƒ€๋ฏผ์ œ์ž…๋‹ˆ๋‹ค." ] emb_q = model.encode(query, convert_to_tensor=True) emb_d = model.encode(docs, convert_to_tensor=True) scores = util.cos_sim(emb_q, emb_d)[0] for doc, score in zip(docs, scores): print(f"{doc} โ†’ ์œ ์‚ฌ๋„: {score.item():.4f}") ``` --- ## โš™๏ธ Training Environment | ํ•ญ๋ชฉ | ๋ฒ„์ „ | |------|------| | Python | 3.12.4 | | torch | 2.4.1 | | transformers | 4.44.2 | | sentence-transformers | 3.0.1 | | accelerate | 0.27.0 | | pandas | 2.2.3 | --- ## ๐Ÿ“… Release Info - Author: **@Yoonyoul** - Base Model: `intfloat/multilingual-e5-small` - Fine-tuned Model: `Yoonyoul/fine-tuned-e5-small-drugproduct` - Repository: [https://github.com/ryukato/fine-tuned-e5-drugmodel](https://github.com/ryukato/fine-tuned-e5-drugmodel) - Last Updated: **2025-10-27**