File size: 5,888 Bytes
4898d42
70cd2da
4898d42
 
70cd2da
 
 
 
 
 
 
 
 
 
 
 
4898d42
 
70cd2da
ad4c10a
70cd2da
 
 
 
ad4c10a
70cd2da
f2350c7
70cd2da
abbffd7
70cd2da
abbffd7
70cd2da
 
faeac6e
70cd2da
faeac6e
70cd2da
 
 
faeac6e
70cd2da
 
 
abbffd7
70cd2da
 
 
abbffd7
70cd2da
 
faeac6e
70cd2da
faeac6e
70cd2da
 
 
 
faeac6e
70cd2da
 
 
 
faeac6e
70cd2da
 
 
 
faeac6e
70cd2da
faeac6e
38e3e29
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
70cd2da
faeac6e
 
70cd2da
faeac6e
70cd2da
faeac6e
70cd2da
 
 
 
 
 
faeac6e
70cd2da
 
faeac6e
70cd2da
 
 
 
faeac6e
70cd2da
faeac6e
70cd2da
faeac6e
70cd2da
 
 
 
 
 
 
 
faeac6e
70cd2da
abbffd7
70cd2da
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
---
language: ko
license: mit
tags:
  - sentence-transformers
  - semantic-search
  - medical
  - pharmaceutical
  - korean
datasets:
  - drug_product_similarity_train
library_name: sentence-transformers
pipeline_tag: feature-extraction
base_model: intfloat/multilingual-e5-small
model_name: Yoonyoul/fine-tuned-e5-small-drugproduct
model_type: sentence-transformer
---

# ๐Ÿงฌ Fine-tuned E5-small for Korean Drug Product Semantic Embedding

## ๐Ÿ“˜ Model Overview
์ด ๋ชจ๋ธ์€ **[intfloat/multilingual-e5-small](https://huggingface.co/intfloat/multilingual-e5-small)** ๊ธฐ๋ฐ˜์œผ๋กœ,  
์˜์•ฝํ’ˆ ์š”์•ฝยท์ƒ์„ธ ๋ฐ์ดํ„ฐ(`drug_summary`, `drug_details`) ๋ฐ ์ œํ’ˆ ์œ ํ˜• ์ •์˜(`drug_type_definition`), DUR ๊ทœ์ œ ์ •์˜(`drug_dur_type_definition`)๋ฅผ ํ™œ์šฉํ•˜์—ฌ  
ํ•œ๊ตญ์–ด ์˜์•ฝํ’ˆ ๋„๋ฉ”์ธ์— ๋งž๊ฒŒ **3๋‹จ๊ณ„ ํŒŒ์ธํŠœ๋‹(fine-tuning)** ๋œ SentenceTransformer ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.

- GitHub Repository: [https://github.com/ryukato/fine-tuned-e5-drugmodel](https://github.com/ryukato/fine-tuned-e5-drugmodel)

---

## ๐Ÿงฉ Base Model Selection Rationale

์ด ํ”„๋กœ์ ํŠธ๋Š” ๋‹ค๊ตญ์–ด ํ™˜๊ฒฝ์—์„œ๋„ **์˜์•ฝํ’ˆ ๋ช…์นญ, ํšจ๋Šฅ, DUR ๊ทœ์ œ์˜ ๋ณต์žกํ•œ ์˜๋ฏธ ๊ด€๊ณ„๋ฅผ ์ •ํ™•ํžˆ ์ž„๋ฒ ๋”ฉ**ํ•˜๊ธฐ ์œ„ํ•ด  
**E5(multilingual-E5)** ๊ณ„์—ด ๋ชจ๋ธ ์ค‘ `intfloat/multilingual-e5-small`์„ ์„ ํƒํ–ˆ์Šต๋‹ˆ๋‹ค.

์„ ์ • ์ด์œ ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

1. **๋‹ค๊ตญ์–ด ๋ฌธ์žฅ ํ‘œํ˜„๋ ฅ**  
   - ์˜์–ด๋ฟ ์•„๋‹ˆ๋ผ ํ•œ๊ตญ์–ด, ์ผ๋ณธ์–ด, ์ค‘๊ตญ์–ด, ๋…์ผ์–ด ๋“ฑ ๋‹ค์–‘ํ•œ ์–ธ์–ด์—์„œ ๊ท ํ˜• ์žกํžŒ ์˜๋ฏธ ํ‘œํ˜„ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.  
   - ์˜์•ฝํ’ˆ ๋ฐ์ดํ„ฐ๋Š” ์™ธ๋ž˜์–ดยทํ•™์ˆ ์šฉ์–ด๊ฐ€ ํ˜ผํ•ฉ๋œ ํ˜•ํƒœ๊ฐ€ ๋งŽ๊ธฐ ๋•Œ๋ฌธ์— multilingual encoder๊ฐ€ ์œ ๋ฆฌํ•ฉ๋‹ˆ๋‹ค.

2. **ํšจ์œจ์  ์„ฑ๋Šฅ ๋Œ€๋น„ ํŒŒ๋ผ๋ฏธํ„ฐ ํฌ๊ธฐ (Small Variant)**  
   - `small` ๋ชจ๋ธ์€ ์•ฝ **33M ํŒŒ๋ผ๋ฏธํ„ฐ**๋กœ, M1/M2 ๋งฅ๋ถ ๋“ฑ ๋กœ์ปฌ ํ™˜๊ฒฝ์—์„œ๋„ ์•ˆ์ •์ ์œผ๋กœ fine-tuning ๊ฐ€๋Šฅํ–ˆ์Šต๋‹ˆ๋‹ค.  
   - FP16 ๋˜๋Š” bfloat16 ์ง€์›์œผ๋กœ GPUยทMPS ํ™˜๊ฒฝ์—์„œ๋„ ํšจ์œจ์ ์ธ ์—ฐ์‚ฐ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

3. **๋ฌธ์žฅ ๋‹จ์œ„ ์˜๋ฏธ ๊ฒ€์ƒ‰(semantic retrieval)์— ์ตœ์ ํ™”**  
   - E5 ๋ชจ๋ธ์€ โ€œ๋ฌธ์žฅ ๋‹จ์œ„ ์˜๋ฏธ ์ž„๋ฒ ๋”ฉ(Sentence Embedding)โ€์„ ์œ„ํ•ด ํ•™์Šต๋˜์–ด ์žˆ์–ด,  
     ๋‹จ์ˆœ ์งˆ์˜(`"๊ธฐ์นจ์•ฝ"`, `"์—ด ๋‚ด๋ฆฌ๋Š” ์•ฝ"`)์™€ ์ œํ’ˆ๋ช…(`"ํŒ์ฝœ์—์ด"`, `"ํƒ€์ด๋ ˆ๋†€"`) ๊ฐ„ ์˜๋ฏธ ๋งค์นญ์— ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ์„ ๋ณด์ž…๋‹ˆ๋‹ค.

4. **Sentence-Transformers์™€ ์™„๋ฒฝํ•œ ํ˜ธํ™˜์„ฑ**  
   - `SentenceTransformer` ์ธํ„ฐํŽ˜์ด์Šค์™€ 100% ํ˜ธํ™˜๋˜์–ด, PyTorch ๊ธฐ๋ฐ˜ pipeline ํ†ตํ•ฉ์ด ์šฉ์ดํ–ˆ์Šต๋‹ˆ๋‹ค.

---

## ๐Ÿ”น Step 1: Drug Type Semantic Alignment
- ๋ฐ์ดํ„ฐ์…‹: `drug_type_def_list.csv`  
- ๋ชฉํ‘œ: `"ํ•ด์—ด์ œ" โ†’ "์ฒด์˜จ์„ ๋‚ฎ์ถ”๋Š” ์•ฝ"` ๊ณผ ๊ฐ™์€ ๊ฐœ๋… ๋งคํ•‘ ํ•™์Šต  
- ๋ชจ๋ธ ๊ฒฐ๊ณผ: `/model/fine_tuned_e5_small_drugtype`

### ๐Ÿ”น Step 2: DUR Type Semantic Alignment
- ๋ฐ์ดํ„ฐ์…‹: `drug_dur_type_similarity_train.csv`  
- ๋ชฉํ‘œ: `"์ž„๋ถ€๊ธˆ๊ธฐ"`, `"๋…ธ์ธ์ฃผ์˜"`, `"๋ณ‘์šฉ๊ธˆ๊ธฐ"` ๋“ฑ DUR ํƒ€์ž…๊ณผ ์ „๋ฌธ์  ์„ค๋ช… ๊ฐ„ ์˜๋ฏธ ๋งคํ•‘ ํ•™์Šต  
- ๋ชจ๋ธ ๊ฒฐ๊ณผ: `/model/fine_tuned_e5_small_drugdurtype`

### ๐Ÿ”น Step 3: Drug Product Semantic Alignment
- ๋ฐ์ดํ„ฐ์…‹: `drug_product_similarity_train.csv` (์•ฝ 3,000๊ฑด ์ƒ˜ํ”Œ)  
- ๋ชฉํ‘œ: `"ํŒ์ฝœ์—์ด๋‚ด๋ณต์•ก"` ๊ฐ™์€ ์‹ค์ œ ์ œํ’ˆ๊ณผ `"์—ด์„ ๋‚ด๋ฆฌ๋Š” ์•ฝ"` ๊ฐ™์€ ์งˆ์˜ ๊ฐ„ ์˜๋ฏธ ๋งค์นญ ๊ฐ•ํ™”  
- ๋ชจ๋ธ ๊ฒฐ๊ณผ: `/model/fine_tuned_e5_small_drugproduct_accum`

---

## ๐Ÿ”น Experimental: Drug Ingredient + Product Type Fine-tuning

`fine_tuned_e5_small_drugdurtype` ๋ชจ๋ธ์„ ๊ธฐ๋ฐ˜์œผ๋กœ,  
์˜์•ฝํ’ˆ ์„ฑ๋ถ„(`ingredient_name`)๊ณผ ์ œํ’ˆ ์œ ํ˜•(`product_type`)์„ ๊ฒฐํ•ฉํ•œ  
์ž„๋ฒ ๋”ฉ ํ•™์Šต(`fine_tuned_e5_small_drug_ptype_ingredients`)์„ ์ถ”๊ฐ€๋กœ ์ˆ˜ํ–‰ํ•˜์˜€์Šต๋‹ˆ๋‹ค.  

### โš™๏ธ ์ ์šฉ ๋‚ด์šฉ
| ํ•ญ๋ชฉ | ๊ฐ’ |
|------|----|
| **ํ•™์Šต ๋ฐ์ดํ„ฐ** | `"์„ฑ๋ถ„๋ช…์€(๋Š”) ์ œํ’ˆ์œ ํ˜• ์ œ์ œ์—์„œ ์‚ฌ์šฉ๋˜๋Š” ์˜์•ฝ ์„ฑ๋ถ„์ด๋‹ค."` |
| **์ƒ˜ํ”Œ ์‚ฌ์ด์ฆˆ** | 1,289 |
| **ํ‰๊ท  ์†์‹ค** | 0.0012 |
| **์œ ์‚ฌ๋„ ํ‰๊ฐ€** | ์˜๋ฏธ์  ๊ตฌ๋ถ„์ด ์ถฉ๋ถ„ํžˆ ์ด๋ฃจ์–ด์ง€์ง€ ์•Š์Œ |
| **๊ด€์ฐฐ ์˜ˆ์‹œ** | โ€œ์†Œ์—ผยท์ง„ํ†ต์ œโ€ ๊ณ„์—ด์˜ `์ด๋ถ€ํ”„๋กœํŽœ`๊ณผ ๋น„๊ด€๋ จ ์„ฑ๋ถ„์ธ `์—ผํ™”๋‚˜ํŠธ๋ฅจ`, `์„ธํ‹ฐ๋ฆฌ์ง„`์ด ๋ชจ๋‘ 0.91~0.94 ์ˆ˜์ค€์˜ ์œ ์‚ฌ๋„๋ฅผ ๋ณด์ž„ |

### ๐Ÿ“‹ ๊ด€์ฐฐ ๋‚ด์šฉ
- ๋ชจ๋ธ์€ ์•ˆ์ •์ ์œผ๋กœ ์ˆ˜๋ ดํ•˜์˜€์œผ๋‚˜, ๋ฌธ์žฅ ํŒจํ„ด์˜ ๋ฐ˜๋ณต์„ฑ๊ณผ Positive-only ๋ฐ์ดํ„ฐ ๊ตฌ์„ฑ์œผ๋กœ ์ธํ•ด  
  ํšจ๋Šฅ๊ตฐ ๊ฐ„ ์˜๋ฏธ์  ๊ฒฝ๊ณ„๊ฐ€ ์ œ๋Œ€๋กœ ํ˜•์„ฑ๋˜์ง€ ์•Š์•˜์Œ.  
- ์ „์ฒด ์œ ์‚ฌ๋„ ๋ถ„ํฌ๊ฐ€ ๊ณผ๋„ํ•˜๊ฒŒ ๋†’๊ฒŒ ์ˆ˜๋ ดํ•˜์—ฌ, ์˜๋ฏธ๋ณด๋‹ค ๋ฌธ์ฒด ํŒจํ„ด์„ ์ค‘์‹ฌ์œผ๋กœ ํ•™์Šต๋œ ๊ฒƒ์œผ๋กœ ๊ด€์ฐฐ๋จ.  
- **๊ฒฐ๋ก ์ ์œผ๋กœ ๋ณธ ๋ชจ๋ธ์€ ์„ฑ๋ถ„-์ œํ’ˆ์œ ํ˜• ๊ฒฐํ•ฉ ํ•™์Šต์ด ์˜๋ฏธ ๊ฒ€์ƒ‰ ํ’ˆ์งˆ ๊ฐœ์„ ์— ์‹ค์งˆ์  ์ด์ ์„ ์ œ๊ณตํ•˜์ง€ ์•Š์Œ์„ ํ™•์ธํ•˜์˜€์œผ๋ฉฐ, ํ˜„์žฌ ํŒŒ์ดํ”„๋ผ์ธ์—๋Š” ์ ์šฉํ•˜์ง€ ์•Š๊ธฐ๋กœ ๊ฒฐ์ •ํ•จ.**

---

## ๐Ÿง  Use Case Example

```python
from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer("Yoonyoul/fine-tuned-e5-small-drugproduct")

query = "์—ด์„ ๋‚ด๋ฆฌ๋Š” ์•ฝ์€?"
docs = [
    "ํŒ์ฝœ์—์ด๋‚ด๋ณต์•ก์€ ํ•ด์—ด์ง„ํ†ต์ œ์ž…๋‹ˆ๋‹ค.",
    "๋งˆ์ด์•”๋ถ€ํ†จ์ •์€ ํ•ญ๊ฒฐํ•ต์ œ์ž…๋‹ˆ๋‹ค.",
    "์ง€๋ฅดํ…์ •์€ ํ•ญํžˆ์Šคํƒ€๋ฏผ์ œ์ž…๋‹ˆ๋‹ค."
]

emb_q = model.encode(query, convert_to_tensor=True)
emb_d = model.encode(docs, convert_to_tensor=True)

scores = util.cos_sim(emb_q, emb_d)[0]
for doc, score in zip(docs, scores):
    print(f"{doc} โ†’ ์œ ์‚ฌ๋„: {score.item():.4f}")
```

---

## โš™๏ธ Training Environment

| ํ•ญ๋ชฉ | ๋ฒ„์ „ |
|------|------|
| Python | 3.12.4 |
| torch | 2.4.1 |
| transformers | 4.44.2 |
| sentence-transformers | 3.0.1 |
| accelerate | 0.27.0 |
| pandas | 2.2.3 |

---

## ๐Ÿ“… Release Info
- Author: **@Yoonyoul**
- Base Model: `intfloat/multilingual-e5-small`
- Fine-tuned Model: `Yoonyoul/fine-tuned-e5-small-drugproduct`
- Repository: [https://github.com/ryukato/fine-tuned-e5-drugmodel](https://github.com/ryukato/fine-tuned-e5-drugmodel)
- Last Updated: **2025-10-27**