datasets:

  • "VERITABLE RECORDS of the JOSEON DYNASTY"

SillokBert-NER: ์กฐ์„ ์™•์กฐ์‹ค๋ก ํŠนํ™” ๊ฐœ์ฒด๋ช… ์ธ์‹ ๋ชจ๋ธ

SillokBert-NER: A Domain-Specific NER Model for the Veritable Records of the Joseon Dynasty

๋ชจ๋ธ ์„ค๋ช… (Model Description)

SillokBert-NER์€ ์กฐ์„ ์™•์กฐ์‹ค๋ก ์›๋ฌธ์— ํŠนํ™”๋œ ๊ฐœ์ฒด๋ช… ์ธ์‹(Named Entity Recognition, NER) ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. ์ด ๋ชจ๋ธ์€ ์กฐ์„ ์™•์กฐ์‹ค๋ก ์ „์ฒด ์›๋ฌธ(ํ•œ๋ฌธ)์œผ๋กœ ์ง€์†์  ์‚ฌ์ „ํ•™์Šต(continued pre-training)์„ ์ง„ํ–‰ํ•œ ์–ธ์–ด ๋ชจ๋ธ ddokbaro/SillokBert ํ”„๋กœ์ ํŠธ์˜ Trial 11 ์ฒดํฌํฌ์ธํŠธ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํŒŒ์ธํŠœ๋‹๋˜์—ˆ์œผ๋ฉฐ, ์—ญ์‚ฌ ๊ธฐ๋ก๋ฌผ ์†์—์„œ ๋‹ค์Œ์˜ 4๊ฐ€์ง€ ํ•ต์‹ฌ ๊ฐœ์ฒด ์œ ํ˜•์„ ์ •ํ™•ํ•˜๊ฒŒ ์‹๋ณ„ํ•˜๋„๋ก ์„ค๊ณ„๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

(SillokBert-NER is a Named Entity Recognition (NER) model specialized for the Veritable Records of the Joseon Dynasty (์กฐ์„ ์™•์กฐ์‹ค๋ก). It is fine-tuned from the Trial 11 checkpoint of the ddokbaro/SillokBert project, a language model that was continually pre-trained on the full-text classical Chinese (Hanja) corpus of the Veritable Records. This model is designed to accurately identify four key entity types within the historical texts.)

  • PER: ์ธ๋ช… (Person)
  • LOC: ์ง€๋ช… (Location)
  • POH: ์„œ์ฑ…๋ช… (Publication of History)
  • DAT: ์—ฐํ˜ธ (Date / Era Name)

๋ณธ ๋ชจ๋ธ์€ ํ•œ๊ตญํ•™์ค‘์•™์—ฐ๊ตฌ์› ๋””์ง€ํ„ธ์ธ๋ฌธํ•™์—ฐ๊ตฌ์†Œ์˜ "ํ•œ๊ตญ ๊ณ ์ „ ๋ฌธํ—Œ ๊ธฐ๋ฐ˜ ์ง€๋Šฅํ˜• ํ•œ๊ตญํ•™ ์–ธ์–ด๋ชจ๋ธ ๊ฐœ๋ฐœ" ํ”„๋กœ์ ํŠธ์˜ ์ผํ™˜์œผ๋กœ ๊ฐœ๋ฐœ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ๋ณธ ๋ชจ๋ธ์˜ ํ•™์Šต ํ™˜๊ฒฝ์€ ๊ณผํ•™๊ธฐ์ˆ ์ •๋ณดํ†ต์‹ ๋ถ€ ์ •๋ณดํ†ต์‹ ์‚ฐ์—…์ง„ํฅ์›์˜ 2025๋…„ ๊ณ ์„ฑ๋Šฅ์ปดํ“จํŒ…์ง€์›(GPU) ์‚ฌ์—…(G2025-0450)์˜ ์ง€์›์„ ๋ฐ›์•˜์Šต๋‹ˆ๋‹ค. ์—ฐ๊ตฌ์— ํ•„์ˆ˜์ ์ธ ๊ณ ์„ฑ๋Šฅ ์ปดํ“จํŒ… ํ™˜๊ฒฝ์„ ์ง€์›ํ•ด์ฃผ์…”์„œ ์ง„์‹ฌ์œผ๋กœ ๊ฐ์‚ฌ๋“œ๋ฆฝ๋‹ˆ๋‹ค.

This model was developed as part of the "Development of an Intelligent Korean Studies Language Model based on Classical Korean Texts" project at the Digital Humanities Research Institute, The Academy of Korean Studies. The training environment for this model was supported by the 2025 High-Performance Computing Support (GPU) Program of the National IT Industry Promotion Agency (NIPA) (No. G2025-0450). We sincerely appreciate the support for providing the high-performance computing environment essential for our research.

์‚ฌ์šฉ ๋ชฉ์  ๋ฐ ํ•œ๊ณ„ (Intended Uses & Limitations)

์ด ๋ชจ๋ธ์€ ํ•™์ˆ  ๋ฐ ์—ฐ๊ตฌ ๋ชฉ์ ์œผ๋กœ ์ œ์ž‘๋˜์—ˆ์œผ๋ฉฐ, ํŠนํžˆ ์กฐ์„ ์™•์กฐ์‹ค๋ก์ด๋‚˜ ์œ ์‚ฌํ•œ ํ•œ๋ฌธ ์—ญ์‚ฌ ๊ธฐ๋ก์„ ๋‹ค๋ฃจ๋Š” ์—ฐ๊ตฌ์ž์™€ ๊ฐœ๋ฐœ์ž์—๊ฒŒ ์œ ์šฉํ•ฉ๋‹ˆ๋‹ค.

This model is intended for academic and research purposes, specifically for scholars and developers working with the Veritable Records of the Joseon Dynasty or similar historical Korean texts written in classical Chinese.

ํ•œ๊ณ„ (Limitations):

  • ์ด ๋ชจ๋ธ์€ ํŠน์ • ๋„๋ฉ”์ธ์— ๊ณ ๋„๋กœ ํŠนํ™”๋˜์–ด ์žˆ์œผ๋ฏ€๋กœ, ํ˜„๋Œ€ ํ•œ๊ตญ์–ด๋‚˜ ๋‹ค๋ฅธ ์ข…๋ฅ˜์˜ ํ…์ŠคํŠธ์— ๋Œ€ํ•œ ๋ฒ”์šฉ NER ๋ชจ๋ธ๋กœ๋Š” ์ ํ•ฉํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. (This model is a highly domain-specific model and is not suitable for general-purpose NER on modern Korean or other types of texts.)
  • ์‹œ๋Œ€๋‚˜ ๋ฌธ์ฒด์  ํŠน์ง•์ด ๋‹ค๋ฅธ ์—ญ์‚ฌ ๋ฌธํ—Œ์—์„œ๋Š” ์„ฑ๋Šฅ์ด ๋‹ค๋ฅด๊ฒŒ ๋‚˜ํƒ€๋‚  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. (Performance may vary on historical documents from different eras or with different stylistic features.)

์‚ฌ์šฉ ๋ฐฉ๋ฒ• (How to Get Started)

transformers ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์˜ ํŒŒ์ดํ”„๋ผ์ธ์„ ํ†ตํ•ด ๊ฐ„๋‹จํ•˜๊ฒŒ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. (You can use this model with the transformers library pipeline.)

from transformers import pipeline

# ์˜ต์…˜ 1 (๊ถŒ์žฅ): ํ—ˆ๊น…ํŽ˜์ด์Šค ํ—ˆ๋ธŒ์—์„œ ์ง์ ‘ ๋ชจ๋ธ ๋กœ๋“œ
# Option 1 (Recommended): Load the model directly from the Hugging Face Hub
ner_pipeline = pipeline("token-classification", model="ddokbaro/SillokBert-NER")

# ์˜ต์…˜ 2: ๋กœ์ปฌ์— ์ €์žฅ๋œ ๋ชจ๋ธ ๋กœ๋“œ (๊ฒฝ๋กœ๋ฅผ ์‹ค์ œ ํ™˜๊ฒฝ์— ๋งž๊ฒŒ ์ˆ˜์ •ํ•ด์•ผ ํ•จ)
# Option 2: Load the model from a local directory (the path must be adjusted to your environment)
# local_model_path = "/home/work/baro/sillokner20250618/models/SillokBert-NER-trial11"
# ner_pipeline = pipeline("token-classification", model=local_model_path)


text = "ๆ™‚ๅคชๅฎ—ๅœจๆฝ›้‚ธ้ฃ่ถ™่‹ฑ่Œ‚่ซญๆ„ไธ”ๆ›ฐไปŠๆˆ‘ๅœ‹ๅฎถๅœŸๅฎ‡้š˜่ฅฟๅŒ—่ท้ดจ็ถ ๆœชๅŠ็™พ้‡Œ"
# ํƒœ์ข…์‹ค๋ก 1๊ถŒ, ํƒœ์กฐ 1๋…„ 1์›” 15์ผ (Veritable Records of Taejong, Vol. 1, 15th day of the 1st month of the 1st year of King Taejo)

results = ner_pipeline(text)
for entity in results:
    print(entity)

# Expected Output:
# {'entity': 'B-PER', 'score': 0.99..., 'index': 2, 'word': 'ๅคชๅฎ—', 'start': 3, 'end': 5}
# {'entity': 'B-PER', 'score': 0.99..., 'index': 6, 'word': '่ถ™่‹ฑ่Œ‚', 'start': 15, 'end': 18}
# {'entity': 'B-LOC', 'score': 0.99..., 'index': 13, 'word': '้ดจ็ถ ', 'start': 43, 'end': 45}

์‚ฌ์ „ํ•™์Šต ๋ชจ๋ธ ์›๋ณธ (Original Pre-trained Model)

๋ณธ ๋ฆฌํฌ์ง€ํ† ๋ฆฌ์—๋Š” ์ด NER ๋ชจ๋ธ์˜ ๊ธฐ๋ฐ˜์ด ๋œ ์›๋ณธ SillokBert (Trial 11) ์ฒดํฌํฌ์ธํŠธ ํŒŒ์ผ๋“ค์ด 'SillokBert_trial11/' ํด๋”์— ํ•จ๊ป˜ ํฌํ•จ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. ๋‹ค๋ฅธ ๋‹ค์šด์ŠคํŠธ๋ฆผ ํƒœ์Šคํฌ์— ์ง์ ‘ ํŒŒ์ธํŠœ๋‹์„ ์‹œ๋„ํ•ด๋ณด๊ณ ์ž ํ•˜๋Š” ์—ฐ๊ตฌ์ž๋“ค์€ ํ•ด๋‹น ํด๋”์˜ ํŒŒ์ผ๋“ค์„ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

This repository also contains the original SillokBert (Trial 11) checkpoint files in the 'SillokBert_trial11/' folder. Researchers who wish to fine-tune this model on other downstream tasks can utilize the files in that directory.

ํ•™์Šต ๋ฐ ํ‰๊ฐ€ ๋ฐ์ดํ„ฐ (Training and Evaluation Data)

๋ฐ์ดํ„ฐ์…‹ (Dataset)

์ด ๋ชจ๋ธ์€ ์กฐ์„ ์™•์กฐ์‹ค๋ก ์›๋ณธ XML ํŒŒ์ผ๋กœ๋ถ€ํ„ฐ ๊ตฌ์ถ•๋œ **Sillok NER Corpus**๋กœ ํ•™์Šต๋˜์—ˆ์Šต๋‹ˆ๋‹ค. (This model was trained on the Sillok NER Corpus, a custom dataset built from the original XML files of the Veritable Records of the Joseon Dynasty.)

  • ์›์ฒœ ๋ฐ์ดํ„ฐ (Source Data): ๊ณต๊ณต๋ฐ์ดํ„ฐํฌํ„ธ - ๊ต์œก๋ถ€ ๊ตญ์‚ฌํŽธ์ฐฌ์œ„์›ํšŒ_์กฐ์„ ์™•์กฐ์‹ค๋ก ์ •๋ณด_์‹ค๋ก์›๋ฌธ https://www.data.go.kr/data/15053647/fileData.do. ์—ฐ๊ตฌ์˜ ํ† ๋Œ€๊ฐ€ ๋œ ๊ท€์ค‘ํ•œ ์ž๋ฃŒ๋ฅผ ์ œ๊ณตํ•ด์ฃผ์‹  ๊ต์œก๋ถ€ ๊ตญ์‚ฌํŽธ์ฐฌ์œ„์›ํšŒ ์ธก์— ๊ฐ์‚ฌ์˜ ๋ง์”€์„ ์ „ํ•œ๋‹ค.
    We express our gratitude to the National Institute of Korean History (Ministry of Education) for providing the invaluable data that formed the foundation of this research.
  • ๋ฐ์ดํ„ฐ ๋ฒ„์ „ ๋ฐ ์žฌํ˜„์„ฑ (Data Version and Reproducibility): ๋ณธ ์—ฐ๊ตฌ๋Š” 2022๋…„ 11์›” 03์ผ์— ๋“ฑ๋ก๋œ ๋ฐ์ดํ„ฐ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•ฉ๋‹ˆ๋‹ค. ๊ณต์‹ ๋ฐฐํฌ์ฒ˜์˜ ๋ฐ์ดํ„ฐ๊ฐ€ ์—…๋ฐ์ดํŠธ๋  ์ˆ˜ ์žˆ์–ด, ์™„๋ฒฝํ•œ ์žฌํ˜„์„ฑ์„ ๋ณด์žฅํ•˜๊ธฐ ์œ„ํ•ด ํ•™์Šต์— ์‚ฌ์šฉ๋œ ์›๋ณธ XML ํŒŒ์ผ ์ „์ฒด๋ฅผ raw_data/sillok_raw_xml.zip ํŒŒ์ผ๋กœ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ, ์ฆ‰์‹œ ํ™œ์šฉ ๊ฐ€๋Šฅํ•œ ์ „์ฒ˜๋ฆฌ ์™„๋ฃŒ ํ…์ŠคํŠธ ํŒŒ์ผ(train.txt, validation.txt, test.txt)์€ preprocessed_data/ ํด๋”์—์„œ ํ™•์ธํ•˜์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
    This research is based on the data registered on November 3, 2022. As the data from the official distributor may be updated, we provide the entire original XML files used for training as raw_data/sillok_raw_xml.zip in this repository to ensure perfect reproducibility. Additionally, the preprocessed text files (train.txt, validation.txt, test.txt) ready for immediate use can be found in the preprocessed_data/ folder.
  • ์ „์ฒ˜๋ฆฌ (Preprocessing): XML์˜ <index\> ํƒœ๊ทธ๋ฅผ ํŒŒ์‹ฑํ•˜์—ฌ ๊ฐœ์ฒด๋ช… ํ…์ŠคํŠธ, ์œ ํ˜•(์ด๋ฆ„, ์ง€๋ช…, ์„œ๋ช…, ์—ฐํ˜ธ), ๊ณ ์œ  ์ฐธ์กฐ ID๋ฅผ ์ถ”์ถœํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด ์ •๋ณด๋Š” 3์—ด์˜ CoNLL ํ˜•์‹(token ner_tag ref_id)์œผ๋กœ ๋ณ€ํ™˜๋˜์—ˆ์Šต๋‹ˆ๋‹ค. (The <index\> tags in the XML were parsed to extract entity text, types (์ด๋ฆ„, ์ง€๋ช…, ์„œ๋ช…, ์—ฐํ˜ธ), and unique reference IDs. This information was converted into a 3-column CoNLL format (token ner_tag ref_id).)
  • ๋ฐ์ดํ„ฐ ๋ถ„ํ•  (Data Split): ์ „์ฒด ๋ง๋ญ‰์น˜๋Š” ํ•™์Šต(80%), ๊ฒ€์ฆ(10%), ํ‰๊ฐ€(10%) ์„ธํŠธ๋กœ ๋ฌด์ž‘์œ„ ๋ถ„ํ• ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. (The full corpus was randomly split into training (80%), validation (10%), and test (10%) sets.)
    • ํ•™์Šต ์„ธํŠธ (Training Set): 375,366 ๋ฌธ์žฅ
    • ๊ฒ€์ฆ ์„ธํŠธ (Validation Set): 46,920 ๋ฌธ์žฅ
    • ํ‰๊ฐ€ ์„ธํŠธ (Test Set): 46,922 ๋ฌธ์žฅ

๋ฐ์ดํ„ฐ์…‹ ๋‹ค์šด๋กœ๋“œ (Dataset Download)

๋ณธ ๋ฆฌํฌ์ง€ํ† ๋ฆฌ์—๋Š” ๋ชจ๋ธ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ, ์—ฐ๊ตฌ์— ์‚ฌ์šฉ๋œ ์ „์ฒ˜๋ฆฌ ์™„๋ฃŒ ๋ฐ์ดํ„ฐ์™€ ์›๋ณธ ๋ฐ์ดํ„ฐ๊ฐ€ ๋ชจ๋‘ ํฌํ•จ๋˜์–ด ์žˆ์–ด ์ฆ‰์‹œ ํ™œ์šฉ ๋ฐ ์žฌํ˜„์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

This repository contains not only the model but also the pre-processed and raw data used in the research, allowing for immediate use and reproducibility.

'data/raw_xml/': ์—ฐ๊ตฌ์˜ ๊ธฐ๋ฐ˜์ด ๋œ ์›๋ณธ XML ํŒŒ์ผ ์ „์ฒด๊ฐ€ ํฌํ•จ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. (Contains the complete original XML files that formed the basis of this research.)

'preprocessed_data/': ์ฆ‰์‹œ ํ™œ์šฉ ๊ฐ€๋Šฅํ•œ CoNLL ํ˜•์‹์˜ 'train.txt', 'validation.txt', 'test.txt' ํŒŒ์ผ์ด ํฌํ•จ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. (Contains ready-to-use CoNLL formatted files: 'train.txt', 'validation.txt', and 'test.txt'.)

๊ฐœ์ฒด๋ช… ์œ ํ˜• (Entity Types)

ํƒœ๊ทธ (Tag) ์„ค๋ช… (Description) XML type ์›๋ณธ ๋ฐ์ดํ„ฐ ์ˆ˜ (Raw Data Count)
PER Person Name (์ธ๋ช…) ์ด๋ฆ„ 1,495,199
LOC Location Name (์ง€๋ช…) ์ง€๋ช… 490,163
POH Publication of History (์„œ์ฑ…๋ช…) ์„œ๋ช… 49,506
DAT Date / Era Name (์—ฐํ˜ธ) ์—ฐํ˜ธ 5,964

ํ•™์Šต ์ ˆ์ฐจ (Training Procedure)

๊ณต์ •ํ•œ ํ‰๊ฐ€๋ฅผ ์œ„ํ•ด ๋ชจ๋“  ๋น„๊ต ๋ชจ๋ธ์— ๋™์ผํ•œ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํŒŒ์ธํŠœ๋‹์„ ์ง„ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค. (The model was fine-tuned using the same set of hyperparameters across all comparative models to ensure a fair evaluation.)

  • ํ•™์Šต๋ฅ  (Learning Rate): 2e-5
  • ๋ฐฐ์น˜ ์‚ฌ์ด์ฆˆ (Batch Size): 16
  • ์—ํญ (Epochs): 3
  • ๊ฐ€์ค‘์น˜ ๊ฐ์‡  (Weight Decay): 0.01

์„ฑ๋Šฅ ํ‰๊ฐ€ (Evaluation)

๋„๋ฉ”์ธ ํŠนํ™” ์‚ฌ์ „ํ•™์Šต์˜ ํšจ๊ณผ๋ฅผ ๊ฒ€์ฆํ•˜๊ธฐ ์œ„ํ•ด ํฌ๊ด„์ ์ธ ๋น„๊ต ๋ถ„์„์„ ์ˆ˜ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค. (We conducted a comprehensive comparative analysis to validate the effectiveness of domain-specific pre-training.)

๋น„๊ต ๋ชจ๋ธ (Models for Comparison)

  • ๊ทธ๋ฃน 1 (์ž์ฒด ๋ชจ๋ธ / Our Models): SillokBert (Top 3 Trials) vs. bert-base-multilingual-cased (๋ฒ ์ด์Šค๋ผ์ธ / Baseline).
  • ๊ทธ๋ฃน 2 (์™ธ๋ถ€ ๋ชจ๋ธ / External Models): ํ˜„๋Œ€ ํ•œ๊ตญ์–ด(klue/roberta-large) ๋˜๋Š” ๋‹ค๋ฅธ ์ค‘๊ตญ ๊ณ ๋ฌธ(SIKU-BERT, guwenbert-large)์œผ๋กœ ์‚ฌ์ „ํ•™์Šต๋œ ๋ชจ๋ธ. (Models pre-trained on modern Korean (klue/roberta-large) or other classical Chinese texts (SIKU-BERT, guwenbert-large).)
  • ๊ทธ๋ฃน 3 (SOTA ๋ฒค์น˜๋งˆํฌ / SOTA Benchmark): ์ค‘๊ตญ ๊ณ ๋ฌธ NER ๊ณผ์ œ๋กœ ๊ธฐํ•™์Šต๋œ ๋ชจ๋ธ(ethanyt/guwen-ner). (A pre-trained NER model for classical Chinese (ethanyt/guwen-ner).)

๊ฒฐ๊ณผ (Results)

๋‹ค์Œ ํ‘œ๋Š” ๊ฐ ๋ชจ๋ธ์˜ ๊ฒ€์ฆ ์„ธํŠธ์— ๋Œ€ํ•œ ์ตœ๊ณ  F1 ์ ์ˆ˜๋ฅผ ์š”์•ฝํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค. (The following table summarizes the best F1 scores on the validation set for each model.)

๊ทธ๋ฃน (Group) ๋ชจ๋ธ๋ช… (Model) ๊ธฐ๋ฐ˜ ๋ฐ์ดํ„ฐ (Base Data) F1 ์ ์ˆ˜ (F1) ์ •๋ฐ€๋„ (P) ์žฌํ˜„์œจ (R) ์ •ํ™•๋„ (Acc) ๋น„๊ณ  (Notes)
1 SillokBert (Trial 11) ์‹ค๋ก (์ž์ฒด) 0.9569 0.9485 0.9655 0.9959 ์ตœ๊ณ  ์„ฑ๋Šฅ ๋‹ฌ์„ฑ
1 SillokBert (Trial 10) ์‹ค๋ก (์ž์ฒด) 0.9565 0.9572 0.9558 0.9960 ์ตœ๊ณ  ์„ฑ๋Šฅ๊ณผ ๋Œ€๋“ฑ
1 SillokBert (Trial 4) ์‹ค๋ก (์ž์ฒด) 0.9564 0.9586 0.9542 0.9959 ddokbaro/SillokBert ๊ณต์‹ ๋ชจ๋ธ
1 bert-base-multilingual-cased ๋‹ค๊ตญ์–ด (๋ฒ”์šฉ) 0.9530 0.9544 0.9516 0.9956 ์‚ฌ์ „ํ•™์Šต ํšจ๊ณผ ๋น„๊ต๋ฅผ ์œ„ํ•œ ๋ฒ ์ด์Šค๋ผ์ธ
2 klue/roberta-large ํ˜„๋Œ€ ํ•œ๊ตญ์–ด 0.9488 0.9501 0.9475 0.9952 ์ตœ์‹  ์•„ํ‚คํ…์ฒ˜, ๋„๋ฉ”์ธ ๋ถˆ์ผ์น˜๋กœ ์„ฑ๋Šฅ ํ•˜๋ฝ
2 ethanyt/guwenbert-large ์ค‘๊ตญ ๊ณ ๋ฌธ (๋ฒ”์šฉ) 0.9461 0.9450 0.9472 0.9951 ์œ ์‚ฌ ๋„๋ฉ”์ธ, SillokBert ๋Œ€๋น„ ์„ฑ๋Šฅ ํ•˜๋ฝ
2 SIKU-BERT/sikubert ์ค‘๊ตญ ๊ณ ๋ฌธ (์‚ฌ๊ณ ์ „์„œ) 0.9421 0.9380 0.9463 0.9948 ํŠน์ • ๊ณ ๋ฌธํ—Œ, SillokBert ๋Œ€๋น„ ์„ฑ๋Šฅ ํ•˜๋ฝ
3 ethanyt/guwen-ner (SOTA) ์ค‘๊ตญ ๊ณ ๋ฌธ (๊ธฐํ•™์Šต) 0.1749 0.2601 0.1317 0.9288 ๋ผ๋ฒจ/๋„๋ฉ”์ธ ๋ถˆ์ผ์น˜๋กœ ์„ฑ๋Šฅ ์ธก์ • ๋ถˆ๊ฐ€

๊ฒฐ๊ณผ ๋ถ„์„ (Analysis of Results)

  • SillokBert์˜ ์šฐ์ˆ˜์„ฑ (Superiority of SillokBert): SillokBert๋Š” ๋‹ค๋ฅธ ๋ชจ๋“  ๋น„๊ต ๋ชจ๋ธ๋ณด๋‹ค ์ผ๊ด€๋˜๊ฒŒ ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ์„ ๋ณด์ด๋ฉฐ, ๋„๋ฉ”์ธ ํŠนํ™” ์ง€์†-์‚ฌ์ „ํ•™์Šต(domain-specific continued pre-training)์˜ ๋ช…๋ฐฑํ•œ ์ด์ ์„ ๋ณด์—ฌ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค. (SillokBert consistently outperformed all other models, demonstrating the clear advantage of domain-specific continued pre-training.)
  • ๋„๋ฉ”์ธ ์ •ํ•ฉ์„ฑ์˜ ์ค‘์š”์„ฑ (Importance of Domain Alignment): klue/roberta-large์™€ ๊ฐ™์ด ํ˜„๋Œ€ ํ•œ๊ตญ์–ด ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šต๋œ ๊ฐ•๋ ฅํ•œ ๋ชจ๋ธ์ด๋‚˜, guwenbert-large, SIKU-BERT ๋“ฑ ๋‹ค๋ฅธ ์ค‘๊ตญ ๊ณ ๋ฌธ ํ…์ŠคํŠธ๋กœ ํ•™์Šต๋œ ๋ชจ๋ธ์กฐ์ฐจ SillokBert์˜ ์„ฑ๋Šฅ์—๋Š” ๋ฏธ์น˜์ง€ ๋ชปํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ๋ณธ ๊ณผ์ œ์—์„œ ๋„๋ฉ”์ธ ์ •ํ•ฉ์„ฑ์ด ์•„ํ‚คํ…์ฒ˜ ๊ฐœ์„ ์ด๋‚˜ ์ผ๋ฐ˜์ ์ธ ์–ธ์–ด ๋Šฅ๋ ฅ๋ณด๋‹ค ๋” ์ค‘์š”ํ•œ ์š”์†Œ์ž„์„ ๊ฐ•์กฐํ•ฉ๋‹ˆ๋‹ค. (Even powerful models trained on modern Korean (klue/roberta-large) or other classical Chinese texts (guwenbert-large, SIKU-BERT) could not match the performance of SillokBert. This highlights that domain alignment is more critical than architectural improvements or general language capabilities for this specific task.)
  • ๊ธฐ์„ฑ SOTA ๋ชจ๋ธ์˜ ํ•œ๊ณ„ (Limitations of Out-of-the-Box SOTA Models): ์‚ฌ์ „ํ•™์Šต๋œ guwen-ner ๋ชจ๋ธ์€ ๋ ˆ์ด๋ธ” ์ฒด๊ณ„์™€ ๋„๋ฉ”์ธ์˜ ๋ถˆ์ผ์น˜๋กœ ์ธํ•ด ์šฐ๋ฆฌ ๋ฐ์ดํ„ฐ์…‹์—์„œ ์‹คํŒจํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ์™ธ๋ถ€ ๋„๊ตฌ๋ฅผ ๋ฌด๋น„ํŒ์ ์œผ๋กœ ์ ์šฉํ•˜๊ธฐ๋ณด๋‹ค, ํŠนํ™”๋œ ๋ฐ์ดํ„ฐ๋ฅผ ์œ„ํ•œ ๋งž์ถคํ˜• ๋ชจ๋ธ์„ ๊ฐœ๋ฐœํ•  ํ•„์š”์„ฑ์„ ๊ฐ•์กฐํ•ฉ๋‹ˆ๋‹ค. (The pre-trained guwen-ner model failed on our dataset due to a mismatch in label schemas and domains. This underscores the necessity of developing custom models for specialized data rather than uncritically applying external tools.)

์ธ์šฉ (Citation)

์ด ๋ชจ๋ธ์ด๋‚˜ Sillok NER Corpus๋ฅผ ์—ฐ๊ตฌ์— ์‚ฌ์šฉํ•˜์‹ ๋‹ค๋ฉด, ์ด ๋ฆฌํฌ์ง€ํ† ๋ฆฌ๋ฅผ ์ธ์šฉํ•ด ์ฃผ์‹ญ์‹œ์˜ค. (If you use this model or the Sillok NER Corpus in your research, please cite this repository.)

@misc{SillokBertNER2025,  
  author = {Kim, Baro},  
  title = {SillokBert-NER: A Domain-Specific NER Model for the Veritable Records of the Joseon Dynasty},  
  year = {2025},  
  publisher = {Hugging Face},  
  journal = {Hugging Face repository},  
  howpublished = {url{https://huggingface.co/ddokbaro/SillokBert-NER}}  
}  
Downloads last month
0
Safetensors
Model size
177M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support