SillokBert-NER: 조선왕조실록 특화 개체명 인식 모델

SillokBert-NER: A Domain-Specific NER Model for the Veritable Records of the Joseon Dynasty

모델 설명 (Model Description)

SillokBert-NER은 조선왕조실록 원문에 특화된 개체명 인식(Named Entity Recognition, NER) 모델입니다. 이 모델은 조선왕조실록 전체 원문(한문)으로 지속적 사전학습(continued pre-training)을 진행한 언어 모델 ddokbaro/SillokBert 프로젝트의 Trial 11 체크포인트를 기반으로 파인튜닝되었으며, 역사 기록물 속에서 다음의 4가지 핵심 개체 유형을 정확하게 식별하도록 설계되었습니다.

(SillokBert-NER is a Named Entity Recognition (NER) model specialized for the Veritable Records of the Joseon Dynasty (조선왕조실록). It is fine-tuned from the Trial 11 checkpoint of the ddokbaro/SillokBert project, a language model that was continually pre-trained on the full-text classical Chinese (Hanja) corpus of the Veritable Records. This model is designed to accurately identify four key entity types within the historical texts.)

PER: 인명 (Person)
LOC: 지명 (Location)
POH: 서책명 (Publication of History)
DAT: 연호 (Date / Era Name)

본 모델은 한국학중앙연구원 디지털인문학연구소의 "한국 고전 문헌 기반 지능형 한국학 언어모델 개발" 프로젝트의 일환으로 개발되었습니다. 본 모델의 학습 환경은 과학기술정보통신부 정보통신산업진흥원의 2025년 고성능컴퓨팅지원(GPU) 사업(G2025-0450)의 지원을 받았습니다. 연구에 필수적인 고성능 컴퓨팅 환경을 지원해주셔서 진심으로 감사드립니다.

This model was developed as part of the "Development of an Intelligent Korean Studies Language Model based on Classical Korean Texts" project at the Digital Humanities Research Institute, The Academy of Korean Studies. The training environment for this model was supported by the 2025 High-Performance Computing Support (GPU) Program of the National IT Industry Promotion Agency (NIPA) (No. G2025-0450). We sincerely appreciate the support for providing the high-performance computing environment essential for our research.

사용 목적 및 한계 (Intended Uses & Limitations)

이 모델은 학술 및 연구 목적으로 제작되었으며, 특히 조선왕조실록이나 유사한 한문 역사 기록을 다루는 연구자와 개발자에게 유용합니다.

This model is intended for academic and research purposes, specifically for scholars and developers working with the Veritable Records of the Joseon Dynasty or similar historical Korean texts written in classical Chinese.

한계 (Limitations):

이 모델은 특정 도메인에 고도로 특화되어 있으므로, 현대 한국어나 다른 종류의 텍스트에 대한 범용 NER 모델로는 적합하지 않습니다. (This model is a highly domain-specific model and is not suitable for general-purpose NER on modern Korean or other types of texts.)
시대나 문체적 특징이 다른 역사 문헌에서는 성능이 다르게 나타날 수 있습니다. (Performance may vary on historical documents from different eras or with different stylistic features.)

사용 방법 (How to Get Started)

transformers 라이브러리의 파이프라인을 통해 간단하게 사용할 수 있습니다. (You can use this model with the transformers library pipeline.)

from transformers import pipeline

# 옵션 1 (권장): 허깅페이스 허브에서 직접 모델 로드
# Option 1 (Recommended): Load the model directly from the Hugging Face Hub
ner_pipeline = pipeline("token-classification", model="ddokbaro/SillokBert-NER")

# 옵션 2: 로컬에 저장된 모델 로드 (경로를 실제 환경에 맞게 수정해야 함)
# Option 2: Load the model from a local directory (the path must be adjusted to your environment)
# local_model_path = "/home/work/baro/sillokner20250618/models/SillokBert-NER-trial11"
# ner_pipeline = pipeline("token-classification", model=local_model_path)


text = "時太宗在潛邸遣趙英茂諭意且曰今我國家土宇隘西北距鴨綠未及百里"
# 태종실록 1권, 태조 1년 1월 15일 (Veritable Records of Taejong, Vol. 1, 15th day of the 1st month of the 1st year of King Taejo)

results = ner_pipeline(text)
for entity in results:
    print(entity)

# Expected Output:
# {'entity': 'B-PER', 'score': 0.99..., 'index': 2, 'word': '太宗', 'start': 3, 'end': 5}
# {'entity': 'B-PER', 'score': 0.99..., 'index': 6, 'word': '趙英茂', 'start': 15, 'end': 18}
# {'entity': 'B-LOC', 'score': 0.99..., 'index': 13, 'word': '鴨綠', 'start': 43, 'end': 45}

사전학습 모델 원본 (Original Pre-trained Model)

본 리포지토리에는 이 NER 모델의 기반이 된 원본 SillokBert (Trial 11) 체크포인트 파일들이 'SillokBert_trial11/' 폴더에 함께 포함되어 있습니다. 다른 다운스트림 태스크에 직접 파인튜닝을 시도해보고자 하는 연구자들은 해당 폴더의 파일들을 활용할 수 있습니다.

This repository also contains the original SillokBert (Trial 11) checkpoint files in the 'SillokBert_trial11/' folder. Researchers who wish to fine-tune this model on other downstream tasks can utilize the files in that directory.

학습 및 평가 데이터 (Training and Evaluation Data)

데이터셋 (Dataset)

이 모델은 조선왕조실록 원본 XML 파일로부터 구축된 **Sillok NER Corpus**로 학습되었습니다. (This model was trained on the Sillok NER Corpus, a custom dataset built from the original XML files of the Veritable Records of the Joseon Dynasty.)

원천 데이터 (Source Data): 공공데이터포털 - 교육부 국사편찬위원회_조선왕조실록 정보_실록원문 https://www.data.go.kr/data/15053647/fileData.do. 연구의 토대가 된 귀중한 자료를 제공해주신 교육부 국사편찬위원회 측에 감사의 말씀을 전한다.
We express our gratitude to the National Institute of Korean History (Ministry of Education) for providing the invaluable data that formed the foundation of this research.
데이터 버전 및 재현성 (Data Version and Reproducibility): 본 연구는 2022년 11월 03일에 등록된 데이터를 기반으로 합니다. 공식 배포처의 데이터가 업데이트될 수 있어, 완벽한 재현성을 보장하기 위해 학습에 사용된 원본 XML 파일 전체를 raw_data/sillok_raw_xml.zip 파일로 제공합니다. 또한, 즉시 활용 가능한 전처리 완료 텍스트 파일(train.txt, validation.txt, test.txt)은 preprocessed_data/ 폴더에서 확인하실 수 있습니다.
This research is based on the data registered on November 3, 2022. As the data from the official distributor may be updated, we provide the entire original XML files used for training as raw_data/sillok_raw_xml.zip in this repository to ensure perfect reproducibility. Additionally, the preprocessed text files (train.txt, validation.txt, test.txt) ready for immediate use can be found in the preprocessed_data/ folder.
전처리 (Preprocessing): XML의 <index\> 태그를 파싱하여 개체명 텍스트, 유형(이름, 지명, 서명, 연호), 고유 참조 ID를 추출했습니다. 이 정보는 3열의 CoNLL 형식(token ner_tag ref_id)으로 변환되었습니다. (The <index\> tags in the XML were parsed to extract entity text, types (이름, 지명, 서명, 연호), and unique reference IDs. This information was converted into a 3-column CoNLL format (token ner_tag ref_id).)
데이터 분할 (Data Split): 전체 말뭉치는 학습(80%), 검증(10%), 평가(10%) 세트로 무작위 분할되었습니다. (The full corpus was randomly split into training (80%), validation (10%), and test (10%) sets.)
- 학습 세트 (Training Set): 375,366 문장
- 검증 세트 (Validation Set): 46,920 문장
- 평가 세트 (Test Set): 46,922 문장

데이터셋 다운로드 (Dataset Download)

본 리포지토리에는 모델뿐만 아니라, 연구에 사용된 전처리 완료 데이터와 원본 데이터가 모두 포함되어 있어 즉시 활용 및 재현이 가능합니다.

This repository contains not only the model but also the pre-processed and raw data used in the research, allowing for immediate use and reproducibility.

'data/raw_xml/': 연구의 기반이 된 원본 XML 파일 전체가 포함되어 있습니다. (Contains the complete original XML files that formed the basis of this research.)

'preprocessed_data/': 즉시 활용 가능한 CoNLL 형식의 'train.txt', 'validation.txt', 'test.txt' 파일이 포함되어 있습니다. (Contains ready-to-use CoNLL formatted files: 'train.txt', 'validation.txt', and 'test.txt'.)

개체명 유형 (Entity Types)

태그 (Tag)	설명 (Description)	XML type	원본 데이터 수 (Raw Data Count)
`PER`	Person Name (인명)	이름	1,495,199
`LOC`	Location Name (지명)	지명	490,163
`POH`	Publication of History (서책명)	서명	49,506
`DAT`	Date / Era Name (연호)	연호	5,964

학습 절차 (Training Procedure)

공정한 평가를 위해 모든 비교 모델에 동일한 하이퍼파라미터를 사용하여 파인튜닝을 진행했습니다. (The model was fine-tuned using the same set of hyperparameters across all comparative models to ensure a fair evaluation.)

학습률 (Learning Rate): 2e-5
배치 사이즈 (Batch Size): 16
에폭 (Epochs): 3
가중치 감쇠 (Weight Decay): 0.01

성능 평가 (Evaluation)

도메인 특화 사전학습의 효과를 검증하기 위해 포괄적인 비교 분석을 수행했습니다. (We conducted a comprehensive comparative analysis to validate the effectiveness of domain-specific pre-training.)

비교 모델 (Models for Comparison)

그룹 1 (자체 모델 / Our Models): SillokBert (Top 3 Trials) vs. bert-base-multilingual-cased (베이스라인 / Baseline).
그룹 2 (외부 모델 / External Models): 현대 한국어(klue/roberta-large) 또는 다른 중국 고문(SIKU-BERT, guwenbert-large)으로 사전학습된 모델. (Models pre-trained on modern Korean (klue/roberta-large) or other classical Chinese texts (SIKU-BERT, guwenbert-large).)
그룹 3 (SOTA 벤치마크 / SOTA Benchmark): 중국 고문 NER 과제로 기학습된 모델(ethanyt/guwen-ner). (A pre-trained NER model for classical Chinese (ethanyt/guwen-ner).)

결과 (Results)

다음 표는 각 모델의 검증 세트에 대한 최고 F1 점수를 요약한 것입니다. (The following table summarizes the best F1 scores on the validation set for each model.)

그룹 (Group)	모델명 (Model)	기반 데이터 (Base Data)	F1 점수 (F1)	정밀도 (P)	재현율 (R)	정확도 (Acc)	비고 (Notes)
1	SillokBert (Trial 11)	실록 (자체)	0.9569	0.9485	0.9655	0.9959	최고 성능 달성
1	SillokBert (Trial 10)	실록 (자체)	0.9565	0.9572	0.9558	0.9960	최고 성능과 대등
1	SillokBert (Trial 4)	실록 (자체)	0.9564	0.9586	0.9542	0.9959	ddokbaro/SillokBert 공식 모델
1	`bert-base-multilingual-cased`	다국어 (범용)	0.9530	0.9544	0.9516	0.9956	사전학습 효과 비교를 위한 베이스라인
2	`klue/roberta-large`	현대 한국어	0.9488	0.9501	0.9475	0.9952	최신 아키텍처, 도메인 불일치로 성능 하락
2	`ethanyt/guwenbert-large`	중국 고문 (범용)	0.9461	0.9450	0.9472	0.9951	유사 도메인, SillokBert 대비 성능 하락
2	`SIKU-BERT/sikubert`	중국 고문 (사고전서)	0.9421	0.9380	0.9463	0.9948	특정 고문헌, SillokBert 대비 성능 하락
3	`ethanyt/guwen-ner (SOTA)`	중국 고문 (기학습)	0.1749	0.2601	0.1317	0.9288	라벨/도메인 불일치로 성능 측정 불가

결과 분석 (Analysis of Results)

SillokBert의 우수성 (Superiority of SillokBert): SillokBert는 다른 모든 비교 모델보다 일관되게 뛰어난 성능을 보이며, 도메인 특화 지속-사전학습(domain-specific continued pre-training)의 명백한 이점을 보여주었습니다. (SillokBert consistently outperformed all other models, demonstrating the clear advantage of domain-specific continued pre-training.)
도메인 정합성의 중요성 (Importance of Domain Alignment): klue/roberta-large와 같이 현대 한국어 데이터로 학습된 강력한 모델이나, guwenbert-large, SIKU-BERT 등 다른 중국 고문 텍스트로 학습된 모델조차 SillokBert의 성능에는 미치지 못했습니다. 이는 본 과제에서 도메인 정합성이 아키텍처 개선이나 일반적인 언어 능력보다 더 중요한 요소임을 강조합니다. (Even powerful models trained on modern Korean (klue/roberta-large) or other classical Chinese texts (guwenbert-large, SIKU-BERT) could not match the performance of SillokBert. This highlights that domain alignment is more critical than architectural improvements or general language capabilities for this specific task.)
기성 SOTA 모델의 한계 (Limitations of Out-of-the-Box SOTA Models): 사전학습된 guwen-ner 모델은 레이블 체계와 도메인의 불일치로 인해 우리 데이터셋에서 실패했습니다. 이는 외부 도구를 무비판적으로 적용하기보다, 특화된 데이터를 위한 맞춤형 모델을 개발할 필요성을 강조합니다. (The pre-trained guwen-ner model failed on our dataset due to a mismatch in label schemas and domains. This underscores the necessity of developing custom models for specialized data rather than uncritically applying external tools.)

인용 (Citation)

이 모델이나 Sillok NER Corpus를 연구에 사용하신다면, 이 리포지토리를 인용해 주십시오. (If you use this model or the Sillok NER Corpus in your research, please cite this repository.)

@misc{SillokBertNER2025,  
  author = {Kim, Baro},  
  title = {SillokBert-NER: A Domain-Specific NER Model for the Veritable Records of the Joseon Dynasty},  
  year = {2025},  
  publisher = {Hugging Face},  
  journal = {Hugging Face repository},  
  howpublished = {url{https://huggingface.co/ddokbaro/SillokBert-NER}}  
}