SillokBert: 조선왕조실록 특화 언어 모델

SillokBert: A Language Model Specialized for Veritable Records of the Joseon Dynasty

모델 설명 (Model Description)

SillokBert은 bert-base-multilingual-cased 모델을 기반으로, 국사편찬위원회에서 제공하는 조선왕조실록 원문(한문) 전체 데이터셋에 Masked Language Modeling(MLM) 태스크로 추가 파인튜닝(further fine-tuned)한 언어 모델입니다. 본 모델은 조선왕조실록에 등장하는 고유한 어휘(인명, 지명, 관직 등), 문어체 스타일, 그리고 복잡한 문맥 구조를 깊이 있게 학습하도록 설계되었습니다. 이를 통해 실록 원문의 빈칸 추론, 원문 교열 및 복원, 의미적 검색, 텍스트 특징 추출 등 다양한 역사학 및 디지털 인문학 연구의 기반(foundational) 모델로 활용될 수 있는 잠재력을 가집니다.

SillokBert is a language model based on bert-base-multilingual-cased, further fine-tuned on the entire Veritable Records of the Joseon Dynasty (Annals of the Joseon Dynasty) original text (Classical Chinese) dataset provided by the National Institute of Korean History using the Masked Language Modeling (MLM) task. This model is designed to deeply learn the unique vocabulary (personal names, place names, official titles, etc.), literary style, and complex contextual structures found in the Annals. It holds the potential to be utilized as a foundational model for various historical and digital humanities research tasks, such as fill-in-the-blank inference, text correction and restoration, semantic search, and feature extraction from the original Sillok texts.

본 모델은 한국학중앙연구원 디지털인문학연구소의 "한국 고전 문헌 기반 지능형 한국학 언어모델 개발" 프로젝트의 일환으로 개발되었습니다. 본 모델의 학습 환경은 과학기술정보통신부 정보통신산업진흥원의 2025년 고성능컴퓨팅지원(GPU) 사업(G2025-0450)의 지원을 받았습니다. 연구에 필수적인 고성능 컴퓨팅 환경을 지원해주셔서 진심으로 감사드립니다.

This model was developed as part of the "Development of an Intelligent Korean Studies Language Model based on Classical Korean Texts" project at the Digital Humanities Research Institute, The Academy of Korean Studies. The training environment for this model was supported by the 2025 High-Performance Computing Support (GPU) Program of the National IT Industry Promotion Agency (NIPA) (No. G2025-0450). We sincerely appreciate the support for providing the high-performance computing environment essential for our research.

활용 방안 (Intended Use)

직접 사용 (Direct Use)

fill-mask 파이프라인을 사용하여 모델의 언어적 이해도를 직접 테스트하거나, 특정 문맥에서 가장 확률이 높은 단어를 예측하는 데 사용할 수 있습니다.
You can use the fill-mask pipeline to directly test the model's linguistic understanding or to predict the most probable words in a specific context.

# !pip install transformers torch  # 라이브러리가 설치되지 않은 경우 주석을 해제하고 실행하세요.

# transformers 라이브러리에서 필요한 pipeline과 AutoTokenizer를 가져옵니다.
from transformers import pipeline, AutoTokenizer

# 허깅페이스 Hub에 있는 모델과 토크나이저의 경로를 지정합니다.
model_path = "ddokbaro/SillokBert"

# 지정된 경로에서 토크나이저를 불러옵니다.
tokenizer = AutoTokenizer.from_pretrained(model_path)

# "fill-mask" 작업을 위한 파이프라인을 생성합니다.
# 이 때, 미리 불러온 모델 경로와 토크나이저를 사용합니다.
fill_mask = pipeline("fill-mask", model=model_path, tokenizer=tokenizer)

# 예시 문장 (조선왕조실록 스타일)
# 上曰, "予 然 無 [MASK] 之意." (임금께서 말씀하시길, '내게는 [MASK]할 뜻이 없다.')
text = "上曰, \"予 然 無 [MASK] 之意.\""

# 파이프라인을 실행하여 빈칸([MASK])에 들어갈 단어를 추론합니다.
results = fill_mask(text)

# 결과를 출력합니다.
print(f"'{text}' 문장에 대한 추론 결과:")
for item in results:
    # 'sequence'는 [MASK]가 채워진 전체 문장, 'score'는 해당 예측의 신뢰도 점수입니다.
    print(f"  - 문장: {item['sequence']}, 점수: {item['score']:.4f}, 토큰: {item['token_str']}")

다운스트림 태스크 활용 (Downstream Use)

본 모델은 조선왕조실록 텍스트를 대상으로 하는 다양한 다운스트림 태스크의 사전학습 모델로 활용될 수 있습니다.
This model can be used as a pre-trained model for various downstream tasks targeting the text of Veritable Records of the Joseon Dynasty.

문서 분류 (Text Classification): 특정 주제(예: 군사, 외교, 의례)에 대한 기사 분류.
Classification of articles on specific topics (e.g., military, diplomacy, rituals).
개체명 인식 (Named Entity Recognition): 인명, 지명, 관직명 등 고유명사 자동 추출.
Automatic extraction of named entities such as personal names, place names, and official titles.
의미 검색 (Semantic Search): 키워드 매칭을 넘어 의미적으로 유사한 기사 검색.
Semantic search for articles that are contextually similar, going beyond simple keyword matching.

학습 데이터 (Training Data)

데이터 출처 및 수집 (Data Source and Collection)

원천 데이터 (Source Data): 공공데이터포털 - 교육부 국사편찬위원회_조선왕조실록 정보_실록원문 https://www.data.go.kr/data/15053647/fileData.do. 연구의 토대가 된 귀중한 자료를 제공해주신 교육부 국사편찬위원회 측에 감사의 말씀을 전한다.
We express our gratitude to the National Institute of Korean History (Ministry of Education) for providing the invaluable data that formed the foundation of this research.
데이터 버전 및 재현성 (Data Version and Reproducibility): 본 연구는 2022년 11월 03일에 등록된 데이터를 기반으로 합니다. 공식 배포처의 데이터가 업데이트될 수 있어, 완벽한 재현성을 보장하기 위해 학습에 사용된 원본 XML 파일 전체를 raw_data/sillok_raw_xml.zip 파일로 제공합니다. 또한, 즉시 활용 가능한 전처리 완료 텍스트 파일(train.txt, validation.txt, test.txt)은 preprocessed_data/ 폴더에서 확인하실 수 있습니다.
This research is based on the data registered on November 3, 2022. As the data from the official distributor may be updated, we provide the entire original XML files used for training as raw_data/sillok_raw_xml.zip in this repository to ensure perfect reproducibility. Additionally, the preprocessed text files (train.txt, validation.txt, test.txt) ready for immediate use can be found in the preprocessed_data/ folder.

데이터 전처리 (Data Preprocessing)

raw_data의 원본 XML은 다음과 같은 과정을 거쳐 preprocessed_data로 가공되었습니다.
The original XML from raw_data was processed into preprocessed_data through the following steps:

구조 분석 (Structural Parsing): lxml 라이브러리를 사용하여 XML 파일을 파싱.
Parsed XML files using the lxml library.
본문 추출 (Text Extraction): 각 기사(level5) 내의 paragraph 태그에서 텍스트를 추출.
Extracted text from the paragraph tags within each article (level5).
주석 제외 (Annotation Exclusion): 원문의 의미를 해치지 않으면서도 모델 학습에 방해가 될 수 있는 현대인의 주석(<annotation\>) 내용은 추출 과정에서 모두 제외하여, 순수한 원전 텍스트만을 학습 대상으로 삼았습니다.
To ensure the model learned purely from the original source text, modern annotations (<annotation\>) that could interfere with learning while not affecting the meaning of the original text were excluded during extraction.
정제 (Normalization): 추출된 텍스트에서 불필요한 공백과 줄바꿈 문자를 정규화.
Normalized unnecessary whitespace and line breaks in the extracted text.
필터링 (Filtering): 의미 있는 문맥 학습을 위해 최소 길이 10자 미만의 기사는 제외.
Filtered out articles shorter than 10 characters to focus on meaningful contextual learning.

상세한 전처리 로직과 전체 코드는 본 리포지토리에 함께 업로드된 scripts/prepare_data.py 스크립트에서 확인하실 수 있습니다.
The detailed preprocessing logic and the complete code can be found in the scripts/prepare_data.py script uploaded to this repository.

데이터 통계 (Data Statistics)

전체 기사 수 (Total Articles): 402,339
데이터 분할 (Data Split - 90/5/5):
- 학습 (Train): 362,107 articles
- 검증 (Validation): 20,116 articles
- 테스트 (Test): 20,116 articles
추가 정보 (Additional Information):
- 총 글자 수 (Total Characters): 66,322,312
- 어휘집 크기 (Vocabulary Size): 119,547

학습 절차 (Training Procedure)

하이퍼파라미터 최적화 (Hyperparameter Optimization - HPO)

본 모델은 한정된 연구 자원 내에서 최적의 성능을 도출하기 위해, 단계적 탐색(Staged Exploration) HPO 전략을 채택했습니다. 이는 넓은 탐색 공간에서 점진적으로 유망한 후보군을 좁혀나가는 깔때기(Funnel) 방식의 접근법으로, 연구의 효율성과 신뢰도를 동시에 확보하기 위해 설계되었습니다.
To derive optimal performance within limited research resources, this model adopted a Staged Exploration HPO strategy. This is a funnel-like approach that progressively narrows down promising candidates from a wide search space, designed to ensure both efficiency and reliability in the research.

1단계: 광범위 탐색 (Stage 1: Broad Exploration)

목표 (Objective): 넓은 하이퍼파라미터 공간에서 성능이 현저히 낮은 영역을 빠르게 식별하고 배제.
To quickly identify and exclude underperforming regions from a wide hyperparameter space.
방법 (Method): 전체 학습 데이터의 **10%**만을 사용하여 각 Trial을 최대 20 스텝이라는 매우 짧은 시간 동안만 학습. 이 단계에서는 learning_rate(1e-6 ~ 1e-3), 유효 배치 크기(16 ~ 512), optimizer 종류(AdamW, Adafactor) 등 모델 성능에 영향을 미치는 주요 하이퍼파라미터에 대해 가능한 넓은 탐색 범위를 설정하여 잠재적 성능 영역을 포괄적으로 확인했습니다.
Each trial was trained for a very short period (max 20 steps) using only 10% of the total training data. A wide search range was set for key hyperparameters influencing model performance, such as learning_rate (1e-6 to 1e-3), effective batch size (16 to 512), and optimizer type (AdamW, Adafactor), to comprehensively identify potential performance areas.
결과 (Result): 컴퓨팅 자원 낭비를 최소화하며, 후속 탐색을 집중할 유망한 파라미터 영역에 대한 초기 통찰 확보 (최저 eval_loss ~3.83).
Minimized computational waste and gained initial insights into promising parameter regions for subsequent focused searches (lowest eval_loss ~3.83).

2단계: 심층 탐색 (Stage 2: Focused Search)

목표 (Objective): 1단계에서 식별된 유망 영역을 대상으로, 더 많은 데이터와 학습량을 투입하여 신뢰도 높은 후보군을 압축.
To narrow down a reliable set of candidates by applying more data and training to the promising regions identified in Stage 1.
방법 (Method): 데이터셋의 **40%**를 사용하여 2-4 에포크 동안 학습을 진행. 1단계의 결과를 바탕으로 다음과 같이 유망한 하이퍼파라미터 공간을 집중적으로 탐색했습니다.
Based on the results from Stage 1, a focused search was conducted in the promising hyperparameter space by training for 2-4 epochs using 40% of the dataset.
- learning_rate: 2e-5 ~ 1e-4 (log-uniform)
- effective_batch_size (per_device_train_batch_size * gradient_accumulation_steps): 32 ~ 256
- weight_decay: 0.0 ~ 0.1
- lr_scheduler_type: linear, cosine, constant_with_warmup
결과 (Result): eval_loss가 1.8822까지 크게 향상되었으며, 최종 평가를 진행할 최상위 10개의 우수한 하이퍼파라미터 조합을 성공적으로 도출. 파라미터 중요도 분석 결과, learning_rate와 weight_decay가 모델 성능에 가장 결정적인 영향을 미치는 파라미터로 확인되었습니다. (상세 내용은 hpo_visualizations/stage2_param_importances.html 참고)
The eval_loss was significantly improved to 1.8822, successfully identifying the top 10 hyperparameter combinations for final evaluation. Parameter importance analysis revealed that learning_rate and weight_decay were the most critical parameters affecting model performance. (See hpo_visualizations/stage2_param_importances.html for details).

3단계: 최종 검증 (Stage 3: Final Validation)

목표 (Objective): 2단계에서 선별된 상위 10개 후보를 대상으로 전체 데이터셋에 대한 실제 성능과 과적합 지점을 정밀하게 측정하여 최종 모델을 확정.
To precisely measure the actual performance and overfitting points on the entire dataset for the top 10 candidates selected in Stage 2, thereby finalizing the model.
방법 (Method): **전체 데이터셋(100%)**을 사용하여 각 후보를 10 에포크 동안 학습. 매 에포크마다 검증 손실을 기록하여 최적의 성능을 보인 시점의 모델을 저장.
Each candidate was trained for 10 epochs using the entire dataset (100%). The model at the point of best performance was saved by recording the validation loss at each epoch.
결과 (Result): 최종적으로 Test Loss 1.4163, Perplexity 4.1219를 기록한 Trial 4 모델을 최종 모델로 선정.
The Trial 4 model, which recorded a final Test Loss of 1.4163 and a Perplexity of 4.1219, was selected as the final model.

최종 모델 하이퍼파라미터 (Final Model Hyperparameters)

3단계 최종 검증을 통해 선정된 Trial 4 모델의 하이퍼파라미터는 다음과 같습니다.
The hyperparameters for the Trial 4 model, selected through the final validation, are as follows:

learning_rate: 9.66e-05
per_device_train_batch_size: 8
gradient_accumulation_steps: 4 (Effective batch size: 32)
weight_decay: 0.0401
lr_scheduler_type: linear
adam_beta1: 0.8943
adam_beta2: 0.9923
warmup_ratio: 0.0983
optimizer: AdamW
mlm_probability: 0.15
max_seq_length: 256

학습 환경 (Training Environment)

Hardware: 1 x NVIDIA A100-PCIE-40GB
Software: For reproducibility, the following versions of major libraries were used in this study.
- transformers: v4.47.1
- datasets: v3.0.1
- torch: v2.6.0a0+ecf3bae40a.nv25.01
- optuna: v4.3.0
- accelerate: v1.5.2
- pandas: v2.2.2
- lxml: v5.3.0
- tqdm: v4.67.1
- scikit-learn: v1.6.1

평가 (Evaluation)

평가 지표 (Evaluation Metrics)

Test Loss: The loss value of the model on the test dataset.
Perplexity (PPL): A standard metric for evaluating language models, representing uncertainty. A lower value indicates that the model predicts the next word better. (Formula: eloss)

평가 결과 (Evaluation Results)

한 번도 학습에 사용되지 않은 테스트 데이터셋(test.txt)으로 상위 10개 후보 모델을 평가한 최종 성능은 다음과 같습니다.
The final performance of the top 10 candidate models, evaluated on the held-out test dataset (test.txt), is as follows:

순위(Rank)	Trial 번호(No.)	Test Loss	Perplexity (PPL)	검증(Val) 순위와의 차이(vs. Val Rank)
1	4	1.4163	4.1219	-
2	11	1.4182	4.1296	▲ 2
3	10	1.4197	4.1357	▲ 2
4	3	1.4198	4.1362	▼ 1
5	2	1.4202	4.1381	▼ 3
6	7	1.4800	4.3931	-
7	8	1.5229	4.5853	-
8	6	1.5269	4.6040	-
9	5	1.5688	4.8010	-
10	9	1.5757	4.8339	-

분석 (Analysis): 검증 데이터셋에서 1위를 차지했던 Trial 4 모델이 테스트 데이터셋에서도 가장 우수한 일반화 성능을 보여, 본 연구에서 채택한 단계적 HPO 전략의 유효성을 입증했습니다. 또한 검증 셋에서 4위였던 Trial 11이 최종 2위를 기록한 것은, 최종 평가에서 Top 10 전체를 검증하는 과정의 중요성을 시사합니다.
The Trial 4 model, which ranked first on the validation dataset, also showed the best generalization performance on the test dataset, validating the effectiveness of the staged HPO strategy adopted in this study. Furthermore, the fact that Trial 11, ranked 4th on the validation set, achieved the 2nd position in the final evaluation highlights the importance of validating the entire top 10 candidates.

HPO 결과 분석 자료 (Analysis of HPO Results)

전체 하이퍼파라미터 최적화 과정에 대한 상세한 결과는 본 리포지토리의 hpo_visualizations 및 hpo_databases 폴더에서 확인하실 수 있습니다.
Detailed results of the entire hyperparameter optimization process can be found in the hpo_visualizations and hpo_databases folders of this repository.

정적 시각화 보고서 (Static Visualization Reports)
- 위치 (Location): hpo_visualizations/
- 설명 (Description): HTML files containing interactive graphs that visualize the results of each HPO stage. These can be opened directly in a browser for quick exploration. (Visualization for Stage 1 is excluded due to its short training time and low statistical significance.)
HPO 원시 데이터 (Raw HPO Data)
- 위치 (Location): hpo_databases/
- 설명 (Description): Original SQLite database files containing the records of all HPO trials. Other researchers can load these files to perfectly reproduce the results of this study or conduct their own analysis.
- 활용 예시 (Usage Example): You can perform the analysis yourself using the provided scripts/hpo\_result\_analyzer\_universal.py script and the DB files.

# 2단계 결과 분석 재현
# Reproduce Stage 2 analysis
# (DB and Study names should be adjusted to match the actual uploaded files and settings.)
python scripts/hpo_result_analyzer_universal.py \
  --db_path "hpo_databases/hpo_stage2_search.db" \
  --study_name "Sillok-LM_MLM_HyperOpt_Heavier_bert_base_multilingual_cased" \
  --file_prefix "reproduced_stage2_"

# 3단계 결과 분석 재현
# Reproduce Stage 3 analysis
python scripts/hpo_result_analyzer_universal.py \
  --db_path "hpo_databases/hpo_stage3_validation.db" \
  --study_name "Sillok-LM_Final_Top10_Run" \
  --file_prefix "reproduced_stage3_"

제한 사항 및 편향성 (Limitations and Bias)

본 모델은 조선왕조실록 원문 데이터로 학습되었으므로, 현대 한국어나 다른 시대의 한문 텍스트에 대해서는 성능이 저하될 수 있습니다.
Since this model was trained on the original text of Veritable Records of the Joseon Dynasty, its performance may degrade on modern Korean or Classical Chinese texts from other periods.
실록은 특정 계층(왕, 사대부)의 관점에서 기록된 사료이므로, 모델이 생성하거나 예측하는 내용 또한 이러한 역사적, 이념적 편향성을 내재할 수 있습니다. 사용자는 모델의 결과를 비판적으로 해석해야 합니다.
Veritable Records of the Joseon Dynasty were recorded from the perspective of a specific class (kings, scholar-officials), so the content generated or predicted by the model may inherit these historical and ideological biases. Users should interpret the model's outputs critically.

연구팀 및 인용 (Team and Citation)

연구팀 (Team)

김바로 (Baro Kim): 연구 책임자 (Principal Investigator), Digital Humanities Research Institute, The Academy of Korean Studies

인용 정보 (Citation)

이 모델을 연구에 사용하실 경우, 다음과 같이 인용해 주시기 바랍니다.
If you use this model in your research, please cite it as follows:

@misc{kim2025sillokbert,  
      title={{SillokBert: A Language Model for Veritable Records of the Joseon Dynasty}},   
      author={Baro, Kim},  
      year={2025},  
      publisher={Hugging Face},  
      journal={Hugging Face repository},  
      howpublished={url{[https://huggingface.co/ddokbaro/SillokBert](https://huggingface.co/ddokbaro/SillokBert)}}  
}