--- language: ja license: apache-2.0 tags: - mulan - japanese-mulan pipeline_tag: feature-extraction --- # japanese-mulan-base This is a Japanese [MuLan (Music-Language pretraining)](https://arxiv.org/abs/2208.12415) model developed by [LY Corporation](https://www.lycorp.co.jp/en/). This model was trained on ~20k internal music-text pairs, and it is applicable to various music tasks including zero-shot music classification, text-to-music or music-to-text retrieval. ## How to use 1. Install packages ```sh pip install transformers[torch] torchaudio sentence-transformers sentencepiece ``` 2. Run ```python import torch import torch.nn.functional as F import torchaudio from transformers import AutoModel, AutoProcessor HF_MODEL_PATH = "line-corporation/japanese-mulan-base" model = AutoModel.from_pretrained(HF_MODEL_PATH, trust_remote_code=True) processor = AutoProcessor.from_pretrained(HF_MODEL_PATH, trust_remote_code=True) url = "https://cdn.bensound.com/bensound-happyrock.mp3" # music by Bensound.com waveform, sample_rate = torchaudio.load(url) # stero to mono + unbatched to batched waveform = waveform.mean(dim=0, keepdim=True) labels = ["ロック", "ヒップホップ", "ジャズ", "クラシック"] processor.eval() model.eval() with torch.no_grad(): music_feature = processor.get_music_feature(waveform, sample_rate=sample_rate) text_feature = processor.get_text_feature(labels) music_embedding = model.get_music_features(**music_feature) text_embedding = model.get_text_features(**text_feature) # batched to unbatched music_embedding = music_embedding.squeeze(dim=0) # NOTE: music_embedding is not normalized by L2 norm. similarity = F.cosine_similarity(music_embedding, text_embedding, dim=-1) label_index = torch.argmax(similarity, dim=-1) label = labels[label_index.item()] print("Estimated label:", label) # Estimated label: ロック ``` ## Model architecture The model uses an [Audio Spectrogram Transformer (AST)](https://arxiv.org/abs/2104.01778) as the music encoder and a GLuCoSE as the text encoder. The music encoder was initialized from [official AST pretrained by AudioSet](https://github.com/YuanGongND/ast). The text encoder was initialized from [pkshatech/GLuCoSE-base-ja](https://huggingface.co/pkshatech/GLuCoSE-base-ja). ## Licenses [The Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0) ## Citation ``` @misc{mulan-japanese-base, title = {Japanese MuLan Base}, author={Takuya Hasumi and Yusuke Fujita} url = {https://huggingface.co/line-corporation/japanese-mulan-base}, } ```