---
language: ja
license: apache-2.0
tags:
  - mulan
  - japanese-mulan
pipeline_tag: feature-extraction
---

# japanese-mulan-base

This is a Japanese [MuLan (Music-Language pretraining)](https://arxiv.org/abs/2208.12415) model developed by [LY Corporation](https://www.lycorp.co.jp/en/). This model was trained on ~20k internal music-text pairs, and it is applicable to various music tasks including zero-shot music classification, text-to-music or music-to-text retrieval.

## How to use

1. Install packages

```sh
pip install transformers[torch] torchaudio sentence-transformers sentencepiece
```

2. Run

```python
import torch
import torch.nn.functional as F
import torchaudio
from transformers import AutoModel, AutoProcessor

HF_MODEL_PATH = "line-corporation/japanese-mulan-base"

model = AutoModel.from_pretrained(HF_MODEL_PATH, trust_remote_code=True)
processor = AutoProcessor.from_pretrained(HF_MODEL_PATH, trust_remote_code=True)

url = "https://cdn.bensound.com/bensound-happyrock.mp3"  # music by Bensound.com
waveform, sample_rate = torchaudio.load(url)
# stero to mono + unbatched to batched
waveform = waveform.mean(dim=0, keepdim=True)

labels = ["ロック", "ヒップホップ", "ジャズ", "クラシック"]

processor.eval()
model.eval()

with torch.no_grad():
    music_feature = processor.get_music_feature(waveform, sample_rate=sample_rate)
    text_feature = processor.get_text_feature(labels)
    music_embedding = model.get_music_features(**music_feature)
    text_embedding = model.get_text_features(**text_feature)

# batched to unbatched
music_embedding = music_embedding.squeeze(dim=0)

# NOTE: music_embedding is not normalized by L2 norm.
similarity = F.cosine_similarity(music_embedding, text_embedding, dim=-1)
label_index = torch.argmax(similarity, dim=-1)
label = labels[label_index.item()]

print("Estimated label:", label)
# Estimated label: ロック
```

## Model architecture

The model uses an [Audio Spectrogram Transformer (AST)](https://arxiv.org/abs/2104.01778) as the music encoder and a GLuCoSE as the text encoder.
The music encoder was initialized from [official AST pretrained by AudioSet](https://github.com/YuanGongND/ast).
The text encoder was initialized from [pkshatech/GLuCoSE-base-ja](https://huggingface.co/pkshatech/GLuCoSE-base-ja).

## Licenses

[The Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)

## Citation

```
@misc{mulan-japanese-base,
    title = {Japanese MuLan Base},
    author={Takuya Hasumi and Yusuke Fujita}
    url = {https://huggingface.co/line-corporation/japanese-mulan-base},
}
```