DejanX13/Javne_Nabavke_embedding_1000

This is a sentence-transformers model fine-tuned specifically for Serbian public procurement documents ("Javne Nabavke"). It maps sentences & paragraphs to a 1024 dimensional dense vector space and can be used for tasks like clustering, semantic search, and document retrieval in the context of Serbian public procurement.

Model Description

This model has been fine-tuned on a dataset of 1000 Serbian public procurement documents to improve semantic understanding and retrieval performance for:

  • Public procurement document analysis
  • Tender document similarity matching
  • Legal document search and retrieval
  • Procurement process automation
  • Serbian legal text understanding

The model is based on a multilingual transformer architecture and has been optimized for both Serbian and English text in the public procurement domain.

Usage (Sentence-Transformers)

Using this model becomes easy when you have sentence-transformers installed:

pip install -U sentence-transformers

Then you can use the model like this:

from sentence_transformers import SentenceTransformer

# Example Serbian public procurement texts
sentences = [
    "Javni poziv za nabavku računarske opreme",
    "Tender za izgradnju javnih objekata",
    "Specifikacija tehničkih zahteva za softver"
]

model = SentenceTransformer('DejanX13/Javne_Nabavke_embedding_1000')
embeddings = model.encode(sentences)
print(embeddings)

Usage (LlamaIndex)

You can also use this model with LlamaIndex for document retrieval:

from llama_index.embeddings.huggingface import HuggingFaceEmbedding

embedding_model = HuggingFaceEmbedding(
    model_name="DejanX13/Javne_Nabavke_embedding_1000",
    embed_batch_size=16
)

# Use with VectorStoreIndex for document retrieval
from llama_index.core import VectorStoreIndex, Document

documents = [Document(text="Your procurement document text here")]
index = VectorStoreIndex.from_documents(documents, embed_model=embedding_model)

Performance

This model has been evaluated on Serbian public procurement document retrieval tasks and shows significant improvement over general-purpose multilingual models for domain-specific tasks.

Training Details

The model was fine-tuned with the following parameters:

Base Model: multilingual-e5-large

Training Dataset: 1000 Serbian public procurement documents with query-document pairs

Training Parameters:

  • Epochs: 2
  • Batch Size: 5
  • Learning Rate: 2e-05
  • Loss Function: MultipleNegativesRankingLoss
  • Evaluation Steps: 50
  • Warmup Steps: 94
  • Weight Decay: 0.01
  • Max Gradient Norm: 1
  • Optimizer: AdamW

DataLoader: torch.utils.data.dataloader.DataLoader of length 470 with parameters:

{'batch_size': 5, 'sampler': 'torch.utils.data.sampler.SequentialSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}

Loss Function: sentence_transformers.losses.MultipleNegativesRankingLoss.MultipleNegativesRankingLoss with parameters:

{'scale': 20.0, 'similarity_fct': 'cos_sim'}

Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: XLMRobertaModel 
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Use Cases

This model is particularly useful for:

  1. Document Retrieval: Finding relevant procurement documents based on queries
  2. Tender Matching: Matching suppliers with relevant tender opportunities
  3. Legal Document Analysis: Understanding legal requirements in procurement documents
  4. Compliance Checking: Identifying similar regulatory requirements across documents
  5. Procurement Automation: Building AI systems for procurement process automation

Languages

  • Primary: Serbian (sr)
  • Secondary: English (en)
  • Optimized for: Serbian public procurement terminology and legal language

Limitations

  • Optimized specifically for Serbian public procurement domain
  • May not perform optimally on general-purpose text outside this domain
  • Performance may vary on other Serbian text domains not related to public procurement

Citation

If you use this model in your research or applications, please cite:

@misc{javne_nabavke_embedding_1000,
  author = {DejanX13},
  title = {Javne_Nabavke_embedding_1000: Fine-tuned Embeddings for Serbian Public Procurement},
  year = {2024},
  publisher = {Hugging Face},
  url = {https://huggingface.co/DejanX13/Javne_Nabavke_embedding_1000}
}

Contact

For questions or issues related to this model, please open an issue in the model repository or contact the author through Hugging Face.

Downloads last month
0
Safetensors
Model size
560M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support