PatentSim: Semantic Similarity Model for U.S. Patent Phrase Matching

Model Name: PatentSim Word Semantic Similarity
Author: Michael Posso Source: Trained for the Kaggle "U.S. Patent Phrase to Phrase Matching" competition
License: Apache 2.0


Model Description

PatentSim is a lightweight, hybrid machine learning pipeline designed to evaluate the semantic similarity between pairs of phrases in the context of patent literature. The model combines a pre-trained transformer-based sentence encoder with a ridge regression model trained on cosine similarity scores. It was developed as part of the Kaggle competition hosted by the U.S. Patent and Trademark Office (USPTO).

This model addresses the critical need for semantic equivalence detection in legal and technical language, supporting tasks such as patent search, prior art retrieval, and claim comparison.


Intended Use

  • Patent Prior Art Search
  • Legal Text Similarity
  • Technical Paraphrase Detection
  • Domain-aware Semantic Matching

Model Architecture

  • Sentence Embeddings: all-mpnet-base-v2
  • Similarity Function: Cosine Similarity between sentence embeddings
  • Regressor: Scikit-learn Ridge Regression (alpha=1.0), trained to map cosine similarity to semantic relatedness scores

Dataset

  • Source: Kaggle โ€“ U.S. Patent Phrase to Phrase Matching
  • Size: 45,000+ anchor-target phrase pairs with similarity scores from 0.0 to 1.0
  • Domains: Technical, scientific, commercial IP content
  • Features: CPC classification codes to provide domain context

Training Procedure

  1. Sentence embeddings for anchor and target phrases were generated using sentence-transformers/all-mpnet-base-v2.
  2. Embeddings were averaged with lowercased versions to improve normalization.
  3. Cosine similarity between anchor and target embeddings formed the feature set.
  4. Ridge regression was trained to predict human-labeled semantic similarity scores.

Usage

from sentence_transformers import SentenceTransformer, util
import joblib
import numpy as np

# Load embedding model
model = SentenceTransformer("sentence-transformers/all-mpnet-base-v2")

# Load regression model
from joblib import load
reg = load("ridge_model.joblib")

def predict_similarity(anchor, target):
    a_emb = model.encode(anchor, convert_to_tensor=True)
    t_emb = model.encode(target, convert_to_tensor=True)
    cosine_sim = util.cos_sim(a_emb, t_emb).item()
    score = reg.predict(np.array([[cosine_sim]]))[0]
    return score
Downloads last month
5
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for micposso/word-semantic-similarity

Finetuned
(286)
this model