PatentSim: Semantic Similarity Model for U.S. Patent Phrase Matching

Model Name: PatentSim Word Semantic Similarity
Author: Michael Posso Source: Trained for the Kaggle "U.S. Patent Phrase to Phrase Matching" competition
License: Apache 2.0

Model Description

PatentSim is a lightweight, hybrid machine learning pipeline designed to evaluate the semantic similarity between pairs of phrases in the context of patent literature. The model combines a pre-trained transformer-based sentence encoder with a ridge regression model trained on cosine similarity scores. It was developed as part of the Kaggle competition hosted by the U.S. Patent and Trademark Office (USPTO).

This model addresses the critical need for semantic equivalence detection in legal and technical language, supporting tasks such as patent search, prior art retrieval, and claim comparison.

Intended Use

Patent Prior Art Search
Legal Text Similarity
Technical Paraphrase Detection
Domain-aware Semantic Matching

Model Architecture

Sentence Embeddings: all-mpnet-base-v2
Similarity Function: Cosine Similarity between sentence embeddings
Regressor: Scikit-learn Ridge Regression (alpha=1.0), trained to map cosine similarity to semantic relatedness scores

Dataset

Source: Kaggle – U.S. Patent Phrase to Phrase Matching
Size: 45,000+ anchor-target phrase pairs with similarity scores from 0.0 to 1.0
Domains: Technical, scientific, commercial IP content
Features: CPC classification codes to provide domain context

Training Procedure

Sentence embeddings for anchor and target phrases were generated using sentence-transformers/all-mpnet-base-v2.
Embeddings were averaged with lowercased versions to improve normalization.
Cosine similarity between anchor and target embeddings formed the feature set.
Ridge regression was trained to predict human-labeled semantic similarity scores.

Usage

from sentence_transformers import SentenceTransformer, util
import joblib
import numpy as np

# Load embedding model
model = SentenceTransformer("sentence-transformers/all-mpnet-base-v2")

# Load regression model
from joblib import load
reg = load("ridge_model.joblib")

def predict_similarity(anchor, target):
    a_emb = model.encode(anchor, convert_to_tensor=True)
    t_emb = model.encode(target, convert_to_tensor=True)
    cosine_sim = util.cos_sim(a_emb, t_emb).item()
    score = reg.predict(np.array([[cosine_sim]]))[0]
    return score

micposso
/

word-semantic-similarity