MedCPT Hit 5 Million Downloads — Here's How to Use It in Your Medical RAG Pipeline

By Prahlad Menon Published 2026-04-02 4 min read

NIH’s MedCPT just crossed 5 million downloads on Hugging Face. That’s a meaningful milestone — and if you’re building anything that touches medical text, it’s worth understanding why this model exists, how it works, and when you should use it instead of a general-purpose embedding model.

Why General Embeddings Fall Short in Medicine

If you’ve tried plugging text-embedding-3-small or a general BGE model into a medical RAG pipeline, you’ve probably noticed the cracks:

“MI” (myocardial infarction) and “MI” (motivational interviewing) land near each other
“Positive” means something very different in oncology vs psychiatry
Clinical shorthand (SOB, c/o, prn) doesn’t exist in general training corpora
PubMed abstracts have their own syntactic patterns that general models never saw

The core problem: general embedding models are trained on web text. Medical language has its own vocabulary, syntax, and semantics. A query like “ACE inhibitor contraindications in bilateral renal artery stenosis” requires a model that has seen thousands of papers discussing exactly that trade-off.

What MedCPT Is

MedCPT (Medical Contrastively Pre-trained Transformer) is a family of three models from NIH/NLM, trained on an unprecedented dataset: 255 million real user query-article pairs from PubMed search logs.

That last part is key. It’s not synthetic data or document pairs — it’s 255 million times a real clinician, researcher, or student typed a query into PubMed and clicked an article. That’s a behavioral signal that captures what relevance actually means in biomedical search.

The family has three components:

Model	Role	Max tokens	Use for
`ncbi/MedCPT-Query-Encoder`	Query embedding	64	Short queries, questions, clinical notes
`ncbi/MedCPT-Article-Encoder`	Document embedding	512	PubMed abstracts, clinical docs, guidelines
`ncbi/MedCPT-Cross-Encoder`	Re-ranking	512	Scoring query-doc pairs after retrieval

The query and article encoders share the same embedding space — dot product similarity works across them directly.

Setting Up MedCPT

pip install transformers torch

No API keys. No accounts. Runs locally.

Step 1: Embed Your Documents

import torch
from transformers import AutoTokenizer, AutoModel

# Load article encoder once (reuse across documents)
article_model = AutoModel.from_pretrained("ncbi/MedCPT-Article-Encoder")
article_tokenizer = AutoTokenizer.from_pretrained("ncbi/MedCPT-Article-Encoder")
article_model.eval()

def embed_articles(articles: list[list[str]]) -> torch.Tensor:
    """
    articles: list of [title, abstract] pairs
    returns: (N, 768) tensor of embeddings
    """
    with torch.no_grad():
        encoded = article_tokenizer(
            articles,
            truncation=True,
            padding=True,
            return_tensors="pt",
            max_length=512,
        )
        return article_model(**encoded).last_hidden_state[:, 0, :]

# Example: embed a small corpus
corpus = [
    [
        "Metformin as first-line therapy for type 2 diabetes",
        "Metformin remains the recommended first-line pharmacological therapy for type 2 diabetes due to its efficacy, safety profile, low cost, and potential cardiovascular benefits...",
    ],
    [
        "SGLT2 inhibitors in heart failure with reduced ejection fraction",
        "Sodium-glucose cotransporter-2 (SGLT2) inhibitors have demonstrated significant reductions in cardiovascular death and hospitalization in patients with HFrEF...",
    ],
    [
        "GLP-1 receptor agonists and weight management in obesity",
        "GLP-1 receptor agonists (GLP-1 RAs) reduce body weight through multiple mechanisms including delayed gastric emptying, increased satiety, and reduced food intake...",
    ],
]

doc_embeddings = embed_articles(corpus)
print(f"Corpus embedded: {doc_embeddings.shape}")  # (3, 768)

Step 2: Embed Your Query

query_model = AutoModel.from_pretrained("ncbi/MedCPT-Query-Encoder")
query_tokenizer = AutoTokenizer.from_pretrained("ncbi/MedCPT-Query-Encoder")
query_model.eval()

def embed_query(query: str) -> torch.Tensor:
    """Returns (768,) embedding for a single query."""
    with torch.no_grad():
        encoded = query_tokenizer(
            [query],
            truncation=True,
            padding=True,
            return_tensors="pt",
            max_length=64,
        )
        return query_model(**encoded).last_hidden_state[:, 0, :].squeeze()

query_emb = embed_query(
    "What is the best first-line treatment for a newly diagnosed type 2 diabetic patient?"
)
print(f"Query embedded: {query_emb.shape}")  # (768,)

Step 3: Retrieve

import torch.nn.functional as F

def retrieve(query_emb, doc_embeddings, corpus, top_k=3):
    # Cosine similarity
    scores = F.cosine_similarity(query_emb.unsqueeze(0), doc_embeddings)
    top_indices = scores.argsort(descending=True)[:top_k]
    return [(corpus[i][0], scores[i].item()) for i in top_indices]

results = retrieve(query_emb, doc_embeddings, corpus)
for title, score in results:
    print(f"  [{score:.3f}] {title}")

Output:

  [0.847] Metformin as first-line therapy for type 2 diabetes
  [0.612] GLP-1 receptor agonists and weight management in obesity
  [0.489] SGLT2 inhibitors in heart failure with reduced ejection fraction

Step 4: Re-rank with the Cross-Encoder (Optional but Recommended)

The cross-encoder is slower but more accurate — use it to re-rank your top-K results:

from transformers import AutoModelForSequenceClassification

cross_tokenizer = AutoTokenizer.from_pretrained("ncbi/MedCPT-Cross-Encoder")
cross_model = AutoModelForSequenceClassification.from_pretrained("ncbi/MedCPT-Cross-Encoder")
cross_model.eval()

def rerank(query: str, candidates: list[str], top_k: int = 3) -> list[tuple]:
    pairs = [[query, doc] for doc in candidates]
    with torch.no_grad():
        encoded = cross_tokenizer(
            pairs, truncation=True, padding=True,
            return_tensors="pt", max_length=512,
        )
        scores = cross_model(**encoded).logits.squeeze(dim=1)
    ranked = sorted(zip(candidates, scores.tolist()), key=lambda x: x[1], reverse=True)
    return ranked[:top_k]

# Pass the top retrieved docs to the cross-encoder for final ranking
candidate_texts = [f"{c[0]}. {c[1]}" for c in corpus]
reranked = rerank(
    "best first-line treatment for type 2 diabetes",
    candidate_texts,
)
for doc, score in reranked:
    print(f"  [{score:.3f}] {doc[:80]}...")

Full RAG Pipeline in ~50 Lines

Here’s the complete pattern — drop in your own document collection and LLM:

import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel, AutoModelForSequenceClassification

# --- Models (load once) ---
q_tok = AutoTokenizer.from_pretrained("ncbi/MedCPT-Query-Encoder")
q_mod = AutoModel.from_pretrained("ncbi/MedCPT-Query-Encoder").eval()

a_tok = AutoTokenizer.from_pretrained("ncbi/MedCPT-Article-Encoder")
a_mod = AutoModel.from_pretrained("ncbi/MedCPT-Article-Encoder").eval()

x_tok = AutoTokenizer.from_pretrained("ncbi/MedCPT-Cross-Encoder")
x_mod = AutoModelForSequenceClassification.from_pretrained("ncbi/MedCPT-Cross-Encoder").eval()

def encode(model, tokenizer, texts, max_len):
    with torch.no_grad():
        enc = tokenizer(texts, truncation=True, padding=True,
                        return_tensors="pt", max_length=max_len)
        return model(**enc).last_hidden_state[:, 0, :]

def medcpt_rag(query: str, docs: list[str], top_k: int = 3) -> list[str]:
    # 1. Embed
    q_emb = encode(q_mod, q_tok, [query], 64)
    d_emb = encode(a_mod, a_tok, [[d, ""] for d in docs], 512)

    # 2. Dense retrieval (top 10)
    scores = F.cosine_similarity(q_emb, d_emb)
    top10 = scores.argsort(descending=True)[:10].tolist()
    candidates = [docs[i] for i in top10]

    # 3. Cross-encoder re-rank (top k)
    pairs = [[query, c] for c in candidates]
    with torch.no_grad():
        enc = x_tok(pairs, truncation=True, padding=True,
                    return_tensors="pt", max_length=512)
        rerank_scores = x_mod(**enc).logits.squeeze(dim=1)

    ranked = sorted(zip(candidates, rerank_scores.tolist()),
                    key=lambda x: x[1], reverse=True)
    return [doc for doc, _ in ranked[:top_k]]

# --- Use it ---
context_docs = medcpt_rag(
    query="contraindications for ACE inhibitors",
    docs=your_document_collection,
)

# Feed to any LLM
prompt = f"""You are a clinical AI assistant. Use only the following sources:

{chr(10).join(f'[{i+1}] {d}' for i, d in enumerate(context_docs))}

Question: What are the contraindications for ACE inhibitors?
Answer:"""

When to Use MedCPT vs General Embeddings

Use case	MedCPT	General (text-embedding-3, BGE)
PubMed / clinical literature search	✅ Better	❌ Misses medical semantics
EHR / clinical notes retrieval	✅ Better	⚠️ Mediocre
Drug/disease ontology matching	✅ Better	❌ Struggles with synonyms
General FAQ / product docs	⚠️ Overkill	✅ Fine
Multilingual content	❌ English only	✅ Better
Very short texts (<10 words)	⚠️ OK	✅ Fine

The rule of thumb: if your documents would appear on PubMed or in a clinical system, use MedCPT.

Pairing with a Vector DB

MedCPT outputs 768-dimensional vectors. Drop them straight into any vector store:

# Qdrant example
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct

client = QdrantClient(":memory:")  # or your Qdrant URL
client.create_collection(
    collection_name="pubmed",
    vectors_config=VectorParams(size=768, distance=Distance.COSINE),
)

# Index your corpus
points = [
    PointStruct(id=i, vector=emb.tolist(), payload={"title": c[0], "abstract": c[1]})
    for i, (c, emb) in enumerate(zip(corpus, doc_embeddings))
]
client.upsert(collection_name="pubmed", points=points)

# Search
hits = client.search(
    collection_name="pubmed",
    query_vector=query_emb.tolist(),
    limit=5,
)

Works identically with Chroma, Weaviate, Pinecone, pgvector — it’s just a 768-dim vector.

The 5M Number in Context

5 million downloads is notable not just as a vanity metric. It tells you:

The model is production-tested — someone has already hit every edge case
LitSense 2.0 uses it in production at NIH, serving millions of PubMed searches
The community has done the integration work — there are examples for LangChain, LlamaIndex, Haystack, custom pipelines
It’s not going away — NIH is committed to it, and the HuggingFace repo is actively maintained

For healthcare AI builders, that stability matters. General-purpose embedding models get deprecated, fine-tuned, versioned, and priced. MedCPT is open-weight, free, and institutionally backed.

Resources

Paper: arXiv:2307.00589 — published in Bioinformatics
Code: github.com/ncbi/MedCPT
Query Encoder: huggingface.co/ncbi/MedCPT-Query-Encoder
Article Encoder: huggingface.co/ncbi/MedCPT-Article-Encoder
Cross Encoder: huggingface.co/ncbi/MedCPT-Cross-Encoder
LitSense 2.0: https://www.ncbi.nlm.nih.gov/research/litsense/

If you’re building a medical AI system and you haven’t looked at MedCPT yet, now’s the time. Five million developers apparently agree.