RAG Context Stuck at 512 for Too Long: The 32K Context Era for Embedding Models Begins with Granite R2

If your RAG system needs to handle multilingual content—Chinese contracts, Japanese technical manuals, German patent documents—you'll discover an awkward truth: the embedding models used for retrieval often have a context window of only 512 tokens. Meanwhile, LLMs themselves have long been capable of handling 128K or even 200K tokens.

This means when you index a document, you might truly only be seeing the first page. The subsequent content isn't just "ignored"—it simply doesn't exist within the vector space. The upper limit of your retrieval recall is welded shut the moment you choose your embedding model.

On May 14th, IBM open-sourced the Granite Embedding Multilingual R2 series, pushing the context window for embedding models from 512 to 32K, all while maintaining an Apache 2.0 license and a small parameter scale. This isn't a simple version iteration—it touches upon a long-ignored engineering bottleneck with far-reaching consequences.

Why the Context Window of Embedding Models Matters

The classic RAG workflow is: chunk documents → encode each chunk into a vector → retrieve the most similar chunks upon a query. There's an implicit assumption here: that the content of each chunk is self-contained enough to independently express a complete unit of information.

When your embedding model can only see 512 tokens (roughly 400 English words), the chunk size is artificially constrained. For short texts—news summaries, product descriptions, FAQ entries—this is sufficient. But for inherently long-form content like technical documentation, legal contracts, and academic papers, 512 tokens force you to choose between "chopping too finely" and "exceeding the context."

The problem with chopping too finely is more insidious: once a logical paragraph is split in half, the first half loses its conclusion and the second half loses its premise. Neither vector is sufficient to match a user's query. Classic RAG tutorials teach you to "set an appropriate chunk size and overlap," but rarely does anyone tell you that even if you maximize the overlap, the true ceiling is the embedding model's own context window.

Granite Embedding R2 directly raises this ceiling to 32K tokens—64 times the 512 tokens of R1. This means a 30-page PDF technical manual can be encoded without any chunking at all. More importantly, it doesn't sacrifice retrieval accuracy for length: in the LongEmbed benchmark, R2 improved by over 30 points compared to R1, leaping directly from a score of 34.3 to 65.6 (for the 97M model). This performance leap comes almost entirely from the context window expansion—previous models simply couldn't see the full text of long documents.

A Two-Pronged Approach: The Compact 97M and Full-Size 311M

Granite Embedding R2 launched two models, both based on the ModernBERT encoder architecture. ModernBERT is a modernization of BERT that emerged last year—using rotary position embeddings instead of absolute position embeddings (natively supporting length extrapolation), alternating attention to reduce computation for long sequences, and integrating Flash Attention 2.0 to accelerate GPU encoding.

The key choice lies in the parameter scale:

The 97M model (granite-embedding-97m-multilingual-r2) has an output dimension of 384 and only 97M parameters, yet it achieves a score of 60.3 on multilingual MTEB retrieval—surpassing multilingual-e5-base (278M parameters, 52.7 score) and gte-multilingual-base (305M parameters, 57.2 score). A model three times smaller, with a higher score. For production environments, this means you can process more documents with the same hardware budget, or deploy the embedding service on smaller instances.

The 311M model (granite-embedding-311m-multilingual-r2) has an output dimension of 768 and supports Matryoshka Embedding, meaning you can truncate dimensions to 512, 384, 256, or even 128 with almost no loss in retrieval quality. It scores 65.2 on MTEB multilingual retrieval, ranking second among open-source models with under 500M parameters.

In terms of encoding speed, the 97M model can encode over 2,500 documents per second on a single H100, while the 311M does about 1,800 docs/sec—over 5.5 times faster than the jina-embeddings-v5-text-nano at a comparable retrieval quality.

The True Boundaries of Multilingual Support

"Supporting 200+ languages" is no longer a novelty in embedding model marketing, but the real quality gaps lie in the details. Granite Embedding R2 performed explicit retrieval pair training and cross-lingual training for 52 languages, meaning the model can truly distinguish between "relevant" and "irrelevant" documents in these languages, rather than just pulling text vectors of the same language closer.

From a practical engineering perspective, the most critical point is tokenizer coverage. Many embedding models claim to support a language, but their tokenizer efficiency for that language is extremely low—a single paragraph of Thai might consume half of the context window. The R2 97M model uses a pruned 180K-token vocabulary (pruned from 262K), significantly reducing the embedding table size while preserving multilingual coverage. The 311M model directly adopts the Gemma 3 tokenizer with its 262K-token vocabulary.

On cross-lingual retrieval, the 311M model achieved a score of 66.5 on Belebele (cross-lingual passage matching across 122 languages) and 67.1 on MLQA (cross-lingual question-answering retrieval across 7 languages)—improvements of 4.3 and 4.1 points over R1, respectively.

Code Retrieval: An Underestimated RAG Scenario

Beyond multilingual capabilities, R2 also added support for code retrieval—covering nine languages including Python, Go, Java, JavaScript, PHP, Ruby, SQL, C, and C++, and supporting cross-lingual code retrieval.

This is very useful in practical scenarios. Imagine your team maintains an international codebase with English documentation, comments mixing Chinese, Japanese, and English, and a specific module handling German localization. With previous embedding models, searching for "handle date format" and "日付フォーマット処理" might yield completely different results. With cross-lingual code retrieval, their semantics can be aligned into a similar vector space.

On the MTEB Code benchmark, the 97M model scored 60.4 and the 311M model scored 63.8—improvements of 19.7 and 15.3 points over R1, respectively.

How to Integrate into Your Existing RAG System

Granite Embedding R2 can seamlessly replace embedding models in existing RAG frameworks. For sentence-transformers users, it requires just a one-line change:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer(
"ibm-granite/granite-embedding-97m-multilingual-r2"
)

For LangChain users:

from langchain_huggingface import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(
    model_name="ibm-granite/granite-embedding-97m-multilingual-r2"
)

A similar one-line replacement is supported for LlamaIndex and Haystack.

If you use the 311M model and wish to save storage and computation costs, Matryoshka truncation is a very practical feature:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer(
"ibm-granite/granite-embedding-311m-multilingual-r2"
)

# Truncate to 256 dimensions with less than 1% quality loss
embeddings = model.encode(
    ["example text"], truncate_dim=256
)

On MTEB multilingual retrieval, the 256-dimension version only drops by 0.5 points compared to the 768-dimension one (65.2→64.7), while reducing storage and similarity computation costs by a factor of 3.

The Boundaries for Engineering Implementation

The Apache 2.0 license for Granite Embedding R2 means it can be used in commercial projects without restriction. However, there are several boundaries to be aware of when choosing a model:

The 97M model shows a decline on Belebele cross-lingual retrieval compared to R1 (52.9 vs. 55.1), a trade-off resulting from vocabulary pruning and parameter reduction—it vastly leads in broader multilingual retrieval but regresses on narrow cross-lingual tasks. If your core scenario involves retrieval across a vast array of language pairs, the 311M is the better choice.

The 97M model does not support Matryoshka truncation; it is natively 384 dimensions. If you need flexible control over vector dimensions, you must choose the 311M model.

Both models are encoder-only, meaning they do not support generation—they are pure embedding models and cannot replace an LLM for understanding and answering. The embedding model and the LLM each play their own role in the RAG pipeline; choosing Granite Embedding R2 boosts retrieval recall, not generation quality.

Finally, while the 32K context is powerful, the latency and computational cost of encoding long documents will grow linearly. For latency-sensitive scenarios, you need to evaluate whether full-document encoding is truly necessary, or if better results can be achieved with intelligent slicing combined with the 32K window.

The Embedding Layer is Upgrading

The emergence of Granite Embedding R2 is a clear signal: embedding models are no longer static components in the RAG pipeline; they are undergoing a context extension similar to that of LLMs. When LLMs are already competing on context length at the million-token level, the 512-token limit at the embedding layer has become a real bottleneck.

For production-grade RAG systems—especially those needing to handle multilingual content, long documents, or code retrieval—upgrading the embedding model to a 32K context window might be one of the highest-ROI optimizations in the near future.

#RAG #LLMAppEngineering #AIAppDevelopment #EmbeddingModels #MultilingualRetrieval