What Hard Negative Mining Actually Means in Practice
Published:
When you train a bi-encoder for retrieval, the standard setup is contrastive learning: given a query and a positive document, pull their embeddings together and push everything else apart. The “everything else” is your negatives, and how you choose them is the most important decision you will make.
We built a system that matches entities (text) to internal tasks (text) using a bi-encoder. The model learns a similarity function and retrieves the best matching task for a given entity. The challenge was not general relevance. It was fine-grained discrimination between candidates that are almost identical. In many cases, candidates differed by a single token, a formatting variation, or a minor semantic shift. Standard training setups struggled with this. This post covers what actually worked.
The naive approach fails quickly. Random negatives are too easy. The model learns to distinguish obviously irrelevant from relevant, which is a trivially easy task that does not generalize to the hard cases you actually care about.
Hard negatives are documents that look relevant but are not. These are the candidates that BM25 or a weaker embedding model would rank highly, but that a human annotator would mark incorrect. Training on these forces the model to learn fine-grained distinctions rather than coarse separability.
The real failure mode
The main failure was not obvious mismatches. It was near-duplicates, minimal lexical differences, and structurally similar text with different meaning. In these cases, embedding distances collapsed, the model assigned similar scores to multiple candidates, and ranking became unstable. Hard example mining exposed this issue, but did not solve it by itself.
Why standard hard negative mining was not enough
We used both offline and online hard negative mining. Offline mining became stale as the model improved. Online mining provided a better training signal since negatives stayed challenging throughout training. But both hit a limit. When negatives became extremely similar to positives, gradients became noisy, false negatives increased, and training stability degraded. Hardness alone is not sufficient. The model needs better structure to separate very close candidates.
Controlling difficulty
The first improvement was filtering by similarity instead of always selecting the hardest negatives. We removed very easy negatives and also removed extremely hard ones, keeping a band of moderately hard examples. This reduced noise and improved convergence. Semi-hard negatives outperform the hardest negatives because they carry less label ambiguity.
Cross-encoder guided mining
A key upgrade was adding a stronger scorer. The bi-encoder retrieves candidates, the cross-encoder scores them, and high-scoring incorrect candidates become negatives. This filters out false negatives and surfaces truly confusing pairs. This is now a standard approach in modern retrieval systems.
Margin-based contrastive learning
The biggest improvement came from changing the loss. Standard contrastive loss treats all negatives equally, which fails when differences are very small. We moved to margin-based training, enforcing a minimum score gap between positive and negative and increasing the margin for harder negatives. This forced separation even when texts were very similar and stabilized ranking among near-duplicates.
Curriculum over hardness
We did not keep hardness fixed throughout training. Early stages used easier negatives, and we gradually introduced harder ones. This prevented early collapse and let the model build stable representations before handling difficult cases.
Aligning training with inference
We also aligned training with the actual retrieval pipeline. Instead of training only on random or pre-mined negatives, we used candidates from the live lexical and embedding-based retrieval system. This ensured the training distribution matched inference and that hard negatives reflected real system errors rather than synthetic ones.
Handling near-duplicate collapse
The most difficult issue was near-identical candidates. The fixes that helped most were stronger normalization, embedding regularization to reduce collapse, and combining dense similarity with lexical signals during training. The hybrid signal helps distinguish subtle differences that embeddings alone miss.
What finally worked
The most stable setup combined in-batch negatives for diversity, index-based online mining, cross-encoder filtering, semi-hard negative selection, margin-based loss, and a curriculum over hardness. This consistently outperformed random negatives, purely offline mining, and aggressively hard negatives.
The broader lesson is that hard example mining alone does not solve fine-grained matching. Extremely hard negatives introduce noise. Semi-hard negatives provide a better learning signal. Margin-based objectives are critical for separating near-duplicates. Training must reflect the actual retrieval pipeline. And when candidates differ by very small signals, the problem shifts from retrieval to precision ranking. Solving it requires changes in both data and objective, not just mining strategy.
