Editions classifies every article into your topics using on-device machine learning. We benchmarked six embedding models across 70 labeled articles from three RSS feeds to find the right one. Here's what we learned.
When you create a focus like "Climate Change" or "Local News", Editions needs to decide which of your incoming articles belong there. It does this by computing an embedding — a compact mathematical fingerprint of meaning — for both the article and the focus label, then measuring how similar they are.
The choice of embedding model matters. A better model means you see the right articles in the right topics. A worse one means you're manually re-sorting articles the system should have caught.
We tested six models — three small (384 dimensions, ~23–33 MB) and three medium (768 dimensions, ~67 MB) — to find which one earns its place on your server.
30 articles across overlapping categories — gadgets, gaming, AI, deals. The hardest test: is this review about hardware or entertainment?
20 articles with more distinct topics — science, security, EVs, policy. Cleaner category boundaries.
20 articles on clearly separable geopolitical topics. The easiest test — but validates the system works at all.
Every article was labeled by hand: for each topic, does this article belong? Yes or no. An article about an AI-powered gadget might be labeled "yes" for both Gadgets & Hardware and AI & Machine Learning.
We then ran each model against these labels, found the optimal confidence threshold per topic, and measured precision (how many predicted matches were correct) and recall (how many true matches were found).
F1 score measures the balance between precision and recall — the higher, the better. We averaged F1 across all topics and feeds for each model.
bge-small-en-v1.5 achieved the highest classification F1 (86.9%) despite being a small 384-dimensional model at just 33 MB. It outperformed models twice its size.
NLI (natural language inference via BART-MNLI) scored 84.6% — lower than simple embedding similarity — at roughly 60 times the compute cost.
| Model | Ars Technica | The Verge |
|---|---|---|
| bge-small winner | 86.4% | 82.9% |
| MiniLM-L12 | 82.2% | 83.4% |
| gte-small | 82.8% | 82.2% |
| MiniLM-L6 | 80.3% | 79.0% |
| bge-base | 82.6% | 62.4% |
| gte-base | 80.1% | 70.3% |
NYT World results excluded from the table — all models scored 88–93% on clearly separable geopolitical topics, revealing little differentiation.
The intuitive assumption is that larger models with more dimensions should perform better. They have twice the capacity to represent meaning. More parameters, more nuance, right?
Wrong — at least for topic classification of news articles. The 768-dimensional "base" models consistently underperformed their smaller 384-dimensional counterparts. The effect was dramatic on The Verge, where overlapping consumer tech categories exposed the problem.
The likely explanation: larger models capture finer semantic distinctions that don't align with coarse topic boundaries. A "base" model can tell that a gaming laptop review is about both gaming and hardware — but that extra nuance produces ambiguous scores that hurt precision.
Same model family, same training data. The only difference is embedding dimension — and the smaller one wins by 20 points.
Beyond classification, embeddings power vote propagation and future semantic search. We measured separation ratio — how much more similar articles within the same topic are versus articles in different topics. Higher is better.
MiniLM models spread articles out more in their embedding space — articles about the same topic cluster together, and different topics stay apart. This is ideal for vote propagation, where a vote on one article should influence similar articles but not unrelated ones.
bge-small wins on classification F1 but has a tighter embedding space (ratio 1.10 vs 1.45). This means vote propagation needs a higher similarity threshold to avoid bleeding across topics — which is why we raised it from 0.3 to 0.4.
We simulated voting (2 upvotes, 1 downvote per topic) and measured NDCG — a ranking metric where 1.0 means perfect ordering and 0.5 means essentially random.
Votes improved ranking — the model we chose
Also benefits from vote signal
No meaningful change
Slightly worse with votes
Votes hurt — embedding space too compressed
Votes hurt despite good cluster separation
Vote propagation works by finding similar articles and transferring the vote signal. If the embedding space is too spread out (MiniLM), votes on one article barely reach related ones — the signal dissipates. If the space is too compressed (base models), votes bleed across topic boundaries — an upvote on a gaming article accidentally boosts unrelated tech policy.
bge-small sits in the sweet spot: compact enough for votes to propagate to genuinely similar articles, but separable enough to respect topic boundaries.
One more thing we learned: with only a few votes, the vote signal is noisy. A single upvote on a clickbait article shouldn't dramatically reshape the entire feed.
So we added vote ramping — the vote weight scales linearly from zero to full based on how many votes you've cast. Until you've voted on at least 5 articles, the system leans more heavily on confidence and recency. As you vote more, personalization gradually takes effect.
NLI asks a fundamentally different question — not "how similar is this text to this label?" but "does this text entail this label?" In theory, that's more precise. In practice, for well-defined topic categories, the simpler approach wins.
We still offer NLI and hybrid modes for users who want them (set analysis.classifier in your config). But the default is now pure embedding similarity — instant classification, no 400 MB model download on first run.
From all-MiniLM-L6-v2 to bge-small-en-v1.5. Same 384 dimensions, 10 MB larger, but 2.8 percentage points better on classification and measurably better vote propagation.
NLI offered no accuracy advantage at 60x the cost. Classification now completes in seconds instead of minutes. NLI and hybrid remain available for users who want them.
Vote influence now scales linearly from 0 to full over your first 5 votes. Until you've cast enough votes for a meaningful signal, the system relies more on topic confidence and recency.
Every classification score now records which model produced it. When you change models, the system automatically rescores all articles on the next reconciliation — no manual intervention needed.
The benchmark lives in apps/eval/ — a standalone workspace that imports from the server but exports nothing back. It uses the same transformers.js inference engine and the same in-memory SQLite setup as the real pipeline.
pnpm fetch-feed Download and extract RSS feeds into JSON fixtures pnpm label Interactive CLI to label articles per topic pnpm bench Run all benchmarks (classify, embed, rank) pnpm report Generate HTML report with SVG charts Precision, recall, F1 per focus, with thresholds optimized per focus via grid search. Macro-averaged across focuses, then averaged across fixtures.
Separation ratio — mean intra-focus cosine similarity divided by mean inter-focus similarity. A ratio above 1.0 means same-topic articles are more similar than different-topic articles.
NDCG@K and MRR — standard information retrieval metrics measuring how well the ranked output matches the ideal ordering. Tested with four weight configurations and two vote scenarios.
No cloud. No API keys. No data leaving your machine.
That's the point.