Technical report

How we chose what reads your news.

Editions classifies every article into your topics using on-device machine learning. We benchmarked six embedding models across 70 labeled articles from three RSS feeds to find the right one. Here's what we learned.

The benchmark
6 models compared 3 RSS feeds, 70 labeled articles 15 topic categories 3 metrics: classification, embeddings, ranking
The question

Can a 33 MB model read your news?

When you create a focus like "Climate Change" or "Local News", Editions needs to decide which of your incoming articles belong there. It does this by computing an embedding — a compact mathematical fingerprint of meaning — for both the article and the focus label, then measuring how similar they are.

The choice of embedding model matters. A better model means you see the right articles in the right topics. A worse one means you're manually re-sorting articles the system should have caught.

We tested six models — three small (384 dimensions, ~23–33 MB) and three medium (768 dimensions, ~67 MB) — to find which one earns its place on your server.

The method

Three feeds, 70 articles, by hand.

Tech / Consumer

The Verge

30 articles across overlapping categories — gadgets, gaming, AI, deals. The hardest test: is this review about hardware or entertainment?

GadgetsAI & MLDealsGamingBig Tech
Tech / Specialist

Ars Technica

20 articles with more distinct topics — science, security, EVs, policy. Cleaner category boundaries.

ScienceComputingEVsSecurityPolicy
Geopolitics

NYT World

20 articles on clearly separable geopolitical topics. The easiest test — but validates the system works at all.

Middle EastOil & EnergyUS PolicyUkraineLatin America

Every article was labeled by hand: for each topic, does this article belong? Yes or no. An article about an AI-powered gadget might be labeled "yes" for both Gadgets & Hardware and AI & Machine Learning.

We then ran each model against these labels, found the optimal confidence threshold per topic, and measured precision (how many predicted matches were correct) and recall (how many true matches were found).

Classification

The small model won.

F1 score measures the balance between precision and recall — the higher, the better. We averaged F1 across all topics and feeds for each model.

Average macro F1 across all fixtures
100% 70% 40% 86.9% bge-small 84.7% MiniLM-L12 84.6% NLI* 84.1% MiniLM-L6 84.0% gte-small 79.6% gte-base *BART-MNLI, ~60x slower
Finding

bge-small-en-v1.5 achieved the highest classification F1 (86.9%) despite being a small 384-dimensional model at just 33 MB. It outperformed models twice its size.

Surprise

NLI (natural language inference via BART-MNLI) scored 84.6% — lower than simple embedding similarity — at roughly 60 times the compute cost.

Similarity F1 by feed
Model Ars Technica The Verge
bge-small winner 86.4% 82.9%
MiniLM-L12 82.2% 83.4%
gte-small 82.8% 82.2%
MiniLM-L6 80.3% 79.0%
bge-base 82.6% 62.4%
gte-base 80.1% 70.3%

NYT World results excluded from the table — all models scored 88–93% on clearly separable geopolitical topics, revealing little differentiation.

The surprise

Bigger isn't better.

The intuitive assumption is that larger models with more dimensions should perform better. They have twice the capacity to represent meaning. More parameters, more nuance, right?

Wrong — at least for topic classification of news articles. The 768-dimensional "base" models consistently underperformed their smaller 384-dimensional counterparts. The effect was dramatic on The Verge, where overlapping consumer tech categories exposed the problem.

The likely explanation: larger models capture finer semantic distinctions that don't align with coarse topic boundaries. A "base" model can tell that a gaming laptop review is about both gaming and hardware — but that extra nuance produces ambiguous scores that hurt precision.

The Verge F1 — small vs base
bge-small 384d
82.9%
bge-base 768d
62.4%

Same model family, same training data. The only difference is embedding dimension — and the smaller one wins by 20 points.

Embedding quality

How well do articles cluster by topic?

Beyond classification, embeddings power vote propagation and future semantic search. We measured separation ratio — how much more similar articles within the same topic are versus articles in different topics. Higher is better.

Average separation ratio (intra / inter-focus similarity)
1.45 1.10 0.75 1.45 MiniLM-L12 1.41 MiniLM-L6 1.10 bge-small 1.05 bge-base 1.03 gte-small 1.03 gte-base
What this means

MiniLM models spread articles out more in their embedding space — articles about the same topic cluster together, and different topics stay apart. This is ideal for vote propagation, where a vote on one article should influence similar articles but not unrelated ones.

The tradeoff

bge-small wins on classification F1 but has a tighter embedding space (ratio 1.10 vs 1.45). This means vote propagation needs a higher similarity threshold to avoid bleeding across topics — which is why we raised it from 0.3 to 0.4.

Ranking

Do votes actually help?

We simulated voting (2 upvotes, 1 downvote per topic) and measured NDCG — a ranking metric where 1.0 means perfect ordering and 0.5 means essentially random.

NDCG change after voting
bge-small +0.073

Votes improved ranking — the model we chose

gte-small +0.050

Also benefits from vote signal

gte-base -0.001

No meaningful change

MiniLM-L12 -0.021

Slightly worse with votes

bge-base -0.037

Votes hurt — embedding space too compressed

MiniLM-L6 -0.048

Votes hurt despite good cluster separation

Why votes hurt some models

Vote propagation works by finding similar articles and transferring the vote signal. If the embedding space is too spread out (MiniLM), votes on one article barely reach related ones — the signal dissipates. If the space is too compressed (base models), votes bleed across topic boundaries — an upvote on a gaming article accidentally boosts unrelated tech policy.

bge-small sits in the sweet spot: compact enough for votes to propagate to genuinely similar articles, but separable enough to respect topic boundaries.

Vote weight ramping

One more thing we learned: with only a few votes, the vote signal is noisy. A single upvote on a clickbait article shouldn't dramatically reshape the entire feed.

So we added vote ramping — the vote weight scales linearly from zero to full based on how many votes you've cast. Until you've voted on at least 5 articles, the system leans more heavily on confidence and recency. As you vote more, personalization gradually takes effect.

Classification strategy

The 400 MB model you don't need.

Embedding similarity
How it works Dot product between article and focus embeddings
Model size 33 MB
Speed ~7 seconds for 70 articles
Best F1 86.9%
NLI (zero-shot)
How it works "Does this text entail 'Technology'?"
Model size ~400 MB (BART-MNLI)
Speed ~500 seconds for 70 articles
Best F1 84.6%

NLI asks a fundamentally different question — not "how similar is this text to this label?" but "does this text entail this label?" In theory, that's more precise. In practice, for well-defined topic categories, the simpler approach wins.

We still offer NLI and hybrid modes for users who want them (set analysis.classifier in your config). But the default is now pure embedding similarity — instant classification, no 400 MB model download on first run.

What we changed

Four changes from one benchmark.

01

Switched to bge-small-en-v1.5

From all-MiniLM-L6-v2 to bge-small-en-v1.5. Same 384 dimensions, 10 MB larger, but 2.8 percentage points better on classification and measurably better vote propagation.

02

Defaulted to similarity-only classification

NLI offered no accuracy advantage at 60x the cost. Classification now completes in seconds instead of minutes. NLI and hybrid remain available for users who want them.

03

Added vote weight ramping

Vote influence now scales linearly from 0 to full over your first 5 votes. Until you've cast enough votes for a meaningful signal, the system relies more on topic confidence and recency.

04

Model tracking for automatic rescoring

Every classification score now records which model produced it. When you change models, the system automatically rescores all articles on the next reconciliation — no manual intervention needed.

Methodology

How to reproduce this.

The eval framework

The benchmark lives in apps/eval/ — a standalone workspace that imports from the server but exports nothing back. It uses the same transformers.js inference engine and the same in-memory SQLite setup as the real pipeline.

pnpm fetch-feed Download and extract RSS feeds into JSON fixtures
pnpm label Interactive CLI to label articles per topic
pnpm bench Run all benchmarks (classify, embed, rank)
pnpm report Generate HTML report with SVG charts

Metrics used

Classification

Precision, recall, F1 per focus, with thresholds optimized per focus via grid search. Macro-averaged across focuses, then averaged across fixtures.

Embedding quality

Separation ratio — mean intra-focus cosine similarity divided by mean inter-focus similarity. A ratio above 1.0 means same-topic articles are more similar than different-topic articles.

Ranking

NDCG@K and MRR — standard information retrieval metrics measuring how well the ranked output matches the ideal ordering. Tested with four weight configurations and two vote scenarios.

A 33 MB model, running on your server, classifying your news with 87% accuracy.

No cloud. No API keys. No data leaving your machine.

That's the point.