DuckDB's FTS Extension: BM25-Powered Search Transforms Local Analytics Without Extra Servers

Data analysts and engineers have long chased fast text search in analytical workflows. DuckDB’s full-text search extension changes that equation. Load it with two SQL commands. Index millions of rows. Query with Okapi BM25 ranking. All in-process, zero infrastructure.

Peter Doherty kicked off the recent buzz. In his April 29 post on peterdohertys.website, he processes thousands of .eml files into JSON, imports them to DuckDB, and builds an FTS index on email bodies. His Python script uses BeautifulSoup to strip HTML, extracts metadata like sender and subject, then dumps clean text. Simple. Effective. He runs INSTALL fts; LOAD fts; then PRAGMA create_fts_index(’emails’, ‘id’, ‘body’);. Queries follow via match_bm25, scoring results by relevance.

Doherty draws from Postgres and Elasticsearch experience. DuckDB’s version, modeled on SQLite FTS5, handles stemming—reducing ‘running’ to ‘run’—stopwords like ‘the’ and ‘and’, accent stripping. It lacks phrase queries or match highlighting. No ts_headline equivalent. He greps results manually. Still, for local troves of emails or docs, it delivers.

Official docs confirm the mechanics, as detailed on DuckDB.org. The extension autoloads now in v1.5, but manual install works too. create_fts_index takes table name, ID column, text columns—or ‘*’ for all VARCHARs. Options abound: stemmer=’english’ or ‘porter’, stopwords=’english’, ignore regex for punctuation. Defaults lower everything, strip accents. It spits out a schema like fts_main_emails with inverted indexes.

Query via match_bm25(id_col, ‘query terms’, fields:=’body’, k:=1.2, b:=0.75). K tunes term frequency saturation; b normalizes doc length. Conjunctive mode demands all terms present. Scores NULL on no match—filter those out, order descending. A macro like fts_main_corpus.match_bm25 auto-generates for convenience. Combine with WHERE clauses: score IS NOT NULL AND date > ‘2026-01-01’ ORDER BY score DESC LIMIT 10. Analytical joins, aggregates follow naturally.

Mark Rachapoom pushes boundaries in his March 26 guide on dench.com. One million rows? Sub-100ms queries. Ten million? 200-500ms on standard hardware. Index balloons 30-50% over raw text—manageable for hundreds of GB. He searches CRM notes: ‘contract renewal Q2’. Logs for anomalies. Feedback for ‘churn’. Boolean ops shine: ‘pipeline AND revenue’, ‘customer NOT spam’. No Elasticsearch Docker sprawl. Just SQL.

But gaps persist. Indexes don’t auto-update. Edit data? Drop and rebuild: PRAGMA drop_fts_index(‘table’); recreate. Fine for batch loads, painful for live streams. No fuzzy matching. Limited snippets—hack with substr and position. Rachapoom notes Elasticsearch edges on distributed scale, facets, fuzziness. DuckDB owns single-node speed, SQL purity.

DuckDB’s guide on full-text search demos on Shakespeare: PRAGMA create_fts_index(‘corpus’, ‘line_id’, ‘text_entry’);. Hunt ‘butter’: SELECT fts_main_corpus.match_bm25(line_id, ‘butter’) AS score, * FROM corpus WHERE score IS NOT NULL ORDER BY score DESC. King Lear lines surface first. Integrate filters: speaker != ‘Chorus’. Group by play_name for counts.

Recent chatter amplifies adoption. Hacker News thread on Doherty’s post, news.ycombinator.com, hit 62 points fast. Users praise DuckDB’s data-source magic—S3 Parquet to SQL instantly—now supercharged with FTS. One flags trigram tokenizer wishes on GitHub, eyeing better partial matches.

The FTS repo on GitHub updated February 2026, bumping to DuckDB v1.5. Core team maintains. Multilingual stemmers: Arabic to Turkish. Stem function for debugging: stem(‘running’, ‘english’) yields ‘run’.

Real-world pulls it together. Jiby on jiby.tech indexes song lyrics April 17. Medium’s Praxen touts it for logs, tickets. No server. Triages freeform text fast.

And here’s the shift. Analysts ditch LIKE ‘%query%’ scans—brutal on GBs. Elasticsearch setups? Overkill for local stacks. DuckDB FTS slots in: read Parquet/JSON/CSV, index text columns, query ranked. Pair with vss extension for hybrid lexical-semantic search. MotherDuck cloud scales it further.

Performance holds on hardware today. Rachapoom’s 10M rows in half a second. Doherty’s 13k emails? Instant. Index build times scale linear—minutes for millions.

Limitations force trade-offs. Rebuilds. No phrases natively. Workarounds emerge: custom tokenizers, vector extensions.

Yet for internal tools, prototypes, edge analytics—FTS nails it. DuckDB turns laptops into search engines. No clusters. Pure SQL.

DuckDB’s FTS Extension: BM25-Powered Search Transforms Local Analytics Without Extra Servers

Notice an error?

Ready to get started?