Table of Contents
Changelog
- 2026-03-10
- Initial publication
Preface
This blog post is the result of two humans bringing up self-hosted search within days of each other. I was curious and ended up dancing with the Red Queen.
This post will discuss two options I find useful for self-hosted search and my thoughts regarding both. However, it must be noted: this is a wholly different approach to search than going to some big name (Google / Bing / Duck Duck Go / Others) search engine and casting about the internet.
Self-hosted search is going to provide a very different view upon the internet. This is A Good Thing.
Safety
I wish this section didn’t need to be here but… governments around the globe are authoritarian nightmares and want to control every aspect of life, including what kinds of content you view online. Thanks to this, VPNs are A Thing and do help keep varied governments from spying on every facet of your online life.
Now imagine you’re hosting a search engine and scraping the web. How much of that web scraping are you ok with your government, ISP, etc seeing? None? Good, me too.
You can put a VPN at the head of a search engine such that the search engine outbound traffic is through the VPN but can still be accessed, inbound, from a different IP. This is especially easy with Docker and I’ve provided an example setup in the Config Files section below. This allows you an extra layer of safety and privacy. A layer that should insulate you from a lot of flavors of spying and surveillance.
If you decide to deploy self-hosted search: please think hard about whether or not you should have a VPN ahead of your node. It’s probably worth it, even if you decide you have nothing to hide. If you do decide you have nothing to hide, maybe read this article and its offshoots before actually deploying services.
Yacy
What is…
Yacy is essentially a distributed search engine. They use a bunch of p2p tech (p2p tech is legal, just not how it’s usually seen used) to spread the search index across a network of public nodes. You can also run it in a kind of ‘private’ mode that’s disconnected from their public index and wholly self-contained. You can even index an intranet and search it with a Yacy node.
After some time with Yacy I’ve come to like it as a generic search that gives me wildly different results than the bigger, common search indexes. It’s like the early days of the web when AOL was considered ’the internet’, long before companies like Google emerged.
Yacy is not perfect but it does a good job showing me things I wouldn’t normally see, which is the whole point for me.
Usage
Using Yacy is pretty simple: use it like you’d use any other internet search. Type some stuff into a box, boop the ‘search’ button and read the results. It’s not fancy or interesting.
That said: there are advanced search parameters (the Yacy search page has details) and you can perform an image search, etc. It’s very similar to the ’early’ search engines that were common before Google slaughtered all other search.
One thing I did notice during use: the search result page shows how many results came from that nodes local index and how many came from remote nodes. Sometimes Yacy does not pull from remote nodes ‘right away’. When I see the results aren’t pulling from remote nodes I usually run a search again in a few minutes to try to get remote results.
Seeing remote results has proven to be important in my use of Yacy and it does work, it just may need some time to pull results from remote nodes. Also, there are times running the search a 2nd time after ‘a few minutes’ is needed to see the remote results.
Self-Hosting
Self-hosting Yacy was pretty straight forward with Docker. I simply followed the docs to get it deployed initially. After that… I had to do some extra work.
They have config and tuning options scattered across the management web UI and I strongly recommend going through all the screens. I made the following changes
- Tweaked the main search page
- Removed top bar
- Set default theme to gray
- Enabled node SSL
- Set
Prefer SSLanywhere it shows as an option - Created a new admin user
- Created a power user with all permissions but admin
- Set max ram to
4gvia the mainyacy.conffile - Turned on auto crawling so it only runs a shallow crawl of depth
0 - Crawled key sites that I care about
- Setup automated crawls of the sites I care about with a frequency of weekly
The Config Files section below has some additional info and files that shed more insight into how I have Yacy deployed.
One final note about self-hosting: if you allow remote nodes to initiate crawling, make sure your Yacy node is behind a VPN (or similar) for outbound traffic. I’m not keen on having the militarized police showing at my house because my Yacy note accepted a remote crawl request. This is very similar to what can happen if you run a tor exit node.
SearXNG
What is…
SearXNG is a meta search engine. It searches other search engines and is touted as not tracking users. It’s self-hostable and a very different kind of search than most are used to. It’s designed to talk to other search endpoints and show you the results from all of the endpoints as a merged list of results.
I find I like it best for highly specialized searches. If I want to search ‘just academic sources’ or ‘just forums’ or ‘just source forges’ then SearXNG is where I head. I use it much like folk use Kagi Lenses: to focus my efforts.
One thing to note: a lot of folk talk about how SearXNG is ‘private’ and can be used to hide your search queries. This only works if you have many users of an instance. The only way you can ‘get lost in the noise’ is if you’re one of the many, not few. Self-hosting SearXNG violates this constraint (generally speaking). Be mindful of claims others are making in this regard.
Usage
My usage is based on categories and specialized searches. I did setup the usual ‘big tech’ search like Duck Duck Go, Bing, Google, etc but I rarely use them. I kept them as every now and again I do have a need as Kagi (my main search engine) doesn’t always yield the best results.
I have categories like ‘Academic’, ‘Dictionary’, ‘Source Code’ and whatnot that I use to do targeted searches via specialty engines. It works really well for me, more so than Kagi lenses despite Kagi lenses working very similar to how SearXNG works (in practice).
I also use ‘bang searches’ (ie. !unsplash or !source-code) heavily to hone my searches further or to avoid my first search using the default category (side note: if you know how to show categories as soon as the engine is loaded, I’d love to know how).
The last ‘big thing’ I do is I keep an eye on the search engine info for time to run the search, if an engine timed out, etc. This helps me ensure my config is 100% and to clue me into possible tuning I may want to add to my config. This is mostly not needed anymore but I still keep an eye out.
Self-Hosting
For self-hosting, I’ve published my config below. I made serious changes to weights, categories and some other tunables. Do not underestimate this tuning. The default SearXNG engine config is, frankly, total trash. Dumpster fire grade total trash.
Spend time tuning. It makes a huge difference. The kind of difference that makes other admins go ‘huh, well that is wildly different and a massive improvement’ when they try out an Actually Tuned config. Note: I have not tinkered with the torrent ‘stuff’ or similar. I don’t recommend you do either, it’s not worth the risk IMO.
Additonally, I recommend folk mind the fact you need a lot of users to have your searches ‘get lost in the noise’ and I very much recommend putting a VPN ahead of SearXNG. Much like Yacy above, it’s to keep things a bit separate from the server’s main internet line.
If you use Duck Duck Go, they have a captcha setup to limit how often requests can be made from a single IP address. See the below for details if you want to use DDG via SearXNG.
- https://github.com/searxng/searxng/issues/4824
- https://docs.searxng.org/admin/answer-captcha.html#answer-captcha-from-server-s-ip
Config Files
General
The below is a Docker Compose file and a simple shell script to launch Gluetun to provide VPN service as the network for Yacy and SearXNG. This config also sets up inbound Yacy on a stable port so you can have outgoing networking via the VPN and inbound via your usual IP/DNS.
Definitely review the compose.yml file and adjust accordingly. Especially given I use traefik as a reverse proxy and diun for update notifications.
SearXNG
The below is my SearXNG config, inclusive of my tuning. Definitely review this closely and adjust as you see fit. One thing to note: the default category SearXNG uses is general and I’ve intentionally disabled that category in this config. I don’t like the approach and prefer ‘force’ folk to pick categories they want to use.
Gluetun
The below is a gluetun auth config for the api. This is handy if you need/want port forwarding tricks. See the compose.yml for insights on how this may be useful.