Sphinx

StatusNet supports using Sphinx as a backend for searching user profiles and notice text; for a larger site this can perform much better than MySQL's default search engine.

General background
Sphinx provides an advanced full-text search server, which can be configured to pull source data directly from our MySQL tables. (Handy!) We can then query Sphinx instead of our own MySQL to find text matches in notices and user profiles.

One limitation is that Sphinx doesn't have a facility for live updates to the index, so if the data set is large it can take some time to rebuild indexes and have new notices appear.

Work in progress
The current state of Sphinx support in 0.9.x is a little funky; the config is set up for a single site only and requires a full index rebuild to update.

I'm doing some branch work restructuring things a bit... --brion 13:37, 5 November 2009 (UTC)


 * moving Sphinx-specific code to a plugin
 * added GetSearchEngine event hook for plugins to provide search engines
 * hardcoded "table" names like 'identica_people' changed to consistent things like '{sitename}_profile'
 * added gen_conf.php script to generate a sphinx.conf file based on site config
 * can attempt to use the status_network site routing table, but I'm not sure it's working right yet
 * config has been enhanced to do incremental updates...
 * indexes split into 'main' and 'delta' components, which are consolidated for search on the sphinx side
 * adds a table to keep the cutoff points for delta rebuilds

Development consideration
Does $config['site']['name'] correspond to status_network.sitename? If not, what should I use that'll be consistently accessible from routing table & within a running SN instance and will be a clean table name for Sphinx? (eg need 'identica_profile' not 'Identi.ca_profile')


 * We're starting to expose the 'nickname' internally now which should become our primary site identifier for queues, search indexes, etc. For single sites the default should work ok.

Will Sphinx fall down and barf holding thousands of files open? I've got 42 files just for my single test site's index. That's up to 336,000 files for 8k sites -- will this kill stuff?

How long do index rebuilds take on a real Identica-sized site? Merging the delta index to the main index might save time over doing occasional full rebuilds.


 * End-of-2009 figures were something on the order of 10-15 minutes for a full rebuild on identica; copying the indexes around to multiple servers is actually a slow part too. Unsure of benefit of merge vs rebuild for updating the base index.

How long will it take to cycle through delta updates for 8k+ sites, most of which probably have few messages coming through in any given short time period?


 * Maybe use queues to schedule index updates? Eg have queue handler mark down a list of all sites which had changes in the last 5 minutes, then cronjob builds and distributes delta indexes for just those that need it.

Questions
Q: The generated sphinx.conf references two other files (see below) that are not found when sphinx is run - where do they come from?

WARNING: index 'mublog_profile': preload: failed to open /usr/local/sphinx/data/mublog_profile.sph: No such file or directory; NOT SERVING WARNING: index 'mublog_notice': preload: failed to open /usr/local/sphinx/data/mublog_notice.sph: No such file or directory; NOT SERVING

A: You need to build the indexes! Run Sphinx's "indexer" command.

Possible structure changes
Consider a combined index for all sites, which we can query with a restriction on target site. This keeps the number of files low, and could use the queues to schedule new notices for delta updates to avoid the 8000-site loop.

Other issues
Some problems in the past with utf-8 config; currently many non-ascii searches fail. It _should_ work more cleanly now since DBs have been cleaned up, but need to test and confirm.