Search redesign for 1.0
From StatusNet
(Redirected from Search redesign 1.0)
Contents |
[edit] Desired features
Low-level
- wildcards
- short names
- i18n: stopwords, min length
- stopwords
- non-latin symbols failing
- literal URl search
- user search should not require exact match
- like search stuff
- paging for API
Filtering
- search within an individual user's notices
- filter notice stream by profile_id
- people search function - include subscription tags? [are those public? could at least use own maybe]
- filter people by sub tags
- search based on location
- filter notice stream by location/distance
- filter people by location/distance
- hide blocked users from search results
Output
[edit] Current infrastructure
As of 0.9.x....
SearchEngine parent class and children in lib/search_engines.php...
- you take your given class and initialize it with a target table (notice or profile)
- you toss an opaque query into it with $engine->query()
- that loads up something into a DB_DataObject in $engine->target
- you go through that list outputting things
Problems:
- can only support specific tables mentioned
- fields and filtering are not really provided as options; the query type is fixed to the table
[edit] MySQL fulltext
- profile
- MATCH(nickname, fullname, location, bio, homepage) AGAINST (q IN BOOLEAN MODE)
- and does a second with strtolower() if query isn't all lowercase [may break on utf-8]
- MATCH(nickname, fullname, location, bio, homepage) AGAINST (q IN BOOLEAN MODE)
- notice
- excludes (notice.is_local = Notice::GATEWAY)
- MATCH(content) AGAINST (q IN BOOLEAN MODE)
- also does the second with strtolower() if query isn't all lowercase
[edit] MySQL LIKE
- profile
- (nickname LIKE "%q%" OR
- fullname LIKE "%q%" OR
- location LIKE "%q%" OR
- bio LIKE "%q%" OR
- homepage LIKE "%q%")
- (nickname LIKE "%q%" OR
- notice
- not excluding gatewayed notices
- content LIKE "%q%"
[edit] PostgreSQL
- profile
- textsearch @@ plainto_tsquery(q)
- notice
- not excluding gatewayed notices
- to_tsvector('english', content) @@ plainto_tsquery(s)
[edit] Sphinx plugin
- profile
- passes query through to Sphinx
- indexing query:
- sql_query = SELECT id, UNIX_TIMESTAMP(created) as created_ts, nickname, fullname, location, bio, homepage FROM profile
- sql_attr_timestamp = created_ts
- notice
- passes query through to Sphinx
- not excluding gatewayed notices
- sql_query = SELECT id, UNIX_TIMESTAMP(created) as created_ts, content FROM notice
- sql_attr_timestamp = created_ts
[edit] Notes
General
- notices
- filter by:
- poster (profile_id)
- group...? tag...?
- filter by:
- simulate attributes by adding special tags into the fulltext?
- "this is a nifty post @profile_id:1234 @group:32 @group:65"
MySQL
- Minimum lengths and stopwords can be disabled/changed at server config level...
- Or can be gotten around with a customized search target field -- but that's harder
- Wildcards can be supported, at least to some degress
- Default 'OR' search sucks horribly, probably could benefit from query rewriting
- Doesn't work with InnoDB, which we prefer for main tables
- if use by default, consider splitting out myisam search index tables
MySQL like
- I'd prefer to kill this as it scales very poorly
- Current implementation isn't escaping properly (looks like % and _ wildcards will actually go through)
- no implementation for fancier search keywords
Postgres
- will need to look up some docs...
Sphinx
- sql_query specifies which fields get pulled for actual indexing
- http://sphinxsearch.com/docs/current.html#conf-sql-query
- first column is document id
- sql_query_info setting is used for debugging queries with the cli client only
- certain column values can be marked as "attributes", which can be used for filtering or sorting results (but not for searching)
- should work to implement search within notice stream etc
- http://sphinxsearch.com/docs/current.html#attributes
- sql_attr_timestamp = created_ts (used for sorting)
- double-check the query language but I recall it being roughly sensible