Sorting changes

From StatusNet
Jump to: navigation, search

Emblem-important-red.svg.png The changes described below have been implemented on 0.9.x as of commit 3a831ff, except for search. If updating manually, run db/096to097.sql to update indexes on MySQL.

When we start actively using the data portability features, imports of old notice streams from other sites will start breaking the general assumption that notice.id and notice.created both increase together as time goes by.

Since id numbers are generated in order of insertion, but timestamps are generated in order of authorship, imported data will often have a higher id number but lower creation timestamp than other notices posted in real time before the import.

With our current id-based ordering on output, this could lead to oddities like a bunch of 3-year old imported messages clogging up a profile view or search results on top of newer items.

Note: if importing can only happen when an account is created locally, this wouldn't be as much of an issue: the imported posts won't be sent to user or group inboxes, and any notices posted in real-time after the import will get later IDs. However, if we allow folks to import stuff later -- like merging their old Twitter posts into an existing identi.ca account -- or if we allow the account to start posting new notices *while* a long import process is going -- then we can end up with mixed ones, and an import could flood somebody's profile confusingly.

Contents

[edit] New ordering

Where we previously had notice.id as an element in sort order, we would now have notice.created, notice.id. (Including the ID number as well gives us consistent ordering for notices with the same second-resolution timestamp; including it after timestamp means timestamp is our primary ordering.)

Most realtime-posted notices will stay in the same order (with possible slight differences where clocks were slightly out of sync or where it took a little while to complete an insertion).

This will order imported notices more naturally in the time stream, keeping newer real-time posted notices on top.

[edit] API semantics

Twitter-compatible API clients expect timeline results that are in reverse chronological order of posting, with increasing ID numbers roughly corresponding to increasing time. This isn't 100% guaranteed though, and there's even more leeway with the new 64-bit ID numbers on twitter.com.

Typical flow for a desktop/mobile client looks like:

  • hit the API for a timeline
    • receive the latest 20 statuses and show/store them
    • store the most recent status ID for later
  • ...time passes...
  • hit the API for a timeline again, passing a since_id= parameter with the saved status ID
    • receive the latest (up to) 20 statuses newer than the one we asked about -- avoiding transfer of duplicates we've already seen -- and show/store them
    • store the most recent status ID for later
  • ...lather, rinse, repeat.

Long ago, in the before-time, the Twitter API used a since param instead of since_id, taking a timestamp, but there ain't one anymore.

So we've got the task of being given an id number as a cutoff/starting point, but on a stream ordered by timestamp. Eg, if 200 ancient messages get imported after my latest post, clients won't want to see my ancient imports (the next items by id) -- they want to see what else I've been posting lately, which might be mixed in or after.

This probably makes sense for our processing:

  • if given a since_id, look up that notice.
  • pull the given notice's created timestamp
    • if it's been deleted, pull TS from deleted_notice
      • if never seen it, skip the since_id
  • use the combo of timestamp *and* the id as cutoff, ordering on both of them:
    • select blah from notice where (notice.created = since_ts and notice.id > since_id) or (notice.created > since_ts) order by notice.created desc, notice.id desc;

If a client takes the first id number in the timeline for its next since_id, this should work perfectly for the common case of fetching the most recent posts without shipping duplicates you've already seen.

If the client sorts the numbers and takes the highest id number for its since_id, it could end up with an earlier timestamp cutoff value than expected. Since we pull from the top of the timeline, this should be just fine: we'll always get the latest stuff, but might end up with some duplicates that have already been seen.


There's also max_id available in some places, which a client could use to page backwards, showing what comes before some status already known.

Working much the same as since_id should probably work in most cases:

  • if given a max_id, look up that notice.
  • pull the given notice's created timestamp
    • if it's been deleted, pull TS from deleted_notice
      • if never seen it, skip the max_id
  • use the combo of timestamp *and* the id as cutoff, ordering on both of them:
    • select blah from notice where (notice.created = since_ts and notice.id <= since_id) or (notice.created < since_ts) order by notice.created desc, notice.id desc;

[edit] DB indexing

Choosing and sorting rows from large tables can be very inefficient on a relational database unless indexes are provided that let the database engine take some shortcuts for you.

Public timeline looks something like this:

 select * from notice
   order by created desc, id desc
   limit 20

This requires a compound index whose first element is 'created' and whose second element is 'id' to run efficiently; otherwise MySQL will end up copying data to a temporary table and sorting it again. Yuck! This index is currently missing.

Home timeline pulls from #inbox packing, see below.

Profile timeline is more like:

 select * from notice
   where profile_id=1234
   order by created desc, id desc
   limit 20

This requires narrowing down to the posting profile first, then using our timestamp & id to sort. This index is currently defined as KEY `notice_profile_id_idx` (`profile_id`,`created`,`id`)

Group timeline pulls in from another table, which might be scarier...

 select * from group_inbox
   where group_id=1234
   order by created desc, id desc
   limit 20

This requires an index on (group_id, created, notice_id) This index must be added.

Tag timeline pulls from another table, very similar...

 select * from notice_tag
   where tag='foo'
   order by created desc, id desc
   limit 20

Needs an index covering (tag, created, id). This index must be added.

Tagged timelines within a given user or group are currently not well indexable as they must use either the profile/group-based indexes, or the tag-based indexes. Adding a profile_id field (and an index covering it with the others) to notice_tag could fix that up for users; groups might take throwing in another table duplicating the tag info.

Search either runs in the DB then sorts, or runs in external Sphinx and has it sort for us. Doing the sort by (created, id) should be ok on sql search; sphinx might need to be tweaked to provide the proper sortable field combo.

[edit] Inbox packing

Currently, user home timeline inboxes are stored in a packed structure for efficiency, which simply packs notice IDs as 32-bit integers in a big linear blob.

New notices are always inserted at the top -- older notices such as imports shouldn't actually show up in inboxes.

As long as ordering is built consistently, we might actually be able to get away without changing this format. :D

Personal tools
Namespaces
Variants
Actions
Navigation
Status.net
Toolbox