StatusNet Scalability and Virtual Appliances

A discussion on the scaling of the infrastructure depending on the amount of users present on the system.




 * scaling from 10^0-10^8 users
 * 10^0-10^1:
 * Shared hosted
 * limited servers
 * LAMP
 * 10^2-10^3:
 * Virtual/limited environment
 * 10^4-10^6:
 * Larger servers
 * Owned by the site
 * multiple servers
 * 10^7-10^x:
 * Many DBs
 * Arrays of servers
 * Offline systems
 * Non-LAMP implementations

As the infrastructure grows, different services are moved to dedicated servers, multiple DB slave/master servers are added as bottlenecks appear.

On the single-user end: need some work to make sure it's really usable, integrates well.

On the really big end: some issues with performance, interface as well

Small-scaling and background issues
Some external-party connections require background daemons to keep net connections open (IM etc), these often aren't feasible on shared hosting -> point people to our hosting where we have the infrastructure

Other things need occasional background processing that's ok to do during web hits -> those we could provide (say for twitter status fetching, maybe?)

Some like Facebook are more overhead on the setup than on the actual operation; can we make these easier?

Note to self

 * build a mail-image-to-wiki-upload to make it piss-easy for us to upload meeting whiteboard photos to the wiki ;)

Search notes

 * short-term: delta indexes for quicker updates
 * in progress
 * long-term: split current-ish dataset from old archives (don't search 5-10 years of archives when 99% of time you return results from last week)

Per-user data metrics?

 * what can we measure clearly?
 * amount of data produced by a user
 * amount of traffic related to a user
 * -> help people know when they might want to split out an instance

Replication lag

 * we have a *lot* of writes and do real-time work, so any lag causes massive trouble
 * big operations need to be broken up to avoid lag (say, user delete w/ a million notices)
 * needs more work on our end to do them consistently and not have trouble with locking

Scaling out number of sites

 * per-site processes - bad!
 * daemons -> run a few processes and base on actual activity via queue
 * polling is dangerous (lots of bg processes)
 * cycling through list of all sites is dangerous (lots of lag)

Privacy and file storage

 * site themes always public
 * avatars, attachments, etc need to be private for private sites
 * can store in filesystem just fine, but need to push it through a secure loading layer
 * ^ auth layer could be slow if we have a lot of traffic here
 * file storage gets slow with large amount of data -> hash subdirs
 * public files are easy to push to a CDN
 * performance is now someone else's problem :D

Meteor

 * cpu usage really high w/ meteor... considering orbited

SMTP

 * solved problem :)
 * postfix etc

XMPP

 * ejabberd scaling problems :(
 * huge number of accounts, lots of friend pinging, slow to restart etc
 * prosody better...?
 * more direct integration so we only have to ping relevant remotes?
 * Can we find out what Google does? (gtalk, wave large xmpp systems with reasonably reliable servers)

Reliability

 * verrry important for biz within the firewall
 * can we make it easier to do failover/etc