StatusNet Scalability and Virtual Appliances
From StatusNet
A discussion on the scaling of the infrastructure depending on the amount of users present on the system.
- scaling from 10^0-10^8 users
- 10^0-10^1:
- Shared hosted
- limited servers
- LAMP
- 10^2-10^3:
- Virtual/limited environment
- 10^4-10^6:
- Larger servers
- Owned by the site
- multiple servers
- 10^7-10^x:
- Many DBs
- Arrays of servers
- Offline systems
- Non-LAMP implementations
- 10^0-10^1:
As the infrastructure grows, different services are moved to dedicated servers, multiple DB slave/master servers are added as bottlenecks appear.
On the single-user end: need some work to make sure it's really usable, integrates well.
On the really big end: some issues with performance, interface as well
Contents |
[edit] Small-scaling and background issues
Some external-party connections require background daemons to keep net connections open (IM etc), these often aren't feasible on shared hosting -> point people to our hosting where we have the infrastructure
Other things need occasional background processing that's ok to do during web hits -> those we could provide (say for twitter status fetching, maybe?)
Some like Facebook are more overhead on the setup than on the actual operation; can we make these easier?
[edit] Note to self
- build a mail-image-to-wiki-upload to make it piss-easy for us to upload meeting whiteboard photos to the wiki ;)
[edit] Search notes
- short-term: delta indexes for quicker updates
- in progress
- long-term: split current-ish dataset from old archives (don't search 5-10 years of archives when 99% of time you return results from last week)
[edit] Per-user data metrics?
- what can we measure clearly?
- amount of data produced by a user
- amount of traffic related to a user
- -> help people know when they might want to split out an instance
[edit] Replication lag
- we have a *lot* of writes and do real-time work, so any lag causes massive trouble
- big operations need to be broken up to avoid lag (say, user delete w/ a million notices)
- needs more work on our end to do them consistently and not have trouble with locking
[edit] Scaling out number of sites
- per-site processes - bad!
- daemons -> run a few processes and base on actual activity via queue
- polling is dangerous (lots of bg processes)
- cycling through list of all sites is dangerous (lots of lag)
[edit] Privacy and file storage
- site themes always public
- avatars, attachments, etc need to be private for private sites
- can store in filesystem just fine, but need to push it through a secure loading layer
- ^ auth layer could be slow if we have a lot of traffic here
- can store in filesystem just fine, but need to push it through a secure loading layer
- file storage gets slow with large amount of data -> hash subdirs
- public files are easy to push to a CDN
- performance is now someone else's problem :D
[edit] Meteor
- cpu usage really high w/ meteor... considering orbited
[edit] SMTP
- solved problem :)
- postfix etc
[edit] XMPP
- ejabberd scaling problems :(
- huge number of accounts, lots of friend pinging, slow to restart etc
- prosody better...?
- more direct integration so we only have to ping relevant remotes?
- Can we find out what Google does? (gtalk, wave large xmpp systems with reasonably reliable servers)
[edit] Reliability
- verrry important for biz within the firewall
- can we make it easier to do failover/etc