StatusNet Scalability and Virtual Appliances

From StatusNet

Jump to: navigation, search


A discussion on the scaling of the infrastructure depending on the amount of users present on the system.

Candrews scale.jpeg

  • scaling from 10^0-10^8 users
    • 10^0-10^1:
      • Shared hosted
      • limited servers
      • LAMP
    • 10^2-10^3:
      • Virtual/limited environment
    • 10^4-10^6:
      • Larger servers
      • Owned by the site
      • multiple servers
    • 10^7-10^x:
      • Many DBs
      • Arrays of servers
      • Offline systems
      • Non-LAMP implementations


As the infrastructure grows, different services are moved to dedicated servers, multiple DB slave/master servers are added as bottlenecks appear.

On the single-user end: need some work to make sure it's really usable, integrates well.

On the really big end: some issues with performance, interface as well

Contents

[edit] Small-scaling and background issues

Some external-party connections require background daemons to keep net connections open (IM etc), these often aren't feasible on shared hosting -> point people to our hosting where we have the infrastructure

Other things need occasional background processing that's ok to do during web hits -> those we could provide (say for twitter status fetching, maybe?)

Some like Facebook are more overhead on the setup than on the actual operation; can we make these easier?

[edit] Note to self

  • build a mail-image-to-wiki-upload to make it piss-easy for us to upload meeting whiteboard photos to the wiki ;)

[edit] Search notes

  • short-term: delta indexes for quicker updates
    • in progress
  • long-term: split current-ish dataset from old archives (don't search 5-10 years of archives when 99% of time you return results from last week)

[edit] Per-user data metrics?

  • what can we measure clearly?
    • amount of data produced by a user
    • amount of traffic related to a user
  • -> help people know when they might want to split out an instance

[edit] Replication lag

  • we have a *lot* of writes and do real-time work, so any lag causes massive trouble
  • big operations need to be broken up to avoid lag (say, user delete w/ a million notices)
    • needs more work on our end to do them consistently and not have trouble with locking

[edit] Scaling out number of sites

  • per-site processes - bad!
    • daemons -> run a few processes and base on actual activity via queue
    • polling is dangerous (lots of bg processes)
    • cycling through list of all sites is dangerous (lots of lag)

[edit] Privacy and file storage

  • site themes always public
  • avatars, attachments, etc need to be private for private sites
    • can store in filesystem just fine, but need to push it through a secure loading layer
      • ^ auth layer could be slow if we have a lot of traffic here
  • file storage gets slow with large amount of data -> hash subdirs
  • public files are easy to push to a CDN
    • performance is now someone else's problem :D

[edit] Meteor

  • cpu usage really high w/ meteor... considering orbited

[edit] SMTP

  • solved problem :)
    • postfix etc

[edit] XMPP

  • ejabberd scaling problems :(
    • huge number of accounts, lots of friend pinging, slow to restart etc
  • prosody better...?
    • more direct integration so we only have to ping relevant remotes?
  • Can we find out what Google does? (gtalk, wave large xmpp systems with reasonably reliable servers)

[edit] Reliability

  • verrry important for biz within the firewall
    • can we make it easier to do failover/etc
Personal tools