« Browser makers listen up, we want the following... Signed - OpenAjax alliance | Main | What programming language / platform should you choose for your next project ? »
July 9, 2008
Scaling in the 'cloud', a few case scenarios : Memcached & EC2
I've seen a few stories in recent months all having to do with scaling, there's of course the Google I/O conference that covered what can be considered 'proprietary'-'non-available' technology, the Twitter scaling nightmare -- which is backed by Ruby/Rails BTW -- and yet another on how Facebook is buying thousands of servers to support its traffic -- which is backed by PHP of all things -- its no doubt an interesting topic, but what are the real technical options out-there for scaling ? This entry will cover a few case scenarios I've worked on related to scaling, elaborating on the software that makes it all happen.
| [Entry continues to the left and below ad ] |
I will spare you the introduction on applications written in 'x' programming language scale better than those written in 'y' programming language, since its somewhat obvious some languages are better designed to handle memory management, others are complied, others are interpreted(scripts),etc.etc. that's not the point, Facebook uses PHP which is considered a 'dog' by many in terms of scaling, and yet they manage to serve millions of pages a day without major hiccups.
Case in point, if you have enough money to throw at it, you can get anything to scale, not a very enlightening argument of course, so money aside, what routes are there to scaling ? I've seen two: Caching and using more hardware resources, the latter of which 'theoretically' has become easier with 'grid' or 'elastic' computing.
Caching and MemcachedIf an application requires minimum input from users, and has several thousands visitors on daily basis, you most certainly can't go wrong using caching. The issue with caching is that its supported at many levels, you can have database caching, the middle-tier can have some type of programmatic caching at the code level, or caching can be done on an entire page, not to mention the series of specialized products designed to solve caching issues.
I've personally seen mixed results when caching is attempted at any point other than the last possible tier, which is on entire pages right before they are dispatched by a web-server; the mixed results I've seen using caching in more deeper tiers -- programatically or on the DB -- are due to 'stale' data that was never intended to behave like so, in other words, having cache upon cache can make it more difficult to get 'fresh' data snapshots the way they are intended.
On using caching on entire pages right before they are dispatched, I've used memcached with great success on applications written in both Java and Python, with the biggest pluses on memcached being: its application language agnostic, its open-source and it simply works.
Many high-traffic sites use memcached, among them Slashdot, Wikipedia and Kayak ( a travel search engine) [ Ref. Memcached users ]. Memcached allows you to configure page snapshots at certain time intervals, sparing the underlying hardware the trouble to crunch together the same page over and over again, so if a page is heavily based on DB reads and its not that time sensitive -- say 1- 2 hours -- it can take a big load of a server, not to mention it makes a site extremely responsive.
But alas, memcached can only take you so far in scaling. The first obvious rub comes when a page is based on user data, causing every single user to have his own different page snapshot and making this type of caching pretty much a moot point. The next rub is not so obvious at first glance, but it comes when a site has too many pages to cache.
Since memcached places everything in memory it will eventually run out if it, and as a consequence flush everything it has saved up. So suppose you have a server with 3GB and assign a conservative 1 GB just for memcached, on a site having a little over than this amount on page data it will soon do the following: Cache first 1000 pages amounting to 1 GB - incoming request for page 1001 not found in memcached - ( FLUSH * no more memory ) - incoming request for page 50 - (Ups, no longer in memcached) - CRUNCH-Page 50-CRUNCH - incoming request for page 100 - (Ups, no longer in memcached ) - CRUNCH-Page 100-CRUNCH...and repeat.
The CRUNCH-CRUNCH represents re-banging a DB, using up CPU and all the resources it took to build the page in the first place, a process that can get pretty nasty when the data needed to be cached is over the physical 4GB memory limit on 32-bit servers. But you get the point, if you have too much data, you will eventually need to use some other caching strategy or throw more hardware at it.
More Hardware and 'grid'/'cloud' computingFrom what I've experienced, adding more hardware in order to scale an application is used when all other options -- including caching -- have been completely exhausted, this is not only due to the fact that scaling with more hardware is more expensive, but also because it can be more complicated to get it right.
I hear you, 'now there is 'cloud/elastic' computing where you pay what you use and scale at will !' Yes, I've heard of it and even tinkered with the leading provider which is apparently Amazon EC2, but I'm not too sold on this idea, though I have seen some pretty nifty stuff which I will mention toward the end.
I don't know about you, but choosing a hosting/service provider was probably the last frontier were a technology company didn't have you locked-in, providers of Windows, Linux, Solaris and other OS's hosting still abound, 'price jack-up?', 'incompetence?', move along to the next provider, well not so with 'cloud/elastic' computing.
If you want to take advantage of the current model offered by 'elastic' computing, paying only what you use and scaling at will, you will have to pay another not so evident cost: buying into the 'elastic' architecture of a provider, something that will surely scare more than a few potential users -- including myself.
There is however one circumstance in which it just might be worth buying into a providers 'elastic' platform: CPU scaling, which I would consider a monkey no one can get off their back. Scaling an application with pre-existing snapshots of data can be done resorting to software like memcached or other caching techniques without surrendering an application into any providers 'elastic' lock-in sauce, not to mention caching and similar techniques can put-off the addition of hardware for later dates.
Not so with CPU scaling, if generating an application page requires constant crunching of data for each visitor, then CPU can be a huge bottleneck, requiring the need of immediate expansion to accommodate new traffic.
While I haven't had a need to do just this, I read up on one impressive case scenario were this was precisely the case, the requirements: Convert 405,000 very large TIFF images, 3.3 million articles in SGML and 405,000 xml into 810,000 PNG images, now that some serious cpu crunching ! Not to mention a one time event, in which 'pay as you use' is more than compelling.
The project was undertaken by the New York Times, and not only uses 'elastic' computing through Amazon's EC2, but also Map Reduce(Hadoop) which is a technology used heavily inside Google for the purpose of scaling -- material for another entry in itself.
You can read more about this N.Y Times scaling project at : New York Times - Timemachine and more on the technology they used at Self-service, Prorated Super Computing Fun .
| [Comments below ad ] |
Posted by Daniel at July 9, 2008 1:05 PM
Comments
Post a comment
Track back Pings
Track Back URL for this entry:
http://blog.webforefront.com/mtblog/mt-tb.cgi/101.









