Please log in if you want to be notified when Thomas Brox Røst is mentioned on Eventseer.net. Click the tracker button below to activate notifications.
Click the button to be notified on your personal tracker whenever Thomas Brox Røst is mentioned in an event.

Thomas Brox Røst


Me
Hi, my name is Thomas Brox Røst. I am currently doing a PhD at the Norwegian University of Science and Technology in Trondheim, Norway. I work on finding ways of reducing information overload in general practice patient records through automated knowledge extraction from the written narrative in patient histories. You can find me at the Department of Computer and Information Science or at the Norwegian EHR Research Centre.

In my spare time I develop Eventseer.net.


Homepage: http://thomas.broxrost.com

whiteboard > entry


Other: Posted on August 14, 2008 - 7:15 p.m.

Serving static files with Django and AWS - going fast on a budget

Speed matters.

When Google tried adding 20 additional results to their search pages, traffic dropped by 20%. The reason? Page generation took an extra .5 seconds.

This article will show how Eventseer utilizes an often overlooked way of improving the responsiveness of a web application: Pre-generating and serving static files instead of dynamic pages.

The tools I will be using include Django, Django StaticGenerator, Django Queue Service and Amazon Web Services (AWS). Interfacing with AWS from Python is best done with the boto library, but that will not be covered in this article.

Some knowledge of Django and AWS is helpful but not required. I will also be using lighttpd with mod_magnet—this can of course be replaced with the web server of your choice.

About the site

Eventseer is an academic event tracker that at the time of writing contains some 8,000 event listings. It also has a database of 573K people, 4K research topics and 3K organizations, each of which has its own page. This adds up to almost 600K worth of pages. All the pages are highly interconnected, so each added event tends to require a considerable number of page updates.

On the backend we use Django with a PostgreSQL database, hidden behind a lighttpd/FastCGI web server. The service is running on a single box with 4GB of RAM. Background processing (e.g. recalculating the social graph of featured academics) is done on EC2 (Elastic Cloud Computing) instances on AWS.

The problem

As traffic grew, Eventseer was becoming slower. Navigating the site would involve noticeable delays.

Traffic analysis showed the average server load to be consistently high. Some of the obvious solutions, such as reducing the number of database queries per page view or caching rendered pages, were helpful but not satisfactory.

Ultimately, the remaining bottleneck was that each and every request was sent through the full Django dynamic rendering cycle.

This is not a Django-specific problem—the same pattern holds for every site where there are dynamic elements to the pages. Having a language interpreter instance—be it PHP, Ruby or Python—render HTML will incur some unwanted overhead, no matter what.

With increasing traffic, the overhead compounds. Spreading the load across additional servers would only address the symptom; not the core problem.

For Eventseer, we were being particularly punished by search engine traffic. Being continuously pounded by multiple search engines crawling those 600K dynamically generated pages took its toll.

The fix

So why not serve static pages instead?

Web servers such as lighttpd and Apache are blazingly fast at serving static pages. Reading a file from disk and sending its contents back to the browser is as simple as it gets. No matter how trivial the web application, this will always be the fastest option.

You may have spotted a problem with this line of thinking. Every page on Eventseer has to be customized towards authenticated users. We need to show whether or not the user is keeping track of an event, the user's username, and so on. In these situations, generic static pages won't do.

Still, that does not make them useless. In fact, for most of the traffic on Eventseer.net a static page will do just fine.

Take search engine traffic. Search engines never authenticate themselves and would be perfectly happy with an un-customized static page.

The same can be said for first-time visitors. To take them to the point where they decide to sign up for the service we need to give them the best experience possible. How do we do that? In part by minimizing page load latency.

Sure, once they sign up they too will be subjected to the slower-loading dynamic pages. That, however, is tolerable. They registered with us because we were able to keep their interest long enough for them to be convinced of the product's value. In other words: We made the sale.

Also, shifting all that traffic towards static pages would free up valuable CPU cycles that could make rendering of dynamic pages even snappier.

The rest of this article will give an overview of the steps we took to make this happen on Eventseer.

Hurdle 1: Getting a static page out of Django

This is the easy part. Jared Kuolt's excellent StaticGenerator does exactly what we need.

It integrates readily with Django but a couple of things should be mentioned about how to do this. The easiest (and preferred) method of integration is to install it as Django middleware. In this case, the static page is generated when it is first requested and served on subsequent requests.

If you want to save yourself a lot of effort, this is the way to go. To use it, you just add the supplied middleware and indicate the urls of the static files in settings.py like this:

STATIC_GENERATOR_URLS = (
    r"^/",
    r"^/(about/faq|contact)",
    ...
)
For our purposes this was, however, not a satisfactory solution.

We wanted to maximize the speed boost, so we had to bypass Django completely. In addition, generating all those static pages on demand would not give us the immediate speedup that we craved.

For instant gratification all 600K pages had to be generated in advance.

Hurdle 2: Generating the static pages

While StaticGenerator allows you to script the page generation, the time needed to do so can be daunting. In our case, the whole job would take some 7 days on a single server. Also, while the generation job was running several thousand of the generated pages would already be outdated.

This is where Amazon Web Services and EC2 really shines. We had a one-off problem and needed the computing power to solve it.

Static page generation can easily be split across several servers. We divided the processing tasks into batches that would each take roughly 5 hours to complete and split them across 25 EC2 instances.

At the time of writing, a simple EC2 instance is priced at $0.10 per hour. This would cost us a total of $0.10 x 5 x 25 = $12.50—or roughly the price of a pint of beer in Norway. Not too bad.

Priming the EC2 instances is a matter of making them look like replicas of the production server. Without getting into the finer details, an AMI with the same packages and software as your production server has to be created. Then you just have to make sure that the instances launched from that AMI are working with the freshest available data.

For Eventseer, the full database is regularly synced with Amazon Simple Storage Service (S3)—see the figure below. After launching the EC2 instances, each instance is instantiated with the latest data from S3. Once the processing job is done, the results are sent to the production server and the EC2 instances terminate themselves.

Hurdle 3: Showing the pages at the right time

So, we had our 600K pages; now we just needed to show them when it was safe to do so.

As you recall, a static page can only be shown to a visitor that is not authenticated. All authenticated visitors must be sent through the standard Django request processing cycle.

There is a subtle problem here. In a Django view or middleware class you can easily detect if a visitor is authenticated or not. However, we don't want lighttpd to route requests through Django unless it is absolutely necessary.

Our solution was to have a custom Django middleware class set a cookie to indicate authentication status:

class SetStatusCookieMiddleware(object):
    def process_response(self, request, response):
        auth_val = '0'       
        if request.user.is_authenticated():
            auth_val = '1'

        response.set_cookie('auth', auth_val,
                            domain='.eventseer.net')
        return response

Now we just have to force lighttpd to filter requests based on the cookie value. This can be done via the mod_magnet module.

mod_magnet gives you complete control over the lighttpd request handling. To enable it you just include the module and add a line such as the following to your lighttpd.conf file:

magnet.attract-physical-path-to = ( server.document-root + "/rewrite.lua" )
This routes all requests through a Lua script of our choosing. Below is an excerpt of my rewrite.lua that shows how to redirect unauthorized requests to the start page static file. If you are not familiar with Lua then the Lua-users wiki should give you a head start.

-- Serves the file with the given filename if it exists.
function returnStaticFile (filename)
    statinfo = lighty.stat(filename)
    if (statinfo and (statinfo['st_size'] ~= 0)) then
        lighty.content = {{ filename = filename }}
        lighty.header['Content-Type'] = 'text/html'
        return 200
    end
end
base_dir = '/home/eventseer/static/'
cookie = lighty.request['Cookie']
-- Make sure that cookie is set
if (not cookie) then
    cookie = ''
end
if (cookie:find('auth=1')) then
    -- The user is authorized, so we serve the
    -- standard dynamic page
    return
else
    -- The user is not authorized, so we search for a static page
    -- Get the uri
    uri = lighty.env['request.orig-uri']
    -- Match start page
    if uri:match('^/$') then
        return returnStaticFile(base_dir .. 'index.html')
    end
Remember to keep your mod_magnet script as short and simple as possible—any long-running scripts will block all other lighttpd connections.

Hurdle 4: Keeping the static files up to date

This was probably the most difficult part of our dynamic to static conversion. For each added event, hundreds of pages may have to be updated. Doing these updates on the fly is too time-consuming. Besides, it is not crucial that all pages are immediately updated—delegating this task to regularly executed AWS jobs is therefore a good compromise.

To keep track of pending changes we set up a queueing system. Each page update request is first added to a local queue which is regularly kept in synch with a remote queue on AWS. The motivation for having two separate queues is for those rare moments when AWS is unavailable.

Specifically, we use the Django Queue Service for the local queue and Simple Queue Service (SQS) for the remote AWS queue.



Twice a day the required number of EC2 instances are automatically launched on AWS. The most current version of the database is fetched from S3 and installed at each instance. Then page update requests are fetched one at a time from SQS until the queue is empty. Finally, all the generated static files are sent to the production server and installed at their correct location. Once the EC2 instances are no longer needed they shut themselves down.

This is a fairly inexpensive operation, especially when compared with the cost of purchasing dedicated server capacity.

Final words

There are many ways of improving web site performance. The technique described in this article should probably not be your first weapon of choice—sometimes simply tweaking the generated HTML can yield dramatic performance improvements.

Nonetheless: If your site has usage situations where static files are an option then the performance improvement potential may make it well worth the effort.

The benefits of taking the time to set up an AWS infrastructure are also very tangible. For Eventseer, we use AWS not only for static file generation but for all our background processing jobs.

The savings benefit from not having standby dedicated servers is immediate and lasting, especially for a low-budget operation such as ours. Also, this leaves us in a far better position to scale the service as required.

August 17, 2008 - 9:44 p.m.

Thank you for useful article! Never thought about optimizing sites on django this way.

zuko

November 7, 2008 - 3:45 a.m.

Yes, very useful. Thanks

richr

March 11, 2009 - 4:05 p.m.

Lots of really useful information here. I'm new to python and Django although I've done a lot of Perl and PHP programming for very well used sites.

dantagg

add comment