Skip to content

Tips: Deploying a web application to the cloud

I am wrapping up helping a client with a build out of a drupal site to ec2. The site itself is a pretty standard CMS implementation–custom content types, etc. The site is an extension to an existing brand, and exists to collect email addresses and send out email newsletters. It was a team of three technical people (there were some designers and other folks involved, but I was pretty much insulated from them by my client) and I was lucky enough to do a lot of the infrastructure work, which is where a lot of the challenge, exploration and experimentation was.

The biggest attraction of the cloud was the ability to spin up and spin down extra servers as the expected traffic on the site increased or decreased. We choose Amazon’s EC2 for hosting. They seem a bit like the IBM of the cloud–no one ever got fired, etc. They have a rich set of offerings and great documentation.

Below are some lessons I learned from this project about EC2. While it was a drupal project, I believe many of these lessons are applicable to anyone who is building a similar system in the cloud. If you are building an video processing super computer, maybe not so much.

Fork your AMI

Amazon EC2 running instances are instantiations of a machine image (AMI). Anyone can create a machine image and make it available for others to use. If you start an instance off an image, and then the owner of the image deletes the image (or otherwise removes it), your instance continues to run happily, but, if you ever need to spin up a second instance off the same AMI, you can’t. In this case, we were leveraging some of the work done by Chapter Three called Project Mercury. This was an evolving project that released several times while we were developing with it. Each time, there was a bit of suspense to see if what we’d done on top of it worked with the new release.

This was suboptimal, of course, but the solution is easy. Once you find an AMI that works, you can start up an instance, and then create your own AMI from the running instance. Then, you use that AMI as a foundation for all your instances. You can control your upgrade cycle. Unless you are running against a very generic AMI that is unlikely to go away, forking is highly recommended.

Use Capistrano

For remote deployment, I haven’t seen or heard of anything that compares to Capistrano. Even if you do have to learn a new scripting language (Ruby), the power you get from ‘cap’ is fantastic. There’s pretty good EC2 integration, though you’ll want to have the EC2 response XML documentation close by when you’re trying to parse responses. There’s also some hassle involved in getting cap to run on EC2. Mostly it involves making sure the right set of ssh keys is in the correct place. But once you’ve got it up and running, you’ll be happy. Trust me.

There’s also a direct capistrano/EC2 integration project, but I didn’t use that. It might be worth a look too.

Use EBS

If you are doing any kind of database driven website, there’s really no substitute for persistent storage. Amazon’s Elastic Block Storage (EBS) is relatively cheap. Here’s an article explaining setting up MySQL on EBS. I do have a friend who is using EC2 in a different manner that is very write intensive, that is having some performance issues with his database on EBS, but for a write seldom, read often website, like this one, EBS seems plenty fast.

EC2 Persistence

Some of the reasons to use Capistrano are that it forces you to script everything, and makes it easy to keep everything in version control. The primary reason to do that is that EC2 instances aren’t guaranteed to be persistent. While there is an SLA around overall EC2 availability, individual instances don’t have any such assurances. That’s why you should use EBS. But, surprisingly, the EC2 instances that we are using for the website haven’t bounced at all. I’m not sure what I was expecting, but they (between three and eight instances) have been up and running for over 30 days, and we haven’t seen a single failure.

Use ElasticFox

This is a FireFox extension that lets you do every workaday task, and almost every conceivable operation, to your EC2 instances. Don’t delay, use this today.

Consider CloudFront

For distributed images, CloudFront is a natural fit. Each instance can then reference the image, without you needing to sync files across instances. You could use this for other files as well.

Use Internal Network Addressing where possible

When you start an EC2 instance, Amazon assigns it two IP addresses–an external name that can be used to access it from the internet, and an internal name. For most contexts, the external name is more useful, but when you are communicating within the cloud (pushing files around, or a database connection), prefer the internal DNS. It looks like there are some performance benefits, but there are definitely pricing benefits. “Always use the internal address when you are communicating between Amazon EC2 instances. This ensures that your network traffic follows the highest bandwidth, lowest cost, and lowest latency path through our network.” We actually used the internal DNS, but it makes more sense to use the IP address, as you don’t get any abstraction benefits from the internal DNS, which you don’t control–that takes a bit of mental adjustment for me.

Consider reserved instances

If you are planning to use Amazon for hosting, make sure you explore reserved instance pricing. For an upfront cost, you get significant savings on your runtime costs.

On Flexibility

You have a lot of flexibility with EC2–AMIs are essentially yours to customize as you want, starting up another node takes about 5 minutes, you control your own DNS, etc. However, there are some things that are set at startup time. Make sure you spend some time thinking about security groups (built in firewall rules)–they fall into this category. Switching between AMIs requires starting up a new instance. Right now we’re using DNS round robin to distribute load across multiple nodes, but we are planning to use elastic IPs which allow you to remap a routable ip address to a new instance without waiting for DNS timeouts. EBS volumes and instances they attach to must be in the same availability zone. None of these are groundbreaking news, it’s really just a matter of reading all the documentation, especially the FAQs.

Documentation

Be aware that there are a ton of documentation, one set for each API release, for EC2 and the other web services that Amazon provides. Rather than starting with Google, which often leads you to an outdated version of documentation, you should probably start at the AWS documentation center. This is especially true if you’re working with any of the systems that are newer with perhaps not as stable an API.

In the end

Remember that, apart from new tools and a few catches, using EC2 is not that different than using a managed server where you don’t have access to the hardware. The best document I found on deploying drupal to EC2 doesn’t talk about EC2 at all–it focuses on the architecture of drupal (drupal 5 at that) and how to best scale that with additional servers.

[tags]ec2,amazon web services,capistrano rocks[/tags]

Optimizing a distance calculation in a mysql query

If you have a query that sorts by a derived field, and then takes a limited number of the results, it can be a real dog.  Here’s how I optimized a situation like this.  Imagine this table.

create table office_building (
id int primary key,
latitude float not null,
longitude float not null,
rent int,
address varchar(20),
picture_url varchar(255)
);

If you want to find the nearest 100 office buildings to a point on a map, you run a query something like this (plug your lat/lng into the question marks):

explain select *, round( sqrt( ( ( (latitude - ?) * (latitude - ?) ) *  69.1 * 69.1) +
((longitude - ?) * (longitude - ?) * 53 * 53 ) ) ) as distance
from office_building order by distance limit 100

(See here for an explanation of the 69.1 and 53 constants–basically they convert roughly from lat/lng to miles.) Unfortunately, you are ordering by a derived field, and mysql can no longer do order by optimization.

This means that you’ll be doing a filesort (which does not actually have anything to do with the filesystem, but is just a sort not on an index).  And this, in turn means that your performance will suck if you have any large number of rows returned.

You can help things out a bit by limiting your office building query to a box of a certain size around the point.  Here’s the query with a 5 mile box:

select *, round( sqrt( ( ( (latitude - ?) * (latitude - ?) ) *  69.1 * 69.1 ) +
( (longitude - ?) * (longitude - ?) * 53 * 53 ) ) ) as distance
from office_building
where latitude < ?  + (1/69.1)*5 and latitude > ? - (1/69.1)*5 and longitude < ? + (1/53)*5 and longitude > ? - (1/53)*5
order by distance limit 100

But if you still have too many results, the sorting on distance will be slow.  Also, even if you have an index on latitude and longitude, (such as create index idx_nearby on office_building (latitude,longitude)) because you are not using equality, only the first column will be used.

This is worth repeating, because it took me a while to understand.  If you have an index: create index idx on tbl (col1,col2,col3,col4,col5) and you run a query like select count(*) from tbl where col1 = 1 and col2 > 2 and col3 < 3 and col4 > 4 only col1 and col2 will be used from the index.  Mysql goes to the table data files for col3 and beyond (assuming no other indices on the table).  This makes sense when you think about how indices are created and stored, but I didn’t really understand it until I’d been beaten over the head with it.

As stated here: “[mysql] will use the fields [in the index], from left to right, as long as the WHERE clause has “=”. Once it hits a ‘range’ (>, IN, BETWEEN, …), it stops with that field.”  I don’t know why it is not in the mysql index documentation–maybe it is obvious?

The solution I found was to separate what I wanted to find in the select clause from how I find it, in the where and order by clause:

select select_clause.*,
round( sqrt( ( ( where_clause.latitude - ?) * (where_clause.latitude - ? ) *  69.1 * 69.1 ) +
( (where_clause.longitude - ? ) *(where_clause.longitude - ? ) * 53 * 53 ) ) ) as distance
from office_building where_clause, office_building select_clause
where where_clause.latitude < ? + (1/69.1)*5 and where_clause.latitude > ? - (1/69.1)*5
and where_clause.longitude < ? + (1/53)*5 and where_clause.longitude > ? - (1/53)*5
and where_clause.id = select_clause.id
order by distance
limit 100

You also need to add an index:

create index idx_nearby on office_building (latitude,longitude,id);

Then, when you run the query, you still have the filesort, but you also see the magic ‘Using index’ in your explain plan.  You never have to go to the table to do the sort!  You also have a join now, but it’s on the primary key, and you only need to go to the table for the 100 rows that you know you want.

Using this query had an effect on one live system of one to two orders of magnitude increase in query speed, depending on the query.  This not only works for distance queries, but anytime you want to order by a calculated value.

More useful links: geo search suggestions, index explanation

[tags]mysql, performance, query optimization[/tags]

Interesting GWO Case Study

I’ve written before about Google Website Optimizer.  But it’s always nice to see hard data.

Here’s an interesting GWO Case Study I found online, via a presentation by Angie Pascale.  It focuses on optimizing landing pages for a college system.  Conclusions:

Although the SEM agency did not find a correlation between brain lateralization and form location, they did succeed in optimizing Westwood’s program landing pages. On average, the program pages saw a 39.87% conversion rate improvement, with 83.1% being the highest upgrade. After significant results were revealed, the agency stopped each experiment and changed the format for every page to reflect the best-performing contact form location.

[tags]gwo, case study[/tags]

Setting variables across tasks in capistrano

I am learning to love capistrano–it’s a fantastic deployment system for remote server management.  I’m even learning enough ruby to be dangerous.

One of the issues I ran into was I wanted to set a variable in one task and use it in another (or, more likely, in more than one other task).  I couldn’t find any examples of how to do this online, so here’s how I did it:

task :set_var
self[:myvar]= localvar
end

task :read_var
puts self[:myvar]
end

Note that myvar and localvar need to be different identifiers–“local variables take precedence”.  Also, the variable can be anything, I think.  I use this method to create an array in one task, then iterate over it in another.

[tags]capistrano, remote deployment, ruby newbie[/tags]

New Release of GWT Crypto Library

I just released a new version of gwt-crypto.  You can download it here.  While encryption in javascript has its limits, it also has its place.  Currently, I am using it for some data (lat/lng) that we want to be obscured, but is not top secret.

If you’re using this library, please let me know what you’ve found it useful for.
Overall, this has been a fun experience.  I’ve learned at least the basics of maven, had some interaction with users and written tests for bugs they file.  (I got involved in this project earlier this summer, because I contacted the maintainer.)
[tags]encryption, tripledes, gwt,open source[/tags]

Boco: Colorado’s SXSW?

I spent yesterday at boco.me, a one day, one track conference in Boulder Colorado. The focus was on three different areas: food, tech, and music.  Apparently, South by Southwest (SXSW) has a similar multidimensional focus.

I was looking forward to meeting people from different spheres with different interests, and it certainly delivered that. Most attendees I talked to were tech people, however. Many thanks to Andrew Hyde and company for organizing this. I hope it’s the first of many.

Before signing up and actually before the conference, I did not have a very good idea of how much I was getting.  It was actually quite affordable: $99. For this modest price, attendees received:

  • entry to a concert: value $15
  • $30 worth of dinner at one of Boulder’s many fine restaurants
  • happy hour with beer and wine and apps
  • three sessions with about six speakers per session
  • three breakout sessions
  • a free T-shirt
  • a thank you note from Andrew(!)

Boco was, to put it mildly, a hell of a deal.

The conference had, as first year conferences tend to, a few flaws. The things I would change were:

  • allow users to ask questions of the speakers
  • have the breakout sessions be a bit more organized–they felt very ad hoc.

What follows are my notes from yesterday.  Here’s what the Daily Camera had to say.

First up was Rachel Weidinger (her slides are here). She mentioned the “big here and long now” and talked about tools that make our here bigger–“handheld awesome detectors”.  The tool that excited me the most was the Good Guide. This site offers what I’ve been looking for for a long time, which is detailed information on products, so that price and marketing are not the sole guides when you purchase something off the cuff. This guide has an API so that third-party developers can access their data. Oh, and Rachel is also looking for someone to build snake detecting goggles.

Next, Mark Menagh spoke on the differences between eating organic and eating locally.  I paraphrase, but he said that folks who eat organics are pessimists who want rules to prevent bad things from happening to their food and locavores are optimists.  He also emphasized that this November, Boulder voters are going to be asked to extend the Open Space sales tax ( till 2034! [pdf]) and that while we do that, voters should let the county commissioners know how they feel about GMO crops on open space land.

Then, Justin Perkins, from Olomomo Nut Company discussed some of the similarities he had noticed between building a band fanbase, as he did in the 1990s, and building one for a local food company, as he is doing now.  I can tell you from experience that his nut products are quite good.  He talked about engaging users in the product so that they feel it’s part of their story. Takeaway quote: entrepreneurs “have to be consistent and persistent as hell”.

Cindy O’Keeffe spoke about her experience fighting the GMO beets on Boulder County Open Space land.  I had heard about this issue before (Mark also discussed it), but she gave a good overview of the issues, and she had a compelling story about her personal journey from detached global environmentalist to local leader opposing the GMO planting.

Rick Levine, an author of the Cluetrain Manifesto (read it if you haven’t!) and now chocolatier, gave an overview of the Cluetrain ideas, and then talked about his new venture into high end chocolates, including some of the physics of chocolate.  Seth Ellis, his company, have shiny candy bar wrappers that he claimed were home compostable.  When talking about the Cluetrain and his experiences in technology, he offered up the observation that while he had been really interested in technology, his really great moments were talking to people.

The Autumn Film, a two person Boulder band, talked a bit about their experiences in music creation at this time.  Takeaway: music used to be “work hard, get lucky, hit it big”, but the industry changes have now just made it “work hard, hit it big, work harder”.  You can check out some of their music for free (well, you have to give them some of your personal information). Then, one member of the band performed.

I enjoyed the first breakout in which five of us gathered outside and discussed a wide variety of topics.  It was great to have a framework for getting to know the other conference participants.

Amber Case led off the second session by talking about cyborg anthropology–basically the idea that humans extend themselves via their tools, and that the malleability of current tools (think iphone) far exceeds the malleability of previous tools (think hammer).  Several of the other attendees found her ideas fascinating, but I wasn’t as astonished.  I guess I have thought about this topic, though certainly not with the rigor that Amber has.  (reading Snow Crash is no thesis.)  She did have some neat pointers to other work going on in this field: human-blender ‘communication’ and hug storage. Humorously, her email sig reads “Sent from my external proesthetic device“

Rich Grote and Dave Angulo then talked about what makes an online influencer–relevance, audience, access and one other thing I forgot to write down.  They are working on a company, which I was unable to find a link to, to leverage online influencers for marketing purposes.  It reminded me a bit of what Lijit presented on in June at the BDNT.  They also talked a bit about Dunbar’s number, which is the “theoretical cognitive limit to the number of people with whom one can maintain stable social relationships”.

Scott Andreas discussed his experiences building social software for non profits.  The takeaway for me was that when you have a cohesive group and you provide them social software, it can enrich the community.  The most important thing is that the community (and their norms) exists and is enforced outside of the software.  He also talked about Sunlight Labs, an open data source about the US government. Also, Andrew Hyde mentioned at this time the idea of floating your revenue through Kiva.  I certainly am not earning a lot of interest on my business savings right now, and using the funds to do microloans could be a great social good.  I would be a bit concerned about loan losses, though (98% loan repayment is a bit worrisome).

Sean Porter of Gigbot gave a breakdown of the live music industry ecosystem.  There’s a lot of middlemen between the fan and the band when it comes to concerts–ticketing agencies, promoters, management.  He started down the path of explaining how much of the ticket price you and I pay each of these folks get, but didn’t go all the way; if he had, I think his presentation would have been much stronger.

Ingrid Alongi talked about how she learned about work life balance, and techniques for maintaining it.  Good ideas in there–having a status meeting with coworkers while on a bike ride was probably my favorite, though.  Incidentally, she was laid off on Monday and had found a new job by the time she talked on Friday

Grant Blakeman and Reid Phillips (the latter being a member of The Autumn Film) talked about the new music business models.  Takeaway quote: “things always change”.  Sounds like Abe Lincoln. They are building tools that allow musicians to use some new media to market and connect with their fans. I enjoyed their insistence on musicians retaining control of their work, and using new technology to facilitate that.  It reminded me of this great article by Joel Spolsky where he talks about how your business should never outsource core business functions.  Fan interaction seems a pretty core part of the band business, so I doubt it should be outsourced.

Ari Newman of Filtrbox talked about the realtime web: how we’ve reached a technology tipping point and that Twitter and its open API pushed the real time web into the forefront, but that it is larger than the Twitterstream.  Ari also mentioned how the real time web actually isn’t all that real time–even if the technology delivers news to your computer in half a second, if it is not in front of you, it doesn’t matter.  Maybe he should collaborate with Amber on some goggles that would push realtime news to you all the time 🙂 .  He had real neat slide effects, too.  I chatted with him a bit and it was great to hear stories of his old sysadmin days–Linux on a Mac 8500!

The second breakout session was over lunch.  Was really interesting to talk with Ryan and Angie of Location 3, a Denver interactive agency, as well as Andrew Hyde, Ef, Rahoul(sp?) and Dan Kohler; wide ranging discussion and not too focused.

The third set of sessions was more informal.  Half of the speakers did not follow their topics on the program…

First, Emily Olson, from Foodzie, discussed how she had turned her passion (food) into a job (Foodzie, among others).  Her main points: pay attention to what you do in your free time–that’s an indication of your passion; find a mentor; be willing to work for free, especially at first; don’t try to find the one true vocation.

Dan Kohler, of Renegade Kitchen, discussed how to not have your blog/website suck.  He had 3 people up on stage read 3 different posts, and critiqued them.  Takeaway–“put your voice into” your blog.  I have a pretty vanilla voice on this blog, but part of that is due to professional concerns; however, Dan made the point that really, if you do drive some people off with the tone of your blog, the people you have left will be fiercer fans.

There was a panel on where the local music scene was heading, moderated by Sharon Glassman, a local bluegrass musician, and featuring Jason Bradley and Ira Leibtag.  I stepped out during this panel, but I do remember Jason Bradly discussing how “lots of people live in a box” in reference to his bringing an accordion to a bluegrass jam (and the reaction of the other players).

Brad Feld discussed the startup visa movement.  The idea is anyone who wants to move to the United States and start a company would get a 2 year visa; it would be automatically renewable for achieving certain goals (raising more funding, employing a certain number of people).  The founder would have to show proof of funding.  More information here.  I like anything that gets more smart folks to move to the USA.

Elana Amsterdam spoke on her experience turning a blog she wrote into a recipe book, and stated that her experience showed how you could really build a full fledged business out of a blog, using your passion and the blog as a platform to publish.  She also recommended “Write the Perfect Book Proposal” by Jeff Herman.  Updated 10/4: I asked a friend in the book publishing business about this book and she said: “Yikes. Any book that says “it’s easier to get published than you think” makes me want to hurt myself. Proposals aren’t about capturing a publisher’s attention. They’re about showing your expertise, your marketability, and just plain having an idea that fits within what a company actually publishes.”  For what that’s worth…  I think that she’s absolutely correct, for certain kinds of blogs.  I know that Eric Sink did the same thing with “Eric Sink on the Business of Software”, a fine book that has a collection of blog posts at its core.

Finally, Lilly Allison, a personal chef, spoke about eating seasonally and consciously.  She is using the web to extend her reach (and her brand!) as a personal chef–if you sign up, she’ll send you meal weekly plans with in season menus.  I signed up and will let you know how it goes—I do have lots of food from my CSA (here’s a list of Colorado CSAs).

There was a third breakout session, but I had to run some errands, so I missed it.

Then, it was happy hour time.  Off to the Boulder Digital Works, above Brasserie 1010.  It’s a beautiful space in downtown Boulder, and I talked with some of the incoming students who are doing the first 60 week advertising certificate.  In addition I had conversations on a variety of topics from the success of boco to how to scale a custom chocolate business to whether presenting at BDNT helped business (answer, indirectly, yes) to what to do with consulting requests that interfere with your core business (with the Occipital folks)

At the end of happy hour, we gathered into groups of four.  I had dinner with with Scott Andreas, Dan Kohler, and Jen Myronuk; a fine meal at Centro and then to a concert at the Boulder Theater: Paper Bird.

All in all, a fantastic conference.  It was eclectic and not as focused as other conferences I’ve been to, but for that reason alone has value.  I get bored if I only educate myself in one dimension.  Thanks again to the boco team, and here’s hoping that next year is as good, if not better.