Skip to content

Image Recognition Automation Fail

I had a friend who asked me to take a quick look at a business problem he was having.  He had a set of photos (of vinyl record albums) that he was looking at to identify the artist.  After finding the artist, he’d do some additional categorization work and then push the image and metadata to a ecommerce site he is running.

He wanted a way to more quickly identify the artist and album, preferably without his intervention.

The first thing I suggested was Mechanical Turk, as this seemed like a great example of a Human Interaction Task.  However, my friend tried this and found it to be more work (mostly proofing, I think) than it was help.

He also pointed out that Google Image Search does exactly what he wants.  You can post an image or URL to it (you have to use the camera in the search box), and it will give you back like images, best guess for the image matching, and links related to the image.  Pretty cool!

However, there is no API for the new Google Image Search.  Sure, you have the old, deprecated Images API, but it doesn’t have access to post an image or a URL, just keywords.  A bit of looking around revealed this StackOverflow post, which deconstructs the new Image Search parameters.  However, the deconstructed URL gives back javascript which needs to be executed by a browser.  I looked briefly at Selenium and webdriver to do that, but couldn’t figure it out.

I also looked at Kooaba’s API, but they didn’t get back to me when I signed up for a free developer account, and their API only covers books, CDs and DVDs.  Also from a StackOverflow post, I looked at MacroGlossa and IQEngines.  Neither of them seemed to work–MacroGlossa wanted a category (and, shocker, vinyl record albums was not a category) and IQEngines let me submit an image wasn’t successful in identifying it.

I had to admit I was defeated.

Running a Google Apps Script Once a Month

I needed a way to email a Google spreadsheet to my boss once a month, for some reporting purposes.  I could have put an entry in my calendar reminding me to do it, but I thought it would be a great time to try out the Google Docs scripting that I had read about for a year or two, and seen an AppSumo video about.  (I got the AppSumo video for free, from an ad on HARO.)

It was laughably easy to get write the actual script (here’s a great set of tutorials).  The only rub was Google doesn’t allow you to run scripts in month intervals, only hourly, daily or weekly.  A small bit of scripting got around that.

Here’s the final script (edited to remove sensitive data):

function myFunction() {
  var dayOfMonth = Utilities.formatDate(new Date(), "GMT", "dd");
  if (dayOfMonth == 05){
    MailApp.sendEmail("email@example.com", "Spreadsheet Report Subject", 
'https://spreadsheets.google.com/a/mydomain.com/ccc?key='+SpreadsheetApp.getActiveSpreadsheet().getId());
  }
}

I set up a daily trigger for this script and installed it within the spreadsheet I needed to send.

I really really like Google Apps Script.  I think it has the power to be the VB of the web, in the way that VB made it easy to automate MS Office, reduce drudgery, and allow non developers to build business solutions.  It also ties together some really powerful tools–check out all the APIs you can access.

Once you let non developers develop, which is what Google Apps Script does, you do run into some maintenance issues (versioning, sharing the code, testing), but the same is true with Excel Macros, and solving those issues is for greater minds than mine.

Useful Tools: StatsMix makes it easy to build a dashboard

I haven’t been to a BDNT lately, but still get their email announcements.  In August, all the 2010 TechStars folks presented, and were listed in the email.  I took a look at each company, and signed up when the company seemed to be doing something cool.  I always want to capture my preferred username, mooreds!

One that was very interesting to me was StatsMix; I signed up for their beta.  On Nov 1, I got invited to sign up.  Wahoo!

Statsmix lets users build custom dashboards.  I am developing an interest in web analytics (aside: if you are interested in this topic, I highly recommended Web Analytics 2.0, by Avinahsh Kaushik).  I’ve been playing with Piwik, an open source analytics toolkit, but Statsmix offers a slicker solution.

They have made it dead simple to create a custom dashboard for users.  They offer integration with, at this time, 29 services (twitter, mailchimp, youtube, Google Analytics, etc).  I could not find an up to date list of integration services outside of their webapplication!  The best I could find was this list from September.  While the integration interface is slick, the data integration is rudimentary.  For example, they will let you monitor the number of rows in a Google Spreadsheet, but nothing more (like rows in different columns, or the value in a particular cell–would be nice to see them integrate with Google Apps Scripting); you can track the number of likes on Facebook, but not the number of comments.

The real power of StatsMix comes from the ease of integration with your own custom stats.  They offer an API which is accessible via REST.  This means that you can push information from your database to a beautiful looking dashboard with shell scripts and a cron job.  Very cool!  It would be nice to see a plugin for Magento or other ecommerce vendors; I recently had a client, The Game Frame, that would have been a great fit for this type of dashboard, since it aggregates beyond what the ecommerce software provides.

Other cool features:

  • The whole UI is beautiful and farily intuitive.
  • The dashboard supports custom date ranges.
  • They will send you an email of stats every day, and apparently have some kind of limited version you can pass onto clients.  I didn’t play with the email feature at all, though it is extremely useful.

However, all is not perfect.  Some issues with StatsMix include:

  • As mentioned above, the integration with third party services leaves something to be desired.  What they offer is a nice start, but it’d be great to see them create some kind of marketplace where developers could build solutions.  For example, the twitter widget only tracks the number of followers.  From the TWitter API, it appears to be pretty easy to track the number of mentions, which could be a useful metric.
  • It wasn’t clear how to share a dashboard, though that may be an upcoming feature.
  • The terms of use are, as always, pretty punishing.
  • Once you develop a number of custom metrics, you are tied to their platform.  That wouldn’t be so bad, except…
  • They are planning to charge for the service, but give no insight into what to expect.  There is a tab called ‘Billing’ but all it says is: “During our beta, StatsMix is free to use. After the beta, you’ll be able to manage your billing preferences on this page.”  If I was considering using this as part of my business, I would want much more insight into possible costs before I committed much time to custom metric buildouts.  I’m fine with them making money, just want more insight into this key aspect of their web app.

All in all, it is well worth a try.  If you to, let me know by posting a comment.  I have 5 invites to give out.

BrowserMob: Load test your applications using the cloud

Via this tweet from Matt Raible, I learned of BrowserMob.  This service allows you to easily load test your web application.

I set it up in about 2 minutes to do a simple load test of a client’s site (though 5 pages).  They make it free to ‘test drive’ their service (though the free not enough to actually stress your site).  It is extremely easy to test a path through a publicly facing system.

The report was good enough; you get screen captures of pages that have failures, and they do a good job of making some of the performance data pretty and intelligible.  Again, I didn’t really load test anything, so I didn’t examine the report as closely as I would have in a real world scenario.  The service is built using Selenium, and I believe they allow you to upload full featured selenium tests (if you have already invested in this technology, but don’t want to build out a cloud network).

This service is of particular interest to me because last year I was part of a project that built a selenium grid on Amazon EC2, using these instructions.

If we’d known about BrowserMob, I’m not sure we would have used them, as I don’t know what our budget was, but it would have been nice to have that in the evaluation mix.

[tags]browsermob, cloud services,load testing[/tags]

In source your EC2 instances

If you have built a killer application on Amazon Web Services, you may reach a point where you don’t want to continue to use them.  I can think of any number of reasons you may want to migrate your servers.

It may be because you’ve reached the 20 server instance, or because you want more control, or because you want to buy your own machines and spend money on a system administrator instead of paying Amazon, or because there’s something that you need customized that’s ‘behind the curtain’ of AWS.

For whatever reason, if you decide to move off Amazon’s elastic compute cloud,  you probably should take a look at Eucalyptus (thanks to George Fairbanks for pointing this out to me!).  From the overview, this is a AWS compatible environment, so you can continue to use the same tools (capistrano!) to manage your instances.  You also gain the same abilities to spin up or spin down servers easily.

What you don’t get is AMI compatibility.  That is, you can’t transfer your AMI to a eucalyptus server farm and expect it to run.  They have a FAQ about AMIs (for 1.5, which is an older version of the software) that points to some forum posts about turning an AMI into an EMI (Eucalyptus Machine Image), but it doesn’t look like a trivial or easy operation.  It does seem possible, though.
However, it’s good to know that it is possible, and that a company can have a migration path off AWS if need be.
[tags]eucalyptus, open source, freedom in the cloud[/tags]

Tips: Deploying a web application to the cloud

I am wrapping up helping a client with a build out of a drupal site to ec2. The site itself is a pretty standard CMS implementation–custom content types, etc. The site is an extension to an existing brand, and exists to collect email addresses and send out email newsletters. It was a team of three technical people (there were some designers and other folks involved, but I was pretty much insulated from them by my client) and I was lucky enough to do a lot of the infrastructure work, which is where a lot of the challenge, exploration and experimentation was.

The biggest attraction of the cloud was the ability to spin up and spin down extra servers as the expected traffic on the site increased or decreased. We choose Amazon’s EC2 for hosting. They seem a bit like the IBM of the cloud–no one ever got fired, etc. They have a rich set of offerings and great documentation.

Below are some lessons I learned from this project about EC2. While it was a drupal project, I believe many of these lessons are applicable to anyone who is building a similar system in the cloud. If you are building an video processing super computer, maybe not so much.

Fork your AMI

Amazon EC2 running instances are instantiations of a machine image (AMI). Anyone can create a machine image and make it available for others to use. If you start an instance off an image, and then the owner of the image deletes the image (or otherwise removes it), your instance continues to run happily, but, if you ever need to spin up a second instance off the same AMI, you can’t. In this case, we were leveraging some of the work done by Chapter Three called Project Mercury. This was an evolving project that released several times while we were developing with it. Each time, there was a bit of suspense to see if what we’d done on top of it worked with the new release.

This was suboptimal, of course, but the solution is easy. Once you find an AMI that works, you can start up an instance, and then create your own AMI from the running instance. Then, you use that AMI as a foundation for all your instances. You can control your upgrade cycle. Unless you are running against a very generic AMI that is unlikely to go away, forking is highly recommended.

Use Capistrano

For remote deployment, I haven’t seen or heard of anything that compares to Capistrano. Even if you do have to learn a new scripting language (Ruby), the power you get from ‘cap’ is fantastic. There’s pretty good EC2 integration, though you’ll want to have the EC2 response XML documentation close by when you’re trying to parse responses. There’s also some hassle involved in getting cap to run on EC2. Mostly it involves making sure the right set of ssh keys is in the correct place. But once you’ve got it up and running, you’ll be happy. Trust me.

There’s also a direct capistrano/EC2 integration project, but I didn’t use that. It might be worth a look too.

Use EBS

If you are doing any kind of database driven website, there’s really no substitute for persistent storage. Amazon’s Elastic Block Storage (EBS) is relatively cheap. Here’s an article explaining setting up MySQL on EBS. I do have a friend who is using EC2 in a different manner that is very write intensive, that is having some performance issues with his database on EBS, but for a write seldom, read often website, like this one, EBS seems plenty fast.

EC2 Persistence

Some of the reasons to use Capistrano are that it forces you to script everything, and makes it easy to keep everything in version control. The primary reason to do that is that EC2 instances aren’t guaranteed to be persistent. While there is an SLA around overall EC2 availability, individual instances don’t have any such assurances. That’s why you should use EBS. But, surprisingly, the EC2 instances that we are using for the website haven’t bounced at all. I’m not sure what I was expecting, but they (between three and eight instances) have been up and running for over 30 days, and we haven’t seen a single failure.

Use ElasticFox

This is a FireFox extension that lets you do every workaday task, and almost every conceivable operation, to your EC2 instances. Don’t delay, use this today.

Consider CloudFront

For distributed images, CloudFront is a natural fit. Each instance can then reference the image, without you needing to sync files across instances. You could use this for other files as well.

Use Internal Network Addressing where possible

When you start an EC2 instance, Amazon assigns it two IP addresses–an external name that can be used to access it from the internet, and an internal name. For most contexts, the external name is more useful, but when you are communicating within the cloud (pushing files around, or a database connection), prefer the internal DNS. It looks like there are some performance benefits, but there are definitely pricing benefits. “Always use the internal address when you are communicating between Amazon EC2 instances. This ensures that your network traffic follows the highest bandwidth, lowest cost, and lowest latency path through our network.” We actually used the internal DNS, but it makes more sense to use the IP address, as you don’t get any abstraction benefits from the internal DNS, which you don’t control–that takes a bit of mental adjustment for me.

Consider reserved instances

If you are planning to use Amazon for hosting, make sure you explore reserved instance pricing. For an upfront cost, you get significant savings on your runtime costs.

On Flexibility

You have a lot of flexibility with EC2–AMIs are essentially yours to customize as you want, starting up another node takes about 5 minutes, you control your own DNS, etc. However, there are some things that are set at startup time. Make sure you spend some time thinking about security groups (built in firewall rules)–they fall into this category. Switching between AMIs requires starting up a new instance. Right now we’re using DNS round robin to distribute load across multiple nodes, but we are planning to use elastic IPs which allow you to remap a routable ip address to a new instance without waiting for DNS timeouts. EBS volumes and instances they attach to must be in the same availability zone. None of these are groundbreaking news, it’s really just a matter of reading all the documentation, especially the FAQs.

Documentation

Be aware that there are a ton of documentation, one set for each API release, for EC2 and the other web services that Amazon provides. Rather than starting with Google, which often leads you to an outdated version of documentation, you should probably start at the AWS documentation center. This is especially true if you’re working with any of the systems that are newer with perhaps not as stable an API.

In the end

Remember that, apart from new tools and a few catches, using EC2 is not that different than using a managed server where you don’t have access to the hardware. The best document I found on deploying drupal to EC2 doesn’t talk about EC2 at all–it focuses on the architecture of drupal (drupal 5 at that) and how to best scale that with additional servers.

[tags]ec2,amazon web services,capistrano rocks[/tags]

Setting variables across tasks in capistrano

I am learning to love capistrano–it’s a fantastic deployment system for remote server management.  I’m even learning enough ruby to be dangerous.

One of the issues I ran into was I wanted to set a variable in one task and use it in another (or, more likely, in more than one other task).  I couldn’t find any examples of how to do this online, so here’s how I did it:

task :set_var
self[:myvar]= localvar
end

task :read_var
puts self[:myvar]
end

Note that myvar and localvar need to be different identifiers–“local variables take precedence”.  Also, the variable can be anything, I think.  I use this method to create an array in one task, then iterate over it in another.

[tags]capistrano, remote deployment, ruby newbie[/tags]

Amazon AMI search

It’s interesting to me that there is no Amazon Machine Image (AMI) search.  AMIs are virtual machine images that you can run on EC2, Amazon’s cloud computing offering.  Sure, you can browse the list of AMIs, but that doesn’t really help.  Finding an image seems to be haphazard, via a google search (how I found this alfresco image) or via the community around a product on an image (like this image for pressflow, a high performance drupal).

I’m not the only person with this complaint.  The Amazon EC2 API only provides limited data about various images, but surely some kind of search mechanism wouldn’t be too hard to whip up, if only on the image owner and platform fields.

Does anyone know where this exists?  My current best solution for finding a specific AMI is to use the fantastic ElasticFox FireFox plugin and just search free form on the ‘Images’ tab.

[tags]amazon, ec2, can I get a ‘search search'[/tags]