Using Amazon Mechanical Turk

chess-1215079_640So, after over a decade, I finally found a use case where I had the clout and the need to use mechanical turk. I wanted to write about my experiences.

What I used it for: We were looking for some data on businesses.  We had business name, city and state, and wanted full contact information.  We paid a dime for each listing, and asked for email address and physical address.  We asked about each listing twice so that we’d have some kind of double check.

How effective was it? This varied.  If you were using the master workers, it was very effective, but slower.  If you open it up to all workers, you have to review their work more closely.  The few times I rejected someone’s task, they wrote back and asked why and tried to make it right, which was a testament to the power of the system (it records rejections).  Make sure you break the work into a couple of smaller groups so you can iterate on your instruction set (when workers asked questions on the first set, the answers went into the instructions for the second set).  We still had to review all the listings and double check any that didn’t match between both task answers, but that was a lot quicker than googling for each business and doing the research ourselves.

How much did it cost? On the order of a couple hundred bucks to process around fifteen hundred listings.

What kind of time savings did we see? Assume we had 1500 business names, and it took us 90 seconds to google the business name and find the information.  That is 1500 listings * 1.5 minutes == 37.5 hours, and this is on the low end.  Instead, it took about 2-3 hours of setup, and then 36 hours of calendar time (when I was able to do other things like sleep and work on other problems), and we were done.  Then I would say it was about 7-10 hours of review. So you are trading a couple hundred bucks for at least 20 hours of saved time.

Would I do it again? I think mturk is perfect if your problem has the following three attributes: more money than time, a task that is extremely simple, and time to review the finished product.

Other tips? You have to build it some kind of sampling for correctness. I have no idea what the quality is if you pay more than a dime per task.  Make sure you think about edge cases.  Provide tips to your workers (“check whois records as well as google”).


Can I customize Sharetribe?

I’ve been talking to some people about customizing Sharetribe.  It’s a big piece of software and there are a lot of moving parts, but many of the people I’m talking to are business folks.  They chose Sharetribe because they have a marketplace in mind, and the software is just a tool.  However, all tools have their limits.

So, I thought it’d be fun to organize it as a ‘choose your own adventure’. Click the link to learn how to customize Sharetribe’s awesome marketplace software.


Getting started with sharetribe development–vagrant style

I have recently spent a fair bit of time working with Sharetribe, an open source, MIT licensed marketplace platform that also powers a hosted solution.

First, let me say that the software is the least significant piece of a marketplace (like AirBnB).  The least significant! (Check out the Sharetribe Academy for some great content about the other steps.)  

But it is still a necessary component.  If you can get by with the hosted solution to prove out your idea, I suggest you do so–$100/month is a lot cheaper than hours of software development. There may come a time when you want to customize the sharetribe interface beyond what javascript injection can do.  If this is the case, you need a developer.  And that developer needs an environment.  That’s what this post is really about.

The sharetribe github readme explains the installation process pretty well, but I find it tedious, so I created a quick start vagrant VM. This VM has a sharetribe installation ready to go.  I use vagrant 1.6.3 and Virtualbox 5–google around for instructions on how to get those up and running. The guest VM is Ubuntu 14.04. This VM uses rvm to manage ruby versions, but I couldn’t be bothered with nvm. It will install sharetribe 5.8.0 and all needed components.

Assuming you have vagrant and virtual box installed, download the Vagrant file and put it in the directory where you want to work. Edit it and change any options you’d like. The options I changed were the port forwarding (I set it to 3003), networking options, and the amount of memory used (I allocate 4GB).

Then run vagrant box add http://www.mooreds.com/sharetribe/sharetribe-base-mooreds.box --name sharetribe-base-mooreds to download the file. It’s downloading a large file from my (small) server, so expect it to take a while (hours).

Then run vagrant init.

When you can login (password is vagrant if you use vagrant putty or you can use vagrant ssh) you can go to the sharetribe directory and do the following:

  • fork the sharetribe repo
  • update your git remote so that your origin is your forked repo (and not mine, because you won’t get write access to mine)
  • create a branch in your repo off of the 5.8.0 tag. There’s one startup script I tweaked a bit, but you can just ignore those changes.
  • update your mysql password: SET PASSWORD FOR 'root'@'localhost' = PASSWORD('MyNewPass');
  • start mailcatcher listening to all interfaces: mailcatcher --ip=0.0.0.0
  • start the rails/react server: foreman start -f Procfile.static
  • visit lvh.me:3003 and start your server
  • set up your local super user and first marketplace
  • edit the url you are redirected to to have the correct external port on it (from the vagrant settings): from http://testdev.lvh.me:3000/?auth=baEOj7kFrsw to http://testdev.lvh.me:3003/?auth=baEOj7kFrsw for example

This is running sharetribe 5.8.0, and I’m sure there will be follow on releases. Here’s how to sync the releases coming from the sharetribe team with your current repo. I’ve taken the liberty of creating an upstream branch for you.

This doesn’t cover deploying the code anywhere–I’d recommend this gist. Make sure you read the comments! Or I can install a vanilla version of sharetribe to heroku for a flat fee–contact me for details.


Adding a sitemap to sharetribe

map-1434486_640I have using the excellent Sharetribe framework to build a marketplace for food businesses and commercial kitchens for my new startup, The Food Corridor.  However, it didn’t have support for generating a sitemap.xml file for all the listings available.

How is someone going to find the right kitchen space when they use google, but we don’t have a sitemap so google can keep apprised of all the options?

This wouldn’t do.  So, I added the ability to generate a sitemap for all the listings in the marketplace.

First off, install the gem–I used sitemap_generator as it seemed to do what I needed–allow me to call out certain routes and add them to my sitemap.  Then you need to create a configuration file, at config/sitemap.rb.  Mine looks like:


SitemapGenerator::Sitemap.default_host = "https://"+APP_CONFIG.domain

SitemapGenerator::Sitemap.create do
  Listing.where(deleted: false, open: true).find_each do |listing|
    add listing_path(listing), :lastmod => listing.updated_at
  end
end

 

Then I just ran bundle exec rake sitemap:refresh:no_ping and a sitemap.xml.gz was generated in my public directory.

If you are running on AWS or someplace else with a persistent filesystem, you can skip to the text starting with “Then, I scheduled”.

If you are running on a PAAS like Heroku, where you don’t get a persistent filesystem, you’ll want to push this generated file to a persistent place. I chose S3. Since sharetribe already has paperclip as a dependency, I used the instructions here and here, with a few modifications for sharetribe.

My rake task to upload the sitemap file was:


require 'aws'
namespace :sitemap do
  desc 'Upload the sitemap files to S3'
  task upload_to_s3: :environment do
    s3 = AWS::S3.new(
      access_key_id: ENV['aws_access_key_id'],
      secret_access_key: ENV['aws_secret_access_key']
    )
    bucket = s3.buckets[ENV['s3_bucket_name']]
      file = File.join(Rails.root, "public", "sitemap.xml.gz")
      path = "sitemap/sitemap.xml.gz"

      begin
        object = bucket.objects[path]
        object.write(file: file)
        object.acl=(:public_read)
      rescue Exception => e
        raise e
      end
  end
end


I then run the sitemap:refresh:no_ping and upload_to_s3 tasks in the same heroku scheduled task: rake sitemap:refresh:no_ping sitemap:upload_to_s3. If you don’t do that (and instead do separate dynos) then the upload task won’t have access to the file (because it will have been generated on the first dyno’s filesystem).

You also need to make sure to add a sitemap controller to redirect from yourdomain.com/sitemap.xml.gz to the S3 bucket (again, as outlined in the articles linked above.

Then, I scheduled a daily refresh of the sitemap.xml file and submitted the file to relevant search engines.

Things I didn’t do:

  • handle more than 50k urls
  • support multiple communities (not really needed for me, but I bet if the folks behind sharetribe.com wanted to use this, they’d want such support).
  • add the sitemap.xml file to my robots.txt file, as outlined here.

How to get a YouTube channel as a podcast for $15/month

Want to create a podcast from a YouTube channel, but don’t have access to the original video or don’t have the time to set up a podcast?

Here’s how to set this up, in 6 easy steps.

  1. Find the YouTube channel URL, something like: https://www.youtube.com/channel/UCsjtSWdw9t1XlBnyrV7mAOQ
  2. Note the part after channel: UCsjtSWdw9t1XlBnyrV7mAOQ
  3. Download and install YouCast.  This is a Windows only program, so if you don’t have access to Windows, or you want to have this accessible across the world without leaving your home PC on, I’d recommend signing up for Azure hosting.  You can have a decent server (an A0) for ~$15/month.  I just used a standard Windows 2008 server, but I did have to upgrade .NET.  This and this will be helpful for setup.
  4. Start up YouCast, and set the channel id to what you have above.  Select the audio output format, and generate the URL.  You’ll get a funky looking URL like http://hostname:22703/FeedService/GetUserFeed?userId=UCsjtSWdw9t1XlBnyrV7mAOQ&encoding=Audio&maxLength=0&isPopular=False
  5. Set YouCast up as a service.  I used NSSM.  Otherwise when your server reboots (as it occasionally will), you won’t have access to your podcast.
  6. Add the URL to your podcast catcher (I like Podcast Addict).

For bonus points, use a service like FeedBurner or RapidFeeds to capture stats about the podcast and make the URL nicer.speaker photo

I did this for one of my favorite YouTube channels, the Startup Therapist.  (I’ve asked Jeff at least once to start a podcast, and I hope he does soon.)

Not sure exactly what the issue is, but even though the podcast RSS feed has all the episodes in it, I can only download three podcasts per channel at a time.  I’ve no idea why–whether it is a limit of Podcast Addict, YouTube, YouCast or some combination.  But apart from that this solution works nicely.

 

 

 


Heroku drains

drain photoSo, I’ve learned a lot more than I wanted to about heroku drains. These are sinks to which heroku applications can write.  After the logs are out of heroku, you analyze these logs just as you would in any other application living outside of a PaaS.  Logs are very useful to see long term trends, debug, etc.  (I’ve worked both on a rails3 app and a java spring/camel app that are deploying to heroku.)

Here are some things I’ve learned:

  • Heroku drains are well documented.
  • You want definitely want them for any production application, because only 1500 lines of heroku logs are retained at any one time.
  • They can go to either syslog (great for applications with a lot of other infrastructure) or https (great for applications without as much infrastructure support).
  • They can’t do any kind of authorization.
  • You can’t know what ip address the logs are coming from, so you can’t limit access by IP.
  • There are third party extensions you can pay for to avoid dealing with drains at all (I’ve heard good things about papertrail.)
  • You can use logstash to pull heroku logs from a syslog drain into elastic search.
  • There are numerous github projects that can drain to databases, etc.  There’s even one that, with echos of Ouroboros, drains to another heroku app.
  • Drains have intelligent behavior if your listener (or listeners) fails.  From heroku support: “The short answer is yes, the drain will drop logs when the sink is not responsive, but this isn’t really the full story. There are a number of undocumented limits and backoff retries that happen when a drain connection is lost.”  And then they go on to explain how the backoff behaviour happens.  I’m not going to cut and paste their entire answer because I assume it is undocumented for a reason (maybe it changes, maybe they don’t want to commit to supporting this behavior).  Ask them yourself 🙂
  • A simple drain can be as easy as <?php error_log(file_get_contents('php://input'), 3, "/var/log/logfile.log"); ?>, but make sure you rotate that log file.
  • You can use puppet to manage drains if you are bringing servers up and down, using the heroku toolbelt and CLI authentication.

If you are deploying anything beyond a toy app on heroku, don’t forget the ops folks and make sure you set up your drain!


“Wave a magic wand”

wand photoThat was what a previous boss said when I would ask him about some particularly knotty, unwieldy issue. “What would the end solution look like if you could wave a magic wand and have it happen?”

For instance, when choosing a vendor to revamp the flagship website, don’t think about all the million details that need to be done to ensure this is successful. Don’t think about who has the best process. Certainly don’t think about the technical details of redirects, APIs and integrations. Instead, “wave a magic wand” and envision the end state–what does that look like? Who is using
it? What do they do with it? What do you want to be able to do with the site? What do you want it to look like?

Or if an employee is unhappy in their role, ask them to “wave the magic wand” and talk about what role they’d rather be in. With no constraints you find out what really matters to them (or what they think really matters to them, to be more precise).

When you think about issues through this lens, you focus on the ends, not the means.  It lets you think about the goal and not the obstacles.

Of course, then you have to hunker down, determine if the goal is reachable, and if so, plan how to reach it. I like to think of this as projecting the vector of the ideal solution into the geometric plane of solutions that are possible to you or your organization–the vector may not lie in the plane, but you can get as close as possible.

“Waving a magic wand” elevates your thinking. It is a great way to think about how to solve a problem not using known methods and processes, but rather determining the ideal end goal and working backwards from there to the “hows”.


The power of SQL–WP edition

lightning photoI’ve noticed a lot of comment spam lately, and had been dealing with it piecemeal. I have all comments moderated, so it wasn’t affecting the user experience, but I was getting email reminders to moderate lots of comments that were not English and/or advertising something.

WordPress lets you turn off comments for a post, but it is tedious through the GUI. So, I dove into the wordpress database, and ran this SQL query:

update wp_posts set comment_status = 'closed' where post_date < '2014-01-01';

This sets comments to closed for all posts from 2013 or earlier (over 1000 posts).  The power of the command line is the ability to repeat yourself effortlessly.  (The drawback of the command line is its obscurity.)

I should probably write a cron job to do that every year, as posts older than two years old aren’t commented on much. To be honest, the comments section of ‘Dan Moore!’ isn’t used much at all–I know some blogs (AVC) have vibrant comments areas, but I never tried to foster a community.


Twitversation: how much do you converse on Twitter?

twitter photo

Photo by eldh

You know what I said a few days ago?

I’d love to have stats on this to make myself more accountable, but I wasn’t able to find an easy way to show my Twitter usage (new tweets vs replys vs retweets)–does anyone know one?

Well, I didn’t find anything and thought it’d be fun to learn some of the Twitter API, a bit of Django, Bootstrap, and how to host something on Heroku.  So, I wrote an app, Twitversation, which gives you a rough approximation of how much you converse on Twitter, as opposed to broadcasting.  You can enter your Twitter username and it presents a breakdown graph and a numeric score (I’m 60 out of 100, whereas patio11 scores 78 and Gary V scores a hefty 83.

Twitversation only pulls the last 200 tweets, so it’s not canonical, but it should be enough to give you a flavor.  Sarah Allen has a post up about her score.

What’d I learn?  Among other things:

  • Heroku is super easy to get started on. And it’s free!  Perfect for your MVP.
  • Django has an unfortunate term for the C in the MVC (they call it a view).
  • You can create a pie graph using only CSS and HTML.
  • Side projects take longer than you think.
  • Picking a side project that doesn’t require any feeding is liberating.  Twitversation will keep running without any attention on my part, as opposed to my other side project.
  • Python’s dependency management is a bear for a newbie.  I didn’t have to do much with this project, because it had its own vagrant vm, but I saw some of the complexity out of the corner of my eye.  Makes me long for the JVM and classpaths, and I never thought I’d say that.
  • Catchy names are hard to come up with.

Hope you enjoy!



© Moore Consulting, 2003-2017 +