Skip to content

Building a Bridge as Your Clients Walk Across It

There was an interesting article posted to hacker news about the nuts and bolts of a SaaS product that you might not expect (article, discussion).  I commented based on my experience that the early days of a SaaS product are like building a bridge while your clients are walking across it.  You want the bridge to be far enough ahead of your clients that they won’t fall off it.  But, not so far that if you or they want to go a different direction you’ve wasted time and materials building a useless walkway section.

So, don’t build features your customers aren’t going to use. But do build features they are going to need. How do you know what the difference is?

  • you can ask them.  This is the only way to start unless you are a target user of your SaaS product (in which case, ask yourself).  Depending on the technical sophistication of your users, you may or may not get good requirements, but there’s no better way to understand their pain.  They will speak very confidently about their pain, however they will also try to give you suggested solutions.  Don’t take those as gospel, as they may not have thought through the ramifications of said solutions.  Find them by looking where they congregate online (facebook groups, forums, reddit).  Targeted email may be OK if you have a relationship.
  • you can build a placeholder.  This is a great way to see if folks want the feature, if you have some folks using your app.  We built a placeholder for document management: “email us and we’ll upload your documents”.  After a few emails, we knew it would be worth it to build out some way for folks to self serve.
  • you can build a MVF (minimum viable feature).  A feature does not need to spring from your mind fully tested, polished and automated.  Sprinkle in manual steps, use emails to people instead of automation, or release only a subset of a feature.  The goal here is again to see usage before you fully develop it.  Another benefit is that the MVF may be all that is needed.
  • you can wait until clients ask for it.  The value of this depends on when they need what they ask for.  If they need it when they ask for it, then it’s just another data point (“thanks for the request.  We’ve noted it in our roadmap”).  If they need it a week or a month after they ask for it, then you can actually build it for them.

It actually can be quite helpful to checkpoint feature usage every so often.  I’ve seen this done two different ways, though I’m sure there are more.  The first is to look at the data and see what features clients are using.  This is nice because it just takes developer time, digging through your OLTP database.  Make sure you write down the results and the queries.  However, this won’t work until you have some users who’ve been using your system for some period of time.  The second is to schedule user interviews and watch your clients or prospects use your system.  This is time intensive, but can lead to many many insights and gives you definite user empathy.

Now, this type of development doesn’t free you from having a strategy. You need to pop your head up every three months or so and revisit the strategy and see if your business is working toward it. But if you are a completionist than early stage SaaS is not for you.

Smashing: A Quick Dashboard Solution

I’m putting together a business metrics dashboard for The Food Corridor (what is old is new again, I remember a project at XOR, my first job out of school, that was all about creating a dashboard). I could have just thrown together some rails views, but I looked around and saw Smashing, which is a fork of Dashing, a dashboard project that came out of shopify.

Smashing is a sinatra app and is fairly simple to set up. It looks gorgeous, a lot better than anything I could hack together. I could install it on a free heroku dyno. Even though it will take a bit of time to spin up, it is now running for free. Smashing has nice MVC separation–you have dashboards which assemble widgets, and then jobs which push data to widgets on a schedule. Sending data looks something like this: send_event('val', { current: current }) where val is referenced in the widget.

You can create more than one dashboard (I did only one). They aren’t customizable by non developers, but once the widgets are written, they can be created by someone with a modicum of experience editing HTML.

Some tips:

  • Smashing stores its state in a file. If you are running on heroku, the filesystem is ephemeral. You have two options. You can store the state in an external data store like redis (patch mentioned here, I didn’t try it). Or you can rely on the systems you are polling for metrics to maintain the state. That’s the path I took.
  • The number widget has the ability to display percentage changed since last updated: send_event('val', { current: current, last: last }). Make sure that val is an integer–I sent a string like “100000” and that was treated as a zero for purposes of calculation.
  • If you are accessing any external systems, make sure you inject any secrets via environment variables.  For local development, I used dotenv.
  • You’ll want some kind of authentication system.
  • The widgets that come with Smashing aren’t complicated, but neither are they documented, so prepare to spend some time understanding what they expect.
  • I grouped jobs, which gather the data, by data source.  You can send multiple events per job, and I thought that made it clearer.  Connections to APIs or databases only needed to happen once as well.
  • The business metrics which I was displaying really only change on a monthly basis.  So I wanted to run the data gathering immediately, then in a week or two weeks.  Because of the ephemeral state, I expect the second run will never happy, but wanted to be prepared for it.  I did so by creating a function and calling it once on job load and then in the scheduler.

Here’s pseudo code for the job that pulls data from stripe:


Stripe.api_key = ENV['STRIPE_SECRET_KEY']

def stripe
  # pull data from stripe...
  send_event('stripeval', { current: current })
end

stripe

SCHEDULER.interval '1w' do
  stripe
end

Smashing is no full on technical metrics solution (like Scout or New Relic), but can be useful for displaying limited data in a beautiful format with a minimum of developmetn effort. If you’re looking for a dead simple dashboard, Smashing will work for you.

Interview with a early stage SaaS founder

I had the chance to talk with my good friend and former colleague Corey Snipes about his SaaS project. He recently launched Meeting Star, a lightweight SaaS tool to help coordinate tech meetups. This interview has been lightly edited for clarity.

——-

Why did you come up with this product?

It began as a desire to fill my own need, for a lightweight and inexpensive place to manage small local tech events. It seemed like a fairly straightforward set of features, which wouldn’t take long to build and release as a product. (Don’t they always?) I was aware of several other tech meetup organizers who were looking for an alternative to meetup.com (often due to price, sometimes due to dislike of the feature set or UI). I did quite a bit of research last fall and didn’t find any suitable alternatives so I built it.

What do you hope to achieve with this app?

I have a few parallel interests here. I want a tool that’s useful to me as a meetup organizer. I want to leverage my experience building, marketing, and operating other software products — both my own, and for customers I’ve had over the years. I want to add a business line in my portfolio that provides value and makes people happy, while also being financially sustainable. I also run a separate, conference-related application and I anticipate some complementary lift between the two.

How much research did you do before plunging in and writing it?

Quite a bit. You can always do more, of course. I poked around online and found several lists, articles, and discussion threads about alternative platforms. I followed conversations of other meetup organizers discussing the relative merits of various methods. I made a list and tried seven or eight of what seemed like the top contender products. I was looking for a place to run my meetups, though, not specifically looking at competition. But in the end, everything was either trying to be a full-featured community management piece, or was such a terribly crafted alternative that I felt there were exactly zero real usable options for what I wanted to do.

Who is the product aimed at?

This particular [app] was born of my own needs, and I tend toward tech and entrepreneurship meetups. It’s well-suited to tech and software meetups. Those are the people in my network. Those are the meetups I attend and organize, and those are the users whose needs I can most easily identify and meet. Since it’s a reasonably lightweight application, it’s actually well-suited to many different kinds of groups, but software/tech/biz meetups are my focus.

What would make you consider this product a success?

Right now, success is getting ten paying, happy customers and getting things dialed in to their needs. I subscribe to Patrick McKenzie’s wisdom that the first ten customers are critical for turning your idea into something people want to use. And also, for proving it’s not a fluke. If you can get to ten, you can get to a hundred. And if you can get to a hundred, you can get to a thousand. For me, long-term success is a useful, sustainable product that has revenue to turn into improvements, runs smoothly, and maybe also puts a little money in my kids’ college fund.

——-

You can find out more about Corey, including why he’s moving to Cleveland, at his website.

If the images in your Rails image_tag calls don’t have a checksum…

This last week, I spent a lot of time learning about how Rails serves static files, how it interacts with a CDN like CloudFront, and how misconfigurations can really screw up your application.  I wanted to document this here so that if I run into these situations again, I can troubleshoot them more easily.

Problem #1: We did an large release, with a lot of moving pieces

We (The Food Corridor) recently engaged with a consulting company to do a refresh of the look and feel of our application.  They don’t just do UX and design, but also implementation using overseas developers and QA.  I was excited to let said developers focus on look and feel (not my strength) but made a mistake in not setting up an entire environment for them.  Instead, I let them use our staging environment.  Things took longer than predicted (as they always do, and some of it was due to my availability).  I was doing some tidying up, changing the deploy process, etc, and ended up merging a lot of changes into our codebase.  This meant that the first release had a lot more risk that previous releases–there was some of the new look and feel as well as all my changes.

That made troubleshooting any issues that came up with the release difficult, because it wasn’t clear what caused it–was it deployment changes, some of my code, some of the look and feel code?

Solution #1: Set up a ‘review’ environment for any long running branches.  I haven’t gone the full ‘review app’ path yet, but based on the docs, it doesn’t seem too hard to set up.  For now we just have one review environment that can be shared.

Problem #2: Certain images appeared on staging but not on production.

This was one of the issues that caused some heartache during the first release.  There were some images (svgs, to be exact) that were present on staging and not on production (from the browser, you’d get a 404).  But staging and production had the exact same codebase, including images.  We were doing a fresh deploy to production.  What was the issues?

I rolled back a lot of the changes I’d made to make sure it wasn’t a deployment issue (turning off pipelines, unsetting environment variables, etc).  No love.  We were still seeing 404s.  Looking at the HTML, on production the image had a different name, without the hash at the end.  From slack:

Interesting. on prod the envelope image is here: &ltlimg alt=”Message” src=”https://d1soonciftqo56.cloudfront.net/images/tfc/message.svg”>

and on staging it is here: <img alt=”Message” src=”https://d1wspyydkkjqvw.cloudfront.net/assets/tfc/message-c9189c257a23964ea6b97b89416b25a4.svg”>
… the svg isn’t being compiled on production for some reason

That led me to learn about the heroku asset pipeline and in particular the way rails4 apps are treated.  I then dove into the compiled slug, and then saw that the message-c9189....svg image was present in the staging slug under public/assets but that there was no message.svg file in production. There was, however, a message-text-89ab....svg file.

Solution #2: The staging environment had some old copies of the image files. The files had been renamed, but the image_tag calls still referenced those old files. We had to update them. I also updated the build process to run a heroku repo purge-cache every time staging was built so that we wouldn’t have any of those old files lying around, using the heroku repo plugin/. (It’s fine to make a mistake, but try not to make the same mistake twice.)

Useful Tool: Delighted for NPS reports

So, I was looking to add a simple widget to collect net promoter scores (often called NPS).  I was astonished at the dearth of options for standalone NPS tracking.  I assume that most customer relationship software has an NPS tracker embedded in it, but we didn’t want to use anything other than a simple standalone widget.

Two options that popped up: Delighted and Murm.  I did a quick spike to evaluate each of these tools.  At the time I reviewed it, I couldn’t get Murm to work, and they didn’t respond to my customer support request.  Delighted, on the other hand, did so.  They let me have access to web display, which was a beta feature at the time and worked with us on price.  It was trivial to install, though I’m still a bit unclear how the form determines when to display.  Highly recommended.

The nice thing about having a NPS tracker on your website is you can get direct feedback from your users.  This has led to numerous useful conversations and feature requests, as someone who is using our software for the first time brings new clarity to confusing features or user interface.  Plus, it is a great number to track.

Heroku 503 errors

One day recently I woke up to a text from our monitoring service (part of minimal heroku operations tasks) saying our application was down.

Darn it.

Here’s the recreation of my steps in troubleshooting this issue:

  • Login, see that the app is indeed down.
  • Look at newrelic.
  • Look at the logs (papertrail!).
  • Look at the deployment history.
  • Note when the issue started–curses, not when I did a deploy.
  • Open a ticket with heroku (after doing some research). (Love their support.)
  • Double check that the database is good and hasn’t hiccuped.
  • Look at the logs more.
  • Add more dynos, see if that helps.
  • Google the error message.
    •  <app-name> heroku/router: at=info method=POST path=”<url>” host=app.thefoodcorridor.com request_id=8a17648f-2d84-46ea-abf2-5903be894a2c fwd=”216.191.191.58″ dyno=web.3 connect=1ms service=4ms status=503 bytes=477 protocol=https“Notice that the issue is being stated right in the log file (passenger request queue filling up).  Here are sample error messages?
    • <app-name> app/web.3: [ 2017-07-12 14:47:44.3688 65/7f652dffd700 age/Cor/Con/CheckoutSession.cpp:261 ]: [Client 3-281] Returning HTTP 503 due to: Request queue full (configured max. size: 100)
  • Find some posts about the error message.  Here and here.
  • Start researching how to increase request queue size.
  • Talk a walk to clear my head.
  • Think about what external services we call, as that seems to be what might cause the request queue to back up.
  • Read another post that says restarting passenger helped.
  • Restart all dynos.
  • Problem disappears.
  • Look at logs more closely.
  • Last dyno to be restarted was the only problematic dyno.
  • Add comment to ticket about this being the cause.
  • Heroku confirms that the issue may have been the dyno: “sometimes individual dynos will hang and cause errors with 503 responses”
  • Write note to customers about the issue explaining how access to app was affected.
  • Lower number of dynos.
  • Breath a sigh of relief.

Kibana Visualizations that Change With Browser Reload

I ran into a weird problem with Kibana recently.  We are using the ELK stack to ingest some logs and do some analysis, and when the Kibana webapp was reloaded, it showed different results for certain visualizations, especially averages.  Not all of them, and the results were always close to the actual value, but when you see 4.6 one time and 4.35 two seconds later on a system under light load and for the exact same metric, it doesn’t inspire confidence in your analytics system.

I dove into the issue.  By using Chrome Webtools, I noticed that the visualizations that were most squirrely were loaded last.  That made me suspicious that there was some failure causing missing data, which caused the average to change. However, the browser API calls weren’t failing, they were succeeding.

I first looked in the Elastic and Kibana configuration files to see if there was any easy timeout configuration values that I was missing.  But I didn’t see any.

I then tried to narrow down the issue.  When it was originally noted, we had about 15 visualizations working on about a months worth of data.  After a fair bit of URL manipulation, I determined that the discrepancies appeared regularly when there were about 10 visualizations, or when I cut the data down to four hours worth.  This gave me more confidence in my theory that some kind of timeout or other resource constraint was the issue. But where was the issue?

I then looked in the ElasticSearch logs.  We have a mapping issue, related to a scripted field and outlined here, which caused a lot of white noise, but I did end up seeing an exception:


org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution (queue capacity 1000) on org.elasticsearch.search.action.SearchServiceTransportAction$23@3c26b1f5
        at org.elasticsearch.common.util.concurrent.EsAbortPolicy.rejectedExecution(EsAbortPolicy.java:62)
        at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823)
        at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369)
        at org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor.execute(EsThreadPoolExecutor.java:79)
        at org.elasticsearch.search.action.SearchServiceTransportAction.execute(SearchServiceTransportAction.java:551)
        at org.elasticsearch.search.action.SearchServiceTransportAction.sendExecuteQuery(SearchServiceTransportAction.java:228)
        at org.elasticsearch.action.search.type.TransportSearchCountAction$AsyncAction.sendExecuteFirstPhase(TransportSearchCountAction.java:71)
        at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.performFirstPhase(TransportSearchTypeAction.java:176)
        at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.start(TransportSearchTypeAction.java:158)
        at org.elasticsearch.action.search.type.TransportSearchCountAction.doExecute(TransportSearchCountAction.java:55)
        at org.elasticsearch.action.search.type.TransportSearchCountAction.doExecute(TransportSearchCountAction.java:45)
        at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:75)
        at org.elasticsearch.action.search.TransportSearchAction.doExecute(TransportSearchAction.java:108)
        at org.elasticsearch.action.search.TransportSearchAction.doExecute(TransportSearchAction.java:43)
        at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:75)
        at org.elasticsearch.action.search.TransportMultiSearchAction.doExecute(TransportMultiSearchAction.java:62)
        at org.elasticsearch.action.search.TransportMultiSearchAction.doExecute(TransportMultiSearchAction.java:39)
        at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:75)
        at org.elasticsearch.client.node.NodeClient.execute(NodeClient.java:98)
        at org.elasticsearch.client.FilterClient.execute(FilterClient.java:66)
        at org.elasticsearch.rest.BaseRestHandler$HeadersAndContextCopyClient.execute(BaseRestHandler.java:92)
        at org.elasticsearch.client.support.AbstractClient.multiSearch(AbstractClient.java:364)
        at org.elasticsearch.rest.action.search.RestMultiSearchAction.handleRequest(RestMultiSearchAction.java:66)
        at org.elasticsearch.rest.BaseRestHandler.handleRequest(BaseRestHandler.java:53)
        at org.elasticsearch.rest.RestController.executeHandler(RestController.java:225)
        at org.elasticsearch.rest.RestController.dispatchRequest(RestController.java:170)
        at org.elasticsearch.http.HttpServer.internalDispatchRequest(HttpServer.java:121)
        at org.elasticsearch.http.HttpServer$Dispatcher.dispatchRequest(HttpServer.java:83)
        at org.elasticsearch.http.netty.NettyHttpServerTransport.dispatchRequest(NettyHttpServerTransport.java:329)
        at org.elasticsearch.http.netty.HttpRequestHandler.messageReceived(HttpRequestHandler.java:63)
        at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
        at org.elasticsearch.http.netty.pipelining.HttpPipeliningHandler.messageReceived(HttpPipeliningHandler.java:60)
        at org.elasticsearch.common.netty.channel.SimpleChannelHandler.handleUpstream(SimpleChannelHandler.java:88)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
        at org.elasticsearch.common.netty.handler.codec.http.HttpChunkAggregator.messageReceived(HttpChunkAggregator.java:145)
        at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
        at org.elasticsearch.common.netty.handler.codec.http.HttpContentDecoder.messageReceived(HttpContentDecoder.java:108)
        at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
        at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:296)
        at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.unfoldAndFireMessageReceived(FrameDecoder.java:459)
        at org.elasticsearch.common.netty.handler.codec.replay.ReplayingDecoder.callDecode(ReplayingDecoder.java:536)
        at org.elasticsearch.common.netty.handler.codec.replay.ReplayingDecoder.messageReceived(ReplayingDecoder.java:435)
        at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
        at org.elasticsearch.common.netty.OpenChannelsHandler.handleUpstream(OpenChannelsHandler.java:74)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
        at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:268)
        at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:255)
        at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
        at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
        at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337)
        at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)

Which led me to this StackOverflow post.  Which led me to run this command on my ES instance:


$ curl -XGET localhost:9200/_cat/thread_pool?v
host            ip           bulk.active bulk.queue bulk.rejected index.active index.queue index.rejected search.active search.queue search.rejected
ip-10-253-44-49 10.253.44.49           0          0             0            0           0              0             0            0               0
ip-10-253-44-49 10.253.44.49           0          0             0            0           0              0             0            0           31589

And as I ran that command repeatedly, I saw the search.rejected number getting larger and larger. Clearly I had a misconfiguration/limit around my search thread pool. After looking at the CPU and memory and i/o on the box, I could tell it wasn’t stressed, so I decided to increase the queue size for this pool. (I thought briefly about modifying the search thread pool size, but this article warned me off.)

This GH issue helped me understand how to modify the threadpool briefly so I could test the theory.

After making this configuration change, search.rejected went to zero, and the visualization aberrations disappeared. I will modify the elasticsearch.yaml file to make this persist across server restarts and re-provisions, but for now, the issue seems to be addressed.

Five rules for troubleshooting an unfamiliar system

trouble photo
Photo by Ken and Nyetta

A few weeks ago, I engaged with a client who had a real issue.  They sold a variety of goods via a website (if this was the 90s, they would have been called an ‘e-tailer’), and had been receiving intermittent double orders through their ecommerce system.  Some customers were charged two times for one order.  This led, as you can imagine, to very unhappy customers.  This had been happening for a while and, unfortunately, due to some external obstacles, internal staff were not available to investigate the issue–they had their hands full with an existing higher priority project.

I was called in to see if I could solve this issue.  I had absolutely no familiarity with the system.  But in less than ten hours of time, I was able to find the issue and resolve it.  How I approached the situation can be summed up in five rules:

Number one: define the problem.  Ask questions, and capture the answers.  What is the exact undesired behavior?  When is the undesired behavior happening?  What seems to trigger it?  When did it start?  Were there any changes that happened recently?  Does the client have reproduction steps?

I gathered as much information as I could, but keep it high level.  I asked for architecture and system diagrams.  For the history of the application.  For access to all systems that could possibly be relevant (this will save you time in the future).  For locations of log files, source repositories, configuration files.  For database credentials and credentials for third party systems like CC processors.  It is important at this time to resist the temptation to dive in–at this point the job is to get a high level understanding so I can be efficient in the next steps.

You will get speculation about what the solution is when you are asking about the problem.  Feel free to capture that, but don’t be influenced by it.

Number two–find the finish line.  After getting a clear definition of the problem, I looked in the orders database and find out if the double orders were showing up there.  They were, which was a clue as to which part of the system was malfunctioning, but more importantly let me see the effectiveness of any changes I was making.  It also lets the customer know the objective end goal, which can be important if this is a t&m project, and it let me know the end state to which I was headed–important for morale.  (BTW, don’t do fixed bids for this type of project–overruns will be unpleasant, and there will be overruns.)

I was able to write a SQL script to find double orders over a given time frame.  I ended up writing a script which emailed the results of this query to myself and the client nightly, as an easy way to track progress.  The results of this query were a quantifiable, objective measure of the problem.

Number three–start where you are familiar.  I could have dove in and looked at the codebase, but due to my problem definition, I knew that there had been no changes to the checkout portion of the code base for years.  I also was unfamiliar with the particular software that managed the ecommerce site and could have wasted a lot of time getting up to speed on the control flow.  Instead, once I had the SQL query, I could find users that had been double charged, and look at their sessions in the web server logs.  I’ve been looking at apache http logs for over a decade and was very familiar with this piece of the system.

Number four–follow your nose. I followed a few of the user sessions using grep and noticed some weirdness in the logs.  There were an awful lot of messages that indicated the server had been restarted, and all the double orders I looked at had completed 5-6 seconds after the minute changed.  (It’s hard to define weirdness explicitly, which is why it behooved me to start with a portion of the system that I was experienced with–it made the “weirdness” more obvious.)  From here, I ended up looking at why or how the server was being restarted regularly.  Ended up finding an errant cron job which was restarting the server often enough that the ecommerce system was getting confused and double booking orders–once before the restart and once after.  This was easily fixed by commenting out the cron job.

Number five–know when to stop.  This ecommerce system obviously had a logic flaw–after all, restarting the web server shouldn’t cause an order to be entered twice, whether you restart it every hour or once a year.  I could have dug through the code to find that out.  But instead, I commented out the cron job, let the system run for a week or so and waited for more double orders.  There were none, indicating that the site was low traffic enough that whatever flaw was present didn’t get exercised often, if at all.  I confirmed with the client that this situation met his expectations of completeness, and called it good.

Being thrown into a new system, especially when troubleshooting, is a difficult task.  I am thankful the client was relatively responsive to my questions, and that pressure, while present, wasn’t intense.  These five steps should help you, if you are put in any troubleshooting situation.

Twitversation: how much do you converse on Twitter?

twitter photo
Photo by eldh

You know what I said a few days ago?

I’d love to have stats on this to make myself more accountable, but I wasn’t able to find an easy way to show my Twitter usage (new tweets vs replys vs retweets)–does anyone know one?

Well, I didn’t find anything and thought it’d be fun to learn some of the Twitter API, a bit of Django, Bootstrap, and how to host something on Heroku.  So, I wrote an app, Twitversation, which gives you a rough approximation of how much you converse on Twitter, as opposed to broadcasting.  You can enter your Twitter username and it presents a breakdown graph and a numeric score (I’m 60 out of 100, whereas patio11 scores 78 and Gary V scores a hefty 83.

Twitversation only pulls the last 200 tweets, so it’s not canonical, but it should be enough to give you a flavor.  Sarah Allen has a post up about her score.

What’d I learn?  Among other things:

  • Heroku is super easy to get started on. And it’s free!  Perfect for your MVP.
  • Django has an unfortunate term for the C in the MVC (they call it a view).
  • You can create a pie graph using only CSS and HTML.
  • Side projects take longer than you think.
  • Picking a side project that doesn’t require any feeding is liberating.  Twitversation will keep running without any attention on my part, as opposed to my other side project.
  • Python’s dependency management is a bear for a newbie.  I didn’t have to do much with this project, because it had its own vagrant vm, but I saw some of the complexity out of the corner of my eye.  Makes me long for the JVM and classpaths, and I never thought I’d say that.
  • Catchy names are hard to come up with.

Hope you enjoy!

400 error on heroku git:clone

heroku photo
Photo by jacobian

I’m working on an estimate for changes to a heroku hosted web application.

I was trying to run heroku git:clone --app appname after adding myself as a collaborator. I was running this on a new vagrant box running ubuntu.

However, I kept getting this error message:

vagrant@precise64:~$ heroku git:clone --app appname
Cloning from app 'appname'...
Cloning into 'appname'...
error: The requested URL returned error: 400 while accessing https://git.heroku.com/appname.git/info/refs
fatal: HTTP request failed

And I couldn’t understand why.

After some fiddling, I determined that first you need to have an ssh key generated:

vagrant@precise64:~$ ssh-keygen -t rsa

And then you can run:

vagrant@precise64:~$ heroku git:clone --app appname --ssh-git
Cloning from app 'appname'...
Cloning into 'appname'...
Warning: Permanently added the RSA host key for IP address 'ipaddress' to the list of known hosts.
Fetching repository, done.
remote: Counting objects: 902, done.
remote: Compressing objects: 100% (473/473), done.
remote: Total 902 (delta 433), reused 826 (delta 379)
Receiving objects: 100% (902/902), 28.06 MiB | 556 KiB/s, done.
Resolving deltas: 100% (433/433), done.

Hope this helps someone, somewhere.