One day recently I woke up to a text from our monitoring service (part of minimal heroku operations tasks) saying our application was down.
Here’s the recreation of my steps in troubleshooting this issue:
- Login, see that the app is indeed down.
- Look at newrelic.
- Look at the logs (papertrail!).
- Look at the deployment history.
- Note when the issue started–curses, not when I did a deploy.
- Open a ticket with heroku (after doing some research). (Love their support.)
- Double check that the database is good and hasn’t hiccuped.
- Look at the logs more.
- Add more dynos, see if that helps.
- Google the error message.
- <app-name> heroku/router: “Notice that the issue is being stated right in the log file (passenger request queue filling up). Here are sample error messages?
- <app-name> app/web.3:
- Find some posts about the error message. Here and here.
- Start researching how to increase request queue size.
- Talk a walk to clear my head.
- Think about what external services we call, as that seems to be what might cause the request queue to back up.
- Read another post that says restarting passenger helped.
- Restart all dynos.
- Problem disappears.
- Look at logs more closely.
- Last dyno to be restarted was the only problematic dyno.
- Add comment to ticket about this being the cause.
- Heroku confirms that the issue may have been the dyno: “sometimes individual dynos will hang and cause errors with 503 responses”
- Write note to customers about the issue explaining how access to app was affected.
- Lower number of dynos.
- Breath a sigh of relief.