...or the aftermath of a DNS downtime
Some of you will have noticed that Planio was unavailable starting at 21:35:27 UTC on Friday, November 2nd for a large part of our user base. First and foremost, we're deeply sorry about this and the inconveniences that Planio's unavailability may have caused you. And while it weren't the servers in our control that messed up - we obviously did in making a poor choice when it comes to our DNS provider. This is entirely our fault and there's nobody else to blame for. We are - I am - truly sorry.
In this blog post, I would like to explain exactly what happened, what we did to resolve the problem short-term, and what we will do to address these types of situations so that they'll hopefully never happen again.
When our monitoring alerted us for the first time at 21:35:32, I was working on my laptop and went to check immediately if there was anything wrong with our servers. Since I'm logged in to at least a couple of Planio servers almost all the time during the day, it took me some time to realize it wasn't actually our servers that weren't responding. In fact, the servers were quite idle and it seemed as if not many users were currently using it. I quickly went on to check our network infrastructure only to find that everything looked okay. I was also able to connect to the Planio web site and our own Planio account which left me puzzled for a minute or so. Only after I tried to connect to another remote server I hadn't used in a while to check the situation from that point of the globe, I noticed that my laptop was unable to resolve its IP address from the hostname I provided. I quickly tried connecting using the IP address itself and had no problems accessing the machine. At this point, it was obvious that the DNS server for the plan.io domain space wasn't resolving.
Folks, our DNS provider @hosteurope has a problem currently and for some of you Planio may be unavailable. No worries, servers are fine.— Planio (@planio) November 2, 2012
For DNS, we're not running our own servers.
We're using We have been using HostEurope with whom we've also registered most of our domain names. A short phone call later, I had the confirmation that HostEurope had indeed problems with their DNS infrastructure. However, nobody of their staff was able to tell me what the problem was or how long it would take to be back up again. At least Twitter showed that we weren't the only ones affected by the outage, but this didn't help solve our problem at hand: all Planio users who hadn't been using Planio in the last couple minutes before the DNS servers went out, were likely not able to connect since their browsers would have to ask a DNS server for an IP address for plan.io which they couldn't get since the DNS servers at HostEurope were down. Most of the users that had been using Planio in the time before the outage were in luck though: their browsers or local DNS caches had cached the IP addresses for Planio and the service was working as expected.
As soon as I got off the phone call with HostEurope, the team and I thought about possible actions. Unfortunately, there's no way we could have helped fix their DNS servers. Their DNS infrastructure serves hundreds of thousands of clients and millions of domains. Obviously, they won't let anyone lay a hand on them and I’m sure their engineers were also already hard at work at that time. In this case, too many cooks would have definitely spoiled the broth. Since we didn't have NS records in place for backup DNS servers other than the ones operated by HostEurope, we decided to switch DNS servers entirely. This was admittedly a pretty bold move, but with no information on the downtime from HostEurope and the clock ticking, it seemed to be our best option.
We had been using the services of Dyn for over 10 years for smaller DNS related things and had always been impressed with their professional infrastructure and service. So, in less than 5 minutes after getting off the phone, but already 10 minutes into the DNS downtime, we had set up DNS records for the most important Planio hostnames at Dyn and now had to switch the NS records at NIC.IO - the official registry for .io domain names. At 21:50:54 - almost exactly 15 minutes after the DNS downtime started, we were done and back up and running.
We've switched to a different DNS provider. Changes are propagating. All our Pingdom probes are already green again.— Planio (@planio) November 2, 2012
Well, technically: As you may know, DNS is a huge system of hierarchically structured servers. Each server gets its information from one or many other servers around the globe and it obviously takes some time for updated information to propagate to all servers.
This is why we offered a quick fix via Twitter shortly after recognizing the problem. Since it weren't our servers that couldn't perform their work but rather your browsers that couldn't “find” them, a simple change to a local file on your hard disk was able to solve the problem as long as the change had not yet propagated to all DNS servers. The hosts file can be used to tell a computer which IP address to use for a certain hostname regardless of what any DNS server may or may not have to say. So, for those of you checking our tweets or getting in touch with us directly via the phone or a direct message, we were able to provide help and the percieved downtime was a lot shorter.
We're working on a temporary set up. Planio's main IP address is 18.104.22.168 if you want to set up hosts entries... Apologies for that!— Planio (@planio) November 2, 2012
For people who didn't try the hosts file trick, our monitoring shows that it may have taken as much as an entire hour until Planio became available again. The last users to which Planio appeared down still were probably the ones behind poorly designed home routers or modems that don't update their DNS cache as they should. This is why we recommended to restart your routers via Twitter if you weren't able to "see" Planio again after 1 hour and a half into the DNS downtime. It sounds like a very strange thing for a cloud provider to ask for, but it helped in all cases we're aware of and Planio is now back to full availability and - hopefully - for all our customers.
If you have a home router or modem you may need to force a DNS cache refresh or simply restart it. Deeply sorry for the trouble...— Planio (@planio) November 2, 2012
We surely won't move back to HostEurope, we will in fact move more of our domains to a more professional DNS service provider and we will also be sure to sign up with a second provider that will be readily configured in our NS records just in case that Dyn ever goes down. We will also work on a monitoring check that's dedicated to DNS resolution. Should a situation like this ever occur again, this will hopefully save us some valuable minutes of figuring out what the problem is in the first place.
Now - in closing - I cannot stress enough how terribly sorry I am that this incident happened and that some of you weren't able to access Planio for such a long time. We are aware of the trust and confidence you as our clients are putting into us. We are proud to be serving you and we will continue to do our very best to improve and harden our infrastructure as well as our set up with external providers to be up and running 24/7 - there for you when you need us.
Again, my apologies for the inconveniences caused,
Jan Schulz-Hofen, Founder and CEO of Planio