What Caused Bluehost’s MASSIVE FAIL
An unknown number of Bluehost servers went down yesterday, April 16th, at 1pm central time. This may have been limited to their Dedicated (which I own) and virtual private servers (VPS) but that’s unknown too. It’s also unknown what caused it, even approximately when it will be fixed, or other pretty basic items a paying customer wants to know when a service is failing.
In this post I will tell you about two fails Bluehost made: them communicating to customers about the outage and what caused the outage in the first place.
BLUEHOST COMMUNICATION FAIL
Outages do occur at webhosts…they just do. But why so many unknowns and a clear reluctance to be transparent? Because Bluehost has failed dramatically at THE MOST BASIC customer relations item: communicating with customers about why something isn’t working as promised. Rather than have a status page at Bluehost.com that either has status updates on it or embeds their Twitter and Facebook feeds, they ask people to follow them “and check our Twitter feed and Facebook page for updates.” How incredibly bush-league.
For hours and hours and hours they have been telling people essentially, “I dunno” which is unacceptable. Not only is this impacting an untold number of people (the tweets are numerous) this is a PR disaster and customers will undoubtedly flee. Especially those who have clients on Bluehost due to their recommendation, one that now makes those recommenders look like a bunch of clueless imbeciles.
I’ve also been evangelizing Bluehost’s new Dedicated server offering since it has been very fast and their Level III tech support access the best I’ve ever had with any host I’ve ever used. Several of my clients have purchased Dedicated servers (and yes, ALL of them pinged me about where they should go next because they are absolutely getting off Bluehost!).
Will I continue to evangelize? Nope. I might have cut Bluehost some slack IF they had been communicative. I may continue to evangelize IF Bluehost provides recompense for my server downtime and IF they provide a plan on how NOT to repeat a fiasco like this in the future. If they say or do nothing I’ll take my business and that of my clients elsewhere.
But here is what caused the outage.
BLUEHOST PARTNER FAIL: THE CAUSE
Some servers are back up and fortunately all of my sites on our dedicated server are up, except the most important one. This site is on its own dedicated IP address (since it uses an SSL certificate) and runs our key ecommerce site. Based upon our transaction history—and because we launched a new product and are simultaneously holding a sale—I estimate we’ve lost between $3,000 – $4,000 in sales since the site went down yesterday. Damn.
After waiting over two and a half hours this morning to talk with Level III technical support today, I learned the answer of what caused the outage. While on hold I thought I’d poke around and see what data I could uncover so I could ask intelligent questions if I ever connected with someone!
I did a traceroute on my site’s dedicated IP address and learned that it stopped dead at ve15.ar04.prov.acedc.net. Acedc.net is run by Ace Data Centers, a colocation and IP transit company. In my poking around I also discovered that all of Bluehost’s dedicated servers (which are rack-mounted blades and might include their VPS servers too) are colocated at this data center in Orem, Utah. Headquarters for Bluehost and Ace are just over two miles apart.
When I connected with support I got in to a conversation with the tech rep in order to ferret out the reason for the outage.
Turns out that the fail was caused by a small team which did a backend FIRMWARE UPDATE ON ROUTERS AND SWITCHES yesterday morning and was performed at Ace Data Center, the company that provides the IP transit service for Bluehost. Apparently the Ace team is small, obviously hosed up the firmware update so domain names were no longer resolving, and couldn’t fix it themselves. A netops team from Bluehost’s parent company, Endurance International Group (EIG), apparently scrambled to get over and help to fix the problem.
EIG is the company that owns Bluehost and numerous other hosting companies and businesses. EIG holds a colocation Master Service Agreement, an IP Transit Service (Carrier Services) Agreement, and an Data Center Rack Cabinet and Power Services Agreement with Ace Data Centers, Inc. (more here). EIG and Bluehost obviously knew the gravity of the screwup by Ace so have clearly pulled out all the stops to get servers back up and running.
The fix was done about 9am this morning and all but one of my sites is up (the one with its own dedicated IP…obviously ones being updated last). But now the netops team is apparently working on “the flow” from the router/switch to the Bluehost servers so the DNS works (i.e., so the domain name will properly “point” to the IP address for a server…or my site!). In router-speak they’re obviously doing something to fix DNS-based X.25 routing data flow and I have no idea what they’re doing or how long it will take.
When I asked the ETA on when my single dedicated IP might resolve was also an unknown and could be “10 minutes to 6-7 hours” from now.
Holy shit. What a massive screwup.
Yes, it pisses me off that my site is down and server was down for hours and hours yesterday. BUT I CANNOT EXCUSE BLUEHOST FROM NOT TELLING US WHAT I JUST DISCOVERED TODAY! Do they think we’re stupid or that the screwup won’t be found out? That maybe we little customers can’t handle the truth? Or perhaps they don’t want to demonstrate publicly how badly their partner Ace screwed up? Whatever the reason, there is NO excuse for not being honest, transparent, and forthright.
- Create a status update page that EXPLAINS what happened (and, God forbid there are future events, do actual updates on that page every 30 MINUTES!).
- Provide redundancy/failover services. I would pay for a redundant data center so my dedicated server was never offline. Yes, I could move our site to Amazon Web Services or other multi-data-center facility and pay A LOT more for the service, but if I had those technical chops (or could afford to hire them) I’d not need Bluehost or the managed Dedicated server!
- Create a mechanism so that, if a server is down or another DNS outage occurs, a DNS request is automatically captured and a page appears which states something like this so all of us running domains don’t look like a bunch of idiots that can’t find our ass with both hands:
“We apologize that the site, domainname.tld, is temporarily offline due to a server or network malfunction. This may be impacting both the website and email. Please use other means to contact the person or organization.”
- By the way, PLEASE PROVIDE AN OPTION TO TURN OFF THAT LOOPING SALES PITCH WHILE ON HOLD FOR TECH SUPPORT! I heard it at least 50 times while waiting and had to listen since I didn’t want to miss the tech support person when they finally answered.
In any event please step it up Bluehost. If you’re the CEO or in leadership reading this, EIG owns a lot of other hosting companies and I’ll bet the shit will roll downhill pretty fast (or maybe already has) and you need to get your act together.
6 Comments
Leave a Comment
About Steve Borsch
Strategist. Learner. Idea Guy. Salesman. Connector of Dots. Friend. Husband & Dad. CEO. Janitor. More here.
Connecting the Dots Podcast
Podcasting hit the mainstream in July of 2005 when Apple added podcast show support within iTunes. I'd seen this coming so started podcasting in May of 2005 and kept going until August of 2007. Unfortunately was never 'discovered' by national broadcasters, but made a delightfully large number of connections with people all over the world because of these shows. Click here to view the archive of my podcast posts.
I read your blog. It seems the explanation for the outage you give on the blog is contrary to the statement given by the CEO.
Specifically how is it contrary “MR”? It was a firmware bug. Router. He didn’t go in to all the detail obviously.
It’s contrary because you claim that the outage was caused by the Ace team who made a mistake on installing a firmware update. Bluehost’s CEO claims that it was a bug. One is human error. The other is not.
Do you still say that it was human error or will you be amending your blog post to say that it was a bug?
Jeffrey,
*I* don’t claim. As you can read that is how it was explained to me. Paraphrasing that explanation was the paragraph, “Turns out that the fail was caused by a small team which did a backend FIRMWARE UPDATE ON ROUTERS AND SWITCHES yesterday morning and was performed at Ace Data Center, the company that provides the IP transit service for Bluehost. Apparently the Ace team is small, obviously hosed up the firmware update so domain names were no longer resolving, and couldn’t fix it themselves. A netops team from Bluehost’s parent company, Endurance International Group (EIG), apparently scrambled to get over and help to fix the problem.”
Yes it was a firmware bug on the routers at Ace. The Bluehost CEO did not explain in his email with specifics that it was person “A” on small team “B” employed by Ace or a subcontractor.
I won’t be amending my post. ‘Nuff said.
Is it true that WordPress offically recommends BlueHost? Do you have any other recommendations?
Hey Dan-o,
Yes. WP does recommend Bluehost.
As you might have seen in the post above, I have a dedicated server and have had incredibly great support from their Level III support crew. Because the switching costs are so high I’m reluctant to go elsewhere since it took so long to settle on their relatively new Dedicated offering.
Like the old adage about the builder constructing your new home, “You’ll start off loving your builder but you will hate them by the end” is true with webhosts too. As long as your host never has an outage—and believe me every host I’ve ever used has had major and minor outages—you’ll be in love. The moment they do they’re dirt and you hunt for a new one!