What Caused Bluehost’s MASSIVE FAIL
An unknown number of Bluehost servers went down yesterday, April 16th, at 1pm central time. This may have been limited to their Dedicated (which I own) and virtual private servers (VPS) but that’s unknown too. It’s also unknown what caused it, even approximately when it will be fixed, or other pretty basic items a paying customer wants to know when a service is failing.
In this post I will tell you about two fails Bluehost made: them communicating to customers about the outage and what caused the outage in the first place.
BLUEHOST COMMUNICATION FAIL
Outages do occur at webhosts…they just do. But why so many unknowns and a clear reluctance to be transparent? Because Bluehost has failed dramatically at THE MOST BASIC customer relations item: communicating with customers about why something isn’t working as promised. Rather than have a status page at Bluehost.com that either has status updates on it or embeds their Twitter and Facebook feeds, they ask people to follow them “and check our Twitter feed and Facebook page for updates.” How incredibly bush-league.
For hours and hours and hours they have been telling people essentially, “I dunno” which is unacceptable. Not only is this impacting an untold number of people (the tweets are numerous) this is a PR disaster and customers will undoubtedly flee. Especially those who have clients on Bluehost due to their recommendation, one that now makes those recommenders look like a bunch of clueless imbeciles.
I’ve also been evangelizing Bluehost’s new Dedicated server offering since it has been very fast and their Level III tech support access the best I’ve ever had with any host I’ve ever used. Several of my clients have purchased Dedicated servers (and yes, ALL of them pinged me about where they should go next because they are absolutely getting off Bluehost!).
Will I continue to evangelize? Nope. I might have cut Bluehost some slack IF they had been communicative. I may continue to evangelize IF Bluehost provides recompense for my server downtime and IF they provide a plan on how NOT to repeat a fiasco like this in the future. If they say or do nothing I’ll take my business and that of my clients elsewhere.
But here is what caused the outage.
BLUEHOST PARTNER FAIL: THE CAUSE
Some servers are back up and fortunately all of my sites on our dedicated server are up, except the most important one. This site is on its own dedicated IP address (since it uses an SSL certificate) and runs our key ecommerce site. Based upon our transaction history—and because we launched a new product and are simultaneously holding a sale—I estimate we’ve lost between $3,000 – $4,000 in sales since the site went down yesterday. Damn.
After waiting over two and a half hours this morning to talk with Level III technical support today, I learned the answer of what caused the outage. While on hold I thought I’d poke around and see what data I could uncover so I could ask intelligent questions if I ever connected with someone!
I did a traceroute on my site’s dedicated IP address and learned that it stopped dead at ve15.ar04.prov.acedc.net. Acedc.net is run by Ace Data Centers, a colocation and IP transit company. In my poking around I also discovered that all of Bluehost’s dedicated servers (which are rack-mounted blades and might include their VPS servers too) are colocated at this data center in Orem, Utah. Headquarters for Bluehost and Ace are just over two miles apart.
When I connected with support I got in to a conversation with the tech rep in order to ferret out the reason for the outage.
Turns out that the fail was caused by a small team which did a backend FIRMWARE UPDATE ON ROUTERS AND SWITCHES yesterday morning and was performed at Ace Data Center, the company that provides the IP transit service for Bluehost. Apparently the Ace team is small, obviously hosed up the firmware update so domain names were no longer resolving, and couldn’t fix it themselves. A netops team from Bluehost’s parent company, Endurance International Group (EIG), apparently scrambled to get over and help to fix the problem.
EIG is the company that owns Bluehost and numerous other hosting companies and businesses. EIG holds a colocation Master Service Agreement, an IP Transit Service (Carrier Services) Agreement, and an Data Center Rack Cabinet and Power Services Agreement with Ace Data Centers, Inc. (more here). EIG and Bluehost obviously knew the gravity of the screwup by Ace so have clearly pulled out all the stops to get servers back up and running.
The fix was done about 9am this morning and all but one of my sites is up (the one with its own dedicated IP…obviously ones being updated last). But now the netops team is apparently working on “the flow” from the router/switch to the Bluehost servers so the DNS works (i.e., so the domain name will properly “point” to the IP address for a server…or my site!). In router-speak they’re obviously doing something to fix DNS-based X.25 routing data flow and I have no idea what they’re doing or how long it will take.
When I asked the ETA on when my single dedicated IP might resolve was also an unknown and could be “10 minutes to 6-7 hours” from now.
Holy shit. What a massive screwup.
Yes, it pisses me off that my site is down and server was down for hours and hours yesterday. BUT I CANNOT EXCUSE BLUEHOST FROM NOT TELLING US WHAT I JUST DISCOVERED TODAY! Do they think we’re stupid or that the screwup won’t be found out? That maybe we little customers can’t handle the truth? Or perhaps they don’t want to demonstrate publicly how badly their partner Ace screwed up? Whatever the reason, there is NO excuse for not being honest, transparent, and forthright.
- Create a status update page that EXPLAINS what happened (and, God forbid there are future events, do actual updates on that page every 30 MINUTES!).
- Provide redundancy/failover services. I would pay for a redundant data center so my dedicated server was never offline. Yes, I could move our site to Amazon Web Services or other multi-data-center facility and pay A LOT more for the service, but if I had those technical chops (or could afford to hire them) I’d not need Bluehost or the managed Dedicated server!
- Create a mechanism so that, if a server is down or another DNS outage occurs, a DNS request is automatically captured and a page appears which states something like this so all of us running domains don’t look like a bunch of idiots that can’t find our ass with both hands:
“We apologize that the site, domainname.tld, is temporarily offline due to a server or network malfunction. This may be impacting both the website and email. Please use other means to contact the person or organization.”
- By the way, PLEASE PROVIDE AN OPTION TO TURN OFF THAT LOOPING SALES PITCH WHILE ON HOLD FOR TECH SUPPORT! I heard it at least 50 times while waiting and had to listen since I didn’t want to miss the tech support person when they finally answered.
In any event please step it up Bluehost. If you’re the CEO or in leadership reading this, EIG owns a lot of other hosting companies and I’ll bet the shit will roll downhill pretty fast (or maybe already has) and you need to get your act together.