Multiple Microsoft services went down last week, and in a detailed explanation of the outage, the Redmond-based software giant explains it was all caused by an error affecting Azure DNS.
More specifically, the outage, which lasted almost 40 minutes, happened on April 1, and Microsoft says it was all labeled as a service availability issue.
In other words, customers trying to connect to Microsoft servers were unable to resolve domain names, and this in turn made it impossible to load the services completely. Microsoft explains that all services returned to normal a little over a year later.
“Azure DNS servers experienced an anomalous surge in DNS queries from across the globe targeting a set of domains hosted on Azure. Normally, Azure’s layers of caches and traffic shaping would mitigate this surge. In this incident, one specific sequence of events exposed a code defect in our DNS service that reduced the efficiency of our DNS Edge caches,” the company explains.
“As our DNS service became overloaded, DNS clients began frequent retries of their requests which added workload to the DNS service. Since client retries are considered legitimate DNS traffic, this traffic was not dropped by our volumetric spike mitigation systems. This increase in traffic led to decreased availability of our DNS service.”
DNS services recovered themselves 30 minutes later
The company goes on to explain that its DNS services recovered automatically approximately 30 minutes after the outage occurred, with engineers then working to bring everything back to normal. Eventually, all services were operating normally by 22:30 UTC.
“We apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future,” the company further adds.