March 10, 2020
Let’s Encrypt CAA Rechecking Bug Affected 3 Million Certificates
February 29th wasn’t only the Leap Day. It was also a day when Let's Encrypt discovered a bug in their Certificate Authority Authorization (CAA) code.
Let’s Encrypt is a Certificate Authority (CA) that enables https on 190 million websites, and does it for free! Recently they have celebrated a benchmark of one billion issued TLS certificates.
Sadly enough, a bug has been found in Let’s Encrypt’s automated certificate management environment software called Boulder. There’s a couple of things we need to know about Boulder. It checks for CAA records and validates a subscriber’s control of a domain name.
Rechecking Problem That Turned Into a Headache
The problem that caused all this mess is the following - when the certificate request contained X domain names, Boulder checked one domain X times, rather than check all of them once.
This iteration problem is a common mistake in Go code, that takes a reference to a loop iterator variable, as explained by Let's Encrypt’s lead developer Jacob Hoffman-Andrews in his bug report.
The issuance stopped immediately after the bug was confirmed, and the code fix was deployed two hours later.
Let’s Encrypt team first thought it had something to do with the error message generation, but then one of the engineers discovered that it was indeed a bigger problem.
The Scale Of The Issue In Numbers
2.6% might not look bad, right?
3,048,289 digital certificates out of about 116 million were affected, with Let’s Encrypt warning their subscribers about the need to renew and replace affected certificates by Wednesday, March 4, 2020.
“If you're not able to renew your certificate by March 4, the date we are required to revoke these certificates, visitors to your site will see security warnings until you do renew the certificate. Your ACME client documentation should explain how to renew.” - the email contained.
Since that announcement Let’s Encrypt staff worked with subscribers to replace affected certificates, and more than 1.7 million affected certificates have been replaced in less than 48 hours! On the other hand, more than a million certificates were not replaced before the deadline.
Then Let’s Encrypt team decided not to revoke certificates by the deadline, to avoid harming so many sites. As this certificate authority only offers certificates with 90 day lifetimes, those that were not revoked “will leave the ecosystem relatively quickly”.
It must be noted that 1,706,505 certificates were revoked because Let’s Encrypt is confident they were replaced during the incident period. Let’s Encrypt will revoke more certificates, if absolutely sure it won’t disrupt the experience of web users.
Something To Learn About From This Problem
While some subscribers reacted to this problem relatively quickly, there will undoubtedly be a lot of people unaware of this issue.
The backlash would surely have more media coverage if the same snafu happened with any commercial certificate authority. Still, this is not great for the mental health of Let’s Encrypt staff.
Speaking of the subscribers, while many system administrators handled the emergency certificate renewal well, it’s also easy to botch the update command and face Let’s Encrypt's API rate limiting.
To summarize, no language can guarantee the absence of bugs, nothing to save companies from human errors, and there’s not a lot of alternatives to SSL/TLS out there.
I hope this post will be helpful to people who might have been affected, or those who didn’t catch the news lately because of all this craziness happening.