status

SSL certificate expiry — Microsoft Teams, Bazel, and you

A pillar piece on why expired SSL certificates remain one of the most embarrassing and most preventable outages in 2025. Microsoft Teams in February. Bazel in December. Two Let's Encrypt API outages. Apple's 47-day cert lifespan move. Keyfactor's $2.86M-per-outage number.

By Akshay Sarode· September 22, 2025· 13 min readsslcertificatesuptimepostmortem

The 150-word answer

In 2025, four things made SSL certificates the most expensive boring problem in infrastructure:

3 February 2025: Microsoft let an SSL cert on a Teams subdomain expire, taking down a portion of Teams for ~9 hours globally. (SOCRadar)
26 December 2025: Bazel's releases.bazel.build cert expired with no automated renewal. The whole Bazel CI ecosystem stalled for hours on Boxing Day. (Surfing Complexity)
Two Let's Encrypt API outages — 21 July 2025 and 16 December 2025 — each lasting hours, blocking renewals across the entire ACME ecosystem.
Apple's 47-day max-cert-lifespan vote in CA/Browser Forum, taking effect 2027. (Slashdot coverage)

Keyfactor puts the average cost of a single cert outage at $2.86M. The math now demands monitoring, automation, and rehearsed renewal workflows. Not optional.

Why this is still happening in 2025

SSL certificates are the textbook well-understood problem. Let's Encrypt has been free since 2016. ACME automation is built into every modern web stack. Cert lifetimes have been shortening for a decade.

And yet, in 2025, Microsoft — the company with possibly the largest cert estate on earth — let a Teams cert expire and took the product down for nine hours. Bazel — a Google-owned build tool used by tens of thousands of engineering organizations — let a release CDN cert expire on Christmas Boxing Day with no monitoring on the renewal pipeline.

The reason it keeps happening is the same reason it's always happened: certs are someone-else's-problem until they expire. The team that owns the cert, the team that renewed it last time, the system that emails warnings, the inbox that receives the warnings — all of these can drift. Subdomains get spun up by teams who don't know about the org-wide cert process. Renewals get scheduled by people who leave the company. Email warnings go to mailing lists that became defunct three reorgs ago.

The fix is not "be more careful." The fix is continuous monitoring, redundant alerting, rehearsed renewal, and architectural choices that make cert expiry impossible to ignore.

Microsoft Teams — 3 February 2025

The most embarrassing one of the year. From SOCRadar's writeup:

A subdomain serving Microsoft Teams traffic had its SSL certificate expire. Teams clients globally started failing TLS handshakes against the affected subdomain. The outage lasted approximately 9 hours from first reports to full restoration.

Specifics that matter:

The cert had a publicly visible expiry date. Anyone on the internet running an SSL monitor knew when it would expire.
Microsoft's internal cert management presumably has alerting. The alerting either didn't fire or didn't reach the right people.
The renewal process, when it finally kicked in during the incident, took hours — suggesting it wasn't a routine path.

The lesson: a company with the cert-management resources of Microsoft can still ship an expired cert. If they can, your team can. The defense is not "we're more careful than Microsoft." The defense is monitoring + automation + drills.

Bazel — 26 December 2025

The Boxing Day case study. From Surfing Complexity's writeup:

releases.bazel.build — a subdomain serving Bazel binaries that thousands of CI systems pull from — had its cert expire on 26 December. The renewal pipeline existed but didn't run. There was no proactive notification. The cert simply went past its expiry, the next CI build that pulled from releases.bazel.build got a TLS failure, and within an hour every Bazel-using CI system globally that had cache-misses was failing.

What's instructive about this case: Bazel is a Google project. Google has massive cert-management infrastructure. The subdomain in question, however, was a side-channel that didn't sit on the same renewal pipeline as the main bazel.build domain. Two adjacent subdomains, two different renewal paths, one of them silently broken.

This is the subdomain drift failure mode. Big organizations have multiple cert-issuance paths. Subdomains created by different teams, at different times, for different purposes, often end up on different renewal pipelines. Some are automated, some are manual, some are semi-automated with human-in-the-loop steps that fail when the human leaves.

Defense: a single source of truth for "every certificate the org owns" — with monitoring that doesn't depend on the team that issued the cert remembering to register it. External monitoring that watches every public hostname an org operates is the only way to catch subdomain drift.

Let's Encrypt — twice in 2025

Even when your infrastructure is correct, the certificate authority can have a bad day:

21 July 2025 — complete API outage

The Let's Encrypt community thread documents a complete ACME API outage on 21 July. Renewals across the entire Let's Encrypt ecosystem stalled. Renewals scheduled in the affected window failed. Teams whose certs were due for renewal during the outage window had to either wait it out or scramble.

For most teams, this didn't matter — certs renew well in advance of expiry, so a few-hour outage at the CA is recoverable. For teams with tight renewal windows (e.g., short-lifetime certs that were renewing close to expiry), it was a real risk.

16 December 2025 — second ACME API outage

The December community thread documents a second outage. Same pattern, different date.

The lesson: the CA itself is a single point of failure. Mature SSL strategies use multiple CAs (Let's Encrypt + ZeroSSL + Sectigo) and have fallback issuance configured — if one CA is unavailable, the renewal pipeline tries another.

Most teams don't do this. Most teams won't, until their renewal-day coincides with a CA outage and bites them.

The 47-day cert lifespan, coming 2027

Apple proposed, and the CA/Browser Forum accepted, a phased reduction of maximum public TLS certificate lifespans. The full schedule (per Slashdot's coverage and CA/B Forum minutes):

Year	Max cert lifespan
2024	398 days
2026	200 days
2027	100 days
2028	47 days

The 47-day number is the eventual target. By 2028, a public TLS certificate cannot be valid for more than 47 days.

Implications:

Manual renewal stops being viable. No team can sustainably renew certs every 47 days by hand. Automation is mandatory.
The window for "we'll get to it" disappears. A 47-day cert that doesn't auto-renew has roughly a one-week window between "should be renewed" and "expired" before downtime.
CA outage risk rises. A 4-hour Let's Encrypt outage is a smaller % of a 47-day cert lifetime than a 398-day cert lifetime. Multiple-CA fallback becomes less optional.
Cert monitoring becomes a P0 operational concern. Not a P3 SRE backlog item.

The right way to read the 47-day move is: the bar for cert ops is rising, and the cost of getting it wrong is rising in lockstep.

The $2.86M number

Sectigo / Keyfactor cite a survey-derived figure: the average cost of a single expired-certificate outage is $2.86 million, across surveyed enterprises.

The number comes from a few buckets:

Direct revenue impact during downtime
SLA credits and contractual penalties
Customer-support load and ticket-handling cost
Engineering response cost (incident bridge, all-hands, rollback effort)
Reputational impact (harder to quantify; this number is conservative)

Whether $2.86M is the right number for your org depends on your size, but for most enterprises it's directionally correct. The number is a reminder that "we'll renew it manually next month" is, mathematically, a $2.86M gamble.

CSC's research corroborates: a majority of surveyed enterprises had at least one cert-related incident in the prior 12 months. Most preventable.

What "good" looks like

Five commitments, in order of priority:

1. External monitoring on every public hostname

Not "the certs we know about." Every hostname your org serves on the public internet. Tooling: Sutrace, Better Stack, Pingdom, or a custom script wrapped around openssl s_client. The point is: external, daily, with a database of all hostnames. New subdomains get auto-discovered via certificate transparency logs.

2. Multi-channel alerting, far in advance

Email, Slack, PagerDuty. At 30 days, 14 days, 7 days, 1 day, expired-now. Every channel, every threshold. Cert expiry alerts must be impossible to miss. The Bazel case is what happens when one channel fails silently.

3. Automated renewal pipelines with health checks

ACME automation everywhere possible. The pipeline needs a health check — a synthetic check that confirms the renewal succeeded and the new cert is actually serving — not just that the renewal command exited 0.

4. Multi-CA fallback

Not every team needs this; the small ones don't. Any team where SSL downtime costs > $100k should have it. Configure renewal to try Let's Encrypt first, fall back to ZeroSSL or Sectigo if Let's Encrypt is unreachable.

5. Rehearsed renewal drill

Once a quarter, manually trigger a renewal on a non-production cert. Confirm the pipeline works end-to-end. Confirm the alerting fires correctly. Confirm the new cert serves correctly. The drill catches the silent failures that monitoring misses.

What Sutrace does

We monitor SSL certificate expiry, chain integrity, and SAN coverage as a first-class object. The trigger thresholds are configurable per cert. New subdomains are auto-discovered via certificate transparency log subscription, so a team that spins up staging-2.yourapp.com doesn't have to remember to register the cert with us — we see it in the CT log within minutes and start monitoring.

When a cert is approaching expiry, we alert via email + Slack + PagerDuty (configurable). When it actually expires, we open an incident on the bound public status page and the affected component flips to degraded — automatically. (See the honest status page use-case.)

The free tier includes SSL monitoring on 5 hostnames. The Team tier ($99/month) covers 100 hostnames with CT-log auto-discovery. See /pricing.

What you should do this week

List every public hostname your org serves. Get the actual list, not the one in the wiki. CT logs are the right source.
Run an external check on every cert's expiry date. Find the ones expiring in the next 90 days.
For each, identify the renewal pipeline owner and the alerting setup. The ones that fail this exercise are your time bombs.
Decide whether you're investing in multi-CA fallback this year. If your downtime cost is high enough, do it now.

The 47-day cert future is two years away. Build the muscle now.

Read more in our pillar on why status pages lie — many "the page lied" cases trace back to cert mis-handling. Try Sutrace free at sutrace.io; SSL monitoring is included at every tier.