Blog Details

The Unglamorous List of Things That Actually Break After Launch

A field-report of the real failures we see after go-live — SSL renewals, inbox reputation, dusty cron jobs, silent integrations — and what to watch for.

The Unglamorous List of Things That Actually Break After Launch

The Unglamorous List of Things That Actually Break After Launch

DevLK Editorial Team

  • 20 Apr 2026

  • English

  • 1

A field-report of the real failures we see after go-live — SSL renewals, inbox reputation, dusty cron jobs, silent integrations — and what to watch for.

Launching software is the easy day. The interesting work happens weeks, months, and years later — at 3 a.m., quietly, when nobody is watching.

This post is a field-report. It is the list of things we actually see break in production on SME-scale web applications after launch, ranked roughly by how often they fail and how expensive each failure is. If you are about to launch something, or you already did and you are wondering what to worry about next, use this as a checklist.

1. SSL certificates

Failure mode: certificate expires, browsers refuse to load the site, customers assume you are hacked.

If you are on Let's Encrypt, renewals are usually automated. "Usually" is doing a lot of work in that sentence. We have seen renewals fail because:

  • The renewal cron job was never enabled.
  • The server ran out of disk space and the ACME client could not write its challenge.
  • The site was put behind a CDN after launch and the certificate was still being issued against the origin.
  • A web server config change moved the webroot and nobody updated the ACME client path.

What to do: actually test what happens when a certificate is 15 days from expiry. Many monitoring tools will warn you. Very few default setups tell you until it is too late.

2. Email deliverability

Failure mode: your transactional emails (password resets, invoices, order confirmations) silently start landing in spam.

This is the single most common post-launch complaint we receive, and it almost never comes from a developer. It comes from a customer who "didn't get the email."

Common causes, in order of how often we see them:

  • SPF, DKIM, or DMARC records are missing or wrong for the actual sending domain.
  • The app was switched to a new sending service but the DNS records still point at the old one.
  • The sending IP was put on a blocklist because someone elsewhere in the shared pool sent spam.
  • The "from" address is a no-reply at a domain with no proper DMARC policy.

The fix is almost always: use a proper transactional email provider, authenticate properly with SPF + DKIM + DMARC, monitor your sender reputation, and do not use a personal Gmail as the "from" address for a business app.

3. Backups that run but are never tested

Failure mode: the database gets corrupted, the backups "exist," the restore fails.

This one is almost comically common. Teams spend a day setting up nightly backups, tick the box, and move on for three years. When the day comes that they need to restore, they discover one of:

  • The backup script has been failing silently for six months.
  • The backup file is there but unreadable because a schema migration was partially applied before the dump.
  • Nobody has the decryption key, or the credentials for the bucket it is stored in.
  • Restoring takes 18 hours and nobody knew.

What to do: run a real restore test at least quarterly. Pick a copy of yesterday's backup, restore it to a scratch server, run the app against it, confirm the numbers match. If that test has never been done, you do not have backups; you have hope.

4. Third-party integrations drifting

Failure mode: a payment gateway, shipping API, or SMS provider quietly changes behaviour, and your app is wrong before you notice.

Integrations are the most volatile part of any modern application. In the last two years we have seen:

  • A payment provider change the rounding rule on foreign-currency refunds by one cent, breaking reconciliation.
  • A shipping API deprecate an endpoint with six months' notice that nobody read.
  • An SMS provider tighten sender-ID rules for a specific country, silently dropping messages.
  • An OAuth provider rotate a signing key and orphan every long-lived session.

What to do:

  • Subscribe to the status page and changelog of every third-party you depend on.
  • Write an integration smoke test that runs nightly against each one's sandbox and alerts on divergence.
  • Keep a single file in the repository listing every integration, its API version, and the expiry date of any secret or certificate it uses.

5. The dusty cron job

Failure mode: a scheduled task (nightly billing, weekly report, monthly close) stops running. Nobody notices until someone asks "why didn't I get the report?"

Cron jobs die silently. Servers get rebuilt without them. Daylight-saving changes shift them by an hour. A deploy accidentally changes the user the job runs as, and now it has no permissions.

What to do: every scheduled job should ping a monitoring service when it succeeds. If the ping does not arrive in its expected window, you get alerted. This inverts the failure mode — instead of "someone will notice when the report is missing," you find out the same night.

6. The disk that slowly fills up

Failure mode: uploads, logs, or temporary files fill the disk. Writes start failing. The app crashes, sometimes partially, at 2 a.m.

This is boring, predictable, and regularly the cause of the most expensive outages we are called in to fix. Usually:

  • Log rotation was never configured.
  • An upload directory was never cleaned of orphaned files.
  • A session-store directory is growing by 200 MB a day because nobody added expiry.

What to do: monitor disk usage with a threshold, rotate logs, purge temp files on a schedule, and include disk usage in your handover checklist.

7. Secrets left on developer laptops

Failure mode: a developer leaves the company or changes laptops. Nobody rotates the API keys they had. One day the old laptop leaks and a stranger has production credentials.

We treat every production secret as if it will eventually leak. That means:

  • Every secret has an owner and an expiry.
  • We rotate production credentials at a regular cadence, whether or not there has been an incident.
  • When a team member leaves, we run through the secret inventory the same week and rotate anything they touched.

This is tedious. It also prevents the kind of incident that ends vendor relationships.

8. The forgotten user list

Failure mode: an ex-employee, ex-vendor, or former contractor still has admin access months after they left.

Access control rot is a silent failure. Nothing breaks. Everything works. Right up until something embarrassing or criminal happens, and you find out the person who did it should not have been in the system in the first place.

What to do:

  • Quarterly admin-user review. Export the list. Challenge every name. Revoke anything that cannot be justified.
  • Tie deactivation to HR offboarding, not to memory.
  • Prefer SSO where possible so deactivating one account removes access everywhere.

9. The monitoring that nobody reads

Failure mode: the alerting exists, it even fires, but the alerts go to a shared inbox or Slack channel that nobody checks, and the real incident arrives via a customer phone call.

If you cannot name the person who would see a production alert at 2 a.m. and who is expected to respond, you do not have monitoring. You have decoration.

What to do: on-call is a role, not a group chat. Someone owns it, they know they own it, and they have the credentials to act. If that is outsourced to your vendor, put it in the contract. If it is not, put it on your own team's rota.

10. Documentation that drifted from reality

Failure mode: the wiki says the deploy command is X. The real deploy command has been Y for eight months. A new developer joins, follows the wiki, breaks production.

Documentation rot is inevitable. The only defences are:

  • Treat the deployment script and the runbook as code. If they change, the docs change in the same pull request.
  • Run the handover walkthrough with a new hire at least once a year, following the docs exactly. Every friction is a doc bug.

The honest takeaway

None of these items are exotic. None of them require a senior engineer to fix. Almost all of them are invisible until they fail, and expensive the moment they do.

The good vendors and good in-house teams are not the ones that never have these problems. They are the ones that have quietly built the habits above, so that when each failure arrives, it is a small Tuesday task instead of a Saturday disaster.

If you have launched something recently and any of this feels uncomfortably familiar, we offer a flat-fee post-launch review that goes through this list and about forty more items against your specific setup. No long engagement required. You get a document with each finding, a severity, and a fix.

Original Source: The unglamorous list of things that actually break after launch

Leave a Comment

Share what you think about this article.

Comments

No comments yet. Be the first to share your feedback.