Blog Details
A field-report of the real failures we see after go-live — SSL renewals, inbox reputation, dusty cron jobs, silent integrations — and what to watch for.
DevLK Editorial Team
20 Apr 2026
English
1
A field-report of the real failures we see after go-live — SSL renewals, inbox reputation, dusty cron jobs, silent integrations — and what to watch for.
Launching software is the easy day. The interesting work happens weeks, months, and years later — at 3 a.m., quietly, when nobody is watching.
This post is a field-report. It is the list of things we actually see break in production on SME-scale web applications after launch, ranked roughly by how often they fail and how expensive each failure is. If you are about to launch something, or you already did and you are wondering what to worry about next, use this as a checklist.
Failure mode: certificate expires, browsers refuse to load the site, customers assume you are hacked.
If you are on Let's Encrypt, renewals are usually automated. "Usually" is doing a lot of work in that sentence. We have seen renewals fail because:
What to do: actually test what happens when a certificate is 15 days from expiry. Many monitoring tools will warn you. Very few default setups tell you until it is too late.
Failure mode: your transactional emails (password resets, invoices, order confirmations) silently start landing in spam.
This is the single most common post-launch complaint we receive, and it almost never comes from a developer. It comes from a customer who "didn't get the email."
Common causes, in order of how often we see them:
The fix is almost always: use a proper transactional email provider, authenticate properly with SPF + DKIM + DMARC, monitor your sender reputation, and do not use a personal Gmail as the "from" address for a business app.
Failure mode: the database gets corrupted, the backups "exist," the restore fails.
This one is almost comically common. Teams spend a day setting up nightly backups, tick the box, and move on for three years. When the day comes that they need to restore, they discover one of:
What to do: run a real restore test at least quarterly. Pick a copy of yesterday's backup, restore it to a scratch server, run the app against it, confirm the numbers match. If that test has never been done, you do not have backups; you have hope.
Failure mode: a payment gateway, shipping API, or SMS provider quietly changes behaviour, and your app is wrong before you notice.
Integrations are the most volatile part of any modern application. In the last two years we have seen:
What to do:
Failure mode: a scheduled task (nightly billing, weekly report, monthly close) stops running. Nobody notices until someone asks "why didn't I get the report?"
Cron jobs die silently. Servers get rebuilt without them. Daylight-saving changes shift them by an hour. A deploy accidentally changes the user the job runs as, and now it has no permissions.
What to do: every scheduled job should ping a monitoring service when it succeeds. If the ping does not arrive in its expected window, you get alerted. This inverts the failure mode — instead of "someone will notice when the report is missing," you find out the same night.
Failure mode: uploads, logs, or temporary files fill the disk. Writes start failing. The app crashes, sometimes partially, at 2 a.m.
This is boring, predictable, and regularly the cause of the most expensive outages we are called in to fix. Usually:
What to do: monitor disk usage with a threshold, rotate logs, purge temp files on a schedule, and include disk usage in your handover checklist.
Failure mode: a developer leaves the company or changes laptops. Nobody rotates the API keys they had. One day the old laptop leaks and a stranger has production credentials.
We treat every production secret as if it will eventually leak. That means:
This is tedious. It also prevents the kind of incident that ends vendor relationships.
Failure mode: an ex-employee, ex-vendor, or former contractor still has admin access months after they left.
Access control rot is a silent failure. Nothing breaks. Everything works. Right up until something embarrassing or criminal happens, and you find out the person who did it should not have been in the system in the first place.
What to do:
Failure mode: the alerting exists, it even fires, but the alerts go to a shared inbox or Slack channel that nobody checks, and the real incident arrives via a customer phone call.
If you cannot name the person who would see a production alert at 2 a.m. and who is expected to respond, you do not have monitoring. You have decoration.
What to do: on-call is a role, not a group chat. Someone owns it, they know they own it, and they have the credentials to act. If that is outsourced to your vendor, put it in the contract. If it is not, put it on your own team's rota.
Failure mode: the wiki says the deploy command is X. The real deploy command has been Y for eight months. A new developer joins, follows the wiki, breaks production.
Documentation rot is inevitable. The only defences are:
None of these items are exotic. None of them require a senior engineer to fix. Almost all of them are invisible until they fail, and expensive the moment they do.
The good vendors and good in-house teams are not the ones that never have these problems. They are the ones that have quietly built the habits above, so that when each failure arrives, it is a small Tuesday task instead of a Saturday disaster.
If you have launched something recently and any of this feels uncomfortably familiar, we offer a flat-fee post-launch review that goes through this list and about forty more items against your specific setup. No long engagement required. You get a document with each finding, a severity, and a fix.
Original Source: The unglamorous list of things that actually break after launch
Share what you think about this article.
No comments yet. Be the first to share your feedback.