Our first outage from LLM-written code

skip0110@lemmy.zip · 6 months ago

Our first outage from LLM-written code

deegeese@lemmy.dbzer0.com · edit-2 6 months ago

I gasped when I saw this:

A bit of discussion indicated that the trigger for the CPU spikes both times was our CEO logging in. We re-deployed to get a clean start, permanently banned him from the service, and moved on.

This is like finding a live grenade under your bed and putting it under the rug.

They found a way to reproduce a system killing bug, and instead of taking the time to understand it, they threw away their test case.

BlazeDaley@lemmy.world · 6 months ago

They contained the impact. Root causing or “understanding” should come after impact mitigation. If needed find a safe way to reproduce the bug without customer impact.

We reverted the refactoring, deployed, un-banned the CEO, and set about analysis.

FizzyOrange@programming.dev · 6 months ago

Yeah me too but if you keep reading they didn’t actually “move on” in the way that it sounds.