• deegeese@lemmy.dbzer0.com
    link
    fedilink
    English
    arrow-up
    58
    arrow-down
    4
    ·
    edit-2
    2 days ago

    I gasped when I saw this:

    A bit of discussion indicated that the trigger for the CPU spikes both times was our CEO logging in. We re-deployed to get a clean start, permanently banned him from the service, and moved on.

    This is like finding a live grenade under your bed and putting it under the rug.

    They found a way to reproduce a system killing bug, and instead of taking the time to understand it, they threw away their test case.

    • BlazeDaley@lemmy.world
      link
      fedilink
      arrow-up
      37
      ·
      2 days ago

      They contained the impact. Root causing or “understanding” should come after impact mitigation. If needed find a safe way to reproduce the bug without customer impact.

      We reverted the refactoring, deployed, un-banned the CEO, and set about analysis.