Or seen from someone else!

  • TootSweet@lemmy.world
    link
    fedilink
    English
    arrow-up
    1
    ·
    5 hours ago

    If you use autoscaling for your node counts, make sure at the beginning of any prod outage that prevents traffic from getting to your application, you pin the number of nodes high. If you don’t, the reduced traffic due to the outage will let the whole system scale your node counts way down. Then when the outage ends, the flood of traffic on your too-few nodes will overwhelm your application, resulting in a longer outage.

    Basically, I let a 2-hour outage turn into a 3-hour outage by not thinking to pin node counts high.

    A lesser sin than my coworker who once deleted the whole prod database.