GitHub Outages Since Microslop Acquisition

0x0@lemmy.zip · 1 day ago

GitHub Outages Since Microslop Acquisition

Possibly linux@lemmy.zip · edit-2 29 minutes ago

It is impressive how bad Microsoft is fumbling the bag

Github has gotten extremely popular but it also sucks really bad

k0e3@lemmy.ca · 7 hours ago

Surely they could just Copilot their way out of this mess lmao

TJA!@sh.itjust.works · 50 minutes ago

They are trying ^^

bagsy@lemmy.world · 11 hours ago

But the payment processing service has 9 nines of uptime…

Possibly linux@lemmy.zip · 31 minutes ago

I highly doubt it

merc@sh.itjust.works · 15 hours ago

I’ve worked on services with 5 nines of availability (i.e. 99.999% available, less than 5 minutes of downtime allowed per year). I’ve more frequently worked on ones with 4 nines, where you’re allowed almost an hour of downtime per year. GitHub is now barely maintaining 2 nines. That’s just embarrassing.

Each “nine” you add is much more difficult. To get four nines you need people on call who can start working on a problem within 5 minutes and fix it within a few more minutes, and you can only get those calls once every couple of months. Five nines means that you need people at their desks in shifts ready to start fixing something the moment there’s a problem because it would take too long for someone on-call to get their computer out, connect and authenticate. It requires warm backup systems that are sitting idle but ready to take over fully at a moment’s notice.

A two nines system is allowed to be down for 100x as long as a four nines system, and 1000x as long as a five nines system. It’s almost 15 minutes of downtime allowed per day, compared to about 15 minutes every 3 months for a four-nines system. Gamers wouldn’t even put up with a two-nines system for a video game. It’s absurd to allow that for a critical piece of infrastructure for software.

HrabiaVulpes@europe.pub · 43 minutes ago

I cal bullshit on “Gamers wouldn’t put up with a two-nines system for a video game”

Elder Scrolls Online has a weekly scheduled outage for about 8h. Every monday. Players have been complaining about it for years, but game is still popular.

P03 Locke@lemmy.dbzer0.com · edit-2 13 hours ago

Five nines means that you need people at their desks in shifts ready to start fixing something the moment there’s a problem

No, it means you don’t have outages. Ever.

Five-nines is something like 7 minutes of downtime throughout the entire year. At best, you might have automated failover systems that require tiny outages. No human involvement, though, unless you’re deal with some major breakage that would have killed the five-nines commitment that year, anyway.

It’s takes a human something like 5-10 minutes just to get out of bed and figure out the situation, anyway.

merc@sh.itjust.works · 13 hours ago

No, it means you don’t have outages. Ever.

No, that’s infinite nines, which isn’t possible.

Five-nines is something like 7 minutes of downtime throughout the entire year. At best, you might have automated failover systems that require tiny outages. No human involvement, though, unless you’re deal with some major breakage that would have killed the five-nines commitment that year, anyway.

Yes, you have automated failover systems. But, if something happens which causes those systems to fail over, you need to immediately investigate what happened and why. Even at four nines you have automatic failover, redundant system, hot spares, etc. But, you accept that sometimes not everything will work as planned and you’ll need to fix something. Five nines is just that and more.

It’s takes a human something like 5-10 minutes just to get out of bed and figure out the situation, anyway.

Right, which is why I said that four nines is your realistic maximum if you’re going to have people on call who aren’t actually at their desks. To get better than four nines you need to have around the clock coverage with people at their desks so when a system breaks you have eyes on it in something like 30s.

P03 Locke@lemmy.dbzer0.com · 13 hours ago

No, that’s infinite nines, which isn’t possible.

It’s not impossible. Large reliable websites do it all the time. It’s call 100% uptime.

Sure, it’s measured per year, and sometimes they have some outage that breaks the record. But, it is possible to have 100% uptime throughout the year.

merc@sh.itjust.works · 12 hours ago

It’s not impossible. Large reliable websites do it all the time. It’s call 100% uptime.

No, no website does it. There is no such thing as 100% uptime. If it happens, great, but I can guarantee you that no website even aims for 5 nines of uptime.

Google is the benchmark for website availability and in 2022 they had an outage that lasted an hour, meaning they didn’t meet 4 nines for the year.

Sure, it’s measured per year, and sometimes they have some outage that breaks the record. But, it is possible to have 100% uptime throughout the year.

If you miss your SLO target for the year, then you missed your SLO target. If you’re down for 60 minutes but fine for the other 11 months, 29 days and 23 hours, you still missed your yearly SLO.

P03 Locke@lemmy.dbzer0.com · 7 hours ago

No, no website does it. There is no such thing as 100% uptime. If it happens, great, but I can guarantee you that no website even aims for 5 nines of uptime.

Google is the benchmark for website availability and in 2022 they had an outage that lasted an hour, meaning they didn’t meet 4 nines for the year.

In 2022. In the other years, they had 100% uptime.

Also, yes, there are plenty of clients that ask for five-nines. Is it realistic? Probably not. But, they definitely ask.

If you miss your SLO target for the year, then you missed your SLO target. If you’re down for 60 minutes but fine for the other 11 months, 29 days and 23 hours, you still missed your yearly SLO.

I understand how SLO targets work. If somebody is asking for a five-nines as an SLO, they are basically asking for 100% uptime, because there is no such thing as a “five minute outage”, especially not one that is fixable without total automation.

Again, a human hasn’t even gotten paged and out of bed in 5 minutes time.

YeahToast@aussie.zone · 23 minutes ago

Again, a human hasn’t even gotten paged and out of bed in 5 minutes time.

Dude, why do you keep referencing that people won’t get out of bed in time, when that’s exactly what the OC originally said XD

Waraugh@lemmy.dbzer0.com · edit-2 14 hours ago

I’m used to environments where they expect five nines, get 3 (maybe 4) nines, and fund for 1 nine.

9point6@lemmy.world · 1 day ago

Damage@feddit.it · edit-2 1 day ago

I see two nines

caseyweederman@lemmy.ca · 14 hours ago

I see six

huquad@lemmy.ml · 1 day ago

Microsoft never promised where the nines would be

chellomere@lemmy.world · 21 hours ago

0.99%

itrealgood@mander.xyz · 1 day ago

deleted by creator

AnUnusualRelic@lemmy.world · 20 hours ago

Lies! 89.98% has two nines in it!

cyberduck@aussie.zone · 1 day ago

I’m stupid what does zero nines uptime mean?

0x0@lemmy.zip · 1 day ago

These services measure their uptimes in number of nines, the more the better.

dohpaz42@lemmy.world · 1 day ago

Sometimes the humorous term “nine fives” (55.5555555%) is used to contrast with “five nines” (99.999%),[18][19][20] though this is not an actual goal….

Maybe Microsoft misunderstood the assignment, and thought this was a goal. At their current rate, it’s certainly more achievable than the more traditional “five nines”.

As an aside, I love how the following is preferences as “casual”, and then the author starts arguing semantics:

Similarly, percentages ending in a 5 have conventional names, traditionally the number of nines, then “five”, so 99.95% is “three nines five”, abbreviated 3N5.[13][14] This is casually referred to as “three and a half nines”,[15] but this is incorrect….

Hazel@piefed.blahaj.zone · 1 day ago

the author starts arguing semantics

Legendary levels of pedantry, gave me a real good chuckle 🤭

Tiresia@slrpnk.net · 17 hours ago

Being casual does not shield you from your mathematical incorrectness.

cyberduck@aussie.zone · 1 day ago

Ah makes sense. Thanks

ByteJunk@lemmy.world · edit-2 1 day ago

When contracting a service, usually there are clauses that specify that it needs to be fully working and available x% of time, and compensation may be due in case this goal isn’t met.

Let’s say GitHub was down for 1 full day in the last year, that’s 99.7% availability. That’s “2 nines”, but sometimes people might say “2 nines five”, meaning “better than 99.5% uptime”.

I’d say that the expectation for a high availability service nowadays is “5 nines”: 99.999% uptime. That’s around 5 minutes of downtime in a full year. This kind of performance from a site like GitHub is just unacceptable…

raspberriesareyummy@lemmy.world · 24 hours ago

Thank you, that is much more helpful than OP graph

paris@lemmy.blahaj.zone · edit-2 18 hours ago

https://damrnelson.github.io/github-historical-uptime/

A lot of this is GitHub Actions alone, but a lot of it isn’t. I also don’t know how well GitHub tracked outages before the Microsoft acquisition. It’s entirely possible the graph looks so bad because they only took outage tracking seriously after being acquired. I don’t know.

Further related discussion on Hacker News

frank@sopuli.xyz · 1 day ago

Move slow and break shit

InvalidName2@lemmy.zip · 20 hours ago

It’s the best of both worsts.

raspberriesareyummy@lemmy.world · 24 hours ago

Nothing to make a point like snipping off the y-axis scaling.

I hate Microslop like any person with > 2 brain cells, but that graph is useless - all visible y-entries end in a 0 - might as well be 99.990, 99.980, 99.970, …

Jordan117@lemmy.world · 23 hours ago

It’s just Xitter’s image viewer cropping it automatically; the original upload has it.

prenatal_confusion@feddit.org · 21 hours ago

It is still bad practice to select a narrow window from a axis like this and show the difference that seems massive relative to what is shown but isn’t that significant when we can see the relation to the whole.

Graph 101

Obi@sopuli.xyz · 8 hours ago

This is a commonly known issue with graphs and one that gets repeated without a lot of consideration for context. While it’s generally a good basic rule to have graphs show the full vertical axis, it’s not like it’s a hard rule that needs to be followed 100% of the time. In this case for example, moving from 99.999% (five nines) to 99% (two nines) is a significant effect, it has importance. Displaying the full axis would make that difference unnoticeable and render the graph useless.

prenatal_confusion@feddit.org · 4 hours ago

Yes I absolutely agree but it has to be transparent and for me it is intentionally misleading to show it like this. Yes, it’s still significant and still shows lack of care from microslop but context matters to me. Maybe more than to others :)) I acknowledge that I am special that way and this is fine for others

DahGangalang@infosec.pub · 1 day ago

Obv a gross looking chart, but I am bothered that the left hand scale is trimmed off. I expect those are 10% increments, but wouldn’t be shocked if Original was like 99.0, 98.0, 97.0, etc.

vogi@piefed.social · 1 day ago

You’d be surprised: https://damrnelson.github.io/github-historical-uptime/

But weirdly enough it feels much worse using gh professionally than the scale makes it seem.

lemmyman@lemmy.world · edit-2 1 day ago

The graph is neat.

Saving some people a click: the cut-off y scale in the OP image is in 0.1% increments. So the lowest point is a little above 99.5%

raspberriesareyummy@lemmy.world · 24 hours ago

Thank you! I was thinking “it can’t just be me that’s bothered”

Safeguard@beehaw.org · 20 hours ago

Is that real? Because that… Makes it real clear…

bitjunkie@lemmy.world · 18 hours ago

That’s just fucking disgraceful.

Possibly linux@lemmy.zip · 29 minutes ago

You should see what they are doing to Minecraft

MBech@feddit.dk · edit-2 1 day ago

How does this corrospond with growth? I imagine having 100% uptime is much harder the bigger a platform is, so did Github grow a lot in the same period?

I’m not questioning wether or not Microsoft has issues, I just find it relevant wether or not they very suddenly saw a 2000% increase in server usage or something.

jatone@lemmy.dbzer0.com · edit-2 1 day ago

I imagine having 100% uptime is much harder the bigger a platform is, so did Github grow a lot in the same period?

its not there are scale points where once you hit a critical number you need to re-architect your backend. 1k,10k,1mil, etc. usually these vary based on your app. but they’re usually exponential so once you hit the higher levels it takes much longer to reach the next level.

on top of that you usually by the higher tiers have proper backpressure and signals being sent to the frontend systems to dynamically manage the load generated. so suddenly uptime is much easier.

when you see large repeated failures like this the cause is almost always corporate causing issues.

reducing engineering budget.
not listening to engineering department on product decisions. (see the recent product manager AI generated commit that got merged and caused a mild uproar of 'co authored by copilot)
rushing nonsense out before its ready.

it this particular case i bet it cutting engineering head count and increase AI slop generated code without proper review by engineers. which ive been hearing a lot more from my engineering friends.