Fresh Hacker News | GitHub: Git operation failures

▲GitHub: Git operation failures(githubstatus.com)

362 points by wilhelmklopp 12 hours ago | 87 comments

▲aeldidi 10 hours ago

I'm becoming concerned with the rate at which major software systems seem to be failing as of late. For context, last year I only logged four outages that actually disrupted my work; this quarter alone I'm already on my fourth, all within the past few weeks. This is, of course, just an anecdote and not evidence of any wider trend (not to mention that I might not have even logged everything last year), but it was enough to nudge me into writing this today (helped by the fact that I suddenly had some downtime). Keep in mind, this isn't necessarily specific to this outage, just something that's been on my mind enough to warrant writing about it.

It feels like resiliency is becoming a bit of a lost art in networked software. I've spent a good chunk of this year chasing down intermittent failures at work, and I really underestimated how much work goes into shrinking the "blast radius", so to speak, of any bug or outage. Even though we mostly run a monolith, we still depend on a bunch of external pieces like daemons, databases, Redis, S3, monitoring, and third-party integrations, and we generally assume that these things are present and working in most places, which wasn't always the case. My response was to better document the failure conditions, and once I did, realize that there was many more than we initially thought. Since then we've done things like: move some things to a VPS instead of cloud services, automate deployment more than we already had, greatly improve the test suite and docs to include these newly considered failure conditions, and generally cut down on moving parts. It was a ton of effort, but the payoff has finally shown up: our records show fewer surprises which means fewer distractions and a much calmer system overall. Without that unglamorous work, things would've only grown more fragile as complexity crept in. And I worry that, more broadly, we're slowly un-learning how to build systems that stay up even when the inevitable bug or failure shows up.

For completeness, here are the outages that prompted this: the AWS us-east-1 outage in October (took down the Lightspeed R series API), the Azure Front Door outage (prevented Playwright from downloading browsers for tests), today’s Cloudflare outage (took down Lightspeed’s website, which some of our clients rely on), and the Github outage affecting basically everyone who uses it as their git host.

▲HardwareLust 10 hours ago

It's money, of course. No one wants to pay for resilience/redundancy. I've launched over a dozen projects going back to 2008, clients simply refuse to pay for it, and you can't force them. They'd rather pinch their pennies, roll the dice and pray.

▲stinkbeetle 8 hours ago

> It's money, of course.

100%

> No one wants to pay for resilience/redundancy. I've launched over a dozen projects going back to 2008, clients simply refuse to pay for it, and you can't force them. They'd rather pinch their pennies, roll the dice and pray.

Well, fly by night outfits will do that. Bigger operations like GitHub will try to do the math on what an outage costs vs what better reliability costs, and optimize accordingly.

Look at a big bank or a big corporation's accounting systems, they'll pay millions just for the hot standby mainframes or minicomputers that, for most of them, would never be required.

▲solid_fuel 6 hours ago

> Bigger operations like GitHub will try to do the math on what an outage costs vs what better reliability costs, and optimize accordingly.

Used to, but it feels like there is no corporate responsibility in this country anymore. These monopolies have gotten so large that they don't feel any impact from these issues. Microsoft is huge and doesn't really have large competitors. Google and Apple aren't really competing in the source code hosting space in the same way GitHub is.

▲collingreen 4 hours ago

> Take the number of vehicles in the field, A, multiply it by the probable rate of failure, B, then multiply it by the result of the average out of court settlement, C. A times B times C equals X. If X is less than the cost of a recall, we don't do one.

https://youtu.be/SiB8GVMNJkE

▲Jenk 8 hours ago

I've worked at many big banks and corporations. They are all held together with the proverbial sticky tape, bubblegum, and hope.

They do have multiple layers of redundancies, and thus have the big budgets, but they won't be kept hot, or there will be some critical flaws that all of the engineers know about but they haven't been given permission/funding to fix, and are so badly managed by the firm, they dgaf either and secretly want the thing to burn.

There will be sustained periods of downtime if their primary system blips.

They will all still be dependent on some hyper-critical system that nobody really knows how it works, the last change was introduced in 1988 and it (probably) requires a terminal emulator to operate.

▲stinkbeetle 8 hours ago

I've worked on software used by these and have been called in to help support from time to time. One customer which is a top single digit public company by market cap (they may have been #1 at the time, a few years ago) had their SAP systems go down once every few days. This wasn't causing a real monetary problem for them because their hot standby took over.

They weren't using mainframes, just "big iron" servers, but each one would have been north of $5 million for the box alone, I guess on a 5ish year replacement schedule. Then there's all the networking, storage, licensing, support, and internal administration costs for it which would easily cost that much again.

Now people will say SAP systems are made entirely of dict tape and bubblegum. But it all worked. This system ran all their sales/purchasing sites and portals and was doing a million dollars every couple of minutes so that all paid for itself many times over during the course of that bug. Cold standby would not have cut it. Especially since these big systems take many minutes to boot and HANA takes even longer to load from storage.

▲lopatin 9 hours ago

I agree that it's all money.

That's why it's always DNS right?

> No one wants to pay for resilience/redundancy

These companies do take it seriously, on the software side, but when it comes to configurations, what are you going to do:

Either play it by ear, or literally double your cloud costs for a true, real prod-parallel to mitigate that risk. It looks like even the most critical and prestigious companies in the world are doing the former.

▲macintux 7 hours ago

> Either play it by ear, or literally double your cloud costs for a true, real prod-parallel to mitigate that risk.

There's also the problem that doubling your cloud footprint to reduce the risk of a single point of failure introduces new risks: more configuration to break, new modes of failure when both infrastructures are accidentally live and processing traffic, etc.

Back when companies typically ran their own datacenters (or otherwise heavily relied on physical devices), I was very skeptical about redundant switches, fearing the redundant hardware would cause more problems than it solved.

▲paulddraper 2 hours ago

Complexity breeds bugs.

Which is why the “art” of engineering is reducing complexity while retaining functionality.

▲ForHackernews 9 hours ago

Why should they? Honestly most of what we do simply does not matter that much. 99.9% uptime is fine in 99.999% of cases.

▲porridgeraisin 7 hours ago

This is true. But unfortunately the exact same process is used even for critical stuff (the crowdstrike thing for example). Maybe there needs to be a separate swe process for those things as well, just like there is for aviation. This means not using the same dev tooling, which is a lot of effort.

▲roxolotl 7 hours ago

To agree with the comments it seems likely it's money which has begun to result in a slow "un-learning how to build systems that stay up even when the inevitable bug or failure shows up."

▲suddenlybananas 10 hours ago

To be deliberately provocative, LLMs are being more and more widely used.

▲zdragnar 7 hours ago

Word on the street is github was already a giant mess before the rise of LLMs, and it has not improved with the move to MS.

▲dsagent 7 hours ago

They are also in the process of moving most of the infra from on-prem to Azure. I'm sure will see more issues over the next couple months.

https://thenewstack.io/github-will-prioritize-migrating-to-a...

▲Tadpole9181 3 hours ago

To be deliberately provocative, so is offshoring work.

▲blibble 9 hours ago

imagine what it'll be like in 10 years time

Microsoft: the film Idiocracy was not supposed to be a manual

▲mandus 11 hours ago

Good thing git was designed as a decentralized revision control system, so you don’t really need GitHub. It’s just a nice convenience

▲jimbokun 11 hours ago

As long as you didn't go all in on GitHub Actions. Like my company has.

▲esafak 11 hours ago

Then your CI host is your weak point. How many companies have multi-cloud or multi-region CI?

▲IshKebab 11 hours ago

Do you think you'd get better uptime with your own solution? I doubt it. It would just be at a different time.

▲wavemode 10 hours ago

Uptime is much, much easier at low scale than at high scale.

The reason for buying centralized cloud solutions is not uptime, it's to safe the headache of developing and maintaining the thing.

▲manquer 6 hours ago

It is easier until things go down.

Meaning the cloud may go down more frequently than small scale self deployments , however downtimes are always on average much shorter on cloud. A lot of money is at stake for clouds providers, so GitHub et al have the resources to put to fix a problem compared to you or me when self hosting.

On the other hand when things go down self hosted, it is far more difficult or expensive to have on call engineers who can actual restore services quickly .

The skill to understand and fix a problem is limited so it takes longer for semi skilled talent to do so, while the failure modes are simpler but not simple.

The skill difference between setting up something locally that works and something works reliably is vastly different. The talent with the latter are scarce to find or retain .

▲tyre 10 hours ago

My reason for centralized cloud solutions is also uptime.

Multi-AZ RDS is 100% higher availability than me managing something.

▲wavemode 10 hours ago

Well, just a few weeks ago we weren't able to connect to RDS for several hours. That's way more downtime than we ever had at the company I worked for 10 years ago, where the DB was just running on a computer in the basement.

Anecdotal, but ¯\_(ツ)_/¯

▲sshine 8 hours ago

An anecdote that repeats.

Most software doesn’t need to be distributed. But it’s the growth paradigm where we build everything on principles that can scale to world-wide low-latency accessibility.

A UNIX pipe gets replaced with a $1200/mo. maximum IOPS RDS channel, bandwidth not included in price. Vendor lock-in guaranteed.

▲jakewins 10 hours ago

“Your own solution” should be that CI isn’t doing anything you can’t do on developer machines. CI is a convenience that runs your Make or Bazel or Just or whatever you prefer builds, that your production systems work fine without.

I’ve seen that work first hand to keep critical stuff deployable through several CI outages, and also has the upside of making it trivial to debug “CI issues”, since it’s trivial to run the same target locally

▲IshKebab 19 minutes ago

> should be that CI isn’t doing anything you can’t do on developer machines

You should aim for this but there are some things that CI can do that you can't do on your own machine, for example running jobs on multiple operating systems/architectures. You also need to use CI to block PRs from merging until it passes, and for merge queues/trains to prevent races.

▲CGamesPlay 7 hours ago

Yes, this, but it’s a little more nuanced because of secrets. Giving every employee access to the production deploy key isn’t exactly great OpSec.

▲deathanatos 9 hours ago

Yes. I've quite literally run a self-hosted CI/CD solution, and yes, in terms of total availability, I believe we outperformed GHA when we did so.

We moved to GHA b/c nobody ever got fired ^W^W^W^W leadership thought eng running CI was not a good use of eng time. (Without much question into how much time was actually spent on it… which was pretty close to none. Self-hosted stuff has high initial cost for the setup … and then just kinda runs.)

Ironically, one of our self-hosted CI outages was caused by Azure — we have to get VMs from somewhere, and Azure … simply ran out. We had to swap to a different AZ to merely get compute.

The big upside to a self-hosted solution is that when stuff breaks, you can hold someone over the fire. (Above, that would be me, unfortunately.) With Github? Nobody really cares unless it is so big, and so severe, that they're more or less forced to, and even then, the response is usually lackluster.

▲tcoff91 11 hours ago

Compared to 2025 github yeah I do think most self-hosted CI systems would be more available. Github goes down weekly lately.

▲Aperocky 10 hours ago

Aren't they halting all work to migrate to azure? Does not sound like an easy thing to do and feels quite easy to cause unexpected problems.

▲macintux 7 hours ago

I recall the Hotmail acquisition and the failed attempts to migrate the service to Windows servers.

▲drykjdryj 6 hours ago

Yes, this is not the first time github trying to migrate to azure. It's like the fourth time or something.

▲prescriptivist 10 hours ago

It's fairly straightforward to build resilient, affordable and scalable pipelines with DAG orchestrators like tekton running in kubernetes. Tekton in particular has the benefit of being low level enough that it can just be plugged into the CI tool above it (jenkins, argo, github actions, whatever) and is relatively portable.

▲davidsainez 11 hours ago

Doesn’t have to be an in house system, just basic redundancy is fine. eg a simple hook that pushes to both GitHub and gitlab

▲nightski 10 hours ago

I mean yes. We've hosted internal apps that have four nines reliability for over a decade without much trouble. It depends on your scale of course, but for a small team it's pretty easy. I'd argue it is easier than it has ever been because now you have open source software that is containerized and trivial to spin up/maintain.

The downtime we do have each year is typically also on our terms, not in the middle of a work day or at a critical moment.

▲__MatrixMan__ 11 hours ago

This escalator is temporarily stairs, sorry for the convenience.

▲Akronymus 10 hours ago

Tbh, I personally don't trust a stopped escalator. Some of the videos of brake failures on them scared me off of ever going on them.

▲collingreen 10 hours ago

You've ruined something for me. My adult side is grateful but the rest of me is throwing a tantrum right now. I hope you're happy with what you've done.

▲rvnx 10 hours ago

I read a book about elevators accidents; don't.

▲yjftsjthsd-h 9 hours ago

elevators accidents or escalator accidents?

▲rvnx 9 hours ago

elevators. for escalators, make sure not to watch videos of people falling in "the hole".

▲ 8 hours ago

▲Akronymus 9 hours ago

I am genuinly sorry about that. And no, I am not happy about what I've done.

▲fishpen0 9 hours ago

Not really comparable at any compliance or security oriented business. You can't just zip the thing up and sftp it over to the server. All the zany supply chain security stuff needs to happen in CI and not be done by a human or we fail our dozens of audits

▲__MatrixMan__ 9 hours ago

Why is it that we trust those zany processes more than each other again? Seems like a good place to inject vulnerabilities to me...

▲cyberax 2 hours ago

Hi! My name is Jia Tan. Here's a nice binary that I compiled for you!

▲lopatin 11 hours ago

The issue is that GitHub is down, not that git is down.

▲drob518 6 hours ago

Aren’t they the same thing? /sarc

▲ElijahLynn 11 hours ago

You just lose the "hub" of connecting others and providing a way to collaborate with others with rich discussions.

▲parliament32 11 hours ago

All of those sound achievable by email, which, coincidently, is also decentralized.

▲Aurornis 11 hours ago

Some of my open source work is done on mailing lists through e-mail

It's more work and slower. I'm convinced half of the reason they keep it that way is because the barrier to entry is higher and it scares contributors away.

▲octoberfranklin 6 hours ago

Well it does prevent brigading.

▲drykjdryj 6 hours ago

Email at a company is very not decentralized. Most use Microsoft 365, also hosted in azure, i.e. the same cloud as github is trying to host its stuff in.

▲ 11 hours ago

▲awesome_dude 11 hours ago

Wait, email is decentralised?

You mean, assuming everyone in the conversation is using different email providers. (ie. Not the company wide one, and not gmail... I think that covers 90% of all email accounts in the company...)

▲paulddraper 2 hours ago

For sure.

You can commit, branch, tag, merge, etc and be just fine.

Now, if you want to share that work, you have to push.

▲PhilippGille 1 hour ago

You can push to any other Git server during a GitHub outage to still share work, trigger a CI job, deploy etc, and later when GitHub is reachable again you push there too.

Yes you lose some convenience (like GitHub's pull requests UI can't be used, but you can temporarily use the other Git server's UI for that.

I think their point was that you're not fully locked in to GitHub. You have the repo locally and can mirror it on any Git remote.

▲Conscat 11 hours ago

I'm on HackerNews because I can't do my job right now.

▲brokenmachine 5 hours ago

I'm on HN because I don't want to do my job right now.

▲y42 11 hours ago

I work in the wrong time zone. Good night.

▲keybored 10 hours ago

I don’t use GitHub that much. I think the thing about “oh no you have centralized on GitHub” point is a bit exaggerated.[1] But generally, thinking beyond just pushing blobs to the Internet, “decentralization” as in software that lets you do everything that is Not Internet Related locally is just a great thing. So I can never understand people who scoff at Git being decentralized just because “um, actually you end up pushing to the same repository”.

It would be great to also have the continuous build and test and whatever else you “need” to keep the project going as local alternatives as well. Of course.

[1] Or maybe there is just that much downtime on GitHub now that it can’t be shrugged off

▲ramon156 11 hours ago

SSH also down

▲gertlex 11 hours ago

My pushing was failing for reasons I hadn't seen before. I then tried my sanity check of `ssh git@github.com` (I think I'm supposed to throw a -t flag there, but never care to), and that worked.

But yes ssh pushing was down, was my first clue.

My work laptop had just been rebooted (it froze...) and the CPU was pegged by security software doing a scan (insert :clown: emoji), so I just wandered over to HN and learned of the outage at that point :)

▲kragen 11 hours ago

SSH works fine for me. I'm using it right now. Just not to GitHub!

▲blueflow 11 hours ago

SSH is as decentralized as git - just push to your own server? No problem.

▲jimbokun 11 hours ago

Well sure but you can't get any collaborators commits that were only pushed to GitHub before it went down.

Well you can with some effort. But there's certainly some inconvenience.

▲stevage 11 hours ago

Curious whether you actually think this, or was it sarcasm?

▲0x457 11 hours ago

It was sarcasm, but git itself is Decentralized VCS. Technically speaking, every git checkout is a repo of itself. GitHub doesn't stop me from having the entire repo history up to last pull, and I still can push either to the company backup server or my coworker directly.

However, since we use github.com fore more than just a git hosting it is SPOF in most cases, and we treat it as a snow day.

▲stevage 9 hours ago

Yep, agreed - Issues being down would be a bit of a killer.

▲SteveNuts 11 hours ago

I have a serious question, not trying to start a flame war.

A. Are these major issues with cloud/SaaS tools becoming more common, or is it just that they get a lot more coverage now? It seems like we see major issues across AWS, GCP, Azure, Github, etc. at least monthly now and I don't remember that being the case in the past.

B. If it's becoming more common, what are the reasons? I can think of a few, but I don't know the answer, so if anyone in-the-know has insight I'd appreciate it.

Operations budget cuts/layoffs? Replacing critical components/workflows with AI? Just overall growing pains, where a service has outgrown what it was engineered for?

Thanks

▲wnevets 11 hours ago

> A. Are these major issues with cloud/SaaS tools becoming more common, or is it just that they get a lot more coverage now? It seems like we see major issues across AWS, GCP, Azure, Github, etc. at least monthly now and I don't remember that being the case in the past.

FWIW Microsoft is convinced moving Github to Azure will fix these outages

▲Lammy 11 hours ago

Everything old is new again.

https://www.zdnet.com/article/ms-moving-hotmail-to-win2000-s...

https://jimbojones.livejournal.com/23143.html

▲codethief 10 hours ago

From the second link:

> In 2002, the amusement continued when a network security outfit discovered an internal document server wide open to the public internet in Microsoft's supposedly "private" network, and found, among other things, a whitepaper[0] written by the hotmail migration team explaining why unix is superior to windows.

Hahaha, that whitepaper is pure gold!

[0]: https://web.archive.org/web/20040401182755/http://www.securi...

▲hotsauceror 5 hours ago

And 25 years later, a significant portion of the issues in that whitepaper remain unresolved. They were still shitting on people like Jeffrey Snover who were making attempts to provide more scalable management technologies. Such a clown show.

▲einsteinx2 11 hours ago

The same Azure that just had a major outage this month?

▲tombert 6 hours ago

Microsoft is a company that hasn't even figured out how to get system updating working consistently on their premier operating system in three decades. It seems unlikely to me that somehow moving to Azure is going to make anything more stable.

▲bovermyer 11 hours ago

Microsoft is also convinced that its works are a net benefit for humanity, so I would take that with a grain of salt.

▲andrewstuart2 11 hours ago

I think it would be pretty hard to argue against that point of view, at least thus far. If DOS/Windows hadn't become the dominant OS someone would have, and a whole generation of engineers cut their teeth on their parents' windows PCs.

▲cdaringe 11 hours ago

There are some pretty zany alternative realities in the Multiverses I’ve visited. Xerox Parc never went under and developed computing as a much more accessible commodity. Another, Bell labs invented a whole category of analog computers that’s supplanted our universe’s digital computing era. There’s one where IBM goes directly to super computers in the 80s. While undoubtedly Microsoft did deliver for many of us, I am a hesitant to say that that was the only path. Hell, Steve Jobs existed in the background for a long while there!

▲bilegeek 10 hours ago

I wish things had gone differently too, but a couple of nitpicks:

1.) It's already a miracle Xerox PARC escaped their parent company's management for as long as they did.

3.) IBM was playing catch-up on the supercomputer front since the CDC 6400 in 1964. Arguably, they did finally catch up in the mid-late 80's with the 3090.

▲noir_lord 11 hours ago

AT&T sold Unix machines (actually a rebadged Olivetti for the hardware) and Microsoft has Xenix when windows wasn't a thing.

So many weird paths we could have gone down it's almost strange Microsoft won.

▲andrewstuart2 8 hours ago

Yeah, I'm absolutely not saying it was the only path. It's just the path that happened. If not MS maybe it would have been Unix and something else. Either way most everyone today uses UX based on Xerox Parc's which was generously borrowed by, at this point, pretty much everyone.

▲saghm 2 hours ago

I'm not sure I understand this logic. You're saying that the gap would have been filled even if their product didn't exist, which means that the net benefit isn't that the product exists. How are you concluding that whatever we might have gotten instead would have been worse?

▲tombert 6 hours ago

If Microsoft hadn't tried to actively kill all its competition then there's a good chance that we'd have a much better internet. Microsoft is bigger than just an operating system, they're a whole corporation.

Instead they actively tried to murder open standards [1] that they viewed as competitive and normalized the antitrust nightmare that we have now.

I think by nearly any measure, Microsoft is not a net good. They didn't invent the operating system, there were lots of operating systems that came out in the 80's and 90's, many of which were better than Windows, that didn't have the horrible anticompetitive baggage attached to them.

[1] https://en.wikipedia.org/wiki/Embrace,_extend,_and_extinguis...

▲switchbak 11 hours ago

DOS and Windows kept computing behind for a VERY long time, not sure what you're trying to argue here?

▲krabizzwainch 11 hours ago

What’s funny is that we were some bad timing away from IBM giving the DOS money to Gary Kildall and we’d all be working with CP/M derivatives!

Gary was on a flight when IBM called up the Digital Research looking for an OS for the IBM-PC. Gary’s wife, Dorothy, wouldn’t sign an NDA without it going through Gary, and supposedly they never got negotiations back on track.

▲goda90 11 hours ago

What if that alternate someone had been better than DOS/Windows and then engineers cut their teeth on that instead?

▲andrewstuart2 8 hours ago

Then my comment may have been about a different OS. Or I might never have been born. Who knows?

▲bovermyer 11 hours ago

I'm not convinced of your first point. Just because something seems difficult to avoid given the current context does not mean it was the only path available.

Your second point is a little disingenuous. Yes, Microsoft and Windows have been wildly successful from a cultural adoption standpoint. But that's not the point I was trying to argue.

▲andrewstuart2 8 hours ago

My first comment is simply pointing out that there's always a #1 in anything you can rank. Windows happened to be what won. And I learned how to use a computer on Windows. Do I use it now? No. But I learned on it as did most people whose parents wanted a computer.

▲tombert 6 hours ago

The comment you were replying to was about Microsoft.

Even if Windows weren't a dogshit product, which it is, Microsoft is a lot more than just an operating system. In the 90's they actively tried to sabotage any competition in the web space, and held web standards back by refusing to make Internet Explorer actually work.

▲hobs 8 hours ago

And how does it follow that microsoft is the good guy in a future where we did it with some other operating system? You could argue that their system was so terrible that its displacement of other options harmed us all with the same level of evidence.

▲junon 11 hours ago

Been on GitHub for a long time. It feels like they're more often. It used to be yearly if at all that GitHub was noticably impacted. Now it's monthly, and recently, seemingly weekly.

▲0x457 10 hours ago

Definitely not how I remember. First, I remember seeing unicorn page multiple times a day some weeks. There were also time when webhook delivery didn't work, so circle ci users couldn't kick off any builds.

What change is how many services GitHub can be having issues.

▲chadac 11 hours ago

I suspect that the Azure migration is influencing this one. Just a bunch of legacy stuff being moved around along with Azure not really being the most reliable on top... I can't imagine it's easy.

▲zackify 11 hours ago

there has been 5 between actions and push pull issues just this month. it is more often

▲cmrdporcupine 11 hours ago

In the early days of GitHub (like before 2010) outages were extremely common.

▲bovermyer 11 hours ago

I agree, for what that's worth.

However, this is an unexpected bell curve. I wonder if GitHub is seeing more frequent adversarial action lately. Alternatively, perhaps there is a premature reliance on new technology at play.

▲cmrdporcupine 11 hours ago

I pulled my project off github and onto codeberg a couple months ago but this outage still screws me over because I have a Cargo.toml w/ git dependency into github.

I was trying to do a 1.0 release today. Codeberg went down for "10 minutes maintenance" multiple times while I was running my CI actions.

And then github went down.

Cursed.

▲netghost 11 hours ago

I think it was generally news when there were upages and the site was up. Similar with twitter for that matter.

▲junon 11 hours ago

Not from my recollection. Not like this. BitBucket on the other hand had a several day outage at one point. That one I do recall.

▲sampullman 11 hours ago

I remember periods of time when GitHub was down every few weeks, my impression is that it's become more stable over the years.

▲ 11 hours ago

▲kkarpkkarp 11 hours ago

> If it's becoming more common, what are the reasons?

Someone answered this morning, while Cloudflare outage, it's AI vibe coding and I tend to think there is something true in this. At some point there might be some tiny grain of AI engaged which starts the avalanche ending like this.

▲AIorNot 11 hours ago

well layoffs across tech probably havent helped

https://techrights.org/n/2025/08/12/Microsoft_Can_Now_Stop_R...

ever since Musk greenlighted firing people again.. CEOs can't wait to pull the trigger

▲smsm42 11 hours ago

It certainly feels that way, though it may be an instance of availability bias. Not sure what's causing it - maybe extra load from AI bots (certainly a lot of smaller sites complain about it, maybe major providers feel the pain too), maybe some kind of general quality erosion... It's certainly something that is waiting for a serious research.

▲pm90 11 hours ago

Github isn't in the same reliability class as the hyperscalars or cloudflare; its comically bad now, to the point that at a previous job we invested in building a readonly cache layer specifically to prevent github outages from bringing our system down.

▲tingletech 10 hours ago

Years ago on hackernews I saw a link about probability describing a statistical technique that one could use to answer a question about if a specific type of event was becoming more common or not. Maybe related to the birthday paradox? The gist that I remember is that sometimes a rare event will seem to be happening more often, when in reality there is some cognitive bias that makes it non-intuitive to make that decision without running the numbers. I think it was a blog post that went through a few different examples, and maybe only one of them was actually happening more often.

▲ambicapter 10 hours ago

If the events are independent, you could use a binomial distribution. Not sure if you can consider these kinds of events to be independent, though.

▲grayhatter 10 hours ago

End of year, pre-holiday break, code/project completion for perf review rush.

Be good to your Stability reliability engineers for the next few months... it's downtime season!

▲Wowfunhappy 11 hours ago

I’m more interested in how this and the Cloudflare outage occurred on the same day. Is it really just a coincidence?

▲averageRoyalty 11 hours ago

I suspect there is more tech out there. 20 years ago we didn't have smartphones. 10 years ago, 20mbit on mobile was a good connection. Gigabit is common now, infrastructure no longer has the hurdles it used to, AI makes coding and design much easier, phones are ubiquitous and usage of them at all times (in the movies, out and dinner, driving) has become super normalised.

I suspect (although have not researched) that global traffic is up, by throughput but also by session count.

This contributes to a lot more awareness. Slack being down wasn't impactful when most tech companies didn't use Slack. An AWS outage was less relevant when the 10 apps (used to be websites) you use most didn't rely on a single AZ in AWS or you were on your phone less.

I think as a society it just has more impact than it used to.

▲dlenski 10 hours ago

> Are these major issues with cloud/SaaS tools becoming more common, or is it just that they get a lot more coverage now?

I think that "more coverage" is part of it, but also "more centralization." More and more of the web is centralized around a tiny number of cloud providers, because it's just extremely time-intensive and cost-prohibitive for all but the largest and most specialized companies to run their own datacenters and servers.

Three specific examples: Netflix and Dropbox do run their own datacenters and servers; Strava runs on AWS.

> If it's becoming more common, what are the reasons? I can think of a few, but I don't know the answer, so if anyone in-the-know has insight I'd appreciate it.

I worked at AWS from 2020-2024, and saw several of these outages so I guess I'm "in the know."

My somewhat-cynical take is that a lot of these services have grown enormously in complexity, far outstripping the ability of their staff to understand them or maintain them:

- The OG developers of most of these cloud services have moved on. Knowledge transfer within AWS is generally very poor, because it's not incentivized, and has gotten worse due to remote work and geographic dispersion of service teams.

- Managers at AWS are heavily incentivized to develop "new features" and not to improve the reliability, or even security, of their existing offerings. (I discovered numerous security vulnerabilities in the very-well-known service that I worked for, and was regularly punished-rather-than-rewarded for trying to get attention and resources on this. It was a big part of what drove me to leave Amazon. I'm still sitting on a big pile of zero-day vulnerabilities in ______ and ______.)

- Cloud services in most of the world are basically a 3-way oligopoly between AWS, Microsoft/Azure, and Google. The costs of switching from one provider to another are often ENORMOUS due to a zillion fiddly little differences and behavior quirks ("bugs"). It's not apparent to laypeople — or even to me — that any of these providers are much more or less reliable than the others.

▲myth_drannon 11 hours ago

Looking around, I noticed that many senior, experienced individuals were laid off, sometimes replaced by juniors/contractors without institutional knowledge or experience. That's especially evident in ops/support, where the management believes those departments should have a smaller budget.

▲sunshine-o 11 hours ago

1/ Most of the big corporations moved to big cloud providers in the last 5 years. Most of them started 10 years ago but it really accelerated in the last 5 years. So there is for sure more weight and complexity on cloud providers, and more impact when something goes wrong.

2/ Then we cannot expect big tech to stay as sharp as in the 2000s and 2010s.

There was a time banks had all the smart people, then the telco had them, etc. But people get older, too comfortable, layers of bad incentive and politics accumulate and you just become a dysfunctional big mess.

▲swed420 10 hours ago

> B. If it's becoming more common, what are the reasons?

Among other mentioned factors like AI and layoffs: mass brain damage caused by never-ending COVID re-infections.

Since vaccines don't prevent transmission, and each re-infection increases the chances of long COVID complications, the only real protection right now is wearing a proper respirator everywhere you go, and basically nobody is doing that anymore.

There are tons of studies to back this line of reasoning.

▲__MatrixMan__ 11 hours ago

I think it's cancer, and it's getting worse.

▲xmprt 11 hours ago

One possibility is increased monitoring. In the past, issues that happened weren't reported because they went under the radar. Whereas now, those same issues which only impact a small percentage of users would still result in a status update and postmortem. But take this with a grain of salt because it's just a theory and doesn't reflect any actual data.

A lot of people are pointing to AI vibe coding as the cause, but I think more often than not, incidents happen due to poor maintenance of legacy code. But I guess this may be changing soon as AI written code starts to become "legacy" faster than regular code.

▲Kostic 11 hours ago

At least with GitHub it's hard to hide when you get "no healthy upstream" on a git push.

▲chrsstrm 11 hours ago

I thought I was going crazy when I couldn't push changes but now it seems it's time to just call it for the day. Back at it tomorrow.

▲Mossly 11 hours ago

Seeing auth succeed but push fail was an exercise in hair pulling.

▲curioussquirrel 11 hours ago

Same, even started adding new ssh keys to no avail... (I was getting some nondescript user error first, then unhealthy upstream)

▲chrsstrm 11 hours ago

Would love to see a global counter for the number of times ‘ssh -T git@github.com’ was invoked.

▲peciulevicius 11 hours ago

same, i've started pulling my hair out, was about to nuke my setup and set it up all from scratch

▲keepamovin 11 hours ago

lol same. Hilarious when this shit goes down that we all rely on like running water. I'm assuming GitHub was hacked by the NSA because someone uploaded "the UFO files" or sth.

▲_jab 11 hours ago

GitHub is pretty easily the most unreliable service I've used in the past five years. Is GitLab better in this regard? At this point my trust in GitHub is essentially zero - they don't deserve my money any longer.

▲jakub_g 10 hours ago

My company self-hosts GitLab. Gitaly (the git server) is a weekly source of incidents, it doesn't scale well (CPU/memory spikes which end up taking down the web interface and API). However we have pretty big monorepos with hundreds of daily committers, probably not very representative.

▲ecshafer 11 hours ago

We self host gitlab, so its very stable. But Gitlab also kind of is enterprise software. It hits every feature checkbox, but they aren't well integrated, and they are kind of half way done. I don't think its as smooth of an experience as Github personally, or as feature rich. But Gitlab can self host your project repos, cicd, issues, wikis, etc. and it does it at least okay.

▲input_sh 10 hours ago

I would argue GitLab CI/CD is miles ahead of the dumpster fire that is GitHub Actions. Also the homepage is actually useful, unlike GitHub's.

▲tottenhm 10 hours ago

Frequently use both `github.com` and self-hosted Gitlab. IMHO, it's just... different.

Self-hosted Gitlab periodically blocks access for auto-upgrades. Github.com upgrades are usually invisible.

Github.com is periodically hit with the broad/systemic cloud-outage. Self-hosted Gitlab is more decentralized infra, so you don't have the systemic outages.

With self-hosted Gitlab, you likely to have to deal with rude bots on your own. Github.com has an ops team that deals with the rude bots.

I'm sure the list goes on. (shrug)

▲noosphr 11 hours ago

You can make it as reliable as you want by hosting it on prem.

▲jakub_g 10 hours ago

> as reliable as you want

We self-host GitLab but the team owning it is having hard time scaling it. From my understanding talking to them, the design of gitaly makes it very hard to scale it beyond certain repo size and # of pushes per day (for reference: our repos are GBs in size, ~1M commits, hundreds of merges per day)

▲ 9 hours ago

▲ 11 hours ago

▲themafia 11 hours ago

Flashbacks to me pushing hard for GitLab self hosting a few months ago. The rest of the team did not feel the lift was worth it.

I utterly hate being at the mercy of a third party with an after thought of a "status page" to stare at.

▲cindyllm 10 hours ago

[dead]

▲tapoxi 11 hours ago

Another GitLab self-hosting user here, we've run it on Kubernetes for 6 years. It's never gone down for us, maybe an hour of downtime yearly as we upgrade Postgres to a new version.

▲globular-toast 1 hour ago

Same. Also running on-prem for 6+ years with no issues. GitLab CI is WAY better than GitHub too.

▲yoyohello13 11 hours ago

We've been self hosting GitLab for 5 years and it's the most reliable service in our organization. We haven't had a single outage. We use Gitlab CI and security scanning extensively.

▲markbnj 11 hours ago

Ditto, self-hosted for over eight years at my last job. SCM server and 2-4 runners depending on what we needed. Very impressive stability and when we had to upgrade their "upgrade path" tooling was a huge help.

▲loloquwowndueo 11 hours ago

Forgejo, my dudes.

▲esafak 11 hours ago

Do we know its uptime statistics?

▲loloquwowndueo 8 hours ago

What do you mean. Forgejo is self-hosted, uptime is up to you.

▲esafak 7 hours ago

Mea culpa. I forgot, perhaps thinking of codeberg.

▲geoffbp 8 hours ago

Gitlab has regular issues (we use Saas) and the support isn’t great. They acknowledge problems, but the same ones happen again and again. It’s very hard to get anything on their roadmap etc.

▲JonChesterfield 11 hours ago

Couldn't log into it this morning when cloudflare was down so there's that.

▲cactusfrog 10 hours ago

There’s this Gitlab incident https://www.youtube.com/watch?v=tLdRBsuvVKc

▲cjonas 12 hours ago

I didn't really want to work today anyways. First cloudflare, now this... Seems like a sign to get some fresh air

▲dlahoda 11 hours ago

we depend too much on usa centralized tech.

we need more soverenity and decentralization.

▲worldsavior 11 hours ago

How is this related to them being located in the USA?

▲lorenzleutgeb 11 hours ago

Please check out radicle.dev, helping hands always welcome!

▲letrix 11 hours ago

> Repositories are replicated across peers in a decentralized manner

You lost me there

▲hungariantoast 10 hours ago

"Replicated across peers in a decentralized manner" could just as easily be written about regular Git. Radicle just seems to add a peer-to-peer protocol on top that makes it less annoying to distribute a repository.

So I don't get why the project has "lost you", but I also suspect you're the kind of person any project could readily afford to lose as a user.

▲lorenzleutgeb 9 hours ago

What this is trying to say: - "peers": participants in the network are peers, i.e. both ends of a connection run the same code, in contrast to a client-and-server architecture, where both sides often run pretty different code. To exemplify: The code GitHub's servers run is very different from the code that your IDE with Git integration runs. - "replicated across peers": the Git objects in the repository, and "social artifacts" like discussions in issues and revisions in patches, is copied to other peers. This copy is kept up to date by doing Git fetches for you in the background. - "in a decentralized manner": Every peer/node in the network gets to locally decide which repositories they intend to replicate, i.e. you can talk to your friends and replicate their cool projects. And when you first initialize a repository, you can decide to make it public (which allows everyone to replicate it), or private (which allows a select list of nodes identified by their public key to replicate). There's no centralized authority which may tell you which repositories to replicate or not.

I do realize that we're trying to pack quite a bit of information in this sentence/tagline. I think it's reasonably well phrased, but for the uninitiated might require some "unpacking" on their end.

If we "lost you" on that tagline, and my explanation or that of hungariantoast (which is correct as well) helped you understand, I would appreciate if you could criticize more constructively and suggest a better way to introduce these features in a similarly dense tagline, or say what else you would think is a meaningful but short explanation of the project. If you don't care to do that, that's okay, but Radicle won't be able to improve just based on "you lost me there".

In case you actually understood the sentence just fine and we "lost you" for some other reason, I would appreciate if you could elaborate on the reason.

▲CivBase 11 hours ago

The sad part is both the web and git were developed as decentralized technologies, both of which we foolishly centralized later.

The underlying tech is still decentralized, but what good does that do when we've made everything that uses it dependent on a few centralized services?

▲kennysmoothx 12 hours ago

FYI in an emergency you can edit files directly on Github without the need to use git.

Edit: ugh... if you rely on GH Actions for workflows though actions/checkout@v4 is also currently experiencing the git issues, so no dice if you depend on that.

▲ruuda 11 hours ago

FYI in an emergency you can `git push` to and `git pull` from any SSH-capable host without the need to use GitHub.

▲cluckindan 11 hours ago

FYI in an emergency you can SSH to your server and edit files and the DB directly.

Where is your god now, proponents of immutable filesystems?!

▲BadBadJellyBean 11 hours ago

I love when people do that because they always say "I will push the fix to git later". They never do and when we deploy a version from git things break. Good times.

I started packing things into docker containers because of that. Makes it a bit more of a hassle to change things in production.

▲noir_lord 11 hours ago

Depends on the org, the big ones I've worked for regular Devs even seniors don't have anything like the level of access to be able to pull a stunt like that.

At the largest place I did have prod creds for everything because sometimes they are necessary and I had the seniority (sometimes you do need them in a "oh crap" scenario).

They where all setup on a second account in my work Mac which had a danger will Robinson wallpaper because I know myself, far far too easy to mentally fat finger when you have two sets of creds.

▲egeozcan 11 hours ago

FYI in an emergency, you can buy a plane ticket and send someone to access the server directly.

I actually had the privilege of being sent to the server.

▲noir_lord 11 hours ago

Had a coworker have to drive across the country once to hit a power button (many years ago).

Because my suggestion they have a spare ADSL connection for out of channel stuff was an unnecessary expense... Til he broke the firewall knocked a bunch of folks offline across a huge physical site and locked himself out of everything.

The spare line got fitted the next month.

▲lenerdenator 11 hours ago

I'm actually getting "ERROR: no healthy upstream" on `git pull`.

They done borked it good.

▲avree 11 hours ago

If your remote is set to a git@github.com remote, it won't work. They're just pointing out that you could use git to set origin/your remote to a different ssh capable server, and push/pull through that.

▲rco8786 11 hours ago

Yup, we were just trying to hotfix prod and ran into this. What is happening to the internet lately.

▲shrikant 11 hours ago

We're not using Github Actions, but CircleCI is also failing git operations on Github (it doesn't recognise our SSH keys).

▲vielite1310 11 hours ago

True that, and this time Github AI actually have a useful answer to check for githubstatus.com

▲lopatin 11 hours ago

Can you create a branch through GitHub UI?

▲hobofan 11 hours ago

Yes. Just start editing a file and when you hit the "commit changes" button it will ask you what name to use for the branch.

▲captainkrtek 9 hours ago

Reflecting on the last decade, with my career spanning big tech and startups, I've seen a common arch:

Small and scrappy startup -> taking on bigger customers for greater profits / ARR -> re-architecting for "enterprise" customers and resiliency / scale -> more idealism in engineering -> profit chasing -> product bloat -> good engineers leave -> replaced by other engineers -> failures expand.

This may be an acceptable lifecycle for individual companies as they each follow the destiny of chasing profits ultimately. Now picture it though for all the companies we've architected on top of (AWS, CloudFlare, GCP, etc.) Even within these larger organizations, they are comprised of multiple little businesses (eg: EC2 is its own business effectively - people wise, money wise)

Having worked at a $big_cloud_provider for 7 yrs, I saw this internally on a service level. What started as a foundational service, grew in scale, complexity, and architected for resiliency, slowly eroded its engineering culture to chase profits. Fundamental services becoming skeletons of their former selves, all while holding up the internet.

There isn't a singular cause here, and I can't say I know what's best, but it's concerning as the internet becomes more centralized into a handful of players.

tldr: how much of one's architecture and resiliency is built on the trust of "well (AWS|GCP|CloudFlare) is too big to fail" or "they must be doing things really well"? The various providers are not all that different from other tech companies on the inside. Politics, pressure, profit seeking.

▲Esophagus4 8 hours ago

Well said. I definitely agree (you’re absolutely right!) that the product will get worse through that re-architecting for enterprise transition.

But the small product also would not be able to handle any real amount of growth as it was, because it was a mess of tech debt and security issues and manual one-off processes and fragile spaghetti code that only Jeff knows because he wrote it in a weekend, and now he’s gone.

So by definition, if a service is large enough to serve a zillion people, it is probably big and bloated and complex.

I’m not disagreeing with you, I liked your comment and I’m just rambling. I have worked with several startups and was surprised at how poorly their tech scaled (and how riddled with security issues they were) as we got into it.

Nothing will shine a flashlight on all the stress cracks of a system like large-scale growth on the web.

▲captainkrtek 8 hours ago

> So by definition, if a service is large enough to serve a zillion people, it is probably big and bloated and complex.

Totally agree with your take as well.

I think the unfortunate thing is that there can exist a "goldie locks zone" to this, where the service is capable of serving a zillion people AND is well architected. Unfortunately it can't seem to last forever.

I saw this in my career. More product SKUs were developed, new features/services defined by non-technical PMs, MBAs entered the chat, sales became the new focus over availability, and the engineering culture that made this possible eroded day by day.

The years I worked in this "goldie locks zone" I'd attribute to:

- strong technical leadership at the SVP+ level that strongly advocated for security, availability, then features (in that order).

- a strong operational culture. Incidents were exciting internally, post mortems shared at a company wide level, no matter how small.

- recognition for the engineers who chased ambulances and kept things running, beyond their normal job, this inspired others to follow in their footsteps.

▲grepfru_it 7 hours ago

There was a comment on another GitHub thread that I replied to. I got a response saying it’s absurd how unreliable Gh is when people depend on it for CI/CD. And I think this is the problem. At GitHub the developers think it’s only a problem because their ci/cd is failing. Oh no, we broke GitHub actions, the actions runners team is going to be mad at us! Instead of, oh no, we broke GitHub actions, half the world is down!

That larger view held only by a small sliver of employees is likely why reliability is not a concern. That leads to the every team for themselves mentality. “It’s not our problem, and we won’t make it our problem so we don’t get dinged at review time” (ok that is Microsoft attitude leaking)

Then there’s their entrenched status. Real talk, no one is leaving GitHub. So customers will suck it up and live with it while angry employees grumble on an online forum. I saw this same attitude in major companies like Verio and Verisign in the early 2000s. “Yeah we’re down but who else are you going to go to? Have a 20% discount since you complained. We will only be 1% less profitable this quarter due to it” The kang and kodos argument personified.

These views are my own and not related to my employer or anyone associated with me.

▲rurban 1 hour ago

So was this caused by the Cloudflare duplicate features file, causing 5xx errors internally? They said they shipped a fix, but no details yet

▲lol768 11 hours ago

> We are seeing failures for some git http operations and are investigating

It's not just HTTPS, I can't push via SSH either.

I'm not convinced it's just "some" operations either; every single one I've tried fails.

▲deathanatos 9 hours ago

I'm convinced the people who write status pages are incapable of escaping the phrasing "Some users may be experiencing problems". Too much attempting to save face by PR types, instead of just being transparent with information (… which is what would actually save face…)

And that's if you get a status page update at all.

▲olivia-banks 11 hours ago

A friend of mine was able to get through a few minutes ago, apparently. Everyone else I know is still fatal'ing.

▲shooker435 12 hours ago

https://www.githubstatus.com/incidents/5q7nmlxz30sk

it's up now (the incident, not the outage)

▲laurentiurad 11 hours ago

A lot of failures lately during the aI ReVoLuTiOn.

▲SOLAR_FIELDS 11 hours ago

GitHub has a long history of garbage reliability that long predates AI

▲personjerry 11 hours ago

Looks like Gemini 3 figured out the best way to save costs on its compute time was to shut down github!

▲JonChesterfield 9 hours ago

What's the local workaround for this?

Git is distributed, it should be possible to put something between our servers and github which pulls from github when it's running and otherwise serves whatever it used to have. A cache of some sort. I've found the five year old https://github.com/jonasmalacofilho/git-cache-http-server which is the same sort of idea.

I've run a git instance on a local machine which I pull from, where a cron job fetches from upstream into it, which solved the problem of cloning llvm over a slow connection, so it's doable on a per-repo basis.

I'd like to replace it globally though because CI looks like "pull from loads of different git repos" and setting it up once per-repo seems dreadful. Once per github/gitlab would be a big step forward.

▲arbol 12 hours ago

I'm also getting this. Cannot pull or push but can authenticate with SSH

    myrepo git:(fix/context-types-settings) gp
    ERROR: user:1234567:user
    fatal: Could not read from remote repository.

    myrepo git:(fix/context-types-settings) ssh -o ProxyCommand=none git@github.com
    PTY allocation request failed on channel 0
    Hi user! You've successfully authenticated, but GitHub does not provide shell access.
    Connection to github.com closed.

▲smashah 12 hours ago

same

▲keepamovin 11 hours ago

[flagged]

▲bhouston 12 hours ago

I cannot push/pull to any repos. Scared me for a second, but of course I then checked here.

▲OptionOfT 11 hours ago

It is insane how many failures we've been getting lately, especially related to actions.

    * jobs not being picked up
    * jobs not being able to be cancelled
    * jobs running but showing up as failed
    * jobs showing up as failed but not running
    * jobs showing containers as pushed successfully to GitHub's registry, but then we get errors while pulling them
    * ID token failures (E_FAIL) and timeouts.

I don't know if this is related to GitHub moving to Azure, or because they're allowing more AI generated code to pass through without proper reviews, or something else, but as a paying customer I am not happy.

▲manbitesdog 11 hours ago

Same! The current self-hosted runner gets hung every so often

▲veighnsche 11 hours ago

Probably because AI-generated reviews has made qa way worse.

▲sgreene570 12 hours ago

github has had a few of these as of late, starting to get old

▲baq 11 hours ago

Remember talking about the exact same thing with very similar wording sometime pre-COVID

▲cluckindan 11 hours ago

MSFT intentionally degrading operations to get everyone to move onto Azure… oh, wait, they just moved GitHub there, carry on my wayward son!

▲blasphemers 10 hours ago

GitHub hasn't been moved onto azure yet, they just announced it's their goal to move over in 2026

▲JLCarveth 12 hours ago

The last outage was a whole 5 days ago https://news.ycombinator.com/item?id=45915731

▲jmclnx 12 hours ago

Didn't I hear github is moving to Microsoft Azure ? I wonder if these outages are related to the move.

Remember hotmail :)

▲bhouston 12 hours ago

Huh? What were they on before? The acquisition by MSFT is 7 years ago, they maintained their own infrastructure for that long?

▲JLCarveth 11 hours ago

The Github CEO did step down a few months ago, they never named a successor. Could have something to do with the recent issues. https://news.ycombinator.com/item?id=44865560

▲silverwind 11 hours ago

Yes they are/were on their own hardware. The outages will only get worse with this move.

▲mendyberger 11 hours ago

After restarting my computer, reinstalling git, almost ready to reinstall my os, I find out it's not even my fault

▲consumer451 11 hours ago

We live in a house of cards. I hope that eventually people in power realize this. However, their incentive structures do not seem to be a forcing function for that eventuality.

I have been thinking about this a lot lately. What would be a tweak that might improve this situation?

▲sznio 10 hours ago

Not exactly for this situation, but I've been thinking about distributed caching of web content.

Even if a website is down, someone somewhere most likely has it cached. Why can't I read it from their cache? If I'm trying to reach a static image file, why do I have to get it from the source?

I guess I want torrent DHT for the web.

▲consumer451 7 hours ago

That is genuinely interesting. But, let's put all "this nerd talk" into terms that someone in the average C-suite could understand.

How can C-suite stock RSU/comp/etc be tweaked to make them give a crap about this, or security?

---

Decades ago, I was a teenager and I realized that going to fancy hotel bars was really interesting. I looked old enough, and I was dressed well. This was in Seattle. I once overheard a low-level cellular company exec/engineer complain about how he had to climb a tower, and check the radiation levels (yes non-ionizing). But this was a low level exec, who had to take responsibility.

He joked about how while checking a building on cap hill, he waved his wand above his head, and when he heard the beeps... he noped tf out. He said that it sucked that he had to do that, and sign-off.

That is actually cool, and real engineering/responsibility at the executive level.

Can we please get more of that type of thing?

▲rewgs 23 minutes ago

I think that that kind of domain knowledge and getting your hands dirty is more necessary when you're actually having to solve real problems that real people pay real money for -- money that isn't able to be borrowed for free.

It's no coincidence that the clueless MBA who takes pride in knowing nothing about the business they're apart of proliferated during economic "spring time" -- low interest rates, genuine technological breakthroughs to capitalize on, early mover advantage, etc. When everyone is swimming in money, it's easier to get a slice without adequately proving why you deserve it.

Now we're in "winter." Interest rates are high, innovation is debatably slowing, and the previous early movers are having to prove their staying power.

All that to say: the bright side, I hope, of this pretty shitty time is that hopefully we don't _need_ to "put all this nerd talk into terms that someone in the average C-suite could understand," because hopefully the kinds of executives who are simultaneously building and running _tech companies_ and who are allergic to "nerd talk" will very simply fail to compete.

That's the free market (myth as it may often be in practice) at work -- those who are totally uninterested in the subject matter of their own companies aren't rewarded for their ignorance.

▲mistercheph 11 hours ago

Using p2p or self hosted, and accepting the temporary tradeoffs of no network effects.

▲keepamovin 11 hours ago

It's weird to think all of our data lives on physical servers (not "in the cloud") that are falliable and made and maintained by falliable humans, and could fail at any moment. So long to all the data! Good ol' byzantine backups.

▲silverwind 11 hours ago

It's not only http, also ssh.

▲dadof4 11 hours ago

Same for me, fatal: unable to access 'https://github.com/repository_example.git/': The requested URL returned error: 500

▲netsharc 11 hours ago

I remember a colleague setting up a CI/CD system (on an aaS obviously) depending on Docker, npm, and who knows what else... I thought "I wonder what % of time all those systems are actually up at the same time"

▲ 11 hours ago

▲bstsb 11 hours ago

side effect that isn't immediately obvious: all raw.githubusercontent.com content responds with a "404: Not Found" response.

this has broken a few pipeline jobs for me, seems like they're underplaying this incident

▲pm90 11 hours ago

yeah something major is borked and they're unwilling to admit it. The status page initially claimed "https git operations are affected" when it was clear that ssh were too (its updated to reflect that now).

▲whinvik 11 hours ago

Haha I don't know if its a good test or not but I could not figure out why git pull was failing and Claude just went crazy trying so many random things.

Gemini 3 Pro after 3 random things announced Github was the issue.

▲olivia-banks 11 hours ago

This is incredibly annoying. I've been trying to fix a deployment action on GitHub for a the past bit, so my entire workflow for today has been push, wait, check... push, wait, check... et cetera.

▲luca616 11 hours ago

You should really check out (pun intended) `act` https://github.com/nektos/act

▲agnishom 9 hours ago

(epistemic status: frustrated, irrational)

This is what happens when they decide that all the budget should be spent on AI stuff rather than solid infra and devops

▲futurestef 11 hours ago

I wonder how much of this stuff has been caused by AI agents running on the infra? Claude Code is amazing for devops, until it kubectl deletes your ArgoCD root app

▲h4kunamata 7 hours ago

With Microsoft (Behind GitHub) going full AI mode, expected things to get worse.

I worked for one of the largest company in my country, they had "catch-up" with GitHub and it is not longer about GitHub as you folks are used to but AI aka CoPilot.

We are seeing major techs such as but not limited to Google, AWS and Azure going under after making public that their code is 30% AI generated (Google).

Even Xbox(Microsoft) and its gaming studio got destroyed (COD BO7) for heavily dependency on AI.

Don't you find it coincidence all of these system outage worldwide happening right after they proudly shared heavily dependency on AI??

Companies aren't using AI/ML to improve processes but to replace people, full stop. The AI stock market is having a massive meltdown as we speak with indications that the AI bubble went live.

If you as a company wanna keep your productivity at 99.99% from now on:

* GitLab: Self-hosted GitLab/runners * Datacenter: AWS/GCP/Azure is no longer a safe option or cheaper, we have data center companies such as Equinix which have a massive backup plan in place. I have visited one, they are prepared for a nuclear war and I am not even being dramatic. If I was starting a new company in 2025, I would go back to datacenter over AWS/GCP/Azure * Self-host everything you can, and no, it does not require 5 days in the office to manage all of that.

▲amazingman 7 hours ago

I didn't see a case made for self-hosting as the better option, instead I see that proposition being assumed true. Why would it be better for my company to roll its own CI/CD?

▲h4kunamata 6 hours ago

I worked at a bank that self-hosted GitLab/runners.

As the AI bubble goes sideways, you don't know how your company data is being held, CoPilot uses GitHub to train its AI for instance. Yes, the big company I work for had a clause to forbids GitHub from using the company's repo from AI training.

How many companies can afford having a dedicated GitHub team to speak to?? How many companies read the contracts or have any saying??

Not many really.

Yeah sure, cloud is easier, you just pay the bills, but at what cost??

▲shooker435 12 hours ago

The internet is having one heck of a day! we focus on ecommerce technology and I can't help but think our customers will be getting nervous pre-BFCM.

▲85392_school 11 hours ago

Seeing "404: Not Found" for all raw files

▲alexskr 11 hours ago

Mandatory break time has officially been declared. Please step away from your keyboard, hydrate, and pretend you were productive today.

▲mysteria 11 hours ago

Cloudflare this morning, and now this. A bunch of work isn't getting done today.

Maybe this will push more places towards self-hosting?

▲thinkindie 12 hours ago

I really can't believe this. I had issues with CircleCI too earlier, soon after the incident with Cloudflare resolved.

▲zackify 12 hours ago

this is actually the 5-6th time this month. actions have been degraded constantly now push and pull breaks back to back

▲randall 11 hours ago

can’t go down is better than won’t go down.

the problem isn’t with centralized internet services, the problem is a fundamental flaw with http and our centralized client server model. the solution doesn’t exist. i’ll build it in a few years if nobody else does.

▲clbrmbr 12 hours ago

Same issue here for me. Downdetector [1] agrees, and github status page was just updated now.

▲matkv 11 hours ago

Just as I was wondering why my `git push` wasn't working all of a sudden :D

▲stuffn 11 hours ago

Centralized internet continues to show it's wonderful benefits.

At least microsoft decided we all deserve a couple hour break from work.

▲sre2025 11 hours ago

Why are there outages everywhere all the time now? AWS, Azure, GitHub, Cloudflare, etc. Is this the result of "vibe coding"? Because before "vibe coding", I don't remember having this many outages around the clock. Just saying.

▲elicash 11 hours ago

I think it has more to do with layoffs.

"Why do we need so many people to keep things running!?! We never have downtime!!"

▲Refreeze5224 11 hours ago

Which the true reason for AI, reducing payroll costs.

▲themafia 11 hours ago

The reason I detest those who push AI as a technological solution. I think AI as a field is interesting but highly immature, but it's been over hyped to the point of absurdity, and now it is having real negative pressure on wages. That pressure has carry over effects and I agree that we're starting to observe those.

▲brovonov 11 hours ago

Has to be a mix of both.

▲noosphr 11 hours ago

They fired a ton of employees with no rhyme or reason to cut costs, this was the predictable outcome. It will get worse if it ever gets better.

The funny thing is that the over hiring during the pandemic also had the predictable result of mass lay-offs.

Whoever manages HR should be the ones fired after two back to back disasters like this.

▲baq 11 hours ago

And yet we keep paying the company

▲ 11 hours ago

▲harshalizee 11 hours ago

Could also be the hack and slash layoffs are starting to show its results. Removing crucial personnel, teams spread thin, combined with low morale industrywide and you've got the perfect recipe for disaster.

▲nawgz 11 hours ago

AI use being pushed, team sizes being reduced, continued lack of care towards quality… enshittification marches on, gaining speed every day

▲mepage 11 hours ago

Seeing "ERROR: no healthy upstream" in push/pull operations

▲brovonov 11 hours ago

Good thing I already moved away from gh to a selfhosted Forgejo instance.

▲etchalon 11 hours ago

What is today and who do I blame for it

▲baq 11 hours ago

Computers are great at solving problems that wouldn’t have existed without computers

▲stronglikedan 11 hours ago

Computers and alcohol.

▲spapas82 11 hours ago

Having a self hosted gitea server is a godsend in times like this!

▲Argonaut998 11 hours ago

Mercury is in retrograde

▲0dayman 11 hours ago

Github is down a lot...

▲kevinlajoye 11 hours ago

Pain

My guess is that it has to do with the Cloudflare outage this morning.

▲case0x 11 hours ago

I wish I could say something smart such as “People/Organisations should host their own git servers“, but as someone who had the misfortune of doing that in the past I rather have a non-functional GitHub.

▲Mossly 11 hours ago

I've found Gitea to be pretty rock solid, at least for a small team.

▲gelbphoenix 11 hours ago

Would even recommend Forgejo (the same project Codeberg also uses as the base for their service)

▲mkreis 11 hours ago

I'm curious to learn from your mistakes, can you please elaborate what went wrong?

▲ 11 hours ago

▲swedishuser 11 hours ago

Almost one hour down now. What differs this from recent AWS and Cloudflare issues is that this appears to be a global issue?

▲chazeon 11 hours ago

Seem images on GitHub web also not showing

▲kennysmoothx 12 hours ago

What a day...

▲usui 11 hours ago

It's working again now.

▲ashishb 11 hours ago

I have said this before, and I will say this again: GitHub stars[1] are the real lock-in for GitHub. That's why all open-core startups are always requesting you to "star them on GitHub".

The VCs look at stars before deciding which open-core startup to invest in.

The 4 or 5 9s of reliability simply do not matter as much.

1 - https://news.ycombinator.com/item?id=36151140

▲mrguyorama 11 hours ago

I'm going to awkwardly bring up that we have avoided all github downtime and bugs and issues by simply not using github.

Our git server is hosted by Atlassian. I think we've had one outage in several years?

Our self hosted Jenkins setup is similarly robust, we've had a handful of hours of "Can't build" in again, several years.

We are not a company made up of rockstars. We are not especially competent at infrastructure. None of the dev teams have ever had to care about our infrastructure (occasionally we read a wiki or ask someone a question).

You don't have to live in this broken world. It's pretty easy not to. We had self hosted Mercurial and jenkins before we were bought by the megacorp, and the megacorp's version was even better and more reliable.

Self host. Stop pretending that ignoring complexity is somehow better.

▲fidotron 12 hours ago

It used to be having GitHub in the critical path for deployment wasn't so bad, but these days you'd have to be utterly irresponsible to work that way.

They need to get a grip on this.

▲MattGaiser 12 hours ago

Eh, the lesson from us-east-1 outage is that you should cling to the big ones instead. You get the convenience + nobody gets mad at you over the failure.

▲bhouston 11 hours ago

Everything will have periods of unreliability. The only solution is to be multi-everything (multi-provider for most things), but the costs for that are quite high and hard to see the value in that.

▲dylan604 11 hours ago

yes, but if you are going to provide assurances like SLAs, you need to be aware of your own allow for them. if you're customers require working with known problem areas, you should add a clause exempting those areas when they are the cause.

▲angrydev 11 hours ago

Ton of people in the comments here wanting to blame AI for these outages. Either you are very new to the industry or have forgotten how frequently they happen. Github in particular was a repeat offender before the MS acquisition. us-east-1 went down many times before LLMs came about. Why act like this is a new thing?

▲ssawchenko 11 hours ago

Same.

ERROR: no healthy upstream fatal: Could not read from remote repository.

▲imdsm 11 hours ago

Cloudflare, GitHub...

▲pyenvmanger 11 hours ago

Git push and pull not working. Getting a 500 response.

▲Aeroi 11 hours ago

just realized my world stops, when github does.

▲mepage 11 hours ago

Seeing "ERROR: no healthy upstream" in push/pull.

▲theoldgreybeard 11 hours ago

can the internet work for 5 minutes, please?

▲SimoncelloCT 11 hours ago

Same issue, and I need to complete my work :(

▲dogman123 12 hours ago

hell yea brother

▲whynotmaybe 11 hours ago

We gonna need xkcd "compiling" but with "cloudflare||github||chatgpt||spotify down".

https://xkcd.com/303/

▲pyenvmanger 11 hours ago

Git pull and push not working

▲theideaofcoffee 11 hours ago

Man, I sound like a broken record, but... Love that for them.

How many more outages until people start to see that farming out every aspect of their operations maybe, might, could have a big effect on their overall business? What's the breaking point?

Then again, the skills to run this stuff properly are getting more and more rare so we'll probably see more and more big incidents popping up more frequently like this as time goes on.

▲arbol 11 hours ago

Its back

▲lenerdenator 11 hours ago

It would be nice if this was actually broken down bit-by-bit after it happened, if only for paying customers of these cloud services.

These companies are supposed to have the top people on site reliability. That these things keep happening and no one really knows why makes me doubt them.

Alternatively,

The takeaway for today: clearly, Man was not meant to have networked, distributed computing resources.

We thought we could gather our knowledge and become omniscient, to be as the Almighty in our faculties.

The folly.

The hubris.

The arrogance.

▲WesolyKubeczek 11 hours ago

So that’s how the Azure migration is going.

▲ 12 hours ago

▲smashah 12 hours ago

Spooky day today on the internet. Huge CF outage, Gemini 3 launches now I can't push anything to my repos.

▲MattGaiser 12 hours ago

https://www.githubstatus.com/incidents/5q7nmlxz30sk

▲broosted 11 hours ago

can't do git pull or git push 503 and 500 errors

▲saydus 11 hours ago

Cherry on top will be another aws outage

▲linsomniac 11 hours ago

Funny you should say that, I'm here looking because our monitoring server is seeing 80-90% packet loss on our wireguard from our data center to EC2 Oregon...

▲linsomniac 11 hours ago

FYI: Not AWS. Been doing some more investigation, it looks like it's either at our data center, or something on the path to AWS, because if I fail over to our secondary firewall it takes a slightly different path both internally and externally, but the packet loss goes away.

▲_pdp_ 11 hours ago

Is it just me or it seems that there is an increased frequency of these types of incidents as of late.

▲RGamma 11 hours ago

ICE keeps finding immigrants in the server cabinets.

▲lherron 11 hours ago

Gemini 3 = Skynet ?

▲treeroots 11 hours ago

what else is out there like github?

▲kragen 11 hours ago

Gitlab, Forgejo, Gitea, Gogs, ... or you can just push to your own VPS over SSH, with or without an HTTP server. We had a good discussion of this last option here three weeks ago: https://news.ycombinator.com/item?id=45710721

▲parliament32 11 hours ago

GitLab is probably the next-largest competitor. Unlike GitHub it's actually open source, so you can use their managed offering or self-host.

▲robowo 11 hours ago

Give https://codeberg.org/ a go

▲ 11 hours ago

▲broosted 11 hours ago

can't do git pull or push 503 and 500 errors

▲projproj 11 hours ago

Obviously just speculation, but maybe don't let AI write your code...

Microsoft CEO says up to 30% of the company’s code was written by AI https://techcrunch.com/2025/04/29/microsoft-ceo-says-up-to-3...

▲tauchunfall 11 hours ago

It's degraded availability of Git operations.

The enterprise cloud in EU, US, and Australia has no issues.

If you look at the incident history disruptions happen often in the public cloud for years already. Before AI wrote code for them.

▲TimTheTinker 11 hours ago

The enterprise cloud runs on older stable versions of GitHub's backend/frontend code.

▲smsm42 11 hours ago

That sounds very bad, but I guess it depends also on which code it is. And whether Nadella actually knows what he's talking about, too.

▲dollylambda 11 hours ago

Maybe AI is the tech support too

▲Aloisius 11 hours ago

Sweet. 30% of Microsoft's code isn't protected by copyright.

Time to leak that.

▲angrydev 11 hours ago

What a ridiculous comment, as if these outages didn't happen before LLMs became more commonplace.

▲projproj 8 hours ago

I admit it was a bit ridiculous. However, if Microsoft is going to brag about how much AI code they are using but not also brag about how good the code is, then we are left to speculate. The two outages in two weeks are _possible_ data points and all we have to go on unless they start providing data.

▲malfist 11 hours ago

What a ridiculous comment, as if these outages haven't been increasing in quantity since LLMs became more commonplace

▲meirp 11 hours ago

[dead]