Everyone is eager to point a finger at Google, but I've been a user of Railway for a while now, and I've seen enough nonsense to want to hear what GCP has to say about this before drawing any conclusions. Let's just say Railway has had problems like this before, and the way their team handles them does not inspire any confidence.
Regardless of how it happened, for me, this is the straw that broke the camel's back.
Two years ago I needed their support and they were so toxic that I just moved to vercel and told them to f off.
But I wanted something similar for other services and then I found coolify.
There’s absolutely no reason to use railway when you can use coolify.
another ditto from me, albeit anecdotal again. Railway dev teams play fast and loose with sprinkles of vibe coding everywhere on top. There's 'oops yea bear with us we are still a startup' and then there's railway.
i mean even google and aws are not without sin on this one. maybe wait for an RCA before punching someone who is currently down. theres a reason classy people do "hugops" when a competitor goes down, regardless of reputation.
Personally, I don't see this as people punching someone who's down. This is the sort of real life experience and necessary context from actual technical users that I come to HN comments for.
Someone is just asking to get Google's side and explaining why they want that, which seems reasonable since we're in a post where Google is being punched/blamed for this, and it sounds like it isn't Railways first questionable outage.
> Let's just say Railway has had problems like this before, and the way their team handles them does not inspire any confidence.
This. It's very odd that in other threads we see a bunch of accounts heavily invested in criticizing a cloud provider, but what's conspicuously absent from this wave of indignation is any curiosity in the root cause, or even any interest in exploring what it might have been. Quite odd.
A joint statement from UniSuper CEO Peter Chun and Google Cloud CEO Thomas Kurian
8 May 2024
UniSuper and Google Cloud understand the disruption to services experienced by members has been extremely frustrating and disappointing. We extend our sincere apologies to all members.
While supporting UniSuper to bring its systems back online, Google Cloud has been conducting a root cause analysis.
Thomas Kurian has confirmed that the disruption arose from an unprecedented sequence of events, where an inadvertent misconfiguration during provisioning of UniSuper’s Private Cloud services ultimately resulted in the deletion of UniSuper’s Private Cloud subscription.
This is described as an isolated, “one-of-a-kind occurrence” that has never before occurred with any Google Cloud client globally. This should not have happened. Google Cloud has identified the sequence of events and taken measures to ensure it does not happen again.
Why did the outage last so long?
UniSuper had duplication across two geographies as protection against outages and data loss. However, the deletion of the Private Cloud subscription triggered deletion across both geographies.
Restoring the Private Cloud required significant coordination and effort between UniSuper and Google Cloud, including recovery of hundreds of virtual machines, databases, and applications.
I wrote about the UniSuper issue at the time: https://danielcompton.net/google-cloud-unisuper. It was a pretty nasty bug where their VMWare environment was created with a one-year expiry date, but was one "resource" from the perspective of Google Cloud.
"UniSuper’s production Google Cloud VMware Engine (GCVE) private cloud was automatically deleted one year after it’s creation due to a misconfiguration in how it was created. When it was created, there was a bug in the creation script which passed a null value."
That's pretty amazing. Not due to a cascading failure from someone changing a config deep inside of a system that caused a bunch of unintended effects, just someone who messed up writing a shell script?
That's one footgun, but then pushing that into production and actually deleting things rather than queuing them to be deleted later after a sanity check until the system is stable, and not informing users that the 1 year policy existing, (probably) not documenting that the expiry exists, not testing 'what happens if we pass in null?', etc are a whole series of mistakes.
This was less "Oh look, a rare edge case that was easy to miss!" and more "We don't bother putting guardrails into critical systems. Oops!"
The instant cascading worldwide deletion upon closing or deleting a subscription sounds like a recipe for disaster. Why not mark it for deletion and delete say... a day or a week later?
From personal experience, as a customer who once did something stupid: Google Cloud does soft deletes.
But you need to reach out to support fast enough. And really, if you deleted something important and discovered it only the next day, and not within minutes, you're having a bigger issue that a soft delete won't solve.
No, not really. That's pretty basic stuff. You would do well in reading up on the shared responsibility model. Customers are responsible for setting up their own infrastructure, and platform/service providers are only responsible for the services they manage. Even then, stuff like persisted data is still recoverable by design.
But you are absolutely responsible for the service you put together. This is a basic principle for around two decades. Infrastructure as code tools are pervasive and ubiquitous for over a decade.
Either mark-for-delete has the same impact as deleting in terms of shooting all the Cloud resources associated with the subscription, at which point the outage still happens but maybe the recovery is smoother or you've just delayed the inevitable by a week because no one will look at it unless there is actual impact.
You just turn it all off. So yes, the disruption is the same but restoral is much smoother. Much easier said than done - that has be baked into every service and there would certainly be a cost from it that would have to be passed along to everyone.
> The instant cascading worldwide deletion upon closing or deleting a subscription sounds like a recipe for disaster.
I don't agree. What do you expect to happen when you explicitly delete your user account? Do you expect your systems to remain in operation for a week? That itself would be a major risk and liability, as your whole infrastructure would still be up even though you cut your access to it.
Also, isn't your whole infrastructure expected to be automatically deployed with IaC? The notable exception is data, which is already soft deleted and recoverable through customer support.
All in all, where do you expect the customer's responsibility to end and the cloud provider's to start? The shared responsibility model is covered by any intro course in no uncertain terms.
It has been 0 days since GCP has taken down a startup (again).
You see this at least once a year. Never heard of this from AWS or Azure.
In all seriousness, this is why we don't use them. They have the most ergonomic cloud of the big three, then absolutely murder it by having this kind of reputation.
On the other hand i can’t remember when there was a serious outage on GCP, unlike AWS/Azure who seem to go down catastrophically a couple of times per year.
I've been in AWS for almost twenty years at this point. It's been a long time since I've seen a global outage of the data plane on anything. The control plane, especially the US-east-1 services? Yes - but if you're off of east-1, your outages are measured in missile strikes, not botched deployments.
The impacts are usually partial. For example, scaling is impacted but everything already deployed contributes to work up to capacity. Or, you can't change configuration but the previous configuration works as configured. Often surprisingly not so impactful even if there can be limited work stoppage.
Considering how many AWS and non-AWS services go down at least partially when us-east-1 fails, this reads somewhat like "Don't worry that the steering wheel and pedals aren't working, your engine is still running on cruise control".
The problem with the us-east-1 outage is that a lot of big companies are there, so even if you try your best not to depend on us-east-1, your third party providers are most likely there. In my previous company, we were completely down during us-east-1 outage because of other dependencies that are beyond our control.
There is a mobile game I know of that had an outage as a result of a GCP service outage. That is the only time I've noticed GCP outages.
With that said, I would not say few companies rely on GCP. Search for "GCP" in this month's HN hiring thread. There are 23 hits, more than Azure's 21. AWS has 90 hits, which I guess shows its sheer dominance in the startup space. But these figures more or less agree with my intuition of the major clouds being AWS/GCP/Azure.
> Perhaps you don't notice GCP outages because so few companies rely on them?
GCP is the world's third largest cloud provider, and has around half of AWS' market share. Claiming no one uses it reads like Yogi Berra's "no one goes there anymore, it's too crowded".
GCP has a lot of customers. But you wouldn't know the companies that do, unless you worked there and wanted to leak it, or it publicly comes out. Eg it's been publicly acknowledged that Apple uses GCP for iCloud, https://www.cnbc.com/amp/2018/02/26/apple-confirms-it-uses-g... , and Home Depot is another that's used as a case study, https://cloud.google.com/customers/the-home-depot but most customers don't want to make a big deal about being on GCP as it's none of our business who's hosting them.
Apple also uses AWS, and I won't be surprised if they also use Azure. Big companies are multicloud, and not because it's a good idea (it rarely is), but because they inherited multiple environments on different CSPs, and maintaining those where they are is often cheaper than migrating them to a different CSP.
upvoted & favourited because you taught me a really interesting fact which I feel makes up for an amazing discussion (regarding icloud using GCP).
also, I can't help but imagine if instead of render, it was Apple's account which could've been auto-banned (Render is almost a billion dollar company or series-B, I am not sure)
I haven't read the articles and I admit that but can you please elaborate to me on why Apple uses GCP themselves for idrive, I would love to know the technical decisions behind it on a genuinely curious level.
From my (let's face it) limited understanding of GCP, it isn't particularly good or price performant and one of the wonders is that Google sells it directly with Google photos too and an competitive lineup at android.
So in some sense if Apple is using gcp's for icloud then aren't they just reselling google storage themselves and google can always beat them in pricing while also wanting to chew away at the percentage of iphones themselves too?
I mean, I can still try to understand the google search pays apple 10 billion dollars (right?) deal but I don't quite understand why apple would pick GCP when the hosting market is one of the more competitive ones with lots of companies.
I would love to get some explainations or theories as to why exactly is that the case
(Also given its HN, if anyone from apple is reading or knows the answer, I would love that too!)
Firstly, apple doesn’t compete on price. Even if icloud is priced more than google people would always buy apple just for the ecosystem integration. It’s not even a competition to be honest.
Look up “buy or build” which is the industry term for this kind of evaluation: buy product and use it/resell it or build your own.
Apple has gone for different strategies in various areas:
Build own Apple silicon chips, do not buy off the shelf chips from intel or nvidia or amd.
Buy and resell google storage but don’t want to build their own distributed data store for end users.
It’s about what matters more for the company and the core products. Apple’s laptops, cell phones are considered core products. Icloud is a value add.
This is also why apple is making their own cell phone broadband chips. For most companies, this is a “buy from qualcolm” but apple needs to build their own for independence for their number 1 core product: the iphone.
> So in some sense if Apple is using gcp's for icloud then aren't they just reselling google storage themselves and google can always beat them in pricing while also wanting to chew away at the percentage of iphones themselves too?
Apple uses Samsung displays and Sony camera sensors, iirc, both of which are flagship Android phone makers. That doesn't really seem to be a concern in their procurement thinking. iCloud and Google Photos are not that direct competitors because which one is native depends on which phone you already bought. Google Photos definitely does have some market share on iOS due to having 3x the free storage and a handy compression mode (which used to be entirely unmetered at launch but now still uses storage, just less of it). But it will never be a full competitor because it is a separate app you have to install and it can't magically fetch cloud-only photos from the camera roll and photo picker UI like iCloud can.
The pricing of Google One and Apple One/iCloud+ isn't really dictated by underlying storage costs. At the higher tiers like 2TB, many don't come close to using all, while the laughable 5GB iCloud free tier clearly costs almost nothing in raw store, even on nVME SSD, if you compare it to S3/Backblaze or even raw disk pricing on the cloud.
Let's also not ignore enterprise realities: in your example, Samsung Displays is likely giving a great price to Apple for displays based on long-term commitment of large quantities: it allows them to optimize production and possibly give a better price than maybe Samsung Mobile for smaller-runs of phones.
Each division also cross-charges, so Samsung Mobile would be paying Samsung Displays for the screens, and possibly at a small, guaranteed and non-negotiable margin.
Without a global strategy not to do so, divisions within an enterprise optimize for their own bottom line and have internal discussions on build-vs-buy even if they have an internal factory.
AWS goes down catastrophically but are back up in minutes/hours most of the time (as long as they aren't down because Iran blew up their data center). That's obviously REALLY bad for certain industries, but I suspect for the vast majority of their customers it's not a big deal. We've been able to isolate the damage almost every time just by having AZ failover in place and avoiding us-east-1 where we can.
Failover is supposed to protect you every time, unless something really exceptional happens.
While its possible to to isolate the effects, judging by how many things stop working when there is an AWS failure a lot of people fail to do that. I think the shit of responsibility to AWS removes the incentive to put effort into resilience against AWS failure.
You can't have 100% uptime. It's unfeasible, especially for a startup. You should be telling your customers that downtime might happen, sometimes for reasons beyond your control, and that if it does then you'll do your best to recover and to compensate them for the inconvenience. You should cultivate a relationship with your early customers that makes them feel bad for you when there's an outage rather than angry about how it impacts them. Maybe even go as far as firing the customers who give you a hard time over it. That way if your cloud provider falls over it's really annoying but not a big deal.
Your cloud provider blocking your business from running is far worse.
None of the AWS “outages” have impacted us. They have either been regional, in which case we stand down the region (we run multiple hot regions), or didn’t involve things we need to maintain operation.
I can’t imagine AWS ever doing such a cascading delete. I mean, they have made deletion protection a difficult thing to ignore even for individual resources.
There was a pretty bad one last summer - their IAM system got a bad update and it broke almost all GCP services for an hour or so, since every authenticated API call reaches out to IAM.
It had lasting effects for us for a little over 3 hours.
I still remember the one where they nuked all the storage of I think an Australian insurance company I think, luckily the it department had done a multi cloud setup for backups
That’s an entirely different type of problem, and avoidable by just using us-east-2 (I still don’t understand why people default to us-east-1 unless they require some highly specific services).
Is it that easily avoidable? A lot of AWS's control plane seems to have dependencies on us-east-1, or at least that's what it's looked like as a non-us-east-1 user during recent outages.
During my 5 years of my startup, we had only 1 outage due to AWS because we picked us-west-2 as the primary reason. If anyone starting a company and picks us-east-1 as the primary reason, they should be fired. There's absolutely no reason to be in that region.
> Why do people want to be in that region? Is it the default or something?
It's one of the oldest and largest regions. It hosts the most services, both low-level platform stuff and higher level managed services (which run on the low-level platform stuff), so services tend to be more performant.
Geographic location is also good.
Also, due to scale their pricing ends up being cheaper.
Let's say that it's the region people use by default, unless they have a compelling reason to have a presence in any other particular region.
AWS has throttled our service so badly that we couldn't operate. I was thinking of writing a blog post about how they stalled our growth for a month but it seems moot
It's AWS and Azure that are the outliers and tend not to care too much what their customers do with their infrastructure. AWS is perfectly fine with allowing me to run copies of 15 year old vulnerable AMIs copied from AMIs they've long since deprecated and removed. Even for removed features like NAT AMIs.
The only anecdotal thing I've seen is we hired a vendor to do a pentest a few years ago, and they setup some stuff in an AWS account and that account got totally yeeted out of existence by AWS if memory serves.
Having done this for both Azure and AWS, there's a specific ticket that needs to be filed with each provider that documents the scope of your pen test, where you're coming from, and a time frame over which you're doing it (which ISTR was "not more than 24 hours")
You should not be conducting unauthorized penetration tests against third party infrastructure providers without permission. They have processes and systems and usually just wants a heads up of what you plan to test and t the duration / timestamps.
Cuz otherwise you look like a threat actor.
That’s assuming your vendor was pentesting AWS systems. If you meant you hired a vendor to pentest your own systems on AWS, that’s of course a totally different matter.
>That’s assuming your vendor was pentesting AWS systems. If you meant you hired a vendor to pentest your own systems on AWS, that’s of course a totally different matter.
Sorry for being unclear, the vendor was attacking our organization only, and any other company was expressly forbidden in the contract. As I recall it was a fake SSO sign-in page to collect credentials that they would try and social engineer our employees with.
Yup, I thought it was great. Although one concern I always had in the back of my mind was where is the line drawn. Such as if an adversary gains access to one of my orgs accounts and does something similar, do we get 100% taken out.
How the heck do these things happen, especially with companies with huge monthly spend? At my last job we had some suspicious workloads running on AWS and our TAM reached out to us before taking any action. Who wants to bet this was some AI automation gone wrong and because GCP seems to be allergic to actually contacting a human to get a response, this just sits in some support queue that outsourced workers look at after a few hours just to give a canned response?
Nothing surprises me with anything related to support on GCP. While we absolutely do not need them, I have been through no less than 12 different Account Executives over the last 6y and they're all ENTIRELY and COMPLETELY useless.
They all introduce themselves, beg me to setup a meeting w/them and some sort of engineering resource(s), and they come to a meeting with a canned slide deck that is so absurdly unrelated to us that I just laugh, and then the next time I hear from them it's because we have a new AE.
This is my most recent reply (right after Next '26):
> I really appreciate you reaching out; however, we have met with, I dunno at this point, more than a dozen GCP Account reps, execs, technical teams, etc over the years and there's little to no value for us or you, now or in the future. Please do feel free to invest your time on your other clients. We're good; truly.
I love GCP and its services; we have been very pleased with it over the years, but the human side of it? Fucking sucks and I just don't see why they even bother.
This is actually kind of validating. I work for a company that spends almost 1mm a year on GCP. We've never had an actual support contract with them because the numbers work out to, at a minimum, being 10% of our spend. We've yet to encounter a situation where we actually needed GCP support, so we've held off. In the moments where we'd like to get some support (mostly around datastore behavior) we've managed to work around it or figure it out ourselves. So it's good to know we haven't missed out on much. Beyond the offensive aspect of GCP offering no support if we aren't willing to cough up a non-trivial percentage of our spend, I'm pretty happy with it.
Don't know about GCP, but our AE on AWS was also continuously rotating, and as best I can tell, their job was to figure out what we are planning to build, and to ensure that we should always use <INSERT AWS SERVICE DU JOUR> for that, rather than a competitor product or build it ourselves.
Before I just cut them off entirely, I used to tell them my primary concern was cost savings and that I wanted them to recommend ways I could cut 25% off my bill every month and watch the glorified salespeople fumble over trying to avoid that conversation.
It’s ok though, Claude helped us cut >45% of our monthly costs. I’m surprised they haven’t been beating down my door after we made that level-shift. Probably in AE transition. ¯\_(ツ)_/¯
My experience with a large-ish ($5m/year) AWS account was quite different. They were happy to support us with cost optimizations, discounts, and one time credits for certain activities (co-innovation and archiving certain milestones in their partner program).
Their primary concern seemed to have been to keep as much of our workload inside AWS as possible and to win workload from 3rd party services we used (e.g. CDNs). The actual revenue appeared secondary.
This is my experience as well with AWS accounts in the 2-4M/year range, the biggest upsell they always try to make is Enterprise Support, but for the rest they are usually happy to help you cut cost in the short term - as long as you stay with them in the long term.
For what it's worth - I'm not sure what the criteria is (I assume we're "medium sized / not a big upsell opportunity"?) - our GCP rep quickly pushed us to switching to using a GCP reseller. They took over our billing so that we can pay via ACH, and provide both free first-line support/escalation and paid engagements for bigger projects; they don't charge a premium on top, apparently Google pays them for supporting us. Hasn't made much of a difference in how we operate, but at least we have a direct-ish line for issues when they come up.
That's exactly why I'm less pleased with GCP: to trust a CSP (or any service), I need to be assured that when (not if) things go wrong, I could escalate to a team that would have my back.
huh- I guess there are two HN submissions with meaningful replies...
I said this in the other thread, we got access to our account back, but even with a Account Rep. and a CSM on our account- it still took them a while to figure out what was going on.
I'm sure it could have been worse if we didn't have a rep on our account.
What does blocked mean? Is there a different post that I am missing? There is shared infrastructure in GCP for networking (ex-googler here) and if only railway is affected, then it is not clear if it is only GCP or if there is something from Railway's perspective that needs to be addressed.
> Around 22:20 UTC, our Google Cloud account was placed into a "restricted" status hence removing all of our cloud overflow VMs, our CloudSQL instance, and our API.
As someone who runs some public APIs, the amount of spam from Railway IPs is insane. They have horrible abuse prevention. Hopefully this encourages them to improve their operations.
I continue to receive phishing via AWS pretending to be Amazon. And not even the Unicode-lookalike shenanigans that my spam filter refuses for excessive mixed scripts, no; literally claiming to be Amazon as in: the company that operates the relay.
When you signup for Railway, they have uncommon way of making sure you have read and understood their T&C regarding abuse of their systems, including crypto mining, etc.
My guess is that many are abusing their free tier, causing them trouble with their service providers.
I take no joy in seeing Railway take a hit like this, even as a competitor, but free compute attracts all sorts of strange users. We've been there and decided early on to avoid free compute even it costs us our top of the funnel.
This is bad. Even their own website is down at railway.com. Looks like total dependency on google cloud. Surprising for a company of their scale with all this VC money.
Well, as a 2 week tenured and very happy Railway customer until now, I am now a Render customer. Somehow DNS cut over within 1 min(!) and live after about 30 minutes of work. Not bad!
In my experience, DNS changes are a lot faster than they used to be. There’s some website that has a map that tries to resolve your domain with a bunch of name servers around the world that was pretty neat to look at last time I migrated something.
Is google allergic to humans or something? Cannot they just send an email or call the company before taking a wrecking ball to the entire company's infra? Are they stupid?
It surprises me there's not a manual review for $$$$ accounts. Speculation at this stage, but it's weird they would be put in the Recycle Bin like that.
I didn’t knew Railway so with this misleading headline I thought a Google Cloud data centre was being built in the way of a railroad. That’d been a funny story to read..
An elevated railroad once ran through one end of what is now a Google-owned building (Chelsea Market in Manhattan). It's now part of the High Line elevated pedestrian park.
If you don't happen to know that "Railway" is referring to a company, then you might reasonably read that as "a GCP outage caused issues in the train network somewhere".
I will never leverage GCP in an enterprise setting, it's honestly amazing how hard they fumble the bag. Will be interesting to see when GCP support started working with them, from the updates there was an hour and change from when they identified the issue and GCP support was confirmed.
In the cloud space it seems like AWS does nothing and wins.
Does anyone know how this even happens inside the walls of google? Is it an automated process? How is such a (presumably) high revenue account just magically blocked without human intervention? I'm quite perplexed.
There would have been efforts to contact them, but it would have been via their contact method, aka the email they set it up with.
Common ways this happens? They are using a credit card to run their business with no backup payment method. Then the company's contact person is on vacation.
Yeah, I'm not sure what to think here. We know Google is not the best at customer service and has automated account suspensions. But, what I'm curious about here is why this happened.
Railway hosts applications for customers. An uneducated guess for some possible reasons: 1) one of those customers hosted something they shouldn't have 2) railway had something spawn that took up too many resources 3) Or their account balance was too high 4) Or something...
But all of this probably culminates in someone needed to read an email that was missed.
Scaling a customer infrastructure setup like Railway is hard. This is one of the non-technical hard parts - how to make sure your account with your primary vendor is safe. But, I'm willing to wait to pass judgement here until more information is available. I'm sure the post-mortem will have lessons. I'd like to know more.
Honestly still insane to nuke a high-volume client's business after a single payment issue. There would be no reason for Google to believe that a single hiccup like that is evidence that they won't get paid and have to cut account access immediately.
I've managed several accounts with GCP over the years and I've always maintained a great relationship with our contacts there. Some of these accounts were quite small, on the order of <$20k/mo, and even then we were kept abreast of anything that might be cause for concern. I always maintain a standing biweekly meeting with at least someone on the other side (account exec, technical staff, whatever) and I've yet to be blindsided by anything.
Is Google's communication good? No, not particularly. The only way something like TFA happens is if the relationship is neglected (by one or both parties). I'm not saying Railway did something wrong, but there are usually many flags and opportunities to correct long before drastic actions.
I get the impression that Railway plays fast and loose with a lot of their limits and resources and that Google may not be a fan of that.
Edit: would also like to say that if you put all your resources in one GCP project you are going to have a bad time. If you organize stuff over many projects it is very unlikely that they will ever take account wide action. I've had issues with, for example, a particular tenant's behavior, but it never jeopardized the other tenants.
I don't think you can ever trust one service with critical data. Some Claude instance deletes your prod database, you have to restore from an offsite backup because it also deleted your local backups. Even at small startups we did pg_dump to AWS from GCP because ... who knows what is going to happen to GCP, and we want to continue to be in business if that happens.
I don't feel safe with any one single point of failure. "Your credit card bounced", "you thought it was dev", "you got hacked", etc. are all the same problem to me and no cloud provider solves those merely by setting up an account.
At this point you can’t trust Google anymore, it keeps breaking things. Imagine having Google AI do this thins automatically. Will have apocalypse in in a day.
The 3-2-1 backup rule is pretty outdated in the world of cloud. You could have 3 complete copies of your data in different S3 buckets, but if they're all under the same account you've lost your blast radius protection
It's not outdated, you just actually need to follow it. 3 copies of data in separate S3 buckets is ignoring the "2" in the 3-2-1 rule: 2 different mediums, and also the "1" rule: 1 copy offsite. In the cloud era, offsite means not on the same cloud provider. Different mediums ideally means a non-cloud provider (e.g. a NAS at your office under your control).
Well having backups help, but I certainly can’t migrate my infra to rsync.net on moments’ notice (or ever since rsync.net does storage and nothing else) so my customers aren’t affected.
Wild to me that any tech sector business would want to rent an operating environment to park their entire infrastructure into. This is the equivalent to traveling shoe salesmen setting up a tent in the parking lot of a strip mall.
I just... I don't really understand why startups even use AWS, GC, or any other cloud hosted software? Hetzner, etc. Are all extremely cheap, and honestly scale so well... Code nowadays is cheaper for configs, and having full control over your compute is... liberating.
Low cost to entry, easy to get scale from the beginning if you need it. The large cloud providers throw free credit at startups to lock them in all the time. I had a short lived stint trying to get my own startup off the ground and it was really easy to get free compute from Google with no strings attached. This was many years ago now, but I would be surprised if it is any different.
I am with you entirely and would not have taken that route today, but it is really easy to see why people go that route.
A few years ago, when I was kinda active in the startup scene in my area, you have people selling access to cloud credits with penny-on-the-dollar price. The credits are given out liberally to big-corps, organization by AWS/GCP, through workshops, webinars, events. All in the hope of roping the departments into building MVPs, demos on AWS/GCP, but people also find a way to cheat on that system and make some quick bucks.
I know a startup of my acquaintances that have been running on AWS for 5 years straight without paying a single dollar to AWS. When the credits almost run out, they started to migrate their data over to another account with credit. That happened twice already.
It helps to have a portable, replicable IaC config. But also this is sustainable because they are a pretty small struggling shop. You will probably not be able to do this if you are trying to maintain more than 3 nines for an enterprise client.
Perhaps Railway does a bit more than what you think, they have some great functionality (I'm not affiliated with them). Check out [Features | Railway](https://railway.com/features) "PR Environments", they are incredible for the QA process
Oh absolutely... and many use architectures that have evolved out of the needs of really big companies and are not really a good fit for a startup. But I guess they want to be "ready for growth".
Yeah but until you find that the new cloud provider won't approve your compute quota or doesn't have enough capacity in the region or you hit fraud flags for stagnant account spinning up lots of compute.
There's a lot of, what seems to me, unfounded blame being directed at Google for this. Isn't railway the company that just blamed Anthropic for deleting their prod database?
Nope, Railway was the company who was hosting PocketOS, which is the company that blamed Cursor for deleting their prod database. Railway is only involved insofar as their API allowed an instant delete of the prod database.
Why does Railway deserve any blame here at all? It was an MCP with elevated infra access, that the user willingly connected through Cursor, which allowed an LLM Agent to manage infra on Railway. The user would first have gone through oAuth confirming the access level scope (I would have rejected the moment it indicates to me that it can delete critical infra and backups...). So obviously it has access to all commands the user would also have access to. From my perspective the blame is entirely on the user, and partly on Cursor for not enforcing HITL correctly across their agents.
Putting AI aside, people make mistakes. One of the most common mistakes people make is deleting the wrong thing. After they realize the mistake, people want to restore the thing they deleted from backups. Thus deleting the thing and deleting the backups of the thing should always be separate operations.
fairly certain you are remembering the goofy article that was going around where a railway user allowed an agent to delete his db. iirc he questioned the agent after and the agent told him it should have read the file that told him not to do things, so just sounds like he deleted his db and blamed his tools.
How to handle domains? The rest is easy, but your domain registrar blocking you sounds like a pain. My current solution is to use a local small provider, just for the domain. Then if there is a problem with your play account it is out of any blast radius.
What the deuce are you blathering on about. An account got blocked, this has nothing to do with a domain.
And I’m talking about having disparate failovers that don’t rely on a single hosting provider. At that point, who cares what Google does to your cloud account… work with the hot failover and spin up another hot failover somewhere else.
Looks like they were sold at the beginning of the year to a company without a Wikipedia page whose parent company doesn’t have one either https://en.wikipedia.org/wiki/Markmonitor
Acquired in November 2022 by Newfold Digital, it was later announced that the firm would be sold to Com Laude, a company owned by PX3 Partners.
PX3 stands for purpose, passion, and performance. It is a pan-European private equity firm with headquarters in London. It invests behind transformative themes and targets companies operating within select segments of the business services, consumer and leisure, and industrials sectors with strong business fundamentals.
Precisely. If you’re going to have a hot failover, it behoves you to have an entirely separate entity billing you for that hosting.
Honestly, I don’t know where the downvotes are coming from. Do people have no clue about service resiliency? I can understand if it’s a personal project or you haven’t yet scaled to paying customers, but anything at scale with serious money involved needs to be completely independent of the underlying hosting. It should remain up even if an entire provider goes titsup.