Don't rent the cloud, own instead

1042 points438 comments17 hours ago
adamcharnock

This is an industry we're[0] in. Owning is at one end of the spectrum, with cloud at the other, and a broadly couple of options in-between:

1 - Cloud – This is minimising cap-ex, hiring, and risk, while largely maximising operational costs (its expensive) and cost variability (usage based).

2 - Managed Private Cloud - What we do. Still minimal-to-no cap-ex, hiring, risk, and medium-sized operational cost (around 50% cheaper than AWS et al). We rent or colocate bare metal, manage it for you, handle software deployments, deploy only open-source, etc. Only really makes sense above €$5k/month spend.

3 - Rented Bare Metal – Let someone else handle the hardware financing for you. Still minimal cap-ex, but with greater hiring/skilling and risk. Around 90% cheaper than AWS et al (plus time).

4 - Buy and colocate the hardware yourself – Certainly the cheapest option if you have the skills, scale, cap-ex, and if you plan to run the servers for at least 3-5 years.

A good provider for option 3 is someone like Hetzner. Their internal ROI on server hardware seems to be around the 3 year mark. After which I assume it is either still running with a client, or goes into their server auction system.

Options 3 & 4 generally become more appealing either at scale, or when infrastructure is part of the core business. Option 1 is great for startups who want to spend very little initially, but then grow very quickly. Option 2 is pretty good for SMEs with baseline load, regular-sized business growth, and maybe an overworked DevOps team!

[0] https://lithus.eu, adam@

show comments
scalemaxx

Everything comes circle. Back in my day, we just called it a "data center". Or on-premise. You know, before the cloud even existed. A 1990s VP of IT would look at this post and say, what's new? Better computing for sure. Better virtualization and administration software, definitely. Cooling and power and racks? More of the same.

The argument made 2 decades ago was that you shouldn't own the infrastructure (capital expense) and instead just account for the cost as operational expense (opex). The rationale was you exchange ownership for rent. Make your headache someone else's headache.

The ping pong between centralized vs decentralized, owned vs rented, will just keep going. It's never an either or, but when companies make it all-or-nothing then you have to really examine the specifics.

show comments
tgtweak

>San Diego has a mild climate and we opted for pure outside air cooling. This gives us less control of the temperature and humidity, but uses only a couple dozen kW. We have dual 48” intake fans and dual 48” exhaust fans to keep the air cool. To ensure low humidity (<45%) we use recirculating fans to mix hot exhaust air with the intake air. One server is connected to several sensors and runs a PID loop to control the fans to optimize the temperature and humidity.

Oh man, this is bad advice. Airborn humidity and contaminants will KILL your servers on a very short horizon in most places - even San Diego. I highly suggest enthalpy wheel coolers (kyotocooling is one vendor - switch datacenters runs very similar units on their massive datacenters in the Nevada desert) as they remove the heat from the indoor air using outdoor air (+can boost slightly with an integrated refrigeration unit to hit target intake temps) without switching the air from one side to the other. This has huge benefits for air control quality and outdoor air tolerance and a single 500KW heat rejection unit uses only 25KW of input power (when it needs to boost the AC unit's output). You can combine this with evaporative cooling on the exterior intakes to lower the temps even further at the expense of some water consumption (typically far cheaper than the extra electricity to boost the cooling through an hvac cycle).

Not knocking the achievement just speaking from experience that taking outdoor air (even filtered + mixed) into a datacenter is a recipe for hardware failure and the mean time to failure for that is highly dependant on your outdoor air conditions. I've run 3MW facilities with passive air cooling and taking outdoor air directly into servers requires a LOT more conditioning and consideration than is outlined in this article.

show comments
alecco

Counterpoint: "Why I'm Selling All My GPUs" https://www.youtube.com/watch?v=C6mu2QRVNSE

TL;DW: GPU rental arbitrage is dead. Regulation hell. GPU prices. Rental price erosion. Building costs rising. Complexity of things like backup power. Delays of connection to energy grid. Staffing costs.

speedgoose

I would suggest to use both on-premise hardware and cloud computing. Which is probably what comma is doing.

For critical infrastructure, I would rather pay a competent cloud provider than being responsible for reliability issues. Maintaining one server room in the headquarters is something, but two servers rooms in different locations, with resilient power and network is a bit too much effort IMHO.

For running many slurm jobs on good servers, cloud computing is very expensive and you sometimes save money in a matter of months. And who cares if the server room is a total loss after a while, worst case you write some more YAML and Terraform and deploy a temporary replacement in the cloud.

Another thing between is colocation, where you put hardware you own in a managed data center. It’s a bit old fashioned, but it may make sense in some cases.

I can also mention that research HPCs may be worth considering. In research, we have some of the world fastest computers at a fraction of the cost of cloud computing. It’s great as long as you don’t mind not being root and having to use slurm.

I don’t know in USA, but in Norway you can run your private company slurm AI workloads on research HPCs, though you will pay quite a bit more than universities and research institutions. But you can also have research projects together with universities or research institutions, and everyone will be happy if your business benefits a lot from the collaboration.

show comments
IFC_LLC

This is cool. Yet, there are levels of insanity and those depend on your inability to estimate things.

When I'm launching a project it's easier for me to rent $250 worth of compute from AWS. When the project consumes $30k a month, it's easier for me to rent a colocation.

My point is that a good engineer should know how to calculate all the ups and downs here to propose a sound plan to the management. That's the winning thing.

show comments
kevinkatzke

Feels like I’ve lived through a full infrastructure fashion cycle already. I started my career when cloud was the obvious answer and on-prem was “legacy.”

Now on-prem is cool again.

Makes me wonder whether we’re already setting up the next cycle 10 years from now, when everyone rediscovers why cloud was attractive in the first place and starts saying “on-prem is a bad idea” again.

show comments
drnick1

On premises isn't only about saving money (that's not always clear). The article neglects the most important benefits which are freedom (control) and privacy. It's basically the same considerations that apply to owning vs renting a house.

show comments
jillesvangurp

At scale (like comma.ai), it's probably cheaper. But until then it's a long term cost optimization with really high upfront capital expenditure and risk. Which means it doesn't make much sense for the majority of startup companies until they become late stage and their hosting cost actually becomes a big cost burden.

There are in between solutions. Renting bare metal instead of renting virtual machines can be quite nice. I've done that via Hetzner some years ago. You pay just about the same but you get a lot more performance for the same money. This is great if you actually need that performance.

People obsess about hardware but there's also the software side to consider. For smaller companies, operations/devops people are usually more expensive than the resources they manage. The cost to optimize is that cost. The hosting cost usually is a rounding error on the staffing cost. And on top of that the amount of responsibilities increases as soon as you own the hardware. You need to service it, monitor it, replace it when it fails, make sure those fans don't get jammed by dust puppies, deal with outages when they happen, etc. All the stuff that you pay cloud providers to do for you now becomes your problem. And it has a non zero cost.

The right mindset for hosting cost is to think of it in FTEs (full time employee cost for a year). If it's below 1 (most startups until they are well into scale up territory), you are doing great. Most of the optimizations you are going to get are going to cost you in actual FTEs spent doing that work. 1 FTE pays for quite a bit of hosting. Think 10K per month in AWS cost. A good ops person/developer is more expensive than that. My company runs at about 1K per month (GCP and misc managed services). It would be the wrong thing to optimize for us. It's not worth spending any amount of time on for me. I literally have more valuable things to do.

This flips when you start getting into the multiple FTEs per month in cost for just the hosting. At that point you probably have additional cost measured in 5-10 FTE in staffing anyway to babysit all of that. So now you can talk about trading off some hosting FTEs for modest amount of extra staffing FTEs and make net gains.

show comments
simianwords

The reason companies don’t go with on premises even if cloud is way more expensive is because of the risk involved in on premises.

You can see it quite clearly here that there’s so many steps to take. Now a good company would concentrate risk on their differentiating factor or the specific part they have competitive advantage in.

It’s never about “is the expected cost in on premises less than cloud”, it’s about the risk adjusted costs.

Once you’ve spread risk not only on your main product but also on your infrastructure, it becomes hard.

I would be vary of a smallish company building their own Jira in house in a similar way.

show comments
3acctforcom

The lowest grade I got in my business degree was in the "IT management" course. That's because the ONLY acceptable answer to any business IT problem is to move everything to the cloud. Renting is ALWAYS better than owning because you transfer cost and risk to a 3rd party.

That's pretty much the dogma of the 2010s.

It doesn't matter that my org runs a line-of-business datacentre that is a fraction of the cost of public cloud. It doesn't matter that my "big" ERP and admin servers take up half a rack in that datacentre. MBA dogma says that I need to fire every graybeard sysadmin, raze our datacentre facility to the ground, and move to AWS.

Fun fact, salaries and hardware purchases typically track inflation, because switching cost for hardware is nil and hiring isn't that expensive. Whereas software is usually 5-10% increases every year because they know that vendor lock-in and switching costs for software are expensive.

show comments
vadepaysa

I was an on-prem maxi (if thats a thing) for a long time. I've run clusters that costed more than $5M, but these days I am a changed man. I start with PaaS like Vercel and work my way down to on-prem depending on how important and cost conscious that workload is.

Pains I faced running BIG clusters on-prem.

1. Supply chain Management -- everything from power supplies all the way to GPUs and storage has to be procured, shipped, disassembled and installed. You need labor pool and dedicated management.

2. Inventory Management -- You also need to manage inventory on hand for parts that WILL fail. You can expect 20% of your cluster to have some degree of issues on an ongoing basis

3. Networking and security -- You are on your own defending your network or have to pay a ton of money to vendors to come in and help you. Even with the simplest of storage clusters, we've had to deal with pretty sophisticated attacks.

When I ran massive clusters, I had a large team dealing with these. Obviously, with PaaS, you dont need anyone.

show comments
swordsith

Recently learned about tailscale and have been accessing my project from my phone, It's been a game changer. The fact that they support teams of up to 3 people and 100 devices on the free plan is awesome imo. Running locally just makes me feel so much more comfortable.

sgarland

Note that they're running R630/R730s for storage. Those are 12-year old servers, and yet they say each one can do 20 Gbps (2.5 GBps) of random reads. In comparison, the same generation of hardware at AWS ({c,m,r}4) instance maxes out at 50% of that for EBS throughput on m4, and 70% on r4 - and that assumes carefully tuned block sizes.

Old hardware is _plenty_ powerful for a lot of tasks today.

show comments
insuranceguru

The own vs rent calculus for compute is starting to mirror the market value vs replacement cost divergence we see in physical assets. Cloud is convenient because it lowers OpEx initially, but you lose control over the long-term CapEx efficiency. Once you reach a certain scale, paying the premium for AWS flexibility stops making sense compared to the raw horsepower of owned metal.

show comments
hbogert

Datacenters need cool dry air? <45%

No, low isn't good perse. I worked in a datacenter which in winters had less than 40%, ram was failing all over the place. Low humidity causes static electricity.

show comments
regular_trash

The distinction between rent/own is kind of a false dichotomy. You never truly own your platform - you just "rent" it in a more distributed way that shields you from a single stress point. The tradeoff is that you have to manage more resources to take care of it, but you have much greater flexibility.

I have a feeling AI is going to be similar in the future. Sure, you can "rent" access to LLM's and have agents doing all your code. And in the future, it'll likely be as good as most engineers today. But the tradeoff is that you are effectively renting your labor from a single source instead of having a distributed workforce. I don't know what the long-term ramifications are here, if any, but I thought it was an interesting parallel.

butterisgood

I think this is how IBM is making tons of money on mainframes. A lot of what people are doing with cloud can be done on premises with the right levels of virtualization.

https://intellectia.ai/news/stock/ibm-mainframe-business-ach...

60% YoY growth is pretty excellent for an "outdated" technology.

sakopov

Does anyone remember how cloud prices used to trend down? That was about 6 years ago and then seemingly after the pandemic everything started going the other way.

sys42590

It would be interesting to hear their contingency plan for any kind of disaster (most commonly a fire) that hits their data center.

show comments
dh2022

LOL’ed IRL at “ In a future blog post I hope I can tell you about how we produce our own power and you should too.” Producing own power as a pre-requisite for running on-prem is a non-starter for many.

show comments
pja

I’m impressed that San Diego electrical power manages to be even more expensive than in the UK. That takes some doing.

epistasis

Ah Slurm, so good to see it still being used. As soon as I touched it in ~2010 I realized this was finally the solid queue management system we needed. Things like Sun Grid Engine or PBS were always such awful and burdensome PoS.

IIRC, Slurm came out of LLNL, and it finally made both usage and management of a cluster of nodes really easy and fun.

Compare Slurm to something like AWS Batch or Google Batch and just laugh at what the cloud has created...

yomismoaqui

This quote is gold:

The cloud requires expertise in company-specific APIs and billing systems. A data center requires knowledge of Watts, bits, and FLOPs. I know which one I rather think about.

show comments
pu_pe

> Self-reliance is great, but there are other benefits to running your own compute. It inspires good engineering.

It's easy to inspire people when you have great engineers in the first place. That's a given at a place like comma.ai, but there are many companies out there where administering a datacenter is far beyond their core competencies.

I feel like skilled engineers have a hard time understanding the trade-offs from cloud companies. The same way that comma.ai employees likely don't have an in-house canteen, it can make sense to focus on what you are good at and outsource the rest.

show comments
MagicMoonlight

For ML it makes sense, because you’re using so much compute that renting it is just burning money.

For most businesses, it’s a false economy. Hardware is cheap, but having proper redundancy and multiple sites isn’t. Having a 24/7 team available to respond to issues isn’t.

What happens if their data centre loses power? What if it burns down?

ghc

If it were me, instead of writing all these bespoke services to replicate cloud functionality, I'd just buy oxide.computer systems.

ynac

Not nearly on the article's level, but I've been operating what I call a fog machine (itsy bitsy personal cloud) for about 15 years. It's just a bunch of local and off-site NAS boxes. It has kinda worked out great. Mostly Synology, but probably won't be when their scheduled retirement comes up. The networking is dead simple, the power use is distributed, and the size of it all is still a monster for me - back in the day, I had to use it for a very large audio project to keep backups of something like 750,000 albums and other audio recordings along with their metadata and assets.

apothegm

This also depends so much on your scaling needs. If you need 3 mid-sized ECS/EC2 instances, a load balancer, and a database with backups, renting those from AWS isn’t going to be significantly more expensive for a decent-sized company than hiring someone to manage a cluster for you and dealing with all the overhead of keeping it maintained and secure.

If you’re at the scale of hundreds of instances, that math changes significantly.

And a lot of it depends on what type of business you have and what percent of your budget hosting accounts for.

show comments
siliconc0w

You can also buy the hardware and hire an IT vendor to rack and help manage it as smart hands so you never need to visit the datacenter. With modern beefy hardware, even large web services only need a few racks so most orgs don't even to manage a large footprint.

Sure you have to schedule your own hardware repairs or updates but it also means you don't need to wrangle with the ridiculous cost-engineering, reserved instances, cloud product support issues or API deprecations, proprietary configuration languages, etc.

Bare metal is better for a lot of non-cost reasons too, as the article notes it's just easier/better to reason about the lower level primitives and you get more reliable and repeatable performance.

show comments
JKCalhoun

Naive comment from a hobbyist with nothing close to $5M: I'm curious about the degree to which you build a "home lab" equivalent. I mean if "scaling" turned out to be just adding another Raspberry Pi to the rack (where is Mr. Geerling when you need him?) I could grow my mini-cloud month by month as spending money allowed.

(And it would be fun too.)

show comments
komali2

> The cloud requires expertise in company-specific APIs and billing systems.

This is one reason I hate dealing with AWS. It feels like a waste of time in some ways. Like learning a fly-by-night javascript library - maybe I'm better off spending that time writing the functionality on my own, to increase my knowledge and familiarity?

Maro

Working at a non-tech regional bigco, where ofc cloud is the default, I see everyday how AWS costs get out of hand, it's a constant struggle just to keep costs flat. In our case, the reality is that NONE of our services require scalability, and the main upside of high uptime is nice primarily for my blood pressure.. we only really need uptime during business hours, nobody cares what happens at night when everybody is sleeping.

On the other hand, there's significant vendor lockin, complexity, etc. And I'm not really sure we actually end up with less people over time, headcount always expands over time, and there's always cool new projects like monitoring, observability, AI, etc.

My feeling is, if we rented 20-30 chunky machines and ran Linux on them, with k8s, we'd be 80% there. For specific things I'd still use AWS, like infinite S3 storage, or RDS instances for super-important data.

If I were to do a startup, I would almost certainly not base it off AWS (or other cloud), I'd do what I write above: run chunky servers on OVH (initially just 1-2), and use specific AWS services like S3 and RDS.

A bit unrelated to the above, but I'd also try to keep away from expensive SaaS like Jira, Slack, etc. I'd use the best self-hosted open source version, and be done with it. I'd try Gitea for git hosting, Mattermost for team chat, etc.

And actually, given the geo-political situation as an EU citizen, maybe I wouldn't even put my data on AWS at all and self-host that as well...

0xbadcafebee

  If your business relies on compute, and you run that compute in the cloud, you are putting a lot of trust in your cloud provider. Cloud companies generally make onboarding very easy, and offboarding very difficult. If you are not vigilant you will sleepwalk into a situation of high cloud costs and no way out. If you want to control your own destiny, you must run your own compute.
This is not a valid reason for running your own datacenter, or running your own server.

  Self-reliance is great, but there are other benefits to running your own compute. It inspires good engineering. Maintaining a data center is much more about solving real-world challenges. The cloud requires expertise in company-specific APIs and billing systems. A data center requires knowledge of Watts, bits, and FLOPs. I know which one I rather think about.
This is not a valid reason for running your own datacenter, or running your own server.

  Avoiding the cloud for ML also creates better incentives for engineers. Engineers generally want to improve things. In ML many problems go away by just using more compute. In the cloud that means improvements are just a budget increase away. This locks you into inefficient and expensive solutions. Instead, when all you have available is your current compute, the quickest improvements are usually speeding up your code, or fixing fundamental issues.
This is not a valid reason for owning a datacenter, or running your own server.

  Finally there’s cost, owning a data center can be far cheaper than renting in the cloud. Especially if your compute or storage needs are fairly consistent, which tends to be true if you are in the business of training or running models. In comma’s case I estimate we’ve spent ~5M on our data center, and we would have spent 25M+ had we done the same things in the cloud.
This is one of only two valid reasons for owning a datacenter, and one of several valid reasons for running your own server.

The only two valid reasons to build/operate a datacenter: 1) what you're doing is so costly that building your own factory is the only profitable way for your business to produce its widgets, 2) you can't find a datacenter with the location or capacity you need and there is no other way to serve your business needs.

There's many valid reasons to run your own servers (colo), although most people will not run into them in a business setting.

nubela

Same thing. I was previously spending 5-8K on DigitalOcean, supposedly a "budget" cloud. Then the company was sold, and I started a new company on entirely self-hosted hardware. Cloudflare tunnel + CC + microk8s made it trivial! And I spend close to nothing other than internet that I already am spending on. I do have solar power too.

juvoly

> Cloud companies generally make onboarding very easy, and offboarding very difficult. If you are not vigilant you will sleepwalk into a situation of high cloud costs and no way out. If you want to control your own destiny, you must run your own compute.

Cost and lock-in are obvious factors, but "sovereignty" has also become a key factor in the sales cycle, at least in Europe.

Handing health data, Juvoly is happy to run AI work loads on premise.

ex-aws-dude

I can see how this would work fine if the primary purpose is for training rather than serving large volumes of customer traffic in multiple regions

It would probably even make sense for some companies to still use cloud for their API but do the training on prem as that may be the expensive part.

bob1029

The #1 reason I would advocate for using AWS today is the compliance package they bring to the party. No other cloud provider has anything remotely like Artifact. I can pull Amazon's PCI-DSS compliance documentation using an API call. If you have a heavily regulated business (or work with customers who do), AWS is hard to beat.

If you don't have any kind of serious compliance requirement, using Amazon is probably not ideal. I would say that Azure AD is ok too if you have to do Microsoft stuff, but I'd never host an actual VM on that cloud.

Compliance and "Microsoft stuff" covers a lot of real world businesses. Going on prem should only be done if it's actually going to make your life easier. If you have to replicate all of Azure AD or Route53, it might be better to just use the cloud offerings.

show comments
cgsmith

I used to colocate a 2U server that I purchased with a local data center. It was a great learning experience for me. Im curious why a company wouldn't colocate their own hardware? Proximity isnt an issue when you can have the datacenter perform physical tasks. Bravo to the comma team regardless. It'll be a great learning experience and make each person on their team better.

Ps... bx cable instead of conduit for electrical looks cringe.

show comments
kavalg

This was one of the coolest job ads that I've ever read :). Congrats for what you have done with your infrastructure, team and product!

show comments
eubluue

On top of that, now when the US cloud act is again a weapon against EU, most European companies know better and are migrating in droves to colo, on-prem and EU clouds. Bye bye US hyperscalers!

Dormeno

The company I work for used to have a hybrid where 95% was on-prem, but became closer to 90% in the cloud when it became more expensive to do on-prem because of VMware licensing. There are alternatives to VMware, but not officially supported with our hardware configuration, so the switch requires changing all the hardware, which still drives it higher than the cloud. Almost everything we have is cloud agnostic, and for anything that requires resilience, it sits in two different providers.

Now the company is looking at doing further cost savings as the buildings rented for running on-prem are sitting mostly unused, but also the prices of buildings have gone up in recent years, notably too, so we're likely to be saving money moving into the cloud. This is likely to make the cloud transition permanent.

danpalmer

> Cloud companies generally make onboarding very easy, and offboarding very difficult.

I reckon most on-prem deployments have significantly worse offboarding than the cloud providers. As a cloud provider you can win business by having something for offboarding, but internally you'd never get buy-in to spend on a backup plan if you decide to move to the cloud.

show comments
comrade1234

15-years ago or so a spreadsheet was floating around where you could enter server costs, compute power, etc and it would tell you when you would break-even by buying instead of going with AWS. I think it was leaked from Amazon because it was always three-years to break-even even as hardware changed over time.

show comments
b8

SSD's don't last longer than HDDs. Also they're much more expensive due to AI now. They should move to cutdown on power costs.

durakot

There's the HN I know and love

imcritic

I love articles like this and companies with this kind of openness. Mad respect to them for this article and for sharing software solutions!

evertheylen

> Maintaining a data center is much more about solving real-world challenges. The cloud requires expertise in company-specific APIs and billing systems. A data center requires knowledge of Watts, bits, and FLOPs. I know which one I rather think about.

I find this to be applicable on a smaller scale too! I'd rather setup and debug a beefy Linux VPS via SSH than fiddle with various propietary cloud APIs/interfaces. Doesn't go as low-level as Watts, bits and FLOPs but I still consider knowledge about Linux more valuable than knowing which Azure knobs to turn.

wessorh

what is the underling filesystem for your kv store, it doesn't appear to use raw devices.

bovermyer

I'm thinking about doing a research project at my university looking into distributed "data centers" hosted by communities instead of centralized cloud providers.

The trick is in how to create mostly self-maintaining deployable/swappable data centers at low cost...

arjie

Realistically, it's the speed with which you can expand and contract. The cloud gives unbounded flexibility - not on the per-request scale or whatever, but on the per-project scale. To try things out with a bunch of EC2s or GCEs is cheap. You have it for a while and then you let it go. I say this as someone with terabytes of RAM in servers, and a cabinet I have in the Bay Area.

satvikpendem

I just read about Railway doing something similar, sadly their prices are still high compared to other bare metal providers and even VPS such as Hetzner with Dokploy, very similar feature set yet for the same 5 dollars you get way more CPU, storage and RAM.

https://blog.railway.com/p/launch-week-02-welcome

show comments
bradley13

Goes for small business and individuals as well. Sure, there are times that cloud makes sense, but you can and should do a lot on your own hardware.

Hasz

This is hackernews, do the math for the love of god.

There are good business and technical reasons to choose a public cloud.

There are good business and technical reasons to choose a private cloud.

There are good business and technical reasons to do something in-between or hybrid.

The endless "public cloud is a ripoff" or "private clouds are impossible" is just a circular discussion past each other. Saying to only use one or another is textbook cargo-culting.

rmoriz

Cloud, in terms of "other company's infrastructure" always implies losing the competence to select, source and operate hardware. Treating hardware as commodity will eventually treat your very own business as commodity: Someone can just copy your software/IP and ruin your business. Every durable business needs some kind of intellectual property and human skills that are not replaceable easily. This sounds binary, but isn't. You can build long-lasting partnerships. German Mittelstand did that over decades.

dagi3d

> San Diego power cost is over 40c/kWh, ~3x the global average. It’s a ripoff, and overpriced simply due to political dysfunction.

Mind anyone elaborate? Always thought this is was a direct cause of the free market. Not sure if by dysfunction the op means lack of intervention.

show comments
infecto

I love this article. Great write up. Gave me the same feeling when I would read about Stackoverflows handful of servers that ran all of the sites.

monster_truck

Don't even have to go this far. Colocating in a couple regions will give you most of the logistical thrills at a fraction of the cost!

show comments
lawrenceyan

Hetzner bare metal ran much of crypto for many years before they cracked down on it.

nickorlow

Even at the personal blog level, I'd argue it's worth it to run your own server (even if it's just an old PC in a closet). Gets you on the path to running a home lab.

show comments
throwaway-aws9

The cloud is a psyop, a scam. Except at the tiniest free-tier / near free-tier use cases, or true scale to zero setups.

I've helped a startup with 2.5M revenue reduce their cloud spend from close to 2M/yr to below 1M/yr. They could have reached 250k/yr renting bare-metal servers. Probably 100k/yr in colos by spending 250k once on hardware. They had the staff to do it but the CEO was too scared.

Cloud evangelism (is it advocacy now?) messed up the minds of swaths of software engineers. Suddenly costs didn't matter and scaling was the answer to poor designs. Sizing your resource requirements became a lost art, and getting into reaction mode became law.

Welcome to "move fast and get out of business", all enabled by cloud architecture blogs that recommend tight integration with vendor lock-in mechanisms.

Use the cloud to move fast, but stick to cloud-agnostic tooling so that it doesn't suck you in forever.

I've seen how much cloud vendors are willing to spend to get business. That's when you realize just how massive their margins are.

show comments
segmondy

I cancelled my digital ocean server of almost a decade late last year and replaced it with a raspberry pi 3 that was doing nothing. We can do it, we should do it.

CodeCompost

Microsoft made the TCO argument and won. Self-hosting is only an option if you can afford expensive SysOps/DevOps/WhateverWeAreCalledTheseDays to manage it.

show comments
faust201

Look the bottom of that page:

An error occurred: API rate limit already exceeded for installation ID 73591946.

Error from https://giscus.app/

Fellow says one thing and uses another.

nottorp

> We use SSDs for reliability and speed.

Hey, how do SSDs fail lately? Do they ... vanish off the bus still? Or do they go into read only mode?

gwbas1c

TLDR:

> In comma’s case I estimate we’ve spent ~5M on our data center, and we would have spent 25M+ had we done the same things in the cloud.

IMO, that's the biggie. It's enough to justify paying someone to run their datacenter. I wish there was a bit more detail to justify those assumptions, though.

That being said, if their needs grow by orders of magnitude, I'd anticipate that they would want to move their servers somewhere with cheaper electricity.

Havoc

Interesting that they go for no redundancy

show comments
tirant

Well, their comment section is fore sure not running on premises, but on the cloud:

"An error occurred: API rate limit already exceeded for installation ID 73591946."

langarus

This is a great solution for a very specific type of team but I think most companies with consistent GPU workloads will still just rent dedicated servers and call it a day.

show comments
assaddayinh

Is there a client to sell on your own unused private cloud?

rvz

Not long ago Railway moved from GCP to their own infrastructure since it was very expensive for them. [0] Some go for a Oxide rack [1] for a full stack solution (both hardware and software) for intense GPU workloads, instead of building it themselves.

It's very expensive and only makes sense if you really need infrastructure sovereignty. It makes more sense if you're profitable in the tens of millions after raising hundreds of millions.

It also makes sense for governments (including those in the EU) which should think about this and have the compute in house and disconnected from the internet if they are serious about infrastructure sovereignty, rather than depending on US-based providers such as AWS.

[0] https://blog.railway.com/p/data-center-build-part-one

[1] https://oxide.computer/

show comments
RT_max

The observation about incentives is underappreciated here. When your compute is fixed, engineers optimize code. When compute is a budget line, engineers optimize slide decks. That's not really a cloud vs on-prem argument, it's a psychology-of-engineering argument.

Semaphor

In case anyone from comma.ai reads this: "CTO @ comma.ai" the link at the end is broken, it’s relative instead of absolute.

show comments
stego-tech

IT dinosaur here, who has run and engineered the entire spectrum over the course of my career.

Everything is a trade-off. Every tool has its purpose. There is no "right way" to build your infrastructure, only a right way for you.

In my subjective experience, the trade-offs are generally along these lines:

* Platform as a Service (Vercel, AWS Lambda, Azure Functions, basically anything where you give it your code and it "just works"): great for startups, orgs with minimal talent, and those with deep pockets for inevitable overruns. Maximum convenience means maximum cost. Excellent for weird customer one-offs you can bill for (and slap a 50% margin on top). Trade-off is that everything is abstracted away, making troubleshooting underlying infrastructure issues nigh impossible; also that people forget these things exist until the customer has long since stopped paying for them or a nasty bill arrives.

* Infrastructure as a Service (AWS, GCP, Azure, Vultr, etc; commonly called the "Public Cloud"): great for orgs with modest technical talent but limited budgets or infrastructure that's highly variable (scales up and down frequently). Also excellent for everything customer-facing, like load balancers, frontends, websites, you name it. If you can invoice someone else for it, putting it in here makes a lot of sense. Trade-off is that this isn't yours, it'll never be yours, you'll be renting it forever from someone else who charges you a pretty penny and can cut you off or raise prices anytime they like.

* Managed Service/Hosting Providers (e.g., ye olde Rackspace): you don't own the hardware, but you're also not paying the premium for infrastructure orchestrators. As close to bare metal as you can get without paying for actual servers. Excellent for short-term "testing" of PoCs before committing CapEx, or for modest infrastructure needs that aren't likely to change substantially enough to warrant a shift either on-prem or off to the cloud. You'll need more talent though, and you're ultimately still renting the illusion of sovereignty from someone else in perpetuity.

* Bare Metal, be it colocation or on-premises: you own it, you decide what to do with it, and nobody can stop you. The flip side is you have to bootstrap everything yourself, which can be a PITA depending on what you actually want - or what your stakeholders demand you offer. Running VMs? Easy-peasy. Bare metal K8s clusters? I mean, it can be done, but I'd personally rather chew glass than go without a managed control plane somewhere. CapEx is insane right now (thanks, AI!), but TCO is still measured in two to three years before you're saving more than you'd have spent on comparable infrastructure elsewhere, even with savings plans. Talent needs are highly variable - a generalist or two can get you 80% to basic AWS functionality with something like Nutanix or VCF (even with fancy stuff like DBaaS), but anything cutting edge is going to need more headcount than a comparable IaaS build. God help you if you opt for a Microsoft stack, as any on-prem savings are likely to evaporate at your next True-Up.

In my experience, companies have bought into the public cloud/IaaS because they thought it'd save them money versus the talent needed for on-prem; to be fair, back when every enterprise absolutely needed a network team and a DB team and a systems team and a datacenter team, this was technically correct. Nowadays, most organizational needs can be handled with a modest team of generalists or a highly competent generalist and one or two specialists for specific needs (e.g., a K8s engineer and a network engineer); modern software and operating systems make managing even huge orgs a comparable breeze, especially if you're running containers or appliances instead of bespoke VMs.

As more orgs like Comma or Basecamp look critically at their infrastructure needs versus their spend, or they seriously reflect on the limited sovereignty they have by outsourcing everything to US Tech companies, I expect workloads and infrastructure to become substantially more diversified than the current AWS/GCP/Azure trifecta.

squeefers

mark my words. cloud will fall out of fashion, but it will come back in fashion under another name in some amount of years. its cyclical.

lovegrenoble

I've just shifted to Hetzner, no regret

petesergeant

One thing I don't really understand here is why they're incurring the costs of having this physically in San Diego, rather than further afield with a full-time server tech essentially living on-prem, especially if their power numbers are correct. Is everyone being able to physically show up on site immediately that much better than a 24/7 pair of remote hands + occasional trips for more team members if needed?

show comments
intalentive

I like Hotz’s style: simply and straightforwardly attempting the difficult and complex. I always get the impression: “You don’t need to be too fancy or clever. You don’t need permission or credentials. You just need to go out and do the thing. What are you waiting for?”

show comments
deadbabe

Clouds suck. But so does “on premises”. Or co-location.

In the future, what you will need to remain competitive is computing at the edge. Only one company is truly poised to deliver on that at massive scale.

kaon_2

Am I the only one that is simply scared of running your own cloud? What happens if your administrator credentials get leaked? At least with Azure I can phone microsoft and initiate a recovery. Because of backups and soft deletion policies quite a lot is possible. I guess you can build in these failsafe scenarios locally too? But what if a fire happens like in South Korea? Sure most companies run more immediate risks such as going bankrupt, but at least Cloud relieves me from the stuff of nightmares.

Except now I have nightmares that the USA will enforce the patriot act and force Microsoft to hand over all their data in European data centers and then we have to migrate everything to a local cloud provider. Argh...

show comments
rob_c

And finally we reach the point where you're not shot for explaining if you invest in ownership after everything is over you have something left that has intrinsic value regardless of what you were doing with it.

Otherwise, well just like that gym membership, you get out what you put into it...

architsingh15

Looks insanely daunting imo

pelasaco

if i understood correctly, you dont kubernetes, rights? Did you consider it?

devmor

> In a future blog post I hope I can tell you about how we produce our own power and you should too.

Rackmounted fusion reactors, I hope. Would solve my homelab wattage issues too.

vasco

Having worked only with the cloud I really wonder if these companies don't use other software with subscriptions. Even though AWS is "expensive" its a just another line item compared to most companies overall SaaS spend. Most businesses don't need that much compute or data transfer in the grand scheme of things.

mrbluecoat

Stopped reading at "Our main storage arrays have no redundancy". This isn't a data center, it's a volatile AI memory bank.

show comments
jongjong

Or better; write your software such that you can scale to tens of thousands of concurrent users on a single machine. This can really put the savings into perspective.

show comments
macmac_mac

Chatgpt:

# don’t own the cloud, rent instead

the “build your own datacenter” story is fun (and comma’s setup is undeniably cool), but for most companies it’s a seductive trap: you’ll spend your rarest resource (engineer attention) on watts, humidity, failed disks, supply chains, and “why is this rack hot,” instead of on the product. comma can justify it because their workload is huge and steady, they’re willing to run non-redundant storage, and they’ve built custom GPU boxes and infra around a very specific ML pipeline. ([comma.ai blog][1])

## 1) capex is a tax on flexibility

a datacenter turns “compute” into a big up-front bet: hardware choices, networking choices, facility choices, and a depreciation schedule that does not care about your roadmap. cloud flips that: you pay for what you use, you can experiment cheaply, and you can stop spending the minute a strategy changes. the best feature of renting is that quitting is easy.

## 2) scaling isn’t a vibe, it’s a deadline

real businesses don’t scale smoothly. they spike. they get surprise customers. they do one insane training run. they run a migration. owning means you either overbuild “just in case” (idle metal), or you underbuild and miss the moment. renting means you can burst, use spot/preemptible for the ugly parts, and keep steady stuff on reserved/committed discounts.

## 3) reliability is more than “it’s up most days”

comma explicitly says they keep things simple and don’t need redundancy for ~99% uptime at their scale. ([comma.ai blog][1]) that’s a perfectly valid trade—if your business can tolerate it. many can’t. cloud providers sell multi-zone, multi-region, managed backups, managed databases, and boring compliance checklists because “five nines” isn’t achieved by a couple heroic engineers and a PID loop.

## 4) the hidden cost isn’t power, it’s people

comma spent ~$540k on power in 2025 and runs up to ~450kW, plus all the cooling and facility work. ([comma.ai blog][1]) but the larger, sneakier bill is: on-call load, hiring niche operators, hardware failures, spare parts, procurement, security, audits, vendor management, and the opportunity cost of your best engineers becoming part-time building managers. cloud is expensive, yes—because it bundles labor, expertise, and economies of scale you don’t have.

## 5) “vendor lock-in” is real, but self-lock-in is worse

cloud lock-in is usually optional: you choose proprietary managed services because they’re convenient. if you’re disciplined, you can keep escape hatches: containers, kubernetes, terraform, postgres, object storage abstractions, multi-region backups, and a tested migration plan. owning your datacenter is also lock-in—except the vendor is past you, and the contract is “we can never stop maintaining this.”

## the practical rule

*if you have massive, predictable, always-on utilization, and you want to become good at running infrastructure as a core competency, owning can win.* that’s basically comma’s case. ([comma.ai blog][1]) *otherwise, rent.* buy speed, buy optionality, and keep your team focused on the thing only your company can do.

if you want, tell me your rough workload shape (steady vs spiky, cpu vs gpu, latency needs, compliance), and i’ll give you a blunt “rent / colo / own” recommendation in 5 lines.

[1]: https://blog.comma.ai/datacenter/ "Owning a $5M data center - comma.ai blog"

BoredPositron

capex vs opex the Opera.

barbazoo

And now go do that in another region. Bam, savings gone. /s

What I mean is that I'm assuming the math here works because the primary purpose of the hardware is training models. You don't need 6 or 7 nines for that is what I'm imagining. But when you have customers across geography that use your app hosted on those servers pretty much 24/7 then you can't afford much downtime.