(Almost) Every infrastructure decision I endorse or regret after 4 years at a startup

Title and idea blatantly stolen from Jack Lindamood.

I’ve led infrastructure at a startup for the past 4 years that has had to scale quickly. From the beginning I made some core decisions that the company has had to stick to, for better or worse, these past four years. This post will list some of the major decisions made and if I endorse them for your startup, or if I regret them and advise you to pick something else.

I'm almost at the 4 year mark at Tend, which also means the main technical stack is around the 4 year mark. A lot of the same things are on my mind - did we (and sometimes, I) make the right decision around some key things, or could we have done better? If they were not right - is it something we can/should roll back?

Unlike Jack, I've not spent 4 years "running infrastructure", tho it was my main focus for a lot of the first 18 months. And not all the decisions were mine - we started with 4 engineers, and a lot of the decisions were group ones, tho with some pretty heavy biases - with good reasons.

💚 : Good idea; 🔴 : Bad idea; 🟠 : Undecided

Cloud / AWS / Data / Infra

Decisions made around our infrastructure

AWS

💚 unreserved endorsement

I'm not sure we actually had any other choice. Google was a non-starter (hello killedbygoogle.com), Azure might have worked but... We had a very good relationship with AWS at the previous job (of which our CPTO was the VPEng), and its what I knew well. Likely we should have put more stuff in Sydney, not Oregon (can't engineer around physics!), but once the NZ region is bedded in, we'll likely move here, at least for customer-facing stuff. 150ms RTT across the pacific for every call is a bit of a killer.

Aurora Serverless v1

💚 going to 🟠

Serverless v1 came out about when we started. It was easy to setup and use from Lambda - especially with DataAPI, and, well, it just worked.

We definitely hit some walls with it last year tho - old versions of Postgres, didn't look like it was getting any love at all, no performance insights... So we moved to Aurora-proper.

I don't regret using it - it was right at the time. But for now? No.

Aurora Serverless v2

🔴

We tried to move to using Serverless v2 once we got on Aurora, but we have a fairly consistent workload, and the extra cost was substantial. We don't need scale to zero, so just having some regular db.x2g.xlarge instances works better for us.

We DO use it for dev tho. At least at the moment. That is likely to change soon.

Aurora in general

💚

Brilliant. No notes.

We had to do a load of re-engineering to get to using it tho - DataAPI wasn't available for Aurora, so we had to go to using normal DB connections and being in the VPC. Conveniently, AWS made this quick and painless (no cold start tax in VPCs anymore), so it was mostly on our end. Having direct, proxied, scalable connections to the database has been wonderful tho. DataAPI is great for small stuff, but for larger apps, having a direct connection has been priceless.

Using a "id+data" structure, not columns (and DIY ORM)

🟠

Our table structure is largely a GUID id column, created_at/updated_at/deleted_at timestamps, a data blob of JSONB, and sometimes we hoist some of the fields in the data block up into their own columns (historical reasons around Postgres not indexing jsonb well)

This has worked remarkably well. We get a lot of the benefits of a KV store like DynamoDB, without needing to work out how to get it out and into a data warehouse. Being able to iterate on the data block (data model / schema) incredibly rapidly has been priceless.

We had to do our own "its not an ORM" on top. That part I'm not sure I'd recommend. But it's given us a load of flexibility and agility. Swings and roundabouts.

Metabase

💚 unreserved

So easy to just point it at a DB and go write reports (see previous data blobs). This has been expanded a LOT since, and now works with data from a number of different sources, apis, and other things via DBT and DuckDB.

When they took away having a write connection a few of us were not happy. But for read-only, it's great.

DuckDB

💚 unreserved

A lot of our data from our core system comes in as Parquait in S3, and we use DBT to pull that in and drop it into a DuckDB database. So fast, so easy, I have no idea how they have made it do this. It's magic.

Terraform

💚 mostly

Terraform: fine.

My use of terraform 4 years ago: less fine. We need to do a bunch of refactoring / rework now, largely cos I could ignore the infrastructure while we built the rest of the product out.

Still wouldn't use another tool. Just might structure it differently.

Serverless (ie, serverless.com)

💚 mostly

This is an odd one. It's worked for us - especially locally. But its not hard to see the rough edges and compromise points.

But outside of a few things I'd like to change (one build -> deploy to many environments with config changes only; redeploying an old package! Tho thats CloudFormation), its served us well.

The landscape has changed a lot in 4 years tho. I'd re-evaluate before picking it today. We might move to deployments being a container + terraform, but thats a fairly decent lift for not a huge return.

Tailscale

💚

So useful to have a mesh network which can drop us into the dev VPC from our laptops, or prod if the ACLs allow for it.

Or drop us onto one of the clinic's networks to see why something isn't working over there. Incredibly useful.

Process

How we do things - or don't do things.

Being very light on process

💚 tho it might be starting to bite us a bit

Less as the team has gotten bigger, more as the complexity and scope of what we are doing has changed. More external input, less "just build the product out"

Exposing costs to engineers

💚

Definitely a win. OK, maybe only for me... but we have a bot which posts the spend every monday and friday. Handy to spot large jumps.

SaaS

External services we use for the platform, or the processes

Honeycomb

💚 mostly

The "mostly" comes from the 60 day window. Otherwise: love it.

We do need to go to OTEL tho - just need to work out how we do the longer retention we need. We get that from Cloudwatch right now, but CWL's ingestion is expensive, even if the storage is dirt cheap. Still haven't solved this one.

I've written here about it quite a bit, and out local dev logs go in there now too.

Mixpanel

💚

Once we stopped using it as a log for the app and website... its a lot more use 😄. Cheap to run, easy to integrate, good to answer things like "how many people went from the enrolment to the booking flow....". API lets us get to it from Airbyte and into the Data Warehouse.

Sentry

💚

Speaking of cheap to run... Another service we are either not holding right, or have only recently started to hold right. Get used to breadcrumbs and the other bits of it, and use them.

Stripe

💚

Gets out of the way. Does the job. No notes.

We MIGHT need to move soon tho - there are some things they have to not do, which is in the space of things we have to do. Or vault cards with them and with someone local, for the cases where we can't use Stripe.

Future problem, but likely to happen.

YouTrack

🟠

As a bug tracker / ticket system, it's... ok. Problem is, it doesn't play with anything else - or rather, nothing else plays with it.

Outside of that, it works, and works fairly well. I don't get the hate for Jira, and I really miss the Jira + Confluence integration. If I could have that with YouTrack + something (Notion, even Coda) that'd be great.

Coda

🔴

Well, it works. Mostly. But it's document model is weird, its tables are.... data tables (nope, I don't want to embed it in 2 places thanks), it remaps a lot of keys on me...

I swing between "it works and largely gets out of my way" and "I hate this with a fiery passion reserved for people who kick puppies and kittens".

We have a lot of cross-department content in here tho, mostly cos its pricing model allows a lot of readers (free) and only a few writers (paid). Which works for us.

We've started moving a lot of engineering (not product) docs into Docusarus + Github + S3/Cloudfront. If that had a hosted editor, it'd be an easy decision to move more or all of product + engneering.

If I had to pick again, for Product / Engineering, I'd go with Notion, I think. Or Confluence maybe. The pricing model for bits of the business where there is a 1 writer to 10 or 50 readers is hard to beat tho.

Buildkite

💚

While I don't love YML, I can't complain at all about Buildkite. It's... just worked. I much prefer the model of a hosted control plane, with self-hosted build agents.

The AWS setup is easy, and it's been rock solid reliable for the whole time.

Hard Recommend, to the point where I use it to manage my home lab, too.

Bitrise

💚

See Buildkite, except for mobile. It's been a solid work horse for 4 years. We use their build agents tho - hell no to AWS Mac pricing - but they have limited access to anything in our environment.

Software

Lambda as compute

💚

They require a bit of re-thinking, but the maintenance aspect is hard to beat. Not to mention the cost. Our model is a monorepo with a load of services (apis, feature/area specific ones), which are made of 1 to 20 lambda functions.

Not sure I'd split things up into lambda-calling-lambda services tho. Slowly rolling that one back a bit (outside of async and queue-based usages). I prefer packages at the language level.

I'm trying to think about the extra work we'd have if we were using ECS - even via Fargate. I think we'll likely move to using containers as a distribution mechanism soon, and possibly Terraform to deploy it all (hello rollback!) but lambda itself has been rock solid.

Postgres

💚 💚

Are there any other (relational) databases, really? Handy that we have a Postgres Savant (or 2) on staff tho, but the database itself has more than proven itself.

GraphQL

💚 (with a bit of 🟠 )

GraphQL itself I like, especially for mobile. Not sure that Apollo is ideal tho - especially on mobile devices. It's more of a state management system with a networking layer, and we are holding it a bit wrong.

Much nicer than having to version REST tho, even if thats more of a known quantity.

Lambda + API Gateway

💚

Best thing about serverless is you can largely set it up and ignore it. API Gateway (HTTP) has covered our needs so far - GraphQL, REST, websockets...

Typescript / Node

💚

Yeah, not doing Javascript without types. I wish Node would run it without transpilation, but its not that big a deal.

React

💚

Largely gets out of the way - and pretty much everyone understands it.

React Native

💚

Only gotcha is: make sure you keep fairly up to date. Stay off the bleeding edge, but don't lag too far back.

I don't think we'd ship at the rate we do with 2 native apps, and we'd need a much much bigger team. I'm pretty sure the result would be worse, or at least the same, too.

For clarity, a single engineer might touch the mobile, web, and backend api, all in one ticket. Having them all in a single, consistent language / platform is pretty valuable, even if we are not sharing code between them.

Cognito

🟠

I have a love/hate with Cognito a bit. I love that it works, and gives us almost no issues. However, I'm aware that getting our users out if we needed to would be problematic, and it has some big "black box" components.

We also use the Amplify libraries for some of the front end bits (mobile, web) which is where we hit issues. Just don't try to report issues and try to get them fixed - they'll just be ignored.

🔴 🔴 Just don't.

We started using AWS SNS to send SMS. We got a 50-90% hit rate across various NZ carriers. AWS had no idea - just passed it on to the next carrier level up. Who also had no idea. So...

Twilio

💚

... we moved to Twilio, who (before we even sent a single SMS) told us "oh, you're in NZ, you need to fill out this form to get allow listed with the carriers".

Near 100% success rate since.

We also use them for voice, and while it'd be nice to be able to call out from an 0800 number (not actually a Twilio issue) it's otherwise been great.

OpenTok / Tokbox

💚

We use them for video calling in the app. The service itself has been stable, but while they provide an unsupported React Native component, it's... unsupported. Or was for ages. Better recently, but it was odd using something produced by the service, but not at all supported.

A few bits where I'd like to get some more telemetry, but it's worked well and given us next to no issues - video calling is an absolutely critical part of the app, so for it to go down would be a big problem.

The backup is Twilio, but their video offering doesn't do React Native either, and moving would be a HUGE change.

Metrics for context

We are a team of 7 engineers, 2 product managers and 2 designers, with a data engineer and a clinical specialist (our core system). We are mostly in Auckland, but 3 of us are around NZ or Australia. The wider company has around 450 staff.

We have a customer base in the high 10's of 1000's, heading towards low 100's of 1000's, but one which expects 100% uptime, or close to it. Not being able to have an appointment with your doctor can be a life or death situation.

We deploy at will - it takes about 20 mins from merge to master, to code being in production, outside of however long it takes to check it in dev. 10 mins for the front end web app, and about 15 to get an app build into TestFlight.

We deploy to production between 5 and 20 times a day, depending on the day. Yes, including fridays.

We do app releases around once a week, sometimes more, sometimes once every 2-3 weeks if we don't have a lot to ship.

Our AWS spend is mostly database - compute is <10%.

Cloud / AWS / Data / Infra

AWS

Aurora Serverless v1

Aurora Serverless v2

Aurora in general

Using a "id+data" structure, not columns (and DIY ORM)

Metabase

DuckDB

Terraform

Serverless (ie, serverless.com)

Tailscale

Process

Being very light on process

Exposing costs to engineers

SaaS

Honeycomb

Mixpanel

Sentry

Stripe

YouTrack

Coda

Buildkite

Bitrise

Software

Lambda as compute

Postgres

GraphQL

Lambda + API Gateway

Typescript / Node

React

React Native

Cognito

SNS for SMS

Twilio

OpenTok / Tokbox

Metrics for context