It’s been a couple years of working full-time on Luro and we’ve travelled through at least three (or four?) different distinct architectures. If that sounds like a lot, I’d agree. It’s been educational to say the least.
I think it’s interesting to think about how apps grow and adapt over years of contributions, so I thought I’d share how we evolved Luro. Each refactor and re-architecture was a considered choice involving the entire team and balancing business and customer needs at the time as well as maintaining new feature velocity.
Here is the great retelling…
v0: The Not-so Serverless Jamstack Hybrid Prototype
- Customer Need: Ability to preview application
- Internal Need: Prove idea could work
The prototype version of Luro was a Serverless Nuxt app deployed entirely on Netlify. Each CRUD operation was its own serverless function that connected to our Postgres database on Digital Ocean. Netlify and its branch deploys allowed for rapid deployment and the ability to spin up demo instances at a custom subdomain as we shopped the idea around.
This was a great proof of concept but we soon hit big pain points. The biggest hurdle was that the serverless functions for all our endpoints were taking upwards of 20 minutes to build. That led to flaky deployments so we combined all our serverless functions into an Express app inside a single function. That took 5 minutes to build. This approach worked until…
Some features (e.g., Lighthouse) didn’t fit within Lambda’s size limitations (69 MB at the time) and our tasks were too long for background functions. The solution was to standup a small Node service that could handle longer tasks. Each Lighthouse run generates about 2 MB of JSON data so we stored that inside a MongoDB that sat inside the Node server.
A lot of moving parts but it worked enough to start having conversations with customers and investors.
v1: The Single-tenant Supermodel Monolith
- Customer Need: Isolated deployment environments
- Internal Need: Reduce number of servers and services
Our first batch of potential (enterprise) customers expressed a need to deploy the app behind their firewall. Looking at our Nuxt + Express API inside a Netlify Function + Node + Postgres + MongoDB franken-app deployed across two separate web hosting platforms, we started exploring a Docker setup to bundle all the moving parts.
This was an opportunity to simplify! And simplify we did.
- We merged the Nuxt, Express, and Node app into a single app server. This simplified the Docker setup and allowed us to start server rendering Nuxt for a performance boost.
- We also simplified our content model to a single database table. Before our content types (prototypes, user tests, research, pages, etc.) were in their own table but their schema was nearly identical (
id
,title
,description
). That separation is good until you need to duplicate a model and its views or roll a feature out across all models and views. We made the decision to use a single “supermodel” approach where all content types were in the same table. - We eliminated MongoDB from the stack. One major challenge was our task queue (agenda) had a hard-dependency on MongoDB. Through some effort we moved over to graphile-worker and used a
json
field type in our PostgreSQL to store the big blobs of JSON. - To make client responses smaller we stripped out the unused parts of the Lighthouse report and streamlined it from 2 MB to 30 KB per report.
- To smooth out customer deployments, we built a Terraform script with a CLI interface to automate building isolated environments on Digital Ocean with an app server, database, S3 bucket, as well as adding a new DNS record to give instances their own vanity subdomain.
- We moved the application off of Netlify simplified our hosting situation. On the downside, in the move to Digital Ocean we lost our branch deploy strategy. We propped up a staging server but it wasn’t the same as an automated branch deploy.
After all these improvements the app felt like a lightweight machine.
Docker wasn’t utilized well but it made an easy pathway for deploying to Digital Ocean. I’m glad we didn’t invest too much into Docker, because over time it became a bottleneck we’d have to rip out.
v2: The Multi-tenant Monolith
- Customer Need: Reduce onboarding friction
- Internal Need: Less manual labor in setting up, tearing down, and updating instances
Through some positive investor pressure and some exhaustion setting up customer installs, we decided the app should be more self-service. This was a big undertaking.
While terraforming a new customer install wasn’t hard (CLI command + 10 minute build), the customer feedback loop was too long. Someone would sign up for Luro, then we’d email them later, then they’d say “yes I want it”, then it’d take 10 minutes to set up, then we’d send an email “okay it’s ready”… that experience cut the winds of excitement out of the sales funnel.
To make this work, we needed to create the idea of a Team
in Luro and make everythign belong to that team. Having Teams meant having a new permission system, authentication system, and invitation system. But the biggest need was to design and build new onboarding flows throughout the application because we wouldn’t be doing “white glove” installs anymore. This was a big lift for a small team.
At the end of the three month sprint we made an impromptu decision to move from Digital Ocean to Render. Our Docker deploys were taking 20 minutes to build the image + 10 minutes switch over to the new image and stand up the environment. That long feedback loop was killing us with design reviews. We made the jump to Render, killed our Dockerfile
, and got back to 5 minutes deploys and regained the ability to have Preview Deploys for each PR. That was a slam dunk win-win.
This refactor was a lot of work (and re-work) but thanks to two full-time hires we were ready to begin our Private Beta by the end of the summer.
v3: The Multi-tenant Microservice Monorepo
- Customer Need: Application uptime, reliability
- Internal Need: Observability and isolation of services
As people signed up and began to use the application we discovered a new set of problems. If one CPU intensive service (e.g. Lighthouse) had a problem during its run, the entire application server would fall over causing cascading service failures. The “stampede” of failures made it difficult to debug the source of our problems.
The biggest improvement in observability was converting our monolith into a microservice monorepo using Turborepo. Using Render’s Blueprint (Infrastructure as a Service) we could use YAML to configure servers for each app and cronjob in the apps/
folder.
We had some intermittent problems with graphile-worker
queue crashing so we pivoted to RabbitMQ hosted on CloudAMQP as a message queue. RabbitMQ is like Redis but a little bit more robust and a lot less “build your own” functionality (monitoring, etc). Our apps or cronjobs would post messages to specific queues and then the services would process those specific queues.
With our services properly split up, we spent more time improving our logging to ensure we had end-to-end error reporting. Each service logged to its own bucket which was possibly overkill but if an error showed up in the Lighthouse bucket we could pinpoint which part of the application failed.
And this is the shape of the application today. It’s a pretty elegant machine. We’re far from the days of servers crashing every weekend. We’ve worked hard to stabilize the application and now it doesn’t make much more than a peep.
101 things I’ve learned about software architecture
I’ve learned a lot about building and organizing large software projects through this. If I had to do it all over again, I don’t think it would be much different given the information we had at the time. With the power of hindsight, we probably could have skipped one or two evolutions but that was less of a lack of engineering imagination or foresight and more product needing clearer business objectives. If we had dropped the “white glove” idea earlier or had a narrower market (TAM/SAM/SOM) we could have spotted opportunities further out.
I learned hundreds or thousands of lessons from all this but here’s my Top 3 key takeaways…
1. It’s always a “100 duck-sized horses vs one horse-sized duck” problem
There’s an old internet question about “Would you rather fight 100 duck-sized horses or one horse-sized duck?” Well. Software is a lot like that. You expand, then consolidate, expand, then consolidate. Over and over. For example…
- The Jamstack app = 100 duck-sized horses
- The Supermodel monolith = 1 horse-sized duck
- Single-tenant deployments = 100 duck-sized horses
- Multi-tenant app = 1 horse-sized duck
- Microservice monorepo = 100 duck-sized horses
Even on the component level, you don’t want a bunch of bespoke tables, so you create a <SuperTable/>
component but then you need a bunch of <BespokeCell/>
components to slot inside the table. At a certain point you can’t reduce complexity any further and you’re left shifting complexity around and reshaping it. The goal is to end up with something less complex or more predictable.
I think the correct answer is 100 duck-sized horses, but you need good systems in place for managing that. And 5 to 10 is probably a better number than 100.
2. Monorepos are pretty great
For managing all those small apps, I recommend Turborepo! Monorepos make it easy to break up your application into separate services or shared packages (like a design system). You can start the entire system with one command from the terminal. You have less mysteries and phantom configs menacing your codebase when everyone is working in the same neighborhood. You can maintain velocity by changing an upstream API service and the UI layer in a single PR with less flight control.
We had hints of a service-oriented architecture in our v0 Node Background Server, but it existed outside (not alongside) the UI application. Monorepos are not without their issues, but I’d reach for Turborepo again for applications that don’t fit inside one folder on one server.
3. I over-index on the belief that we can fix it later
I’m a firm believer in iteration and believe that in most situations we can fix issues later. That can be an admirable quality, but it’s a choice to create technical debt. At times the technical or style debt I created blew up and we spent more time teasing out an abstraction later on. If we had “done it right” the first time we would have saved time but we also didn’t have enough information to make a proper abstraction and would have fell victim to YAGNI (You Ain’t Gonna Need It) over-optimization.
In the end there were microadjustments we could have made but with the pace of startup life (groan) you’re always in a hurry to do the next thing, we balanced that fairly well.
Just build websites
This whole experience has taught me one valuable lesson: You learn a lot by doing. You learn a lot by doing it the long, hard, stupid way. And you learn a lot by working with other talented people.
Thanks to everyone who has helped and participated and contributed code over the years: Trent Walton, Reagan Ray, Julian Jones, Scott Robbin, Kyle Zinter, Emmett Naughton, Eli Van Zoeren, and René Merino.