One of the decisions a CTO has to frequently make is whether to build some piece of functionality in-house or buy it from a third party vendor. In this post, I share my framework for making these decisions.
I also include a case study each for a buy decision and a build decision. Added bonus – some thoughts on whether you should sell something you’ve decided to build.
One of the most important determinants of buy-vs-build decisions will be the ability to spec out the entire functionality. Often we know what the primary functionality should be – e.g. being able to show metrics on a graph (a monitoring solution). The devil is in the details, though, and spending some time in discovering peripheral functionality is very helpful in avoiding a situation where you jump head-first into building something seemingly simple and then get stuck.
E.g. how does the monitoring solution handle server-churn for the services it is monitoring? Does it rerun the entire query when auto-refresh is turned on? etc.
One effective way to discover peripheral functionality is to look across competing products that offer the solution and see how they differentiate themselves from competitors. Each of those differentiators is functionality that is relevant to some customer, and could be relevant to you as well.
Knowing what to build (functionality) is a starting point. The next is to understand whether you have the capability to build that functionality. It’s not uncommon to find that functionally complex products are not technically complex and vice-versa. Functional complexity really arises out of the uncertainty in specifying the desired behaviour in a way that is known to be easy to code.
Technical complexity arises out of the ability to meet the functional requirements in repeatable and reliable ways. E.g. interactive information retrieval remains a technically complex domain despite years and years of R&D into the field. The business requirements and scale requirements keep growing to outpace any simplifying developments that have taken place in the field.
Having said that, this is one of the less difficult criteria for evaluation. Provided you have a good enough understanding of the functional complexity, you know whether you have the skills to build it in-house or whether you can hire to build the skill-set.
Things that are easy to build could be deceptively difficult to operate. Keeping something running reliably with predictable performance under stress is especially hard.
Having spent the last 10+ years handling “discount sale” events for book publishers to game engines to e-commerce platforms, I have a special distaste for all the F/OSS technology that works beautifully at small scale, only to blow up spectacularly under high load at a time when the business critically depends on it to hold up well.
And therein is a signal for you to understand whether DIY might be a good idea. If you see a product for which there’s no shortage of F/OSS offerings, yet there are “enterprise” offerings that are highly profitable, it’s a strong indicator of operational complexity.
In general, if you want to treat some technology as a black box, don’t use F/OSS [that you operate yourself]. If you’re operating something on your own, it’s best if you have built it or you have the ability to benefit from the openness of its source and understand exactly how it works under the hood. Another alternative is to understand its behaviour empirically by stressing it beyond your foreseeable requirements and figuring out how you’ll handle its failures. Both of these – understanding through code or empirical behaviour – are expensive propositions.
Unique Business Value
The last evaluation criteria (but not the least important by any means) is how the component you’re to buy or build acts as a differentiator for your business.
The differentiator consideration is extremely important. I’ve been in many situations where people argue that we should build something in-house because it’s business-critical. Well, Internet is business-critical but pretty much no one is building their own telco network, are they?
There is a very dire consequence of deciding to build something that is not a differentiator for your business – lack of long term investment into the component. When deciding to staff a team, the business will always prefer to allocate more to components or teams where they can see a clear, tangible business benefit. The less that a component contributes to business differentiation, the less focus it will have from a staffing perspective.
Soon you’ll find that the component that was cutting-edge at the time you built it has turned into a dinosaur few years down the line while the state-of-the-art has progressed much further, potentially becoming even more cost effective.
How to measure cost in Buy vs. Build
Here’s a retrospective mini-game for your decision-making maturity on cost comparisons. You’ve already played this before. Now see what level you’ve landed at.
Level 1. Technology Cost: You compare the cost of servers and other operational components and see how much you might save.
Level 2. Staffing Cost: You throw in the CTC of the developers who will be building and operating the component.
Level 3. Business Development Cost: Now this is a bit tricky. If you allocate the budget from Level 2 cost (i.e. your sharp engineers) to building direct business value how much additional revenue would you gain? If you pull existing engineers who are directly building business value, how much revenue would you lose? For a high growth startup that’s still executing on its product plan, this cost could be really high.
Level 4. Loss of Business Cost: This is even more tricky than Level 3 and requires comparing monetary impact of SLAs and Quality of Service. Things to consider here include impact on revenue per minute if the component fails and impact on revenue if the component degrades in performance or reliability. Loss of revenue due to reduced engineering and business productivity due to poor performance is another cost factor.
Level 5. Business Opportunity Cost: This is the most interesting aspect of making a buy vs. build decision. In a nutshell, it captures the impact of lack of investment (build) or control (buy) in the evolution of the product under consideration. It’s also closely tied to the differentiation factor. If it’s a low differentiator built in-house, the most likely future is that it will get understaffed, leading to growing Level 4 costs while the rest of the world advances in capability. In-house high differentiators have the potential to increase future differentiation to the extent that they could become sources of direct revenue.
Case Study 1: Chat Server – Build
Not too long ago, at Tokopedia we needed to build peer-to-peer and group chat capabilities. We already had some experience with using 3rd party solutions for reference. Here’s how the above decision framework applied to this decision.
Functional Complexity: chat functionality is well understood and we were able to nail down most of the core and peripheral functional requirements with clarity. This was validated against the core feature-set from 3rd party solutions.
Technical Complexity: chat servers have traditionally been notoriously complicated pieces of technology to build. Keeping end-to-end latency under predictable bounds while handling hundreds of thousands of concurrent connections is a well known problem (cf. C10k or C10M Problem).
10 years ago, one of the big technology decisions used to be “which webserver”. Apache2 was the most widely deployed webserver for C/C++ or PHP while the Java folks swore by TomCat and later Jetty, etc. That was until Go (yeah, golang) arrived and suddenly the application was its own webserver. What’s more, you could use that webserver to do raw TCP I/O using websockets, which were supported out-of-the-box in Go before any mainstream webservers picked them up. Anyway, I digress.
One of the coolest gifts that Go has given to software development community is the select statement. Combined with goroutines and the rapid benchmarking capabilities, chat servers have become trivial (I don’t use the t-word lightly) to build with Go. Fortunately, Tokopedia is a Go shop and double-fortunately, we also had engineers who could wield this capability into a reliable working solution.
Operational Complexity: Given that we decided to build our own chat server from scratch, we had full control on its behaviour and failure patterns. We could dig right down into the TCP implementation of Go if needed, leaving very little behaviour that we couldn’t understand or control. We also had a scale target to design for and we frequently ran stress tests to push the implementation. This allowed us to develop implementation and operations hand-in-hand for a low maintenance solution that could be operated with minimal overhead.
Unique Business Value: Chat is one of the most crucial touch-points that a business has with their customers. Having the ability to connect the user with a business representative via chat messaging is a very basic functionality but we saw that as an opportunity to build a lot of differentiation.
Being able to build workflows over chat, being able to integrate our own business artefacts into the chat protocol and being able to control the experience end-to-end has a very high business value. We’ve been continuously enhancing our protocol to support all kinds of business use-cases. Having full control over the solution, rather than building to a 3rd party API, means that we can do the integrations at whatever level is most optimal, rather than only doing surface-level integrations.
Cost: We worked through pretty-much all the levels of cost evaluation. While I can’t share a lot of data on this, we did find that our Level 1 and Level 2 costs were 10x lower at full scale for one of the use cases.
Case Study 2: Log Platform – Buy
Of late, one of the other considerations that we had to make at Tokopedia was on how to provide a scalable log aggregation platform that could be used across the whole company, not just for application specific logs but also being able to do cross-application analysis at high speed of interactivity. Here’s how the decision framework applied to it.
Functional Complexity: On the surface, log aggregation is a well understood functionality that has limited scope. However, our prior experience and analysis of 3rd party offerings revealed important peripheral functionality such as the ability to handle instance churn in a containerised environment, highly responsive interaction, analytics and alerting on log data, etc.
Technical Complexity: Technically, log aggregation can be simple or complex depending on the scope of functionality. I mentioned earlier that interactive information retrieval at scale is a complex problem. We could build log aggregators that were interactive at low volume and supported basic filters but basic log aggregation wasn’t sufficient functionality and building full scale log analytics is a technically complex problem.
Operational Complexity: Given the high technical complexity, our only options were to assemble and operate a F/OSS solution as a black box. That didn’t go very well for us, mainly because understanding a log platform is extremely costly – whether as a white-box (reading the code) or empirically (understanding failures through stress testing). We didn’t want log platform maintenance to be someone’s full-time job and it wasn’t easy to keep ingestion and query times low enough to be acceptable as “real-time, interactive” performance while ingesting TBs of log data daily – and be able to scale it 10x for the discount sale spikes.
Unique Business Value: Log aggregation and analysis is one of those components that are business-critical but not differentiators. It is extremely important to have a reliable log platform because it’s the foundation on which we build our failure-feedback during development, testing and – most importantly – production operation. However, we did not find any use cases specific to us that couldn’t be satisfied through off-the-shelf solutions. Thus the business differentiation was very low.
Costs: Again, while it’s difficult to share hard numbers on this, our analysis revealed that our Level 2 through Level 5 costs would have been higher for an in-house solution. High enough to overcome any savings in Level 1 costs, and the gap increased with increase in scale. So we decided to go with a 3rd party solution instead.
When to Sell What You Build
Here’s a bonus! While most of us stop after making a buy vs. build decision, the same framework used to make that decision can help in deciding whether you should sell what you’ve built. We just need a couple of extra points to consider.
Externally Relevant Differentiation
When you decide to build something that has high complexity, most likely it’s because it has high business differentiation value for you or you have done it in a way that alters the economics of third party solutions. This means that you’ve build something that is not only hard to build but is also different from existing solutions in some meaningful ways to you.
Given that your business is not operating in isolation, there may be others who might find that same kind of differentiation meaningful. So you could potentially sell your solution to other parties either commercially or as a F/OSS offering. However, there’s one more thing to validate before you make that decision.
Cost of Multi-Tenancy
When you build a solution in-house, you do it for your company, for your use cases. Selling this solution externally means that you have to do a lot of additional work to separate out your company as a tenant of your solution. That’s followed by ensuring that your solution can work for multiple tenants in a way that they don’t step on each others’ toes or discover each others’ business secrets. Add to that the requirements for cost computation and usage audits and more robust end-user security.
Sometimes the cost of multi-tenancy itself becomes too high. Maybe the solution that you built turned out to be profitable precisely because it wasn’t multi-tenant. One of the reasons why enterprises buy enterprise solutions is because any platform in a large company inherently needs to support multi-tenancy. These companies don’t just have different teams, they have teams that belong to different business units that need to calculate their internal charge-backs for shared technology and don’t take loss of reliability due to a sister unit’s misdemeanour very lightly.
I do find that different markets have different appetites to buy or build. There’s a lot of “emotional” decision-making that goes on. Functional vs. Technical complexity is often not differentiated and the wrong trade-offs between short term and long term prospects are made.
I have documented here my decision framework for making buy-vs-build decisions and I hope it helps you make better decisions. Conversely, you can help me by adding to the framework.