The technology universe is in a constant state of flux with new advancements arriving faster than one could keep up. A technology leader, in this scenario, needs to look for something durable to build the foundation of their new (or improved) technology organisation.
Spending 36 months leading the charge (and occasionally failing) at a fast growing business that’s powered by technology can teach a lot. Coming off the back of a career built with customer-facing development teams at companies serving over 100 million customers, I now have some idea of what it takes to make a strong technology foundation for a modern business.
After a reflective 3-week slate-wiping vacation, I’m good to write down and share my thoughts.
0. The Technology Landscape
In order to understand the rationale behind the foundational traits, it’s worth spending some time to understand what the overall landscape of the technology domain looks like. This would allow us to identify the essential traits of a technology organisation that would help it succeed.
- Reaching a wide audience is easy – and expensive. Internet Marketing tools and social media provide an incredibly easy, albeit expensive, way to get people directly to your product. This is different from OOH, TV, Radio and Print marketing that could only take your brand to the audience but not your product. In order to leverage your spend on Internet Marketing, you need to ensure that:
- You are ready to accept any amount of traffic that marketing directs to you. A website or app that crashes due to traffic is a waste of money and opportunity
- You provide a smooth glitch-free experience to every user every time
- There is a large supply of developers yet there’s a talent crunch. This seemingly paradoxical situation has an easy explanation: the hiring requirements are too specific. We don’t just look for smart, committed problem solvers, we hire people who know programming language X and database Y and message queue infrastructure Z.1 Even if the job description doesn’t include all of this, I’ve seen enough people get dropped in interviews because “they don’t have enough experience with our tech stack”.
- Competition is mostly copy-paste. There are very few domains where it’s possible to bring an innovation to the market and not have it replicated sooner than it can become your forte. Rapidly improving execution, along with consistent innovation, is the key to sustained leadership in the market.
- Digital Transformation. Half a century ago, only large corporations could think of “computerising” their office operations like finance, payroll and leave management because computers were hard to procure, maintain and program. Now, almost every pop & mom store has spreadsheets to do these jobs, with no programmers and IT support staff. Similarly, offline businesses are now looking to set up transactional access through online channels. They can’t afford to be left behind and they can’t afford to hire and maintain large software development teams either.
Based on the above scenario, I can identify the following three essential traits that a technology organisation should exhibit:
Speed. To thrive in today’s business environment, a business needs to leverage economies of speed, irrespective of whether they have economies of scale. Traditional software development teams roll out new features in a 6-month time frame. The modern software development organisation needs to be able to roll out entire business models in that time. And I’m including the conception and ideation time in this window too! This is what enables copy-paste competition and what can thwart copy-paste competition as well.
Reliability. In his 1975 book, Mythical Man Month, author Fred Brooks Jr. argued that it’s important to build a throw-away prototype before building a production ready version. It is telling of the advancements in software development practice as well as added environmental pressure to deliver, that 20 years later, the author acknowledged throw-away prototyping as a mistake borne out of the waterfall model.
Today, we need the prototype to be production ready because no closed-room estimation of a feature’s effectiveness could be better than testing in the market and having a tight loop from feedback to incremental change and release to market. If prototyping and production deployment can’t be done by the same team with the same tech stack and the same skill requirements, the overheads pile up. It’s crucially important to have prototypes be robust and reliable enough for production from the beginning.
Effectiveness. The agricultural revolution ca. 10,000 BC allowed humans to settle down in communities and the key to scaling up those settlements from hamlets to villages to towns and cities was to grow even more crops. The key to scaling up in the industrial revolution was to build bigger machines that processed more raw material and finished more products per time.
The key to scaling in IT revolution is to do more with computers. The more we focus on transferring human knowledge to computers the more the humans can achieve – and there’s a lot to achieve. After all, there have been five agricultural revolutions since 10,000 BC, the most recent being half a century ago! We need to be able to deliver more business value through smaller teams with lower average experience and skill level.
Now that we have established the three traits of a competitive technology organisation, let’s look at the top 5 competitive advantages that enable these traits in an organisation. Spoiler: there’s no AI/ML, Blockchain or IoT involved.
1. The Public Cloud
Not being on a public cloud platform today is a disadvantage. Back in 200x, being able to provision hardware with the click of a button was a novelty. Now it’s an essence. But that’s not even the point here.
Public Cloud is more about commoditisation of development and production environments. Look carefully at the product offerings of mainstream cloud providers and you’ll notice how you can choose from a variety of capabilities as well as bundled solutions. It’s like going to a computer store and either picking and choosing parts (and relying on yourself for ensuring cost effectiveness, interoperability and end-to-end optimisation) or just picking up a ready-to-use laptop/tablet… or smartphone that matches your budget and requirements.
The Cost Angle
One common perception to address is that of cost effectiveness of managing your own (usually virtualised) physical infrastructure. I’ve come across anecdotal figures of 3-5x difference in cost. However, this only includes the cost of primary compute resources (CPU, memory, storage, network). The elephant in the room that is usually not addressed is the cost of speed and cost of reliability. Most modern businesses have so much business to lose due to lack of reliability, that they would readily buy it and recover more as revenue for every dollar spent as “reliability opex”.
Leasing a data centre, running CloudStack or Kubernetes, etc. on it is easy. Making everything work together to form an integrated “production runtime environment” is difficult. Managing the environment 24×7 is hard. Making it developer friendly is an incredible design challenge. Managing it while ensuring end-to-end optimisation and reliability is a whole different ballgame. Think redundancy, failover, backups, audits, security and access controls, incident management, 24×7 monitoring, performance bottlenecks….
Now consider that a production environment consisting many components typically exhibits the “weakest link” phenomenon – it doesn’t matter much how robust a few of your components are or how many components are robust. A single weak link can spoil the whole party. Given the business pressure to deliver more, faster, with understaffed teams, the occurrence of weak links is unsurprisingly frequent.
There’s no shortage of companies that offer to manage these components better for money – I’m talking about stuff like enterprise backup solutions, enterprise monitoring and analytics systems, and so on. They are great at solving a specific problem, but while buying these solutions a-la-carte can improve the components they apply to, ensuring global optimisation is still your responsibility.
The upside of using public cloud is that you get a more well integrated, globally optimised production environment. You get mature infrastructure operations, security and disaster recovery capabilities without having to acquire talent for them first. Without having to hire supervisors for that talent. Without the risk of failing time consuming certification processes when you’re audited for raising funds or going public.
The Capability Angle
I’ve had the good fortune and opportunity to develop my career in a highly competitive environment. So much so that while I’ve usually been recognised as having strong technology skills, I still believe it’s my weak point – after all, I’ve never worked at Google or Microsoft! Anyway, the teams that I worked at throughout my career had a strong “we can do it” attitude and for us to buy a solution was demeaning. This environment worked great for our growth as technologists at an individual level but for the businesses? In some cases, not so much.
The reason is simple. It’s not a question of capability alone. It’s also about focus. If you lead a team of smart, strong developers, would you rather have them focus on building the competitive business capabilities faster or would you rather have them rebuild what has been built elsewhere many times over? (The justification for re-invention often is “we can customise it for our business” and “we can do it better/cheaper”. My empirical observation is that those assertions are usually valid but also usually not worth the loss of focus. Lack of focus is immeasurably costly).
A business has some core competencies that differentiate it from competitors and it has a lot of auxiliary supporting functionality that is required to make a complete product for end customers. The core competency requires a lot more focus than the auxiliary functionality, but both sets require equal level of maturity. With public clouds, you no longer have to build your business critical functionality with the same level of knowledge commitment as an auxiliary function. For example, you might very well use carefully optimised self managed database instances and custom tuned VMs for your core, high volume, mature transaction processing systems while using something like AppEngine or Lambda for low complexity applications. You can test new features in the market with something that gets you started quicker and move to something that’s more cost effective if and when your scale and talent pool justify the move.
Talking of talent pool, most graduating developers now acquire hands-on knowledge through public clouds. The more of your stack that runs on standard public cloud offerings, the more ready-made usually free training material you can leverage as part of your developer on-boarding process.
Any product offering that’s already part of a public cloud is now commodity. It’s not adding value to your business unless it actually is the business you’re building. Trying to duplicate that functionality as part of your technology capabilities is a distraction from your core business and it’s going to have diminishing value over time.
Elasticity is simple to understand: it’s the ability to add or remove resources on-the-fly with minimal effort and without the risk of unexpected change in functionality. The resources could be anything: more CPU, more memory, more storage, more bandwidth… etc. This resource elasticity must also translate into cost elasticity. In my rulebook for 2019 and ahead, if any technology component is not elastic, it’s not viable.
Elasticity and Software
When I try to apply the elasticity mandate to every technology component, I realise how much more there is to be done, how it can bring about a paradigm shift in design of applications. Being elastic requires the ability to chunk all resources into small units and quickly add or remove those units without disrupting functionality. Most of us are pretty comfortable with being able to add more servers behind a load-balancer for our stateless application layer or sharding databases. That’s how we scale. So what’s the deal with Elasticity?
Elasticity is about doing scalability in style. If adding more instances to the application requires filing a change request that someone spends a day to get to and an hour to provision instances, set them up and update DNS records, you can claim scalability but not elasticity. Elasticity is all about how those instances are added – how quick is the decision-making, how quick is the turnaround, how failure-resistant is the operation, how consistent is the application functionality in the event of scaling and how hands-off is the whole process.
This has interesting side-effects such as the need for application software to treat instance failure/shutdown as a normal operation rather than exceptional event. It also shifts the need to recover from instance failure to instance replacement. This is a subtle difference on the surface but has great consequences. Why? Because recovering a failing instance requires knowing the cause of failure and knowing how to remediate based on that cause. This troubleshoot-and-fix method of failure recovery is complicated and time consuming (high MTTR). With elastic applications, failure can be handled by simply replacing a failing instance with a new one, while optionally quarantining the failed instance for forensics. The faster the replacement, the lower the MTTR. (There are related strategies for handling total failures, but that’s outside the scope of this discussion).
Having an elastic application layer just scratches the surface. As I seek to apply the principle of elasticity to more and more technology components, I realise how much there is to desire in the area of elastic data persistence and networking. This brings us nicely to the next part of this discussion.
Elasticity and Vendor Selection
Building a technology organisation for speed, reliability and effectiveness requires keeping the NIH Syndrome at bay and that means leveraging third party vendors for the functionality. My definition of “vendor” in this article is broad enough to include non-commercial Open Source Software as well.
Vendor selection is tough. Mostly, every provider in a domain covers the same feature set as others. In today’s world of copy-paste competition, this is hardly surprising. To make matters worse, the SaaS model makes it very easy to “hide” production deficiencies under the hood.
So far, I’ve found that bringing elasticity as a deal-breaker tends to differentiate mature and immature vendors fairly well. It is also a criterion to judge how well the vendor can fit with the desire to leverage public clouds, which was the competitive advantage I previously talked about. For commercial vendors, it also simplifies cost considerations as it converts most of the cost into “opex” that can scale with usage.
As I said earlier, elasticity is about scaling in style. A competitive technology team must not only target scalability, but also bring more functionality into the realm of being elastic.
3. Lean Development Environment
One of the biggest differentiators between an effective technology team and an ineffective one is not the average technology skill level of its members, but the maturity of the team’s software development methodology. In other words, a team of less skilled developers with better methodology would outperform a team of more skilled developers with poor methodology. Wouldn’t it be wonderful if you could succeed with competent talent instead of battling for the elusive “rockstars” in the talent market? Of course it would! But here’s another kicker: top talent actually sticks around in companies with great culture, and that includes the way the daily grind of software development works.
While there is no specific methodology that I recommend, I can list down a few properties that good methodology would exhibit. These properties directly contribute to the Build-Measure-Learn loop that is one of the principles outlined in the book, The Lean Startup. A Lean Development Environment is one that enables the methodology to seamlessly embed into the process of creating and maintaining software, rather than feeling like an add-on or overhead.
Following are the properties that a Lean Development Environment should embody
Effective Change Annotation
The first version of any useful system starts with “clean code”. The team that writes it feels a sense of pride at how pristine and up-to-date the design is. But if it’s a useful piece of software, it will be used, small bugs will be discovered, minor edge cases will be handled. It will scale in usage beyond what it was envisioned to do. Within six months, the code’s authors won’t be able to recognise what they wrote and why.
And then it happens. An unusual bug surfaces. The troubleshooting leads to a particular line of code that seems too absurd to have been written. Was there a reason why this code was written this way? Changing it to fix the bug causes another test to break. Is that test still valid or does it need to be updated too? That’s when you wish you knew what was going on in the mind of the developer who wrote that code.
This wish is granted by effective change annotation. Any change in code should be annotated with information that answers the “why, when, by whom” questions. There should be a traceable reverse path from any line of code to the commit that applied it and its author. The commit message ideally explains the change, but for good measure, it also links to a task in a task tracker, which has more context around the change. The task links to a technical design document and a specification document (or PRD). This is the most effective method of connecting the documentation and code that I have ever practised.
The team that did this best among ones that I worked with was at Yahoo! ca. 2005. We were tasked to build a multi-lingual, multi-format Reviews & Ratings platform that would cater to all of Yahoo’s properties throughout the world serving half a billion users. The team took a cue from Philip Tellis, who refused to do any work if there wasn’t a Bugzilla entry for it. Through strict discipline and a clever combination of CVS and Bugzilla, we managed to ensure that no code was committed for production deployment if it didn’t have a corresponding Bugzilla ticket. The dopamine kick came from writing “Fix Bug #XXXXX” in the commit message to automatically mark the bug as fixed.
This practice was immensely helpful in incorporating customer-specific requirements since they had to be converted into platform-wide capabilities that could be used by any other customer. More often than not, this required being able to refer back to an ancient design choice, understand how changing it would impact the system and working out the peripheral changes required to bring in the new capability. Sounds a lot like what happens in a typical Build-Measure-Learn cycle, right? Except the code is no longer an opaque wall of incomprehension.
When we write code for a new application, we focus on getting the functionality right first. We don’t really think much about the diagnostics and telemetry we should be getting from the application. However, it’s impossibly tough to manage an application in production if it’s not sending out any signals about its functioning.
Signals related to health checks, latency and error rates are so fundamental to production operation of networked services that they should not even need to be coded up manually. Anything that requires manual effort has a chance to be skipped inadvertently or by ignorance. This is the reason why I’m a huge fan of metrics-enabled infrastructure components, such as envoy proxy. Just last week one of my teams in Tokopedia debugged an OOM incident using our in-house continuous profiler for Go. The continuous profiler had been quietly recording profiles every 5 minutes. The developers simply located the spike on a graph and analysed it. The cost to developers for getting this capability? One never-changing line of code in application initialisation.
Operational metrics is one aspect of telemetry. However the Build-Measure-Learn cycle requires more than measuring network parameters and success rates. The need to perform deep data analysis makes it important to integrate event generation and consumption as a first class capability in the stack from development environment all the way to production.
One of the companies where this was most culturally well integrated was, unsurprisingly, Zynga. Even as far back as 2010, they could overlay change events on computational and business performance metrics to easily narrow down what change created what impact. Some of the high-brand-recognition startups of today still require ad-hoc work from Data Analysts to figure out this kind of information.
The way Zynga achieved this was through a clever system to capture events in a hierarchical model that was fixed throughout the company, yet flexible enough to allow each event to provide its own meaning to the hierarchy. Crucially, everyone from the Product Managers to the Software Engineers could speak and understand this hierarchy. The events to be captured were specified as part of the product specifications. This was combined with a well integrated multi-variate testing (experimentation) framework to allow deep analysis of player behaviour, spending habits, game progression blockers, and even system issues such as bugs or exploits.
Contrast this with how much effort and money teams spend on building or renting log aggregation and analysis services. If I had a penny for every thousand log messages that getting pushed into these aggregators… wait, there are companies that get these pennies and they are getting quite rich!
From Laptop to Production: Build/Deploy Pipeline
The Build-Measure-Learn loop wraps from Learn back on to Build. The tighter the Learn-Build link is, the faster the team can move forward with enhancements. Once a change is done and tested in an individual’s development environment, every other step until production release is an overhead for the developer. The faster and more reliable this journey is, the speedier, more confident the developers are with making changes.
On the process side, the intra-team and cross-team collaboration models have seen a lot of changes over the last few years. There was a period of confusion as git – an SCM created to handle a very large codebase with contributors from all over the world – caught the fancy of small teams of developers (thanks mostly to github). It took a while for developers to experiment with fancy development models until they realised that developing closed software within a team or company requires different methods than developing software in the open, across the world.
The CI/CD pipeline is a great place to infuse a lot of automation. The build environment, for example, can be sandboxed to only pull dependencies from a local repository, wherein the dependencies have gone through an automated or manual security review. The build environment can have “tricked-out” versions of language SDKs and standard libraries to inject additional security, auditing and diagnostics tooling, that the development environment may not require.
Current advancements in infrastructure technology such as Kubernetes, Infrastructure-as-Code, GitOps, etc. have made this journey super high speed, especially for Cloud Native development environments. Teams that are not already in a position to leverage these advancements are going to lose out on velocity and reliability under change. The good news is, public cloud providers and independent SaaS vendors are already adding a lot of value into this ecosystem so the barrier to setting up a great pipeline is not so high. The caveat is that it does make the team dependent on one cloud provider or SaaS.
A Lean Development Environment is hard to build without embodying its principles from the start and having a relentless focus on automation. Adding emphasis on automation to the culture triggers a self-reinforcing cycle that leads to improved maturity across all the above properties. A team that scales through automation grows to add new capabilities or specialisation in the team more often than adding headcount that have the same skills.
Perhaps the most advanced Lean Development Environment in the world is being run at scale at Facebook. That gives them a solid competitive technology advantage against everyone else trying to build yet another social network.
4. Test Driven Development
Most technology organisations spend too much effort on software testing, but they don’t test the software enough. This may seem paradoxical but it is actually a result of unsuitable methodology borrowed from manufacturing industry. Industrial testing involves verification that each of the thousands or millions of units of mass produced items is identical and meets the same set of quality expectations. On the other hand, software development is about change, not about producing identical copies of the same bits of data.
Please hold that thought while I briefly recount some of my personal experience with software testing. I spent the first three years of my career building web applications and rewriting a payroll processing system (from PL/SQL to MSSQL – no “coding”). I used to build the UI and its back-end, take the resulting UI for a spin using dummy data and aasdffg input. The new payroll system ran as a batch in parallel with the existing system and we just eyeballed the output to find differences to be fixed.
In my next job at Yahoo!, I had some big responsibilities to fulfil and the inadequacy of my testing practices was laid bare by my manager when she told me quite simply, “you write buggy code”. Luckily, I happened to be reading Kent Beck’s book, Extreme Programming Explained, which talked about “Test First Programming” and other things. It sounded extreme enough to make a difference so I tried it. Soon enough test first programming took me to a state where I could make large changes with speed and confidence.
Changing Software with Speed and Confidence
“Test Driven Development” is all about making change with speed and confidence. Unfortunately, outside of software development, testing has very little to do with change. We get our blood tested when we feel sick. We sit on tests (exams) when we need to prove our possession of skill or knowledge. Chemical labs do tests to ascertain the purity of their products. Factories do tests to determine that a unit manufactured today is identical to one produced last month.
That makes it a lot more difficult for software developers to get motivated to follow TDD practices, because on the surface it just looks like a hurdle or overhead in the process of creating change through programming.
Despite the great success from my Yahoo! project, I didn’t always work in teams following TDD. Not until I discovered Go programming language and its built-in unit testing capabilities. That was back in 2012 and since then, I have never written a non-trivial program without following TDD. Only after several years of regular TDD practice did I come to understand the full scope of its benefits. So, last year I discussed
go test in a tech-talk and wrote about it in a long blog post titled, The Best Feature of Go.
The Programmer’s Assistant
In traditional waterfall process, testing comes after the implementation is finished and just before shipping the software. On the other hand, TDD practices encompass the full gamut of software development activities, acting as a “programmer’s assistant” at every stage, helping them make progress with speed and confidence. This is illustrated and exemplified in the blog post that I mentioned previously. To quickly recap, here are the points of assistance:
- Tests help the developer work out the program design by starting with a skeleton and validating the intended implementation as it happens. Difficult testing is a sign of insufficient modularity or lack of functional cohesion
- Tests help the developer switch to the “user” mindset by actually trying to use the software’s public interface
- Tests help as shared API contracts – breaking API tests indicate impending integration failure
- Tests help as examples and documentation (especially prevalent in F/OSS now) to speed up integration and adoption
- Tests help in faster refactoring and identifying breaking changes
- Tests help in assessing impact of change on performance
- Tests help in finding safety and security issues before deployment in production environment
- Tests help in confident releases to production through CI/CD automation
- Tests help in automated operations and validating failure recovery mechanisms
- Tests help in validating the non-functional properties of software (scalability, reliability, efficiency, responsiveness)
When Not to Test
One of the biggest psychological detractors to TDD or Test First Programming is reading too literally into the title. Test First Programming doesn’t always mean that you have to start with a unit test for each and every task. During the conceptualisation and design phase, often a developer would be testing a few premises – verifying that the available technology is viable for a particular way of solving the problem.
The test for viability or proof of concept does not necessarily require unit testing, although it really is a test. For example you might want to try out mutual exclusion among distributed processes through a remote lock. Trying out the remote locking mechanism to synchronise a couple of dummy processes spun up by hand doesn’t need a unit test. The moment you’re convinced that it’s a viable plan and decide to use it in a project is the point where you should start writing a unit test. In fact, I’ve often found it helpful to write rigorous unit tests to even convince myself of the viability of a solution. Occasionally, I found that the preliminary evaluation gave a false positive on viability. For example, a benchmarking unit test for our remote locking example is highly recommended to understand its scalability limits.
This inflexion point between proving something as viable versus applying the solution to the project being worked upon is something that often gets missed. Initiation of a project’s implementation unassisted by tests is likely to cause some corner-cutting in modularity in the haste to see something working. That corner cutting makes retrofitting of unit tests more burdensome causing further lack of motivation to test.
The bottom-line, then, is that TDD is not required while exploring the solution space but tests should be put in place as soon as the implementation start taking shape.
Developers and the Test Organisation
One of the more controversial decisions for a technology organisation these days is whether or not to have a testing team or QA department. There are many ways to address this but the most polished organisations tend to hold two principles as invariant:
- People involved in software production must be ultimately responsible for quality of implementation
- There must be a human evaluation for all human-computer interactions (GUI, UX, etc.)
The best engineering organisations hold both these invariants while striving to improve execution on both. Facebook’s engineers, for example, are required to have a high level of ownership of code quality through unit testing, code reviews etc. At the same time, everyone at Facebook runs the pre-production version of the app, so they act has human evaluators of the app’s HCI too.
The question, then, is no longer about having a QA department since quality assurance is a shared responsibility. What does make sense is to have a group of people focused on improving the infrastructure, tools and execution of automated and human testing activities.
Having a software development culture rooted in rigorous automated and human testing is no longer a hallmark of technology giants. It’s a fundamental pre-requisite to having a competitive technology organisation.
5. Abstraction of Evolution
Software development has gone through several large paradigm shifts over the last 50 years or so. However, the essence of programming – data input, logical processing and data output – has largely remained. This is illustrated by the fact that even today, there are some systems running COBOL from ’60s on dated hardware architecture.
How is it that businesses get locked into a period of technology? How can a business make itself adaptable to technology change?
The Roles that Matter
A typical enterprise technology group consists of several functional divisions. There are infrastructure engineers, network engineers, system administrators, emergency responders, security engineers, database administrators, testers, release managers and developers.
To really understand which of these technology roles are truly, uniquely part of the business capability, we could look at what headcounts are part of “cost centres” and what are part of “revenue centres”. Another dimension to consider is, which of these can be contracted out to a third party without revealing business secrets.
The answer to the above would be different for different businesses. I reckon, though, that for most technology driven businesses the real value is in the business logic code, and the most important technology people are the developers.
The Innovation Triumvirate
While we agreed (or agreed to disagree) that the most important technology role is that of a developer – the person who writes business logic and makes it work in a production environment – there are two other very important roles that fuel innovation: the product owner and the data analyst.
These roles map to the “Build-Measure-Learn” loop of innovation in Lean businesses. It’s not necessary that there be three different people playing these three roles. Very often, there is a pair of people – the business person and the technology person – who split the roles among themselves. In fact, the “data analyst” role is something that the other two should be able to play in a hands-on manner to some extent.
The Technology Churn Treadmill
Technology churn is a very real phenomenon in the software industry. Most of the technology innovation in the last two decades has been driven through open source software, most of which advances in increments and usually has the problem of choosing among too many alternatives that are minimally supported and have short shelf life.
The alternative is to build on top of “mature” technology that has been around for 10+ years – not a bad choice, except that it might leave your developers with FOMO (fear of missing out).
The impact of the tension between the need for stability vs. urge to innovate is that most developers are either permanently busy experimenting with the next shiny thing or planning to quit the job where they can’t be experimenting with the next shiny thing. And that’s bad because developers are a part of the Innovation Triumvirate.
If only there were a way to jump off this unhealthy treadmill.
The Program and the Platform
The most important architectural decision that a business needs to make in order to assure longevity and agility of its technology practice is to separate the program and the platform.
In its most simple essence, the program is the implementation of the business logic, along with the required data, telemetry and mission control dashboards. The platform is the underlying infrastructure that the program runs on.
One of the most illustrious examples of this separation is the UNIX family of operating systems and the programs built on top of it. The first UNIX was written in 1970. Today, everything from the most powerful computing infrastructure in Google and other technology giants to the little smartwatch on your wrist runs some variant of this operating system. How did UNIX achieve such longevity and yet remained at the forefront of innovation unlike the stagnant COBOL systems?
One of the defining features of UNIX was the “system call interface” – a set of C APIs that captured the basic intent of what a developer wanted to do, along with offering a set of expectations from the underlying machine of what it did when the system call was invoked. Here’s what a system call looks like, pulled from the latest MacOS Catalina documentation:
read(int fildes, void *buf, size_t nbyte);
read() attempts to read nbyte bytes of data from the object referenced by the descriptor fildes into the buffer pointed to by buf.
This is the system call that you would make to read some data from… anything! Note that
filedes is just an integer that identifies some notion of a file (or collection of bytes) without actually saying anything about where the file is to be found or how it is stored on specific hardware. Storage devices have gone from magnetic tapes to floppy disks to magnetic disk drives and now NVMe solid state drives, but the caller of the
read() system call doesn’t need to care. UNIX would make the machine fulfil its promise of performing the read. For reference, here’s the documentation from the 1973 AT&T SysV UNIX manual:
A successful, innovative Platform is built upon a collection of programming intents that it fulfils in a well defined manner. The set of intents should be minimal and orthogonal. Orthogonality in software design is the ability to combine any set of features with meaningful, consistent results. The intents that the platform supports should completely abstract the infrastructure, yet be low level enough to be able to support a wide variety of business requirements through various combinations. A great example of such intents would be the RISC CPU architecture which is now universally deployed in smart devices to supercomputers and personal computers.2
Separating the Program and the Platform is a powerful and purpose-driven way to keep innovating from one point of stability to the next, without getting worn down by incessant experimentation or stagnating with fear of change.
The Way Forward
To move forward, we look back at the five competitive advantages in technology that have been covered so far.
A technology business that is built as a Program that runs on a separated Platform which enables Lean Development using TDD as a stabilising keystone and works on top of an Elastic infrastructure in the Public Cloud has very strong competitive technology advantages.
If that’s not what you have right now, you might want to plan for it soon because every business is being transformed into a technology business and it’s not fun to have it driven by a handful of technology giants that already have these advantages.
- and know classic data structures and algorithms while also being aware of memory fragmentation tuning in redis and threadpool configuration in Java and setting up load balancing properties in nginx while spinning up VMs and deploying 100+ microservices, and are diligent enough to deploy circuit breakers and can write a rate limiter while half asleep… and then also know how to build Android and iOS apps for good. Well, that’s a good full-stack developer if you ask for one!
- Although Intel x86 CPUs are CISC, not RISC, the CPU has been designed with a RISC core. The CISC instructions supported by the CPU are broken down into RISC micro-instructions before being executed.