Building and operating large-scale software is a relatively young field, and many industry practitioners, including software engineers, often overlook the necessary considerations for doing it well. While “Hello World” examples of software development may serve as good tutorials, they can also give the impression that solving software problems is a trivial task, which fails to acknowledge the significant amount of work and dedication required to achieve quality results.
When developing a typical service solution at a big tech company, the following factors must be taken into account:
-
Requirements: Why is this a problem, and for whom? How are the users solving the problem today, and what new challenges may arise once a new solution is implemented? How do we define success, and what constraints do we have in terms of time, budget, and personnel?
-
Architecture: Where will the new solution exist for users, and is it a brand new or be a part of an existing system? If so, how will it (the new solution) interact with existing system components? Do we have a full understanding of the existing system, and is it easily modifiable? Who can help us understand the system better, and how do we ensure the new solution is secure, performs well, reliable, available, and compliant?
-
Design: What data will be managed by the solution, and what is the expected asymptotic run time? How do we recover from errors and failures? What message protocols do we need to interact with the new service? What is the overall cost of building and operating the solution? How will the solution scale? What tools do we need to use to keep track of development and or support? What are the current assumptions and dependencies?
-
Coding: What programming languages and tools are best suited to the problem and the personnel? How should the code base be organized for being both modular and testable? How do we handle common concerns such as authentication, authorization, auditing, logging and metrics?
-
Building: What are the configuration requirements, and how will secrets be managed? How is the software built and versioned, and how are artifacts distributed? How do we introduce and roll back new versions in a live environment?
-
Testing: How do we divide testing between unit, integration, regression, and performance tests? What processes are needed for disaster management?
-
Observability: How is the system and software monitored, alerted, and troubleshooted? What promises are made to users, and how do we ensure that we are meeting those promises (SLI, SLO, SLA)?
-
Maintainability: How are parts of the solution and code deprecated? How do we document the system and code? What permissions are required for development and operation of the software, and is user training needed for the system?
Equally important is creating a high-performing software engineering team that does the actual work:
-
Composition: What is the optimal balance between seasoned, experienced engineers and talented new hires for the problem we are solving? How do we recruit the right people for the team?
-
Respectful: How do we create a culture of listening, free from biases and stereotypes, and a safe environment for candid discussions?
-
Motivated: How do we articulate the problem, users, and benefits to stakeholders to keep the team motivated? What is the vision, mission, and charter for the team?
-
Measurable: How do we keep our goals clear and challenging, and how do we encourage the team to take deliberate risks?
-
Feedback: How do we gather feedback from users to understand how we are performing and iterate?
-
Bonding: How do we celebrate successes and learn from failures as a team?
-
Enable: What processes and tools do we need to build into our software development life cycle to ensure that the team can do their job well? What support is required from the organization (project management)?