Best Practices for Developing High-Capacity Cloud Applications

Background: In 2021, Trigyn deployed a large-scale, cloud-native vaccination management system to serve a population of more than one billion people. Because of the fears surrounding the pandemic, and the pent-up demand for a vaccine, it was a requirement that the system had to be capable of withstanding extreme surges in service demand. Trigyn was tasked with the challenge of designing, building, and managing an application which would not only withstand the stress of extremely high traffic loads from users seeking access to a vaccine, but also manage the entire vaccination program from supply chain, through to appointment booking, clinic management and the vaccine passport system.

Trigyn’s Cloud Services Team, part of our Digital Transformation Practice, was tasked with completing the initial release in just three months from planning to delivery. The cloud-native solution, architected by Trigyn, was delivered on time, bug free, and exceeded performance expectations. The application successfully handled more than 1 billion transactions a day, on multiple occasions throughout 2021.

Approach: Building a cloud-native application capable of handling extreme and sustained loads requires much more than just a well-built application. The task required a systematic analysis of every aspect of the application from software design to infrastructure planning, through to the website operational plan. Because of the high profile of the program, the need to ensure transparency in the appointment booking process, and the sensitive personal information being captured, security was a paramount concern. With this in mind, our team employed a DevSecOps approach to planning and implementing the solution, thereby making security a key consideration throughout every phase of the project lifecycle.

Our team developed a project plan that they were confident could handle the loads that were expected. To get there, Trigyn’s cloud services team of senior architects critiqued every aspect of the project from start to finish. They worked closely with technology vendors and partners to understand their capabilities and limitations. This included evaluation of numerous cloud-native services, open-source systems, and tools available on cloud marketplaces. With a thorough understanding of the technologies available, Trigyn developed a best-in-breed approach to make the system scalable and secure, while trying to find ways to control the overall spend.

After choosing a cloud vendor, Amazon Web Services (AWS), Trigyn engaged with their professional services team to validate the architected framework including a review of the various services chosen, and those intentionally omitted, to arrive at the final project plan.

The final application was subjected to performance and security testing by an independent professional testing service which confirmed the security and scalability of the application and validated our design decisions.

Scalability: Scalability was a paramount requirement for the cloud application. To achieve the required scalability, a combination of Amazon’s Elastic Kubernetes Service (EKS) was chosen. A serverless approach was selected because it was deemed better capable of handling the anticipated service demands. Appropriate limits were set up in EKS, so the nodes would scale in/out automatically based on pre-set parameters, such as CPU utilization. Containers were created using Node.js code and developed as microservices, with EKS orchestrating these services. After basic testing, stress testing was performed to ensure the system scaled to the levels of forecasted demand.

Performance is not something that could be compromised. System testing included performance optimization for tens of thousands of concurrent writes per second, using various load testing tools and techniques. Using continuous optimization techniques, the team was subsequently able to have the cloud application scale to perform at even higher concurrency levels. By adjusting system parameters, optimizing APIs, minifying front-end JavaScript code, and implementing other initiatives, system performance was enhanced, enabling it to scale far beyond anticipated scenarios and easily handle loads exceeding 1 billion transactions per day. Every aspect of the system was scrutinized to ensure the system warmed-up and scaled to the needs of the client during demand surges, but also to ensure it scaled down during times of reduced demand to reduce total cost of ownership (TCO). To ensure optimal performance at scale, CloudFront was used as the Content Delivery Network (CDN) with the Redis Cache engine to cache database transactions. Overall, approximately 60% of hits were handled from Redis as opposed to our database.

Security: Security of the cloud application was of paramount importance. Security planning started at the account level following the principle of least privileges. A Web Application Firewall (WAF) was deployed to protect the system from vulnerabilities. At the network level, the entire application was hosted in private subnets, with only one public subnet containing the bastion host. Network Access Control Lists (NACLs) and security groups were configured to allow only expected traffic at specific ports and protocols. An API Gateway was configured with appropriate rate limits to ensure proper management of API calls to the containers hosting the application code. This provided flexibility in restricting traffic, while the load balancer helped to distribute traffic via HTTPS, for secure data transfer. All HTTP requests were routed to HTPS, to ensure data in transit was always encrypted.

AWS Shield was used to protect against DDoS attacks. GuardDuty was employed for account and sensitive data protection, as well as threat detection, while AWS Inspector assisted in vulnerability protection and compliance. AWS Key Management Service (KMS) helped manage keys for the application to securely connect to various services. Encryption at the OS and database levels was implemented to ensure data at rest was encrypted. In addition to this, many application-level fixes were done to ensure security of the overall application.

Maintaining a secure application also requires proper logging and monitoring of the overall system. This included limiting access to logs due to the sensitive information they contain about network, application, and service levels. The monitoring system utilized a combination of Amazon’s VPC Flow Logs, application-level logs, and service-level logs. This combination provided an ongoing perspective on network activity, application performance, and health of the various services that make up the system. This combination provided a complete perspective as to the functioning of application. AWS CloudWatch and Lambda functions were used to proactively alert Trigyn teams of any issues.

Technologies Used: All aspects of the system, including registrations, notifications, analytics, and reporting, were built using the latest technologies, such as HTML5, Angular, Node.js, Power BI, SQL, NoSQL, microservices architecture, AWS, S3, EKS, JWT authentication, ALB, Redis Cache, SNS, and SQS, PinPoint, API Gateway, and other services.

Key Learnings:

Collaboration with the Cloud Services Provider is essential. Our discussions with AWS staff helped validate our architecture decisions and technology selections to ensure delivery of a successful project.
Third-Party Systems played an Important Role. In the end, a combination of native and third-party systems was used based on the needs of the cloud application. Although third-party systems can add cost, in many cases they offered significant advantages over native solutions which made them an easily justifiable investment.
Automation is critical. For applications of this scale, automation is an essential component of the project plan. Automation was implemented wherever possible, especially in the areas of DevSecOps and IaC.
Plan for Upward and Downward Scalability. Discussions of scalability usually focus on ramping up capacity to match surges in service demand. Proactively managing scalability to also adjust to demand lulls can result in significant operational savings.
Use parallel Development Environments. Due to the tight timelines, the team was challenged to release multiple features within very short timeframes. To achieve this, the team would temporarily spin off different environments for various components and stakeholders to ensure unimpeded progress.
Plan DevSecOps early. Early prioritization of DevSecOps, IaC, monitoring and instrumentation allowed DevSecOps requirements to be incorporated into the system design before code had been written giving the flexibility needed to try new features and quickly integrate third-party services.

For more information about Trigyn’s Cloud Services, Contact Us.

Best Practices for High-Capacity Cloud Applications

Key Learnings:

Related Content

Case Study: Development of a National Youth Mentorship Platform

Case Study: Re-architecting an on-prem solution for an international customer

Case Study: National Foreign Work Visa Solution