Skip to main content
Best Practices for High-Capacity Cloud Applications

Best Practices for High-Capacity Cloud Applications

Background: In 2021, Trigyn deployed a large-scale, cloud-native vaccination management system to serve a population of more than one billion people. Because of the fears surrounding the pandemic, and the pent-up demand for a vaccine, it was a requirement that the system had to be capable of withstanding extreme surges in service demand. Trigyn was tasked with the challenge of designing, building, and managing an application which would not only withstand the stress of extremely high traffic loads from users seeking access to a vaccine, but also manage the entire vaccination program from supply chain, through to appointment booking, clinic management and the vaccine passport system.

Trigyn’s Cloud Services Team, part of our Digital Transformation Practice, was tasked with completing the initial release in just three months from planning to delivery. The cloud-native solution, architected by Trigyn, was delivered on time, bug free, and exceeded performance expectations. The application successfully handled more than 1 billion transactions a day, on multiple occasions throughout 2021.

Approach: Building a cloud-native application capable of handling extreme and sustained loads requires much more than just a well-built application. The task required a systematic analysis of every aspect of the application from software design to infrastructure planning, through to the website operational plan. Because of the high profile of the program, the need to ensure transparency in the appointment booking process, and the sensitive personal information being captured, security was a paramount concern. With this in mind, our team employed a DevSecOps approach to planning and implementing the solution, thereby making security a key consideration throughout every phase of the project lifecycle.

Our team developed a project plan that they were confident could handle the loads that were expected. To get there, Trigyn’s cloud services team of senior architects critiqued every aspect of the project from start to finish. They worked closely with technology vendors and partners to understand their capabilities and limitations. This included evaluation of numerous cloud-native services, open-source systems, and tools available on cloud marketplaces. With a thorough understanding of the technologies available, Trigyn developed a best-in-breed approach to make the system scalable and secure, while trying to find ways to control the overall spend.

After choosing a cloud vendor, Amazon Web Services (AWS), Trigyn engaged with their professional services team to validate the architected framework including a review of the various services chosen, and those intentionally omitted, to arrive at the final project plan.

The final application was subjected to performance and security testing by an independent professional testing service which confirmed the security and scalability of the application and validated our design decisions.

Scalability: Scalability was a paramount requirement for the cloud application. To achieve the required scalability, a combination of Amazon’s Elastic Kubernetes Service (EKS) was chosen. A server less approach, was selected because it was deemed to be better capable of handling the anticipated service demands. Appropriate limits were set up in EKS, so the nodes would scale in/out automatically based on pre-set parameters, such as CPU utilization. Containers were created using Node.js code and were developed as microservices, with EKS as the orchestrator of these services. With this approach, after basic testing, stress testing was performed to ensure it indeed scaled to the levels of forecasted demand we expected.

Performance is not something that could be compromised. System testing included performance optimization for tens of thousands of concurrent writes per second using a variety of load testing tools and techniques. Using continuous optimization techniques, the team was subsequently able to have the cloud application scale to perform at even higher concurrency levels. With adjustments to some system parameters, optimizing API’s, minification of front-end JavaScript code, and various other initiatives, system performance was increased so the system demonstrated the ability to scale far beyond any scenario we anticipated, and easily handle loads exceeding 1 billion transactions per day. Every aspect of the system was scrutinized to ensure the system warmed-up and scaled to the needs of the client during demand surges, but also to ensure it scaled down during times of reduced demand to reduce total cost of ownership (TCO). To ensure optimal performance at scale, CloudFront was used as the Content Delivery Network (CDN) with the Redis Cache engine to cache database transactions. Overall, approximately 60% of hits were handled from Redis as opposed to our database.

Security: Security of the cloud application was of paramount importance. Security planning started at the account level following the principle of least privileges. A Web Application Firewall (WAF) was deployed to protect the system from vulnerabilities. At network level, the entire application was hosted in private subnets, with only one public subnet containing the bastion host. Network Access Control Lists (NACL) and security groups were created to allow only the expected traffic at various ports for specific protocols. An API Gateway was configured with appropriate rate limits, to ensure proper management of API calls to the containers hosting the application code. This provided flexibility in restricting traffic, while the load balancer helped to distribute traffic via HTTPS, for secure data transfer. All HTTP requests were routed to HTPS, to ensure data in transit was always encrypted.

AWS Shield was used to protect against DDoS attacks. GuardDuty was used to protect accounts, sensitive data and for threat detection, while AWS Inspector helped in protection against vulnerabilities and for compliance reasons. AWS Key Management System (KMS) helped manage keys for the application to connect to various services in a secure way. Encryption at OS and Database level was used, to ensure encryption of data at rest. In addition to this, many application-level fixes were done to ensure security of the overall application.

Having a secure application also calls for proper logging and monitoring of the overall system. This included limiting access to logs due to the sensitive information they contain about network, application and at service levels. The monitoring system included a combination of Amazon’s VPC Flow Logs, application-level logs, and service-level logs. This combination provided an ongoing perspective on network activity, application performance, and health of the various services that make up the system. This combination provided a complete perspective as to the functioning of application. AWS Cloud Watch and Lambda Functions were used to proactively alert Trigyn teams of any issues.

Technologies Used: All aspects of the system, including registrations, notifications, analytics, reporting were all built using latest technologies such as HTML5, Angular, Node.js, PowerBI, SQL, NoSQL using microservices architecture, AWS, S3, EKS, JWT authentication, ALB, Redis Cache, SNS, PinPoint, API Gateway, and other services.

Key Learnings:

  1. Collaboration with the Cloud Services Provider is essential. Our discussions with AWS staff helped validate our architecture decisions and technology selections to ensure delivery of a successful project.

  2. Third-Party Systems played an Important Role. In the end, a combination of native and third-party systems was used based on the needs of the cloud application. Although third-party systems can add cost, in many cases they offered significant advantages over native solutions which made them an easily justifiable investment.

  3. Automation is critical. For applications of this scale, automation is an essential component of the project plan. Automation was implemented wherever possible, especially in the areas of DevSecOps and IaC.

  4. Plan for Upward and Downward Scalability. Discussions of scalability usually focus on ramping up capacity to match surges in service demand. Proactively managing scalability to also adjust to demand lulls can result in significant operational savings.

  5. Use parallel Development Environments. Due to the tight timelines, the team was challenged to release multiple features within very short timeframes. To achieve this, the team would temporarily spin off different environments for various components and stakeholders to ensure unimpeded progress.

  6. Plan DevSecOps early. Early prioritization of DevSecOps, IaC, monitoring and instrumentation allowed DevSecOps requirements to be incorporated into the system design before code had been written giving the flexibility needed to try new features and quickly integrate third-party services.