Introduction
Akumina is the leading global Employee Experience Platform (EXP) software provider. Akumina’s EXP operates by leveraging the Microsoft Azure cloud, Microsoft’s cloud productivity suite, and Microsoft’s Office 365 cloud. Akumina maintains a industry-accepted Business Continuity Program (“BCP”) which ensures that our Customer’s Business is not affected by any interruption. The BCP is functionally continuous and able to meet Akumina’s stated Service Level Agreements with its Customer’s. Akumina’s BCP restores standard operating procedures for its software platform including AppManager, ServiceHub, Employee Experience Platform Base, and Headless solutions.
In the event of a disaster occurring inside of Microsoft’s Cloud, our primary goals of this plan are the following:
• Minimize interruptions to normal operations.
• Limit the extent of disruption and damage.
• Minimize the economic impact of the interruption.
• Establish alternative means of operation in advance.
• Train personnel with emergency procedures.
• Provide for rapid restoration of service.
Akumina cloud service architecture enables shared compute (WebApp, Functions, and Batch Processor) and shared data stores such as Cosmos DB and Azure Storage. The purpose of this document is to explain the technical architecture of the application disaster recovery plan and how it conforms to Akumina’s formal BCP.
Architecture
Akumina cloud service architecture using Microsoft Azure.
SLA and Uptime
Each Akumina customer agreed to Akumina’s Service Level Agreement as a part of their EXP subscription purchase and all uptimes depend on the uptime provided by Microsoft Azure Services. The following table is for reference only; for the latest Microsoft uptime and Service Level Agreement, please refer to Microsoft’s Service Level Agreement on Microsoft.com: https://azure.microsoft.com/en-us/support/legal/sla/
In addition and outside of the guaranteed uptime guarantees committed by Microsoft, Akumina will have a scheduled maintenance window that will be outside of normal business hours and will be pre-communicated to each Customer. Akumina’s maintenance window is up to 4 hours per month.
Disaster Recovery for Multi-Region AppManager
Multi-Region AppManager
Akumina’s top priority is to provide business resilience and continuity. Disaster recovery plans as defined by the BCP are built carefully to minimize the business impact of natural and human-made disasters such as power outages, catastrophic software failures, and network outages. Akumina employs a multi-region strategy that is deployed with backup in geographically distributed Microsoft Azure data centers (regions). When the physical infrastructure in one region is unavailable, the service can still be continued in another region. Akumina cannot guarantee uptimes as it relates to Customer’s networks, the Internet’s performance, Acts of God, and anything else outside of Akumina’s direct control.
Backing up data
Microsoft Azure Storage: Akumina stores configurations into Azure Storage in Blob and Tables. Azure Storage Queues are used for background entries such as Content distribution, streams, people sync etc. Akumina leverages the Azure Storage Geo-Redundant Storage feature to enable a secondary endpoint. In this case, Microsoft replicates data to a secondary region. In addition, Akumina also enables continuous vault backup for the Blob containers.
Microsoft Cosmos DB (SQL API): Microsoft Cosmos DB (SQL API) is used to store Akumina application data such as users, groups, streams, and social data. Akumina configures Cosmos data replication to at least two regions (i.e., 2 X 4 replicas) for high availability.
Application Files: Application files are updated using the package URL or DevOps, and a second region is always deployed using the same package URL to keep both primary and secondary application files consistent.
Backup strategy
Akumina carefully evaluated the following approaches:
• Redeploy on disaster: In this approach, the AppManager and other services redeployed from scratch at the time of disaster.
• Warm Spare (Active/Passive): A secondary hosted service is created in an alternate region, and related services are deployed to guarantee minimal capacity; however, the secondary services do not receive production traffic.
• Hot Spare (Active/Active): The application is designed to receive a production load in multiple regions. All required services in multiple regions are configured for higher capacity than needed for disaster recovery purposes. Alternatively, the cloud services might scale-out as necessary at the time of a disaster and failover.
At Akumina, we enabled the Active/Active approach for all Microsoft Azure services.
Failover
Akumina configures a disaster recovery strategy using automatic failover techniques using Azure services. In this case, Akumina uses Azure Front Door for failover across regions.
Disaster Recovery Plan
Recovery Time Objective (RTO)
Microsoft’s estimated recovery response time is less than one hour. Recovery time includes failover of the azure storage and switches the traffic to the secondary region using Azure Front Door.
Recovery Point Objective (RPO)
Microsoft Azure leverages Geo-Region storage for all storage and config files. These configuration files are continuously backed up weekly or when changed. In this case, the estimated recovery point objective is a maximum of 60 minutes.
Steps to enable disaster recovery:
1. Enable communication with committee members
2. Redeploy Azure Redis cache
3. Verify the Key vault secrets
4. Switch the traffic to a secondary region
Failover and failback testing:
Akumina runs failover and failback testing twice a year and in accordance with the testing in our BCP policy.