top of page
Search

High Availability and Disaster recovery approach in Oracle Cloud Infrastructure

  • Writer: Subham Dutta
    Subham Dutta
  • Jul 15, 2022
  • 6 min read

Data is the most important asset of an organisation. Data can be of any form - structured, unstructured and need to be protected every time to get benefit from it. Data generated by the applications can be categorised in high priority, medium and low priority and should be at least backed up and highly available when requires. Organizations today and in the future will rely more on data than any other asset to gain an edge on their competition, create business insights, and compete.


In traditional environments it was uphill task to create a Disaster recovery scenario as it required new hardware, license and can be difficult implementing and maintaining an effective Disaster Recovery plan.


Entered Cloud. With the lightning speed of innovation and low cost of public cloud infrastructure , it is getting easier for customer to increasingly looking for Public cloud infrastructure for hosting the DR solution both in Hybrid and Full-fledged cloud DR structure.


Oracle cloud provides world class disaster recovery and High availability solution optimised for both Oracle and Non-Oracle applications revolving around well-constructed set of best practises, techniques, and architectures with low complexity levels and are a long-term assurance of business operability in the aftermath of a disaster. These solutions largely allow businesses to upgrade from their otherwise fossilised rigid business continuity plans to leverage a new level of control and flexibility.


Disaster Scenarios

Planning for DR requires a thorough understanding of all the possible scenarios that can cause disasters.

  • Application Failure

An application can fail Network Failure of failures in the underlying infrastructure or issues related to changes in software or hardware configuration. It’s important to include monitoring capability in your DR solution design so that application failures are detected and alerts are sent. Depending on your requirements, your DR solution can range from simply backing up application data and configurations to a fully active-to-active failover setup that seamlessly mitigates many types of failures.

  • Network Failure

For DR, consider potential network outages in your cloud environment. For example, if you use an IPSec VPN connection to connect your on-premise data centers to Oracle Cloud, the IPSec VPN connection could experience network performance or outage issues. We recommend setting up multiple IPSec VPN connections or using both FastConnect and IPSec VPN connections so that you have sufficient redundancy for your network connections.

  • Data Center Failure

An unexpected event could affect an entire data center (availability domain). In your DR solution design, plan for this kind of failure. If your region has multiple availability domains, we recommend deploying your applications across the availability domains to accommodate potential issues for a particular data center. If your region has only one availability domain, consider a combination of multiple fault domains and multiple-region configurations, as defined in the recommendations for a region failure.

  • Region Failure

A natural disaster could cause an entire Oracle Cloud Infrastructure region to be out of service. This scenario could be one of the most severe cases in your DR design. To protect against this scenario, deploy your workloads across multiple Oracle Cloud Infrastructure regions. Depending on your DR goals (RTO and RPO), you can back up or replicate your data to another region, or set up a fully active-to-active standby in another region.


Identifying the Right DR strategy for Business-Critical Apps

Most important factors that help select a solution for your business is Data loss (referred to as RPO) and downtime (referred to as RTO).


Based on the criticality of the application we discussed, there can be three scenarios:


Ø The application is less critical and can tolerate hours of lost data and no guaranteed recovery times, if that is the scenario plain vanilla backup to the cloud is the answer. Probably every application somehow needs or at least deserves this level of protection. Here the data is backed up to the cloud or some far location so that we can recover something when the disaster strikes. Though no guarantee and there will be trade off on the time when the disaster strikes and the time you reach the offsite and start the systems or recover the data from there on. Though at basic this is the level every application should and must be at.


Ø Next, we can think of application that are heart of enterprise, which cannot even sustain a minute or half-hour of downtime or data loss. Maybe we can look for something “near zero/zero downtimeIf you need that, or you need < than a few seconds of data loss – basically zero data loss after a site-wide outage, then you want the Active / Active solution. Of course this comes with more cost & effort, but if you need it, Oracle can deliver it. Here we can have identical application running across multiple cloud regions or multi-cloud* regions with proper connectivity.


Most of the other applications fall in the middle category, they might not be super critical however require some sort of respect it terms of backup and accessibility when disaster strikes. They might not look for zero downtime/zero data loss solution but based on the RTO/RPO need say < ~3 hrs then active-standby should be good bet and if the requirement can be stretched even more like < ~20-24 hrs then pilot light can be an approach. This is the approach most of enterprises look for their applications. We can have a Database replication running as active-passive but still we need to take backup for the servers as an when disaster strikes, the backup can be used as golden image to spin the application and servers in DR. Another approach of this extension is active-passive with standby servers ready but thus when disaster strikes we can switch on the systems and switch it to production environment. This brings our recovery time down to minutes to go along with our already low data loss.

ree

As discussed earlier, your decision to go from Simple plain vanilla backup -> to Pilot light -> to Active -passive -> to Active-Active depends solely on the tolerance and practicality of your application. Less the RPO and RTO in time and duration, more inclination we have towards Active-Active and obviously cost and complexity do play critical role. Since having active-active scenario in Cloud can have its own cost implication.


Disaster Recovery: On prem and multi-AD and multi-region


Ø The first phase approach can be to establish Disaster recovery within one single region using multi-AD approach. With the advent of most new region with Single AD architecture or with older region like Phoenix, customers do have option to setup DR with single AD with workloads spread across multiple Fault Domains and same with Multi-AD in Single region with workloads spread across multiple ADs. The below diagram covers the right case of workloads across multiple Availability domains within a single OCI region.


ree

Ø Next approach can be more robust with the extension of the previous one and creating multi-AD multi-Region a true blue Disaster recovery for the for the most

demanding workloads.


ree

Components across AD’s will be in active-active clustering to mitigate AD failure and in active-passive or active-active depending on the appetite or necessity across region. VCN peering is required for connectivity in multi-region. Block storage can be replicated across region and so does the Object storage between buckets across regions. Usage of data Guard or Active Data guard based on the use case and Database version can be done here.


Ø Customer can use Hybrid DR scenario using tear down version of their on prem by using standard Object storage bucket in Oracle Cloud Infrastructure region as a cold standby environment, which is a powered-off replica of your primary on-premise environment. This setup requires an established and tested method to replicate the production data asynchronously from the on-premise environment to Oracle Cloud Infrastructure Object Storage through a storage gateway. If the primary environment is affected by a disaster, then the cold standby can be powered on programmatically, and the data can be restored from object storage to the standby environment. The grade of the stateful-stateless split affects the effort required for creating the data-restoration routine and its complexity. The higher the stateful grade at the application layer, the easier it is to create a restoration routine that's based on only the database layer.


ree

Ø If the Customer environment is VMware heavy, then Oracle do have VMware solution in Oracle Cloud as OCVS (Oracle Cloud VMware Solution) which can be used as an extension to the On prem environment. More on the link:


Note: If anyone is interested in doing Hands on lab , Oracle provide livelabs, please find the link for HA deployment in OCI . https://apexapps.oracle.com/pls/apex/r/dbpm/livelabs/view-workshop?wid=651&clear=RR,180&session=102370043762212

 
 
 

Comments


Post: Blog2_Post
  • LinkedIn

©2022 by CloudIdeas. Proudly created with Wix.com

bottom of page