Nova in Action: A Deep Dive into Disaster Recovery Demo

In the blog post “Elevating Disaster Recovery With Kubernetes-native Document Databases.” with Selvi Kadirvel, we explored the theoretical aspects of disaster recovery automation with Nova. Now, let’s delve into the practical side by examining a live demonstration presented by Maciek Urbanski. This demo will showcase the step-by-step process of setting up a disaster recovery workflow for a document database within a multi-cluster Kubernetes environment using Nova.

The demo environment consists of four key components:

^Nova Management Cluster: The central hub that orchestrates the entire process.
^Workload Clusters (2x): These Kubernetes clusters host our applications and Postgres databases.
^Percona Operator (in Workload Clusters): Installed on each workload cluster. Manages Postgres cluster configurations deployed within each cluster. (FerretDB is used as an example database here.)
^HAProxy Cluster (Separate in Production): Demo Setup: In this demonstration, HAProxy is placed on the same cluster as Percona cluster 2 for simplicity. Production Recommendation: In a real-world scenario, HAProxy should reside in a separate Kubernetes cluster with its dedicated disaster recovery plan for enhanced fault tolerance.
^HAProxy Setup: Configuring and deploying the HAProxy instance as the load balancer.

Check out part one of this series, "Elevating Disaster Recovery With Kubernetes-native Document Databases," by Selvi Kadirvel (Tech Lead Engineering Manager at Elotl) for a theoretical foundation on Nova's disaster recovery capabilities.

The demo will be divided into two parts:

Part 1: Day Zero Operations

^Schedule Policy Creation: Defining policies that dictate workload placement across clusters based on labels or namespaces.
^Percona Operator Installation: Setting up the Percona operator to manage the Postgres deployments.
^Percona Cluster Configuration: Configuring and deploying the Postgres clusters in the designated workload clusters.
^FerretDB Deployment: Installing FerretDB as a sample document database within the workload clusters.
^HAProxy Setup: Configuring and deploying the HAProxy instance as the load balancer.
^Recovery Plan Creation: Defining a recovery plan that outlines the automated actions Nova will take in case of a failure.

By completing these steps, we establish a functional database deployment with disaster recovery capabilities orchestrated by Nova.

Part 2: Simulating a Failure and Recovery

^Failure Simulation: Simulating a database failure to trigger the disaster recovery plan.
^Simulating Alert Trigger: In a real-world scenario, an alerting system would detect a database failure and trigger a webhook to notify Nova. However, for this demonstration, we're simulating the alert by directly injecting an alert object into the Nova control plane using the Kubernetes API.
^Recovery Plan Execution: Observing Nova automatically executes the predefined plan to restore database availability.
^Post-Recovery Verification: Confirming successful recovery and database accessibility through the HAProxy endpoint.

With this comprehensive demo, you’ll gain valuable insights into how Nova simplified disaster recovery management for your document databases in a Kubernetes environment. Let’s proceed with the detailed breakdown of Maciek’s demo, starting with the installation process.

Defining FerretDB Placement Policies

Following the creation of scheduling policies for Percona deployments, Maciek moves on to defining policies for FerretDB placement. He creates two separate policies for services that interact with FerretDB, simplifying the visualization of these policies. Additionally, a third policy is introduced with the following functionalities:

^Spreading FerretDB Deployments: This policy distributes FerretDB deployments across both workload clusters, ensuring redundancy.
^Replication with Configuration Overrides: It replicates the FerretDB deployments but applies configuration overrides.
^Tailoring Configuration Overrides: These overrides ensure each deployment connects to the appropriate Percona cluster (Cluster One or Cluster Two) based on its location within the workload cluster.
^Tailoring Configuration Overrides: These overrides ensure each deployment connects to the appropriate Percona cluster (Cluster One or Cluster Two) based on its location within the workload cluster.

Creating the Percona Operator

The next step involves creating the Percona operator, the critical component responsible for managing the Postgres deployments within the workload clusters. This process entails:

^Namespace Creation: Establishing a dedicated namespace to house the Percona operator, providing a clear organizational structure.
^Label Application: Assigning labels defined in the scheduling policies to the relevant resources. This allows Nova to identify and manage these resources based on the predefined policies.
^Object Labeling with Python Script: Utilizing a Python script to automate the process of adding necessary labels to all objects. This streamlines the labeling process and reduces the risk of human error.

Once these preparations are complete, Maciek verifies the Percona operator installation within both workload clusters (cluster one and cluster two), confirming its successful deployment across the multi-cluster environment.

Percona Cluster Configuration and Deployment

With the operator in place, the next stage focuses on configuring and deploying the Percona clusters. This involves:

^Cluster Configuration Distribution: Sending configuration files specifically for cluster one to the first workload cluster and for cluster two to the second workload cluster. This ensures each cluster receives the appropriate configuration for its designated role (primary or secondary).
^Deployment Verification in Nova: Observing Nova's confirmation of having "two out of one," indicating the successful deployment of replicated objects across the workload clusters. This provides a visual confirmation of redundancy within the environment.

Note: Percona cluster configurations are hidden within workload clusters, managed by the Percona operator. Nova positions these operators and configurations, but doesn't directly display the specifics.

S3 Bucket Access and Cluster Configuration

Before proceeding, Maciek sets up access to an S3 bucket simulated by a Minio instance. This S3 bucket facilitates data synchronization between the primary and secondary Percona clusters, ensuring data consistency in the event of a failover.

Following the S3 access configuration, the actual Percona cluster configurations are applied. However, these configurations remain hidden within the workload clusters and are not directly reflected in Nova. To observe the container initialization and Percona operator activity, Maciek switches the view to the workload clusters. This allows for a more granular inspection of the deployment process.

This concludes the first part of the Nova disaster recovery demo, focusing on Day Zero operations. The environment is now prepared with Percona clusters deployed strategically across workload clusters, FerretDB instances with tailored configurations, and a HAProxy instance ready for configuration in the upcoming section. The groundwork has been laid for a robust disaster recovery solution managed by Nova.

FerretDB Deployment and Service Configuration

Having established the groundwork with Percona clusters and access to an S3 bucket, Maciek proceeds with the deployment of FerretDB. Here’s a breakdown of this step:

^FerretDB Deployment: A standard FerretDB deployment is initiated. It's important to note that the PostgreSQL URL displayed is a placeholder, as it will be overridden based on the policies defined earlier.
^FerretDB Service Creation: Two separate services are created, each designated for a specific workload cluster. This segregation is reflected in the service names, making it easier to identify which service corresponds to which cluster.
^Service Connection Mapping: The first FerretDB service connects to the FerretDB instance deployed within Cluster One (where the primary Percona cluster resides).The second FerretDB service connects to the FerretDB instance deployed within workload Cluster Two (where the secondary Percona cluster resides).

HAProxy Integration for Single Entry Point

To establish a single point of access for database interactions, Maciek introduces HAProxy. Here’s how it integrates with the existing setup:

^HAProxy Configuration: The configuration for HAProxy is prepared. This configuration will be crucial for directing user traffic to the appropriate FerretDB service.
^External IP Retrieval: The external IP address of the first FerretDB service (connected to the primary Percona cluster) is retrieved. This IP address will be used within the HAProxy configuration.
^HAProxy Deployment: With the configuration in place, HAProxy is deployed.
^Configuration Verification: The configuration map is checked to confirm that the retrieved external IP has been successfully incorporated for FerretDB service one within the HAProxy configuration.
^HAProxy Placement: It's important to note that while HAProxy is currently configured to point to service one, it's typically deployed in a separate cluster with its disaster recovery plan. In this demo, the focus remains on Percona cluster recovery.

Shifting View to Console Verification

After successfully deploying HAProxy, Maciek switches to the console view to demonstrate the following:

^Service Verification: He verifies that both FerretDB services and the HAProxy instance are running as expected using the get-all command.
^Pre-Failure Database State: He connects to the FerretDB database using an SH container to showcase some pre-populated data. This represents the state of the database before simulating a failure.

With this setup complete, the environment is prepared for the next stage of the demo: simulating a failure and observing Nova’s disaster recovery capabilities in action.

Defining the Recovery Plan

The final step of Day Zero operations involves creating a recovery plan. This plan outlines the automated actions Nova will take upon detecting a failure within the Percona cluster. Here’s a breakdown of the four-step recovery plan outlined by Maciek:

^Alert Trigger and Failover Initiation: The plan is triggered by an alert received from Prometheus (or any other monitoring tool) indicating a failure in the primary Percona cluster (Cluster One in this scenario). Upon receiving the alert, Nova triggers a pre-defined webhook.
^Primary Cluster Standby Mode: Nova patches the manifest for Cluster One, effectively disabling its primary mode. This ensures Cluster One no longer acts as the primary database.
^Standby Cluster Promotion: Nova patches the manifest for the standby cluster (Cluster Two), enabling its primary mode. Consequently, Cluster Two is promoted to become the primary database.
^HAProxy Configuration Update: The external IP address of the new primary FerretDB service (connected to Cluster Two) is retrieved. The retrieved IP address is used to update the HAProxy configuration, ensuring it directs traffic to the newly promoted primary database.

Simulating Failure and Recovery

Since setting up Prometheus falls outside the scope of this demo, Maciek injects an alert object directly into Nova. This simulates the behavior of a webhook triggered by an actual alert.

Verifying Recovery Outcomes

Following the simulated failure and recovery plan execution, Maciek checks various aspects to confirm successful disaster recovery:

^Cluster Status Verification: Cluster One is expected to be in standby mode (primary mode disabled). Cluster Two is expected to be in primary mode (assuming successful promotion).
^HAProxy Configuration Verification: The HAProxy configuration map is checked to verify that it reflects the updated IP address of the new primary FerretDB service.
^Database Accessibility Test: Maciek connects to the MongoDB console (using the HAProxy address) to confirm continued database accessibility after the failover. A DB find command is executed, successfully retrieving the expected data objects, demonstrating that the database remains functional even after the simulated failure.

Conclusion

This blog post has provided a comprehensive walkthrough of Maciek Urbanski’s live demonstration of disaster recovery automation with Nova in a multi-cluster Kubernetes environment. We’ve delved into the step-by-step process of setting up Day Zero operations, including:

Defining scheduling policies for workload placement.
Installing and configuring the Percona operator for Postgres deployments.
Deploying FerretDB instances as sample document databases.
Setting up an HAProxy instance as the load balancer for database traffic.
Creating a recovery plan to outline Nova’s automated actions during a failure.

By following these steps, Maciek established a functional database environment with disaster recovery capabilities orchestrated by Nova. We also explored how Nova integrates with external monitoring tools like Prometheus to trigger recovery plans upon detecting failures.

Note: To gain a deeper understanding of the theoretical foundation behind Nova and its disaster recovery capabilities, be sure to read the first part of this blog series titled "Elevating Disaster Recovery With Kubernetes-native Document Databases." where Selvi Kadirvel, the Tech Lead Engineering Manager at Elotl, explores the advantages of disaster recovery automation with Nova and its unique functionalities within a Kubernetes landscape. Click here.

We extend our gratitude to Maciek Urbanski for showcasing Nova’s disaster recovery features in detail through this insightful live demonstration. Watch the full webinar here. Join us, Document Database Community on Slack, and share your comments below.