Multi-datacenter failover scenarios

Learn about the expected behavior in a multi-datacenter deployment in case of failover, and how the system will behave in different scenarios.

3 minute read

One API Gateway instance is down

In this case, one API Gateway instance is down in DC 1:

One API Gateway instance is down

The following applies in this scenario:

API requests can no longer be serviced by the API Gateway instance that is not running in DC 1.
API requests will be serviced by the remaining API Gateway instance in DC 1.
You must restart the API Gateway instance that is not running in DC 1.

One Cassandra node is down

In this case, one Cassandra node is down in DC 1:

One Cassandra node is down

The following applies in this scenario:

Cassandra is inherently HA and can tolerate the loss of one Cassandra node only in a datacenter (DC 1 in this case). This ensures 100% data consistency when Cassandra is configured for multiple datacenters. For more details, see Configure Cassandra for multiple datacenters.
You must restart the Cassandra node that is not running in DC 1.

Note

When a node has been absent from a cluster for a time, it is brought back into the cluster after restart, and becomes eventually consistent by design. Node repair is required after re-integration into the cluster. For more details, see Perform essential Cassandra operations.

A full datacenter is down

In this case, DC 1 is down:

A full datacenter is down

The following applies in this scenario:

API requests can no longer be serviced by DC 1
API requests are automatically directed to DC 2 by the load balancer
API Manager quotas remain the same but over less servers
You should not deploy updates for the following file-based data types to the API Gateway group:
- API Gateway configuration
- API Gateway KPS custom table structure

Caution

There is a risk that end-users may need to re-initiate sessions using the following data types stored in Ehcache:

API Gateway OAuth token store
API Gateway custom cache

Restart the datacenter

To restart the datacenter:

Restart each of the Cassandra nodes.
When all Cassandra nodes are running normally and have synchronized with the rest of the cluster in DC2, you can restart the API Gateway instances. The API Gateway should not be restarted prior to a successful restart of all Cassandra nodes.

Note

You should run the nodetool repair command on each of the three Cassandra nodes in DC 1 when the datacenter has been down for more than two hours.

The network between both datacenters is down

In this case, the network between DC 1 and DC 2 is down, while both datacenters remain active:

Network between both datacenters is down

The following applies in this scenario:

OAuth or throttling data stored in Ehcache per datacenter is not affected
You should not deploy updates for the following file-based data types to the API Gateway group:
- API Gateway configuration
- API Gateway KPS custom table structure
For the following data types stored in Cassandra, both datacenters are not synchronized until the network recovers, and then automatically resynchronize:
- API Manager catalog, client registry, web-based settings
- API Gateway KPS custom table structure
For the following file based data types, a single API Gateway Manager web console cannot access both datacenters while the network is down, and can only access each datacenter separately:
- API Gateway logs
- API Gateway traffic monitoring

If the network connection has been down for more than two hours, the following steps are recommended:

Run nodetool repair on each of the six nodes in the Cassandra cluster to ensure that the data has synchronized. For more details, see Perform essential Cassandra operations.
Restart the API Gateway instances to resynchronize data from Cassandra (potentially in both datacenters if Cassandra changes have occurred in both datacenters).

It may take a minute for newly created, deleted, or updated APIs in one datacenter to synchronize successfully with the other datacenter.

Last modified December 12, 2022: Merge developnov22 branch into master branch (#2511) (c285a3fd)