Resilience strategies — redundancy, backing up and restoring data

This topic discusses the resilience, redundancy, backup and restore strategies to consider when deploying and maintaining your Pexip Infinity platform.

Resilience and redundancy

Pexip is designed for multiple layers of resilience and redundancy. Companies and service providers should consider which situations they want to protect against. All options can be combined, and this is typically a consideration of cost versus benefit, and how much downtime can be tolerated in a worst case scenario.

The main levels of resilience redundancy and our associated recommendations are described below.

Hardware and physical considerations

Dual hot swap power supplies for each server connected to different power circuits, optionally with UPS or backup power.
Dual network cards in each server, connected to dual switches (VMware NIC Teaming). Switches are connected to redundant routers, allowing for any component in the network path to fail. Consider if the data center is robust if the fiber cable to the data center is cut.
Redundant storage, either by adding a hardware RAID controller and operating two disks in RAID 1 (mirror) or by using redundant external SAN solutions.
Redundant servers — we recommend that service providers deploy n+1 to always allow for one physical server to be unavailable.
Redundant datacenters — consider providing Pexip Infinity from multiple data centers (either multiple data centers in one region or in various international regions).

VMware considerations

Note that some of the options listed below are not relevant for all node types in a Pexip deployment. Typically a Pexip Management Node, Reverse Proxy and the Conferencing Nodes responsible for signaling should be kept up by using technologies such as vMotion or DRS.

For Conferencing Nodes with a consistently high media load, we recommend deploying n+1 (or more) Conferencing Nodes, so that you always have spare capacity if a node becomes unavailable.
vMotion: proactively move the node to another server if maintenance/changes are to be conducted on the server where it is currently running. Do not vMotion a Conferencing Node that has active conferences.
DRS: automatically move VMs around depending on load for servers. This should not be used with Conferencing Nodes that handle media.

Call control or DNS considerations

If using Pexip Infinity to provide registration and call control services, deploy multiple Conferencing Nodes to provide resiliency options.
If using DNS SRV records for call control server discovery, ensure that each SRV record points to multiple A records (the names of multiple servers) so that call control can still be provided by other Conferencing Nodes if one fails.
Consider DNS SRV records with different priorities that fail over to an alternative datacenter if available.
Be prepared to modify DNS configurations to direct traffic to alternative datacenters in case of a major disaster in one datacenter.
Ensure that call control systems outside of Pexip (other SIP registrars, H.323 gatekeepers, Skype for Business servers, or other components) are integrated with multiple Pexip Conferencing Nodes (often via the DNS strategies mentioned above) to allow service continuity should any node in the Pexip deployment be unavailable.

See DNS record examples for more information.

Physical locations of Pexip nodes

As the Pexip Infinity platform can be deployed with multiple nodes in multiple locations, make use of this flexibility and design the setup as required for the spread of customers being served.

Backup mechanisms

It is mandatory for any critical installation to have a proper backup of the Management Node.

The Pexip Infinity platform can be backed up in many ways — with different advantages to each option. In some scenarios, for fast recovery of various situations, using multiple (or all) options is possible, and in most cases we recommended using at least two backup strategies.

For on-premises deployments, we recommend that you use both the hypervisor and Pexip's inbuilt methods to preserve your configuration data. A VM snapshot should be your primary mechanism prior to an upgrade, as this allows you to easily restore your system back to its state at the time the snapshot was taken. The Pexip Infinity backup and restore mechanism is your fallback mechanism, as this allows you to preserve a copy of your data in an alternative location, in case you lose your VM environment. For cloud-based deployments (Azure, AWS, GCP or Oracle) if you want to take a VM snapshot we recommend that you perform a graceful OS shutdown (not a hard power-off) before taking the snapshot; alternatively, you may use the Pexip Infinity backup and restore mechanism.

In all cases ensure that you take regular backups, refreshing your backups after making configuration changes to your Management Node.

VMware backup

VMware backups should be performed in the same datacenter (with an offsite replication of the backup in case of a local disaster) to allow fast restoration of the data.

Be aware that snapshots are not backups. Snapshots are a tool to roll back to a given time. Therefore, we recommend taking snapshots only when necessary (such as prior to an upgrade) and deleting the snapshot as soon as possible after the upgrade is confirmed to be successful. You should only create and delete VMware snapshots at a time of minimal usage. Taking or removing snapshots can cause virtual machines to become unresponsive.

VM backups should use a proper hypervisor VM backup tool (e.g. VMware VDP — vSphere Data Protection) or similar, and restoration should be tested and verified (preferably after the inbuilt backup methods have been set up, to ensure that you have another way of recovering if your restoration fails).

Conferencing Nodes do not need to be backed up. They receive all their configuration information from the Management Node and can simply be redeployed if necessary (i.e. delete and recreate). However, if your Conferencing Nodes are geographically spread out and redeploying them would consume significant bandwidth or take a significant length of time, they can also be backed up with your hypervisor's backup tools.

Pexip's inbuilt Management Node configuration backup process

You can use Pexip Infinity's inbuilt backup and restore mechanism to backup and restore the configuration data on the Management Node.

You can enable regular automatic backups, and you can also take a manual backup whenever it is appropriate, for example, before and after you make any configuration changes or perform a software upgrade.

All backup files are encrypted — the administrator supplies a passphrase and must remember this for any subsequent restoration.
Restoration must occur on exactly the same version that the backup was taken from.
The data contained in the backup contains all configuration data, including IP addresses, custom themes, certificates and call history.
The backup data does not contain licenses, the administrator log, the support log, usage statistics or the operating system password.
The system keeps on the host VM only the 5 most recent manually-taken backups and the 5 most recent automatic backups. Older backup files are deleted.

As the restore does not contain the license key, if you use this method to restore your configuration data onto a fresh Management Node you will need to contact your Pexip authorized support representative to obtain a new license key.

We recommended that you configure Pexip Infinity to schedule regular backups and send the backup file to an external FTP server. See Backing up and restoring configuration for instructions.

Pexip import/export of service configuration data

This is your "last resort" if you do not have the ability to restore configuration from a VM backup or from a Pexip backup. You can export and import your service configuration data (Virtual Meeting Rooms, Virtual Auditoriums, Virtual Receptions, device aliases and Automatically Dialed Participants) to and from CSV files.

You can apply this data to a fresh (or existing) Pexip installation if, for example, you had to redeploy the entire Pexip platform. Note that Call Routing Rule and other platform configuration must be documented for manual restoration in this scenario. See Bulk import/export of service configuration data for instructions.

If you use provisioning you can restore your VMRs and device aliases from your LDAP/AD source.

Impact of lost connectivity to the Management Node

If the Management Node disappears from the network (due to network outage, server outage, lack of VMware HA, or during HA restoration etc.) there will be some impact on your Pexip Infinity platform:

No configuration updates pushed to Conferencing Nodes: a Conferencing Node typically checks in with the Management Node once every 60 seconds for configuration updates. This will fail, hence no new VMRs, Call Routing Rules, or platform configuration updates such as DNS servers etc. will be added to the Conferencing Node's local replicated database.
No CDR/log data sent back to the Management Node: the Conferencing Nodes will try to push syslogs back to the Management Node with log data. This will also fail, and the Conferencing Nodes will buffer the log data until the Management Node is operational (there may be some limitations here — if you leave the Management Node down for a very long time and there is a lot of traffic, at some point the Conferencing Node logs will rotate to avoid filling up the disk, but you should not normally have such long outages).
No visibility in Pexip Management Node interface: new calls will not appear in the Management Node's Administrator interface (assuming the Administrator interface is accessible, but some Conferencing Nodes cannot reach the Management Node due to a network split, for example).
Licensing: if a Conferencing Node is unable to contact the Management Node, the call is permitted on the assumption that a license is available. After the Management Node has been out of contact for a grace period of 14 days, all calls will be rejected.
Rebooting/restarting Conferencing Nodes: do not restart or reboot a Conferencing Node while the Management Node is unavailable. The 14-day licensing grace period only works if the Conferencing Node has had a valid sync with the Management Node after the Conferencing Node's most recent reboot.
Syslog to external syslog server (TCP/TLS): for reliable syslog output, each Conferencing Node can send its syslog data directly your own external syslog server, for quick realtime analysis with tools like Splunk etc. By using TCP/TLS based syslog you will use a reliable data channel so you know that the traffic is received in the other end. As the log data is sent directly from the Conferencing Nodes to your configured syslog server, it is not affected by the Management Node not being available.

Some features are not affected or are only partly affected:

New and existing calls: Conferencing Nodes will continue to handle both new and existing calls as before — this is because no conference media or signaling passes through the Management Node.

Considerations when restoring the Management Node from backup

Your procedures to restore the Management Node from a backup depend upon the type of backup available and the problem that has occurred:

Restoring from VMware

This is probably the quickest and most reliable way of restoring the Management Node.

You can perform a Management Node restore from VMware during production hours.
When the restore is complete the configuration replication will start as soon as the IPSec tunnels between the Management Node and the Conferencing Nodes are back up. Due to the nature of IPsec, some Conferencing Nodes might take a while to sync back up. Rebooting a Conferencing Node will solve this, but that will impact operations and will drop any active calls that are being handled on that node — and note that if the node is not in sync with the Management Node you cannot use the Administrator interface to check if that node has any running calls.
Any configuration on the Conferencing Nodes will be refreshed with the data that has been restored to the Management Node.
Any logs / call data records (CDRs) that have already been sent to the old Management Node will be lost as Conferencing Nodes do not check what is in the Management Node database — it only knows they had been delivered. To recover these CDRs, use external syslog data and parse it as an additional import to your billing system. New CDRs will be queued awaiting the Management Node to become operational again.

Restoring a file created by Pexip's inbuilt Management Node configuration backup process

We recommend using this approach if the Management Node needs to be reinstalled or rebuilt.

When you perform a restore of Management Node configuration data, the node's IP address is restored (as this is used by the IPSec tunnels to all of the Conferencing Nodes). To be able to restore a Management Node into another datacenter (in case of local disaster recovery), the Management Node needs an IP address from your ISP that can be re-routed to another site. For high-availability deployments, this is something worth discussing with the network provider as you cannot just move nodes around after being deployed.

You can perform a Management Node restore from a Pexip backup during production hours.
If you are restoring onto a fresh Management Node:
- As the IP address of the new Management Node will be the same as the old node, they should not both be powered on in the same VLAN to avoid IP conflict.
- It will not replicate configuration to the Conferencing Nodes until the backup file is restored (as the new Management Node does not contain the right certificates to communicate with the Conferencing Nodes). When the Pexip backup is restored, it will connect with the Conferencing Nodes and replicate configuration. Any configuration on the Conferencing Nodes will be refreshed with the data that has been restored to the Management Node.
- Any logs / call data records (CDRs) that have already been sent to the old Management Node will be lost as Conferencing Nodes do not check what is in the Management Node database — it only knows they had been delivered. To recover these CDRs, use external syslog data and parse it as an additional import to your billing system. New CDRs will be queued awaiting the Management Node to become operational again.

Restoring service configuration data from CSV files

An import of service configuration data can be performed at any time.

The primary key for the Virtual Meeting Room / Virtual Auditorium / Virtual Reception database is the name of that service. So, for example, if you import VMRs on top of an existing VMR database, the existing configuration for those VMR names will be overwritten by the imported VMR's configuration. Any other existing VMR records will be left unchanged.
The primary key of device alias data is the alias itself, and for Automatically Dialed Participants it is the combination of the conference name and the alias to be dialed.
Any new or modified service configuration will take effect after approximately 60 seconds.
CDR data is not affected by a restore of service configuration data.