content.gif menu.gif

Basics on the Disaster Recovery System

High availability and reliability are becoming more important in automation technology. Even a short breakdown can result in significant costs and security risks. This should be prevented with the aid of the WinCC OA Disaster Recovery System.

As a management system, WinCC OA has an integrated Hot Standby Redundancy Concept. With this, the high demands of system authors and operators for availability as well as process and data security can be covered. The reliability in a redundant system with WinCC OA is implemented with Hot Standby. It is a security concept that consists of two interconnected servers. Both are permanently operational and are subject to the same functional stresses. Only one server is always active. The second passive server synchronizes the data at runtime. If a unit fails, a “flying switch” is executed and the server that was passive until then takes over the control.

The aim of the new Disaster Recovery System the redundancy concept is extended by a Warm Standby System, so that the operability of the system nevertheless remains maintained on another system even in the event of a complete failure or shutdown in the course of e.g. maintenance on the redundant system. Thus, the data loss and the idle time are kept as low as possible. This is achieved by a second system, the so called secondary server system (SSS), being assigned to the first redundant Hot Standby System (primary server system; PSS) and a “Warm Standby” being implemented between the two systems. This means that the data between the two systems is permanently synchronized.

This has two advantages:

  1. In the case of a complete system failure, the system remains operable.

  2. The historical data can be retrospectively synchronized.

The main demand on the Disaster Recovery System is to keep data loss, the inoperability and the idle time from the side of the management system as low as possible. In order to guarantee this, a constant synchronization of the online and configuration data between the PSS (Primary Server System) and the SSS (Secondary Server System) is essential. Since the quantity of this data is, however, very extensive and is linked to the size of the project, the system operator or the integrator should manage and define the scale and the synchronization interval between the two systems as far as possible.

The following functions are provided by the Disaster Recovery System:

  • Synchronization of the online data changes between PSS and SSS at runtime.

  • Synchronization of the alarm status (acknowledgement status, acknowledgement time, acknowledgment user) between PSS and SSS at runtime.

  • Cyclical synchronization of the configuration changes (alert handling, data point functions, etc.) between PSS and SSS.

  • Automatic (cyclical) or manually triggered synchronization of the project files (panel files, control scripts and libraries, data point lists, color databases, graphic files and images, text catalogues).

  • Synchronization of the historical data (via Oracle packages) after triggering by the user after an SSS system failure or an interruption in the connection between PSS and SSS.

  • Synchronization of the user administration (user data).

  • Automatic switchover between PSS and SSS and automatic/manual shift-in between SSS and PSS.

  • Working with a user interface that is either connected to the PSS or to the SSS (two different file links are necessary on the desktop) is possible.

  • Automatic switchover at the client between the user interface of the PSS and the SSS to the currently running system (two user interfaces active in parallel), if a second UI license is available. Otherwise a manual start of the first system is required, if this has broken down.

System architecture

drs-01.png

The functionality of the Disaster Recovery System is based on two WinCC OA standard functions. These are WinCC OA Hot Standby Redundancy and the WinCC OA supported Distributed Systems, which are used between the PSS and the SSS.

Connection

All servers of PSS and SSS are connected by means of LAN or WAN (TCP/IP protocol).

Normal Operation Mode

In the normal operation mode, the PSS system supports the connection to the field devices (or master control station with OPC UA port) and communicates all values to the SSS by means of the Control Manager.

On the work station, there are two possibilities:

  • Two WinCC OA user interfaces are started. One has a connection to the PSS and the other to the SSS. The UI of the managing system runs in the foreground. All panel switchovers from this UI are automatically communicated to the other UI (this is visible for the user only then, when the connection to the PSS has failed), so that both WinCC OA user interfaces always have the same image displayed.

  • A WinCC OA user interface that supports the connection to the PSS is started or a WinCC OA user interface is started that supports the connection to the SSS. The decision is made by the user after the connection to the active system is lost. The user retrieves a notification, when the other system becomes passive again and then the user interface with the connection to the other system has to be opened.

PSS (Primary Server System)

The PSS consists of a redundant WinCC OA project, in which diverse drivers and control managers are controlled and therefore maintain and further process the current data of the field devices (or master control station with OPC UA port). Between the two servers within the primary server system, the Hot Standby Concept is dominant. More information on WinCC OA redundancy can be found in the chapter Redundancy, Basics.

SSS (Secondary Server System)

The SSS is intended for management in case of a complete failure of the PSS or maintenance on the PSS. It is also a redundant WinCC OA project that has the same configured drivers and control managers as the PSS. Considered from a simple point of view, it is a reflection of the PSS.

Normally, the SSS has no connection to the field devices (master control stations) and also does not carry out calculation procedures (except for WinCC OA internal calculations such as error quantifiers, compressions, etc.). Nevertheless, the process data is available with a very low delay on this system, since the values of the data points and the alarm status are continually communicated from the PSS with the mechanism of distributed WinCC OA systems.

If both computers of the PSS fail, the servers of the SSS take over the complete monitoring and control of the project. For the user, this simply means a short interruption in the operation of the application before the SSS takes over the control, then configures the connection to the field devices (or master control stations) and provides the current values for the user.

If the server that failed on the PSS takes the operation again, the Disaster Recovery System executes the reverse data migration. During such a fallback switchover, the WinCC OA managers are started again on the PSS and the data is synchronized with the current data on the SSS. Furthermore, in the course of a fallback procedure, the historical data can also be synchronized. Thereby it is made certain that all changes that occurred after the failover are also available on the PSS.
 

Figure: Failure of the Primary Server System

drs-09.png

If the connection between the PSS and the SSS has failed, both systems are active, since it must be assumed that the other system respectively has failed. In the process, an alarm for the loss of connection of the DIST manager is triggered. It is possible to operate on both systems and both systems establish a connection to the field devices.

 

Figure: Interruption in the Connection between the Primary and the Secondary Server Systems

drs-10.png

The Disaster Recovery System can deal with both cases. The data synchronization after an interruption in connection between the PSS and the SSS occurs automatically at the next synchronization cycle or can be repeatedly manually activated after the triggering of the synchronization alarm via the operating interface. The data synchronization after a connection establishment is carried out from the PSS (master) to the SSS (slave).

Additional range of functions

  1. By default, the synchronization of the driver configuration contains the IEC, OPC, S7 and Modbus drivers.

  2. Split mode is supported by the Disaster Recovery System. If the split mode is removed, a synchronization of the configuration system on the active computer is carried out, so that the changes to the configuration system are applied. Optionally, it is possible to reject all of the changes.

Data synchronization between PSS and SSS

In the normal operating mode, the PSS has a connection to the field devices (master control stations with OPS UA port) and the drivers only run on the PSS. The synchronization of the online data (process data) between the PSS and the SSS is carried out by a special control manager and the mechanism of the distributed systems, which communicates the data from the PSS to the SSS with the help of the WinCC OA DIST manager and therefore maintains the WinCC OA last value databases identically. All data points and data point types respectively, whose values are synchronized between the two systems, must be configured with the help of the corresponding configuration panels (see Configuration - Introduction). The requirement for this is that both distributed systems contain the same data points. This is achieved in the current mode via the synchronization of the data point configuration.

The synchronization of the (data point) configuration is achieved on the one hand by the use of WinCC OA control functions (timed functions) and the WinCC OA ASCII manager. In a freely determinable interval (default is 60 minutes), the changed configuration data is exported from the primary system and then imported into the secondary system. This synchronization can also be deactivated if the configuration data from the PSS should not be processed to the SS periodically, or if no more configuration data should be processed after the first automatic synchronization process, because no further configuration changes are expected on the system.

The synchronization of the historical data is resolved via the application of Oracle functions. This synchronization is required in order to completely synchronize the systems again after a fallback, so that the historical polling of the databases on both systems returns the same result.

Failover procedure or manual switchover procedure between PSS and SSS

If the connection between the SSS and the PSS becomes lost, or if the managing system fails completely, all of the drivers and control managers that are hierarchized on the active secondary system server are started, and the secondary system therefore becomes the managing system. This idle time between the individual steps is configurable via the configuration wizard. Additionally, the driver activates a general query.

On the workstation:

  • With 2 UIs -> the WinCC OA user interface of the activating (secondary) system is switched in the foreground, so that the operation of the system remains possible without a large interruption.

  • With 1 UI -> the WinCC OA user interface of the primary system loses the connection upon total failure and the user starts the user interface of the secondary system via a second programme request.

The same actions are also carried out if manually switched from the PSS to the SSS.

Fallback procedure

If the failed system restarts the normal operating mode, a complete synchronization of the online data and of the alarm status between the two systems is carried out.

Behavior of the Disaster Recovery System upon failure of one or several servers

The following sub sections show the behavior of the Disaster Recovery System in various error scenarios.

The server designations A, B, C and D correspond to the server designations in the figures shown above.

Failure of Server A. Server B, C and D are operational.

This fault is handled by the default WinCC OA Redundancy. In this case, there is a redundancy switchover and the passive server of the PSS becomes active and takes over all tasks and the communication with the field devices (or master control stations with OPC UA port).

Failure of Server B. Server A, C and D are operational.

If the passive server of the PSS has failed, this has no effect on the operation of the system.

Failure of Servers A and B. Servers C and D are operational.

If both computers of the PSS fail, the SSS takes over the control, starts the control manager and the drivers, establishes the connection/communication to the field devices (or master control station with OPC UA port) and processes the data. The starting of the control managers and the drivers takes place hierarchically, whereby the time between the individual steps is configurable.

Failure of Servers A and C. Servers B and D are operational.

This has no effect on the operation of the system, since at any one time one computer of the two systems is still running. Generally the same behavior as described in the first case would apply in this case, although the standard Hot Standby Redundancy switches over to server B on the PSS.

Failure of Servers B and D. Servers A and C are operational.

This has no effect on the operation of the system, since at any one time one computer of the two systems is still running. Generally the same behavior as described in the second case would apply in this case.

Failure of Servers A, B and C. Server D is operational.

If both of the servers of the PSS and the active computer of the SSS fail, the system behaves in a very similar way as in the third case described above. The only difference is that now the standby server of the SSS takes over control of all of the tasks.

note.gifNote

By setting the config entry useOfflineErrorstateInfo to 1 in the [DisRec] section it can be defined whether the system, which had the higher error state during the interruption, becomes passive, even if it was active before.

Chapter Overview

Chapter

Description

Basics on the Disaster Recovery System

Basic information on the Disaster Recovery System, its functions, system architecture (PSS, SSS), operation and behavior in case of the failure of a server.

Requirements and Installation

Requirements and installation of the Disaster Recovery System.

Configuration in WinCC OA

Step by step instructions for setting up a Disaster Recovery System.

Configuration of the Disaster Recovery System

 

 

Configuration - Introduction

Introductory information on the configuration of the Disaster Recovery System.

 

Configuration

Description of the available Wizards for general configuration of the Disaster Recovery System (divided into 5 steps).

 

System Overview

Description of the overview panel of the installed and set up Disaster Recovery System.

 

File Synchronization

Description of the panel for configuring the synchronization of the project files.

 

Database Synchronization

Description of the panel for the configuration of an historical database synchronization.

Database Configuration

 

 

Requirements and Installation

Requirements and preparation of the database for historical database synchronization.

 

Synchronization Process

Description of the procedure for historical database synchronization.

 

Status of the Synchronization

Description of the possible status conditions for historical database synchronization.

Control of the Client Behavior

Controlling the client behavior with the aid of a Disaster Recovery System reference object.

Internal Data point Types of the Disaster Recovery System

Description of the internal data point types.

Details on the Disaster Recovery System

Details on the Disaster Recovery System, i.e. config entries and debug flags.

Notes and Restrictions

Details and Restrictions that should be noted when using the Disaster Recovery System.

Glossary

Explanation of the terms and abbreviations used in the help documentation of the Disaster Recovery System.

 

V 3.11 SP1

Copyright ETM professional control GmbH 2013 All Rights Reserved