High availability and reliability are becoming more important in automation technology. Even a short breakdown can result in significant costs and security risks. This should be prevented with the aid of the WinCC OA Disaster Recovery System. As a management system, WinCC OA has an integrated Hot Standby Redundancy Concept. With this, the high demands of system authors and operators for availability as well as process and data security can be covered. The reliability in a redundant system with WinCC OA is implemented with Hot Standby. It is a security concept that consists of two interconnected servers. Both are permanently operational and are subject to the same functional stresses. Only one server is always active. The second passive server synchronizes the data at runtime. If a unit fails, a “flying switch” is executed and the server that was passive until then takes over the control. The aim of the new Disaster Recovery System the redundancy concept is extended by a Warm Standby System, so that the operability of the system nevertheless remains maintained on another system even in the event of a complete failure or shutdown in the course of e.g. maintenance on the redundant system. Thus, the data loss and the idle time are kept as low as possible. This is achieved by a second system, the so called secondary server system (SSS), being assigned to the first redundant Hot Standby System (primary server system; PSS) and a “Warm Standby” being implemented between the two systems. This means that the data between the two systems is permanently synchronized. This has two advantages:
The main demand on the Disaster Recovery System is to keep data loss, the inoperability and the idle time from the side of the management system as low as possible. In order to guarantee this, a constant synchronization of the online and configuration data between the PSS (Primary Server System) and the SSS (Secondary Server System) is essential. Since the quantity of this data is, however, very extensive and is linked to the size of the project, the system operator or the integrator should manage and define the scale and the synchronization interval between the two systems as far as possible. The following functions are provided by the Disaster Recovery System:
System architectureThe functionality of the Disaster Recovery System is based on two WinCC OA standard functions. These are WinCC OA Hot Standby Redundancy and the WinCC OA supported Distributed Systems, which are used between the PSS and the SSS. Connection All servers of PSS and SSS are connected by means of LAN or WAN (TCP/IP protocol). Normal Operation Mode In the normal operation mode, the PSS system supports the connection to the field devices (or master control station with OPC UA port) and communicates all values to the SSS by means of the Control Manager. On the work station, there are two possibilities:
PSS (Primary Server System) The PSS consists of a redundant WinCC OA project, in which diverse drivers and control managers are controlled and therefore maintain and further process the current data of the field devices (or master control station with OPC UA port). Between the two servers within the primary server system, the Hot Standby Concept is dominant. More information on WinCC OA redundancy can be found in the chapter Redundancy, Basics. SSS (Secondary Server System) The SSS is intended for management in case of a complete failure of the PSS or maintenance on the PSS. It is also a redundant WinCC OA project that has the same configured drivers and control managers as the PSS. Considered from a simple point of view, it is a reflection of the PSS. Normally, the SSS has no connection to the field devices (master control stations) and also does not carry out calculation procedures (except for WinCC OA internal calculations such as error quantifiers, compressions, etc.). Nevertheless, the process data is available with a very low delay on this system, since the values of the data points and the alarm status are continually communicated from the PSS with the mechanism of distributed WinCC OA systems. If both computers of the PSS fail, the servers of the SSS take over the complete monitoring and control of the project. For the user, this simply means a short interruption in the operation of the application before the SSS takes over the control, then configures the connection to the field devices (or master control stations) and provides the current values for the user. If the server that failed on the PSS takes
the operation again, the Disaster Recovery System executes the
reverse data migration. During such a fallback switchover, the
WinCC OA managers
are started again on the PSS and the data is synchronized with
the current data on the SSS. Furthermore, in the course of a fallback
procedure, the historical data can also be synchronized. Thereby
it is made certain that all changes that occurred after the failover
are also available on the PSS. Figure: Failure of the Primary Server System If the connection between the PSS and the SSS has failed, both systems are active, since it must be assumed that the other system respectively has failed. In the process, an alarm for the loss of connection of the DIST manager is triggered. It is possible to operate on both systems and both systems establish a connection to the field devices.
Figure: Interruption in the Connection between the Primary and the Secondary Server Systems The Disaster Recovery System can deal with both cases. The data synchronization after an interruption in connection between the PSS and the SSS occurs automatically at the next synchronization cycle or can be repeatedly manually activated after the triggering of the synchronization alarm via the operating interface. The data synchronization after a connection establishment is carried out from the PSS (master) to the SSS (slave). Additional range of functions
Data synchronization between PSS and SSSIn the normal operating mode, the PSS has a connection to the field devices (master control stations with OPS UA port) and the drivers only run on the PSS. The synchronization of the online data (process data) between the PSS and the SSS is carried out by a special control manager and the mechanism of the distributed systems, which communicates the data from the PSS to the SSS with the help of the WinCC OA DIST manager and therefore maintains the WinCC OA last value databases identically. All data points and data point types respectively, whose values are synchronized between the two systems, must be configured with the help of the corresponding configuration panels (see Configuration - Introduction). The requirement for this is that both distributed systems contain the same data points. This is achieved in the current mode via the synchronization of the data point configuration. The synchronization of the (data point) configuration is achieved on the one hand by the use of WinCC OA control functions (timed functions) and the WinCC OA ASCII manager. In a freely determinable interval (default is 60 minutes), the changed configuration data is exported from the primary system and then imported into the secondary system. This synchronization can also be deactivated if the configuration data from the PSS should not be processed to the SS periodically, or if no more configuration data should be processed after the first automatic synchronization process, because no further configuration changes are expected on the system. The synchronization of the historical data is resolved via the application of Oracle functions. This synchronization is required in order to completely synchronize the systems again after a fallback, so that the historical polling of the databases on both systems returns the same result. Failover procedure or manual switchover procedure between PSS and SSSIf the connection between the SSS and the PSS becomes lost, or if the managing system fails completely, all of the drivers and control managers that are hierarchized on the active secondary system server are started, and the secondary system therefore becomes the managing system. This idle time between the individual steps is configurable via the configuration wizard. Additionally, the driver activates a general query. On the workstation:
The same actions are also carried out if manually switched from the PSS to the SSS. Fallback procedureIf the failed system restarts the normal operating mode, a complete synchronization of the online data and of the alarm status between the two systems is carried out. Behavior of the Disaster Recovery System upon failure of one or several serversThe following sub sections show the behavior of the Disaster Recovery System in various error scenarios. The server designations A, B, C and D correspond to the server designations in the figures shown above. Failure of Server A. Server B, C and D are operational. This fault is handled by the default WinCC OA Redundancy. In this case, there is a redundancy switchover and the passive server of the PSS becomes active and takes over all tasks and the communication with the field devices (or master control stations with OPC UA port). Failure of Server B. Server A, C and D are operational. If the passive server of the PSS has failed, this has no effect on the operation of the system. Failure of Servers A and B. Servers C and D are operational. If both computers of the PSS fail, the SSS takes over the control, starts the control manager and the drivers, establishes the connection/communication to the field devices (or master control station with OPC UA port) and processes the data. The starting of the control managers and the drivers takes place hierarchically, whereby the time between the individual steps is configurable. Failure of Servers A and C. Servers B and D are operational. This has no effect on the operation of the system, since at any one time one computer of the two systems is still running. Generally the same behavior as described in the first case would apply in this case, although the standard Hot Standby Redundancy switches over to server B on the PSS. Failure of Servers B and D. Servers A and C are operational. This has no effect on the operation of the system, since at any one time one computer of the two systems is still running. Generally the same behavior as described in the second case would apply in this case. Failure of Servers A, B and C. Server D is operational. If both of the servers of the PSS and the active computer of the SSS fail, the system behaves in a very similar way as in the third case described above. The only difference is that now the standby server of the SSS takes over control of all of the tasks.
By setting the config entry useOfflineErrorstateInfo to 1 in the [DisRec] section it can be defined whether the system, which had the higher error state during the interruption, becomes passive, even if it was active before. Chapter Overview
|
V 3.11 SP1
Copyright ETM professional control GmbH 2013 All Rights Reserved