Ian Verhappen is a contributor and blogger for Control and Control Design. He has 25+ years experience in instrumentation, controls and automation. =
There is the common assumption that redundancy always leads to improved system reliability, and in most cases this is certainly true, which is why it's often the first step used with control systems to help achieve the target of 100% availability. However, to be effective when designing redundant systems two important principles must be followed: KISS (keep it system-simple) and avoid common-mode failures. A good corollary is to implement standards, whether these are industry standards or your own internal corporate standards, which include preferred supplier lists. This is because standardizing on one supplier makes it possible to effectively implement supplier-specific protocols. Since all the supplier's equipment supports the "proprietary" protocol, it becomes a corporate standard protocol. In the case of Ethernet ring protocols, this is still an important consideration (Table 1).
Ring Options Abound
Table 1: Ring topologies are available from a wide range of companies, and, though they provide coverage for a single point of failure in the network, this is in most cases sufficient for control systems that are relatively simple in structure—at least within a facility. This table was developed based on an examination of literature and a similar table in Glenn Johnson’s “Redundancy in Industrial Networks” (www.controldesign.com/networkredundancy). The table also illustrates that many manufacturers partner with other companies to build products on their behalf, and as a result you may in fact be able to obtain a proprietary protocol from more than one supplier.
Company
URL
Technology
Claimed recovery time
Network size
Advantech
X-ring
30
Hirschmann
HiPER ring
200
Fast HiPER ring
200
Moxa
www.moxa.com
Turbo ring
250
WeidmĂĽller
www.weidmuller.com.au
Turbo ring
N-Tron
www.n-tron.com
N-ring
~30 ms
250
O-Ring
www.oring-networking.com
O-RSTP
~20 ms
40
O-ring
250
Â
Â
Open-ring
Variable
250
Rockwell Automation
www.rockwellautomation.com
Cisco REP
20-250 ms
Â
Westermo
www.westermo.com
Cisco REP
200
Red Lion/Sixnet
www.sixnet.com
Real-time ring
30 mS plus 5 mS per hop
50
Why does redundancy increase reliability? The simple answer is mathematics. When two devices or components are connected in series, the system availability is simply the product of the availability of the two components.
Conversely, when two devices are connected in parallel, which is effectively what we're doing when implementing a redundant system, the combined availability is calculated as 1-(1- A)2 where A is the availability of the component.
Consequently, not only do we benefit from the fact that we're subtracting the availability from unity, but also the result is squared, so it improves exponentially.
Also Read: Trees and Rings
Though hardware redundancy does result in a genuine improvement in system availability, it's not always by the degree that the theory predicts, mainly due to common-mode failures in nonredundant elements and lack of diagnostics coverage.
Mean Time to Reliability
In addition, there can be a risk that introducing hardware redundancy can result in much more complexity, and may only make a marginal improvement in overall availability. An example where this may be true is if the original hardware element is itself simple with a high availability or long mean time to failure (MTTF).
However, introducing redundancy involves the addition of complex hardware just to perform the fail-over function. As a result, the component count and number of interconnects needed to implement it are ridiculously high, so overall MTTF actually decreases. This situation may be a rare occurrence, but it's still one to be mindful of. Hence, the KISS admonishment.
Unfortunately, there are multiple standards by which MTTF calculations can be performed, and, of course, each standard provides different results. So, when using information on a supplier's website, always be sure that you're comparing calculations using the same methods.
One of the more widely used calculation methods for electronic equipment reliability is based on the 1991 MIL-HDBK-217Â published by the U.S. Dept. of Defense. Most recently updated in February 1995, it contains failure rate models for a wide range of electronic components, including integrated circuits, transistors, diodes, resistors, capacitors, relays, switches and connectors, or, in other words, all the pieces used to manufacture the equipment we use to build control systems.
Because of the age of MIL-217, one consequence is that the underlying data doesn't reflect the most recent advances in the reliability of the above components. As a result, it often provides pessimistic results, so many companies are moving to AT&T Bell Labs' Bellcore (Telcordia) standards: TR-332 Issue 6 (1997) and SR-332 Issue 3 (2011). They're based on simplified MIL-217 but have the ability to incorporate real-world data to correct the theoretical models.
In addition, thermal cycling refers to the temperature changes the chips see due to changes in ambient temperature, as well as heat produced from other components on the same boards in the enclosure. Duty cycle is the on/off operation of the equipment. As a result, the IEC 62380 standard's models can handle continuous working, on/off cycles and dormant applications and include failures related to component soldering in the calculated failure rate.
All these models, however, give us an indication of the reliability of the devices or nodes in a control system, but not of the network itself. Since Ethernet is now the most commonly used network to link nodes together, system reliability also becomes a function of how we can make Ethernet-based communications between these devices or nodes reliable. Unfortunately, Ethernet's broadcast nature doesn't permit physical loops, and effectively forbids redundant communication paths. So, like any good group of engineers, we've developed ways to circumvent this limitation—at the price of increased complexity.
Network redundancy can be achieved at both the data link layer (Layer 2) and the network layer (Layer 3). Layer 2 redundancy is provided by switches within a TCP/IP subnet, while Layer 3 redundancy is done by routers, routing traffic via different TCP/IP subnets. Naturally, routing means higher overhead and lower performance, and, because industrial networks have a need for speed, they tend to rely on Layer 2 redundancy options for reliability.
From Trees to Rings
One simple form of redundancy is Link Aggregation Control Protocol (IEEE 802.1ad) that provides the ability to bundle groups of switch ports between switches to form one link with the aggregated bandwidth of the individual links by splitting the communications across multiple paths. With link aggregation or link redundancy, in the event a single connection fails, the remaining links keep working with reduced bandwidth. Because the most likely reason that a cable fails is due to a mechanical disruption, the physical links (cables) should be routed via different paths, or there is the risk that multiple link failures will occur when the cable is damaged, thus circumventing the benefits of link aggregation.
One of the first protocols developed in the early 1990s to implement redundancy was the Spanning Tree Protocol (STP). Though it can handle different network technologies, including mesh, the failover time for this protocol can be as long as 10 seconds. Due to the time required to converge on a new configuration, STP has a practical limit on the number of switches between endpoints in the network. The original request for comment (RFC) for STP recommended that the number of hops—the number of bridges or switches between any two devices—should be no more than seven.
STP has generally been replaced by the improved Rapid Spanning Tree Protocol (RSTP) that was defined in 1998. Because the failover time will vary depending on the particular implemented topology and the location of the individual failure, neither STP nor RSTP can provide deterministic failover.
STP and RSTP both allow Ethernet to connect by putting selected links into standby to prevent data loops from overloading the network. Without this capability, the circular connections that make up the mesh and ring topologies would bog down the network and ultimately lead to a complete communications failure. Standby links are activated to "heal" the network when a link fails. RSTP networks support a larger number of switches—20 in a path—and the typical failover time is around 1 second. The time difference between STP and RSTP to recover or rebuild the network is great, but RSTP is still slow for many industrial applications.
A very common approach to provide redundancy for industrial networks is to use a ring, in part because it is a much lower-cost option than building a mesh between all the nodes. The HiPER-ring protocol first released by Hirschman and Siemens in 1999 has been adopted as IEC 62439 as the Media Redundancy Protocol (MRP).
MRP requires that one of the switches is configured in the role of media redundancy manager (MRM). While maintaining one port closed to normal data, the MRM sends frames out of one of its ring ports and receives them on its other ring port, thus communicating in both directions. All other switches on the network act as media redundancy clients (MRCs).
Success with Subrings and Redboxes
Along with many proprietary ring technologies, MRP also has the ability to support subrings. Depending on the support that's included by your hardware vendor, some switches can be configured as subring managers (SRMs), allowing them to take part in two rings. The two switches then take part in two rings, the original ring being known as the basis ring. The subring will need to have at least one other switch since there needs to be a switch taking the role of MRM for the subring. The subrings also need to be configured on different VLANs, so further configuration is required to share traffic between the rings.
One weakness of loop topologies is that they can recover completely from only one failure. Conversely, a partial or full mesh network has multiple backup links which, if properly designed, can support two or more links to fail.
Similar to the fieldbus standard, there are a number of clauses in IEC 62439, each covering a different protocol related to network management. The Parallel Redundancy Protocol (PRP) IEC 62439-3 is implemented in the end devices where two independent paths that are completely separated and are assumed to be fail-independent are configured to exist between these end devices. Once the paths have been created, a source node with PRP functionality simultaneously sends two copies of a frame, one over each of two ports. The two frames travel through their respective separate networks until they reach a destination node. Because each frame took a separate route, they will arrive at the target device at slightly different times. The destination node accepts the first frame of a pair and discards the second, taking advantage of a sequence number in each frame that's incremented for each frame sent.
Because PRP is implemented in software in the end nodes that can be installed on any platform supporting standard operating systems, the switches in the network do not need to have any PRP functionality. An end device, typically a sensor, with PRP functionality is a double attached node (DAN), having a connection to both networks that share the same MAC and IP address.
A standard device with a single network interface—a single attached node (SAN)—can only be attached to one network. Such a device has no redundant path in the event of network failure between it and another SAN. A device called a redundancy box (redbox) can be used to connect standard devices or networks of standard devices to both networks. The high-availability seamless redundancy (HSR) receiver removes unicast frames from the ring, while the sender removes multicast and uni-broadcast frames.
A redbox has three external Ethernet ports. Two of the ports are connected to a redundant network, which in the case of HSR discussed below is a ring, and one port is a traditional Ethernet port. When forwarding frames to the ring, the redbox duplicates each frame and sends two duplicates to the ring, one in both directions. When forwarding frames from the ring, redbox forwards the first copy and removes the one that arrives later.
Because all frames are sent twice over the same network, even when there's no failure, when compared to RSTP only about half of the network bandwidth is available to applications relying on HSR rings. As a result, in large implementations, consideration needs to be given to increasing the network speed from 100 MBps up to 1 Gbps, depending on the architecture.
In contrast to PRP (IEC 62439-3- Clause 4), with which it shares the operating principle, HSR was standardized as IEC 62439-3 Clause 5, and is one of the redundancy protocols selected for substation automation in the IEC 61850 standard. Like PRP, HSR is application-protocol-independent and can be used by most industrial Ethernet implementations that use the IEC 61784 suite.
Similar to PRP, HSR functions with zero switchover time, but unlike PRP, it doesn't require two parallel networks. Rather as the name implies, it takes the form of a ring or a structure of coupled rings, with the result that it's less flexible than PRP at the installation stage. HSR rings can also be connected via a redbox to a standard RSTP or MRP redundant network as a backbone or even to a PRP network using two redboxes.
A double redbox, quadbox or quadruple port device is used to connect two HSR rings to each other. As one quadbox would itself be a single point of failure, two adjacent quadboxes are typically used between HSE rings. As you can see, there are a number of options to increase the overall reliability of industrial networks by implementing redundancy in a variety of different ways and combinations (Figure 1).
Option for Network Redundancy
Figure 1: To help understand the differences between network redundancy options, this table from “Applying PRP and HSR Protocol for Redundant Industrial Ethernet” by Thomas Siegrist summarizes the various protocols.
Protocol |
Topology |
Max Devices |
Networking Equipment |
Reconfiguration Time |
STP (IEEE 802.1D-2004) |
Mesh/TCP |
Any |
STP-compliant |
60-300 seconds |
STP (IEEE 802.1D-2004) |
Mesh/UDP |
Any |
STP-compliant |
40-150 seconds |
Trunking |
TCP |
200 - 1000 |
Standard switch |
100-200 ms |
Trunking |
UDP |
200 - 1000 |
Standard switch |
0-10 ms |
RSTP (IEEE802.1D-2004) |
Ring |
40 |
Switch with RSTP support |
>2 seconds |
RSTP (IEEE802.1D-2004) |
Any |
Any |
Switch with RSTP support |
>2 seconds |
MRP (IEC62439-3, clause 2) |
Ring |
50 |
Device supports MRP |
10, 30, 200, 500 ms |
PRP (IEC62439-3, clause 4) |
Double, Any |
Any |
Standard switch |
0 ms |
HSR (IEC62439-3, clause 5) |
Coupled rings |
512 |
Device support HSR |
0 ms |
After doing all this work to make sure your network is robust, and don't forget the balance of your infrastructure, such as power supplies, or all you have done is relocate the weakest link to another location that may in fact be of lower reliability—for example, a single bus power supply that is down for maintenance or a UPS that is not maintained or is undersized. All your efforts and additional system complexity could make the situation worse rather than better. In the end, it still comes down to understanding your system and implementing good engineering practices, which in most cases still means KISS.
Continue Reading
Leaders relevant to this article: