SN10300 High Availability mechanisms
Posted by Michal Podoski, Last modified by Danny Staub on 15 November 2017 02:58 PM

SN10300 come with few layers of High Availability

 

1. Single MGW HA design

a) PSU failover
- hot swappable PSUs with realtime failover

b) SS7/MTP3 signaling link failover

c) TDM Clock failover between Line Services

d) Automatic protection mechanism on the TelcoBoard
- On board process monitoring: dead-lock or crashed process are automatically detected and board is rebooted to ensure minimal downtime. 
- HW Watchdog: When a SmartMedia system is active, the TelcoBoard is polled continuously by the System Manager. If this polling stops, it means the TelcoBoard lost contact with the SmartMedia applications and it will reboot automatically if communication is not restored in a timely manner.

e) IP Network (SIP Signaling)
For outgoing calls, the system can be configured such that you can use alternate addresses to reach the SIP Proxy, so if one network path is down, it will use the other one. For incoming calls, you can configure SIP stacks on two separate ip interfaces so that if one is not reachable by a peer, the second can be reached. Note however that mechanism such as virtual IP is present only in 1+1 HA model between Units, so the peer SIP agent must be configured with the 2 ip addresses used on the TelcoBoard and it must be able to switch from one to the other if one is down.

f) IP VOIP interfaces bonding
On version 2.7 of SmartMedia bonding of VOIP interfaces was introduced making configuration of 1 IP on two interfaces possible.

g) IP Network (Voice Path)
We support redundant voice path for SIP calls. If one path is down, the other one will be selected for all new calls. If both path are up, they can be used in load sharing.

h) Flexible IP (SmartMedia 2.8 and above)
Version 2.8 allows creating multiple dot1q VLAN based Virtual Interfaces on Physical Interfaces and assigning them different roles. Also bonding is possible on all TelcoBoard interfaces (not only VOIP01 and VOIP1).

i) STM1 - Automatic Protection Switching
STM1 versions of SN10k support a linear topology APS fiber protection for 1+1 configurations only. In some implementations this standard is also called MSP 1+1 (Multiplexing Section Protection).

2. Cluster HA design

Main advantage of SN10300 series is the cluster architecture, allowing the device to achieve an additional level of High Availability. SN10300 main components are:

  • SN10300/STM1/z, SN10300/xE/z, SN10300/xDS3/z - media gateway Units
  • SN10300/CTRL/RUI - Control Host for the SN10300 Cluster, running applications and database
  • SN10300/SMS/RUI - backplane switch, responsible for low latency, real time media, signaling and clocking transport between MGW. 

Components redundancy
SN10300 design makes possible to duplicate SN10300/CTRL and SN10300/SMS roles in the system. CTRL servers are running as a Primary Secondary pair, allowing sharing the load of running applications. Combined with Event polling from other cluster components, a rapid switchover time and service continuity is achieved in case of detected failure. 

Example SN10300 N+1 fully redundant Cluster setup:

SN10300 N+1

Networking redundancy
All of the internal IP and TDM networks used to build the SN10300 Cluster are based on redundant physical and logical connections (ports and addresses)

SMS

  • Prioprietary protocol
  • Clocking
  • Backplane circuit switching

CTRL (ETH)

  • Ethernet / IP based – strict IP Addressing
  • Internal network for C based API calls
  • Watchdog and Heartbeat

VoIP

  • SIP, RTP, SIGTRAN exclusive
  • possible to run DNS queries
  • VOIP0 and VOIP1 in different broadcast domains – however bonding possible

MGMT

  • Web, SSH acces
  • IP addressing can be chosen freely

HW Cluster Architecture
SN10300 architecture design is based on division of Application processing (Control Host - SN10300/CTRL) and Real Time Traffic (MGW Units - SN10300/STM1,xE,xDS3) on the hardware level. Such approach unleashes a remarkable call processing power and also allows to fully use the Cluster potential, mainly because the ability of sharing the load between applications running on separate servers as well as protocol stack running on MGW Units.

Application Level Cluster
SN10300 SmartMedia applications support Active-Standby mode so you can have two hosts, each having a running instance of our SmartMedia application. If the active server crashes, the application on the standby server will take the relay. Furthermore, on each host, SmartMedia installs a service that monitors all SmartMedia application and that automatically restart crashed or deadlocked applications. 

Another feature is the possibility of manually decide which server will run the Active instance of a particular application, which fe. allows isolating routing decisions from monitoring daemon operations.

Signaling Stack Cluster
Protocol stacks are run from SN10300/STM1,xE,xDS3 Units in order to ensure minimal latency in signaling processing and also to offloading for Application Servers. Signaling data after parsing is pushed via a standarized proprietary API from MGW Units to the Application Servers. Thanks to the robust SMS TDM backplane it's possible to perform signaling parsing on any MGW Unit in the Cluster, which means signaling data received on a particular MGW doesn't need to be processed there. 

What is also important the protocol stacks fe. ISUP can be run in a Active/Standby pair on separate MGW Units, causing the state information to be resillent against MGW Unit or protocol stack crashes.

Configuration Database Cluster
Sn10300 supports HA at the DB level. SmartMedia can configure 2 DB Servers and setup replication between them so that all changes made to the Master DB are replicated on the Slave DB. If the Master DB is lost, SmartMedia will automatically switch to using the Slave DB. After the failure the Database roles are switched, to ensure proper direction in replication. In order to ensure data consistency and minimize the rish of cross writing, SmartMedia is running a dedicated database replication director controlling the redundancy mechanism.

IP Network (GW Control and Managment)
All communications between application<-->application, application<-->TelcoBoard, TelcoBoard<-->TelcoBoard are done using redundant ip interfaces. This is build from the ground up in all our code.

3. N+1 Failover design

SN10300 delivers N+1 HA model, which consists of multiple Active (regular) SN10300/STM1 Units, a special Backup (N+1) SN10300/STM1BU Unit and an active N+1 STM1 patch panel. Currently the N+1 redundancy model is available only with optical interfaces (STM1/OC3). SN10300 can run up to 15 Active SN10300/STM1 Units backed up with a single SN10300/STM1BU. System can run up to two N+1 patch panels: one securing the STM1 Working Channel and second the STM1 APS Protection channel.

Simplified diagram of N+1 setup operating in normal circumstances

N+1 normal operation

Most outstanding feature in this setup would be the N+1 patch panel, which allows Backup Unit to take over the STM1 line from a failed Active Unit. N+1 patch panel is controlled by the CTRL servers polling instantly the Event bus of the SN10300 Cluster. During normal operations the Active MGW processes all of the calls, receives the optical path through N+1 patch panel and holds the configured protocol stacks. In case of Active Unit failure, configuration of the failed Unit is copied to the Backup Unit, which instantly is brought online along with N+1 patch panel switching the optical path from Failed Active to Backup MGW. All of the networking configurations including MTP3 and IP are cloned to the Backup Unit. 

Floating IP is based on a proprietary protocol working as a type of FHRP, announcing the IP Address switchover to a new MAC Address with a Gratutious ARP.

Simplified diagram of a switched N+1 system

N+1 switched operation

 

N+1 failure scenarios:
a) Switchovers without loss of active calls:
- SmartMedia stopped on one of the units
- An application is shutdown
b) Switchovers without loss of active calls running specific protocols (only SS7 call legs running on distributed signaling links are preserved*/**)
- TelcoBoard unit shutdown
- TelcoBoard reboot requested
- Package installation
- Primary unit is back (and auto-switch-back is used)
- Loss of communication with the TelcoBoard
- Process crash on TelcoBoard
c) Host applications switchovers (no loss of active calls)
- Application crash
- Control Host crash
d) TelcoBoard switchovers with loss of active call legs running non-SS7 traffic**
- TelcoBoard unit shutdown
- TelcoBoard reboot requested
- Package installation
- Primary unit is back (and auto-switch-back is used)
- Loss of communication with the TelcoBoard
- Process crash on TelcoBoard

* SIP legs resillency during N+1 switchover is currently on the Roadmap
** due to signaling design ISDN and CAS call legs can't be preserved during N+1 switchover

 

(8748 vote(s))
Helpful
Not helpful

Comments (0)
Post a new comment
 
 
Full Name:
Email:
Comments:
CAPTCHA Verification 
 
Please enter the text you see in the image into the textbox below (we use this to prevent automated submissions).