1

Topic: HF2 problem version 2

I decided to start a new thread for this topic. It is similar to the CM1 failure reported on other threads (which I have experienced as well) but not caused by the same issue or covered on other topics. Sorry for this long post....

The cases discussed before are either a hardware failure or a CM1 crashing due to a network storm caused by incompatibilities between RSTP/MSTP and CobraNet. In the case of my latest problem, the CM1 will crash - and restart completely as verified by the SysUptime variable - under normal operation without any kind of network changes or spanning-tree renegotiations. Here is a brief description of the system in question:

- 53 CobraNet devices: 43 1 channel transmit/receive devices, 2 CAB16d (16ch tx/rx), 7 CAB8i (8ch tx) and 2 CAB16o (2x8ch rx)
- 6 total NIONs
- 18 of the 1ch devices are also receiving serial data using serial bridging. 485 data input using the CAB16d.

We have designed and installed about 20-30 similar systems but this one has the most devices receiving serial data.

On a Cobranet level, the system is arranged as follows:

Nion 1 talks to 16 1ch tx/rx devices without serial data +
Nion 2 talks to 9 1ch tx/rx devices without serial data + 6 1ch tx/rx devices with serial data
Nion 3 talks to 6 1ch tx/rx devices with serial data + 2 CAB8i
Nion 4 talks to 6 1ch tx/rx devices with serial data + 2 CAB8i
Nion 5 talks to 4 CAB8i + 2 16o
Nion 6 talks to 1 CAB8i + 2 16d

If the system is run with all devices connected and serial bridging enabled, the CM1 on Nion 4 or Nion 3 will crash in less than 3 minutes after the roles start. The flash code on the CM1 is 3.4.3 which stands for:

"Byte code: 77; Flash code: 3,4,3;  Type: FATAL; Name: ILLEGAL_INST; Description: Illegal instruction encountered.; Expected conditions: - ; Unexpected conditions: Hardware problem with main memory or address/data busses."

Quoting Kevin Gross, this code: "The illegal instruction error usually would occur as the result of a software programming error resulting in corrupted memory. The fact that we only see it under certain stressful situations is not entirely surprising. The error could be invoked by particular timing relationship between host access, network traffic and serial bridging activity. The error could be invoked by an overflow condition caused by multiple concurrent activities, receipt of a malformed Ethernet packet or host request."

Interestingly enough, here are some tests that I have performed and their results:
- Only N3 and N4 seems to crash (at least over 3-4 days), not N2 which also handles serial data and a significant number of 1ch devices
- The NION that has the problem is dependent on the role it has and not the hardware, meaning that I have swapped roles and the problem follows the role not the NION box.
- The problem only happens if serial bridging is enabled
- The problem becomes worse (happens more often) with more CobraNet traffic (as desk unit CobraNet bundles are removed the problem becomes much less frequent, getting spaced to happen every day or so vs. every 3 minutes or so)
- The problem happens in a separate system installed 3 years ago with a smaller network, a different project file and a of course different infrastructure and equipment
- The problem becomes less frequent (but does eventually happen) if the network ring is broken as compared to having a ring with RSTP or STP (both behave the same), but it still happens. It also seems to be less frequent if the ring is made smaller (remove 5 switches from the 12 that are typically a part of the network).

At this point, I am considering this problem either an inherent CobraNet problem or a problem with the way NIONs communicate with the CM1. There is no technical reason (i.e. bandwidth, CM1 capabilities, etc) that should cause this problem. Any ideas or suggestions will of course be greatly appreciated but also be aware that the CM1 crashing problems are not only related to RSTP or bad hardware, and that there are other scenarios under which this problem might occur.

Thoughts, ideas, etc? Thanks!
Rodrigo

2

Re: HF2 problem version 2

Rodrigo,

This sounds very much like a stack overflow problem. Have you tried it with the Beta CM-1 firmware designed to mitigate stack ovf issues?
Give me a ring or a PM and I can get that to you if you do not already have it.

Can you run Wireshark and see what the traffic in and out of the failing node looks like? You will probably have to mirror the device's  port on the switch to do this.

Nihilism is best done by professionals

3

Re: HF2 problem version 2

53 CobraNet devices,powerful,

learning

Achievements of others, and create ourself

4

Re: HF2 problem version 2

Although I have talked to cobraguy offline about the issue, I wanted to give everyone an update.

I tried the Beta CM-1 firmware and that didn't correct our problem. I don't think it is an overflow problem since network monitoring doesn't show any unexpected traffic or bursts of data.

I did see some interesting behavior. Currently, the CAB16d that I use to input the serial data to the network is connected to a different switch than the NIONs. If I move it to the same switch as the NIONs, the serial data gets corrupted. I applied some source-port filtering between the CAB and the NIONs and the serial data was fixed so I was hoping that would also stop the CM1 from crashing but it did not. I am wondering if both issues are related....

We are to trying to reproduce our problem at our office using a more controlled environment, will let you all know what we find (if we find something!)

5

Re: HF2 problem version 2

Wow, that's pretty weird. Can you define the corruption a bit more? As in completely unrelated to the input(garbage), bytes and/or segments out of order, loss of data, etc? If the data is input via the 16d, where is the output whence you are detecting this corruption? When you mention source-port, I assume you mean physical switch source port? These switches are all members of the same broadcast doman(VLAN)?

6

Re: HF2 problem version 2

It is definitely weird!

The serial data we are sending is a timer that counts down. When it gets corrupted, the timer counts down 2-3 seconds and suddenly jumps to show a completely invalid time for a second or so, then comes back to count down 2-3 seconds correctly and jumps again to show bad data. I did not have time to analyze the actual serial data to look in more detail.

The network is a large ring (12 total switches in the loop). Typically the CAB is lets say switch 5, and the devices that receive and output the data are in switches 3,7 and 11. The NIONs are on switch 1 and that is where I moved the CAB when the data got corrupted. The original issue I have had with the CM1 having problems also changes as the network is modified to have less switches on the ring which seems to correlate in some ways with this other problem.

Source-port filtering is a feature on some HP switches that allows you to forbid traffic from sending ports to be sent to other ports on the same switch. Yes, all switches are members of the same VLAN of course.

7

Re: HF2 problem version 2

It looks like we now may have a handle on the cause of this problem. An analysis of the processor load on the CM-1 CobraNet module revealed that the processor was being over worked and did not have enough bandwidth to support the assigned configuration. One of the key factors was use of the 'Advanced' CobraNet configuration mode. This enables 16x16 bundle transmitters and receivers and adds quite a bit to the processor load. This mode usually works well and in  this case would have also been OK if that is all the processor had to do. As configured the processor bandwidth was pretty much all consumed but would still work.  The serial bridge data on the net was multicast so was being received by all devices whether they needed it or not. Reception of the serial bridge data by the NION was enough to push the processor bandwidth requirements over the edge and cause it to fail. The solution to the problem would be to use unicast serial bridging to just go point to point where needed OR to reduce the load on the failing NION by removing the 16x16 advanced bundle mode and  backing off to 8x8 or 4x4 bundle mode. Changing the bundle mode will require moving processing of bundles to a NION or NIONs that have available processing capacity in their CM-1's.

The processor capacity of a CM1-1 module  can be modeled using a tool from Cirrus logic called cycle.exe. This tool has historically only been available under NDA. I have contacted Cirrus and asked if we can make it available on this forum. I will follow up on this forum with news of their response.

Nihilism is best done by professionals

8

Re: HF2 problem version 2

The CM-1 modeling tool is available at:

http://mm.peavey.com/assets/software/CM … le_271.zip

Unzip it to a folder. Make a copy of the cycle.ini fie to keep as reference and then edit it to match your configuration and run cycle.exe from a command line.
Read the PDF file for more instructions.

Nihilism is best done by professionals