1

Topic: n3 cm1 failure

In the last two weeks I've started having system failures. The system, comprised of two n3's and a ConMan node, stopped passing audio and the n3 in question required a reboot. Here are the suspect n3's log entries beginning at the time of the error:

10/28/2007 7:40:54    12408    note    mcp/processes    shutting down gracefully
10/28/2007 7:34:27    12407    note    piond/role_manager    role is stopped
10/28/2007 7:34:24    12406    note    project    user logged off: pwadmin
10/28/2007 7:34:24    12405    note    project    user logged off: etech
10/28/2007 7:34:24    12404    note    piond/role_manager    role is running
10/28/2007 7:34:24    12403    fault    piond/fault_policy    more than one error in less than one minute; stopping engine
10/28/2007 7:34:20    12402    note    project    user logged on: pwadmin
10/28/2007 7:34:20    12401    note    project    user logged on: etech
10/28/2007 7:33:58    12400    error    piond/cm1    cm1 not detected : /dev/pion/cm10: timeout waiting for HF2 to go high
10/28/2007 7:33:58    12399    error    piond/cm1    peek aborted after 5 tries: /dev/pion/cm10: timeout waiting for HF2 to go high
10/28/2007 7:33:57    12398    note    piond/role_manager    restarting role : USF SS new/DSP-01/JFb7-bfKY0Jcl7tj8I3pKUZ6uV8/xkD1T16dBJ-l9g11zxr5QbowFDS
10/28/2007 7:33:55    12397    note    project    user logged off: pwadmin
10/28/2007 7:33:55    12396    note    project    user logged off: etech
10/28/2007 7:33:55    12395    error    piond/cm1    peek aborted after 5 tries: /dev/pion/cm10: timeout waiting for HF2 to go high
10/28/2007 7:33:55    12394    error    piond/cm1    mute assertion failed: /dev/pion/cm10: timeout waiting for HF2 to go high
10/28/2007 7:33:55    12393    error    piond/cm1    poke aborted after 5 tries: /dev/pion/cm10: timeout waiting for HF2 to go high
10/28/2007 7:33:55    12392    error    piond/cm1    poke/peek driver exception : /dev/pion/cm10: timeout waiting for HF2 to go high
10/28/2007 7:33:55    12391    note    piond/mute    muted: menu command
10/28/2007 7:33:55    12390    error    piond/fault_policy    restarting audio engine
10/28/2007 7:33:55    12389    error    piond/cm1    peek aborted after 5 tries: /dev/pion/cm10: timeout waiting for HF2 to go high


The graceful shutdown was initiated by the technicians to bring the system back online.

As mentioned above, this has happened at least twice. Same errors with the same n3, about 10 days apart. The other n3's logs just contain the normal complaints about losing XDAB at the same time the above happened. Its log entries are as follows:

10/28/2007 7:40:55    3889    note    mcp/processes    shutting down gracefully
10/28/2007 7:40:37    3888    note    project    user logged off: pwadmin
10/28/2007 7:34:04    3887    note    piond/xdab/leader    arbitration done; ring is incomplete in redundant failed mode
10/28/2007 7:33:55    3886    error    piond/xdab/leader    communication failure
10/28/2007 7:33:54    3885    note    piond/xdab/leader    poll returned false: 'DSP-01'
10/28/2007 7:33:54    3884    note    piond/mute    muted: xdab loss of clock signal

2

Re: n3 cm1 failure

The "timeout waiting for HF2 to go high" indicates that the CM-1 CobraNet interface has crashed.  CM-1's can crash when they receive too much Ethernet traffic.  Typically, this happens when there is an Ethernet "storm".  Storms occur when there is a loop on the Ethernet network.

Other possible causes are too much broadcast traffic on the network.  We've also seen this happen with certain fast spanning tree network configurations.  Finally, any port bandwidth throttling (sometimes called "storm control") can cause problems as well.

So, check for any changes on your network.  By the way, Cirrus (keeper of all things CobraNet) is aware of this problem.  They may have suggestions as well.

3

Re: n3 cm1 failure

Well the n3 that is crashing is the only n3 that is receiving Cobranet bundles, so as far as inbound traffic, it definitely handles the most. But its still only about nine bundles at three channels each.

The possibility of a loop is I guess there, however remote. I guess I could go through and disable all unused ports just to make sure no one is making patches.

I am not using RSTP.

As for Cirrus, do you know any way to contact them? As an end user, I've never been able to get a response from them on the few occasions that I've tried.

I did recently add two switches, a ConMan node and a CAB recently to the network, but the Cobranet VLAN is completely isolated, even from the Nion control network. I guess I can take a look at that Nion's port with a network sniffer, but I'm pretty sure I'm only going to find Cobranet frames. With the exception of the ConMan's second NIC for CAB control, there are no non-Cobranet devices on this VLAN.

4

Re: n3 cm1 failure

Hmm, now that you put me on to this inbound traffic problem I think I might have an idea. That n3 is actually receiving nine bundles. All the bundles are from CAB 4ns, but I'm only picking off 3 channels from most of them, however the CABs are still sending four, which makes for a total of 36 inbound channels.

5

Re: n3 cm1 failure

I think you got it, Jason. Even though you're only using 3 out of 4 channels in a bundle, it still has to receive all 4 channels as the Conductor allocates enough bandwidth for all 4. We are working on advanced subchannel mapping in the CAB4n so you could send a 3 channel bundle so stay tuned. Bear in mind that the CM-2 module in the CAB4n only has 4 transmitters to work with however.

I have seen situations where the CM-1 gets "pounded" by Crest CKi amps with Cobranet inputs causing it to crash. Crest has new firmware to correct that. In the mean time, we have found that moving the Conductor to a CAB takes enough load off the CM-1 for it to function correctly.

The only true wisdom is in knowing you know nothing. -Socrates

6

Re: n3 cm1 failure

Jason, 

You mentioned... "With the exception of the ConMan's second NIC for CAB control, there are no non-Cobranet devices on this VLAN"

...out of curiousity, what type of control are you doing with the CABs from this NIC?

Thanks,

Joe

7

Re: n3 cm1 failure

To Ivor: I split the receive bundles across the two n3's last week. So far no problems. From what cwa said about too much incoming traffic I'm pretty positive that was the problem. My stupid mistake.

To Joe: One of the reasons I added a ConMan node was to manage the CAB devices along with the scripts. Leaving the n3s for only audio stuff. Normal CAB control traffic is carried in a Cobranet frame(Ethernet type 0x8819), which of course must be on the same LAN as Cobranet. The Conman node has to have a NIC connected to the control LAN, which has the normal NioNode control traffic, but is isolated from the Cobranet network. A second NIC was added to give the Conman node access to both LANs, which now allows Conman to communicate normally with the NioNodes and communicate natively with the CABs. As a side note, I did find out when I added the second NIC, which was an Intel, that it is capable of being 802.1q(VLAN) aware. One could simply use one of these physical network interfaces to connect to many VLANs using a trunked switch port.

8

Re: n3 cm1 failure

Right, I was curious if you were doing SNMP control of the CABs or something else.  Thanks for the details.

Thanks,

Joe

9

Re: n3 cm1 failure

I looked into the SNMP initially until I found out that ConMan can do native control. As you probably know, the catch with SNMP control is that it requires IP. It seems that you use BootP to give the CABs IP addresses, then use SNMP CABs from there. However there are some hoops to jump through because BootP won't always give the same IP address to the same MAC. Some folks from Peak Audio gave me the run down on making that association with various(highly undocumented) devices. It would probably work, but comes with a lot of headaches, including requiring the maintenance staff to be aware of MAC address changes when replacing equipment.

10

Re: n3 cm1 failure

It seems that this problem has not been fixed after all. I had changed the program so that the failing n3 is now only receiving 20 channels, five bundles from various CAB 4n's at four channels each.  Since then we have recently experienced two new failures, four days apart. The logs show the same problem as shown in the original post, with the same NioNode. We do have a spare n3 if the NioNode is suspect.

As a side note, this problem seems to have cropped up after a system modification we recently completed, during which we added a single CAB 4n, two switches and a ConMan node. Before the change I have logs for at least a year that contain no such error. That new CAB is used for output only CAB. No additional CobraNet input bundles were added to the NioNodes during this modification, which is interesting because the original system had been running with the 'overloaded' CM1. In fact the NioNodes were also handling the CAB control traffic at that time.

11

Re: n3 cm1 failure

Hi jvalenzuela, now I get the same trouble just like you have, in my system I use Crest Ci20x8 amplifiers, and I have CAB4n and ConMan too, I use 12 Nion to handle audio process, and use ConMan to handle control script. After I did that, I got this error everyday, that's nightmare. I'm afraid that Ci20x8 make a lot of network traffic, did you solved this problem now?

love peace and music
Tibet is one part of China!!!

12

Re: n3 cm1 failure

Unfortunately I have not yet solved this problem, although my failures seem to be much more infrequent than yours. I am interested to know exactly what type of traffic these amplifiers produce in order to see if/how it could affect a NioNode. It is quite possible on a switched network, depending on the traffic type, that a NioNode would never see this traffic on its Cobranet interfaces.

13

Re: n3 cm1 failure

I think network broadcast storm will cause CM-1 crash, and we did some tests with network engineer, our network switch has MSTP application, and if switch restart, the root will rebuild, and there must have network storm. So broadcast storm one reason that will cause CM-1 crash.

love peace and music
Tibet is one part of China!!!

14

Re: n3 cm1 failure

I'm not sure what 'MSTP' is, perhaps you mean 'STP' as in spanning tree protocol. With STP if have a switch go down, STP will not cause a loop and associated storm(s) while the network converges around a new topology. STP's whole purpose in life is to prevent network loops and the general unhappiness they bring.

15

Re: n3 cm1 failure

Yes, MSTP means 'Multiple Spanning Tree Protocol'.

love peace and music
Tibet is one part of China!!!

16

Re: n3 cm1 failure

Jason, did you ever get to the bottom of this? Did the problem mysteriously disappear?

This CM1/Piond crash is still popping up fairly frequently(Speaking for UK and Europe of course). Just trying to establish a pattern right now.

All energy flows according to the whims of the Great Magnet. What a fool I was to defy him.

17

Re: n3 cm1 failure

Nope, I'm still working down the list of things that changed since the problem started. Since the failure tends to be rather infrequent, one or two failures a week(sometimes more, sometimes less), the process is a little slow. As soon as I get a clue, I'll be sure and post my findings.....

18

Re: n3 cm1 failure

Jason, did you ever get to the bottom of this, I am getting similar errors an a large system.

19

Re: n3 cm1 failure

Unfortunately not. Last time I visited the site, they had installed a watchdog monitor to reboot the Nions when a failure was detected.