Wondering if anyone has seen this before, what the cause may be, if there is an automated recovery scenario.
Situation: We run a 3-node cluster (with a 3-node GCO cluster at a remote site) running VCS 5.1-SP1 on Dell R411 servers. This past Saturday, our operations were performing a standard switchover of our Primary resources (applications) to a Standby node. On switchover to the new node, the IP resource (which is first in the dependency tree) was started up. VCS then reported it was starting up the first of our seven Application resources but none were started up. [As an aside, this node had run the Application resources within the past 3 weeks and they are currently running on that node, so there was no problem with the applications]. It appeared that the Application Agent was hung, as we could interact with the had daemon for stats and some commanding, but hastop commands (or variants) would not complete (i.e., had to CTRL-C them since they would not finish).
This left us in a no-brain situation. There were no log entries or traps indicating the had daemon was having a problem with the Application Agent. Worse, the had daemon did not try to recover from the no-brain situation, at least for the 15 minutes we tried CLI commands to clear the issue. We eventually were able to recover from the no-brain by rebooting the server where the issue was occurring. We have a 24x7 operation and outages over 4 minutes can be very detrimental to our customers.
How do we know it was an Application Agent hang? We have been able to create the same situation in our lab by attaching to one or more of the Application Agent threads and causing them to halt on a Standby node, then switching over to that node. The Application resources are not started and the had daemon does not try to recover from the situation (or if it does, it says it is restarting the Application Agent then says it is already up), basically leaving us in no-brain. Also, we are migrating to VCS 6.0.1 in the next month and we see the same behavior with that release.
Has anyone seen this before? Is it a known VCS bug? Is there some way to automatically recover from this to keep us out of extended no-brain?