Hello,
we are doing some failure tests for a customer. We have VCS 6.2 running on solaris 10. We have an Oracle database and of course the listener associated with it.
We try to simulate different kind of failures. One of them is to kill the listener. In this situation the cluster observes that the listener has died, and it fails over the service to the other node. BUT the listener resource will remain in FAULTED state on the original node, and the group to which belongs will be in OFFLINE FAULTED state. In this situation if something goes wrong on the second node the service will not fail back to the original one until we manually run hagrp -clear.
Is there anything we can do to fix this? (to have the clear done automatically)
Here are some lines from the log:
2015/03/30 17:26:10 VCS ERROR V-16-2-13067 (node2p) Agent is calling clean for resource(ora_listener-res) because the resource became OFFLINE unexpectedly, on its own.
2015/03/30 17:26:11 VCS INFO V-16-2-13068 (node2p) Resource(ora_listener-res) - clean completed successfully.
2015/03/30 17:26:11 VCS INFO V-16-1-10307 Resource ora_listener-res (Owner: Unspecified, Group: oracle_rg) is offline on node2p (Not initiated by VCS)
in these it says that clean for the resource has completed successfully, but the resource is still faulted.
but if I run hares -clear manually, the the fault goes away.
20150330-173628:root@node1p:~# hares -state ora_listener-res
#Resource Attribute System Value
ora_listener-res State node1p ONLINE
ora_listener-res State node2p FAULTED
20150330-173636:root@node1p:~# hares -clear ora_listener-res
20150330-173653:root@node1p:~# hares -state ora_listener-res
#Resource Attribute System Value
ora_listener-res State node1p ONLINE
ora_listener-res State node2p OFFLINE
20150330-173655:root@node1p:~#