Home » Infrastructure » Linux » OCFS2 1.4.3 single node failure mount freeze (OCFS2 1.4.3 SLES11 SP1)
OCFS2 1.4.3 single node failure mount freeze [message #477779] Mon, 04 October 2010 07:44 Go to next message
thrixxxadmin
Messages: 3
Registered: October 2010
Junior Member
Hi there.

I've setup an OCFS2 cluster with 5 nodes, every node is a VM on a VMware ESX server.
SCSI controller for ever VM is set to Physical (disks can be shared between ESX servers).
I've created one LUN on our SAN and used "Raw Mapped Lun" to add the LUN to every VM.

Then i created one partition and formated with OCFS2.
For all timeouts i used the default settings:
Specify heartbeat dead threshold = 31
Specify network idle timeout in ms = 30000
Specify network keepalive delay in ms = 2000
Specify network reconnect delay in ms = 2000

Everything works OK, except sometimes when i power off one node for testing.

After power off i check the other nodes and look at the kernel messages. The node does get kicked out after 30s and after additional 30s another node will jump in as recovery node.

However when i start this node it freezes during OCFS2 mount.

Kernel messages:
[  207.393896] o2net: connected to node de-db02 (num 2) at 192.168.128.8:7777
[  209.899312] ocfs2_dlm: Node 2 joins domain 6BECF48BA0524275BEA1E6F32AF9D756
[  209.899315] ocfs2_dlm: Nodes in domain ("6BECF48BA0524275BEA1E6F32AF9D756"): 1 2 3 4 5


It looks like the node successfully joins the ocsf2 cluster but mount doesn't finish...
It hangs forever.
The only thing i can do is power off EVERY node, and boot them one after another.

Restart is not a problem (with shutdown -r now, because OCFS2 successfully leaves the cluster).

OCFS2 cluster.conf:
node:
        name = de-db01
        cluster = htdocs01
        number = 1
        ip_address = 192.168.128.7
        ip_port = 7777

node:
        name = de-db02
        cluster = htdocs01
        number = 2
        ip_address = 192.168.128.8
        ip_port = 7777

node:
        name = de-www01
        cluster = htdocs01
        number = 3
        ip_address = 192.168.128.110
        ip_port = 7777

node:
        name = de-www02
        cluster = htdocs01
        number = 4
        ip_address = 192.168.128.111
        ip_port = 7777

node:
        name = mail01
        cluster = htdocs01
        number = 5
        ip_address = 192.168.128.240
        ip_port = 7777

cluster:
        name = htdocs01
        node_count = 5


Thank you for any advises!
I've no idea what's wrong...
Re: OCFS2 1.4.3 single node failure mount freeze [message #477883 is a reply to message #477779] Tue, 05 October 2010 01:22 Go to previous messageGo to next message
thrixxxadmin
Messages: 3
Registered: October 2010
Junior Member
I forgot to mention the mount options: _netdev,noatime,data=writeback,nointr
Re: OCFS2 1.4.3 single node failure mount freeze [message #477912 is a reply to message #477883] Tue, 05 October 2010 03:50 Go to previous message
thrixxxadmin
Messages: 3
Registered: October 2010
Junior Member
Is it possible that the VMs start too fast after a short power off, so OCFS2 hasn't finished the recovery process?
Then the started node gets locked because OCFS2 doesn't knwo what to do?
Maybe a sleep in the init script would help?
Previous Topic: Oracle 10gR2 installation on RHEL5 through VMware
Next Topic: upgrade from 10.2.0.1 to 10.2.0.4 SLES 10 SP 2 errors
Goto Forum:
  


Current Time: Thu Mar 28 10:31:26 CDT 2024