OCFS2 is POSIX-compliant
Optimized Allocations (extents, reservations, sparse, unwritten extents, punch holes)
REFLINKs (inode-based writeable snapshots)
Indexed Directories
Metadata Checksums
Extended Attributes (unlimited number of attributes per inode)
Advanced Security (POSIX ACLs and SELinux)
User and Group Quotas
Variable Block and Cluster sizes
Journaling (Ordered and Writeback data journaling modes)
Endian and Architecture Neutral (x86, x86_64, ia64 and ppc64) - yes, you can mount the filesystem in a heterogeneous cluster.
Buffered, Direct, Asynchronous, Splice and Memory Mapped I/Os
In-built Clusterstack with a Distributed Lock Manager
Cluster-aware Tools (mkfs, fsck, tunefs, etc.)
One of the main features added most recently is Global Heartbeat. OCFS2 as a filesystem typically was used with what's called local heartbeat. Basically for every filesystem you mounted, it would start its own local heartbeat, membership mechanism. The disk heartbeat means a disk io every 1 or 2 seconds for every node in the cluster, for every device. It was never a problem when the number of mounted volumes was relatively small but once customers were using 20+ volumes the overhead of the multiple disk heartbeats became significant and at times became a stability issue.
global heartbeat was written to provide a solution to the multiple heartbeats. It is now possible to specify on which device(s) you want a heartbeat thread and then you can mount many other volumes that do not have their own and the heartbeat is shared amongst those one, or few threads and as such significantly reducing disk IO overhead.
I was playing with this a little bit the other day and noticed that this wasn't very well documented so why not write it up here and share it with everyone. Getting started with OCFS2 is just really easy and withing just a few minutes it is possible to have a complete installation.
I started with two servers installed with Oracle Linux 6.3. Each server has 2 network interfaces, one public and one private. The servers have a local disk and a shared storage device. For cluster filesystems, typically this shared storage device should be either a shared SAN disk or an iscsi device but it is also possible with Oracle Linux and UEK2 to create a shared virtual device on an nfs server and use this device for the cluster filesystem. This technique is used with Oracle VM where the shared storage is NAS-based.I just wrote a blog entry about how to do that here.
While it is technically possible to create a working ocfs2 configuration using just one network and a single IP per server, it is certainly not ideal and not a recommended configuration for real world use. In any cluster environment it's highly recommended to have a private network for cluster traffic.The biggest reason for instability in a clustering environment is a bad/unreliable network and/or storage. Many times the environment has an overloaded network which causes network heartbeats to fail or disks where failover takes longer than the default configuration and the only alternative we have at that point, is to reboot the node(s).
Typically when I do a test like this, I make sure I use the latest versions of the OS release. So after an installation of Oracle Linux 6.3, I just do a yum update on all my nodes to have the latest packages and also latest kernel version installed and then do a reboot. That gets me to 2.6.39-300.17.3.el6uek.x86_64 at the time of writing. Of course all this is freely accessibly from http://public-yum.oracle.com.
Depending on the type of installation you did (basic, minimal, etc...) you may or may not have to add RPMs. Do a simple check rpm -q ocfs2-tools to see if the tools are installed, if not, just run yum install ocfs2-tools. And that's it. All required software is now installed. The kernel modules are already part of the uek2 kernel and the required tools (mkfs, fsck, o2cb,..) are part of the ocfs2-tools RPM.
Next up: create the filesystem on the shared disk device and configure the cluster.
One requirement for using global heartbeat is that the heartbeat device needs to be a NON-partitioned disk. Other OCFS2 volumes you want to create and mount can be on partitioned disks, but a device for the heartbeat needs to be on an empty disk. Let's assume /dev/sdb in this example.
# mkfs.ocfs2 -b 4K -C 4K -J size=4M -N 4 -L ocfs2vol1 --cluster-name=ocfs2 --cluster-stack=o2cb --global-heartbeat /dev/sdbThis creates a filesystem with a 4K blocksize (normal value), clustersize of 4K (if you have many small files, this is a good value, if you have few large files, go to 1M).
Journalsize of 4M if you have a large filesystem with a lot of metadata changes you might want to increase this. I did not add an option for 32bit or 64bit journals. if you want to create huge filesystems then use block64 which uses jbd2.
The filesystem is created for 4 nodes (-N 4) this can be modified if your cluster needs to grow larger so you can always tune this with tunefs.ocfs2.
Label ocfs2vol1, this is a disklabel you can later use to mount by label a filesystem.
clustername=ocfs2, this is the default name but if you want to have your own name for your cluster you can put a different value here, remember it because you will need to configure the clusterstack with the clustername later.
cluster-stack=o2cb : it is possible to have different cluster-stacks used such as pacemaker or cman.
global-heartbeat : make sure that the filesystem is prepared and built to support global heartbeat
/dev/sdb : the device to use for the filesystem.
# mkfs.ocfs2 -b 4K -C 4K -J size=4M -N 4 -L ocfs2vol1 --cluster-name=ocfs2 --cluster-stack=o2cb --force --global-heartbeat /dev/sdb mkfs.ocfs2 1.8.0 Cluster stack: o2cb Cluster name: ocfs2 Stack Flags: 0x1 NOTE: Feature extended slot map may be enabled Overwriting existing ocfs2 partition. WARNING: Cluster check disabled. Proceed (y/N): y Label: ocfs2vol1 Features: sparse extended-slotmap backup-super unwritten inline-data strict-journal-super xattr indexed-dirs refcount discontig-bg Block size: 4096 (12 bits) Cluster size: 4096 (12 bits) Volume size: 10725765120 (2618595 clusters) (2618595 blocks) Cluster groups: 82 (tail covers 5859 clusters, rest cover 32256 clusters) Extent allocator size: 4194304 (1 groups) Journal size: 4194304 Node slots: 4 Creating bitmaps: done Initializing superblock: done Writing system files: done Writing superblock: done Writing backup superblock: 2 block(s) Formatting Journals: done Growing extent allocator: done Formatting slot map: done Formatting quota files: done Writing lost+found: done mkfs.ocfs2 successful
Now, we just have to configure the o2cb stack and we're done.
o2cb add-node --ip 192.168.199.1 --number 0 ocfs2 host1
o2cb add-node --ip 192.168.199.2 --number 1 ocfs2 host2
run the following command and take the UUID value of the filesystem/device you want to use for heartbeat mounted.ocfs2 -d
# mounted.ocfs2 -d Device Stack Cluster F UUID Label /dev/sdb o2cb ocfs2 G 244A6AAAE77F4053803734530FC4E0B7 ocfs2vol1o2cb add-heartbeat ocfs2 244A6AAAE77F4053803734530FC4E0B7
That's it. If you want to enable this at boot time, you can configure o2cb to start automatically by running /etc/init.d/o2cb configure. This allows you to set different heartbeat time out values and also whether or not to start the clusterstack at boot time.
Now that a first node is configured, all you have to do is copy the file /etc/ocfs2/cluster.conf to all the other nodes in your cluster. You do not have to edit it on the other nodes, you just need to have an exact copy everywhere. You also do not need to redo the above commands, except 1) make sure ocfs2-tools is installed everywhere and if you want to start at boot time, re-run the /etc/init.d/o2cb configure on the other nodes as well. From here on, you can just mount your filesystems :
mount /dev/sdb /mountpoint1 on each node.
If you create more OCFS2 volumes you can just keep mounting them all, and with global heartbeat, you will just have one (or a few) hb's going on.
have fun...
Here is vmstat output, the first output shows a single heartbeat and 8 mounted filesystems, the second vmstat output shows 8 mounted filesystems with their own local heartbeat. Even though the IO amount is low, it shows that there are about 8x more IOs happening (from 1 every other second to 4 every second). As these are small IOs, they will move the diskhead to a specific place all the time and interrupt performance if you have it on each device. Hopefully this shows the benefits of global heartbeat.
# vmstat 1 procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 0 0 0 789752 26220 97620 0 0 1 0 41 34 0 0 100 0 0 0 0 0 789752 26220 97620 0 0 0 0 46 22 0 0 100 0 0 0 0 0 789752 26220 97620 0 0 1 1 38 29 0 0 100 0 0 0 0 0 789752 26228 97620 0 0 0 52 52 41 0 0 100 1 0 0 0 0 789752 26228 97620 0 0 1 0 28 26 0 0 100 0 0 0 0 0 789760 26228 97620 0 0 0 0 30 30 0 0 100 0 0 0 0 0 789760 26228 97620 0 0 1 1 26 20 0 0 100 0 0 0 0 0 789760 26228 97620 0 0 0 0 54 37 0 1 100 0 0 0 0 0 789760 26228 97620 0 0 1 0 29 28 0 0 100 0 0 0 0 0 789760 26236 97612 0 0 0 16 43 48 0 0 100 0 0 0 0 0 789760 26236 97620 0 0 1 1 48 28 0 0 100 0 0 0 0 0 789760 26236 97620 0 0 0 0 42 30 0 0 100 0 0 0 0 0 789760 26236 97620 0 0 1 0 26 30 0 0 100 0 0 0 0 0 789760 26236 97620 0 0 0 0 35 24 0 0 100 0 0 0 1 0 789760 26240 97616 0 0 1 21 29 27 0 0 100 0 0 0 0 0 789760 26244 97620 0 0 0 4 51 44 0 0 100 0 0 0 0 0 789760 26244 97620 0 0 1 0 31 24 0 0 100 0 0 0 0 0 789760 26244 97620 0 0 0 0 25 28 0 0 100 0 0 0 0 0 789760 26244 97620 0 0 1 1 30 20 0 0 100 0 0 0 0 0 789760 26244 97620 0 0 0 0 41 30 0 0 100 0 0 0 0 0 789760 26252 97616 0 0 1 16 56 44 0 0 100 0 0 # vmstat 1 procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 0 0 0 784364 28732 98620 0 0 4 46 54 64 0 0 100 0 0 0 0 0 784364 28732 98628 0 0 4 2 60 48 0 0 100 0 0 0 0 0 784364 28732 98628 0 0 4 2 51 53 0 0 100 0 0 0 0 0 784364 28732 98628 0 0 4 2 58 50 0 0 100 0 0 0 0 0 784364 28732 98628 0 0 4 2 56 44 0 0 100 0 0 0 0 0 784364 28732 98628 0 0 4 2 46 47 0 0 100 0 0 0 0 0 784364 28732 98628 0 0 4 2 65 54 0 0 100 0 0 0 0 0 784388 28740 98620 0 0 4 14 65 55 0 0 100 0 0 0 0 0 784388 28740 98628 0 0 4 2 46 48 0 0 100 0 0 0 0 0 784388 28740 98628 0 0 4 2 52 42 0 0 100 0 0 0 0 0 784388 28740 98628 0 0 4 2 51 58 0 0 100 0 0 0 0 0 784388 28740 98628 0 0 4 2 36 43 0 0 100 0 0 0 0 0 784396 28740 98628 0 0 4 2 39 47 0 0 100 0 0 0 0 0 784396 28740 98628 0 0 4 2 52 54 0 0 100 0 0 0 0 0 784396 28740 98628 0 0 4 2 42 48 0 0 100 0 0 0 0 0 784404 28748 98620 0 0 4 14 52 63 0 0 100 0 0 0 0 0 784404 28748 98628 0 0 4 2 32 42 0 0 100 0 0 0 0 0 784404 28748 98628 0 0 4 2 50 40 0 0 100 0 0 0 0 0 784404 28748 98628 0 0 4 2 58 56 0 0 100 0 0 0 0 0 784412 28748 98628 0 0 4 2 39 46 0 0 100 0 0 0 0 0 784412 28748 98628 0 0 4 2 45 50 0 0 100 0 0 0 0 0 784412 28748 98628 0 0 4 2 43 42 0 0 100 0 0 0 0 0 784288 28748 98628 0 0 4 6 48 52 0 0 100 0 0