Glusterfs - Part A

最後更新: 2014-08-01

Glusterfs - Part A


Glusterfs 是一個 Network Cluster File System 來, 支援 size 到 72 brontobytes

1 Brontobyte = 1024 Yottabytes
1 Yottabyte = 1024 Zettabytes
1 Zettabyte = 1024 Exabytes
1 Exabyte =  1024 Petabytes
1 Petabyte = 1024 Terabytes

它的最大特點就是以 Brick(Dirctory) 為 node 上的分享單位 (node 上的 brick 可以上 ext3, ext4)

此外, 它用 FUSE 實現, 無需影響 kernel

當不幸遇上 split-brain  情況時, 問題都在 file level 解決

在 squeeze 上的 Version 比較舊, 所我選擇了 backport 上的 Version 3.2.4-1 來測試


  • File-based mirroring and replication
  • File-based striping
  • File-based load balancing
  • Volume failover
  • scheduling
  • disk caching
  • Storage quotas




Version 3.3

Ability to change replica count on an active volume

Granular locking – Allows large files to be accessed even during self-healing, a feature that is particularly important for VM images. ( Glusterfs will not have to lock & freeze the whole VM disk image, only the parts that need healing. )

Replication improvements – With quorum enforcement you can be confident that your data has been written in at least the configured number of places before the file operation returns, allowing a user-configurable adjustment to fault tolerance vs performance.




apt-get -t squeeze-backports install glusterfs-server
apt-get -t squeeze-backports install glusterfs-client



查看 Version:

glusterfs --version

由於連帶關係, 系統會自動安裝 glusterfs-common 及 libfuse2

因為 glusterfs 是依賴 FUSE 去實現 FS 的, 所以 libfuse2 當然唔少得.

安裝後, 工具會有 gluster, glusterfs

  • glusterfs (mount.glusterfs)       <-- mountpoint (Client 包)
  • gluster                                    <--  Console (Server 包)
  • glusterfsd                                <-- 負責 glusterfs 上的一切一切 (glusterfs-common)



  • DHT translator (Distribute translator)
  • AFR translator (Replicate)
  • striping translator (block-size default: 128k)
  • Translator cluster/ha
# It can be same server over two different (IB and TCP) interfaces.

volume ha
  type cluster/ha
  subvolumes interface1 interface2


Server Side:




啟動 glusterfsd (client 及 server 都要 !!)

/etc/init.d/glusterfs-server [start / stop]


Console 指令:

gluster [commands] [options]


  • Volume 相關(後接 create, start, stop, info, set, perf, rename) <--求助方式: gluster volume help
  • Quota (volume quota VOLNAME < enable | disable | limit-usage /dir 10GB | list | remove /dire > )
  • Rebalance ( volume  rebalance <volume> <start | stop | status> )
  • Performance (volume top < open | read | write | opendir | readdir | read-perf | write-perf >  [list-cnt cnt] 
                        ( volume profile VOLNAME < start | stop | info > )
  • Brick 相關 (volume  add-brick|replace-brick|remove-brick)
  • Log ( volume log  <filename|rotate,locate> )
  • Peer ( probe, detach, status )
  • Geo-replication ( volume geo-replication Master Slave start|stop|config[opt] )


入門1 - 組成 Cluster


Cluster = storage pool


# 建立

gluster peer probe <ServerName>

# 解組

gluster peer detach <ServerName>


Example: 與另一架 debianB 建立 Cluster

debianA:# gluster peer probe debianb



入門2 - 建立 volume


建立volume 立 volume 的指令如下

gluster volume create VOLNAME [stripe COUNT | replica COUNT] \


volume 一共分 3 類

  • distributed volume             <-- 以容量為主 (default)
  • replicated volume              <-- 以安全為主
  • Striped Volumes                <-- 以 R/W 的速度為主


  • gluster volume create VOLNAME BRICK1 BRICK2...
  • gluster volume create VOLNAME replica COUNT NEW-BRICK...
  • gluster volume create VOLNAME stripe COUNT  NEW-BRICK...

BRICK 的格式是  Server:/Dir


Example: 建立 Network Raid1 FS


# 刪除 volume
gluster volume delete <VOLNAME>


有關 element

    The brick is the storage filesystem that has been assigned to a volume.

    a brick after being processed by at least one translator.
    The final share after it passes through all the translators.

    A translator connects to one or more subvolumes
    * Distribute ( 基於 filename 的 hash 把不同的 file 寫到不同的 Server )
                 ( If the filename changes, a pointer file is written
                 to the server that the new hash code would point to)
    * Replicate
    * Stripe     (Stripes data across bricks in the volume)


入門3 - 啟動


# 啟動

gluster volume start VOLNAME

只有啟動後, client 才可以 mount 它們 !!

# 停用

gluster volume stop VOLNAME

在 run/ 目錄內會有 pid 檔出現


入門4 - 查看狀態


Peer 的狀態:

debianA:# gluster peer status

Number of Peers: 1

Hostname: debianb                            <--- /etc/glusterd/peers/UUID
Uuid: ecad14d9-f4e5-4523-a6e5-28ad13dd92bc   <--- 對方的 /etc/glusterd/
State: Peer in Cluster (Connected)


/etc/glusterd/peers/UUID 的內容:



Volume 的狀態:

gluster volume info [VOLNAME]


Volume Name: test
Type: Distribute
Status: Created
Number of Bricks: 2
Transport-type: tcp
Brick1: debianb:/home/gfs
Brick2: debianb:/home/gfs


gluster volume info raidfs

Volume Name: raidfs
Type: Replicate
Status: Created
Number of Bricks: 2
Transport-type: tcp
Brick1: debiana:/home/gfs
Brick2: debianb:/home/gfs

會有目錄 /etc/glusterd/vols/raidfs/ 建立


入門5 - 在 client 上 mount 起 volume



Client Side:

一般來說, 連接 Volume 有兩種方式, 分別是 FUSE 及 NFS (限制: v3, tcp)

  • 方式 1: FUSE: Gluster Native Client, 支援 POSIX 的 ACL (Backend 的 Brick 都要支援才用到)
  • 方式 2: NFS: 由 glusterfsd 提供 (不支持 ACL) 


方式 1: FUSE

mount -t glusterfs -o acl IP:/VOLNAME  MOUNTPOINT


glusterfs -f /etc/glusterfsd-server.vol <mountpoint>


Example: 掛主機 debiana 上的 raidfs

mount -t glusterfs debiana:raidfs /mnt/gfs



debiana:raidfs on /mnt/gfs type fuse.glusterfs (rw,allow_other,default_permissions,max_read=131072)

!!! 注意 !!!

  • 只有 mount 了的 Volume 才會有 replication
  • mount 唔到時, 系統是不會服錯的, 尤其是當 Server 不存在
  • 在 Server 上對分享出來的 volume 直接操作是沒有意思的(只有 glusterfs 的 mount point 才會反映操作)



server:/volume                /mnt/glusterfs    glusterfs    defaults  0  0


/etc/glusterfs/glusterfs.vol  /mnt/glusterfs    glusterfs    defaults  0  0 



..........  -o log-level=WARNING,logfile=/var/log/gluster.log,ro,defaults .............



glusterfs [options] [mountpoint]


glusterfs --volfile-server=SERVER [MOUNT-POINT]
glusterfs --volfile=VOLFILE [MOUNT-POINT]


--volfile (-f) 去修改

--volfile-server (-s)   Server 的位置


方式 2: NFS


For example:

mount -o proto=tcp,vers=3    nfs://server1:38467/test-volume    /mnt/glusterfs




IP:/VOLNAME    MOUNT_POINT   nfs     defaults,_netdev,mountproto=tcp    0    0



入門6 Triggering Self-Heal on Replicate


find /path_to_glusterfs_mout




Glusterfs Server 上的 TCP 及 UDP ports 24007, 24008 用來與 Client 對話

之後 每一個 bricks 就要開多一個 tcp port 由 49152 開始(舊版的 port: 24009, 24010....)


$ sudo iptables -A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport 24007:24008 -j ACCEPT
$ sudo iptables -A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport 49152:49156 -j ACCEPT

* Ensure that TCP and UDP ports 24007 and 24008 are open on all Gluster servers.

* open one port for each brick starting from port 49152

P.S. Default 的 transport 是 tcp



Where is the meta data stored?

The metadata is stored with the file data itself in its backend disk


Can I directly access the data on the underlying storage volumes?

If you are just doing just read()/access()/stat() like operations, you should be fine. If you are not using any new features (like quota/geo-replication etc etc) then technically, you can modify (but surely not rename(2) and link(2)) the data inside.


Loop mounting image files(Xen) stored in GlusterFS file system

glusterfs -f <your_spec_file>  --disable-direct-io-mode /<mount_path>



# reading many small files (PHP web serving)
NFS client > native client

# write-heavy load
native client > NFS client


allow more than one IP in auth.addr

option auth.addr.<volumename>.allow,192.168*



( decides how to distribute the new creation operations across the clustered file system based on ...)


ALU(Adaptive Least Usage) Scheduler

* entry-threshold
* exit-threshold

volume bricks
  type cluster/unify
  subvolumes brick1 brick2 brick3 brick4 brick5
  # This option makes brick5 to be readonly
  option brick5
  # use the ALU scheduler
  option scheduler alu
  # Don't create files one a volume with less than 5% free diskspace
  option alu.limits.min-free-disk  5%      
  # Don't create files on a volume with more than 10000 files open
  option alu.limits.max-open-files 10000   
  # When deciding where to place a file, first look at the disk-usage, then at  
  # read-usage, write-usage, open files, and finally the disk-speed-usage.
  option alu.order disk-usage:read-usage:write-usage:open-files-usage:disk-speed-usage

  # Kick in if the discrepancy in disk-usage between volumes is more than 2GB
  option alu.disk-usage.entry-threshold 2GB   
  # Don't stop writing to the least-used volume until the discrepancy is 1988MB
  option alu.disk-usage.exit-threshold  60MB   

  # Kick in if the discrepancy(差異) in open files is 1024
  option 1024
  # Don't stop until 992 files have been written the least-used volume
  option 32   

# option 20%    # Kick in when the read-usage discrepancy is 20%
# option 4%      # Don't stop until the discrepancy has been reduced to 16% (20% - 4%)

# option alu.write-usage.entry-threshold 20%   # Kick in when the write-usage discrepancy is 20%
# option alu.write-usage.exit-threshold 4%     # Don't stop until the discrepancy has been reduced to 16%

  # Refresh the statistics used for decision-making every 10 seconds
  option alu.stat-refresh.interval 10sec

  # Refresh the statistics used for decision-making after creating 10 files
  # option alu.stat-refresh.num-file-create 10



NUFA scheduler

# gives the local system more priority for file creation over other nodes.

volume posix1
  type storage/posix               # POSIX FS translator
  option directory /home/export    # Export this directory

volume bricks
  type cluster/unify
  subvolumes posix1 brick2 brick3 brick4
  option scheduler nufa
  option nufa.local-volume-name posix1
  option nufa.limits.min-free-disk 5%


Random Scheduler

# andomly scatters file creation across storage bricks

volume bricks
  type cluster/unify
  subvolumes brick1 brick2 brick3 brick4
  option scheduler random
  option random.limits.min-free-disk 5%


Round-Robin Scheduler

# Round-Robin (RR) scheduler creates files in a round-robin fashion.

volume bricks
  type cluster/unify
  subvolumes brick1 brick2 brick3 brick4
  option scheduler rr
  option brick4  # No files will be created in 'brick4'
  option rr.limits.min-free-disk 5%          # Unit in %
  option rr.refresh-interval 10               # Check server brick after 10s.


Switch Scheduler

# schedules the file according the the filename

volume bricks
  type cluster/unify
  subvolumes brick1 brick2 brick3 brick4 brick5 brick6 brick7
  option scheduler switch
  option *jpg:brick1,brick2;*mpg:brick3;*:brick4,brick5,brick6
  option brick7


Kernel Performance turning



Creative Commons license icon Creative Commons license icon