Glusterfs - Part A

最後更新: 2014-08-01

Glusterfs - Part A

介紹:

Glusterfs 是一個 Network Cluster File System 來, 支援 size 到 72 brontobytes

1 Brontobyte = 1024 Yottabytes
1 Yottabyte = 1024 Zettabytes
1 Zettabyte = 1024 Exabytes
1 Exabyte =  1024 Petabytes
1 Petabyte = 1024 Terabytes

它的最大特點就是以 Brick(Dirctory) 為 node 上的分享單位 (node 上的 brick 可以上 ext3, ext4)

此外, 它用 FUSE 實現, 無需影響 kernel

當不幸遇上 split-brain  情況時, 問題都在 file level 解決

在 squeeze 上的 Version 比較舊, 所我選擇了 backport 上的 Version 3.2.4-1 來測試

它的功能有

  • File-based mirroring and replication
  • File-based striping
  • File-based load balancing
  • Volume failover
  • scheduling
  • disk caching
  • Storage quotas

 

HomePage: http://www.gluster.org/

 

Version 3.3

Ability to change replica count on an active volume

Granular locking – Allows large files to be accessed even during self-healing, a feature that is particularly important for VM images. ( Glusterfs will not have to lock & freeze the whole VM disk image, only the parts that need healing. )

Replication improvements – With quorum enforcement you can be confident that your data has been written in at least the configured number of places before the file operation returns, allowing a user-configurable adjustment to fault tolerance vs performance.

 


安裝:

Debian6:

apt-get -t squeeze-backports install glusterfs-server
apt-get -t squeeze-backports install glusterfs-client

Centos6:

 

查看 Version:

glusterfs --version

由於連帶關係, 系統會自動安裝 glusterfs-common 及 libfuse2

因為 glusterfs 是依賴 FUSE 去實現 FS 的, 所以 libfuse2 當然唔少得.

安裝後, 工具會有 gluster, glusterfs

  • glusterfs (mount.glusterfs)       <-- mountpoint (Client 包)
  • gluster                                    <--  Console (Server 包)
  • glusterfsd                                <-- 負責 glusterfs 上的一切一切 (glusterfs-common)

 

Translator

  • DHT translator (Distribute translator)
  • AFR translator (Replicate)
  • striping translator (block-size default: 128k)
  • Translator cluster/ha
# It can be same server over two different (IB and TCP) interfaces.

volume ha
  type cluster/ha
  subvolumes interface1 interface2
end-volume

 


Server Side:

 

設定檔在以下位置:

/etc/glusterd/
/etc/glusterfs/

啟動 glusterfsd (client 及 server 都要 !!)

/etc/init.d/glusterfs-server [start / stop]

 

Console 指令:

gluster [commands] [options]

它一共分為八大類指令:

  • Volume 相關(後接 create, start, stop, info, set, perf, rename) <--求助方式: gluster volume help
  • Quota (volume quota VOLNAME < enable | disable | limit-usage /dir 10GB | list | remove /dire > )
  • Rebalance ( volume  rebalance <volume> <start | stop | status> )
  • Performance (volume top < open | read | write | opendir | readdir | read-perf | write-perf >  [list-cnt cnt] 
                        ( volume profile VOLNAME < start | stop | info > )
  • Brick 相關 (volume  add-brick|replace-brick|remove-brick)
  • Log ( volume log  <filename|rotate,locate> )
  • Peer ( probe, detach, status )
  • Geo-replication ( volume geo-replication Master Slave start|stop|config[opt] )

 


入門1 - 組成 Cluster

 

Cluster = storage pool

 

# 建立

gluster peer probe <ServerName>

# 解組

gluster peer detach <ServerName>

 

Example: 與另一架 debianB 建立 Cluster

debianA:# gluster peer probe debianb

 


 

入門2 - 建立 volume

 

建立volume 立 volume 的指令如下

gluster volume create VOLNAME [stripe COUNT | replica COUNT] \
NEW-BRICK1 NEW-BRICK2 NEW-BRICK3...

 

volume 一共分 3 類

  • distributed volume             <-- 以容量為主 (default)
  • replicated volume              <-- 以安全為主
  • Striped Volumes                <-- 以 R/W 的速度為主

相對建立指令:

  • gluster volume create VOLNAME BRICK1 BRICK2...
  • gluster volume create VOLNAME replica COUNT NEW-BRICK...
  • gluster volume create VOLNAME stripe COUNT  NEW-BRICK...

BRICK 的格式是  Server:/Dir

 

Example: 建立 Network Raid1 FS

 

# 刪除 volume
gluster volume delete <VOLNAME>

 

有關 element

brick
    The brick is the storage filesystem that has been assigned to a volume.

subvolume
    a brick after being processed by at least one translator.
    
volume
    The final share after it passes through all the translators.

Translator
    A translator connects to one or more subvolumes
    
    * Distribute ( 基於 filename 的 hash 把不同的 file 寫到不同的 Server )
                 ( If the filename changes, a pointer file is written
                 to the server that the new hash code would point to)
    * Replicate
    * Stripe     (Stripes data across bricks in the volume)
  


 

入門3 - 啟動

 

# 啟動

gluster volume start VOLNAME

只有啟動後, client 才可以 mount 它們 !!

# 停用

gluster volume stop VOLNAME

在 run/ 目錄內會有 pid 檔出現

 



入門4 - 查看狀態

 

Peer 的狀態:

debianA:# gluster peer status

Number of Peers: 1

Hostname: debianb                            <--- /etc/glusterd/peers/UUID
Uuid: ecad14d9-f4e5-4523-a6e5-28ad13dd92bc   <--- 對方的 /etc/glusterd/glusterd.info
State: Peer in Cluster (Connected)

 

/etc/glusterd/peers/UUID 的內容:

uuid=ecad14d9-f4e5-4523-a6e5-28ad13dd92bc
state=3
hostname1=debianb

 

Volume 的狀態:

gluster volume info [VOLNAME]

 

Volume Name: test
Type: Distribute
Status: Created
Number of Bricks: 2
Transport-type: tcp
Bricks:
Brick1: debianb:/home/gfs
Brick2: debianb:/home/gfs

 

gluster volume info raidfs

Volume Name: raidfs
Type: Replicate
Status: Created
Number of Bricks: 2
Transport-type: tcp
Bricks:
Brick1: debiana:/home/gfs
Brick2: debianb:/home/gfs

會有目錄 /etc/glusterd/vols/raidfs/ 建立

 


入門5 - 在 client 上 mount 起 volume

 

 

Client Side:

一般來說, 連接 Volume 有兩種方式, 分別是 FUSE 及 NFS (限制: v3, tcp)

  • 方式 1: FUSE: Gluster Native Client, 支援 POSIX 的 ACL (Backend 的 Brick 都要支援才用到)
  • 方式 2: NFS: 由 glusterfsd 提供 (不支持 ACL) 

 

方式 1: FUSE

mount -t glusterfs -o acl IP:/VOLNAME  MOUNTPOINT

                                     OR

glusterfs -f /etc/glusterfsd-server.vol <mountpoint>

 

Example: 掛主機 debiana 上的 raidfs

mount -t glusterfs debiana:raidfs /mnt/gfs

查看:

mount

debiana:raidfs on /mnt/gfs type fuse.glusterfs (rw,allow_other,default_permissions,max_read=131072)

!!! 注意 !!!

  • 只有 mount 了的 Volume 才會有 replication
  • mount 唔到時, 系統是不會服錯的, 尤其是當 Server 不存在
  • 在 Server 上對分享出來的 volume 直接操作是沒有意思的(只有 glusterfs 的 mount point 才會反映操作)

開機時自動掛載:

/etc/fstab

server:/volume                /mnt/glusterfs    glusterfs    defaults  0  0

                                                                             OR

/etc/glusterfs/glusterfs.vol  /mnt/glusterfs    glusterfs    defaults  0  0 

 

其他可用的選項:

..........  -o log-level=WARNING,logfile=/var/log/gluster.log,ro,defaults .............

 

補充:

glusterfs [options] [mountpoint]

Usage:

glusterfs --volfile-server=SERVER [MOUNT-POINT]
glusterfs --volfile=VOLFILE [MOUNT-POINT]

/etc/glusterfs/glusterfs.vol

--volfile (-f) 去修改

--volfile-server (-s)   Server 的位置

 

方式 2: NFS

mount -t nfs    IPADDRESS:/VOLNAME    MOUNTDIR

For example:

mount -o proto=tcp,vers=3    nfs://server1:38467/test-volume    /mnt/glusterfs

 

開機時自動掛載:

/etc/fstab

IP:/VOLNAME    MOUNT_POINT   nfs     defaults,_netdev,mountproto=tcp    0    0

 

 


入門6 Triggering Self-Heal on Replicate

 

find /path_to_glusterfs_mout

 


Network

 

Glusterfs Server 上的 TCP 及 UDP ports 24007, 24008 用來與 Client 對話

之後 每一個 bricks 就要開多一個 tcp port 由 49152 開始(舊版的 port: 24009, 24010....)

Server:

$ sudo iptables -A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport 24007:24008 -j ACCEPT
$ sudo iptables -A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport 49152:49156 -j ACCEPT

* Ensure that TCP and UDP ports 24007 and 24008 are open on all Gluster servers.

* open one port for each brick starting from port 49152

P.S. Default 的 transport 是 tcp


FAQ

 

Where is the meta data stored?

The metadata is stored with the file data itself in its backend disk

===============================================

Can I directly access the data on the underlying storage volumes?

If you are just doing just read()/access()/stat() like operations, you should be fine. If you are not using any new features (like quota/geo-replication etc etc) then technically, you can modify (but surely not rename(2) and link(2)) the data inside.

===============================================

Loop mounting image files(Xen) stored in GlusterFS file system

glusterfs -f <your_spec_file>  --disable-direct-io-mode /<mount_path>

===============================================

performance

# reading many small files (PHP web serving)
NFS client > native client

# write-heavy load
native client > NFS client

===============================================

allow more than one IP in auth.addr

option auth.addr.<volumename>.allow 127.0.0.1,192.168*

 


Scheduler

( decides how to distribute the new creation operations across the clustered file system based on ...)

 

ALU(Adaptive Least Usage) Scheduler

* entry-threshold
* exit-threshold

volume bricks
  type cluster/unify
  subvolumes brick1 brick2 brick3 brick4 brick5
 
  # This option makes brick5 to be readonly
  option alu.read-only-subvolumes brick5
 
  # use the ALU scheduler
  option scheduler alu
 
  # Don't create files one a volume with less than 5% free diskspace
  option alu.limits.min-free-disk  5%      
 
  # Don't create files on a volume with more than 10000 files open
  option alu.limits.max-open-files 10000   
 
  # When deciding where to place a file, first look at the disk-usage, then at  
  # read-usage, write-usage, open files, and finally the disk-speed-usage.
  option alu.order disk-usage:read-usage:write-usage:open-files-usage:disk-speed-usage

  # Kick in if the discrepancy in disk-usage between volumes is more than 2GB
  option alu.disk-usage.entry-threshold 2GB   
  # Don't stop writing to the least-used volume until the discrepancy is 1988MB
  option alu.disk-usage.exit-threshold  60MB   

  # Kick in if the discrepancy(差異) in open files is 1024
  option alu.open-files-usage.entry-threshold 1024
  # Don't stop until 992 files have been written the least-used volume
  option alu.open-files-usage.exit-threshold 32   

# option alu.read-usage.entry-threshold 20%    # Kick in when the read-usage discrepancy is 20%
# option alu.read-usage.exit-threshold 4%      # Don't stop until the discrepancy has been reduced to 16% (20% - 4%)

# option alu.write-usage.entry-threshold 20%   # Kick in when the write-usage discrepancy is 20%
# option alu.write-usage.exit-threshold 4%     # Don't stop until the discrepancy has been reduced to 16%

  # Refresh the statistics used for decision-making every 10 seconds
  option alu.stat-refresh.interval 10sec

  # Refresh the statistics used for decision-making after creating 10 files
  # option alu.stat-refresh.num-file-create 10

end-volume

 

NUFA scheduler

# gives the local system more priority for file creation over other nodes.

volume posix1
  type storage/posix               # POSIX FS translator
  option directory /home/export    # Export this directory
end-volume

volume bricks
  type cluster/unify
  subvolumes posix1 brick2 brick3 brick4
  option scheduler nufa
  option nufa.local-volume-name posix1
  option nufa.limits.min-free-disk 5%
end-volume

 

Random Scheduler

# andomly scatters file creation across storage bricks

volume bricks
  type cluster/unify
  subvolumes brick1 brick2 brick3 brick4
  option scheduler random
  option random.limits.min-free-disk 5%
end-volume

 

Round-Robin Scheduler

# Round-Robin (RR) scheduler creates files in a round-robin fashion.

volume bricks
  type cluster/unify
  subvolumes brick1 brick2 brick3 brick4
  option scheduler rr
  option rr.read-only-subvolumes brick4  # No files will be created in 'brick4'
  option rr.limits.min-free-disk 5%          # Unit in %
  option rr.refresh-interval 10               # Check server brick after 10s.
end-volume

 

Switch Scheduler

# schedules the file according the the filename

volume bricks
  type cluster/unify
  subvolumes brick1 brick2 brick3 brick4 brick5 brick6 brick7
  option scheduler switch
  option switch.case *jpg:brick1,brick2;*mpg:brick3;*:brick4,brick5,brick6
  option switch.read-only-subvolumes brick7
end-volume

 


Kernel Performance turning

link

 

Creative Commons license icon Creative Commons license icon