最後更新: 2014-08-01
Glusterfs - Part A
介紹:
Glusterfs 是一個 Network Cluster File System 來, 支援 size 到 72 brontobytes
1 Brontobyte = 1024 Yottabytes
1 Yottabyte = 1024 Zettabytes
1 Zettabyte = 1024 Exabytes
1 Exabyte = 1024 Petabytes
1 Petabyte = 1024 Terabytes
它的最大特點就是以 Brick(Dirctory) 為 node 上的分享單位 (node 上的 brick 可以上 ext3, ext4)
此外, 它用 FUSE 實現, 無需影響 kernel
當不幸遇上 split-brain 情況時, 問題都在 file level 解決
在 squeeze 上的 Version 比較舊, 所我選擇了 backport 上的 Version 3.2.4-1 來測試
它的功能有
- File-based mirroring and replication
- File-based striping
- File-based load balancing
- Volume failover
- scheduling
- disk caching
- Storage quotas
HomePage: http://www.gluster.org/
Version 3.3
Ability to change replica count on an active volume
Granular locking – Allows large files to be accessed even during self-healing, a feature that is particularly important for VM images. ( Glusterfs will not have to lock & freeze the whole VM disk image, only the parts that need healing. )
Replication improvements – With quorum enforcement you can be confident that your data has been written in at least the configured number of places before the file operation returns, allowing a user-configurable adjustment to fault tolerance vs performance.
安裝:
Debian6:
apt-get -t squeeze-backports install glusterfs-server
apt-get -t squeeze-backports install glusterfs-client
Centos6:
查看 Version:
glusterfs --version
由於連帶關係, 系統會自動安裝 glusterfs-common 及 libfuse2
因為 glusterfs 是依賴 FUSE 去實現 FS 的, 所以 libfuse2 當然唔少得.
安裝後, 工具會有 gluster, glusterfs
- glusterfs (mount.glusterfs) <-- mountpoint (Client 包)
- gluster <-- Console (Server 包)
- glusterfsd <-- 負責 glusterfs 上的一切一切 (glusterfs-common)
Translator
- DHT translator (Distribute translator)
- AFR translator (Replicate)
- striping translator (block-size default: 128k)
- Translator cluster/ha
# It can be same server over two different (IB and TCP) interfaces. volume ha type cluster/ha subvolumes interface1 interface2 end-volume
Server Side:
設定檔在以下位置:
/etc/glusterd/
/etc/glusterfs/
啟動 glusterfsd (client 及 server 都要 !!)
/etc/init.d/glusterfs-server [start / stop]
Console 指令:
gluster [commands] [options]
它一共分為八大類指令:
- Volume 相關(後接 create, start, stop, info, set, perf, rename) <--求助方式: gluster volume help
- Quota (volume quota VOLNAME < enable | disable | limit-usage /dir 10GB | list | remove /dire > )
- Rebalance ( volume rebalance <volume> <start | stop | status> )
-
Performance (volume top < open | read | write | opendir | readdir | read-perf | write-perf > [list-cnt cnt]
( volume profile VOLNAME < start | stop | info > ) - Brick 相關 (volume add-brick|replace-brick|remove-brick)
- Log ( volume log <filename|rotate,locate> )
- Peer ( probe, detach, status )
- Geo-replication ( volume geo-replication Master Slave start|stop|config[opt] )
入門1 - 組成 Cluster
Cluster = storage pool
# 建立
gluster peer probe <ServerName>
# 解組
gluster peer detach <ServerName>
Example: 與另一架 debianB 建立 Cluster
debianA:# gluster peer probe debianb
入門2 - 建立 volume
建立volume 立 volume 的指令如下
gluster volume create VOLNAME [stripe COUNT | replica COUNT] \
NEW-BRICK1 NEW-BRICK2 NEW-BRICK3...
volume 一共分 3 類
- distributed volume <-- 以容量為主 (default)
- replicated volume <-- 以安全為主
- Striped Volumes <-- 以 R/W 的速度為主
相對建立指令:
- gluster volume create VOLNAME BRICK1 BRICK2...
- gluster volume create VOLNAME replica COUNT NEW-BRICK...
- gluster volume create VOLNAME stripe COUNT NEW-BRICK...
BRICK 的格式是 Server:/Dir
Example: 建立 Network Raid1 FS
# 刪除 volume
gluster volume delete <VOLNAME>
有關 element
brick
The brick is the storage filesystem that has been assigned to a volume.
subvolume
a brick after being processed by at least one translator.
volume
The final share after it passes through all the translators.
Translator
A translator connects to one or more subvolumes
* Distribute ( 基於 filename 的 hash 把不同的 file 寫到不同的 Server )
( If the filename changes, a pointer file is written
to the server that the new hash code would point to)
* Replicate
* Stripe (Stripes data across bricks in the volume)
入門3 - 啟動
# 啟動
gluster volume start VOLNAME
只有啟動後, client 才可以 mount 它們 !!
# 停用
gluster volume stop VOLNAME
在 run/ 目錄內會有 pid 檔出現
入門4 - 查看狀態
Peer 的狀態:
debianA:# gluster peer status
Number of Peers: 1 Hostname: debianb <--- /etc/glusterd/peers/UUID Uuid: ecad14d9-f4e5-4523-a6e5-28ad13dd92bc <--- 對方的 /etc/glusterd/glusterd.info State: Peer in Cluster (Connected)
/etc/glusterd/peers/UUID 的內容:
uuid=ecad14d9-f4e5-4523-a6e5-28ad13dd92bc state=3 hostname1=debianb
Volume 的狀態:
gluster volume info [VOLNAME]
Volume Name: test Type: Distribute Status: Created Number of Bricks: 2 Transport-type: tcp Bricks: Brick1: debianb:/home/gfs Brick2: debianb:/home/gfs
gluster volume info raidfs
Volume Name: raidfs Type: Replicate Status: Created Number of Bricks: 2 Transport-type: tcp Bricks: Brick1: debiana:/home/gfs Brick2: debianb:/home/gfs
會有目錄 /etc/glusterd/vols/raidfs/ 建立
入門5 - 在 client 上 mount 起 volume
Client Side:
一般來說, 連接 Volume 有兩種方式, 分別是 FUSE 及 NFS (限制: v3, tcp)
- 方式 1: FUSE: Gluster Native Client, 支援 POSIX 的 ACL (Backend 的 Brick 都要支援才用到)
- 方式 2: NFS: 由 glusterfsd 提供 (不支持 ACL)
方式 1: FUSE
mount -t glusterfs -o acl IP:/VOLNAME MOUNTPOINT
OR
glusterfs -f /etc/glusterfsd-server.vol <mountpoint>
Example: 掛主機 debiana 上的 raidfs
mount -t glusterfs debiana:raidfs /mnt/gfs
查看:
mount
debiana:raidfs on /mnt/gfs type fuse.glusterfs (rw,allow_other,default_permissions,max_read=131072)
!!! 注意 !!!
- 只有 mount 了的 Volume 才會有 replication
- mount 唔到時, 系統是不會服錯的, 尤其是當 Server 不存在
- 在 Server 上對分享出來的 volume 直接操作是沒有意思的(只有 glusterfs 的 mount point 才會反映操作)
開機時自動掛載:
/etc/fstab
server:/volume /mnt/glusterfs glusterfs defaults 0 0
OR
/etc/glusterfs/glusterfs.vol /mnt/glusterfs glusterfs defaults 0 0
其他可用的選項:
.......... -o log-level=WARNING,logfile=/var/log/gluster.log,ro,defaults .............
補充:
glusterfs [options] [mountpoint]
Usage:
glusterfs --volfile-server=SERVER [MOUNT-POINT]
glusterfs --volfile=VOLFILE [MOUNT-POINT]
/etc/glusterfs/glusterfs.vol
--volfile (-f) 去修改
--volfile-server (-s) Server 的位置
方式 2: NFS
mount -t nfs IPADDRESS:/VOLNAME MOUNTDIR
For example:
mount -o proto=tcp,vers=3 nfs://server1:38467/test-volume /mnt/glusterfs
開機時自動掛載:
/etc/fstab
IP:/VOLNAME MOUNT_POINT nfs defaults,_netdev,mountproto=tcp 0 0
入門6 Triggering Self-Heal on Replicate
find /path_to_glusterfs_mout
Network
Glusterfs Server 上的 TCP 及 UDP ports 24007, 24008 用來與 Client 對話
之後 每一個 bricks 就要開多一個 tcp port 由 49152 開始(舊版的 port: 24009, 24010....)
Server:
$ sudo iptables -A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport 24007:24008 -j ACCEPT $ sudo iptables -A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport 49152:49156 -j ACCEPT
* Ensure that TCP and UDP ports 24007 and 24008 are open on all Gluster servers.
* open one port for each brick starting from port 49152
P.S. Default 的 transport 是 tcp
FAQ
Where is the meta data stored?
The metadata is stored with the file data itself in its backend disk
===============================================
Can I directly access the data on the underlying storage volumes?
If you are just doing just read()/access()/stat() like operations, you should be fine. If you are not using any new features (like quota/geo-replication etc etc) then technically, you can modify (but surely not rename(2) and link(2)) the data inside.
===============================================
Loop mounting image files(Xen) stored in GlusterFS file system
glusterfs -f <your_spec_file> --disable-direct-io-mode /<mount_path>
===============================================
performance
# reading many small files (PHP web serving)
NFS client > native client
# write-heavy load
native client > NFS client
===============================================
allow more than one IP in auth.addr
option auth.addr.<volumename>.allow 127.0.0.1,192.168*
Scheduler
( decides how to distribute the new creation operations across the clustered file system based on ...)
ALU(Adaptive Least Usage) Scheduler
* entry-threshold
* exit-threshold
volume bricks type cluster/unify subvolumes brick1 brick2 brick3 brick4 brick5 # This option makes brick5 to be readonly option alu.read-only-subvolumes brick5 # use the ALU scheduler option scheduler alu # Don't create files one a volume with less than 5% free diskspace option alu.limits.min-free-disk 5% # Don't create files on a volume with more than 10000 files open option alu.limits.max-open-files 10000 # When deciding where to place a file, first look at the disk-usage, then at # read-usage, write-usage, open files, and finally the disk-speed-usage. option alu.order disk-usage:read-usage:write-usage:open-files-usage:disk-speed-usage # Kick in if the discrepancy in disk-usage between volumes is more than 2GB option alu.disk-usage.entry-threshold 2GB # Don't stop writing to the least-used volume until the discrepancy is 1988MB option alu.disk-usage.exit-threshold 60MB # Kick in if the discrepancy(差異) in open files is 1024 option alu.open-files-usage.entry-threshold 1024 # Don't stop until 992 files have been written the least-used volume option alu.open-files-usage.exit-threshold 32 # option alu.read-usage.entry-threshold 20% # Kick in when the read-usage discrepancy is 20% # option alu.read-usage.exit-threshold 4% # Don't stop until the discrepancy has been reduced to 16% (20% - 4%) # option alu.write-usage.entry-threshold 20% # Kick in when the write-usage discrepancy is 20% # option alu.write-usage.exit-threshold 4% # Don't stop until the discrepancy has been reduced to 16% # Refresh the statistics used for decision-making every 10 seconds option alu.stat-refresh.interval 10sec # Refresh the statistics used for decision-making after creating 10 files # option alu.stat-refresh.num-file-create 10 end-volume
NUFA scheduler
# gives the local system more priority for file creation over other nodes.
volume posix1 type storage/posix # POSIX FS translator option directory /home/export # Export this directory end-volume volume bricks type cluster/unify subvolumes posix1 brick2 brick3 brick4 option scheduler nufa option nufa.local-volume-name posix1 option nufa.limits.min-free-disk 5% end-volume
Random Scheduler
# andomly scatters file creation across storage bricks
volume bricks type cluster/unify subvolumes brick1 brick2 brick3 brick4 option scheduler random option random.limits.min-free-disk 5% end-volume
Round-Robin Scheduler
# Round-Robin (RR) scheduler creates files in a round-robin fashion.
volume bricks type cluster/unify subvolumes brick1 brick2 brick3 brick4 option scheduler rr option rr.read-only-subvolumes brick4 # No files will be created in 'brick4' option rr.limits.min-free-disk 5% # Unit in % option rr.refresh-interval 10 # Check server brick after 10s. end-volume
Switch Scheduler
# schedules the file according the the filename
volume bricks type cluster/unify subvolumes brick1 brick2 brick3 brick4 brick5 brick6 brick7 option scheduler switch option switch.case *jpg:brick1,brick2;*mpg:brick3;*:brick4,brick5,brick6 option switch.read-only-subvolumes brick7 end-volume
Kernel Performance turning