최근에, ZFS기반의 Lustre 환경에 대해서, 성능 및 HA관련 강화되는 부문에 대한 기사가 있어서 원문 그대로 붙여 놓고, 중간중간에 한글로 관련 추가 설명을 곁들였습니다.
Bolstering Lustre on ZFS: Highlights of Continuing Work
January 11, 2017 Ken Strandberg
The Zettabyte File System (ZFS), as a back-end file system to Lustre, has had support in Lustre for a long time. But in the last few years it has gained greater importance, likely due to Lustre’s push into enterprise and the increasing demands by both enterprise and non-enterprise IT to add more reliability and flexibility features to Lustre. So, ZFS has had significant traction in recent Lustre deployments.
Lustre 에서 ZFS를 지원하기 시작한지 꽤 많은 시간이 흘렀죠, 최근들어서 Lustre를 Enterprise 환경으로 확장하기 위한 요구 사항에 따라, ZFS를 활용하여, Lustre에 그 기능을 확장하고 있습니다.
However, over the last 18 months, a few challenges have been the focus of several open source projects in the Lustre developer community to improve performance, align the enterprise-grade features in ZFS with Lustre, and enhance functionality that eases Lustre deployments on ZFS. These features are being added to open source Lustre (Intel Foundation Edition for Lustre* software) and the Intel Enterprise Edition for Lustre software (Intel EE for Lustre software).
ZFS 기반의 Lustre의 기능 확장으로, IEEL 및 IFEL 버젼에 추가 되고 있죠.
ZFS has capabilities of a back-end file system to Lustre that are sought after by IT architects. Enterprise IT is especially interested for the file system’s many features of reliability and availability, such as end-to-end checksum and on-block data corruption protection. As hard drives continue to increase in size, data protection becomes more critical to data center architects building Lustre clusters. ZFS also provides effective snapshotting, compression, and scrubbing and silvering without taking the system offline, which keeps the data available to clients 24/7 while providing high reliability. For system integrators, ZFS, with its own RAID feature called RAIDZ, allows them to build JBOD-based storage solutions instead of more expensive hardware-based RAID arrays. With this rich pool of features that more organizations look for in a high-performance parallel file system, over the last year and a half the Lustre open source community has continued to work on enhancing ZFS for Lustre deployments. The most recent work includes:
- Improving RAIDZ performance : zfs 기반의 software raid를 사용시 sw raid 관련한 성능 확장
- Enabling snapshots directly from Lustre : lustre 에서 native로 제공하는 snapshot(ZFS기능 활용)
- Boosting Metadata performance : mdt on ZFS의 성능 개선
- Improving streaming I/O performance : ost on ZFS의 streaming io 성능 개선
- Enhancing high availability using Pacemaker and Corosync : zfs기반의 mdt/ost의 HA기능 확장
- Automatically file compress feature on Lustre : ZFS에서 제공하는 FS 레벨의 자동 파일 압축 기능 제공, 지난해부터, 북미 및 아시아쪽 의 몇몇 고객쪽에서, IEEL on ZFS환경을 사용하여, 대형 Backup Storage로 사용하는 사례도 지속적으로 추가 되고 있습니다.
80X Faster RAIDZ3 Reconstruction
Gvozden Nešković, working with the Gesellschaft fuer Schwerionenforschung (GSI) and Facility for Antiproton and Ion Research (FAIR), committed a patch to the ZFS master branch that boosts performance of RAIDZ parity generation and reconstruction for its various levels of data protection called RAIDZ1, RAIDZ2, and RAIDZ3. These levels cover the number of possible ZFS pool hard drive failures that can occur (one, two, or three drives) and still recover the data in the pool. The calculations for parity generation and reconstruction use Galois field equations and vary in complexity from simple carryless addition to complex, complex, finite field algebra constructs. “Modern multi-core CPUs do not have built-in instructions to perform these operations,” said Mr. Nešković, “so I had to revise and optimize the operations to take advantage of the vector instruction sets in today’s CPUs.” Running his new codes on benchmarks returned over 80X speedup on RAIDZ3 reconstruction of three hard drives using the Intel AVX2 instruction set on Intel processors, such as Intel Xeon processors E5 family.
“Codes are based on the vector extensions of the binary Galois field,” stated Nešković. “This is finite field mathematics with 256 elements.” Erasure code is specified by three basic equations for the parity, called P, Q, and R. Regeneration has the most processing demand, depending on the failure mode of RAIDZ level and number of involved hard drives. RAIDZ3 with three drive failures requires solving all three equations in seven different ways to regenerate all data blocks.
Mr. Nešković’s work covered both optimizing all the equations and implementing new constructs. The table shows speedup results compared to the original, unoptimized code running his optimized versions of the original code on a scalar processor and running vectorized operations on SSE and Intel AVX2-enabled processors. “I built new optimized code on top of an optimized RAIDZ framework.. My work focused on x86 processors, but the framework can be extended to other architectures. Others in the community are already working on extending it,” he commented.
“What this means to GSI is that we can dedicate fewer cores to parity and reconstruction processing, and that saves costs building and running of Lustre clusters with a ZFS back end file system using RAIDZ for high availability and reliability,” added Nešković. His presentation on his work at the Luster User Group (LUG) can be found here; the video of it is here.
Creating Lustre Snapshots
Fan Yong of Intel’s High Performance Data Division (Intel HPDD) is adding ZFS snapshot capability using command lines to Lustre with ZFS back end file systems. Lustre snapshots create a mountable view of the entire Lustre filesystem captured at a moment in time. Snapshots are useful for a number of reasons, from recovering files to creating backups to enabling users to see differences across the data over time, and more.
File System 전체를 snapshot 적용이 가능하게 되어, 시간에 따른 변경된 파일의 추적이 가능할것입니다.
The process of creating the snapshot is very fast, because it takes advantage of a copy-on-write file system, rather than copying the entire file system’s data. The snapshot is given its own unique fsname, which can be mounted as a read-only file system. “Creating the snapshot is fast,” stated Fan Yong. “Setting up the system to do that is time consuming, because we have to create a global write barrier across the pool.” The barrier freezes the file system by preventing transactions on the Metadata Targets (MDTs), inhibiting any changes to Metadata during the snapshot. “In theory, this sounds simple,” stated Yong. “Efficiently implementing it is not.”
Snapshot를 사용하게 되면 copy-on-write방식이기 때문에, 아주 빠르게 생성할 수 있습니다. 그러나 snapshot을 생성하는 순간에는, IO가 잠깐 freeze 되는데, 아래 그래프를 보면, idle ~ busy일 경우 대략 3~4초 정도의 차이를 보이고 있스습니다. 최대 대략 20초 정도의 freeze time 이 필요한것으로 보이고 있습니다.
Yong’s research and benchmarks show that while the write barrier can take several seconds to complete, it is highly scalable across their test system with little change as the number of MDTs are added. And the impact on MDT IO performance is minimal, making the wait for the snapshot very worth the value of having a complete picture of the file system at a given moment in time.
The snapshot capability is being added as command line routines to Lustre that in turn execute commands on both the Lustre and ZFS environments.
Yong’s work is multi-phasic and ongoing. Phase 1 with snapshotting capability is expected to land in the community release of Lustre 2.10 and in the Intel Enterprise Edition for Lustre software version 3.0. Future work will integrate additional capabilities, such as auto-mounting the snapshot, and other features. Yong’s presentation at LUG 2016 can be found here, and the vide of it is here.
Making Metadata Performance Scalable and Fast
Alexey Zhuravlev of Intel HPDD has contributed valuable improvements to Metadata performance on ZFS. Lustre’s ldiskfs has been an order of magnitude faster than ZFS, much of it due to the Metadata server performance, which on ZFS is both much slower than ldiskfs and not scalable. Thus, to enable ZFS deployments, some integrators mix back end file systems with Metadata servers based on ldiskfs and object servers based on ZFS. Mr. Zhuravlev’s work so far has shown Metadata performance increases as much as 2x on ZFS 0.6.5.7, which is good news for ZFS installations.
통상적으로는 ZFS보다는 ldiskfs가 더 좋은 MDT 성능을 보여주고 있습니다. 그렇기 때문에, 몇몇 회사에서는 MDT는 ldiskfs 로, OST는 ZFS로 사용하는형태로 구현하는 케이스도 있다고 합니다. 하지만 최근 ZFS 버젼이 향상되며, ZFS기반의 MST성능도 기존 대비 2배 이상 증가 하였고, ZFS기반의 성능또한 지속적으로 개선될것이기 때문에, 굳지 섞어서 구현을 할 필요가 없어지지 않을까 합니다.
“There are several areas where bottlenecks occur,” stated Zhuravlev. “Some are due to Lustre code and some due to ZFS.” Declaration time is a significant one, where ldiskfs is a mere fraction of time compared to ZFS for all declaration operations. “The more specific the declaration, the more expensive it is on ZFS,” added Zhuravlev. And Lustre declarations are very specific, which, it turns out, are not required. Lustre patch LU-7898, which landed in Lustre 2.9, brings ZFS declaration times in line with ldiskfs.
ZFS dnodes also presented a performance challenge to Lustre. ZFS 0.6.5 dnodes are 512 bytes, which are not nearly large enough to accommodate Lustre’s extended attributes. ZFS’ two extra 4k blocks include a spill block for additional data, but accessing the spill block results in an extra disk seek. “It took more than eight gigabytes of raw writes to create one million files,” commented Zhuravlev. A patch for a variable dnode will be landed in the ZFS master 0.7 release that will allow dnodes to be sized from 512 bytes to a few kBs, eliminating the need to look at the spill block, which is expected to reduce the disk seeks by half. “With large dnode support in ZFS and Lustre, 1 million files now take only one gigabyte of raw writes,” added Zhuravlev. Additionally, ZFS didn’t support dnode accounting, and Lustre implements its own primitive support for file quotas, which is very expensive and doesn’t scale well. Patch LU-2435, which is nearly ready for landing, will add native dnode accounting to ZFS. “We estimate performance on dnode accounting will be both highly scalable and higher performance,” stated Zhuravlev.
Fixes to declaration times (Lustre 2.9), dnode size increase (ZFS 0.7), and dnode accounting (Lustre 2.10) are showing significant improvements to MDS performance on ZFS, delivering both high scalability and performance that is expected to pass ldiskfs. These features will appear in the upcoming release of Intel Enterprise Edition for Lustre software.
Lustre 2.10 와 zfs0.7 환경에서는 확실히 이전 버젼 보다는 thread 의 증가에 따른 파일 생성 속도 또한 개선되고 있습니다.
“We continue to work on MDS performance on ZFS,” added Zhuravlev. We are actively looking at other optimizations and working the upstream ZFS community.” He presented his work at the European Open File System’s (EOFS) 2016 Lustre Administrator and Developer Workshop. His presentation is here.
Large IO Streaming Improvements
Write performance of Lustre on ZFS runs near to its theoretical maximum, but block allocation is not file-aware, so it tends to scatter blocks around the disks, which means more disk seeks when he file is read back. This keeps read performance lagging Lustre’s write performance on ZFS as presented at LUG 2016 by Jinshan Xiong of Intel’s HPDD. A lot of patches have been landed during 2015 that has helped performance on ZFS, including Xiong’s vectorization for Intel AVX instructions of the Fletcher 4 algorithm used to compute checksums. Work continues to enhance IO performance.
IO performance work at Intel is being done and upstreamed to the ZFS on Linux project with expectations to be landed in Lustre 2.11. This work includes
- Increasing support on Lustre for a 16 MB block size—already supported by ZFS—which will increase the size of data blocks written to each disk. A larger block size will reduce disk seeks and boost read performance. This, in turn, will require supporting a dynamic OSD-ZFS block size to prevent an increase in read/modify/write operations. : 예전에는 1MB였고, 이후에 4MB까지 확장되었으며, IEEL 3.1(Lustre2.7) 부터는 16MB까지(ldiskfs기준) 확장이 되었습니다. zfs기준의 16MB는 Lustre 2.11에 포함될 예정입니다.
- Implementing a dRAID mechanism instead of RAIDZ to boost performance when a drive fails. With RAIDZ, throughput of a disk group is limited by the spare disk’s bandwidth. dRAID will use a mechanism that distributes data to spare blocks among the remaining disks. Throughput is expected to improve even when the group is degraded because of a failed drive. : ZFS에서 사용하는 Software RAID를 활용하여, Z3방식 대신에 dRAID를 사용하여 AVX2와 함께 활용 활용하면, rebuild 성능을 극대화 할수 있습니다.
- Creating a separate Metadata allocation class to allow a dedicated high throughput VDEV for storing Metadata. Since ZFS Metadata is smaller, but fundamental, reading it faster will result in enhanced IO performance. The VDEV should be an SSD or NVRAM, and it can be mirrored for redundancy. : SSD혹은 NVRAM을 VDEV(파일시스템에서 사용하는 block cache)로 사용함으로써, ZFS의 inodes allocation 성능을 최대한 올릴수 있습니다.
New Resource Agents for Lustre
Additional to performance improvements for ZFS, Gabriele Paciucci, HPC Architect for Intel, released patches to the Lustre main branch for enabling the Linux Pacemaker* and Corosync* high availability framework for Lustre on ZFS. Previously, Lustre’s use of Pacemaker and Corosync* was only supported on ldiskfs. Paciucci’s work introduces three new resource agents released in the following patches:
- LU-8455 supports Pacemaker on a ZFS-based file system. : IEEL3.1x 부터는 ZFS HA관련 Full Managed 형태로 사용할 수 있습니다.
- LU-8457 increases monitoring capabilities for LNet. : IEEL3.x 부터 Dynamic Lnet을 지원하면서, LNET의 모니터 확장 기능 추가 하였습니다.
- LU-8458 increases monitoring capabilities for Lustre services.
- IEEL 3.1xx 부터는 IML 에서 HPC scheduler 와 연동하여, 각 Job의 IO를 볼수 있도록 GUI에 통합 되었습니다.
His work was presented at LAD 2016, and it can be viewed here.
With critical features enterprises and other modernized data centers need, ZFS as a back end file system for Lustre is already being used in many deployments. The Lustre community continues to enhance performance across Lustre and ZFS to close the gap between ldiskfs and ZFS.
아래는 ZFS기반의 Full Managed 형태의 구성인, IEEL 3.1의 Screenshot 입니다. 이전의 IEEL3.0과는 다르게 죄측에 Resource메뉴가 추가되고, 상단에 HPC Scheduler 와 연동하여, 각 Job의 IO영향도를 볼수 있도록 GUI에 jobstats 메뉴가 통합이 되었습니다.