2015年8月26日 星期三

bcache file system 紀錄

今天又逛了一下linux maillist 發現了一篇還蠻有趣的發言,筆記一下看看未來的發展。
是一篇關於一個新的檔案系統,可能是作者(曾經在Google服務過)發的文。
內容提到bcachefs就他文中描述,bcahefs 的效能以及可靠度可以跟ext4, xfs 相提並論又有btrfs/zfs的特性,之後有時間應該要去了解一下這個檔案系統。

回到主題
bcachefs is a modern COW(copy-on-write?) filesystem with checksumming, compression, multiple devices, caching,
and eventually snapshots and all kinds of other nifty features.


再來直接看看發文者對bcachefs的特性描述如下

FEATURES:
 - multiple devices
   (replication is like 80% done, but the recovery code still needs to be
   finished).

 - caching/tiering (naturally)
   you can format multiple devices at the same time with bcacheadm, and assign
   them to different tiers - right now only two tiers are supported, tier 0
   (default) is the fast tier and tier 1 is the slow tier. It'll effectively do
   writeback caching between tiers.

 - checksumming, compression: currently only zlib is supported for compression,
   and for checksumming there's crc32c and a 64 bit checksum. There's mount
   options for them:
   # mount -o data_checksum=crc32c,compression=gzip

   Caveat: don't try to use tiering and checksumming or compression at the same
   time yet, the read path needs to be reworked to handle both at the same time.

PLANNED FEATURES:
 - snapshots (might start on this soon)
 - erasure coding
 - native support for SMR drives, raw flash


發文者對此檔案系統也作了一些效能測試,使用的是dbench,高端的pcie flash device

Here's some dbench numbers, running on a high end pcie flash device:

1 thread, O_SYNC:       Throughput              Max latency
bcache:                 225.812 MB/sec          18.103 ms
ext4:                   454.546 MB/sec          6.288 ms
xfs:                    268.81 MB/sec           1.094 ms
btrfs:                  271.065 MB/sec          74.266 ms

20 threads, O_SYNC:     Throughput              Max latency
bcache:                 1050.03 MB/sec          6.614 ms
ext4:                   2867.16 MB/sec          4.128 ms
xfs:                    3051.55 MB/sec          10.004 ms
btrfs:                  665.995 MB/sec          1640.045 ms

60 threads, O_SYNC:     Throughput              Max latency
bcache:                 2143.45 MB/sec          15.315 ms
ext4:                   2944.02 MB/sec          9.547 ms
xfs:                    2862.54 MB/sec          14.323 ms
btrfs:                  501.248 MB/sec          8470.539 ms

1 thread:               Throughput              Max latency
bcache:                 992.008 MB/sec          2.379 ms
ext4:                   974.282 MB/sec          0.527 ms
xfs:                    715.219 MB/sec          0.527 ms
btrfs:                  647.825 MB/sec          108.983 ms

20 threads:             Throughput              Max latency
bcache:                 3270.8 MB/sec           16.075 ms
ext4:                   4879.15 MB/sec          11.098 ms
xfs:                    4904.26 MB/sec          20.290 ms
btrfs:                  647.232 MB/sec          2679.483 ms

60 threads:             Throughput              Max latency
bcache:                 4644.24 MB/sec          130.980 ms
ext4:                   4405.16 MB/sec          69.741 ms
xfs:                    4413.93 MB/sec          131.194 ms
btrfs:                  803.926 MB/sec          12367.850 ms



文中提到幾項自己有興趣的部分,
"Where'd that 20% of my space go?" - you'll notice the capacity shown by df is
lower than what it should be. Allocation in bcachefs (like in upstream bcache)
is done in terms of buckets, with copygc required if no buckets are empty, hence
we need copygc and a copygc reserve (much like the way SSD FTLs work).


It's quite conceivable at some point we'll add another allocator that doesn't
work in terms of buckets and doesn't require copygc (possibly for rotating
disks), but for a COW filesystem there are real advantages to doing it this way.
So for now just be aware - and the 20% reserve is probably excessive, at some
point I'll add a way to change it.


意思是使用者會發現容量被吃了!? 這對儲存業者來說好像蠻重視的??
一般使用者最敏感的也就是自己花的錢買到多少容量XD
作者提到目前的20%保留區,未來可能會提供一個方法去修改保留區的大小!?

Mount times:
bcachefs is partially garbage collection based - we don't persist allocation
information. We no longer require doing mark and sweep at runtime to reclaim
space, but we do have to walk the extents btree when mounting to find out what's
free and what isn't.

(We do retain the ability to do a mark and sweep while the filesystem is in use
though - i.e. we have the ability to do a large chunk of what fsck does at
runtime).


關於這邊的特性,提到bachefs對於可利用的空間不需要再執行時,使用mark以及sweep來作,而是在mount時藉由搜尋btree來得知。

底下有個應該是在HGST服務的工程師詢問他的問題,
How do you imagine SMR drives support? How do you feel about libzbc
using for SMR drives support? I am not very familiar with bcachefs
architecture yet. But I suppose that maybe libzbc model can be useful
for SMR drives support on bcachefs side. Anyway, it makes sense to
discuss proper model.


另一個引起我興趣的問題是
How do you imagine raw flash support in bcachefs architecture? Frankly
speaking, I am implementing NAND flash oriented file system. But this
project is proprietary yet and I can't share any details. However,
currently, I've implemented NAND flash related approaches in my file
system only. So, maybe, it make sense to consider some joint variant of
bcachefs and implementation on my side for NAND flash support. I need to
be more familiar with bcachefs architecture for such decision. But,
unfortunately, I suspect that it can be not so easy to support raw flash
for bcachefs. Of course, I can be wrong.


提到他正在實作NAND flash oriented file system從他詢問的方式來看應該再開發一種不需要FLASH controller的檔案系統,其實我也一直在思考何時會不需要FLASH controller?? 撇開最近剛發表很紅的XPOINT不說。未來產品是否會回歸直接搭載raw flash? 產品效能的瓶頸在哪?