Btrfs:RAID 5 Rsync Freeze
=========================
:author: Aaron Ball
:email: nullspoon@iohq.net



My server's _/home/_ directory is a btrfs RAID 5, spanning three drives (I did
a blog post about it Btrfs:RAID_Setup[here]). Everything worked fine, until I
used rsync to sync my files from my laptop to my server.  At that point, the
sync would go well for a little while and then slow to a crawl. I couldn't
cancel the sync with a ctrl+c. If I could get on my server over ssh, I'd find
that one of my cpus was pegged at 100%.  Sometimes though it got so bogged down
I couldn't even get to the server at all. If I were already on the server and I
did a kill -9 on rsync, it'd go defunct.

I checked my logs after trying to umount /home/ and found...

----
Nov 03 12:01:18 zion kernel: device label home devid 1 transid 1173 /dev/sdb
Nov 03 12:01:19 zion kernel: btrfs: disk space caching is enabled
Nov 03 12:11:53 zion kernel: INFO: task umount:1668 blocked for more than 120 seconds.
Nov 03 12:11:53 zion kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 03 12:11:53 zion kernel: umount D ffff880037afbc60 0 1668 1653 0x00000000
Nov 03 12:11:53 zion kernel: ffff880037afbbd0 0000000000000086 0000000000014500 ffff880037afbfd8
Nov 03 12:11:53 zion kernel: ffff880037afbfd8 0000000000014500 ffff8800aa0caa30 0000000000000010
Nov 03 12:11:53 zion kernel: 000000000d6fffff ffff880037afbb98 ffffffff8113a911 ffff8800afedb728
Nov 03 12:11:53 zion kernel: Call Trace:
Nov 03 12:11:53 zion kernel: [<ffffffff8113a911>] ?  free_pcppages_bulk+0x3b1/0x3f0
Nov 03 12:11:53 zion kernel: [<ffffffff81132700>] ? filemap_fdatawait+0x30/0x30
Nov 03 12:11:53 zion kernel: [<ffffffff814e1029>] schedule+0x29/0x70
Nov 03 12:11:53 zion kernel: [<ffffffff814e12cf>] io_schedule+0x8f/0xe0
Nov 03 12:11:53 zion kernel: [<ffffffff8113270e>] sleep_on_page+0xe/0x20
Nov 03 12:11:53 zion kernel: [<ffffffff814ddb5b>] __wait_on_bit_lock+0x5b/0xc0
Nov 03 12:11:53 zion kernel: [<ffffffff8113284a>] __lock_page+0x6a/0x70
Nov 03 12:11:53 zion kernel: [<ffffffff81084800>] ?  wake_atomic_t_function+0x40/0x40
Nov 03 12:11:53 zion kernel: [<ffffffff81141fa3>] truncate_inode_pages_range+0x613/0x660
Nov 03 12:11:53 zion kernel: [<ffffffff81142005>] truncate_inode_pages+0x15/0x20
Nov 03 12:11:53 zion kernel: [<ffffffffa07df172>] btrfs_evict_inode+0x42/0x380 [btrfs]
Nov 03 12:11:53 zion kernel: [<ffffffff811b97b0>] evict+0xb0/0x1b0
Nov 03 12:11:53 zion kernel: [<ffffffff811b98e9>] dispose_list+0x39/0x50
Nov 03 12:11:53 zion kernel: [<ffffffff811ba56c>] evict_inodes+0x11c/0x130
Nov 03 12:11:53 zion kernel: [<ffffffff811a1cc8>] generic_shutdown_super+0x48/0xe0
Nov 03 12:11:53 zion kernel: [<ffffffff811a1f22>] kill_anon_super+0x12/0x20
Nov 03 12:11:53 zion kernel: [<ffffffffa07a8ee6>] btrfs_kill_super+0x16/0x90 [btrfs]
Nov 03 12:11:53 zion kernel: [<ffffffff811a22fd>] deactivate_locked_super+0x3d/0x60
Nov 03 12:11:53 zion kernel: [<ffffffff811a28e6>] deactivate_super+0x46/0x60
Nov 03 12:11:53 zion kernel: [<ffffffff811bdeaf>] mntput_no_expire+0xef/0x150
Nov 03 12:11:53 zion kernel: [<ffffffff811bf0b1>] SyS_umount+0x91/0x3b0
Nov 03 12:11:53 zion kernel: [<ffffffff814ea5dd>] system_call_fastpath+0x1a/0x1f
----

The only way to solve the problem was to perform a restart. After that, the
problem would come back as soon as I started rsync again.


[[the-solution]]
== The Solution

I hunted around for a while until I finally just searched for the name of the
pegged process, **btrfs-endio-wri**, and cpu time. It turns out, the btrfs
folks have https://btrfs.wiki.kernel.org/index.php/Gotchas[a page] detailing a
list of current "gotchas" btrfs has. This issue was one of them. They describe
it as <pre> Files with a lot of random writes can become heavily fragmented
(10000+ extents) causing trashing on HDDs and excessive multi-second spikes of
CPU load on systems with an SSD or large amount a RAM. ... Symptoms include
btrfs-transacti and btrfs-endio-wri taking up a lot of CPU time (in spikes,
possibly triggered by syncs). You can use filefrag to locate heavily fragmented
files. </pre>

One of the best parts of rsync is that is syncs deltas instead of resyncing the
entire file. What does that result in? Lots of little random writes. Sounds
like a match to me.

**To fix this**, I defragged all of /home/ (with _compression=lzo_ of course :)
), and remounted using the *autodefrag* option.

Now I can run rsync with no problems.

One last thing to note. Their gotchas page says that once they've worked out a
few potential kinks with the autodefrag mount option, they'll make it the
default, which should prevent this from being an issue in future versions.

Category:Linux
Category:Btrfs
Category:Storage
Category:RAID


// vim: set syntax=asciidoc: