Search This Blog

Wednesday, June 14, 2017

Ceph recovery backfilling affecting production instances

In any kind of distributed system, you will have to choose between consistency, availability, and partitioning, the CAP theorem states that in the presence of a network partition, one has to choose between consistency and availability, by default (default configurations) CEPH provides consistency and partitioning, just take in count that CEPH has many config options: ~860 in hammer, ~1100 in jewel, check this out, is jewel github config_opts.h file.

If you want any specific behavior in your cluster depends on your ability to configure and/or to change on the fly in case of contingency, this post talks about specific default recovery / backfilling option clusters, maybe you have noticed that in case of a critical failure, like losing a complete node, this causes a lot of movement of data, lots of ops on the drives, by default the cluster is going to try to recover in the fastest way possible, and also needs to support the normal operation and common use, like I said at the beginning of the post, by default CEPH have consistency and partitioning, so the common response is to start to have failures in the availability and users will start to notice high latency, high CPU usage in instances using RBD backend because of the slow response.

Try to think of this in a better way and let's analyze the problem, if we have a replica 3 cluster and we have a server down (even if we have a 3 servers cluster), the operation is still possible and the recovery jobs are no that important because CEPH will try to achieve consistency all the time, it will achieve the correct 3 replica consistency eventually, so everything will be fine, no data loss, the remaining replicas will start to regenerate the missing replica in others nodes, the big problem is the backfilling will compromise the operation, so the real problem is that we need to choose between a quick recovery or a common response to the clients and watchers connected, the response is not that hard to know, operation response is priority number 0!!!!

Lost and recovery action in CRUSH (Image from Samuel Just, Vault 2015)

This is not the non-plus ultra solution, is just my solution to this problem, all this was tested in a CEPH hammer cluster:

1.- The better one is to configure at the beginning of the installation in the ceph.conf file

******* SNIP *******
osd max backfills = 1
osd recovery threads = 1
osd recovery op priority = 1
osd client op priority = 63
osd recovery max active = 1
osd snap trim sleep = 0.1
******* SNIP *******

2.- If not, you can inject the on-the-fly options, you can use osd.x where x is the number of the osd daemon, or like the next example applies cluster-wide, but remember to put in the config file after because these options will be lost on reboot.

ceph@stor01:~$ sudo ceph tell osd.* injectargs '--osd-max-backfills 1'
ceph@stor01:~$ sudo ceph tell osd.* injectargs '--osd-recovery-threads 1'
ceph@stor01:~$ sudo ceph tell osd.* injectargs '--osd-recovery-op-priority 1'
ceph@stor01:~$ sudo ceph tell osd.* injectargs '--osd-client-op-priority 63'
ceph@stor01:~$ sudo ceph tell osd.* injectargs '--osd-recovery-max-active 1'
ceph@stor01:~$ sudo ceph tell osd.* injectargs '--osd-snap-trim-sleep 0.1'

The final result will be a really slow recovery of the cluster, but operation without any kind of problem.