In any kind of distributed system you will have to choose between consistency, availability and partitioning, the CAP theorem states that in the presence of a network partition, one has to choose between consistency and availability, by default (default configurations) CEPH provides consistency and partitioning, just take in count that CEPH has many config options: ~860 in hammer, ~1100 in jewel, check this out, is jewel github config_opts.h file.
If you want and specific behaviour in your cluster depends on your ability to configure and/or to change on the fly in case of contingency, this post talks about and specific default recovery / backfilling option clusters, maybe you have notice that in case of a critical failure, like losing a complete node, this causes a lot of movement of data, a lots of ops on the drives , by default the cluster is going to try to recover in the fastest way posible, and also needs to support the normal operation and common use, like I said in the beginning of the post, by default CEPH have consistency and partitioning, so the common response is to start have failures in the availability and users will start to notice high latency, high CPU usage in instances using RBD backend because of the slow response.
Try to think this in a better way and let's analyse the problem, if we have a replica 3 cluster and we have a server down (even if we have a 3 servers cluster), the operation is still posible and the recovery jobs are no that important because CEPH will try to achieve consistency all the time, it will achieve the correct 3 replica consistency eventually, so everything will be fine, no data loss, the remaining replicas will start to regenerate the missing replica in others nodes, the big problem is the backfilling will compromise the operation, so the real problem is that we need to choose between a quick recovery or a common response to the clients and watchers connected, the response is not that hard to know, operation response is priority number 0!!!!
Lost and recovery action in CRUSH (Image from Samuel Just, Vault 2015)
This is not the non plus ultra solution, is just my solution to this problem, all this was tested in a CEPH hammer cluster:
1.- The better one is to configure at the beginning of the installation in the ceph.conf file
******* SNIP *******
osd max backfills = 1
osd recovery threads = 1
osd recovery op priority = 1
osd client op priority = 63
osd recovery max active = 1
osd snap trim sleep = 0.1
******* SNIP *******
2.- If not, you can inject on the fly the options, you can use osd.x where x is the number of the osd daemon, or like the next example apply cluster-wide, but remember to put in the config file after because this options will be lost on reboot.
ceph@stor01:~$ sudo ceph tell osd.* injectargs '--osd-max-backfills 1'
ceph@stor01:~$ sudo ceph tell osd.* injectargs '--osd-recovery-threads 1'
ceph@stor01:~$ sudo ceph tell osd.* injectargs '--osd-recovery-op-priority 1'
ceph@stor01:~$ sudo ceph tell osd.* injectargs '--osd-client-op-priority 63'
ceph@stor01:~$ sudo ceph tell osd.* injectargs '--osd-recovery-max-active 1'
ceph@stor01:~$ sudo ceph tell osd.* injectargs '--osd-snap-trim-sleep 0.1'
Final result will be really slow recovery of the cluster, but operation without any kind of problem.