Search This Blog

Wednesday, December 12, 2018

How to disable Cloud-Init in a EL-like Cloud Image

So this one is pretty simple. However, I found a lot of misinformation along the way, so I figured that I would jot the proper (and most simple) process here.

Symptoms: an RHEL (or variant) VM that takes a very long time to boot. On the VM console, you can see the following output while the VM boot process is stalled and waiting for a timeout. Note that the message below has nothing to do with cloud-init, but it's the output that I have most often seen on the console while waiting for a VM to boot.

[106.325574} random: crng init done

Note that I have run into this issue in both OpenStack (when booting from external provider networks) and in KVM.

Upon initial boot of the VM, run the command below.

13:18:01 alvaro@lykan /home/alvaro/Documents/2post
$ sudo dnf install libguestfs libguestfs-tools openssl
Last metadata expiration check: 1:53:31 ago on Mon 16 Jul 2018 01:51:05 PM CDT.
Package libguestfs-1:1.38.2-1.fc27.x86_64 is already installed, skipping.
Package libguestfs-tools-1:1.38.2-1.fc27.noarch is already installed, skipping.
Package openssl-1:1.1.0h-3.fc27.x86_64 is already installed, skipping.
Dependencies resolved.
Nothing to do.
Complete!

13:18:26 alvaro@lykan /home/alvaro/Documents/2post
$ guestfish --rw -a ../../Downloads/CentOS-7-x86_64-GenericCloud-1805.qcow2
Welcome to guestfish, the guest filesystem shell for
editing virtual machine filesystems and disk images.

Type: ‘help’ for help on commands
‘man’ to read the manual
‘quit’ to quit the shell

> run
100% ⟦▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒⟧ 00:00
> list-filesystems
/dev/sda1: xfs
> mount /dev/sda1 /
> touch /etc/cloud/cloud-init.disabled
> quit

Seriously, that’s it. No need to disable or remove cloud-init services.

Monday, July 16, 2018

Change password to users on qcow2 disk or images

Sometimes you need to change the password to a user in a qcow2 image, to test locally, or if you are using an infrastructure without cloud-init, regardless of the user the procedure is the same.

Depending on the system the packages name could change a little, I'm using Fedora 27 I have installed

[alvaro@lykan 2post]$ sudo dnf install libguestfs libguestfs-tools openssl
Last metadata expiration check: 1:53:31 ago on Mon 16 Jul 2018 01:51:05 PM CDT.
Package libguestfs-1:1.38.2-1.fc27.x86_64 is already installed, skipping.
Package libguestfs-tools-1:1.38.2-1.fc27.noarch is already installed, skipping.
Package openssl-1:1.1.0h-3.fc27.x86_64 is already installed, skipping.
Dependencies resolved.
Nothing to do.
Complete!


Obviously, I have a QEMU environment to test and run the images, a very important part just to know that your steps are working.

[alvaro@lykan 2post]$ guestfish --rw -a ../../Downloads/CentOS-7-x86_64-GenericCloud-1805.qcow2

Welcome to guestfish, the guest filesystem shell for
editing virtual machine filesystems and disk images.

Type: ‘help’ for help on commands
‘man’ to read the manual
‘quit’ to quit the shell

><.fs> run
100% ⟦▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒⟧ 00:00
><.fs> list-filesystems
/dev/sda1: xfs
><.fs> mount /dev/sda1 /
><.fs> cp /etc/shadow /etc/shadow-original
><.fs> vi /etc/shadow


Inside the vim editor, you will see the file and now you can change the hash of any user (do not close this until you reached the last step), in any other terminal run:

[alvaro@lykan 2post]$ openssl passwd -1 mysuperpassword
$1$GKdzYMMe$q20PpMv5i/QFbmgwOqtZy1


Copy that generated hash and copy inside the first and second colon punctuation symbol (delete every inside this)


Before

root:!!:17687:0:99999:7:::
bin:*:17632:0:99999:7:::
daemon:*:17632:0:99999:7:::
adm:*:17632:0:99999:7:::
lp:*:17632:0:99999:7:::
sync:*:17632:0:99999:7:::
shutdown:*:17632:0:99999:7:::
halt:*:17632:0:99999:7:::
mail:*:17632:0:99999:7:::
operator:*:17632:0:99999:7:::
games:*:17632:0:99999:7:::
ftp:*:17632:0:99999:7:::
nobody:*:17632:0:99999:7:::
systemd-network:!!:17687::::::
dbus:!!:17687::::::
polkitd:!!:17687::::::
rpc:!!:17687:0:99999:7:::
rpcuser:!!:17687::::::
nfsnobody:!!:17687::::::
sshd:!!:17687::::::
postfix:!!:17687::::::
chrony:!!:17687::::::


After

root:$1$GKdzYMMe$q20PpMv5i/QFbmgwOqtZy1:17687:0:99999:7:::
bin:*:17632:0:99999:7:::
daemon:*:17632:0:99999:7:::
adm:*:17632:0:99999:7:::
lp:*:17632:0:99999:7:::
sync:*:17632:0:99999:7:::
shutdown:*:17632:0:99999:7:::
halt:*:17632:0:99999:7:::
mail:*:17632:0:99999:7:::
operator:*:17632:0:99999:7:::
games:*:17632:0:99999:7:::
ftp:*:17632:0:99999:7:::
nobody:*:17632:0:99999:7:::
systemd-network:!!:17687::::::
dbus:!!:17687::::::
polkitd:!!:17687::::::
rpc:!!:17687:0:99999:7:::
rpcuser:!!:17687::::::
nfsnobody:!!:17687::::::
sshd:!!:17687::::::
postfix:!!:17687::::::
chrony:!!:17687::::::


Close the vim editor, save the changes, and exit guestfish

><.fs> quit

[alvaro@lykan 2post]$


Now you can test the image on any cloud environment or using your local QEMU environment.

Wednesday, December 6, 2017

Get total provisioned size from cinder volumes

A quick way to get the total amount of provisioned space from cinder

alvaro@skyline.local: ~
$ cinder list --all-tenants
mysql like output :)

So to parse the output and add all the values in the Size col, use the next piped commands.

alvaro@skyline.local: ~
$ . admin-openrc.sh

alvaro@skyline.local: ~
$ cinder list --all-tenants | awk -F'|' '{print $6}' | sed 's/ //g' | grep -v -e '^$' | awk '{s+=$1} END {printf "%.0f", s}'
13453

The final result is in GB.

Wednesday, June 14, 2017

Ceph recovery backfilling affecting production instances

In any kind of distributed system, you will have to choose between consistency, availability, and partitioning, the CAP theorem states that in the presence of a network partition, one has to choose between consistency and availability, by default (default configurations) CEPH provides consistency and partitioning, just take in count that CEPH has many config options: ~860 in hammer, ~1100 in jewel, check this out, is jewel github config_opts.h file.

If you want any specific behavior in your cluster depends on your ability to configure and/or to change on the fly in case of contingency, this post talks about specific default recovery / backfilling option clusters, maybe you have noticed that in case of a critical failure, like losing a complete node, this causes a lot of movement of data, lots of ops on the drives, by default the cluster is going to try to recover in the fastest way possible, and also needs to support the normal operation and common use, like I said at the beginning of the post, by default CEPH have consistency and partitioning, so the common response is to start to have failures in the availability and users will start to notice high latency, high CPU usage in instances using RBD backend because of the slow response.





Try to think of this in a better way and let's analyze the problem, if we have a replica 3 cluster and we have a server down (even if we have a 3 servers cluster), the operation is still possible and the recovery jobs are no that important because CEPH will try to achieve consistency all the time, it will achieve the correct 3 replica consistency eventually, so everything will be fine, no data loss, the remaining replicas will start to regenerate the missing replica in others nodes, the big problem is the backfilling will compromise the operation, so the real problem is that we need to choose between a quick recovery or a common response to the clients and watchers connected, the response is not that hard to know, operation response is priority number 0!!!!





Lost and recovery action in CRUSH (Image from Samuel Just, Vault 2015)

This is not the non-plus ultra solution, is just my solution to this problem, all this was tested in a CEPH hammer cluster:

1.- The better one is to configure at the beginning of the installation in the ceph.conf file

******* SNIP *******
[osd]
....
osd max backfills = 1
osd recovery threads = 1
osd recovery op priority = 1
osd client op priority = 63
osd recovery max active = 1
osd snap trim sleep = 0.1
....
******* SNIP *******

2.- If not, you can inject the on-the-fly options, you can use osd.x where x is the number of the osd daemon, or like the next example applies cluster-wide, but remember to put in the config file after because these options will be lost on reboot.

ceph@stor01:~$ sudo ceph tell osd.* injectargs '--osd-max-backfills 1'
ceph@stor01:~$ sudo ceph tell osd.* injectargs '--osd-recovery-threads 1'
ceph@stor01:~$ sudo ceph tell osd.* injectargs '--osd-recovery-op-priority 1'
ceph@stor01:~$ sudo ceph tell osd.* injectargs '--osd-client-op-priority 63'
ceph@stor01:~$ sudo ceph tell osd.* injectargs '--osd-recovery-max-active 1'
ceph@stor01:~$ sudo ceph tell osd.* injectargs '--osd-snap-trim-sleep 0.1'

The final result will be a really slow recovery of the cluster, but operation without any kind of problem.

Wednesday, April 12, 2017

Keeping up to date git forked repos

A quick guide to remembering how to keep up-to-date forked repos:

First: Manage a set of tracked repositories.

alvaro@skyline.local: ~/docker-openstack-cli
$ git remote -v

origin https://github.com/alsotoes/docker-openstack-cli.git (fetch)
origin https://github.com/alsotoes/docker-openstack-cli.git (push)

Second: Add the remote repo to work with.

alvaro@skyline.local: ~/docker-openstack-cli
$ git remote add kionetworks https://github.com/kionetworks/docker-openstack-cli.git

Third: Print repo local configuration.

alvaro@skyline.local: ~/docker-openstack-cli
$ git remote -v

kionetworks https://github.com/kionetworks/docker-openstack-cli.git (fetch)
kionetworks https://github.com/kionetworks/docker-openstack-cli.git (push)
origin https://github.com/alsotoes/docker-openstack-cli.git (fetch)
origin https://github.com/alsotoes/docker-openstack-cli.git (push)

Fourth: Push to the remote repo, to complete the update.

alvaro@skyline.local: ~/docker-openstack-cli
$ git push kionetworks

Counting objects: 3, done.
Delta compression using up to 2 threads.
Compressing objects: 100% (3/3), done.
Writing objects: 100% (3/3), 311 bytes | 0 bytes/s, done.
Total 3 (delta 2), reused 0 (delta 0)
remote: Resolving deltas: 100% (2/2), completed with 2 local objects.
To https://github.com/kionetworks/docker-openstack-cli.git
51bb74b..33e5dce master -> master
alvaro@skyline.local: ~/docker-openstack-cli
hist:600 jobs:0 $

Pull new changes from origin.

alvaro@skyline.local: ~/docker-openstack-cli
$ git pull

Already up-to-date.

Pull new changes from a remote called kionetworks.

alvaro@skyline.local: ~/docker-openstack-cli
$ git pull kionetworks master

From https://github.com/kionetworks/docker-openstack-cli
* branch master -> FETCH_HEAD
Already up-to-date.

Sorry if this post has too little information, is just a remember.

Wednesday, November 16, 2016

Solve Ceph Clock Skew error

Monitors can be severely affected by significant clock skews across the monitor nodes. This usually translates into weird behavior with no obvious cause. To avoid such issues, you should run a clock synchronization tool on your monitor nodes by default the monitors will allow clocks to drift up to 0.05 seconds.

This error can be seen using:

# ceph -s
# ceph health detail

root@ceph01:~# ceph -s
cluster 9227547b-bb6b-44f7-b877-3f6d25b942a4
health HEALTH_WARN
clock skew detected on mon.ceph01
monmap e3: 3 mons at {ceph01=172.18.3.5:6789/0,ceph02=172.18.5.6:6789/0,ceph03=172.18.5.7:6789/0}
election epoch 24, quorum 0,1,2 ceph01,ceph02,ceph03
mdsmap e17: 1/1/1 up {0=ceph01=up:active}
osdmap e245: 22 osds: 22 up, 22 in
pgmap v14727: 1408 pgs, 5 pools, 11977 MB data, 3183 objects
24729 MB used, 16361 GB / 16385 GB avail
1408 active+clean

The solution? just re-sync the clock in the affected mon, and restart the mon daemon.

root@ceph01:~# service ntp stop
* Stopping NTP server ntpd
root@ceph01:~# ntpdate ntp.ubuntu.com
16 Nov 01:24:16 ntpdate[4149434]: adjust time server 91.189.91.157 offset -0.002235 sec
root@ceph01:~# ntpd -gq
ntpd: time slew +0.003482s
root@ceph01:~# service ntp start
* Starting NTP server ntpd
root@ceph01:~# restart ceph-mon-all
ceph-mon-all start/running

Just to be sure, sometimes it will be better if you sync the clock on all Mon

Also, this default parameter (0.05 seconds) can be changed in the ceph config file, but that you can doesn't mean that you should, the default value is a perfect configuration.

root@ceph01:~# cat /etc/ceph/ceph.conf
....

[mon]
mon clock drift allowed = 10

...

Check again the cluster status, sometimes it takes a few seconds, like 30 seconds.

root@ceph01:~# ceph -s
cluster 9227547b-bb6b-44f7-b877-3f6d25b942a4
health HEALTH_OK
monmap e3: 3 mons at {ceph01=172.18.3.5:6789/0,ceph02=172.18.5.6:6789/0,ceph03=172.18.5.7:6789/0}
election epoch 24, quorum 0,1,2 ceph01,ceph02,ceph03
mdsmap e17: 1/1/1 up {0=ceph01=up:active}
osdmap e245: 22 osds: 22 up, 22 in
pgmap v14727: 1408 pgs, 5 pools, 11977 MB data, 3183 objects
24729 MB used, 16361 GB / 16385 GB avail
1408 active+clean

Cloning a Ceph client auth key

 I don't recall any reason to do this other than using the same user and auth key to authenticate in different Ceph clusters, like in a multi-backend solution, or just because things get messy when you are not using a default configuration.

Sometimes, things get easy when we use the same user and auth key on both clusters for services to connect to, so let's see some background commands for managing users, keys, and permissions:

Create a new user and auth token (cinder client example):

root@ceph-admin:~# ceph auth get-or-create client.jerry
client.jerry
key: AQAZT05WoQuzJxAAX5BKxCbPf93CwihuHo27VQ==

So as you see the key is not a parameter, in a different server this will produce a completely different key.
Just to check, print the complete list of keys:

root@ceph-admin:~# ceph auth list
installed auth entries:

osd.0
key: AQCvCbtToC6MDhAATtuT70Sl+DymPCfDSsyV4w==
caps: [mon] allow profile osd
caps: [osd] allow *
osd.1
key: AQC4CbtTCFJBChAAVq5spj0ff4eHZICxIOVZeA==
caps: [mon] allow profile osd
caps: [osd] allow *
client.admin
key: AQBHCbtT6APDHhAA5W00cBchwkQjh3dkKsyPjw==
caps: [mds] allow
caps: [mon] allow *
caps: [osd] allow *
client.jerry
key: AQAZT05WoQuzJxAAX5BKxCbPf93CwihuHo27VQ==

Or print a user’s authentication key to standard output, execute the command in the following format

ceph auth print-key {TYPE}.{ID}

root@ceph-admin:~# ceph auth print-key client.jerry
AQAZT05WoQuzJxAAX5BKxCbPf93CwihuHo27VQ==

To change this in order to match with others, we need to update their keys and/or their capabilities, the import command is for this, remember their keys and their capabilities will update on existing users and create new ones, use the following format:

ceph auth import -i /path/to/keyring

The keyring file needs to be in this format, if not, the command will not work and the part of the work, it will just hang.

root@ceph-admin:~# cat jerry.key
[client.jerry]
key = AQAMP01WS8i8ERAAPspjwMzUm4SL00n+WppM6A==

Now we can update the auth key for the user jerry:

root@ceph-admin:~# auth import -i ./jerry.key
imported keyring

List again.

root@ceph-admin:~# ceph auth list
installed auth entries:

osd.0
key: AQCvCbtToC6MDhAATtuT70Sl+DymPCfDSsyV4w==
caps: [mon] allow profile osd
caps: [osd] allow *
osd.1
key: AQC4CbtTCFJBChAAVq5spj0ff4eHZICxIOVZeA==
caps: [mon] allow profile osd
caps: [osd] allow *
client.admin
key: AQBHCbtT6APDHhAA5W00cBchwkQjh3dkKsyPjw==
caps: [mds] allow
caps: [mon] allow *
caps: [osd] allow *
client.jerry
key: AQAMP01WS8i8ERAAPspjwMzUm4SL00n+WppM6A==

Done, I will continue posting these little helping tricks until the last post about multi-backend ceph is out.