Configuring Ceph pg_autoscale with Rook for OpenStack Deployments: A Guide to Balanced Data Distribution
At Cloudification, we deploy private clouds based on OpenStack, leveraging Rook-Ceph as a highly available storage solution. During the installation process, one of the recurring issues we faced is a proper configuration of the Ceph cluster to ensure balanced data distribution across OSDs (Object Storage Daemons).
The Problem: PG Imbalance Alerts
Right after a fresh installation, we started receiving PGImbalance alerts from Prometheus, indicating poorly distributed data across hosts. PG stands for Placement Group which is an abstraction under Storage Pool, where each individual object in a cluster is assigned to a PG. Since the number of objects in the cluster can be on the count of hundreds of millions, PGs allow Ceph to operate and rebalance without the need to address each object individually. Let’s have a look at Ceph Placement groups in the cluster:
$ ceph pg dump
...
OSD_STAT USED AVAIL USED_RAW TOTAL HB_PEERS PG_SUM PRIMARY_PG_SUM
23 33 GiB 1.7 TiB 33 GiB 1.7 TiB [0,2,5,7,8,10,16,18,19,22] 4 0
4 113 MiB 1.7 TiB 113 MiB 1.7 TiB [0,1,2,3,5,6,8,9,11,12,14,15,16,17,20,23] 2 1
1 49 GiB 1.7 TiB 49 GiB 1.7 TiB [0,2,5,6,9,10,12,13,15,16,17,18,21,22] 26 19
19 23 GiB 1.7 TiB 23 GiB 1.7 TiB [1,2,3,5,10,16,18,20,21,22] 15 17
22 19 GiB 1.7 TiB 19 GiB 1.7 TiB [4,5,6,11,15,17,19,20,21,23] 11 0
21 226 GiB 1.5 TiB 226 GiB 1.7 TiB [1,3,9,10,13,16,17,18,20,22] 108 17
20 117 MiB 1.7 TiB 117 MiB 1.7 TiB [0,4,7,12,14,17,18,19,21,22] 5 0
18 258 GiB 1.5 TiB 258 GiB 1.7 TiB [1,5,8,10,11,14,16,17,19,21,22,23] 122 19
17 34 GiB 1.7 TiB 34 GiB 1.7 TiB [0,1,2,3,5,6,8,9,11,12,13,15,16,18,20,21,22,23] 6 4
16 33 GiB 1.7 TiB 33 GiB 1.7 TiB [0,5,7,8,11,12,13,15,17,20] 23 2
15 109 MiB 1.7 TiB 109 MiB 1.7 TiB [2,10,12,14,16,18,19,21,22,23] 4 0
0 109 MiB 1.7 TiB 109 MiB 1.7 TiB [1,2,7,8,12,13,14,17,20,23] 5 1
13 111 MiB 1.7 TiB 111 MiB 1.7 TiB [0,1,2,3,8,9,12,14,15,17,19,21] 7 2
2 116 MiB 1.7 TiB 116 MiB 1.7 TiB [1,3,8,11,15,17,18,19,20,22] 3 0
3 33 GiB 1.7 TiB 33 GiB 1.7 TiB [2,4,5,7,8,9,10,11,16,23] 12 0
5 52 GiB 1.7 TiB 52 GiB 1.7 TiB [1,4,6,11,12,13,14,16,17,18,19,20,21,22,23] 16 2
6 23 GiB 1.7 TiB 23 GiB 1.7 TiB [4,5,7,9,10,11,15,19,20,22] 4 2
7 793 MiB 1.7 TiB 793 MiB 1.7 TiB [0,1,3,4,6,8,10,12,13,14,15,16,18,19,21,23] 4 20
8 34 GiB 1.7 TiB 34 GiB 1.7 TiB [0,5,7,9,12,13,14,18,20,22] 5 2
9 60 GiB 1.7 TiB 60 GiB 1.7 TiB [0,1,3,8,10,12,13,16,17,21] 5 2
10 216 GiB 1.5 TiB 216 GiB 1.7 TiB [1,3,4,5,6,7,9,11,12,14,15,16,18,19,21,22] 101 18
11 101 MiB 1.7 TiB 101 MiB 1.7 TiB [1,2,5,10,12,16,18,19,22,23] 4 1
12 54 GiB 1.7 TiB 54 GiB 1.7 TiB [0,1,3,5,6,7,8,9,10,11,13,14,18,20,21] 16 34
14 25 GiB 1.7 TiB 25 GiB 1.7 TiB [4,5,6,7,10,12,13,15,19,20,22] 5 2
sum 1.1 TiB 41 TiB 1.1 TiB 42 TiB
Let’s check how many PGs are configured for pools:
bash-5.1$ for pool in $(ceph osd lspools | awk '{print $2}') ; do echo "pool: $pool - pg_num: `ceph osd pool get $pool pg_num`" ; done
pool: .rgw.root - pg_num: pg_num: 1
pool: replicapool - pg_num: pg_num: 1
pool: .mgr - pg_num: pg_num: 1
pool: rgw-data-pool - pg_num: pg_num: 1
pool: s3-store.rgw.log - pg_num: pg_num: 1
pool: s3-store.rgw.control - pg_num: pg_num: 1
pool: s3-store.rgw.buckets.index - pg_num: pg_num: 1
pool: s3-store.rgw.otp - pg_num: pg_num: 1
pool: s3-store.rgw.buckets.non-ec - pg_num: pg_num: 1
pool: s3-store.rgw.meta - pg_num: pg_num: 1
pool: rgw-meta-pool - pg_num: pg_num: 1
pool: s3-store.rgw.buckets.data - pg_num: pg_num: 1
pool: cephfs-metadata - pg_num: pg_num: 1
pool: cephfs-data0 - pg_num: pg_num: 1
pool: cinder.volumes.hdd - pg_num: pg_num: 1
pool: cinder.backups - pg_num: pg_num: 1
pool: glance.images - pg_num: pg_num: 1
pool: nova.ephemeral - pg_num: pg_num: 1
This directly correlates with imbalanced OSD utilization, as Ceph was only creating 1 Placement Group per pool, leading to inefficient data distribution.
To diagnose the issue, we used the rados df command to identify pools consuming the most space and adjusting pg_num. In this document you will find what you need to calculate this number here.
If we manually reconfigure the current number of PGs for several pools, for example Cinder, Nova, Glance and CephFS:
$ ceph osd pool set cinder.volumes.nvme pg_num 256
$ ceph osd pool set nova.ephemeral pg_num 16
$ ceph osd pool set glance.images pg_num 16
$ ceph osd pool set cephfs-data0 pg_num 16
This triggers rebalancing, resulting in more balanced usage and the resolution of the alert:
bash-5.1$ ceph -s
cluster:
id: a6ab9446-2c0d-42f4-b009-514e989fd4a0
health: HEALTH_OK
services:
mon: 3 daemons, quorum b,d,f (age 3d)
mgr: b(active, since 3d), standbys: a
mds: 1/1 daemons up, 1 hot standby
osd: 24 osds: 24 up (since 3d), 24 in (since 3d)
rgw: 3 daemons active (3 hosts, 1 zones)
data:
volumes: 1/1 healthy
pools: 17 pools, 331 pgs
objects: 101.81k objects, 371 GiB
usage: 1.2 TiB used, 41 TiB / 42 TiB avail
pgs: 331 active+clean
io:
client: 7.4 KiB/s rd, 1.7 MiB/s wr, 9 op/s rd, 166 op/s wr
...
OSD_STAT USED AVAIL USED_RAW TOTAL HB_PEERS PG_SUM PRIMARY_PG_SUM
23 68 GiB 1.7 TiB 68 GiB 1.7 TiB [0,1,2,3,4,5,6,10,11,12,13,14,16,17,18,19,22] 37 12
4 33 GiB 1.7 TiB 33 GiB 1.7 TiB [0,1,2,3,5,6,7,8,9,10,11,12,13,14,15,16,17,20,22,23] 34 13
1 37 GiB 1.7 TiB 37 GiB 1.7 TiB [0,2,3,5,6,7,9,10,11,12,13,14,15,16,17,18,20,21,22] 42 13
19 39 GiB 1.7 TiB 39 GiB 1.7 TiB [0,2,3,6,7,9,10,11,12,13,15,16,17,18,20,22,23] 41 12
22 36 GiB 1.7 TiB 36 GiB 1.7 TiB [0,1,2,3,4,5,6,7,8,9,10,11,12,15,16,19,21,23] 36 11
21 62 GiB 1.7 TiB 62 GiB 1.7 TiB [0,1,2,3,5,6,8,9,10,13,14,15,16,17,18,19,20,22] 37 9
20 35 GiB 1.7 TiB 35 GiB 1.7 TiB [0,1,4,6,7,8,10,12,14,15,16,17,18,19,21] 39 10
18 67 GiB 1.7 TiB 67 GiB 1.7 TiB [1,2,5,7,8,9,10,11,13,14,16,17,19,20,21,22,23] 37 12
17 65 GiB 1.7 TiB 65 GiB 1.7 TiB [0,1,2,3,4,5,6,8,9,11,12,13,15,16,18,19,20,21,22,23] 34 14
16 35 GiB 1.7 TiB 35 GiB 1.7 TiB [0,1,4,5,7,8,9,10,11,12,13,15,17,18,19,20,21,22,23] 39 13
15 39 GiB 1.7 TiB 39 GiB 1.7 TiB [1,2,6,10,12,13,14,16,18,19,21,23] 41 5
0 34 GiB 1.7 TiB 34 GiB 1.7 TiB [1,2,4,5,7,8,9,10,11,12,13,14,15,16,17,19,20,21,22,23] 37 13
13 31 GiB 1.7 TiB 31 GiB 1.7 TiB [0,1,2,3,4,5,6,7,8,9,12,14,15,16,17,18,19,20,21,22,23] 36 16
2 33 GiB 1.7 TiB 33 GiB 1.7 TiB [0,1,3,6,8,11,13,14,15,16,17,18,19,20,21,22] 34 11
3 33 GiB 1.7 TiB 33 GiB 1.7 TiB [2,4,5,7,8,9,10,12,13,15,16,17,19,21,22,23] 33 12
5 64 GiB 1.7 TiB 64 GiB 1.7 TiB [0,1,3,4,6,8,10,11,12,13,14,15,16,17,18,19,20,21,22,23] 37 9
6 54 GiB 1.7 TiB 54 GiB 1.7 TiB [1,4,5,7,8,9,10,11,12,13,14,15,16,19,20,21,22,23] 32 9
7 38 GiB 1.7 TiB 38 GiB 1.7 TiB [0,1,3,4,6,8,10,11,12,13,14,15,16,17,18,19,20,22,23] 39 11
8 65 GiB 1.7 TiB 65 GiB 1.7 TiB [0,3,5,6,7,9,10,12,13,14,15,17,18,20,22] 33 14
9 95 GiB 1.7 TiB 95 GiB 1.7 TiB [0,1,3,6,8,10,11,12,13,14,15,16,17,18,19,20,21,23] 36 11
10 62 GiB 1.7 TiB 62 GiB 1.7 TiB [0,3,4,5,6,7,8,9,11,14,15,16,17,18,19,20,21,22,23] 36 14
11 35 GiB 1.7 TiB 35 GiB 1.7 TiB [0,1,2,3,5,8,9,10,12,14,15,16,18,19,20,22,23] 37 14
12 58 GiB 1.7 TiB 58 GiB 1.7 TiB [0,1,3,4,5,6,7,8,9,11,13,14,15,17,18,19,20,21,23] 35 13
14 56 GiB 1.7 TiB 56 GiB 1.7 TiB [1,2,4,5,6,7,8,9,10,12,13,15,18,19,20,21,22,23] 34 15
sum 1.1 TiB 41 TiB 1.1 TiB 42 TiB
Why did this happen?
By default, Ceph might not create the optimal number of PGs for each pool, resulting in data skew and uneven utilization of storage devices. Manually setting the pg_num for each pool is not a sustainable solution, as data volume is expected to grow over time.
That means Ceph’s automatic PG autoscaling isn’t working as expected, even though the pg_autoscale and pg_autoscale_mode options are enabled in the Ceph cluster configuration.
By running ceph osd pool autoscale-status you can see that the statistics are empty:
bash-5.1$ ceph osd pool autoscale-status
<no-data>
Immediately after executing the above command, the following logs appear in the Ceph MGR Pod:
debug 2024-09-27T13:39:06.888+0000 7f8136222640 0 log_channel(cluster) log [DBG] : pgmap v168593: 301 pgs: 301 active+clean; 371 GiB data, 1.1 TiB used, 41 TiB / 42 TiB avail
debug 2024-09-27T13:39:07.728+0000 7f8137224640 0 log_channel(audit) log [DBG] : from='client.6497139 -' entity='client.admin' cmd=[{"prefix": "osd pool autoscale-status", "target": ["mon-mgr", ""]}]: dispatch
debug 2024-09-27T13:39:07.736+0000 7f812a1ca640 0 [pg_autoscaler WARNING root] pool .rgw.root won't scale due to overlapping roots: {-2, -1}
debug 2024-09-27T13:39:07.736+0000 7f812a1ca640 0 [pg_autoscaler WARNING root] Please See: https://docs.ceph.com/en/latest/rados/operations/placement-groups/#automated-scaling
debug 2024-09-27T13:39:07.736+0000 7f812a1ca640 0 [pg_autoscaler WARNING root] pool cephfs-metadata won't scale due to overlapping roots: {-2, -1}
debug 2024-09-27T13:39:07.736+0000 7f812a1ca640 0 [pg_autoscaler WARNING root] Please See: https://docs.ceph.com/en/latest/rados/operations/placement-groups/#automated-scaling
debug 2024-09-27T13:39:07.736+0000 7f812a1ca640 0 [pg_autoscaler WARNING root] pool rgw-data-pool won't scale due to overlapping roots: {-2, -1}
debug 2024-09-27T13:39:07.736+0000 7f812a1ca640 0 [pg_autoscaler WARNING root] Please See: https://docs.ceph.com/en/latest/rados/operations/placement-groups/#automated-scaling
debug 2024-09-27T13:39:07.740+0000 7f812a1ca640 0 [pg_autoscaler WARNING root] pool cephfs-data0 won't scale due to overlapping roots: {-2, -1}
debug 2024-09-27T13:39:07.740+0000 7f812a1ca640 0 [pg_autoscaler WARNING root] Please See: https://docs.ceph.com/en/latest/rados/operations/placement-groups/#automated-scaling
debug 2024-09-27T13:39:07.740+0000 7f812a1ca640 0 [pg_autoscaler WARNING root] pool s3-store.rgw.log won't scale due to overlapping roots: {-2, -1}
debug 2024-09-27T13:39:07.740+0000 7f812a1ca640 0 [pg_autoscaler WARNING root] Please See: https://docs.ceph.com/en/latest/rados/operations/placement-groups/#automated-scaling
debug 2024-09-27T13:39:07.740+0000 7f812a1ca640 0 [pg_autoscaler WARNING root] pool s3-store.rgw.buckets.index won't scale due to overlapping roots: {-2, -1}
debug 2024-09-27T13:39:07.740+0000 7f812a1ca640 0 [pg_autoscaler WARNING root] Please See: https://docs.ceph.com/en/latest/rados/operations/placement-groups/#automated-scaling
debug 2024-09-27T13:39:07.740+0000 7f812a1ca640 0 [pg_autoscaler WARNING root] pool s3-store.rgw.otp won't scale due to overlapping roots: {-2, -1}
debug 2024-09-27T13:39:07.740+0000 7f812a1ca640 0 [pg_autoscaler WARNING root] Please See: https://docs.ceph.com/en/latest/rados/operations/placement-groups/#automated-scaling
debug 2024-09-27T13:39:07.740+0000 7f812a1ca640 0 [pg_autoscaler WARNING root] pool s3-store.rgw.control won't scale due to overlapping roots: {-2, -1}
debug 2024-09-27T13:39:07.740+0000 7f812a1ca640 0 [pg_autoscaler WARNING root] Please See: https://docs.ceph.com/en/latest/rados/operations/placement-groups/#automated-scaling
debug 2024-09-27T13:39:07.740+0000 7f812a1ca640 0 [pg_autoscaler WARNING root] pool s3-store.rgw.meta won't scale due to overlapping roots: {-2, -1}
debug 2024-09-27T13:39:07.740+0000 7f812a1ca640 0 [pg_autoscaler WARNING root] Please See: https://docs.ceph.com/en/latest/rados/operations/placement-groups/#automated-scaling
debug 2024-09-27T13:39:07.740+0000 7f812a1ca640 0 [pg_autoscaler WARNING root] pool s3-store.rgw.buckets.non-ec won't scale due to overlapping roots: {-2, -1}
debug 2024-09-27T13:39:07.740+0000 7f812a1ca640 0 [pg_autoscaler WARNING root] Please See: https://docs.ceph.com/en/latest/rados/operations/placement-groups/#automated-scaling
debug 2024-09-27T13:39:07.740+0000 7f812a1ca640 0 [pg_autoscaler WARNING root] pool rgw-meta-pool won't scale due to overlapping roots: {-2, -1}
debug 2024-09-27T13:39:07.740+0000 7f812a1ca640 0 [pg_autoscaler WARNING root] Please See: https://docs.ceph.com/en/latest/rados/operations/placement-groups/#automated-scaling
debug 2024-09-27T13:39:07.744+0000 7f812a1ca640 0 [pg_autoscaler WARNING root] pool s3-store.rgw.buckets.data won't scale due to overlapping roots: {-2, -1}
debug 2024-09-27T13:39:07.744+0000 7f812a1ca640 0 [pg_autoscaler WARNING root] Please See: https://docs.ceph.com/en/latest/rados/operations/placement-groups/#automated-scaling
debug 2024-09-27T13:39:07.744+0000 7f812a1ca640 0 [pg_autoscaler WARNING root] pool 1 contains an overlapping root -2... skipping scaling
debug 2024-09-27T13:39:07.744+0000 7f812a1ca640 0 [pg_autoscaler WARNING root] pool 2 contains an overlapping root -1... skipping scaling
debug 2024-09-27T13:39:07.744+0000 7f812a1ca640 0 [pg_autoscaler WARNING root] pool 3 contains an overlapping root -2... skipping scaling
debug 2024-09-27T13:39:07.744+0000 7f812a1ca640 0 [pg_autoscaler WARNING root] pool 4 contains an overlapping root -1... skipping scaling
debug 2024-09-27T13:39:07.748+0000 7f812a1ca640 0 [pg_autoscaler WARNING root] pool 5 contains an overlapping root -1... skipping scaling
debug 2024-09-27T13:39:07.748+0000 7f812a1ca640 0 [pg_autoscaler WARNING root] pool 6 contains an overlapping root -1... skipping scaling
debug 2024-09-27T13:39:07.748+0000 7f812a1ca640 0 [pg_autoscaler WARNING root] pool 7 contains an overlapping root -1... skipping scaling
debug 2024-09-27T13:39:07.748+0000 7f812a1ca640 0 [pg_autoscaler WARNING root] pool 8 contains an overlapping root -1... skipping scaling
debug 2024-09-27T13:39:07.748+0000 7f812a1ca640 0 [pg_autoscaler WARNING root] pool 9 contains an overlapping root -1... skipping scaling
debug 2024-09-27T13:39:07.748+0000 7f812a1ca640 0 [pg_autoscaler WARNING root] pool 10 contains an overlapping root -1... skipping scaling
debug 2024-09-27T13:39:07.748+0000 7f812a1ca640 0 [pg_autoscaler WARNING root] pool 11 contains an overlapping root -1... skipping scaling
debug 2024-09-27T13:39:07.748+0000 7f812a1ca640 0 [pg_autoscaler WARNING root] pool 12 contains an overlapping root -1... skipping scaling
debug 2024-09-27T13:39:07.748+0000 7f812a1ca640 0 [pg_autoscaler WARNING root] pool 13 contains an overlapping root -1... skipping scaling
debug 2024-09-27T13:39:07.748+0000 7f812a1ca640 0 [pg_autoscaler WARNING root] pool 14 contains an overlapping root -1... skipping scaling
debug 2024-09-27T13:39:07.748+0000 7f812a1ca640 0 [pg_autoscaler WARNING root] pool 15 contains an overlapping root -2... skipping scaling
debug 2024-09-27T13:39:07.748+0000 7f812a1ca640 0 [pg_autoscaler WARNING root] pool 16 contains an overlapping root -2... skipping scaling
debug 2024-09-27T13:39:07.748+0000 7f812a1ca640 0 [pg_autoscaler WARNING root] pool 17 contains an overlapping root -2... skipping scaling
This behavior is partially explained in this article. Let’s explore it further by examining the Ceph CRUSH tree:
bash-5.1$ ceph osd crush tree --show-shadow
ID CLASS WEIGHT TYPE NAME
-2 nvme 41.91833 root default~nvme
-10 nvme 6.98639 host worker01-example-dev~nvme
10 nvme 1.74660 osd.10
11 nvme 1.74660 osd.11
12 nvme 1.74660 osd.12
13 nvme 1.74660 osd.13
-12 nvme 6.98639 host worker02-example-dev~nvme
14 nvme 1.74660 osd.14
15 nvme 1.74660 osd.15
16 nvme 1.74660 osd.16
17 nvme 1.74660 osd.17
-6 nvme 6.98639 host worker03-example-dev~nvme
2 nvme 1.74660 osd.2
5 nvme 1.74660 osd.5
7 nvme 1.74660 osd.7
9 nvme 1.74660 osd.9
-14 nvme 6.98639 host mst01-example-dev~nvme
20 nvme 1.74660 osd.20
21 nvme 1.74660 osd.21
22 nvme 1.74660 osd.22
23 nvme 1.74660 osd.23
-4 nvme 6.98639 host mst02-example-dev~nvme
0 nvme 1.74660 osd.0
3 nvme 1.74660 osd.3
6 nvme 1.74660 osd.6
18 nvme 1.74660 osd.18
-8 nvme 6.98639 host mst03-example-dev~nvme
1 nvme 1.74660 osd.1
4 nvme 1.74660 osd.4
8 nvme 1.74660 osd.8
19 nvme 1.74660 osd.19
-1 41.91833 root default
-9 6.98639 host worker01-example-dev
10 nvme 1.74660 osd.10
11 nvme 1.74660 osd.11
12 nvme 1.74660 osd.12
13 nvme 1.74660 osd.13
-11 6.98639 host worker02-example-dev
14 nvme 1.74660 osd.14
15 nvme 1.74660 osd.15
16 nvme 1.74660 osd.16
17 nvme 1.74660 osd.17
-5 6.98639 host worker03-example-dev
2 nvme 1.74660 osd.2
5 nvme 1.74660 osd.5
7 nvme 1.74660 osd.7
9 nvme 1.74660 osd.9
-13 6.98639 host mst01-example-dev
20 nvme 1.74660 osd.20
21 nvme 1.74660 osd.21
22 nvme 1.74660 osd.22
23 nvme 1.74660 osd.23
-3 6.98639 host mst02-example-dev
0 nvme 1.74660 osd.0
3 nvme 1.74660 osd.3
6 nvme 1.74660 osd.6
18 nvme 1.74660 osd.18
-7 6.98639 host mst03-example-dev
1 nvme 1.74660 osd.1
4 nvme 1.74660 osd.4
8 nvme 1.74660 osd.8
19 nvme 1.74660 osd.19
We observe two identical CRUSH roots:
- default~nvme
- default
This has occurred because our cluster has a dedicated device type: “nvme.” By default, Rook creates a default CRUSH root that includes all available devices, potentially of mixed types.
-1 41.91833 root default
If we explicitly specify a device type for any custom pool, a corresponding CRUSH root is also created. Naturally, this overlaps with the default root:
apiVersion: ceph.rook.io/v1
kind: CephBlockPool
metadata:
name: nova.ephemeral
spec:
...
deviceClass: nvme
Just above the ‘Automated scaling’ section in the Ceph documentation, there’s a note explaining why pg_autoscaling may not work out of the box when initially configured.
In our case, simply creating a CRUSH root for the .mgr pool is insufficient, yet still necessary as indicated by the MGR logs.
-2 nvme 41.91833 root default~nvme
The Solution
Our solution includes two parts that are needed to fix the autoscaling issue.
Part 1: Create a Dedicated CRUSH Rule for the .mgr Pool
Ceph uses CRUSH (Controlled Replication Under Scalable Hashing) rules to determine data placement. The .mgr pool was using the default CRUSH rule, which contributed to the imbalance. We created a dedicated CRUSH rule and applied it to the .mgr pool:
ceph osd crush rule create-simple replicated-mgr-default default host firstn
ceph osd pool set .mgr crush_rule replicated-mgr-default
This ensures that the .mgr pool has its own data placement strategy, preventing interference with other pools.
Part 2: Fix Overlapping CRUSH Roots
We found overlapping CRUSH roots due to pools using different device types (default vs. default~nvme). This is expected behaviour of Rook and needs to be adjusted manually. But this can also happen when Ceph is deployed on mixed storage drives, but in our case, we only had NVMe drives.
In order to resolve the overlapping CRUSH roots first we need to identify how many pools are using different CRUSH roots. The following command provides reliable output as long as no custom CRUSH rules have been manually created. Otherwise, you would need to check which CRUSH rule applies to each pool using: ceph osd pool get <POOL_NAME> crush_rule
Fortunately, in our case, no manual changes were made to the CRUSH rules.
To identify pools with different CRUSH roots, we ran:
bash-5.1$ for rule in $(ceph osd crush rule ls) ; do ceph osd crush rule dump $rule | grep 'rule_name\|item_name' ; done
"rule_name": "replicated_rule",
"item_name": "default"
"rule_name": "replicapool",
"item_name": "default~nvme"
"rule_name": "cephfs-metadata",
"item_name": "default"
"rule_name": "cephfs-data0",
"item_name": "default"
"rule_name": "rgw-data-pool",
"item_name": "default"
"rule_name": "s3-store.rgw.log",
"item_name": "default"
"rule_name": "s3-store.rgw.otp",
"item_name": "default"
"rule_name": "s3-store.rgw.meta",
"item_name": "default"
"rule_name": "s3-store.rgw.buckets.index",
"item_name": "default"
"rule_name": "s3-store.rgw.buckets.non-ec",
"item_name": "default"
"rule_name": "s3-store.rgw.control",
"item_name": "default"
"rule_name": ".rgw.root",
"item_name": "default"
"rule_name": ".rgw.root_host",
"item_name": "default"
"rule_name": "s3-store.rgw.otp_host",
"item_name": "default"
"rule_name": "s3-store.rgw.control_host",
"item_name": "default"
"rule_name": "s3-store.rgw.log_host",
"item_name": "default"
"rule_name": "s3-store.rgw.buckets.non-ec_host",
"item_name": "default"
"rule_name": "rgw-meta-pool",
"item_name": "default"
"rule_name": "s3-store.rgw.meta_host",
"item_name": "default"
"rule_name": "s3-store.rgw.buckets.index_host",
"item_name": "default"
"rule_name": "s3-store.rgw.buckets.data",
"item_name": "default"
"rule_name": "rgw-meta-pool_host",
"item_name": "default"
"rule_name": "cinder.volumes.nvme",
"item_name": "default~nvme"
"rule_name": "glance.images",
"item_name": "default~nvme"
"rule_name": "nova.ephemeral",
"item_name": "default~nvme"
"rule_name": "cinder.backups",
"item_name": "default~nvme"
"rule_name": "replicated-mgr",
"item_name": "default~nvme"
The output showed that most pools were using the default CRUSH root, with a few on default~nvme. Since we only have nvme devices, we decided to standardize on the default root.
There is a related bug report in Rook, which proposes support for changing the deviceType. However, this feature was never implemented. This allows us to safely update the CRUSH rules without worrying about Rook interfering in the process.
Standardizing CRUSH Roots
Since there are slightly more pools using the default CRUSH root than those using default~nvme, and considering that our only device type is NVMe, we can simply switch the default~nvme root to default.
We updated the CRUSH rules for pools using default~nvme to use the default root:
POOLS="replicapool
cinder.volumes.nvme
glance.images
nova.ephemeral
replicated-mgr
cinder.backups"
for pool in $(echo $POOLS)
do
echo "update pool: ${pool}"
ceph osd crush rule rename ${pool} ${pool}-nvme
ceph osd crush rule create-simple ${pool} default host firstn
ceph osd pool set ${pool} crush_rule ${pool}
done
This change harmonized our CRUSH hierarchy and allowed pg_autoscale to actually work.
Now we check that pg_autoscale is working
bash-5.1$ ceph osd pool autoscale-status
POOL SIZE TARGET SIZE RATE RAW CAPACITY RATIO TARGET RATIO EFFECTIVE RATIO BIAS PG_NUM NEW PG_NUM AUTOSCALE BULK
.mgr 7432k 3.0 42923G 0.0000 1.0 1 on False
.rgw.root 65536 3.0 42923G 0.0000 1.0 32 on False
replicapool 15775M 3.0 42923G 0.0011 1.0 32 on False
cephfs-metadata 258.5M 3.0 42923G 0.0000 4.0 16 on False
rgw-data-pool 0 1.5 42923G 0.0000 1.0 32 on False
cephfs-data0 27140M 3.0 42923G 0.0019 1.0 16 on False
s3-store.rgw.log 4737k 3.0 42923G 0.0000 1.0 32 on False
s3-store.rgw.buckets.index 4127 3.0 42923G 0.0000 1.0 32 on False
s3-store.rgw.otp 0 3.0 42923G 0.0000 1.0 32 on False
s3-store.rgw.control 0 3.0 42923G 0.0000 1.0 32 on False
s3-store.rgw.meta 62333 3.0 42923G 0.0000 1.0 32 on False
s3-store.rgw.buckets.non-ec 0 3.0 42923G 0.0000 1.0 32 on False
rgw-meta-pool 0 3.0 42923G 0.0000 1.0 32 on False
s3-store.rgw.buckets.data 21088k 1.5 42923G 0.0000 1.0 32 on False
cinder.volumes.nvme 270.4G 25600G 3.0 42923G 1.7892 1.0 256 on False
glance.images 34108M 3.0 42923G 0.0023 1.0 16 on False
nova.ephemeral 31414M 3.0 42923G 0.0021 1.0 16 on False
cinder.backups 4096 3.0 42923G 0.0000 1.0 32 on False
Result: Now data distribution in Ceph cluster looks much more equal
$ ceph pg dump
...
OSD_STAT USED AVAIL USED_RAW TOTAL HB_PEERS PG_SUM PRIMARY_PG_SUM
9 259 GiB 1.5 TiB 259 GiB 1.7 TiB [0,1,2,3,4,5,6,7,8,10,11,12,13,14,15,16,17,18,19,20,21,22,23] 149 27
5 189 GiB 1.6 TiB 189 GiB 1.7 TiB [0,1,2,3,4,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23] 148 26
2 212 GiB 1.5 TiB 212 GiB 1.7 TiB [0,1,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23] 162 35
7 291 GiB 1.5 TiB 291 GiB 1.7 TiB [0,1,2,3,4,5,6,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23] 174 32
23 215 GiB 1.5 TiB 215 GiB 1.7 TiB [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22] 151 32
22 193 GiB 1.6 TiB 193 GiB 1.7 TiB [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,23] 152 28
21 200 GiB 1.6 TiB 200 GiB 1.7 TiB [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,22,23] 139 27
20 337 GiB 1.4 TiB 337 GiB 1.7 TiB [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,21,22,23] 172 38
19 224 GiB 1.5 TiB 224 GiB 1.7 TiB [0,1,2,3,4,5,6,7,9,10,11,12,13,14,15,16,17,18,20,21,22,23] 150 23
18 158 GiB 1.6 TiB 158 GiB 1.7 TiB [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,19,20,21,22,23] 138 30
17 249 GiB 1.5 TiB 249 GiB 1.7 TiB [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,18,19,20,21,22,23] 157 32
0 187 GiB 1.6 TiB 187 GiB 1.7 TiB [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23] 164 34
13 180 GiB 1.6 TiB 180 GiB 1.7 TiB [0,1,2,3,4,5,6,7,8,9,10,11,12,14,15,16,17,18,19,20,21,22,23] 156 36
1 226 GiB 1.5 TiB 226 GiB 1.7 TiB [0,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23] 163 36
14 204 GiB 1.5 TiB 204 GiB 1.7 TiB [0,1,2,3,4,5,6,7,8,9,10,11,12,13,15,16,17,18,19,20,21,22,23] 144 27
3 201 GiB 1.6 TiB 201 GiB 1.7 TiB [0,1,2,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23] 144 34
16 229 GiB 1.5 TiB 229 GiB 1.7 TiB [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,17,18,19,20,21,22,23] 167 36
4 175 GiB 1.6 TiB 175 GiB 1.7 TiB [0,1,2,3,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23] 140 25
6 189 GiB 1.6 TiB 189 GiB 1.7 TiB [0,1,2,3,4,5,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23] 151 24
8 241 GiB 1.5 TiB 241 GiB 1.7 TiB [0,1,2,3,4,5,6,7,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23] 151 37
10 225 GiB 1.5 TiB 225 GiB 1.7 TiB [0,1,2,3,4,5,6,7,8,9,11,12,13,14,15,16,17,18,19,20,21,22,23] 145 35
11 193 GiB 1.6 TiB 193 GiB 1.7 TiB [0,1,2,3,4,5,6,7,8,9,10,12,13,14,15,16,17,18,19,20,21,22,23] 154 37
12 217 GiB 1.5 TiB 217 GiB 1.7 TiB [0,1,2,3,4,5,6,7,8,9,10,11,13,14,15,16,17,18,19,20,21,22,23] 159 37
15 229 GiB 1.5 TiB 229 GiB 1.7 TiB [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,16,17,18,19,20,21,22,23] 169 25
sum 5.1 TiB 37 TiB 5.1 TiB 42 TiB
Equal distribution is very important, especially in small clusters, where a single Ceph node failure can take a large percentage of overall storage out and trigger a rebalance.
Solution (TLDR)
1. Add CRUSH rool for ‘.mgr’ BlockPool:
ceph osd crush rule create-simple replicated-mgr-default default host firstn
ceph osd pool set .mgr crush_rule replicated-mgr-default
2. Change crush root for Openstask Pools:
# Find list of Pools with non default CRUSH root
for rule in $(ceph osd crush rule ls) ; do ceph osd crush rule dump $rule | grep 'rule_name\|item_name' ; done
# List of pools for which we want to change CRUSH root
POOLS="replicapool
cinder.volumes.nvme
glance.images
nova.ephemeral
replicated-mgr
cinder.backups"
for pool in $(echo $POOLS)
do
echo "update pool: ${pool}"
ceph osd crush rule rename ${pool} ${pool}-nvme
ceph osd crush rule create-simple ${pool} default host firstn
ceph osd pool set ${pool} crush_rule ${pool}
done
We updated the CRUSH root for several OpenStack data pools, because our cluster contains only one device type: NVMe and won’t have HDD or SSD in the future. To simplify the configuration, we used the default CRUSH root for all pools.
If you need to add another device type (e.g., SSD) in the future, you will need to take two additional steps:
- Exclude the SSD OSDs from the default CRUSH root.
- Create a new SSD CRUSH root and apply it to the appropriate pools.
However, that’s a topic for another article.
Lessons Learned
- Dedicated CRUSH Rules: Assigning dedicated CRUSH rules to critical pools like .mgr prevents data distribution conflicts.
- Consistent CRUSH Roots: Standardizing on a single CRUSH root (default in our case) avoids complexity and improves autoscaling.
Automating PG Management: Properly configuring pg_autoscale saves time and maintains balanced data distribution as the cluster grows.
Conclusion
In this article, we addressed Ceph PG Autoscaling for OpenStack pools, ensuring balanced data distribution across the Ceph cluster.
We learned that properly configuring Ceph PG Autoscaling with Rook-Ceph is crucial for maintaining balanced data distribution in OpenStack deployments. By creating dedicated CRUSH rules and standardizing CRUSH roots, we resolved the PG imbalance issue and automated PG management for growing the cluster in the future.
At Cloudification, we specialize in deploying highly available private clouds with OpenStack and Rook-Ceph. If you’re facing challenges with Ceph configuration or looking to optimize your cloud infrastructure, contact us today.
Our team is ready to help you design and run a robust and scalable storage cluster.
The post Configuring Ceph pg_autoscale with Rook: A Guide to Balanced Data Distribution first appeared on Cloudification - We build Clouds 🚀☁️.
from Cloudification – We build Clouds 🚀☁️ https://cloudification.io/cloud-blog/configuring-ceph-pg_autoscale-with-rook-for-openstack/
https://cloudification.io/
No comments:
Post a Comment