Configuring Ceph pg_autoscale with Rook for OpenStack Deployments: A Guide to Balanced Data Distribution

At Cloudification, we deploy private clouds based on OpenStack, leveraging Rook-Ceph as a highly available storage solution. During the installation process, one of the recurring issues we faced is a proper configuration of the Ceph cluster to ensure balanced data distribution across OSDs (Object Storage Daemons).

The Problem: PG Imbalance Alerts

Right after a fresh installation, we started receiving PGImbalance alerts from Prometheus, indicating poorly distributed data across hosts. PG stands for Placement Group which is an abstraction under Storage Pool, where each individual object in a cluster is assigned to a PG. Since the number of objects in the cluster can be on the count of hundreds of millions, PGs allow Ceph to operate and rebalance without the need to address each object individually. Let’s have a look at Ceph Placement groups in the cluster:

$ ceph pg dump
...
OSD_STAT  USED     AVAIL    USED_RAW  TOTAL    HB_PEERS                                         PG_SUM  PRIMARY_PG_SUM
23         33 GiB  1.7 TiB    33 GiB  1.7 TiB                       [0,2,5,7,8,10,16,18,19,22]       4               0
4         113 MiB  1.7 TiB   113 MiB  1.7 TiB        [0,1,2,3,5,6,8,9,11,12,14,15,16,17,20,23]       2               1
1          49 GiB  1.7 TiB    49 GiB  1.7 TiB           [0,2,5,6,9,10,12,13,15,16,17,18,21,22]      26              19
19         23 GiB  1.7 TiB    23 GiB  1.7 TiB                      [1,2,3,5,10,16,18,20,21,22]      15              17
22         19 GiB  1.7 TiB    19 GiB  1.7 TiB                     [4,5,6,11,15,17,19,20,21,23]      11               0
21        226 GiB  1.5 TiB   226 GiB  1.7 TiB                     [1,3,9,10,13,16,17,18,20,22]     108              17
20        117 MiB  1.7 TiB   117 MiB  1.7 TiB                     [0,4,7,12,14,17,18,19,21,22]       5               0
18        258 GiB  1.5 TiB   258 GiB  1.7 TiB               [1,5,8,10,11,14,16,17,19,21,22,23]     122              19
17         34 GiB  1.7 TiB    34 GiB  1.7 TiB  [0,1,2,3,5,6,8,9,11,12,13,15,16,18,20,21,22,23]       6               4
16         33 GiB  1.7 TiB    33 GiB  1.7 TiB                      [0,5,7,8,11,12,13,15,17,20]      23               2
15        109 MiB  1.7 TiB   109 MiB  1.7 TiB                   [2,10,12,14,16,18,19,21,22,23]       4               0
0         109 MiB  1.7 TiB   109 MiB  1.7 TiB                      [1,2,7,8,12,13,14,17,20,23]       5               1
13        111 MiB  1.7 TiB   111 MiB  1.7 TiB                  [0,1,2,3,8,9,12,14,15,17,19,21]       7               2
2         116 MiB  1.7 TiB   116 MiB  1.7 TiB                     [1,3,8,11,15,17,18,19,20,22]       3               0
3          33 GiB  1.7 TiB    33 GiB  1.7 TiB                        [2,4,5,7,8,9,10,11,16,23]      12               0
5          52 GiB  1.7 TiB    52 GiB  1.7 TiB      [1,4,6,11,12,13,14,16,17,18,19,20,21,22,23]      16               2
6          23 GiB  1.7 TiB    23 GiB  1.7 TiB                      [4,5,7,9,10,11,15,19,20,22]       4               2
7         793 MiB  1.7 TiB   793 MiB  1.7 TiB      [0,1,3,4,6,8,10,12,13,14,15,16,18,19,21,23]       4              20
8          34 GiB  1.7 TiB    34 GiB  1.7 TiB                      [0,5,7,9,12,13,14,18,20,22]       5               2
9          60 GiB  1.7 TiB    60 GiB  1.7 TiB                      [0,1,3,8,10,12,13,16,17,21]       5               2
10        216 GiB  1.5 TiB   216 GiB  1.7 TiB       [1,3,4,5,6,7,9,11,12,14,15,16,18,19,21,22]     101              18
11        101 MiB  1.7 TiB   101 MiB  1.7 TiB                     [1,2,5,10,12,16,18,19,22,23]       4               1
12         54 GiB  1.7 TiB    54 GiB  1.7 TiB           [0,1,3,5,6,7,8,9,10,11,13,14,18,20,21]      16              34
14         25 GiB  1.7 TiB    25 GiB  1.7 TiB                   [4,5,6,7,10,12,13,15,19,20,22]       5               2
sum       1.1 TiB   41 TiB   1.1 TiB   42 TiB

Let’s check how many PGs are configured for pools:

bash-5.1$ for pool in $(ceph osd lspools | awk '{print $2}') ; do echo "pool: $pool - pg_num: `ceph osd pool get $pool pg_num`" ; done
pool: .rgw.root - pg_num: pg_num: 1
pool: replicapool - pg_num: pg_num: 1
pool: .mgr - pg_num: pg_num: 1
pool: rgw-data-pool - pg_num: pg_num: 1
pool: s3-store.rgw.log - pg_num: pg_num: 1
pool: s3-store.rgw.control - pg_num: pg_num: 1
pool: s3-store.rgw.buckets.index - pg_num: pg_num: 1
pool: s3-store.rgw.otp - pg_num: pg_num: 1
pool: s3-store.rgw.buckets.non-ec - pg_num: pg_num: 1
pool: s3-store.rgw.meta - pg_num: pg_num: 1
pool: rgw-meta-pool - pg_num: pg_num: 1
pool: s3-store.rgw.buckets.data - pg_num: pg_num: 1
pool: cephfs-metadata - pg_num: pg_num: 1
pool: cephfs-data0 - pg_num: pg_num: 1
pool: cinder.volumes.hdd - pg_num: pg_num: 1
pool: cinder.backups - pg_num: pg_num: 1
pool: glance.images - pg_num: pg_num: 1
pool: nova.ephemeral - pg_num: pg_num: 1

This directly correlates with imbalanced OSD utilization, as Ceph was only creating 1 Placement Group per pool, leading to inefficient data distribution.

To diagnose the issue, we used the rados df command to identify pools consuming the most space and adjusting pg_num. In this document you will find what you need to calculate this number here.

If we manually reconfigure the current number of PGs for several pools, for example Cinder, Nova, Glance and CephFS:

$ ceph osd pool set cinder.volumes.nvme pg_num 256
$ ceph osd pool set nova.ephemeral pg_num 16
$ ceph osd pool set glance.images pg_num 16
$ ceph osd pool set cephfs-data0 pg_num 16

This triggers rebalancing, resulting in more balanced usage and the resolution of the alert:

bash-5.1$ ceph -s
  cluster:
    id:     a6ab9446-2c0d-42f4-b009-514e989fd4a0
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum b,d,f (age 3d)
    mgr: b(active, since 3d), standbys: a
    mds: 1/1 daemons up, 1 hot standby
    osd: 24 osds: 24 up (since 3d), 24 in (since 3d)
    rgw: 3 daemons active (3 hosts, 1 zones)

  data:
    volumes: 1/1 healthy
    pools:   17 pools, 331 pgs
    objects: 101.81k objects, 371 GiB
    usage:   1.2 TiB used, 41 TiB / 42 TiB avail
    pgs:     331 active+clean

  io:
    client:   7.4 KiB/s rd, 1.7 MiB/s wr, 9 op/s rd, 166 op/s wr

...

OSD_STAT  USED     AVAIL    USED_RAW  TOTAL    HB_PEERS                                                 PG_SUM  PRIMARY_PG_SUM
23         68 GiB  1.7 TiB    68 GiB  1.7 TiB            [0,1,2,3,4,5,6,10,11,12,13,14,16,17,18,19,22]      37              12
4          33 GiB  1.7 TiB    33 GiB  1.7 TiB     [0,1,2,3,5,6,7,8,9,10,11,12,13,14,15,16,17,20,22,23]      34              13
1          37 GiB  1.7 TiB    37 GiB  1.7 TiB      [0,2,3,5,6,7,9,10,11,12,13,14,15,16,17,18,20,21,22]      42              13
19         39 GiB  1.7 TiB    39 GiB  1.7 TiB           [0,2,3,6,7,9,10,11,12,13,15,16,17,18,20,22,23]      41              12
22         36 GiB  1.7 TiB    36 GiB  1.7 TiB            [0,1,2,3,4,5,6,7,8,9,10,11,12,15,16,19,21,23]      36              11
21         62 GiB  1.7 TiB    62 GiB  1.7 TiB          [0,1,2,3,5,6,8,9,10,13,14,15,16,17,18,19,20,22]      37               9
20         35 GiB  1.7 TiB    35 GiB  1.7 TiB                 [0,1,4,6,7,8,10,12,14,15,16,17,18,19,21]      39              10
18         67 GiB  1.7 TiB    67 GiB  1.7 TiB           [1,2,5,7,8,9,10,11,13,14,16,17,19,20,21,22,23]      37              12
17         65 GiB  1.7 TiB    65 GiB  1.7 TiB     [0,1,2,3,4,5,6,8,9,11,12,13,15,16,18,19,20,21,22,23]      34              14
16         35 GiB  1.7 TiB    35 GiB  1.7 TiB      [0,1,4,5,7,8,9,10,11,12,13,15,17,18,19,20,21,22,23]      39              13
15         39 GiB  1.7 TiB    39 GiB  1.7 TiB                       [1,2,6,10,12,13,14,16,18,19,21,23]      41               5
0          34 GiB  1.7 TiB    34 GiB  1.7 TiB   [1,2,4,5,7,8,9,10,11,12,13,14,15,16,17,19,20,21,22,23]      37              13
13         31 GiB  1.7 TiB    31 GiB  1.7 TiB   [0,1,2,3,4,5,6,7,8,9,12,14,15,16,17,18,19,20,21,22,23]      36              16
2          33 GiB  1.7 TiB    33 GiB  1.7 TiB             [0,1,3,6,8,11,13,14,15,16,17,18,19,20,21,22]      34              11
3          33 GiB  1.7 TiB    33 GiB  1.7 TiB              [2,4,5,7,8,9,10,12,13,15,16,17,19,21,22,23]      33              12
5          64 GiB  1.7 TiB    64 GiB  1.7 TiB  [0,1,3,4,6,8,10,11,12,13,14,15,16,17,18,19,20,21,22,23]      37               9
6          54 GiB  1.7 TiB    54 GiB  1.7 TiB        [1,4,5,7,8,9,10,11,12,13,14,15,16,19,20,21,22,23]      32               9
7          38 GiB  1.7 TiB    38 GiB  1.7 TiB     [0,1,3,4,6,8,10,11,12,13,14,15,16,17,18,19,20,22,23]      39              11
8          65 GiB  1.7 TiB    65 GiB  1.7 TiB                 [0,3,5,6,7,9,10,12,13,14,15,17,18,20,22]      33              14
9          95 GiB  1.7 TiB    95 GiB  1.7 TiB       [0,1,3,6,8,10,11,12,13,14,15,16,17,18,19,20,21,23]      36              11
10         62 GiB  1.7 TiB    62 GiB  1.7 TiB       [0,3,4,5,6,7,8,9,11,14,15,16,17,18,19,20,21,22,23]      36              14
11         35 GiB  1.7 TiB    35 GiB  1.7 TiB            [0,1,2,3,5,8,9,10,12,14,15,16,18,19,20,22,23]      37              14
12         58 GiB  1.7 TiB    58 GiB  1.7 TiB        [0,1,3,4,5,6,7,8,9,11,13,14,15,17,18,19,20,21,23]      35              13
14         56 GiB  1.7 TiB    56 GiB  1.7 TiB          [1,2,4,5,6,7,8,9,10,12,13,15,18,19,20,21,22,23]      34              15
sum       1.1 TiB   41 TiB   1.1 TiB   42 TiB

Why did this happen?

By default, Ceph might not create the optimal number of PGs for each pool, resulting in data skew and uneven utilization of storage devices. Manually setting the pg_num for each pool is not a sustainable solution, as data volume is expected to grow over time.

That means Ceph’s automatic PG autoscaling isn’t working as expected, even though the pg_autoscale and pg_autoscale_mode options are enabled in the Ceph cluster configuration.

By running ceph osd pool autoscale-status you can see that the statistics are empty:

bash-5.1$ ceph osd pool autoscale-status
<no-data>

Immediately after executing the above command, the following logs appear in the Ceph MGR Pod:

debug 2024-09-27T13:39:06.888+0000 7f8136222640  0 log_channel(cluster) log [DBG] : pgmap v168593: 301 pgs: 301 active+clean; 371 GiB data, 1.1 TiB used, 41 TiB / 42 TiB avail
debug 2024-09-27T13:39:07.728+0000 7f8137224640  0 log_channel(audit) log [DBG] : from='client.6497139 -' entity='client.admin' cmd=[{"prefix": "osd pool autoscale-status", "target": ["mon-mgr", ""]}]: dispatch
debug 2024-09-27T13:39:07.736+0000 7f812a1ca640  0 [pg_autoscaler WARNING root] pool .rgw.root won't scale due to overlapping roots: {-2, -1}
debug 2024-09-27T13:39:07.736+0000 7f812a1ca640  0 [pg_autoscaler WARNING root] Please See: https://docs.ceph.com/en/latest/rados/operations/placement-groups/#automated-scaling
debug 2024-09-27T13:39:07.736+0000 7f812a1ca640  0 [pg_autoscaler WARNING root] pool cephfs-metadata won't scale due to overlapping roots: {-2, -1}
debug 2024-09-27T13:39:07.736+0000 7f812a1ca640  0 [pg_autoscaler WARNING root] Please See: https://docs.ceph.com/en/latest/rados/operations/placement-groups/#automated-scaling
debug 2024-09-27T13:39:07.736+0000 7f812a1ca640  0 [pg_autoscaler WARNING root] pool rgw-data-pool won't scale due to overlapping roots: {-2, -1}
debug 2024-09-27T13:39:07.736+0000 7f812a1ca640  0 [pg_autoscaler WARNING root] Please See: https://docs.ceph.com/en/latest/rados/operations/placement-groups/#automated-scaling
debug 2024-09-27T13:39:07.740+0000 7f812a1ca640  0 [pg_autoscaler WARNING root] pool cephfs-data0 won't scale due to overlapping roots: {-2, -1}
debug 2024-09-27T13:39:07.740+0000 7f812a1ca640  0 [pg_autoscaler WARNING root] Please See: https://docs.ceph.com/en/latest/rados/operations/placement-groups/#automated-scaling
debug 2024-09-27T13:39:07.740+0000 7f812a1ca640  0 [pg_autoscaler WARNING root] pool s3-store.rgw.log won't scale due to overlapping roots: {-2, -1}
debug 2024-09-27T13:39:07.740+0000 7f812a1ca640  0 [pg_autoscaler WARNING root] Please See: https://docs.ceph.com/en/latest/rados/operations/placement-groups/#automated-scaling
debug 2024-09-27T13:39:07.740+0000 7f812a1ca640  0 [pg_autoscaler WARNING root] pool s3-store.rgw.buckets.index won't scale due to overlapping roots: {-2, -1}
debug 2024-09-27T13:39:07.740+0000 7f812a1ca640  0 [pg_autoscaler WARNING root] Please See: https://docs.ceph.com/en/latest/rados/operations/placement-groups/#automated-scaling
debug 2024-09-27T13:39:07.740+0000 7f812a1ca640  0 [pg_autoscaler WARNING root] pool s3-store.rgw.otp won't scale due to overlapping roots: {-2, -1}
debug 2024-09-27T13:39:07.740+0000 7f812a1ca640  0 [pg_autoscaler WARNING root] Please See: https://docs.ceph.com/en/latest/rados/operations/placement-groups/#automated-scaling
debug 2024-09-27T13:39:07.740+0000 7f812a1ca640  0 [pg_autoscaler WARNING root] pool s3-store.rgw.control won't scale due to overlapping roots: {-2, -1}
debug 2024-09-27T13:39:07.740+0000 7f812a1ca640  0 [pg_autoscaler WARNING root] Please See: https://docs.ceph.com/en/latest/rados/operations/placement-groups/#automated-scaling
debug 2024-09-27T13:39:07.740+0000 7f812a1ca640  0 [pg_autoscaler WARNING root] pool s3-store.rgw.meta won't scale due to overlapping roots: {-2, -1}
debug 2024-09-27T13:39:07.740+0000 7f812a1ca640  0 [pg_autoscaler WARNING root] Please See: https://docs.ceph.com/en/latest/rados/operations/placement-groups/#automated-scaling
debug 2024-09-27T13:39:07.740+0000 7f812a1ca640  0 [pg_autoscaler WARNING root] pool s3-store.rgw.buckets.non-ec won't scale due to overlapping roots: {-2, -1}
debug 2024-09-27T13:39:07.740+0000 7f812a1ca640  0 [pg_autoscaler WARNING root] Please See: https://docs.ceph.com/en/latest/rados/operations/placement-groups/#automated-scaling
debug 2024-09-27T13:39:07.740+0000 7f812a1ca640  0 [pg_autoscaler WARNING root] pool rgw-meta-pool won't scale due to overlapping roots: {-2, -1}
debug 2024-09-27T13:39:07.740+0000 7f812a1ca640  0 [pg_autoscaler WARNING root] Please See: https://docs.ceph.com/en/latest/rados/operations/placement-groups/#automated-scaling
debug 2024-09-27T13:39:07.744+0000 7f812a1ca640  0 [pg_autoscaler WARNING root] pool s3-store.rgw.buckets.data won't scale due to overlapping roots: {-2, -1}
debug 2024-09-27T13:39:07.744+0000 7f812a1ca640  0 [pg_autoscaler WARNING root] Please See: https://docs.ceph.com/en/latest/rados/operations/placement-groups/#automated-scaling
debug 2024-09-27T13:39:07.744+0000 7f812a1ca640  0 [pg_autoscaler WARNING root] pool 1 contains an overlapping root -2... skipping scaling
debug 2024-09-27T13:39:07.744+0000 7f812a1ca640  0 [pg_autoscaler WARNING root] pool 2 contains an overlapping root -1... skipping scaling
debug 2024-09-27T13:39:07.744+0000 7f812a1ca640  0 [pg_autoscaler WARNING root] pool 3 contains an overlapping root -2... skipping scaling
debug 2024-09-27T13:39:07.744+0000 7f812a1ca640  0 [pg_autoscaler WARNING root] pool 4 contains an overlapping root -1... skipping scaling
debug 2024-09-27T13:39:07.748+0000 7f812a1ca640  0 [pg_autoscaler WARNING root] pool 5 contains an overlapping root -1... skipping scaling
debug 2024-09-27T13:39:07.748+0000 7f812a1ca640  0 [pg_autoscaler WARNING root] pool 6 contains an overlapping root -1... skipping scaling
debug 2024-09-27T13:39:07.748+0000 7f812a1ca640  0 [pg_autoscaler WARNING root] pool 7 contains an overlapping root -1... skipping scaling
debug 2024-09-27T13:39:07.748+0000 7f812a1ca640  0 [pg_autoscaler WARNING root] pool 8 contains an overlapping root -1... skipping scaling
debug 2024-09-27T13:39:07.748+0000 7f812a1ca640  0 [pg_autoscaler WARNING root] pool 9 contains an overlapping root -1... skipping scaling
debug 2024-09-27T13:39:07.748+0000 7f812a1ca640  0 [pg_autoscaler WARNING root] pool 10 contains an overlapping root -1... skipping scaling
debug 2024-09-27T13:39:07.748+0000 7f812a1ca640  0 [pg_autoscaler WARNING root] pool 11 contains an overlapping root -1... skipping scaling
debug 2024-09-27T13:39:07.748+0000 7f812a1ca640  0 [pg_autoscaler WARNING root] pool 12 contains an overlapping root -1... skipping scaling
debug 2024-09-27T13:39:07.748+0000 7f812a1ca640  0 [pg_autoscaler WARNING root] pool 13 contains an overlapping root -1... skipping scaling
debug 2024-09-27T13:39:07.748+0000 7f812a1ca640  0 [pg_autoscaler WARNING root] pool 14 contains an overlapping root -1... skipping scaling
debug 2024-09-27T13:39:07.748+0000 7f812a1ca640  0 [pg_autoscaler WARNING root] pool 15 contains an overlapping root -2... skipping scaling
debug 2024-09-27T13:39:07.748+0000 7f812a1ca640  0 [pg_autoscaler WARNING root] pool 16 contains an overlapping root -2... skipping scaling
debug 2024-09-27T13:39:07.748+0000 7f812a1ca640  0 [pg_autoscaler WARNING root] pool 17 contains an overlapping root -2... skipping scaling

This behavior is partially explained in this article. Let’s explore it further by examining the Ceph CRUSH tree:

bash-5.1$ ceph osd crush tree --show-shadow
ID   CLASS  WEIGHT    TYPE NAME
 -2   nvme  41.91833  root default~nvme
-10   nvme   6.98639      host worker01-example-dev~nvme
 10   nvme   1.74660          osd.10
 11   nvme   1.74660          osd.11
 12   nvme   1.74660          osd.12
 13   nvme   1.74660          osd.13
-12   nvme   6.98639      host worker02-example-dev~nvme
 14   nvme   1.74660          osd.14
 15   nvme   1.74660          osd.15
 16   nvme   1.74660          osd.16
 17   nvme   1.74660          osd.17
 -6   nvme   6.98639      host worker03-example-dev~nvme
  2   nvme   1.74660          osd.2
  5   nvme   1.74660          osd.5
  7   nvme   1.74660          osd.7
  9   nvme   1.74660          osd.9
-14   nvme   6.98639      host mst01-example-dev~nvme
 20   nvme   1.74660          osd.20
 21   nvme   1.74660          osd.21
 22   nvme   1.74660          osd.22
 23   nvme   1.74660          osd.23
 -4   nvme   6.98639      host mst02-example-dev~nvme
  0   nvme   1.74660          osd.0
  3   nvme   1.74660          osd.3
  6   nvme   1.74660          osd.6
 18   nvme   1.74660          osd.18
 -8   nvme   6.98639      host mst03-example-dev~nvme
  1   nvme   1.74660          osd.1
  4   nvme   1.74660          osd.4
  8   nvme   1.74660          osd.8
 19   nvme   1.74660          osd.19
 -1         41.91833  root default
 -9          6.98639      host worker01-example-dev
 10   nvme   1.74660          osd.10
 11   nvme   1.74660          osd.11
 12   nvme   1.74660          osd.12
 13   nvme   1.74660          osd.13
-11          6.98639      host worker02-example-dev
 14   nvme   1.74660          osd.14
 15   nvme   1.74660          osd.15
 16   nvme   1.74660          osd.16
 17   nvme   1.74660          osd.17
 -5          6.98639      host worker03-example-dev
  2   nvme   1.74660          osd.2
  5   nvme   1.74660          osd.5
  7   nvme   1.74660          osd.7
  9   nvme   1.74660          osd.9
-13          6.98639      host mst01-example-dev
 20   nvme   1.74660          osd.20
 21   nvme   1.74660          osd.21
 22   nvme   1.74660          osd.22
 23   nvme   1.74660          osd.23
 -3          6.98639      host mst02-example-dev
  0   nvme   1.74660          osd.0
  3   nvme   1.74660          osd.3
  6   nvme   1.74660          osd.6
 18   nvme   1.74660          osd.18
 -7          6.98639      host mst03-example-dev
  1   nvme   1.74660          osd.1
  4   nvme   1.74660          osd.4
  8   nvme   1.74660          osd.8
 19   nvme   1.74660          osd.19

We observe two identical CRUSH roots:

default~nvme
default

This has occurred because our cluster has a dedicated device type: “nvme.” By default, Rook creates a default CRUSH root that includes all available devices, potentially of mixed types.

 -1         41.91833  root default

If we explicitly specify a device type for any custom pool, a corresponding CRUSH root is also created. Naturally, this overlaps with the default root:

apiVersion: ceph.rook.io/v1
kind: CephBlockPool
metadata:
  name: nova.ephemeral
spec:
...
  deviceClass: nvme

Just above the ‘Automated scaling’ section in the Ceph documentation, there’s a note explaining why pg_autoscaling may not work out of the box when initially configured.

In our case, simply creating a CRUSH root for the .mgr pool is insufficient, yet still necessary as indicated by the MGR logs.

 -2   nvme  41.91833  root default~nvme

The Solution

Our solution includes two parts that are needed to fix the autoscaling issue.

Part 1: Create a Dedicated CRUSH Rule for the .mgr Pool

Ceph uses CRUSH (Controlled Replication Under Scalable Hashing) rules to determine data placement. The .mgr pool was using the default CRUSH rule, which contributed to the imbalance. We created a dedicated CRUSH rule and applied it to the .mgr pool:

ceph osd crush rule create-simple replicated-mgr-default default host firstn
ceph osd pool set .mgr crush_rule replicated-mgr-default

This ensures that the .mgr pool has its own data placement strategy, preventing interference with other pools.

Part 2: Fix Overlapping CRUSH Roots

We found overlapping CRUSH roots due to pools using different device types (default vs. default~nvme). This is expected behaviour of Rook and needs to be adjusted manually. But this can also happen when Ceph is deployed on mixed storage drives, but in our case, we only had NVMe drives.

In order to resolve the overlapping CRUSH roots first we need to identify how many pools are using different CRUSH roots. The following command provides reliable output as long as no custom CRUSH rules have been manually created. Otherwise, you would need to check which CRUSH rule applies to each pool using: ceph osd pool get <POOL_NAME> crush_rule

Fortunately, in our case, no manual changes were made to the CRUSH rules.

To identify pools with different CRUSH roots, we ran:

bash-5.1$ for rule in $(ceph osd crush rule ls) ; do ceph osd crush rule dump $rule | grep 'rule_name\|item_name' ; done
    "rule_name": "replicated_rule",
            "item_name": "default"
    "rule_name": "replicapool",
            "item_name": "default~nvme"
    "rule_name": "cephfs-metadata",
            "item_name": "default"
    "rule_name": "cephfs-data0",
            "item_name": "default"
    "rule_name": "rgw-data-pool",
            "item_name": "default"
    "rule_name": "s3-store.rgw.log",
            "item_name": "default"
    "rule_name": "s3-store.rgw.otp",
            "item_name": "default"
    "rule_name": "s3-store.rgw.meta",
            "item_name": "default"
    "rule_name": "s3-store.rgw.buckets.index",
            "item_name": "default"
    "rule_name": "s3-store.rgw.buckets.non-ec",
            "item_name": "default"
    "rule_name": "s3-store.rgw.control",
            "item_name": "default"
    "rule_name": ".rgw.root",
            "item_name": "default"
    "rule_name": ".rgw.root_host",
            "item_name": "default"
    "rule_name": "s3-store.rgw.otp_host",
            "item_name": "default"
    "rule_name": "s3-store.rgw.control_host",
            "item_name": "default"
    "rule_name": "s3-store.rgw.log_host",
            "item_name": "default"
    "rule_name": "s3-store.rgw.buckets.non-ec_host",
            "item_name": "default"
    "rule_name": "rgw-meta-pool",
            "item_name": "default"
    "rule_name": "s3-store.rgw.meta_host",
            "item_name": "default"
    "rule_name": "s3-store.rgw.buckets.index_host",
            "item_name": "default"
    "rule_name": "s3-store.rgw.buckets.data",
            "item_name": "default"
    "rule_name": "rgw-meta-pool_host",
            "item_name": "default"
    "rule_name": "cinder.volumes.nvme",
            "item_name": "default~nvme"
    "rule_name": "glance.images",
            "item_name": "default~nvme"
    "rule_name": "nova.ephemeral",
            "item_name": "default~nvme"
    "rule_name": "cinder.backups",
            "item_name": "default~nvme"
    "rule_name": "replicated-mgr",
            "item_name": "default~nvme"

The output showed that most pools were using the default CRUSH root, with a few on default~nvme. Since we only have nvme devices, we decided to standardize on the default root.

There is a related bug report in Rook, which proposes support for changing the deviceType. However, this feature was never implemented. This allows us to safely update the CRUSH rules without worrying about Rook interfering in the process.

Standardizing CRUSH Roots

Since there are slightly more pools using the default CRUSH root than those using default~nvme, and considering that our only device type is NVMe, we can simply switch the default~nvme root to default.

We updated the CRUSH rules for pools using default~nvme to use the default root:

POOLS="replicapool
cinder.volumes.nvme
glance.images
nova.ephemeral
replicated-mgr
cinder.backups"

for pool in $(echo $POOLS)
do
   echo "update pool: ${pool}"
   ceph osd crush rule rename ${pool} ${pool}-nvme
   ceph osd crush rule create-simple ${pool} default host firstn
   ceph osd pool set ${pool} crush_rule ${pool}
done

This change harmonized our CRUSH hierarchy and allowed pg_autoscale to actually work.

Now we check that pg_autoscale is working

bash-5.1$ ceph osd pool autoscale-status
POOL                           SIZE  TARGET SIZE  RATE  RAW CAPACITY   RATIO  TARGET RATIO  EFFECTIVE RATIO  BIAS  PG_NUM  NEW PG_NUM  AUTOSCALE  BULK
.mgr                          7432k                3.0        42923G  0.0000                                  1.0       1              on         False
.rgw.root                    65536                 3.0        42923G  0.0000                                  1.0      32              on         False
replicapool                  15775M                3.0        42923G  0.0011                                  1.0      32              on         False
cephfs-metadata              258.5M                3.0        42923G  0.0000                                  4.0      16              on         False
rgw-data-pool                    0                 1.5        42923G  0.0000                                  1.0      32              on         False
cephfs-data0                 27140M                3.0        42923G  0.0019                                  1.0      16              on         False
s3-store.rgw.log              4737k                3.0        42923G  0.0000                                  1.0      32              on         False
s3-store.rgw.buckets.index    4127                 3.0        42923G  0.0000                                  1.0      32              on         False
s3-store.rgw.otp                 0                 3.0        42923G  0.0000                                  1.0      32              on         False
s3-store.rgw.control             0                 3.0        42923G  0.0000                                  1.0      32              on         False
s3-store.rgw.meta            62333                 3.0        42923G  0.0000                                  1.0      32              on         False
s3-store.rgw.buckets.non-ec      0                 3.0        42923G  0.0000                                  1.0      32              on         False
rgw-meta-pool                    0                 3.0        42923G  0.0000                                  1.0      32              on         False
s3-store.rgw.buckets.data    21088k                1.5        42923G  0.0000                                  1.0      32              on         False
cinder.volumes.nvme          270.4G       25600G   3.0        42923G  1.7892                                  1.0     256              on         False
glance.images                34108M                3.0        42923G  0.0023                                  1.0      16              on         False
nova.ephemeral               31414M                3.0        42923G  0.0021                                  1.0      16              on         False
cinder.backups                4096                 3.0        42923G  0.0000                                  1.0      32              on         False

Result: Now data distribution in Ceph cluster looks much more equal

$ ceph pg dump

...
OSD_STAT  USED     AVAIL    USED_RAW  TOTAL    HB_PEERS                                                       PG_SUM  PRIMARY_PG_SUM
9         259 GiB  1.5 TiB   259 GiB  1.7 TiB  [0,1,2,3,4,5,6,7,8,10,11,12,13,14,15,16,17,18,19,20,21,22,23]     149              27
5         189 GiB  1.6 TiB   189 GiB  1.7 TiB  [0,1,2,3,4,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23]     148              26
2         212 GiB  1.5 TiB   212 GiB  1.7 TiB  [0,1,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23]     162              35
7         291 GiB  1.5 TiB   291 GiB  1.7 TiB  [0,1,2,3,4,5,6,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23]     174              32
23        215 GiB  1.5 TiB   215 GiB  1.7 TiB   [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22]     151              32
22        193 GiB  1.6 TiB   193 GiB  1.7 TiB   [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,23]     152              28
21        200 GiB  1.6 TiB   200 GiB  1.7 TiB   [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,22,23]     139              27
20        337 GiB  1.4 TiB   337 GiB  1.7 TiB   [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,21,22,23]     172              38
19        224 GiB  1.5 TiB   224 GiB  1.7 TiB     [0,1,2,3,4,5,6,7,9,10,11,12,13,14,15,16,17,18,20,21,22,23]     150              23
18        158 GiB  1.6 TiB   158 GiB  1.7 TiB   [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,19,20,21,22,23]     138              30
17        249 GiB  1.5 TiB   249 GiB  1.7 TiB   [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,18,19,20,21,22,23]     157              32
0         187 GiB  1.6 TiB   187 GiB  1.7 TiB  [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23]     164              34
13        180 GiB  1.6 TiB   180 GiB  1.7 TiB   [0,1,2,3,4,5,6,7,8,9,10,11,12,14,15,16,17,18,19,20,21,22,23]     156              36
1         226 GiB  1.5 TiB   226 GiB  1.7 TiB  [0,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23]     163              36
14        204 GiB  1.5 TiB   204 GiB  1.7 TiB   [0,1,2,3,4,5,6,7,8,9,10,11,12,13,15,16,17,18,19,20,21,22,23]     144              27
3         201 GiB  1.6 TiB   201 GiB  1.7 TiB  [0,1,2,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23]     144              34
16        229 GiB  1.5 TiB   229 GiB  1.7 TiB   [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,17,18,19,20,21,22,23]     167              36
4         175 GiB  1.6 TiB   175 GiB  1.7 TiB  [0,1,2,3,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23]     140              25
6         189 GiB  1.6 TiB   189 GiB  1.7 TiB  [0,1,2,3,4,5,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23]     151              24
8         241 GiB  1.5 TiB   241 GiB  1.7 TiB  [0,1,2,3,4,5,6,7,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23]     151              37
10        225 GiB  1.5 TiB   225 GiB  1.7 TiB   [0,1,2,3,4,5,6,7,8,9,11,12,13,14,15,16,17,18,19,20,21,22,23]     145              35
11        193 GiB  1.6 TiB   193 GiB  1.7 TiB   [0,1,2,3,4,5,6,7,8,9,10,12,13,14,15,16,17,18,19,20,21,22,23]     154              37
12        217 GiB  1.5 TiB   217 GiB  1.7 TiB   [0,1,2,3,4,5,6,7,8,9,10,11,13,14,15,16,17,18,19,20,21,22,23]     159              37
15        229 GiB  1.5 TiB   229 GiB  1.7 TiB   [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,16,17,18,19,20,21,22,23]     169              25
sum       5.1 TiB   37 TiB   5.1 TiB   42 TiB

Equal distribution is very important, especially in small clusters, where a single Ceph node failure can take a large percentage of overall storage out and trigger a rebalance.

Solution (TLDR)

1. Add CRUSH rool for ‘.mgr’ BlockPool:

ceph osd crush rule create-simple replicated-mgr-default default host firstn
ceph osd pool set .mgr crush_rule replicated-mgr-default

2. Change crush root for Openstask Pools:

# Find list of Pools with non default CRUSH root
for rule in $(ceph osd crush rule ls) ; do ceph osd crush rule dump $rule | grep 'rule_name\|item_name' ; done

# List of pools for which we want to change CRUSH root
POOLS="replicapool
cinder.volumes.nvme
glance.images
nova.ephemeral
replicated-mgr
cinder.backups"

for pool in $(echo $POOLS)
do
   echo "update pool: ${pool}"
   ceph osd crush rule rename ${pool} ${pool}-nvme
   ceph osd crush rule create-simple ${pool} default host firstn
   ceph osd pool set ${pool} crush_rule ${pool}
done

We updated the CRUSH root for several OpenStack data pools, because our cluster contains only one device type: NVMe and won’t have HDD or SSD in the future. To simplify the configuration, we used the default CRUSH root for all pools.

If you need to add another device type (e.g., SSD) in the future, you will need to take two additional steps:

Exclude the SSD OSDs from the default CRUSH root.
Create a new SSD CRUSH root and apply it to the appropriate pools.

However, that’s a topic for another article.

Lessons Learned

Dedicated CRUSH Rules: Assigning dedicated CRUSH rules to critical pools like .mgr prevents data distribution conflicts.
Consistent CRUSH Roots: Standardizing on a single CRUSH root (default in our case) avoids complexity and improves autoscaling.

Automating PG Management: Properly configuring pg_autoscale saves time and maintains balanced data distribution as the cluster grows.

Conclusion

In this article, we addressed Ceph PG Autoscaling for OpenStack pools, ensuring balanced data distribution across the Ceph cluster.

We learned that properly configuring Ceph PG Autoscaling with Rook-Ceph is crucial for maintaining balanced data distribution in OpenStack deployments. By creating dedicated CRUSH rules and standardizing CRUSH roots, we resolved the PG imbalance issue and automated PG management for growing the cluster in the future.

At Cloudification, we specialize in deploying highly available private clouds with OpenStack and Rook-Ceph. If you’re facing challenges with Ceph configuration or looking to optimize your cloud infrastructure, contact us today.

Our team is ready to help you design and run a robust and scalable storage cluster.

The post Configuring Ceph pg_autoscale with Rook: A Guide to Balanced Data Distribution first appeared on Cloudification - We build Clouds 🚀☁️.

from Cloudification – We build Clouds 🚀☁️ https://cloudification.io/cloud-blog/configuring-ceph-pg_autoscale-with-rook-for-openstack/
https://cloudification.io/

Cloudification GmbH

Thursday, 27 February 2025

Configuring Ceph pg_autoscale with Rook: A Guide to Balanced Data Distribution