Fixing Ceph performance problems

First Aid Kit

Ceph has become the de facto standard for software-defined storage (SDS). Companies building large, scalable environments today are increasingly unlikely to go with classic network-attached storage (NAS) or storage area network (SAN) appliances; rather, distributed object storage, now part of Red Hat, is preferred.

Unlike classic storage solutions, Ceph is designed for scalability and longevity. Because Ceph is easy to use with off-the-shelf hardware, enterprises do not have to worry about only being able to source spare parts directly from the manufacturer. When a hardware warranty is coming to an end, for example, you don't have to replace a Ceph store completely with a new solution. Instead, you remove the affected servers from the system and add new ones without disrupting ongoing operations.

The other side of the coin is that the central role Ceph plays makes performance problems particularly critical. Ceph is extremely complex: If the object store runs slowly, you need to consider many components. In the best case, only one component is responsible for bad performance. If you are less lucky, performance problems arise from the interaction of several components in the cluster, making it correspondingly difficult to debug.

After a short refresher on Ceph basics, I offer useful tips for everyday monitoring of Ceph in the data center, especially in terms of performance. In addition to preventive topics, I also deal with the question of how admins can handle persistent Ceph performance problems with on-board resources.

The Setup

Over weeks and months, a new Ceph cluster is designed and implemented in line with all of the current best practices with a 25Gbps fast network over redundant Link Aggregation Control Protocol (LACP) links. A dedicated network with its own Ethernet hardware for traffic between drives in Ceph ensures that the client data traffic and the traffic for

...

Use Express-Checkout link below to read the full article (PDF).