Fixing Ceph performance problems

First Aid Kit

Is It the Network?

The network soon became the main suspect because it was apparently the only infrastructure shared by all the components involved. However, extensive tests with iPerf and similar tools refuted this hypothesis. Between the clients of the Ceph cluster, the dual 25Gbps LACP link was up and running – reliably – at 25Gbps and above in the iPerf tests. The lack of clues was aggravated by the error counters of all the network interface controllers (NICs) involved, as well as those on the network switches, stubbornly remaining at 0.

From there on it became a tedious search. In situations like this, you can only do one thing: trace the individual writes. As soon as a slow write process appeared in the monitoring system, I took a closer look. Ceph always shows the primary OSD of a slow write. On the host on which the OSD runs, the admin socket is used in the next step – and this proved to be very helpful.

In fact, each OSD keeps an internal record of the many operations it performs. The log of a primary OSD for an object also contains individual entries, including the start and end of copy operations for this object to the secondary OSDs. The command

ceph daemon osd.<number> dump_ops_in_flight

displays all the operations that an OSD in Ceph is currently performing. Past slow ops can be retrieved with the dump_historic_slow_ops parameter (Figure 2), whereas dump_historic_ops lets you display log messages about all past operations, but only over a certain period of time.

Figure 2: In this example, the primary OSD has waited no fewer than 16 minutes to get an OK from the secondary OSDs for the write operation.

Equipped with these tools, further monitoring became possible: For individual slow writes, the primary OSD could now be identified; there again, the OSD in question provided details of the secondary OSDs it had computed. I thought their log messages for the same write operation would prove useful for gaining information about defective storage drives.

However, it quickly became clear that for the majority of the time the primary OSD spent waiting for responses from the secondary OSDs, the latter were not aware of the task at hand. As soon as the write requests arrived at the secondary OSDs, they were completed within a few milliseconds. However, it often took several minutes for the requests to reach the secondary OSDs.

The Network Revisited

Because the network hardware had already been excluded as a potential source of error because of extensive testing, it appeared to be a Ceph problem. After much trial and error, the spotlight finally fell on the packet filters of the systems involved. Last but not least, the iptables successor nftables, which is used by default in CentOS 8, turned out to be the cause of the problem. It was not a misconfiguration. Instead, a bug in the Linux kernel caused the packet filter to suppress communication according to an unclear pattern, which in turn explained why the problem in Ceph was very erratic. An update to a newer kernel finally remedied the situation.

As the example clearly shows, automated performance monitoring from within Ceph is one thing, but if you are dogged by persistent performance problems, you can usually look forward to an extended debugging session. At this point, it certainly does no harm to have the manufacturer of the distribution you are using on board as your support partner.

Conclusions

Performance monitoring in Ceph can be implemented easily with Ceph's on-board tools for recording metric data. At the end of the day, it is fairly unimportant whether you view the results in the Ceph Dashboard or Prometheus. However, this does not mean you should stop monitoring the system's classic performance parameters.

To a certain extent, it is unsatisfactory that detecting a problem in a cluster does not allow any direct conclusions to be drawn about its solution. In concrete terms, this means that once you know that a problem exists, the real work has just begun, and this work can rarely be automated.

Infos

  1. Attribution-ShareAlike 4.0 International: https://creativecommons.org/licenses/by-sa/4.0/
  2. Weil, S.A., S.A. Brandt, E.L. Miller, and C. Maltzahn. "CRUSH: Controlled, scalable, decentralized placement of replicated data." In: Proceedings of the ACM/IEEE SC2006 Conference on High Performance Networking and Computing (IEEE, ACM, 2006), November 11-17, Tampa, FL. https://ceph.com/wp-content/uploads/2016/08/weil-crush-sc06.pdf
  3. Ceph Dashboard: https://docs.ceph.com/docs/master/mgr/dashboard/

The Author

Martin Gerhard Loschwitz is Cloud Platform Architect at Drei Austria and works on topics such as OpenStack, Kubernetes, and Ceph.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy ADMIN Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

comments powered by Disqus
Subscribe to our ADMIN Newsletters
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs



Support Our Work

ADMIN content is made possible with support from readers like you. Please consider contributing when you've found an article to be beneficial.

Learn More”>
	</a>

<hr>		    
			</div>
		    		</div>

		<div class=