<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
</head>
<body>
<p>Hi all,</p>
<p>I've been noticing intermittent issues with mail delivery over
the past few days, and finally sat down to dig into it today and
yesterday and came to a temporary solution. Turns out it traces
right back to same Ceph I/O issues that [333] described in his
last email (which hopefully has finally sent now that things are
clearing). Thanks [MTL] and [TPG] for the troubleshooting help.</p>
<p>The chain of troubleshooting goes something like this:<br>
</p>
<ul>
<li>Issue: Mail isn't being delivered/is being rejected. Cause:
Mail delivery is stopped as intended due to the mailserver being
disconnected from Active Directory (AD).</li>
<li>Issue: Mailfish keeps losing its Active Directory connection.
Cause: Mailfish isn't able to pick up the keys necessary to
connect due to Kerberos timing out.</li>
<li>Issue: Kerberos connections to Samson (AD server) hang/time
out. Cause: The samba[kdc] process on Samson is spending most of
its time stuck writing data to disk.</li>
<li>Issue: Processes stuck in D (I/O sleep) state on Samson.
Cause: High disk write latency on the underlying Ceph RBD
backing storage for Samson's / disk.</li>
</ul>
<p>Once I had pinned down this issue to the AD connection on
Mailfish, the commands `sssctl domain-status AD.UCC.GU.UWA.EDU.AU`
for status checks and `sss_debuglevel 6` to write more logs was
very useful. FYI, SSSD is the client software running on each
machine that connects to AD, while samba-ad-dc is the server
software for AD that runs on Samson.<br>
</p>
<p>I have fixed the issue so far by migrating Samson's disk to local
storage. While Mailfish and queued/bounced mail was the most
visible, I believe there has been other jankiness relating to
authentication/accounts/etc. If you were having any difficulties
along those lines, retrying now might be worthwhile.<br>
</p>
<p>Note: shortly after restarting samba-ad-dc, Pinball decided it
was time to fall off the domain and need rejoining. /Sigh/. If
anyone is reporting repeated failed logins on a particular
machine, I'd try doing the same thing there.</p>
<p>Cheers,</p>
<p>James [MPT]<br>
</p>
</body>
</html>