[tech] [wheel] Spamassassin broken

Sun Apr 29 20:19:20 WST 2012

Hi all,

This morning I had a little free time and finally decided to take a look at 
our broken spamassassin setup on mooneye. This is my understanding of it, 
but this is all new to me, so please correct me if I've gone wrong 
somewhere.

I blew away the existing bayesian db using 'sa-learn --clear', and then used 
'sa-learn --sync', which seems to force it to create another. Now we've got 
to check the threshold settings and retrain it to detect ucc specific spam.

In addition to the bayesian filter settings, /etc/spamassassin/local.cf also 
has some other filter settings which allocate a score based on some other 
criteria (such as: sent to a ucc group alias, html format, contains certain 
words that we've decided to block).

In the absence of a required_score setting in the config file, I assume it's 
on the default of 5 [3]. For those normal people who don't know what that 
setting is, it's the threshold score for determining whether something is 
spam or not. If 5 or higher, it's spam. The other important setting is 
bayes_auto_learn_threshold_spam, which is the score at which the bayesian 
filter will take that spam email and learn from it [3].

Here are the offending lines in local.cf that I believe caused our bayesian 
filter to learn the wrong thing:

====================================================
bayes_auto_learn_threshold_spam 7.0

header __UCC_ALIAS      ALL =~ 
/(secretary|camp|coke|webmasters|door|doorgroup|david|dave|chris|webmaster)\@[^ 
,]*ucc\./
describe __UCC_ALIAS    Sent to UCC alias
meta UCC_ALIAS_HTML     (__UCC_ALIAS && HTML_MESSAGE)
describe UCC_ALIAS_HTML UCC alias mail with html
score UCC_ALIAS_HTML    7.0

score BAYES_00 0 0 0.0 -2.599
score BAYES_05 0 0 0.0 -0.413
score BAYES_20 0 0 0.8 -1.951
score BAYES_40 0 0 1.5 -1.096
score BAYES_50 0 0 4.2 0.001
score BAYES_60 0 0 5.4 1.0
score BAYES_80 0 0 6.0 2.0
score BAYES_95 0 0 9.4 3.0
score BAYES_99 0 0 10.0 3.5
=====================================================

The first line tells the bayesian filter that if an email has a score of 7.0 
or higher, it should be used as a spam email for learning. The second block 
is one which allocates a score of 7.0 to any emails for the listed group 
aliases if it's an html type email. The third block adds an extra score from 
the third column based on the probability of the email being spam, according 
to the bayesian filter. As you can see, any html email to those lists starts 
with a score of 7, and it can only go up, meaning it will always be treated 
as spam, regardless of content. Since it has a high enough score, it is then 
used for auto-learning on the bayesian filter. Oops!

We are definitely looking at the third column in that bayes block, since we 
pass the --local flag to spamassassin in /etc/default/spamassassin [1][2]

So things I think we should change:
- I've already adjusted the setting for bayes_auto_learn_threshold_spam up 
to 10 so we don't have a broken filter now that it has been reset
- Adjust the html filter on list emails down to below 5.0 and let the 
bayesian filter increase the score if it's spam
- Looking at [2], I think we should investigate the use of network tests to 
reduce our spam level, it's not like mooneye is struggling for power
- I noticed skip_rbl_checks is set true, so we're not checking any dns 
blacklists. Would it be worth trying to have this on again?

Cheers, Bob

[1] 
http://spamassassin.apache.org/full/3.1.x/doc/Mail_SpamAssassin_Conf.html#scoring_options
[2] http://wiki.apache.org/spamassassin/UsingNetworkTests
[3] http://wiki.apache.org/spamassassin/BasicConfiguration

-----Original Message----- 
From: Matt Johnston
Sent: Sunday, July 31, 2011 11:26 PM
To: tech at ucc.gu.uwa.edu.au
Subject: Re: [tech] [wheel] Spamassassin broken

This should go to tech@ not just wheel@, providing some
notes on UCC's spamassasssin. If anyone wants to see the
bits on mooneye that are wheel-only let me know.

The context is that spamassassin was tagging large amounts
of genuine mail as spam so it's been (either permanently or
temporarily) disabled.

Filter on "X-SpamTest-Status: SPAM" from ITS's Ironports
instead, it's more reliable anyway.

Matt

On Sun, Jul 31, 2011 at 10:37:24PM +0800, Bob Adamson wrote:
> I'm just gonna put it out there - I have no idea how our mail spam
> filtering works or where it's configured. I've had a bit of a look at my
> procmailrc file and afaict it just looks for [SPAM] in the subject line.
> Anyway, could you possibly explain how/where it's configured and what
> exactly needs to change?

To expand on what's what:

- There's a spamd server for Spamassassin on mooneye. It
  listens on port 783
- When it used to be enabled postfix (in
  /etc/postfix/master.cf) had "smtpd -o content_filter=spamfilter:"
  That then ran:
- /usr/local/sbin/newspamfilter.pl is what Bernard (iirc)
  wrote to run non-local mail through
  /usr/local/sbin/spamfilter which feeds mail to spamd. I
  think the latter script's what's packaged with spamassin.
- The spamd learning happens with the "spamass" account. It
  has a logfile ~spamass/learnlog. I just took a look at it
  and it was complaining about
  "bayes: bad permissions on journal, can't read:
  /var/spamassassin-nobody/.spamassassin/bayes_journal"
  because that file's owned as root. I've now chowned it
  back to spamass. I wonder if that was related...
- There's a special spamass crontab:
  spamass at mooneye:~$ crontab -l -u spamass
  # m h dom mon dow command
  53/30 * * * * ~/learnspam
- That learns stuff that gets forwarded to the spamass
  user. I think spamassassin also learned from spam it
  filtered, see all the rules in /etc/spamassin/local.cf

So perhaps we could try and fix the
/var/spamassassin-nobody/ bayesian database and then turn
spamassassin back on.

Matt