SpamAssassin - bayes

最後更新: 2018-04-11

介紹

 

 * 用 probability 去為 mail 評分

 

目錄

 


Basic Setting

 

Whether to use the naive-Bayesian-style classifier built into SpamAssassin.

use_bayes 1

Whether to use any machine-learning classifiers with SpamAssassin, such as the default 'BAYES_*' rules.

0 = disable use of any and all human-trained classifiers.

use_learner 1

Whether SpamAssassin should automatically feed high-scoring mails (or low-scoring mails, for non-spam) into its learning systems.

bayes_auto_learn 1

Score

# BAYES
# BAYES_00: 0 to 1%
# BAYES_05: 1 to 5%
# BAYES_20: 5 to 20%
score   BAYES_00        0
score   BAYES_05        1
score   BAYES_20        1
score   BAYES_40        2
score   BAYES_50        2
score   BAYES_60        2
score   BAYES_80        3
score   BAYES_95        3
score   BAYES_99        7
score   BAYES_999       10

threshold-based discriminator for Bayes auto-learning Setting

# default: 0.1
bayes_auto_learn_threshold_nonspam n.nn 

# default: 12.0
bayes_auto_learn_threshold_spam n.nn

# default: 0
# 1 => This strategy may or may not produce better future classifications,
# autolearning will be performed only when a bayes classifier had a different opinion
# from what the autolearner is now trying to teach it
# but usually works very well, while also preventing unnecessary overlearning and slows down database growth.
#
bayes_auto_learn_on_error 1

 


bayes in mysql(sql)

 

Dependence

  • perl-DBI
  • perl-DBD-MySQL

 

Config DB

mysql -uroot -p

create schema spamassassin;

create user 'spamassassin'@'localhost';

ALTER USER user IDENTIFIED BY 'auth_string';

GRANT SELECT, INSERT, UPDATE, DELETE ON spamassassin.* TO 'spamassassin'@'localhost';

FLUSH PRIVILEGES;

wget http://spamassassin.apache.org/full/3.0.x/dist/sql/bayes_mysql.sql

mysql spamassassin -p  < bayes_mysql.sql

 

Configuring spamassassin to use MySQL

/etc/mail/spamassassin/local.cf

chmod o-r local.cf

# settings to enable mysql database access
use_bayes              1
bayes_auto_expire      0
bayes_store_module     Mail::SpamAssassin::BayesStore::MySQL
bayes_sql_dsn          DBI:mysql:spamassassin:localhost
bayes_sql_username     spamassassin
bayes_sql_password     <password>

# 找出以下 3 行, 之後 "#" 了它們
# disable the following since we don't need them anymore
#auto_whitelist_path   /var/lib/amavis/.spamassassin/auto-whitelist
#bayes_path            /var/lib/amavis/.spamassassin/bayes
#bayes_file_mode       0777

# This directive, if used, will override the username used for storing
# data in the database.  This could be used to group users together to
# share bayesian filter data.  You can also use this config option to
# trick sa-learn to learn data as a specific user.
# make sure that all data is added for the user running amavisd-new
# even if someone else runs sa-learn
bayes_sql_override_username   sf

service spamassassin restart

 

Test

wget http://spamassassin.apache.org/gtube/gtube.txt

sa-learn --spam gtube.txt

Learned tokens from 1 message(s) (1 message(s) examined)

mysql -u spamassassin -p

use spamassassin;

select * from bayes_seen;

sa-learn must be run as the user who's data you are loading, or

you must make use of the bayes_sql_override_username config option.

 

Import DB

ll ~amavis/.spamassassin

sudo -u amavis sa-learn --sync --force-expire

sudo -u amavis sa-learn --backup > bayes_db.txt

sa-learn --restore bayes_db.txt

Checking

SELECT COUNT(*) spam_count FROM bayes_token;

Daily Cron Job

1 3 * * * /usr/bin/sa-learn --sync --force-expire

 

amavisd checking

# amavisd -c /etc/amavisd/amavisd.conf debug 2>&1 | grep -i 'bayes'

# sa-learn --spam --username=vmail /usr/share/doc/spamassassin-3.3.1/sample-spam.txt

Learned tokens from 1 message(s) (1 message(s) examined)

/usr/share/doc/spamassassin-3.3.1/sample-spam.txt

 

Roundcube plugin: markasjunk2

$rcmail_config['markasjunk2_spam_cmd'] = 'sa-learn --spam --username=vmail %f';

$rcmail_config['markasjunk2_ham_cmd'] = 'sa-learn --ham --username=vmail %f';

learning driver cmd_learn requires PHP function exec

# mysql -uroot -p

mysql> USE sa_bayes;

mysql> SELECT COUNT(*) FROM bayes_token;            // 有幾多 SPAM Mail 樣本

 


sa-learn (train SpamAssassin's Bayesian classifier)

 

SpamAssassin remembers which mail messages it has learnt already,

and will not re-learn those messages again, unless you use the --forget option.

Messages learnt as spam will have SpamAssassin markup removed, on the fly.

--forget                   # Forget a given message previously learnt.

# Version

sa-learn -V

SpamAssassin version 3.4.0

#  Checking Configure

spamassassin -D --lint |& grep  bayes

Nov 11 11:46:25.944 [15307] dbg: config: fixed relative path:
 /var/lib/spamassassin/3.004000/updates_spamassassin_org/23_bayes.cf
Nov 11 11:46:25.944 [15307] dbg: config:
 using "/var/lib/spamassassin/3.004000/updates_spamassassin_org/23_bayes.cf" for included file
Nov 11 11:46:25.945 [15307] dbg: config:
 read file /var/lib/spamassassin/3.004000/updates_spamassassin_org/23_bayes.cf
...........

There's a minimum threshold on how many messages must be in the Bayes database, before SA will use it while scanning.

By default, there must be 200 ham messages and 200 spam messages learned before it will be used.

# Setting

bayes_min_ham_num    200
bayes_min_spam_num   200

# Other Setting

# opportunistically attempt to expire the Bayes database.
# Default: 1(yes)
bayes_auto_expire 1

# Bayes database stores up to a certain number of tokens
# Each token has an access time which records when it last contributed to
# a classification or appeared in a learned email.
# sa-learn --dump magic | grep "oldest atime"
bayes_expiry_max_db_size 150000

# specifies how large the Bayes journal will grow before it is opportunistically synced.
bayes_journal_max_size 102400

ignore

bayes_ignore_header header_name
bayes_ignore_from [email protected]
bayes_ignore_to [email protected]

# File

bayes_journal

To avoid the contention of having each SpamAssassin process attempting to gain write access to the Bayes DB
The bayes_journal file is rotating from 0 to 102400 bytes according "bayes_journal_max_size 102400"

Synchronize the journal and databases:

Upon successfully syncing (sa-learn --sync) the database with the entries in the journal, the journal file is removed.

bayes_toks

Containing the tokens learnt, their count of occurrences in ham and spam,

and the timestamp when the token was last seen in a message

bayes_seen

A map of Message-Id and some data from headers and body (msgid) to what that message was learnt as (flag: h/s).

還有那 Mail 是幾時學的 ( CreateTime )

This is used so that SpamAssassin can avoid re-learning a message it has already seen
 

# fix learned incorrectly

--ham                 Learn messages as ham (non-spam)

Learned tokens from N message(s) (M message(s) examined)

--spam                Learn messages as spam

--progress

 92% [========================================    ]  38.87 msgs/sec 00m01s DONE

-f file, --folders=file

i.e.

sa-learn --ham file

# --no-sync = Skip synchronizing the database and journal

sa-learn --no-sync --ham ~/Maildir/.INBOX/{cur,new}

# backup & restore

# backup

sa-learn --sync                          # sync any outstanding journal entries

sa-learn --backup > backup.txt

# restore

sa-learn --clear                         # Wipe out existing database

sa-learn --restore backup.txt

# DB info

sa-learn --dump magic

0.000          0          3          0  non-token data: bayes db version
0.000          0          0          0  non-token data: nspam
0.000          0          1          0  non-token data: nham
0.000          0        122          0  non-token data: ntokens
0.000          0 1476287997          0  non-token data: oldest atime
0.000          0 1476287997          0  non-token data: newest atime
0.000          0          0          0  non-token data: last journal sync atime
0.000          0          0          0  non-token data: last expiry atime
0.000          0          0          0  non-token data: last expire atime delta
0.000          0          0          0  non-token data: last expire reduction count

"nham" is the number of ham messages SA has learned

"nspam" is the number of spam messages SA has learned

"ntokens" the number of tokens in the database

 

# Auto Learn

Mail header

autolearn=no

"no"              autolearning did not occur, didn't achieve the proper threshold values
"ham"            the message was learned as ham)
"spam"          the message was learned as spam)
"disabled"      the configuration specifies bayes_auto_learn 0 or use_bayes 0
"failed"           autolearning was attempted, but couldn't complete.
"unavailable"  autolearning not completed for any reason not covered above.

* If a message has already been learned by SpamAssassin,  then that message will not be learned again.

* SpamAssassin requires at least 3 points from the header and 3 points from the body, to auto-learn as spam.

# 人手教佢

# --no-sync: Skip synchronizing the database and journal after learning

sa-learn --ham --no-sync ham_directory

sa-learn --spam --no-sync spam_directory

sa-learn --sync

# checking

sa-learn --dump magic | egrep 'nspam|nham'

# amavis

Location: /var/spool/amavisd/.spamassassin/

learn-spam.sh

#!/bin/bash

_USER="amavis"
_LOG="/var/log/sa-learn.log"
_SpamFolder="/home/spam/Maildir/.Junk/cur"

chmod 666 $_SpamFolder/*
touch $_LOG
chown $_USER $_LOG

sudo -u $_USER sa-learn --no-sync --spam $_SpamFolder > $_LOG

sudo -u $_USER sa-learn --sync

sudo -u $_USER sa-learn --dump magic >> $_LOG

 


Expiring obsolete tokens

 

Bayes database will grow to include a large number of tokens (in the bayes_tokens table) and
references to those tokens (in the bayes_seen table).

# SpamAssassin uses an auto-expiry mechanism - to do this sort of pruning for you at intervals.

bayes_auto_expire 1

# Force an expiry manually

sa-learn --force-expire

# Checking

# 相當於看 Table: bayes_vars 內容

sa-learn --dump magic

The expiry logic

 - total tokens > bayes_expiry_max_db_size * 75%
 - if the reduction number is < 1000 tokens, abort (not worth the effort).
 - last expire over 30 days ago

# number of tokens

bayes_expiry_max_db_size 250000

 


ttl

 

# unit: s, m, h, d, w, indicating seconds (default), minutes, hours, days, weeks

# default: 3w

bayes_token_ttl 3w

# default: 8d

bayes_seen_ttl 8d

 


bayes_token_sources

 

Default: header visible invisible uri

如何從 mail 提取有用訊息

  • header - tokens collected from a message header section
  • visible - words from visible text (plain or HTML) in a message body
  • invisible - hidden/invisible text in HTML parts of a message body
  • uri - URIs collected from a message body
  • mimepart - digests (hashes) of all MIME parts (textual or non-textual) of a message,
                     computed after Base64 and quoted-printable decoding,
                     suffixed by their Content-Type
  • all - adds all the above keywords to the set being assembled
  • none or noall - removes all keywords from the set

 


 

 

 

 

Creative Commons license icon Creative Commons license icon