最後更新: 2018-04-11
介紹
* 用 probability 去為 mail 評分
目錄
Basic Setting
Whether to use the naive-Bayesian-style classifier built into SpamAssassin.
use_bayes 1
Whether to use any machine-learning classifiers with SpamAssassin, such as the default 'BAYES_*' rules.
0 = disable use of any and all human-trained classifiers.
use_learner 1
Whether SpamAssassin should automatically feed high-scoring mails (or low-scoring mails, for non-spam) into its learning systems.
bayes_auto_learn 1
Score
# BAYES # BAYES_00: 0 to 1% # BAYES_05: 1 to 5% # BAYES_20: 5 to 20% score BAYES_00 0 score BAYES_05 1 score BAYES_20 1 score BAYES_40 2 score BAYES_50 2 score BAYES_60 2 score BAYES_80 3 score BAYES_95 3 score BAYES_99 7 score BAYES_999 10
threshold-based discriminator for Bayes auto-learning Setting
# default: 0.1 bayes_auto_learn_threshold_nonspam n.nn # default: 12.0 bayes_auto_learn_threshold_spam n.nn # default: 0 # 1 => This strategy may or may not produce better future classifications, # autolearning will be performed only when a bayes classifier had a different opinion # from what the autolearner is now trying to teach it # but usually works very well, while also preventing unnecessary overlearning and slows down database growth. # bayes_auto_learn_on_error 1
bayes in mysql(sql)
Dependence
- perl-DBI
- perl-DBD-MySQL
Config DB
mysql -uroot -p
create schema spamassassin;
create user 'spamassassin'@'localhost';
ALTER USER user IDENTIFIED BY 'auth_string';
GRANT SELECT, INSERT, UPDATE, DELETE ON spamassassin.* TO 'spamassassin'@'localhost';
FLUSH PRIVILEGES;
wget http://spamassassin.apache.org/full/3.0.x/dist/sql/bayes_mysql.sql
mysql spamassassin -p < bayes_mysql.sql
Configuring spamassassin to use MySQL
/etc/mail/spamassassin/local.cf
chmod o-r local.cf
# settings to enable mysql database access use_bayes 1 bayes_auto_expire 0 bayes_store_module Mail::SpamAssassin::BayesStore::MySQL bayes_sql_dsn DBI:mysql:spamassassin:localhost bayes_sql_username spamassassin bayes_sql_password <password> # 找出以下 3 行, 之後 "#" 了它們 # disable the following since we don't need them anymore #auto_whitelist_path /var/lib/amavis/.spamassassin/auto-whitelist #bayes_path /var/lib/amavis/.spamassassin/bayes #bayes_file_mode 0777 # This directive, if used, will override the username used for storing # data in the database. This could be used to group users together to # share bayesian filter data. You can also use this config option to # trick sa-learn to learn data as a specific user. # make sure that all data is added for the user running amavisd-new # even if someone else runs sa-learn bayes_sql_override_username sf
service spamassassin restart
Test
wget http://spamassassin.apache.org/gtube/gtube.txt
sa-learn --spam gtube.txt
Learned tokens from 1 message(s) (1 message(s) examined)
mysql -u spamassassin -p
use spamassassin;
select * from bayes_seen;
sa-learn must be run as the user who's data you are loading, or
you must make use of the bayes_sql_override_username config option.
Import DB
ll ~amavis/.spamassassin
sudo -u amavis sa-learn --sync --force-expire
sudo -u amavis sa-learn --backup > bayes_db.txt
sa-learn --restore bayes_db.txt
Checking
SELECT COUNT(*) spam_count FROM bayes_token;
Daily Cron Job
1 3 * * * /usr/bin/sa-learn --sync --force-expire
amavisd checking
# amavisd -c /etc/amavisd/amavisd.conf debug 2>&1 | grep -i 'bayes'
# sa-learn --spam --username=vmail /usr/share/doc/spamassassin-3.3.1/sample-spam.txt
Learned tokens from 1 message(s) (1 message(s) examined)
/usr/share/doc/spamassassin-3.3.1/sample-spam.txt
Roundcube plugin: markasjunk2
$rcmail_config['markasjunk2_spam_cmd'] = 'sa-learn --spam --username=vmail %f';
$rcmail_config['markasjunk2_ham_cmd'] = 'sa-learn --ham --username=vmail %f';
learning driver cmd_learn requires PHP function exec
# mysql -uroot -p
mysql> USE sa_bayes;
mysql> SELECT COUNT(*) FROM bayes_token; // 有幾多 SPAM Mail 樣本
sa-learn (train SpamAssassin's Bayesian classifier)
SpamAssassin remembers which mail messages it has learnt already,
and will not re-learn those messages again, unless you use the --forget option.
Messages learnt as spam will have SpamAssassin markup removed, on the fly.
--forget # Forget a given message previously learnt.
# Version
sa-learn -V
SpamAssassin version 3.4.0
# Checking Configure
spamassassin -D --lint |& grep bayes
Nov 11 11:46:25.944 [15307] dbg: config: fixed relative path: /var/lib/spamassassin/3.004000/updates_spamassassin_org/23_bayes.cf Nov 11 11:46:25.944 [15307] dbg: config: using "/var/lib/spamassassin/3.004000/updates_spamassassin_org/23_bayes.cf" for included file Nov 11 11:46:25.945 [15307] dbg: config: read file /var/lib/spamassassin/3.004000/updates_spamassassin_org/23_bayes.cf ...........
There's a minimum threshold on how many messages must be in the Bayes database, before SA will use it while scanning.
By default, there must be 200 ham messages and 200 spam messages learned before it will be used.
# Setting
bayes_min_ham_num 200 bayes_min_spam_num 200
# Other Setting
# opportunistically attempt to expire the Bayes database. # Default: 1(yes) bayes_auto_expire 1 # Bayes database stores up to a certain number of tokens # Each token has an access time which records when it last contributed to # a classification or appeared in a learned email. # sa-learn --dump magic | grep "oldest atime" bayes_expiry_max_db_size 150000 # specifies how large the Bayes journal will grow before it is opportunistically synced. bayes_journal_max_size 102400
ignore
bayes_ignore_header header_name bayes_ignore_from [email protected] bayes_ignore_to [email protected]
# File
bayes_journal
To avoid the contention of having each SpamAssassin process attempting to gain write access to the Bayes DB
The bayes_journal file is rotating from 0 to 102400 bytes according "bayes_journal_max_size 102400"
Synchronize the journal and databases:
Upon successfully syncing (sa-learn --sync) the database with the entries in the journal, the journal file is removed.
bayes_toks
Containing the tokens learnt, their count of occurrences in ham and spam,
and the timestamp when the token was last seen in a message
bayes_seen
A map of Message-Id and some data from headers and body (msgid) to what that message was learnt as (flag: h/s).
還有那 Mail 是幾時學的 ( CreateTime )
This is used so that SpamAssassin can avoid re-learning a message it has already seen
# fix learned incorrectly
--ham Learn messages as ham (non-spam)
Learned tokens from N message(s) (M message(s) examined)
--spam Learn messages as spam
--progress
92% [======================================== ] 38.87 msgs/sec 00m01s DONE
-f file, --folders=file
i.e.
sa-learn --ham file
# --no-sync = Skip synchronizing the database and journal
sa-learn --no-sync --ham ~/Maildir/.INBOX/{cur,new}
# backup & restore
# backup
sa-learn --sync # sync any outstanding journal entries
sa-learn --backup > backup.txt
# restore
sa-learn --clear # Wipe out existing database
sa-learn --restore backup.txt
# DB info
sa-learn --dump magic
0.000 0 3 0 non-token data: bayes db version 0.000 0 0 0 non-token data: nspam 0.000 0 1 0 non-token data: nham 0.000 0 122 0 non-token data: ntokens 0.000 0 1476287997 0 non-token data: oldest atime 0.000 0 1476287997 0 non-token data: newest atime 0.000 0 0 0 non-token data: last journal sync atime 0.000 0 0 0 non-token data: last expiry atime 0.000 0 0 0 non-token data: last expire atime delta 0.000 0 0 0 non-token data: last expire reduction count
"nham" is the number of ham messages SA has learned
"nspam" is the number of spam messages SA has learned
"ntokens" the number of tokens in the database
# Auto Learn
Mail header
autolearn=no
"no" autolearning did not occur, didn't achieve the proper threshold values
"ham" the message was learned as ham)
"spam" the message was learned as spam)
"disabled" the configuration specifies bayes_auto_learn 0 or use_bayes 0
"failed" autolearning was attempted, but couldn't complete.
"unavailable" autolearning not completed for any reason not covered above.
* If a message has already been learned by SpamAssassin, then that message will not be learned again.
* SpamAssassin requires at least 3 points from the header and 3 points from the body, to auto-learn as spam.
# 人手教佢
# --no-sync: Skip synchronizing the database and journal after learning
sa-learn --ham --no-sync ham_directory
sa-learn --spam --no-sync spam_directory
sa-learn --sync
# checking
sa-learn --dump magic | egrep 'nspam|nham'
# amavis
Location: /var/spool/amavisd/.spamassassin/
learn-spam.sh
#!/bin/bash _USER="amavis" _LOG="/var/log/sa-learn.log" _SpamFolder="/home/spam/Maildir/.Junk/cur" chmod 666 $_SpamFolder/* touch $_LOG chown $_USER $_LOG sudo -u $_USER sa-learn --no-sync --spam $_SpamFolder > $_LOG sudo -u $_USER sa-learn --sync sudo -u $_USER sa-learn --dump magic >> $_LOG
Expiring obsolete tokens
Bayes database will grow to include a large number of tokens (in the bayes_tokens table) and
references to those tokens (in the bayes_seen table).
# SpamAssassin uses an auto-expiry mechanism - to do this sort of pruning for you at intervals.
bayes_auto_expire 1
# Force an expiry manually
sa-learn --force-expire
# Checking
# 相當於看 Table: bayes_vars 內容
sa-learn --dump magic
The expiry logic
- total tokens > bayes_expiry_max_db_size * 75%
- if the reduction number is < 1000 tokens, abort (not worth the effort).
- last expire over 30 days ago
# number of tokens
bayes_expiry_max_db_size 250000
ttl
# unit: s, m, h, d, w, indicating seconds (default), minutes, hours, days, weeks
# default: 3w
bayes_token_ttl 3w
# default: 8d
bayes_seen_ttl 8d
bayes_token_sources
Default: header visible invisible uri
如何從 mail 提取有用訊息
- header - tokens collected from a message header section
- visible - words from visible text (plain or HTML) in a message body
- invisible - hidden/invisible text in HTML parts of a message body
- uri - URIs collected from a message body
-
mimepart - digests (hashes) of all MIME parts (textual or non-textual) of a message,
computed after Base64 and quoted-printable decoding,
suffixed by their Content-Type - all - adds all the above keywords to the set being assembled
- none or noall - removes all keywords from the set