mirror of
https://github.com/mozilla/gecko-dev.git
synced 2024-11-07 12:15:51 +00:00
ea0646f211
Replace the core bayesian junk mail algorithm with a chi-squared probability distribution modeled after spam bayes and Gary Robinson's work. Change the model for how we count tokens across messages. Token counts get out of alignment when re-training against already classified messages. Revamp the junk mail tokenizer. Make it a hdr sink listener and add custom tokens for attachment information. Ignore tokens larger than 13 characters. Tokenize purely off of white space. Ignore tokens less than 3 bytes in length. There is still a lot more work to be done to the tokenizer. Many thanks to Miguel Varga for working out the initial core algorithm improvement and to all of the folks at spam bayes and of course Gary Robinson for helping to make this happen. |
||
---|---|---|
.. | ||
build | ||
resources | ||
src | ||
.cvsignore | ||
Makefile.in |