Antispam From The Inside Out

Mail.Ru Group - 26.03.2016 @ MIPT

Antispam From The Inside Out

Vadim Markovtsev, Mail department / Antispam team26.03.2016 @ MIPT

View this on your device

goo.gl/5x4FzW

Mail.Ru Mail in numbers

Mail.Ru Antispam from technical point

Analytical tasks

Operational tasks

Spam tradeoffs

how much spam is allowed vs. how comfortable user is. system load vs. how well emails are analyzed.

Technologies

Machine Learning

A subfield of computer science that evolved from the study of pattern recognition and computational learning theory in artificial intelligence.

Wikipedia


In other words, real-life problems are stated as optimization problems (models) which are subsequently solved by various methods (machine learning algorithms) on computers without direct human participation.

Example: clustering

Big Data?

A term for data sets that are so large or complex that traditional data processing applications are inadequate.

Wikipedia


This may include:

Big Data?

Most of the everyday problems are not big data.

Big Data & Machine Learning

Hadoop stack

HDFS + YARN, MapReduce, Pig, Hue

MapReduce

IPython

Highload challenges

Examples

Bots ("autoregs")

Autoregs - algorithm

  1. Merge different data streams
  2. Cluster (modified s-means, similar to DBSCAN)
  3. Classify (logistic regression)
  4. Apply negative reinforcement
  5. Merge with the previous state
  6. Back by humans

Sad but true

Yeah, and one of the most important antispam tasks has always been to reduce the amount of hand work.

Sender reputation

Break-ins

Break-ins

Secret projects

Finally

I want to do machine learning

To become familiar with Big Data tools, play with a single-node Hadoop installation.

I want to code highload solutions

Thanks
goo.gl/5x4FzW