Mail.Ru Group - 26.03.2016 @ MIPT

Vadim Markovtsev, Mail department / Antispam team26.03.2016 @ MIPT

Mail.Ru Mail in numbers

Mail.Ru Antispam from technical point

Analytical tasks

Operational tasks

Spam tradeoffs

how much spam is allowed vs. how comfortable user is. system load vs. how well emails are analyzed.


Machine Learning

A subfield of computer science that evolved from the study of pattern recognition and computational learning theory in artificial intelligence.


In other words, real-life problems are stated as optimization problems (models) which are subsequently solved by various methods (machine learning algorithms) on computers without direct human participation.

Example: clustering

Big Data?

A term for data sets that are so large or complex that traditional data processing applications are inadequate.


This may include:

Big Data?

Most of the everyday problems are not big data.

Big Data & Machine Learning

Hadoop stack

HDFS + YARN, MapReduce, Pig, Hue



Highload challenges


Bots ("autoregs")

Autoregs - algorithm

  1. Merge different data streams
  2. Cluster (modified s-means, similar to DBSCAN)
  3. Classify (logistic regression)
  4. Apply negative reinforcement
  5. Merge with the previous state
  6. Back by humans

Sad but true

Yeah, and one of the most important antispam tasks has always been to reduce the amount of hand work.

Sender reputation



Secret projects


I want to do machine learning

To become familiar with Big Data tools, play with a single-node Hadoop installation.

I want to code highload solutions