Fighting with high-dimensional not-so-small data

Vadim Markovtsev, Egor Bulychev - M3 London, 2017

Fighting with high-dimensional not-so-small data

Vadim Markovtsev
Egor Bulychev
source{d}
2017

Read this on your device

Not-so-small

Complexity vs data size

Fits 1 node
≥ 2 nodes
Casual
Bleeding edge
easy, fast
easy, slow
easy, slow
hard, slow

How to determine what's your size?

How to determine what's your complexity?

Source
System complexity
← Modeling difficulty
Easy to model
Simple domain
Difficult to model
Simple domain
Easy to model
Complex domain
Difficult to model
Complex domain

Complexity vs scaling

Vertical
Horizontal
Casual
Bleeding edge
useless
required
easy
hard

The case of Netflix

Fighting

Do we have to fight?

Approaches

PCA

Source

PCA

PCA

Truncated SVD

Proven implementations

CUR

Topic Modeling

Source

Topic Modeling

Sequential data

Embeddings

Properties:

word2vec - skip-gram

word2vec - CBOW

word2vec - fastText

Embeddings - global vectors

Approach combines the advantages of the two major model families:

Embeddings - global vectors

Example:

  • "Text with 5 unique words".
  • Window size: 2.

Swivel

Embedding engine which works on co-occurrence matrices.

Like GloVe, but better - Paper on ArXiV.

We forked the Tensorflow implementation.

Swivel

What to choose?

10s~100s GBs
100s~ GBs
~1M uniq
many M uniq
CBOW
skip-gram
Swivel
bad luck

Graph embeddings

Problems

Properties:

node2vec

Approach:

node2vec

The case of Facebook

Recommendation system: content based

Domains:

  • Personalization.
  • Advertisements.
  • Item recommendation.
  • Search queries.
  • Etc.

Why not collaborative filtering?

  • Cold start problem.
  • New items.
  • Long tail distribution.

Deep semantic similarity model

Data visualization

Task definition:

t-SNE

Why was it introduced?

t-SNE

LargeVis

Large scale t-SNE replacement.

Conclusion

Thank you

blog.sourced.tech