Fighting with high-dimensional not-so-small data

Vadim Markovtsev - Codemotion Madrid, 2017

Fighting with high-dimensional not-so-small data

Vadim Markovtsev

Read this on your device

Not-so-small

Complexity vs data size

Fits 1 node
≥ 2 nodes
Casual
Bleeding edge
easy, fast
easy, slow
easy, slow
hard, slow

How to determine what's your size?

How to determine what's your complexity?

Source
System complexity
← Modeling difficulty
Easy to model
Simple domain
Difficult to model
Simple domain
Easy to model
Complex domain
Difficult to model
Complex domain

Complexity vs scaling

Vertical
Horizontal
Casual
Bleeding edge
useless
required
easy
hard

The case of Netflix

Fighting

Do we have to fight?

Approaches

PCA

Source

PCA

PCA

Truncated SVD

Proven implementations

CUR

Topic Modeling

Source

Topic Modeling

Sequential data

Embeddings

Properties:

word2vec - skip-gram

word2vec - CBOW

word2vec - fastText

Embeddings - global vectors

Approach combines the advantages of the two major model families:

Embeddings - global vectors

Example:

  • "Text with 5 unique words".
  • Window size: 2.

Swivel

Embedding engine which works on co-occurrence matrices.

Like GloVe, but better - Paper on ArXiV.

We forked the Tensorflow implementation.

Swivel

What to choose?

10s~100s GBs
100s~ GBs
~1M uniq
many M uniq
CBOW
skip-gram
Swivel
bad luck

Graph embeddings

Problems

Properties:

node2vec

Approach:

node2vec

The case of Facebook

Deep semantic similarity model

Data visualization

Task definition:

t-SNE

Why was it introduced?

t-SNE

LargeVis

Uniform Manifold Approximation and Projection

Conclusion

Thank you