Vadim Markovtsev, Egor Bulychev - M^{3} London, 2017

Vadim Markovtsev

Egor Bulychev

source{d}

2017

▼

Fits 1 node

≥ 2 nodes

Casual

Bleeding edge

easy, fast

easy, slow

easy, slow

hard, slow

- Is your data sparse or can be made sparse?
- Does the processing fit into 256/512GB?
- Does it take less than 1 hour/day/week?
- Do you expect that it will not in the future?

System complexity →

← Modeling difficulty

Easy to model

Simple domain

Simple domain

Difficult to model

Simple domain

Simple domain

Easy to model

Complex domain

Complex domain

Difficult to model

Complex domain

Complex domain

Vertical

Horizontal

Casual

Bleeding edge

useless

required

easy

hard

- Sparse-aware
- IO agnostic
- Single-machine

- Fuzzy duplicates of sets: SimHash, (Weighted) MinHash: minhashcuda + datasketch.
- Sparse k-NN: pysparnn.
- SVM is sparse-friendly: libsvm.
- Decision Trees are sparse-friendly: xgboost.

- PCA
- DSSM
- CUR
- t-SNE
- Topic Modeling
- LargeVis
- *2vec

- The lowered number of dimensions depends on the explained variance ratio (EVR).
- EVR = Σλ
^{2}trace(X^{T}X)

- Centers the columns.
- Does not work with sparse matrices.
- Output is never sparse.

- Unstable to outliers.
- Possible solutions: filtering; log or any other damping function.

- The efficient method to calculate PCA.
- PCA(X, N) = TSVD(centered(X), N).
- Supports sparse input.
- Output is never sparse.

- Randomized, approximate TSVD.
- Supports sparse input.
- Output is sparse.
- Vulnerable to outliers.
- Implementation: PyMF.

- Topics are not clusters!
- Good for explaining, bad for decisions.

≠

- We are surrounded by sequential data:
- Texts, system logs, etc.
- Usenet dataset example:
- Moderate size: ~7B words ~37GB.
- A lot of unique entities.

- Low-dimensional representation.
- Similar contexts -> similar embeddings.
- Analogies search (hi to fb/faiss).
- Reasoning: W("woman") − W("man") ≃ W("queen") − W("king").

- Predicting the context given the word:
- Input: "mcube".
- Output: "is", "a", "great", "conference".
- Works well even with small data.
- Represents well rare words.

- Predicting the word given its context.
- Input: "is", "a", "great", "conference".
- Output: "mcube".
- Faster training than the skip-gram.
- Better accuracy for the frequent words.

- Embeddings for char n-grams.
- Emb(word)= Σ Emb(char n-gram)
- -> embeddings for unseen words.
- Classification of texts.
- Compression of models.

Approach combines the advantages of the two major model families:

- Global matrix factorization.
- Local context window.

Example:

- "Text with 5 unique words".
- Window size: 2.

Embedding engine which works on co-occurrence matrices.

Like GloVe, but better - Paper on ArXiV.

We forked the Tensorflow implementation.

- Features:
- Multi-GPU support.
- Scales with vocabulary size.
- Efficient sharding algorithm to distribute workload.

10s~100s GBs

100s~ GBs

~1M uniq

many M uniq

CBOW

skip-gram

Swivel

bad luck

Problems

- Social graphs, dependency graphs, etc.

Properties:

- Continuous feature representations for nodes in networks.
- Maximizes the likelihood of of preserving network neighborhoods of nodes.

Approach:

- Random walks over a graph - gives sequences of nodes.
- Sequences of nodes - stochastic approximation of graph.
- Use these sequences as input to word2vec / Swivel / etc. -> embeddings.
- As result: embeddings for nodes.

- n1 -> n1025 -> n37 ...
- n2 -> n53 -> n209 ...
- n3 -> n129 -> n548 ...
- n4 -> n78 -> n789 ...
- ...

Domains:

- Personalization.
- Advertisements.
- Item recommendation.
- Search queries.
- Etc.

Why not collaborative filtering?

- Cold start problem.
- New items.
- Long tail distribution.

- Article.
- Term vector - 500k unique words.
- 3-grams of characters (30k):
- #bad# -> #ba, bad, ad#
- Deep representation.
- Cosine similarity.

Task definition:

- We are given a collection of high-dim data.
- How can we get a feel for how these objects are arranged in space?

Why was it introduced?

- Distance preservation is not enough.

- Preserves the neighborhood.
- Respects the relation between distances and not the exact values.
- Works with N dimensions ~10
^{2}, requires reduction if more. - Complexity: exact O(d*N
^{2}) & Barnes-Hut O(d*N*log(N)). - Excellent blog post.

Large scale t-SNE replacement.

- Complexity: O(s*M*N).

- Fit data into single node.
- Reduce the amount of data.
- Try a sparse-friendly model.
- Eliminate outliers.
- PCA can be good for pre-processing.
- Embed and conquer.
- Visualize early.