Open Source Stack for Machine Learning on Source Code

Vadim Markovtsev, source{d}

Open Source Stack for Machine Learning on Source Code

Vadim Markovtsev, source{d}

Read this on your device

About me

About source{d}

Master plan

Our projects

How do you guys make money?

Currently, we don't. Right now, we're focused on research, development and developer adoption

We want to be the machine learning pipeline on top of the world's source code, in both public and private repositories

In the future, we expect to make revenue by charging enterprises for deploying this pipeline on their code

Machine learning on source code



Will the coding AI take our jobs?

Relax, we are very far from it

... won’t ask what problems can be solved with computers alone. Instead, ... ask: how can computers help humans solve hard problems? ... They won’t just get better at the kinds of things people already do; they’ll help us to do what was previously unimaginable.

Peter Thiel, 0 -> 1

Approaches to extract information from the source code

  1. Treat it as regular text and feed it to the black box
  2. Tokenize it using regexps (Pygments, highlight.js)
  3. Parse ASTs
  4. Compile


Universal Abstract Syntax Tree (UAST)

Shared node roles for every language

			pip3 install bblfsh
			python3 -m bblfsh -f /path/to/file
We are going to deploy bblfsh/dashboard



Source code classification

Problem: given millions of repositories, map their files to languages

Identifiers in source code

Names rule.

Adam Tornhill, Your Code as a Crime Scene
			cat = models.Cat()
			def quick_sort(arr):



We apply Snowball stemmer to words longer than 6 chars

Weighted BOW

Let's interpret every repository as a weighted bag-of-words

We calculate TF-IDF to weigh the occurring identifiers

    tfreturn	67.78
  oprequires	63.97
      doblas	63.71
    gputools	62.34
    tfassign	61.55
    opkernel	60.72
        sycl	57.06
         hlo	55.72
     libxsmm	54.82
  tfdisallow	53.67

Topic modeling

We can run topic modeling on these features

Paper on ArXiV

Run on the whole world -> funny tagging

Run per ecosystem -> useful exploratory search

Topic #36 "Movies" →

Clustering TM vectors

CTO Máximo Cuadros' profile →


Identifier embeddings


Really good embedding engine which works on co-occurrence matrices

Like GloVe, but better - Paper on ArXiV

We forked the Tensorflow implementation

Linux kernel embeddings

Launch Tensorflow Projector


What if we combine bag-of-words model with embeddings?

We focus on finding related repositories

Word Mover's Distance (WMD)

WMD calculation in a nutshell

WMD evaluation is O(N3), becomes slow on N≈100

  1. Sort samples by centroid distance
  2. Evaluate first k WMDs
  3. For every subsequent sample, solve the relaxed LP which gives a lower estimation
  4. If it is greater than the farthest among the k NN, go next
  5. Otherwise, evaluate the WMD

This allows to avoid 95% WMD evaluations on average


source{d}'s approach to WMD. GitHub

We tightened up the relaxed estimation

We used google/or-tools for the heavy-lifting

Result: WMD takes 0.1s on |bag| = 500 (pyemd takes 2s)


Babelfish currently supports only Python and Java, status

			pip3 install vecino
			python3 -m vecino
Models will be stored in Google Cloud Storage; not released yet, to use the deprecated ones:
			python3 -m vecino --gcs test-ast2vec ...

Advanced Scientific Data Format


If you wish to train the embeddings:

			pip3 install ast2vec
			python3 -m ast2vec repo2coocc -o tfcoocc.asdf \
			python3 -m ast2vec preproc -o swivel_dataset -v 10000 \
    -s 2500 --df df.asdf
			python3 -m ast2vec train --input_base_path swivel_dataset ...
			python3 -m ast2vec postproc swivel_output id2vec.asdf

Neural code completion


Bonus: Hercules


Bonus: blog posts

Thank you

My contacts:

source{d} has a community slack!