Open Source Stack for Machine Learning on Source Code

Vadim Markovtsev, source{d}

Read this on your device

About me

Lead Engineer, Machine Learning @ source{d}
Education: compiler technologies and applied math
Switched to ML/DS in 2013
Python, C/C++, Go, HTML5

About source{d}

"AI on code"
We started as a hiring middleware
You may have read "how developers change programming languages"
Open everything! web site, company rules, brewery
We have a technical blog

Master plan

Clone all software repositories in the world
Extract Abstract Syntax Trees (AST)
Crazy data science things
???
PROFIT

Our projects

go-git
Data Retrieval: rovers, borges, siva
AST: enry Babelfish berserker
DS/ML: modelforge, ast2vec, vecino, neural code completion - TBR
Tens of side projects – all on GitHub too

How do you guys make money?

Currently, we don't. Right now, we're focused on research, development and developer adoption

We want to be the machine learning pipeline on top of the world's source code, in both public and private repositories

In the future, we expect to make revenue by charging enterprises for deploying this pipeline on their code

Machine learning on source code

Research

src-d/awesome-machine-learning-on-source-code

Mining Software Repositories conference
Machine Learning for the Analysis of Source Code Text group
Software Institute, Peking University
ETH Zurich (Veselin Raychev)
Microsoft Research

Will the coding AI take our jobs?

Relax, we are very far from it

... won’t ask what problems can be solved with computers alone. Instead, ... ask: how can computers help humans solve hard problems? ... They won’t just get better at the kinds of things people already do; they’ll help us to do what was previously unimaginable.

Peter Thiel, 0 -> 1

Approaches to extract information from the source code

Treat it as regular text and feed it to the black box
Tokenize it using regexps (Pygments, highlight.js)
Parse ASTs
Compile

　

Universal Abstract Syntax Tree (UAST)

Shared node roles for every language

			pip3 install bblfsh
			python3 -m bblfsh -f /path/to/file

We are going to deploy bblfsh/dashboard

Powerful

Opportunities

Unified code metrics calculation
Dependency tree building + libraries.io
Robust copy-paste detection
Pattern mining
Advanced diff-ing
Code manipulations
Many, many more

Source code classification

Problem: given millions of repositories, map their files to languages

github/linguist - Ruby
src-d/enry - Go

Identifiers in source code

Names rule.

Adam Tornhill, Your Code as a Crime Scene

			cat = models.Cat()
			cat.meow()

			def quick_sort(arr):
			    ...

Splitting

UpperCamelCase -> (upper, camel, case)
camelCase -> (camel, case)
FRAPScase -> (fraps, case)
SQLThing -> (sql, thing)
_Astra -> (astra)
CAPS_CONST -> (caps, const)
_something_SILLY_ -> (something, silly)
blink182 -> (blink)
FooBar100500Bingo -> (foo, bar, bingo)
Man45var -> (man, var)
method_name -> (method, name)
Method_Name -> (method, name)
101dalms -> (dalms)
101_dalms -> (dalms)
101_DalmsBug -> (dalms, bug)
101_Dalms45Bug7 -> (dalms, bug)
wdSize -> (size, wdsize)
Glint -> (glint)
foo_BAR -> (foo, bar)

Stemming

We apply Snowball stemmer to words longer than 6 chars

Weighted BOW

Let's interpret every repository as a weighted bag-of-words

We calculate TF-IDF to weigh the occurring identifiers

tensorflow/tensorflow:

    tfreturn	67.78
  oprequires	63.97
      doblas	63.71
    gputools	62.34
    tfassign	61.55
    opkernel	60.72
        sycl	57.06
         hlo	55.72
     libxsmm	54.82
  tfdisallow	53.67

Topic modeling

We can run topic modeling on these features

Paper on ArXiV

Run on the whole world -> funny tagging

Run per ecosystem -> useful exploratory search

Topic #36 "Movies" →

Clustering TM vectors

Topic modeling gives you vectors for every word and every file/repository
Cluster them with src-d/kmcuda or fbr/faiss
Map them to a regular grid with src-d/lapjv
Flatten dots for a chosen developer

CTO Máximo Cuadros' profile →

word2vec

Family of algorithms which map words to dense vectors
These vectors allow linear algebra operations which make sense

Identifier embeddings

Consider every identifier scope as a context
Consider every preprocessed identifier as a word
Build the giant co-occurrence matrix
Factor (embed) it

Swivel

Really good embedding engine which works on co-occurrence matrices

Like GloVe, but better - Paper on ArXiV

We forked the Tensorflow implementation

Linux kernel embeddings

Launch Tensorflow Projector

nBOW

What if we combine bag-of-words model with embeddings?

We focus on finding related repositories

Word Mover's Distance (WMD)

Based on the paper
An application of Earth Mover's Distance to sentences
Works on the nBOW model

WMD calculation in a nutshell

WMD evaluation is O(N³), becomes slow on N≈100

Sort samples by centroid distance
Evaluate first k WMDs
For every subsequent sample, solve the relaxed LP which gives a lower estimation
If it is greater than the farthest among the k NN, go next
Otherwise, evaluate the WMD

This allows to avoid 95% WMD evaluations on average

wmd-relax

source{d}'s approach to WMD. GitHub

We tightened up the relaxed estimation

We used google/or-tools for the heavy-lifting

Result: WMD takes 0.1s on |bag| = 500 (pyemd takes 2s)

Practice

Babelfish currently supports only Python and Java, status

			pip3 install vecino
			python3 -m vecino https://github.com/pytorch/pytorch

Models will be stored in Google Cloud Storage; not released yet, to use the deprecated ones:

			python3 -m vecino --gcs test-ast2vec ...

Examples

Advanced Scientific Data Format

Open standard: 1.1.0
Originated from FITS and astropy
numpy-friendly
YAML with blobs
Transparent compression

Practice

If you wish to train the embeddings:

			pip3 install ast2vec
			python3 -m ast2vec repo2coocc -o tfcoocc.asdf \
    https://github.com/tensorflow/tensorflow
			python3 -m ast2vec preproc -o swivel_dataset -v 10000 \
    -s 2500 --df df.asdf
			python3 -m ast2vec train --input_base_path swivel_dataset ...
			python3 -m ast2vec postproc swivel_output id2vec.asdf

Neural code completion

Deprecated approach: work on sequential token stream
- LSTM in Tensorflow
Modern approach: TreeLSTM
- PyTorch because Tensorflow does not support dynamic graphs
- We are still investigating

Summary

There is a funny set of tools and libraries devoted to MLoSC
They cover collection, extraction, engineering, training and running
There are datasets and pretrained models too
We believe that they will help developers program better and more efficiently

Bonus: Hercules

GitHub

Bonus: blog posts

Thank you

My contacts:

source{d} has a community slack!