Source code abstracts classification using CNN

Vadim Markovtsev, source{d}

Source code abstracts classification using CNN

Vadim Markovtsev, source{d}

Plan

goo.gl/4zq8g9
(view this on your device)
  1. Motivation
  2. Source code feature engineering
  3. Network architecture
  4. Implementation details
  5. First results
  6. Approaches to developers clustering

Motivation

Everything is better with clusters.

Motivation

Customers buy goods, and software developers write code.

Motivation

So to understand the latter, we need to understand what and how they do what they do. Feature origins:

Motivation

Motivation

Let's check how deep we can drill with source code style ML.

Toy task: binary classification between 2 projects using only the data with the origin in code style.

Feature engineering

Requirements:

  1. Ignore text files, Markdown, etc.
  2. Ignore autogenerated files
  3. Support many languages with minimal efforts
  4. Include as much information about the source code as possible

Feature engineering

(1) and (2) are solved by github/linguist

...source{d} has it's own, just no Python bindings exist at the moment

Feature engineering

(3) and (4) are solved by

Feature engineering

Pygments example:

# prints "Hello, World!"
if True:
    print("Hello, World!")
# prints "Hello, World!"
if True:
    print("Hello, World!")

Feature engineering

Token.Comment.Single	'# prints "Hello, World!"'
Token.Text	'\n'
Token.Keyword	'if'
Token.Text	' '
Token.Name.Builtin.Pseudo	'True'
Token.Punctuation	':'
Token.Text	'\n'
Token.Text	'    '
Token.Keyword	'print'
Token.Punctuation	'('
Token.Literal.String.Double	'"'
Token.Literal.String.Double	'Hello, World!'
Token.Literal.String.Double	'"'
Token.Punctuation	')'
Token.Text	'\n'
		

Feature engineering

Feature engineering

Feature engineering

Though extracted, names as words are not used to solve this classification problem. They are a subject of separate research and are likely to dramatically improve accuracy.

Feature engineering

Network architecture

layer kernel pooling number
convolutional 4x1 2x1 250
convolutional 8x2 2x2 200
convolutional 5x6 2x2 150
convolutional 2x10 2x2 100
all2all 512
all2all 64
all2all output

Network architecture

Activation ReLU
Optimizer GD with momentum (0.5)
Learning rate 0.002
Weight decay 0.955
Regularization L2, 0.0005
Weight initialization σ = 0.1

Network architecture

Implementation details

Implementation details

Normalization $$D = E(X^2) - E(X)^2$$ Thus we can cache the following for each source file / project: \(\sum X^2\), \(\sum X\), \(N\) of samples. \[m = \frac{\sum_{unit} \sum X}{\sum_{unit} N},\quad \sigma = \sqrt{\frac{\sum_{unit} \sum X^2 - (\sum_{unit} \sum X)^2}{\sum_{unit} N}}\]

First results

projects description size accuracy
Django vs Twisted Web frameworks, Python 800ktok each 84%
Matplotlib vs Bokeh Plotting libraries, Python 1Mtok vs 250ktok 60%
Matplotlib vs Django Plotting libraries, Python 1Mtok vs 800ktok 76%
Django vs Guava Python vs Java 800ktok >99%
Hibernate vs Guava Java libraries 3Mtok vs 800ktok 96%

First results

Conclusion: the network is likely to extract internal similarity in each project and use it. Just like humans do.

If the languages are different, it is very easy to distinguish projects (at least because of unique token types).

First results

First results

Problem: how to get this for a source code network?

Approaches to developers clustering

GitHub has ≈6M of active users (and 3M after some filtering). If we are able to extract various features for each, we can cluster them. Current approach:

  1. Run K-means with K=45000 (using src-d/kmcuda)
  2. Run t-SNE to visualize the landscape

BTW, kmcuda implements Yinyang k-means.

Approaches to developers clustering

Approaches to developers clustering

Another way is to employ Distance Metric Learning (DML).

Thank you

We are hiring!

We are hiring!