Using deep RNN to model source code

Eiso Kant, source{d}

Using deep RNN to model source code

Eiso Kant, source{d}

Plan

goo.gl/wRQCLS
(view this on your device)
  1. Motivation
  2. Source code feature engineering
  3. Network architecture
  4. Results
  5. Other work

Motivation

Everything is better with clusters.

Motivation

Customers buy goods, and software developers write code.

Motivation

So to understand the latter, we need to understand what and how they do what they do. Feature origins:

Motivation

Feature engineering

Requirements:

  1. Ignore text files, Markdown, etc.
  2. Ignore autogenerated files
  3. Support many languages with minimal efforts
  4. Include as much information about the source code as possible

Feature engineering

(1) and (2) are solved by github/linguist

...source{d} has it's own which feels the same but works 10x faster

Feature engineering

(3) and (4) are solved by

Feature engineering

Pygments example:

# prints "Hello, World!"
if True:
    print("Hello, World!")
# prints "Hello, World!"
if True:
    print("Hello, World!")

Feature engineering

Token.Comment.Single	'# prints "Hello, World!"'
Token.Text	'\n'
Token.Keyword	'if'
Token.Text	' '
Token.Name.Builtin.Pseudo	'True'
Token.Punctuation	':'
Token.Text	'\n'
Token.Text	'    '
Token.Keyword	'print'
Token.Punctuation	'('
Token.Literal.String.Double	'"'
Token.Literal.String.Double	'Hello, World!'
Token.Literal.String.Double	'"'
Token.Punctuation	')'
Token.Text	'\n'
		

Feature engineering

Feature engineering

Feature engineering

Feature engineering

Though extracted, names as words may not used in this scheme.

We've checked out two approaches to using this extra information:

  1. LSTM sequence modelling (the topic of this talk)
  2. ARTM topic modelling (article in our blog)

Network architecture

Idea: apply NLP-style LSTM to Python source code.

We will experiment with Django.

Parse the code into tokens with Pygments and treat each as a word.

Comments, strings should be squashed - we do not want NLP inside our NLP.

Network architecture






			{ "if": 1, "self": 2, ".": 3, "active": 4,  "(": 5, ")": 6,
  "and": 7,  "Weather": 8, "raining": 9, "\n": 10,
  "open_umbrella": 11, "*": 12, "hands": 13 }
		

Network architecture

Model from "Recurrent Neural Network Regularization" by Zaremba et al. ([arXiv](http://arxiv.org/abs/1409.2329))

Network architecture

Network architecture

Optimizer GD with clipping to 5.0
Learning rate 1.0
Weight decay 0.5 after 4 epochs
History size 20
First layer size 200

Network architecture

Implementation:

Results

Results

Results

Can be useful for

Other work - KMeans for devs clustering

src-d/kmcuda

Other work - CNN for code classification

Thank you

We are hiring!

We are hiring!