Splitting source code identifiers using BiLSTM

Vadim Markovtsev, source{d}.

Splitting source code identifiers using BiLSTM

Vadim Markovtsev, Waren Long, Egor Bulychev
Romain Keramitas, Konstantin Slavnov, Gabor Markowski
source{d}

Objective

Naming conventions

            UpperCamelCase
            lowerCamelCase
            lower_underscore
            UPPER_UNDERSCORE
        

Problem

            FooBarBaz = [foo, bar, baz]
            foobarbaz = ?
        

Applications

Dataset

Processing

  1. Extract distinct identifiers from big code
  2. Split them by naming convention heuristics
  3. Use the resulting sequencies as labels

Public Git Archive (PGA)

pga.sourced.tech

Identifiers extraction

Result: 49.2 million distinct identifiers.

Labels

Result: 35.6 million distinct labeled samples.

Example

FooBarBaz -> [foo, bar, baz]

X: foobarbaz

Y: [foo, bar, baz]

Length threshold

Result: 34.4 million distinct labeled samples.

Why 40?

Heuristics

Released

bit.ly/2zO8rJP

Per language

Baselines

Unsmoothed character-level model

Dynamic programming


Gradient boosting on decision trees

Gradient boosting on decision trees

Character-level Convolutional Neural Network

Character-level Convolutional Neural Network

BiLSTM

Single layer

Architecture

Training performance

Evaluation

How we evaluated

Comparison

Comparison

Model Precision Recall F1
Char. ML LM → ∨ ← 0.563 0.936 0.703
Char. ML LM → ∧ ← 0.966 0.573 0.719
Stat. dyn. prog., Wiki 0.741 0.912 0.818
Stat. dyn. prog., Zipf 0.937 0.783 0.853
Stat. dyn. prog., posterior 0.931 0.892 0.911
GBDT 0.931 0.924 0.928
Char. CNN 0.922 0.938 0.930
Char. BiGRU 0.945 0.955 0.949
Char. BiLSTM 0.947 0.958 0.952

Core vocabulary size

Summary

What we did

Future work

Thank you

bit.ly/2mlUh9d