Machine learning on source code
Origins
- GitHub
- 🚀Machine Learning hype
- 💰Big companies write piles of 💩 code
- 📜We write code as in 2000s
Projects similar to MariaDB/server?
Similar code detection
- By style
- By structure
- By identifiers
provides us
- 💪Global graph
- Licenses
- Refactoring
class foobar:
def connecttoserver(self):
myserverhost = globalconfig.server.host
class FooBar:
def connect_to_server(self):
myServerHost = globalConfig.server.host
- Prediction of class, function, variable names
- Type inference
- Which comments do not make sense?
- Which comments are funny?
- Which APIs are bad or misused?
Your code is a crime scene
Tools for MLonCode
PGA
- 270k of siva files
- CSV index
Details in the paper.
PGA index
- URL
- FILE_COUNT
- LANGS (bytes, strings, files)
- COMMITS_[HEAD]_COUNT
- BRANCHES_COUNT
- LINES_COUNT (empty, comments, code)
- LICENSE
source{d} engine
- siva loader in Spark
- Classification, parsing
- Apache license
- GitHub
source{d} engine
>>> from sourced.engine import Engine
>>> engine = Engine(spark, "/path/to/siva/files", "siva")
>>> engine.repositories.references.head_ref \
.commits.tree_entries.blobs \
.classify_languages() \
.select("blob_id", "path", "lang") \
.show()
Universal AST
- ± uniform structure
- ± standard node types (roles)
- XPath queries
- 4 traversal orders
dashboard.bblf.sh
Python client
pip3 install bblfsh
python3 -m bblfsh -f /path/to/file
import bblfsh
client = bblfsh.BblfshClient("0.0.0.0:9432")
client.parse("/file").uast
bblfsh/client-python
Integration
>>> engine.repositories.references.head_ref \
.commits.tree_entries.blobs \
.classify_languages() \
.filter('lang = "Python"') \
.extract_uasts() \
.query_uast('//*[@roleIdentifier]') \
.extract_tokens("result", "tokens") \
.select("blob_id", "path", "tokens")
sourced-ml
- GitHub
- Python 3.4+
- Uses Tensorflow, humanize
- clint tqdm
Requirements
- Automatically fetch from the internet
- "Model Store"
- Flexible, modern binary format (not HDF5)
- Versioning
- Reproducability
- Support various programming languages
Modelforge
- Automatically fetch from the internet
- "Model Store"
- Flexible, modern binary format - ASDF
- Versioning
- Reproducability
- Support various programming languages
GitHub
Summary
- Machine Learning on Source Code is fun
- There is data
- There are tools