Public Git Archive

Vadim Markovtsev, Waren Long, source{d} - MSR'18.

Public Git Archive

Vadim Markovtsev, Waren Long
sourced.tech

10x bigger than any other dataset

100x saved time

PGA in a nutshell

Language stats

Generation

  1. Fetch GHTorrent
  2. Threshold by 50
  3. Clone with borges
  4. Pack to siva
  5. Index with borges-indexer

siva

90% of repositories constitute 50% of the disk size

Getting PGA

pga.sourced.tech
pga list - download and list the index
pga get - download the siva files

Using PGA

  1. siva unpack
    + regular Git
  2. go-git
  3. source{d} engine
>>> from sourced.engine import Engine
        >>> engine = Engine(spark, "/path/to/siva/files", "siva")
        >>> engine.repositories.references.head_ref \
        .commits.tree_entries.blobs \
        .classify_languages() \
        .filter('lang = "Python"') \
        .extract_uasts() \
        .query_uast('//*[@roleIdentifier]') \
        .extract_tokens("result", "tokens") \
        .select("blob_id", "path", "tokens")
        

Legal

Licenses

Via go-license-detector

Thank you

bit.ly/2x34csF