Full software development cycle analytics in near real-time

Vadim Markovtsev, Athenian.

Full software development cycle analytics in near real‑time

Vadim Markovtsev
Athenian

The product

Software development lifecycle

  1. Plan coding.
  2. Start coding.
  3. Open the pull request.
  4. Pass the review.
  5. Pass the CI.
  6. Merge the pull request.
  7. Release the pull request.
  8. Deploy the pull request.

Source code analysis?

Money estimations?

Confidence intervals

Focus

GitHub

GitHub API

Node ID

            from base64 import b64decode
            print(b64decode(
                "MDY6Q29tbWl0MjI1OTI0NDI3OjA1MTkyMDUwOWJmMTk1N2Y"
                "4NDIzMDdmMzdkZGI0NDFiZmQyNTMwMGY="))
            06:Commit225924427:051920509bf1957f842307f37ddb441bfd25300f
        

Two kinds of data retrieval

Gotcha!

Eventual consistency

Incoming events:
  1. Somebody committed aabbccd as a child of dccbbaa in repo XXX.
  2. Commit aabbccd has message "Fix".
  3. Commit aabbccd belongs to pull request PPP.
  4. New pull request PPP in repo XXX titled "merge me".

Releases

Releases

git push --force

Including git rebase

git push --force

Including git rebase

Continuous Integration

Continuous Integration

JIRA

JIRA API

JIRA identity matching

Case studies

We did our job too well

Client X

What we did

Dedicated custom analysis of their data showed:

Thank you!
This is a great insight.

Client X.

Do you know why?

Client X.

We must be hiring the wrong people.

Client X.

We switch to

Client X.

Pandas in prod

When to use Pandas

How to use Pandas

Learn how to iterate rows fast.
            for a, b in zip(df["a"].values, df["b"].values):
                print(a, b)
        

How to use Pandas

Learn how to select rows fast.
            df[df["a"] > 10]
            # versus
            df.take(np.flatnonzero(df["a"].values > 10))
        

How to use Pandas

Learn how to groupby fast.
            for key, sub_df in df.groupby("a", sort=False):
                # ...
            # versus
            df.groupby("a", sort=False).grouper.groups
            # dict key -> row indexes
        

How to use Pandas

Ensure your dtype is not object.

How to live without Pandas

95% ops can be rewritten in pure numpy.

How to live without Pandas

>>> sort_values("a")
            order = np.argsort(array_a)
            index = order
            array_a = array_a[order]
            array_b = array_b[order]
        

How to live without Pandas

>>> groupby("a")
            keys, indexes, counts = np.unique(
                array_a, return_inverse=True, return_counts=True)
            order = np.argsort(indexes)
            borders = np.cumsum(counts[:-1])
            groups_a = np.split(array_a[order], borders)
            groups_b = np.split(array_b[order], borders)
        

How to live without Pandas

>>> groupby("a").mean()
            keys, indexes, counts = np.unique(
                array_a, return_inverse=True, return_counts=True)
            masks = np.zeros((len(keys), len(df)), bool)
            arr_y = np.repeat(np.arange(len(keys)), counts)
            arr_x = np.argsort(indexes)
            masks[arr_y, arr_x] = True
            mean_a = np.mean(array_a, where=masks, axis=1)
        

Tips for Python

Educating the client

Credit: workcompass.com

I can do the same with my script

The book

Accelerate: The Science of Lean Software and DevOps: Building and Scaling High Performing Technology Organizations.

Gene Kim, Jez Humble, and Nicole Forsgren.

Thank you

bit.ly/2RHdp55