Full software development cycle analytics in near real-time

Full software development cycle analytics in near real‑time

Vadim Markovtsev
Athenian

The product

Software development lifecycle

Plan coding.
Start coding.
Open the pull request.
Pass the review.
Pass the CI.
Merge the pull request.
Release the pull request.
Deploy the pull request.

Source code analysis?

Security.
Trust.
Scale.

Money estimations?

Bull shit in 99% of the cases.
Require a great cost model specific to the client.
Require knowing the salaries.

Confidence intervals

The distributions are neither normal, log-normal, nor any other well-known.
Some are "semi-discrete" ≥ 0.
We use the bootstrap method.
Very fast after optimizations.
- Cached randomness.
- Control the number of iterations.
95%: ±50%

Focus

Support percentiles everywhere.
Average vs. median.
- Average on 95th percentile.
Stale garbage.

GitHub

GitHub API

REST
- Best suited for casual or specific analysis.
GraphQL
- Best suited for mirroring GitHub DB.
Rate limits.

Node ID

            from base64 import b64decode
            print(b64decode(
                "MDY6Q29tbWl0MjI1OTI0NDI3OjA1MTkyMDUwOWJmMTk1N2Y"
                "4NDIzMDdmMzdkZGI0NDFiZmQyNTMwMGY="))
            06:Commit225924427:051920509bf1957f842307f37ddb441bfd25300f

Two kinds of data retrieval

Cold start.
Event stream.
- Must register a GitHub application.
- Must re-scan periodically.

Gotcha!

Node ID format changes.
Merges happen after closures.
PRs close before opening.
HTTP 500.
Incomplete.
Eventual consistency.

Eventual consistency

Incoming events:

Somebody committed aabbccd as a child of dccbbaa in repo XXX.
Commit aabbccd has message "Fix".
Commit aabbccd belongs to pull request PPP.
New pull request PPP in repo XXX titled "merge me".

Releases

What is a release?
- A tag.
- A branch merge.
- A hash sent by the client.
Which PRs were released?
- GitHub does not show this for a reason.

Releases

git push --force

Including git rebase

Changes the hashes.
May change the diff-s.
May change the messages.

git push --force

Including git rebase

Changes the hashes.
Usually does not change the diff-s.
Usually does not change the messages.
- Accuracy: 99.9%.

Continuous Integration

👎 Check suites point at commits.
👎 Same commit may belong to several pull requests.
👎 Check suites may launch after PR is closed.
👎 Check suites may launch before PR is opened.
👎 Check suites may glitch.
- E.g., stay enqueued forever.
👎 No finish timestamp for legacy API check runs.

JIRA

JIRA API

No event stream. Pull vs. push.
Next gen vs. classic. Migrations.
So many fields may be custom.
Frequent re-opens.
Super strict privacy.
Only user name by default.
No emails.

JIRA identity matching

Fuzzy names matching problem: blog post.
0.94 F1. Good enough.
Open-source tool: names-matcher.

Case studies

We did our job too well

Client X

200 developers.
Growing x2 in one year.
Strong advocate of Athenian.
>$1,000 MRR.

What we did

Dedicated custom analysis of their data showed:

They are growing very fast.
The PRs frequency stays the same.
Changed lines stay the same.

Thank you!
This is a great insight.

Client X.

Do you know why?

Client X.

We must be hiring the wrong people.

Client X.

Pandas in prod

When to use Pandas

Have requirements on response time?
Work with many small dataframes?
Have memory constraints?

How to use Pandas

Learn how to iterate rows fast.

            for a, b in zip(df["a"].values, df["b"].values):
                print(a, b)

How to use Pandas

Learn how to select rows fast.

            df[df["a"] > 10]
            # versus
            df.take(np.flatnonzero(df["a"].values > 10))

How to use Pandas

Learn how to groupby fast.

            for key, sub_df in df.groupby("a", sort=False):
                # ...
            # versus
            df.groupby("a", sort=False).grouper.groups
            # dict key -> row indexes

How to use Pandas

Ensure your dtype is not object.

How to live without Pandas

95% ops can be rewritten in pure numpy.

datetime64
indexing
sort_values
groupby

How to live without Pandas

>>> sort_values("a")

            order = np.argsort(array_a)
            index = order
            array_a = array_a[order]
            array_b = array_b[order]

How to live without Pandas

>>> groupby("a")

            keys, indexes, counts = np.unique(
                array_a, return_inverse=True, return_counts=True)
            order = np.argsort(indexes)
            borders = np.cumsum(counts[:-1])
            groups_a = np.split(array_a[order], borders)
            groups_b = np.split(array_b[order], borders)

How to live without Pandas

>>> groupby("a").mean()

            keys, indexes, counts = np.unique(
                array_a, return_inverse=True, return_counts=True)
            masks = np.zeros((len(keys), len(df)), bool)
            arr_y = np.repeat(np.arange(len(keys)), counts)
            arr_x = np.argsort(indexes)
            masks[arr_y, arr_x] = True
            mean_a = np.mean(array_a, where=masks, axis=1)

Tips for Python

Use a profiler, e.g. py-spy.
Vectorize everything.
Cython.
Precompute everything.
Cache everything.
Using a DB? Prepare to optimize queries and suffer.
Choose incremental algorithms.
KISS.

Educating the client

Credit: workcompass.com

I can do the same with my script

✔ It will cover 80% of what you need.
✘ It will not be correct.
✘ Write-only.
✘ Charts.

The book

Accelerate: The Science of Lean Software and DevOps: Building and Scaling High Performing Technology Organizations.

Gene Kim, Jez Humble, and Nicole Forsgren.

Thank you

bit.ly/2RHdp55