Characterizing the Distributions of Commits in Large Source Code Repositories

Abstract

Modern software development is based on software repositories and changes committed to those repositories. However, there is an inadequate insight into the nature of changes committed to repositories of different sizes. A data-based characterization of commit activity in large software hubs contributes to a better understanding of software development and can feed into early detection of bugs at the earliest phases. Here, we present preliminary results from characterizing the distribution of 452 million commits in a metadata listing from GitHub repositories. Based on multiple distributions, we find the best fits and second best fits across different ranges in the data. The characterization is aimed at synthetic repository generation suitable for use in simulation and machine learning.

Publication
Kalyan Perumalla
Kalyan Perumalla

Kalyan Perumalla is Founder and President of Discrete Computing, Inc. He led advanced research and development at ORNL and holds senior faculty appointments at UTK, GT, and UNL.

Next
Previous

Related