Abstract

With the wide adoption of social collaborative coding, more and more developers participate and collaborate on platforms such as GitHub through rich social and technical relationships, forming a large-scale complex technical system. Like the functionalities of critical nodes in other complex systems, influential developers and projects usually play an important role in driving this technical system to more optimized states with higher efficiency for software development, which makes it a meaningful research direction on identifying influential developers and projects in social collaborative coding platforms. However, traditional ranking methods seldom take into account the continuous interactions and the driving forces of human dynamics. In this paper, we combine the bursty interactions and the bipartite network structure between developers and projects and propose the BurstBiRank model. Firstly, the burstiness between each pair of developers and projects is calculated. Secondly, a weighted developer-project bipartite network is constructed using the burstiness as weight. Finally, an iterative score diffusion process is applied to this bipartite network and a final ranking score is obtained at the stationary state. The real-world case study on GitHub demonstrates the effectiveness of our proposed BurstBiRank and the outperformance of traditional ranking methods.

1. Introduction

Social collaborative coding is now a popular paradigm among software developers, and collaborations of developers from all over the world can be easily conducted with the social and technical functionalities provided by such kind of platforms like GitHub. For example, in GitHub, developers can follow each other to form a social network, keep track of the updates of a project by the star and watch functionalities, contribute codes by the commit and pull request functionalities, or participate in the discussions of new features design or bug fix by the issue functionality. Rich social and technical functionalities connect developers and projects to form a large-scale complex technical system. It is known that critical nodes usually play important role in operation management and optimization of complex systems. The same goes for complex technical systems such as GitHub, which is usually driven by influential developers and projects to more optimized states with higher efficiency for software development. For example, in addition to direct collaboration, developers always seek popular developers and projects for improving coding ability and technical selection, which in turn makes collaborations more efficient. Thus, identifying influential developers and projects is of great significance for the improvement of developer’s ability and the prosperity of open source community and also has important applications in service recommendations [1, 2] and quality of service prediction [36].

Existing work on influence analysis in open source software community often simply employs basic properties [7], network structural metrics [8], or traditional unipartite graph ranking model [912]. The influence of developers and projects, two major and tightly coupled components of open source software community, is usually evaluated separately although many new graph ranking methods for complex network structures like bipartite network [13] have been proposed. On the other hand, abundant activities of developers are not utilized effectively while our previous study [14] indicates the statistical characteristics of developers’ behavior is useful for distinguishing elite and common developers. Figure 1 shows a comparison of contributions between an elite developer Taylor Otwell and a common developer Franz Liedke, which shows different statistical characteristics of their behavior.

In this paper, we aim at mutually identifying influential developers and projects in GitHub by adopting a combination of the burstiness behavior of developers and the bipartite network topology of developer-project interactions. The contributions of this paper are listed as follows:(1)We propose a burstiness-weighted bipartite network model to incorporate bursty behaviors between developers and projects into network topology.(2)We combine the diffusion-based ranking method BiRank and the burstiness-weighted bipartite network and propose a new ranking method called BurstBiRank for mutually identifying influential developers and projects in GitHub.(3)We apply the proposed model to a real-world GitHub dataset, showing that burstiness can correctly measure developers’ attention to projects and our model outperforms baseline models.

The remainder of the paper is organized as follows. Section 2 introduces the related works on graph ranking methods and human dynamics. The details of our proposed BurstBiRank method are illustrated in Section 3. Then, the experiment results and discussions are given in Section 4. Finally, we briefly summarize our work and explain future directions in Section 5.

Influential node identification has been a hot topic in network science research for decades, and many graph ranking methods have been proposed from different views of network structures or information diffusion mechanisms on various kinds of complex networks [1518]. PageRank [19] and HITS [20] are the most popular ones. PageRank [19] is a random walk-based ranking method and uses the probability a random surfer appears on a web page as the influence score of the web page, while HITS [20] distinguishes authority and hub features of a web page and ranking a web page with both authority score and hub score.

Many graph ranking methods are based on PageRank and HITS. Considering individual’s preference, Haveliwala et al. [21] proposed a personalized PageRank algorithm, and a personalized vector was introduced for expressing individual’s preference for certain topics, novelty, and sensitivity of individuals’ generated contents. Inspired by the discrete-time Markov process interpretation of PageRank, Liu et al. [22] proposed BrowseRank based on continuous-time Markov processes and used user behavior data to rank the importance of pages. In order to overcome parameter tuning of PageRank which is caused by dangling nodes in the network, Lu et al. [23] introduced a ground node connecting to all other nodes and proposed LeaderRank. Then, Li et al. [24] extended LeaderRank to weighted network.

In addition to ranking on unipartite network, recent research studies also extend graph ranking methods of unipartite network to bipartite network. In contrast to random walk-based graph ranking methods, He et al. [13] proposed BiRank, an optimization based ranking method for bipartite network. Xu et al. [25] applied singular value decomposition to bipartite network and proposed SVDRank and SVDARank. Morone et al. [26] extended the k-core decomposition method to the bipartite network and pointed out that in the ecological symbiosis network, the extinction of the maximum k-core node would make the ecosystem reach the critical point of collapse.

The rapid development of graph ranking models also promotes the research in influence analysis for open source software community. Xuan et al. [9] modeled the communications between developers in Apache as networks and analyzed developers’ influence using degree, PageRank, and HITS. Joblin [8] et al. classified developers into core and peripheral with several network metrics. From the view of software projects, Inoue et al. [11] constructed a component graph with use relations for ranking software components. Pan et al. [27] constructed a multilayer complex network by extracting structural information from Java software systems and proposed a weighted PageRank algorithm for ranking classes or packages.

Although network structure plays an important role in identifying influential nodes in online social network, based on previous human dynamics study, Yan et al. [14] found that bursty behavior is a good indicator for distinguishing influential developers from common ones. Human dynamics studies the statistical characteristics of spatial or temporal behaviors of human beings and the potential laws behind it. Goh et al. [28] proposed the burstiness metric to measure to which extent the behavior deviates from periodic behavior.

3. Method

In this section, we will present a novel bipartite network ranking framework incorporating burstiness interactions, called BurstBiRank, for mutually identifying influential developers and projects in GitHub. First, we will introduce the definition of burstiness, which plays an important role in our proposed method for measuring how much attention a developer pays on a project. Then, a thorough description is given about the definition and construction of the burstiness-weighted developer-project bipartite network, and a diffusion-based ranking process is applied on this bipartite network. Finally, the overall algorithm is proposed and its time complexity is analyzed. The notations we will use throughout the article are summarized in Table 1.

3.1. Burstiness

In many real-world or online systems, people’s activity is often intermittent, displaying intense activity during a short period followed by a long period of reduced activity or even no activity. For example, you may spend a total afternoon searching YouTube for videos about husky when you are free, but then seldom visit YouTube in weekdays. This pattern of human behavior and the laws behind it have been studied extensively in the field of human dynamics, and Goh et al. [28] proposed the burstiness metric to measure to which extent the behavior deviates from periodic behavior, which is defined aswhere and represent the standard deviation and the mean of the time interval series of human activities, respectively. It can be concluded from the definition: (1) the value of B ranges from −1 to 1; (2) B> 0 indicates the behavior is bursty, and the larger B is, the stronger the burstiness is; (3) B< 0 indicates cyclical trend, and the smaller B is, the stronger the periodicity is.

3.2. Burstiness-Weighted Developer-Project Bipartite Network

Definition 1. Burstiness-weighted developer-project bipartite network: a burstiness-weighted developer-project bipartite network is a weighted bipartite network G = (UP, E), where U and P denote two disjoint sets of nodes, that is, set of developers and set of projects, respectively, and E represents edges between developers and projects. The burstiness-weighted developer-project bipartite network can be described by a bipartite weight matrix W (∈ ) with elements (1i ≤ |U|, 1j ≤ |P|) indicating tie strength between developer i and project j, which is a function of the burstiness of interactions between developer i and project j, that is,Figure 2 shows a sample burstiness-weighted developer-project bipartite network. Multiple interactions between each pair of developer and project are grouped and burstiness is calculated first. Then, the bipartite weight matrix W is constructed using equation (2). Essentially, the function in equation (2) can be any form, linear or nonlinear. According to the characteristics of burstiness, to ensure edge weights are positive and cyclical interactions have larger weight, we choose a linear form of function f shown in equation (3) for simplicity.

3.3. BiRank

Score diffusion is a general idea behind many popular ranking methods as PageRank [19] and BiRank [13], which employ an iterative process of diffusing score to neighbors until the stationary state. The final scores at stationary state are regarded as the ranking scores. The process of score diffusion can be formulized as equations (4) and (5), and scores of developers and projects are updated in turns.

To ensure the convergence and stability, BiRank adopts symmetric normalization.

BiRank also adopts a query vector in the ranking model to utilize prior beliefs on rankings of nodes as shown in equations (7) and (8). Prior beliefs and diffusion scores are balanced with hyperparameters γ and λ for developers and projects, respectively.

Finally, the equivalent matrix form of BiRank [7] can be obtained as equations (9) and (10)where S is the symmetric normalization of weight matrix W.

3.4. Overall Algorithm

Combing Sections 3.1, 3.2, and 3.3, we finally propose the BurstBiRank, and the overall algorithm is shown in Algorithm 1.

Input:
Developer-project interaction set (DP); query vectors , ; and hyperparameters γ, λ
Output:
Ranking vectors u, p;
(1)Group developer-project interactions by developer and project;
(2)for developer-project interactions group in all groups do
(3) Sort developer-project interactions by commit time in descending order;
(4) Calculate time intervals between successive records;
(5) Calculate burstiness ;
(6) Calculate edge weight according to equation (3);
(7)end for
(8)Construct weight matrix W;
(9) Symmetrically normalize W according to equation (11);
(10)Randomly initialize u and p;
(11)while Stopping criteria are not met do
(12) Update u and p in turn according equations (9) and (10);
(13)end while
(14)return u and p
3.5. Time Complexity Analysis

The time complexity of BurstBiRank consists of two parts. The first part is the calculation of time intervals, burstiness, and edge weights, so the time complexity is , where is the number of interactions and is the number of developer-project groups. The second part is the iterative process of BurstBiRank algorithm, and the time complexity of equations (9) and (10) is O(|U|·|P|). However, most real-world networks are usually very sparse and only nonzero elements (which correspond to existing edges) should be stored and computed regarding matrix multiplication of and . Thus, the time complexity of the second part is O(c|E|), where c is the number of iterations and |E| is the number of edges. The overall time complexity of BurstBiRank is O(+c|E|).

4. Experiment

In this section, the performance of BurstBiRank is evaluated against real-world GitHub dataset [29]. All experiments are run on a Windows 10 PC with a corei7-4790 3.6 GHz CPU and 16 GB memory.

4.1. Datasets

GHTorrent dataset is an offline mirror of data offered through the GitHub REST API and a subset of it about PHP development community is used in this experiment. GHTorrent dataset as of November 1, 2018, is selected and preprocessed as follows: (1) commit interactions between developers and PHP projects are selected; (2) commit date is extracted from commit timestamp; (3) multiple commit interaction records of the same date are merged as one record; (4) developers who have equal or less than 10 records are excluded; (5) follow relationship between developers and watch interactions between developers and projects are extracted. The statistics of the dataset after preprocessing are shown in Table 2.

4.2. Evaluation Metrics

Correlation analysis and SIR (susceptible-infected-removed) simulation are usually adopted for evaluation of graph ranking methods.

In correlation analysis, ranking results are compared with the ground truth using correlation coefficients. Kendall’s tau [30] is one of such correlation coefficients and compares ranking orders instead of exact ranking scores or ground truth values. The definition of Kendall’s tau is shown in the following equation:

X and Y are two different lists with length n, which are usually the predicted ranking list and the ground truth ranking list. C and D are the numbers of concordant and discordant pairs between X and Y, respectively. Let and ; if > 0, then are called a concordant pair, and if < 0, then are called a discordant pair. In case of  = 0, the pair is neither concordant nor discordant.

For the complexity of measuring influence, the ground truth for correlation analysis uses simply the degree of developer-developer following network or developer-project watching network. As we know, degree is local centrality metric which can only roughly measure node’s influence from a local view while influence of a node in network mainly relates to its ability of spreading information to the whole network. Generally, a node with higher influence will spread information to more nodes in a network. Thus, we adopt the SIR model [31], a classical epidemic model, in our experiment to evaluate the performance of our proposed ranking method. As shown in Figure 3, in the SIR model, nodes in a network have three statuses, that is, susceptible (S), infected (I), and removed (R). Initially, a node is selected as infected node and the others are susceptible nodes. Then, an iterative transmission process is applied. At each iteration, infected nodes infect one of its susceptible neighbors with the probability α, and infected nodes recover to removed status with the probability β. The iterative transmission process stops when there are no infected nodes in the network. The final number of recovered nodes can be regarded as the influence of the node which is initially selected as infected.

4.3. Baseline Methods

To show the effectiveness of our proposed BurstBiRank, we compare it with several baseline ranking methods. In addition to burstiness-weighted developer-project bipartite network, two other developer-project bipartite networks are also constructed with unweighted edge (UW) and commit number-weighted edge (CN), and all baseline methods are evaluated on these two bipartite networks with corresponding suffix such as PageRank-UW. The hyperparameters of BurstBiRank γ and λ are both set to 0.85, and the query vectors and are set to the degrees of developers and projects over total number of nodes of each type, respectively, which can be calculated using equations (13) and (14).BiRank [13] employs a diffusion-based way of utilizing mutual reinforcement between different types of nodes in bipartite network for mutually ranking two different types of nodes and adopts a normalization strategy in the iterative process. The hyperparameters are both set to 0.85 in our experiment.SVDRank and SVDARank [25] apply singular value decomposition to bipartite network and select the first eigenvector as the final ranking vector. The difference between them is that SVDRank runs on the original bipartite network while SVDARank introduces two ground nodes to the original bipartite network before applying singular value decomposition in order to solve the problem of dangling nodes.PageRank [19] regards ranking in network as a score diffusion process and ranks nodes by iteratively diffusing scores on the network. In our experiment, we directly apply PageRank on the developer-project bipartite networks ignoring types of nodes. The hyperparameter is set to 0.85 in our experiment.

4.4. Results
4.4.1. Correlation Analysis

In the experiment of correlation analysis, the number of followers of a developer in developer-developer following network is chosen as the ground truth for the rankings of developers, and the number of watchers of a project in developer-project watching network is chosen as the ground truth for the rankings of projects. Kendall’s tau is calculated for developers and projects separately and is shown in Table 3.

From the results of correlation analysis in Table 3, we have the following observations:(1)Our proposed BurstBiRank outperforms all baseline methods in identifying Top-20, Top-50, and Top-100 influential developers and projects except for Top-100 developers. This indicates the effectiveness of employing burstiness as the weight between developers and projects for measuring the influence of both developers and projects instead of ignoring edge weight or simply using commit number as edge weight. This result not only agrees with previous study [14] but also conforms to practical intuitions that developers pay more attention on important projects with continuous and regular work on them.(2)For random walk-based ranking methods (i.e., PageRank and BiRank), higher performance can be obtained with unweighted edges than commit number-weighted edges for identifying high influential (i.e., Top-20) developers and projects, while it is just the opposite for decomposition-based ranking method (i.e., SVDRank and SVDARank). It indicates that how to measure the edge weight is important to model continuous interactions between developers and projects for influence analysis. This is also a key motivation why we model the edge weight as a function of burstiness, and future improvement can be applied on the form of the function.(3)Almost all baseline methods have negative correlation results which indicate that the ranking orders by these methods are negatively correlated with the ground truth ranking orders, that is, influential developers or projects are usually ranked after less influential ones by these methods, while our proposed method always shows good positive correlation with the ground truth ranking orders, indicating good stability of our method.

4.4.2. SIR Simulation

In this section, the SIR model [31] is adopted to evaluate the performances of our proposed BurstBiRank and the baseline methods by comparing the ability of information spreading of Top-k developers and projects ranked by each method. In the experiment, BurstBiRank is compared with each baseline method separately. Because two comparing ranking methods usually rank a common group of nodes which will show equal effect in SIR simulation, for each pair of comparison, Top-50 developers (projects) ranked by each method are selected and only those developers (projects) not ranked by both methods are set as initial infected nodes. In each iteration of SIR simulation, infected nodes randomly select one of their susceptible neighbors and infect it with probability α = 0.5, and infected nodes recover to removed status with probability β, which is the reciprocal of the average of all node degrees. The accumulative number of infected nodes is recorded for each iteration. Iteration stops when there is no infected nodes. To avoid the randomness of SIR simulation, 10 experiments are conducted for each pair of methods and the results are averaged as the final result. The final results are shown in Figures 47, and several significant observations are found:(1)BurstBiRank outperforms all baseline methods which indicates the effectiveness of burstiness in identifying influential developers and projects. This finding will inspire developers to work continuously and regularly to obtain high influence in open source software community.(2)The difference of performance between BurstBiRank and PageRank is larger than that between BurstBiRank and BiRank/SVDRank/SVDARank. As we know, PageRank is designed for unipartite network while BiRank, SVDRank, and SVDARank are special ranking methods for bipartite network which distinguish types of nodes and employ the mutual reinforcement between different types of nodes during ranking. This means it is better to mutually co-rank developers and projects than to simply mix them up.

4.5. Case Study

In addition to correlation analysis and SIR simulation, we further do a detailed case study to show the effectiveness of our model in identifying influential developers and projects. Top-20 developers and projects ranked by BurstBiRank are shown in Tables 4 and 5, respectively, with their rankings in baseline methods.

From Table 4, we can see some major contributors to famous projects can be identified by BurstBiRank but they have lower rankings in baseline methods. For example, Marco Pivetta (GitHub ID: Ocramius), a major contributor of both ZendFramework and Doctrine ORM, is not ranked in Top-20 by BiRank-CN, SVDRank-UW, SVDRank-CN, SVDARank-UW, and SVDARank-CN. Taylor Otwell (GitHub ID: taylorotwell), the creator and major contributor to Laravel, is only identified by our BurstBiRank and BiRank-CN.

As for projects, famous PHP projects like Symfony (GitHub ID: symfony/symfony) and MediaWiki (GitHub ID: wikimedia/mediawiki) can also be identified as Top-20 influential projects by our method but with lower rank by baseline methods.

BurstBiRank and the baseline methods all can identify several influential developers and projects, but some less popular developers and projects are also given a high rank because the dataset is a real-world dataset and only a little filtering operations are applied to it. To sum up about correlation analysis, SIR simulation, and case study, our proposed BurstBiRank outperforms baseline methods and can identify some influential developers and projects of real-world open source software community. But further improvement should be conducted.

4.6. Parameter Analysis

In this section, we investigate how the performance varies with the hypermeters that balance the prior beliefs and diffusion scores. For simplicity, we constrain γ to be equal to λ, and Kendall’s tau is adopted to indicate the model’s performance. Figure 8 shows the raking performance by varying the balance parameters γ and λ from 0.6 to 1. For best ranking performance, the balance parameters γ and λ are set not equal to 1, indicating the prior beliefs are useful for ranking developers and projects.

5. Conclusions

In this work, we aim at identifying influential developers and projects in open source software community. Continuous interactions between developers and projects are modeled as a burstiness-weighted bipartite network, and an iterative diffusion process is applied on it to calculate ranking scores for developers and projects. The proposed BurstBiRank is evaluated against four baseline methods on a real-world GitHub dataset. Extensive experimental analysis and case study show BurstBiRank outperforms baseline methods in both correlation analysis and SIR simulation.

The basic idea behind BurstBiRank is measuring the tie strength between developers and projects by the burstiness of the continuous interactions between them with an intuitively reasonable assumption that more regular interactions mean stronger ties. Under our framework, burstiness can be employed into the developer-project bipartite network by any linear or nonlinear functions, but in our experiment, a linear function is adopted for simplicity. In addition to burstiness, there are other metrics in human dynamics like memory, which may reflect the tie strength between developers and projects. Attributes of developers and projects such as programming language also affect rankings of them. In future work, we will adopt more types of functions, more metrics in human dynamics, and more attributes of developers and projects in our framework.

Data Availability

The data used in this study can be accessed via https://ghtorrent.org/.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This study was supported by the National Natural Science Foundation of China (grant no. 61872002), the University Natural Science Research Project of Anhui Province (grant no. KJ2019A0037), the University Collaborative Innovation Program of Anhui Province (grant no. GXXT- 2019-013), and the Doctoral Scientific Research Foundation of Anhui University (grant no. Y040418194).