Abstract

It is of great significance to predict the results accurately based on the statistics of sports competition for participants research, commercial cooperation, advertising, and gambling profit. Aiming at the phenomenon that the PageRank page sorting algorithm is prone to subject deviation, the category similarity between pages is introduced into the PageRank algorithm. In the PR value calculation formula of the PageRank algorithm, the factor W(u, v) between pages is added to replace the original Nu (the number of links to page u). In this way, the content category between pages is considered, and the shortcoming of theme deviation will be improved. The time feedback factor in the PageRank-time algorithm is used for reference, and the time feedback factor is added to the first improved PR value calculation formula. Based on statistics from 1230 games during the NBA 2018-2019 regular season, this paper ranks the team strength with improved PageRank algorithm and compares the results with the ranking of regular-season points and the result of playoffs. The results show that it is consistent with the regular-season points ranking in the eastern division by the use of improved PageRank algorithm, but there is a difference in the second ranking in the western division. In the prediction of top four in playoffs, it predicts three of the four teams.

1. Introduction

There are many factors involved in the results of competitive games, and many factors need to be considered when forecasting. The prediction of competitive competitions in team battles is more complicated. In addition to personal abilities and personal on-the-spot performance, the factors involved in the results of the competition also include cooperative combat capabilities such as team cooperation. Therefore, the prediction of the outcome of the game is a very professional field problem. The NBA’s data system is amazing to the degree of quantification of the game. The NBA has always relied on cutting-edge technology for support, while providing a large amount of data for game prediction and game analysis. The strength gap between each team is small, and each game is full of infinite possibilities. This makes predicting the game a challenging and meaningful thing.

PageRank algorithm is an algorithm based on link analysis. The principle of the algorithm involves the knowledge of hyperlinks on web pages. The basic idea of PageRank algorithm can be understood in this way. First, the PageRank algorithm evaluates whether a web page is important, based on the number of webpages linked to this web page. We all know that the importance of the Phoenix.com homepage is higher than that of a personal blog page, but the specific importance is measured by the number of web pages linked to these two web pages. Specifically, the number of web pages linked to the homepage of Phoenix.com is more than the number of pages linked to a personal blog. Therefore, the homepage of Phoenix.com is more important. However, in order to improve the importance of some webpages, in addition to improving the quality of their own web page content, they will also create some webpages linking themselves, and many of them are even spam webpages. Although the index of importance has increased, these pages are not important pages. In order to avoid the drawbacks of evaluating the importance of webpages by linking, the PageRank algorithm uses a method of weighting the importance of linked webpages for assessment. For example, if a web page linked to a web page contains some webpages from well-known websites such as Google, the importance of this page is even higher.

It is significantly meaningful to evaluate the strength of the competitors and predict the results of the competition according to the strength. Zak et al. (1979) calculated the offensive strength and defensive strength of each team based on the statistical analysis of the technical characteristics of NBA games, so as to rank the comprehensive strength of teams and predict the results of the game [1]. Wu established the principal component logistic regression model to predict the victory or defeat of the match based on the data of the first 30 matches in the 2010-2011 season of Italian Football League A [2]. Since the end of the 20th century, a large number of researchers began to use Machine Learning Algorithm to predict the results. Cuzzocrea et al. combined the deep-learning and transfer-learning approach for supporting social influence prediction [3]. Huang et al. and Liu and Zhu predicted different target domains based on PageRank and HITS algorithm [4, 5]. Goel et al. and Liu et al., respectively, proposed sNorm(p) algorithm and HITS-PR-HHblits algorithm to further improve the predictive performance [6, 7].

For the research of sorting algorithm, foreign countries are earlier than domestic. PageRank algorithm and HITS algorithm are two representative sorting algorithms [810]. PageRank is a link analysis algorithm, which is also a calculation model that other search engines and academia pay close attention to. The core idea is that the more the links a web page has, the more authoritative is the web page that references it and the more important the web page is. The calculation of the importance of web pages is carried out offline and has nothing to do with the subject of the query, so it has fast response capabilities [11]. However, it also has obvious shortcomings such as subject drift, discrimination against new web pages, and ignoring the individual needs of users. The HITS algorithm uses two mutually influential weights, content authority and link authority, to evaluate the value of web content and the value of hyperlinks in the web [1214]. It is related to the query subject. The interdependent and mutually reinforcing relationship between Authority and Hub is the basis of the HITS algorithm [15]. The algorithm also has the problems of subject drift, low computational efficiency, unstable structure, and easy deception. Relevant scholars first use the vector space model VSM to calculate the similarity weights between web pages, then analyze and count the incremental weights of web page clicks, and finally, combine the two weights to integrate feedback information and content relevance to improve the PageRank algorithm and improve the relevance of search results and user query content [16]. Researchers proposed a four-level method to improve PageRank [17]. By introducing time weight function W, segment function F, web page weight ratio function P, and interest degree V, the problems existing in PageRank were improved, and the improved algorithm was proved through experiments [18, 19]. Relevant scholars proposed improved methods based on the vector space model theory, which represented both user queries and web pages as vectors [2022].

The difference between the Hits algorithm and PageRank is that certain web pages are identified as Authority pages and Hub pages in the Hits algorithm. The traditional PageRank algorithm is calculated based on web page hyperlinks, but the value of each web page link cannot be used to measure its importance and can only be calculated by using the average value. The Hits algorithm solves this problem well. The Hits algorithm is one of the very classic algorithms in link analysis. The current search engine Teoma uses the Hits algorithm as a link analysis algorithm. After the Hits algorithm receives the user’s query, it submits the query to an existing search engine (or a search system constructed by itself) and extracts the top web pages from the returned search results to obtain a set of queries related to the user collection of highly related initial web pages. This collection is called the Root Set. On the basis of the root set, the Hits algorithm expands the set of web pages. All web pages that have a direct link to the web pages in the root set will be expanded, and it is expanded into the extended page collection. The Hits algorithm searches for a good “Hub” page and a good “Authority” page in this expanded web page collection. When the PageRank algorithm calculates the relevance ranking, only one PageRank value is obtained, while when the Hits algorithm calculates, each page will generate two scores, namely, the Authority weight and the Hub score. The former is very useful in the search engine field.

Generally speaking, the commonly used evaluation methods of competitive games belong to the evaluation model of multiparameter input and single result output. Although it can reflect the strong weak relationship of individual matches, it cannot reflect the overall characteristics and interaction of the whole data group. However, these factors are the key basis for determining the ranking of teams. In order to solve the shortcomings of the abovementioned research methods, this paper constructs a new PageRank algorithm based on the weight transfer between research objects. Then, it is applied to the NBA game research, and the prediction results are compared with the previous points ranking data. The results show that the method is effective for predicting the results of competitive competitions.

2. Method Description

This paper attempts to apply the method of ranking the importance of Google search engine pages to the prediction of NBA team playoff results. Firstly, the weight transfer matrix is constructed by using the score relationship between teams, and then, iterative calculation is carried out according to the improved PageRank matrix. Finally, the results of the game are predicted according to the strength of the teams.

2.1. PageRank Algorithm

After considering the topic identification, keyword identification, and other factors, the Google website sorts the search results fed back to users according to the PageRank value of each page. Some of the more important or classic page rankings have been improved as a result. This sorting result has been widely recognized by Google users. Specifically, Google divides the level of web pages into 10 levels based on the PageRank value, of which level 10 is the highest level. Generally speaking, when the PageRank is as low as 1 or 2, it indicates that this web page is not very popular, and when the PageRank value is greater than 7, it means that the importance of this web page is very high and it is recognized by Internet users. Generally, web pages with a PageRank value of 4 or higher are higher-quality web pages. Google’s own PageRank value is 10, which shows that this site is very well received by web users and is used frequently.

The search engine PageRank algorithm evaluates web pages based on web links. Specifically, the higher the quality and quantity of links in and out of a web page, the higher the PageRank value and the more important the web page. The idea of the PageRank algorithm is that every time a web page link enters the web page, it is equivalent to a vote for this web page. The more times it is linked, the more votes this web page gets. This gives rise to “link popularity.” When other websites are willing to link in with your website, it means that when your website is more popular, you can use “link popularity” to evaluate the popularity of your website. This concept is similar to the impact factor of academic journals, when an article in a journal is cited more often by others, the influence of the journal will be greater.

The search engine Google has its own system to calculate the PageRank value. The PageRank value on the Google website has the highest level of 10 and less, and the relationship from 0–10 levels does not increase by an equal amount, but presents a kind of nonlinear relationship, that is, the difference between the PageRank value of 6 and the PageRank value of 5 is much larger than the difference between the PageRank value of 5 and the PageRank value of 4. Also, the higher the number of stages, the greater the difference.

Because the PageRank value obtained by the search engine PageRank algorithm determines the ranking of the web page in the search results and the PageRank value is calculated by the number of links in and out of the web page, people show great interest in web links. In the past few years, some people have been thinking of ways to increase the number of links to their websites and even resorted to exchanges, purchases, etc., which caused adverse effects, so that Google changed its PageRank ranking system. At that time, some types of links were blocked. For example, the “Link Factory” website linked for linking does not have any substantive content, so all its pages are not assigned a PageRank value; some websites are linked to some websites with a high PageRank value. But, in fact, there is no great correlation between the two websites (for example, the website of a famous TV show links to a page on the basic principles of chemical engineering), and the PageRank value will not be obtained. At the same time, Google also extended the time period for updating the PageRank value of each web page each time to facilitate network users to supervise the ranking.

PageRank algorithm, invented by Larry Page and Sergey Brin, two founders of Google company, is applied to rank the importance of web pages by the Google search engine. PageRank algorithm determines the importance of a web page according to the interconnection of all pages in the Internet. If A link points to page B, page A will pass on its importance to page B. Google will calculate the importance of the new page according to the quantity and quality of the links. Generally, if a web page gets more links, the page will be given more importance; if a web page gets more links, the page will be delivered more importance. The transfer of quality and quantity is also applicable to the transfer of importance between teams in competitive sports.

A classic PageRank model is as follows:where is the page to be evaluated; is the set of pages in the chain; and represent the PageRank value of the page u and the page v, respectively; represents the number of outbound link of pages v; and d is the damping coefficient, which is used to calculate the PageRank value when the web page is not outbound link.

2.2. Improved PageRank Algorithm

The PageRank algorithm is one of the classic search engine algorithms, which has always received attention and application from researchers, but this algorithm still needs improvement. A conclusion can be drawn from the PageRank calculation formula. The weight of a web page has a great relationship with the number of links to the web page. Newly published web pages on the Internet have a short publication time and few linked pages. The value will be low, and the corresponding PageRank value of the old web page with a long publishing time will be high because of the number of links. Therefore, the latest search information required by the user is usually ranked relatively lower in the query result, which cannot meet the actual needs of the user. In addition, the PageRank algorithm takes the number of links in and out of webpages as the main factor and cannot distinguish whether a web page document linked to or out of a web page is related to the content of the search, which may cause the subject of search results to deviate. For example, Sina and Sohu are well-known websites on the Internet, and there are many web pages linked to them, and the PageRank value is high. If the user uses Sina or Sohu as a keyword or part of a keyword to search, these webpages will usually be reflected in the query results and will be in a relatively high position, but in fact, the user may not need these webpages.

In view of the defects of the PageRank algorithm, the PageRank algorithm can be used as the basic algorithm for improvement. For example, on the basis of the PageRank algorithm, the influencing factors of the web page HTML language are added. Combining the PageRank algorithm of web topics, all pages are classified according to topics, and the PageRank is calculated according to the classification results for each topic. In this way, each page will have a corresponding page level score for different topics, so as to reflect the importance of the page according to different scores. As the time for web pages to be published on the Internet increases, the importance of web information will continue to decline. A time feedback factor is added on the basis of the PageRank algorithm to feed back the impact of web page publication time on search engine rankings. Webpages with the same content will have different calculated PageRank values due to different publication times. The search engine ranking results given by the latter algorithm meet the expectations of most users and effectively improve the efficiency of search engines.

As can be seen from formula (1), the classic PageRank algorithm calculates the number of outbound link of pages and then distributes its own PageRank values equally according to the number of outbound links. However, in the example of this paper, the number of times any two teams play varies from two to four; the relationship between winning and losing is not fixed; and the PageRank value transferred from teams to other teams is not evenly distributed. Therefore, it is unreasonable to use the traditional PageRank algorithm in the team ranking prediction, and the relationship between teams should be decentralized.

A probability function is introduced to represent the weight transferred from the team v to the team u. Therefore, the improved PageRank model is as follows:

As the object of this paper is a closed system, each team has played with other teams, and there is no case that the team is not outbound link. The damping coefficient can be ignored here. Formula (2) can be simplified as follows:

Formula (3) is essentially a PageRank algorithm with probability function, which can be approximately understood as a Markov process. By processing the eigenvalues, the Markov matrix can be guaranteed to converge to a stable state. Finally, the python program is used to calculate formula (3), and the PageRank score of each team is updated through iterative recursion until the return value is less than the threshold value, and the program ends.

2.3. Selection of Measures and Weights

The segmentation of web pages based on VIPS is an iterative process, mainly divided into three steps. First is the page block extraction; that is, the HTML DOM tree corresponding to the current page is obtained from the browser, and the visual information of the DOM tree is used to segment the semantics. Second is divider detection, which is to find the horizontal divider and vertical divider in the page according to the visual information of the page. Finally, you reconstruct the semantic block, that is, relayout the page level on the basis of the divider obtained in step 2, and merge some blocks to form a new semantic block.

After the web page is divided into blocks, the web page is purified, by purifying and filtering noise blocks such as advertisements and navigation bars in the block. At the same time, according to the ratio of the link text to the nonlink text in the block and the ratio of the number of words to the number of pictures, the category of the page can be determined. Sort content of irrelevant pages can be filtered out directly.

There are two ways to determine the weight W(v) of PageRank value transfer between teams. One way of thinking is to measure the weight transfer between teams according to the victory or defeat relationship between teams. Taking the game between team A and team B as an example, this paper finds out the matches between team A and team B, divides the number of winning games of team a by the number of games of two teams, and obtains the weight q of team a against team B. On the contrary, the weight of team B to team A is 1-q. When the research sample is large enough, this idea can reflect the real strength relationship between the two teams, but the vast majority of sports competitions cannot provide enough samples. In this paper, the NBA team’s number of games is 2–4, and the scores of some matches are quite close. A small score will seriously affect the size of the weight and cannot accurately reflect the strength gap. Therefore, this method is not applicable.

The second way of thinking is to add up the scores of team A against team B and divide the score of team A by the total score of the two teams to get the weight of team a against team B. This paper chooses this method to demonstrate.

2.4. The Application of Improvement Algorithm in This Example

The NBA main game (summer league is not counted in team performance) is divided into two stages: regular season and playoffs. The regular-season ending in April each year will determine the 16 teams participating in the playoffs, namely, 8 Eastern teams and 8 Western teams. In the middle of the regular season, there is also a very special time and game, namely, the All-Star Exhibition Game in February every year. On Thursday of the 16th week of the NBA regular season, the trade deadline for team players is the day. After the trade deadline, each team can only complete the remaining regular-season games and playoff games of the year on the basis of existing players. Also, this time (trade deadline) is usually around the All-Star exhibition game.

NBA team games are divided into home and away games. For a team, under the influence of a series of factors such as familiarity of the home court, support from the audience, and referees, its performance in home games is usually stronger than its performance in away games against the same opponent. Therefore, this article mainly classifies and counts the team’s game results according to the home and away results and builds a simulated team’s win rate when the team is at home and away. Taking into account the stability of the team’s players and the running-in period, this article uses the season data before the All-Star Game to predict the team’s victory and defeat after the All-Star Game. After the start of the NBA main game, the results of each team will be counted.

Different from football and volleyball among the three major balls, two teams with similar strengths have higher uncertainty in the outcome of the basketball game. Basketball games do not accept draws. Therefore, in basketball games of comparable strength, according to the rules of the basketball game, an “overtime game” will be added when the end time of the normal game cannot be determined. Similarly, in a basketball game of comparable strength, there will often be a “lore” scenario, that is, a team scores at the end of the game, changes the score, and wins the game. Therefore, in order to prevent teams of equal strength from amplifying the probability of winning the team due to accidental factors such as overtime or lore, this article proposes to exclude the game data of these games from the original data set, that is, to exclude those with high randomness. The number of matches allows the retained match results to more accurately measure the strength of a team.

The two basic assumptions of PageRank algorithm make the PageRank algorithm insensitive to the initial value assigned to participate in the calculation, that is, the result of PageRank calculation is determined by the topology and transmission relationship of each node in the network. The algorithm constantly calculates and determines the PageRank score of each page node and finally reaches a stable state. PageRank obtains the importance of the web page based on this. In this paper, the improved PageRank algorithm is used to calculate the importance of each team, and the value is used as the weight of the team. The larger the PageRank value is, the stronger the team is.

In this paper, Python language is used to develop the improved PageRank algorithm. The flow of the experiment is shown in Figure 1.

3. Empirical Analysis

3.1. Data Selection and Analysis

This paper selects the game data of NBA teams in the regular season from 2018 to 2019 as the research object. There are two reasons for using the NBA team’s 2018-2019 season data for research. The first reason is that compared with other competitive sports, the NBA in the United States has more research samples. A regular season has a total of 1230 games, and the number of data is relatively higher than other games; the second reason is that NBA teams have at least two games, and there will not be some games in which two teams cannot meet; if there are teams that do not meet, there will be no transfer of importance. The sample obtained is not applicable in this example. Figure 2 shows the big data platform for NBA playoff ranking prediction.

In this paper, the weight of team A to team B is calculated by adding the scores of teams in 30 teams. After arranging, a 30 × 30 weight matrix of losers is obtained.

By normalizing the rows of the matrix and performing matrix transposition operation, formula (5) can be obtained:

Because the improved PageRank algorithm is not sensitive to the initial value of the evaluation object, the initial value of PageRank of each team is assigned as 1 and calculated by formula (3). The matrix reaches a stable state at the 7th iteration, and the team’s PR value does not change any more. The results are shown in Table 1.

From the definition of PR value, it can be understood that the larger the team’s PR value, the more positive the transmission of other teams’ PR value to the team. The higher the team’s ranking in the league, the stronger the overall strength.

3.2. Data Comparison

In order to verify the rationality of the experiment, this paper compares the team ranking based on the improved PageRank algorithm with that of the regular-season teams, as shown in Figure 3. The similar trend between the team’s PR value and the number of winning games continues from 90% to 94%. Among them, the Bucks ranked first in the league in terms of winning matches and PR values, while the teams with the worst performance in the League ranked as Cavaliers by the improved PageRank algorithm and the Knicks by the winning games.

According to the competition system of NBA, the teams of the whole league are divided into eastern and western regions, each of which has 15 places. Finally, 30 teams will be selected to compete in the playoffs and match the strengths and weaknesses according to the regular-season points. According to the eastern and western regions, PageRank algorithm and the number of winning games are sorted. The results are shown in Figures 4 and 5.

Division and strong and weak match can avoid the situation that the strong meet the strong too early, resulting in the strong team being eliminated too early. At the same time, after removing the strongest teams in the east, the improved PageRank value of the eastern team has been less than the equilibrium value of 1. If the team in the west will have a disadvantage, the division helps to increase the audience of the game and improve the viewing.

As can be seen from Figures 4 and 5, improved PageRank thinks that, in the eastern division, the strength of Celtics and 76ers is overestimated, while the strength of Pacers is underestimated; in the western division, the strength of Nuggets is overestimated, and the strength of Jazz and Blazers is underestimated. Because the NBA competition system does not match the third and fourth places, we have to rank the first two teams in the eastern and western regions according to the division rules and compare the results with the actual results of the game. The comparison results are shown in Table 2.

According to Table 2, the western champions are ranked by the number of winning games and the improved PageRank algorithm. The second runner up in the western region is ranked by the number of winning games as the Nuggets, and the improved PageRank algorithm is the Jazz team. In fact, the second runner up in the west is the pioneer team. What needs to be further pointed out is that the improved PageRank algorithm thinks that the strength of the Blazers team is greater than that of the Nuggets team, which is closer to the actual game results. The west team’s middle east champion and runner up composition is the same. In the prediction of the championship, both the winning field ranking method and the improved PageRank method are the Bucks, but because the Bucks were defeated by the Raptors, neither of them can predict correctly. In the prediction of the league’s top four, the winning field ranking method and the improved PageRank method have successfully predicted three of the four teams. The ranking predictions of eastern and western teams are shown in Figures 6 and 7, respectively.

4. Conclusions

The traditional evaluation model of multiparameter input and single result output can only reflect the relationship between the strength and weakness of individual matches, but cannot reflect the overall situation of the season from the whole. However, the ranking of team strength according to the integral method is quite accidental, and sometimes the slight difference in score may affect the overall result. In this paper, an improved PageRank algorithm is used to rank the team strength from the overall perspective of the team competition data, and the results of the playoffs are predicted. The experimental results show that the predicted winning rate is equivalent to the integral method, but it is closer to the actual results in some parts. This paper only calculates the season data from 2018 to 2019 and does not calculate other historical data. There are some problems in NBA games, such as the style of the ball, the sudden injury of the main players, and the randomness of competitive sports, which makes the prediction results deviate to some extent. The effect of other historical data needs to be further verified.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported by Soft Science Research Program of Shaanxi Province (2019KRM101) and Scientific Research Foundation for doctors of Xi’an Polytechnic University (3100401016).