|
5 | 5 | * SPDX-License-Identifier: Apache-2.0 OR LGPL-2.1-or-later |
6 | 6 | */ |
7 | 7 |
|
| 8 | +//! Computes an approximation of the neighborhood function, of the size of the |
| 9 | +//! reachable sets, and of (discounted) positive geometric centralities of a |
| 10 | +//! graph using HyperBall. |
| 11 | +//! |
| 12 | +//! HyperBall is an algorithm computing by dynamic programming an approximation |
| 13 | +//! of the sizes of the balls of growing radius around the nodes of a graph. |
| 14 | +//! Starting from these data, it can approximate the _neighborhood function_ of |
| 15 | +//! a graph (i.e., the function returning for each _t_ the number of pairs of |
| 16 | +//! nodes at distance at most _t_), the number of nodes reachable from each |
| 17 | +//! node, Bavelas's closeness centrality, Lin's index, and _harmonic centrality_ |
| 18 | +//! (studied by Paolo Boldi and Sebastiano Vigna in "[Axioms for |
| 19 | +//! Centrality]", _Internet Math._, 10(3-4):222–262, 2014). HyperBall can also |
| 20 | +//! compute _discounted centralities_, in which the _discount_ assigned to a |
| 21 | +//! node is some specified function of its distance. All centralities are |
| 22 | +//! computed in their _positive_ version (i.e., using distance _from_ the |
| 23 | +//! source: see below how to compute the more usual, and useful, _negative_ |
| 24 | +//! version). |
| 25 | +//! |
| 26 | +//! HyperBall has been described by Paolo Boldi and Sebastiano Vigna in |
| 27 | +//! "[In-Core Computation of Geometric Centralities with HyperBall: A Hundred |
| 28 | +//! Billion Nodes and Beyond][HyperBall paper]", _Proc. of 2013 IEEE 13th |
| 29 | +//! International Conference on Data Mining Workshops (ICDMW 2013)_, IEEE, 2013, |
| 30 | +//! and it is a generalization of the method described in "[HyperANF: |
| 31 | +//! Approximating the Neighborhood Function of Very Large Graphs on a |
| 32 | +//! Budget][HyperANF paper]", by Paolo Boldi, Marco Rosa, and Sebastiano Vigna, |
| 33 | +//! _Proceedings of the 20th international conference on World Wide Web_, pages |
| 34 | +//! 625–634, ACM, 2011. |
| 35 | +//! |
| 36 | +//! Incidentally, HyperBall (actually, HyperANF) has been used to show that |
| 37 | +//! Facebook has just [four degrees of separation]. |
| 38 | +//! |
| 39 | +//! # Algorithm |
| 40 | +//! |
| 41 | +//! At step _t_, for each node we (approximately) keep track (using |
| 42 | +//! [HyperLogLog counters]) of the set of nodes at distance at most _t_. At |
| 43 | +//! each iteration, the sets associated with the successors of each node are |
| 44 | +//! merged, thus obtaining the new sets. A crucial component in making this |
| 45 | +//! process efficient and scalable is the usage of broadword programming to |
| 46 | +//! implement the merge phase, which requires maximising in parallel the list of |
| 47 | +//! registers associated with each successor. |
| 48 | +//! |
| 49 | +//! Using the approximate sets, for each _t_ we estimate the number of pairs of |
| 50 | +//! nodes (_x_, _y_) such that the distance from _x_ to _y_ is at most _t_. |
| 51 | +//! Since during the computation we are also in possession of the number of |
| 52 | +//! nodes at distance _t_ − 1, we can also perform computations using the |
| 53 | +//! number of nodes at distance _exactly_ _t_ (e.g., centralities). |
| 54 | +//! |
| 55 | +//! # Systolic Computation |
| 56 | +//! |
| 57 | +//! If you additionally pass the _transpose_ of your graph, when three quarters |
| 58 | +//! of the nodes stop changing their value HyperBall will switch to a _systolic_ |
| 59 | +//! computation: using the transpose, when a node changes it will signal back to |
| 60 | +//! its predecessors that at the next iteration they could change. At the next |
| 61 | +//! scan, only the successors of signalled nodes will be scanned. In particular, |
| 62 | +//! when a very small number of nodes is modified by an iteration, HyperBall |
| 63 | +//! will switch to a systolic _local_ mode, in which all information about |
| 64 | +//! modified nodes is kept in (traditional) dictionaries, rather than being |
| 65 | +//! represented as arrays of booleans. This strategy makes the last phases of |
| 66 | +//! the computation orders of magnitude faster, and makes in practice the |
| 67 | +//! running time of HyperBall proportional to the theoretical bound |
| 68 | +//! _O_(_m_ log _n_), where _n_ is the number of nodes and _m_ is the number of |
| 69 | +//! arcs of the graph. Note that graphs with a large diameter require a |
| 70 | +//! correspondingly large number of iterations, and these iterations will have |
| 71 | +//! to pass over all nodes if you do not provide the transpose. |
| 72 | +//! |
| 73 | +//! # Stopping Criterion |
| 74 | +//! |
| 75 | +//! Deciding when to stop iterating is a rather delicate issue. The only safe |
| 76 | +//! way is to iterate until no counter is modified, and systolic (local) |
| 77 | +//! computation makes this goal easily attainable. However, in some cases one |
| 78 | +//! can assume that the graph is not pathological, and stop when the relative |
| 79 | +//! increment of the number of pairs goes below some threshold. |
| 80 | +//! |
| 81 | +//! # Computing Centralities |
| 82 | +//! |
| 83 | +//! Note that usually one is interested in the _negative_ version of a |
| 84 | +//! centrality measure, that is, the version that depends on the _incoming_ |
| 85 | +//! arcs. HyperBall can compute only _positive_ centralities: if you are |
| 86 | +//! interested (as it usually happens) in the negative version, you must pass to |
| 87 | +//! HyperBall the _transpose_ of the graph (and if you want to run in systolic |
| 88 | +//! mode, the original graph, which is the transpose of the transpose). Note |
| 89 | +//! that the neighborhood function of the transpose is identical to the |
| 90 | +//! neighborhood function of the original graph, so the exchange does not alter |
| 91 | +//! its computation. |
| 92 | +//! |
| 93 | +//! # Node Weights |
| 94 | +//! |
| 95 | +//! HyperBall can manage to a certain extent a notion of _node weight_ in its |
| 96 | +//! computation of centralities. Weights must be nonnegative integers, and the |
| 97 | +//! initialization phase requires generating a random integer for each unit of |
| 98 | +//! overall weight, as weights are simulated by loading the counter of a node |
| 99 | +//! with multiple elements. Combining this feature with discounts, one can |
| 100 | +//! compute _discounted-gain centralities_ as defined in the [HyperBall paper]. |
| 101 | +//! |
| 102 | +//! # Performance |
| 103 | +//! |
| 104 | +//! Most of the memory goes into storing HyperLogLog registers. By tuning the |
| 105 | +//! number of registers per counter, you can modify the memory allocated for |
| 106 | +//! them. Note that you can only choose a number of registers per counter that |
| 107 | +//! is a power of two, so your latitude in adjusting the memory used for |
| 108 | +//! registers is somewhat limited. |
| 109 | +//! |
| 110 | +//! If there are several available cores, the iterations will be _decomposed_ |
| 111 | +//! into relatively small tasks (small blocks of nodes) and each task will be |
| 112 | +//! assigned to the first available core. Since all tasks are completely |
| 113 | +//! independent, this behavior ensures a very high degree of parallelism. Be |
| 114 | +//! careful, however, because this feature requires a graph with a reasonably |
| 115 | +//! fast random access (e.g., in the case of a short reference chains in a |
| 116 | +//! [`BvGraph`](webgraph::prelude::BvGraph) and a good choice of the granularity. |
| 117 | +//! |
| 118 | +//! [Axioms for Centrality]: <http://vigna.di.unimi.it/papers.php#BoVAC> |
| 119 | +//! [HyperBall paper]: <http://vigna.di.unimi.it/papers.php#BoVHB> |
| 120 | +//! [HyperANF paper]: <http://vigna.di.unimi.it/papers.php#BoRoVHANF> |
| 121 | +//! [four degrees of separation]: <http://vigna.di.unimi.it/papers.php#BBRFDS> |
| 122 | +//! [HyperLogLog counters]: <https://docs.rs/card-est-array/latest/card_est_array/impls/struct.HyperLogLog.html> |
| 123 | +
|
8 | 124 | use anyhow::{Context, Result, bail, ensure}; |
9 | 125 | use card_est_array::impls::{HyperLogLog, HyperLogLogBuilder, SliceEstimatorArray}; |
10 | 126 | use card_est_array::traits::{ |
|
0 commit comments