Skip to content

Commit 95b5cb5

Browse files
committed
Cosmetic fixes; docs for ExactSumSweep and HyperBall
1 parent bfecbfe commit 95b5cb5

7 files changed

Lines changed: 192 additions & 28 deletions

File tree

algo/CHANGELOG.md

Lines changed: 1 addition & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -17,13 +17,9 @@
1717
### Changed
1818

1919
- Several methods previously accepting a `&ThreadPool` now
20-
they don't. The user can use the standard rayon global thread pool
20+
they don't. The user can use the standard Rayon global thread pool
2121
or configure their own and use `ThreadPool::install`.
2222

23-
## [0.3.0] -
24-
25-
### Changed
26-
2723
- Visits have been moved to the main WebGraph crate.
2824

2925
## [0.2.0] - 2025-05-23

algo/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -53,7 +53,7 @@ Union nor the Italian MUR can be held responsible for them.
5353
[symm_par]: https://docs.rs/webgraph-algo/latest/webgraph_algo/sccs/fn.symm_par.html
5454
[Topological Sorting]: https://docs.rs/webgraph-algo/latest/webgraph_algo/fn.top_sort.html
5555
[Acyclicity Testing]: https://docs.rs/webgraph-algo/latest/webgraph_algo/fn.is_acyclic.html
56-
[HyperBall]: https://docs.rs/webgraph-algo/latest/webgraph_algo/distances/hyperball/struct.HyperBallBuilder.html
56+
[HyperBall]: https://docs.rs/webgraph-algo/latest/webgraph_algo/distances/hyperball/index.html
5757
[ExactSumSweep]: https://docs.rs/webgraph-algo/latest/webgraph_algo/distances/exact_sum_sweep/index.html
5858
[Layered Label Propagation]: https://docs.rs/webgraph-algo/latest/webgraph_algo/llp/index.html
5959
[command-line interface]: https://docs.rs/webgraph-cli/latest/index.html

algo/src/distances/exact_sum_sweep/mod.rs

Lines changed: 67 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -5,28 +5,81 @@
55
* SPDX-License-Identifier: Apache-2.0 OR LGPL-2.1-or-later
66
*/
77

8-
//! An implementation of the ExactSumSweep algorithm.
8+
//! Computes the radius and/or the diameter and/or all eccentricities of a
9+
//! graph, using the ExactSumSweep algorithm.
910
//!
1011
//! The algorithm has been described by Michele Borassi, Pierluigi Crescenzi,
11-
//! Michel Habib, Walter A. Kosters, Andrea Marino, and Frank W. Takes in [Fast
12+
//! Michel Habib, Walter A. Kosters, Andrea Marino, and Frank W. Takes in "[Fast
1213
//! diameter and radius BFS-based computation in (weakly connected) real-world
13-
//! graphs–With an application to the six degrees of separation
14-
//! games](https://doi.org/10.1016/j.tcs.2015.02.033)”.
15-
//!
16-
//! The algorithm can compute the diameter, the radius, and even the
17-
//! eccentricities (forward and backward) of a graph. These tasks are quadratic
18-
//! in nature, but ExactSumSweep uses a number of heuristic to reduce the
19-
//! computation to a relatively small number of visits on real-world graphs. It
20-
//! has been used, for examples, on the [whole Facebook
21-
//! graph](https://doi.org/10.1145/2380718.2380723).
14+
//! graphs—With an application to the six degrees of separation
15+
//! games][ExactSumSweep paper]", _Theoretical Computer Science_,
16+
//! 586:59–80, 2015.
17+
//!
18+
//! # Definitions
19+
//!
20+
//! We define the _positive_, or _forward_ (resp., _negative_, or _backward_)
21+
//! _eccentricity_ of a node _v_ in a graph _G_ = (_V_, _E_) as
22+
//! ecc⁺(_v_) = max{_d_(_v_, _w_) : _w_ reachable from _v_} (resp.,
23+
//! ecc⁻(_v_) = max{_d_(_w_, _v_) : _w_ reaches _v_}), where _d_(_v_, _w_) is
24+
//! the number of arcs in a shortest path from _v_ to _w_. The _diameter_ is
25+
//! max{ecc⁺(_v_) : _v_ ∈ _V_}, which is also equal to
26+
//! max{ecc⁻(_v_) : _v_ ∈ _V_}, while the _radius_ is
27+
//! min{ecc⁺(_v_) : _v_ ∈ _V_'}, where _V_' is a set of vertices specified by
28+
//! the user. These definitions are slightly different from the standard ones due
29+
//! to the restriction to reachable nodes. In particular, if we simply define the
30+
//! radius as the minimum eccentricity, the radius of a graph containing a
31+
//! vertex with out-degree 0 would be 0, and this does not make much sense. For
32+
//! this reason, we restrict our attention only to a subset _V_' of the set of
33+
//! all vertices: by choosing a suitable _V_', we can specialize this definition
34+
//! to all definitions proposed in the literature. If _V_' is not specified, we
35+
//! include in _V_' all vertices from which it is possible to reach the largest
36+
//! strongly connected component, as suggested in the aforementioned paper.
37+
//!
38+
//! # Algorithm
39+
//!
40+
//! The algorithm performs some BFSs from "clever" vertices, and uses these BFSs
41+
//! to bound the eccentricity of all vertices. More specifically, for each vertex
42+
//! _v_, the algorithm keeps a lower and an upper bound on the forward and
43+
//! backward eccentricity of _v_, named _lF_\[_v_\], _lB_\[_v_\],
44+
//! _uF_\[_v_\], and _uB_\[_v_\]. Furthermore, it keeps a lower bound _dL_ on
45+
//! the diameter and an upper bound _rU_ on the radius. At each step, the
46+
//! algorithm performs a BFS and updates all these bounds: the radius is found as
47+
//! soon as _rU_ is smaller than the minimum value of _lF_, and the diameter is
48+
//! found as soon as _dL_ is bigger than _uF_\[_v_\] for each _v_, or _dL_ is
49+
//! bigger than _uB_\[_v_\] for each _v_.
50+
//!
51+
//! More specifically, the upper bound on the radius (resp., lower bound on the
52+
//! diameter) is defined as the minimum forward (resp., maximum forward or
53+
//! backward) eccentricity of a vertex from which we performed a BFS. Moreover,
54+
//! if we perform a forward (resp., backward) BFS from a vertex _s_, we update
55+
//! _lB_\[_v_\] = max(_lB_\[_v_\], _d_(_s_, _v_)) (resp.,
56+
//! _lF_\[_v_\] = max(_lF_\[_v_\], _d_(_v_, _s_))). Finally, for the upper
57+
//! bounds, a more complicated procedure handles different strongly connected
58+
//! components separately.
59+
//!
60+
//! # Performance
61+
//!
62+
//! Although the running time is _O_(_mn_) in the worst case, the algorithm is
63+
//! usually much more efficient on real-world networks when only radius and
64+
//! diameter are needed. It has been used, for example, on the [whole Facebook
65+
//! graph][Facebook].
66+
//!
67+
//! If all eccentricities are needed, the algorithm could be faster than
68+
//! _O_(_mn_), but in many networks it achieves performance similar to the
69+
//! textbook algorithm that performs a breadth-first search from each node.
70+
//!
71+
//! # Usage
2272
//!
2373
//! Depending on what you intend to compute, you have to choose the right
24-
//! [*level*](Level) between [`All`], [`AllForward`], [`RadiusDiameter`],
74+
//! [_level_](Level) between [`All`], [`AllForward`], [`RadiusDiameter`],
2575
//! [`Diameter`], and [`Radius`]. Then you have to invoke [`run`](Level::run) or
2676
//! [`run_symm`](Level::run_symm). In the first case, you have to provide a
2777
//! graph and its transpose; in the second case, you have to provide a symmetric
28-
//! graph. The methods returns a suitable structure containing the result of the
29-
//! algorithm.
78+
//! graph. The methods return a suitable structure containing the result of the
79+
//! computation.
80+
//!
81+
//! [ExactSumSweep paper]: <https://doi.org/10.1016/j.tcs.2015.02.033>
82+
//! [Facebook]: <https://doi.org/10.1145/2380718.2380723>
3083
//!
3184
//! # Examples
3285
//!

algo/src/distances/hyperball.rs

Lines changed: 116 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,122 @@
55
* SPDX-License-Identifier: Apache-2.0 OR LGPL-2.1-or-later
66
*/
77

8+
//! Computes an approximation of the neighborhood function, of the size of the
9+
//! reachable sets, and of (discounted) positive geometric centralities of a
10+
//! graph using HyperBall.
11+
//!
12+
//! HyperBall is an algorithm computing by dynamic programming an approximation
13+
//! of the sizes of the balls of growing radius around the nodes of a graph.
14+
//! Starting from these data, it can approximate the _neighborhood function_ of
15+
//! a graph (i.e., the function returning for each _t_ the number of pairs of
16+
//! nodes at distance at most _t_), the number of nodes reachable from each
17+
//! node, Bavelas's closeness centrality, Lin's index, and _harmonic centrality_
18+
//! (studied by Paolo Boldi and Sebastiano Vigna in "[Axioms for
19+
//! Centrality]", _Internet Math._, 10(3-4):222–262, 2014). HyperBall can also
20+
//! compute _discounted centralities_, in which the _discount_ assigned to a
21+
//! node is some specified function of its distance. All centralities are
22+
//! computed in their _positive_ version (i.e., using distance _from_ the
23+
//! source: see below how to compute the more usual, and useful, _negative_
24+
//! version).
25+
//!
26+
//! HyperBall has been described by Paolo Boldi and Sebastiano Vigna in
27+
//! "[In-Core Computation of Geometric Centralities with HyperBall: A Hundred
28+
//! Billion Nodes and Beyond][HyperBall paper]", _Proc. of 2013 IEEE 13th
29+
//! International Conference on Data Mining Workshops (ICDMW 2013)_, IEEE, 2013,
30+
//! and it is a generalization of the method described in "[HyperANF:
31+
//! Approximating the Neighborhood Function of Very Large Graphs on a
32+
//! Budget][HyperANF paper]", by Paolo Boldi, Marco Rosa, and Sebastiano Vigna,
33+
//! _Proceedings of the 20th international conference on World Wide Web_, pages
34+
//! 625–634, ACM, 2011.
35+
//!
36+
//! Incidentally, HyperBall (actually, HyperANF) has been used to show that
37+
//! Facebook has just [four degrees of separation].
38+
//!
39+
//! # Algorithm
40+
//!
41+
//! At step _t_, for each node we (approximately) keep track (using
42+
//! [HyperLogLog counters]) of the set of nodes at distance at most _t_. At
43+
//! each iteration, the sets associated with the successors of each node are
44+
//! merged, thus obtaining the new sets. A crucial component in making this
45+
//! process efficient and scalable is the usage of broadword programming to
46+
//! implement the merge phase, which requires maximising in parallel the list of
47+
//! registers associated with each successor.
48+
//!
49+
//! Using the approximate sets, for each _t_ we estimate the number of pairs of
50+
//! nodes (_x_, _y_) such that the distance from _x_ to _y_ is at most _t_.
51+
//! Since during the computation we are also in possession of the number of
52+
//! nodes at distance _t_ − 1, we can also perform computations using the
53+
//! number of nodes at distance _exactly_ _t_ (e.g., centralities).
54+
//!
55+
//! # Systolic Computation
56+
//!
57+
//! If you additionally pass the _transpose_ of your graph, when three quarters
58+
//! of the nodes stop changing their value HyperBall will switch to a _systolic_
59+
//! computation: using the transpose, when a node changes it will signal back to
60+
//! its predecessors that at the next iteration they could change. At the next
61+
//! scan, only the successors of signalled nodes will be scanned. In particular,
62+
//! when a very small number of nodes is modified by an iteration, HyperBall
63+
//! will switch to a systolic _local_ mode, in which all information about
64+
//! modified nodes is kept in (traditional) dictionaries, rather than being
65+
//! represented as arrays of booleans. This strategy makes the last phases of
66+
//! the computation orders of magnitude faster, and makes in practice the
67+
//! running time of HyperBall proportional to the theoretical bound
68+
//! _O_(_m_ log _n_), where _n_ is the number of nodes and _m_ is the number of
69+
//! arcs of the graph. Note that graphs with a large diameter require a
70+
//! correspondingly large number of iterations, and these iterations will have
71+
//! to pass over all nodes if you do not provide the transpose.
72+
//!
73+
//! # Stopping Criterion
74+
//!
75+
//! Deciding when to stop iterating is a rather delicate issue. The only safe
76+
//! way is to iterate until no counter is modified, and systolic (local)
77+
//! computation makes this goal easily attainable. However, in some cases one
78+
//! can assume that the graph is not pathological, and stop when the relative
79+
//! increment of the number of pairs goes below some threshold.
80+
//!
81+
//! # Computing Centralities
82+
//!
83+
//! Note that usually one is interested in the _negative_ version of a
84+
//! centrality measure, that is, the version that depends on the _incoming_
85+
//! arcs. HyperBall can compute only _positive_ centralities: if you are
86+
//! interested (as it usually happens) in the negative version, you must pass to
87+
//! HyperBall the _transpose_ of the graph (and if you want to run in systolic
88+
//! mode, the original graph, which is the transpose of the transpose). Note
89+
//! that the neighborhood function of the transpose is identical to the
90+
//! neighborhood function of the original graph, so the exchange does not alter
91+
//! its computation.
92+
//!
93+
//! # Node Weights
94+
//!
95+
//! HyperBall can manage to a certain extent a notion of _node weight_ in its
96+
//! computation of centralities. Weights must be nonnegative integers, and the
97+
//! initialization phase requires generating a random integer for each unit of
98+
//! overall weight, as weights are simulated by loading the counter of a node
99+
//! with multiple elements. Combining this feature with discounts, one can
100+
//! compute _discounted-gain centralities_ as defined in the [HyperBall paper].
101+
//!
102+
//! # Performance
103+
//!
104+
//! Most of the memory goes into storing HyperLogLog registers. By tuning the
105+
//! number of registers per counter, you can modify the memory allocated for
106+
//! them. Note that you can only choose a number of registers per counter that
107+
//! is a power of two, so your latitude in adjusting the memory used for
108+
//! registers is somewhat limited.
109+
//!
110+
//! If there are several available cores, the iterations will be _decomposed_
111+
//! into relatively small tasks (small blocks of nodes) and each task will be
112+
//! assigned to the first available core. Since all tasks are completely
113+
//! independent, this behavior ensures a very high degree of parallelism. Be
114+
//! careful, however, because this feature requires a graph with a reasonably
115+
//! fast random access (e.g., in the case of a short reference chains in a
116+
//! [`BvGraph`](webgraph::prelude::BvGraph) and a good choice of the granularity.
117+
//!
118+
//! [Axioms for Centrality]: <http://vigna.di.unimi.it/papers.php#BoVAC>
119+
//! [HyperBall paper]: <http://vigna.di.unimi.it/papers.php#BoVHB>
120+
//! [HyperANF paper]: <http://vigna.di.unimi.it/papers.php#BoRoVHANF>
121+
//! [four degrees of separation]: <http://vigna.di.unimi.it/papers.php#BBRFDS>
122+
//! [HyperLogLog counters]: <https://docs.rs/card-est-array/latest/card_est_array/impls/struct.HyperLogLog.html>
123+
8124
use anyhow::{Context, Result, bail, ensure};
9125
use card_est_array::impls::{HyperLogLog, HyperLogLogBuilder, SliceEstimatorArray};
10126
use card_est_array::traits::{

cli/CHANGELOG.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66

77
- Support for π codes in the CLI tools.
88

9-
- `webgraph-dist` now support running the ExactSumSweep algorithm.
9+
- `webgraph-dist` now supports running the ExactSumSweep algorithm.
1010

1111
### Changed
1212

webgraph/CHANGELOG.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@
3535

3636
### Fixed
3737

38-
- `BfsOrder` and `BfsOrderFromRoots` where reporting wrong distances,
38+
- `BfsOrder` and `BfsOrderFromRoots` were reporting wrong distances,
3939
and reporting roots multiple times.
4040

4141
- `JavaPermutation::set_unchecked` was not properly handling endianness.
@@ -81,8 +81,8 @@
8181

8282
### Changed
8383

84-
- Several methods previously accepting a `&ThreadPool` now
85-
they don't. The user can use the standard Rayon global thread pool
84+
- Several methods previously accepting a `&ThreadPool` no
85+
longer do. The user can use the standard Rayon global thread pool
8686
or configure their own and use `ThreadPool::install`.
8787

8888
- `JavaPermutation` just implements `SliceByValue` and `SliceByValueMut`,
@@ -97,7 +97,7 @@
9797
### Changed
9898

9999
- There is a workspace containing three crates: `webgraph` (basic
100-
infrastructure, `algo` (algorithms), and `cli` (command line
100+
infrastructure), `algo` (algorithms), and `cli` (command line
101101
interface).
102102

103103
- Layered Label Propagation has been moved to the `algo` crate.

webgraph/src/utils/sort_pairs.rs

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -240,11 +240,10 @@ impl<T, I: Iterator<Item = ((usize, usize), T)>> PartialEq for HeadTail<T, I> {
240240

241241
impl<T, I: Iterator<Item = ((usize, usize), T)>> Eq for HeadTail<T, I> {}
242242

243-
#[allow(clippy::non_canonical_partial_ord_impl)]
244243
impl<T, I: Iterator<Item = ((usize, usize), T)>> PartialOrd for HeadTail<T, I> {
245244
#[inline(always)]
246245
fn partial_cmp(&self, other: &Self) -> Option<std::cmp::Ordering> {
247-
Some(other.head.0.cmp(&self.head.0))
246+
Some(self.cmp(other))
248247
}
249248
}
250249

@@ -316,7 +315,7 @@ unsafe impl<T, I: Iterator<Item = ((usize, usize), T)> + SortedIterator> SortedI
316315
for KMergeIters<I, T>
317316
{
318317
}
319-
#[allow(clippy::uninit_assumed_init)]
318+
320319
impl<T, I: Iterator<Item = ((usize, usize), T)>> Iterator for KMergeIters<I, T> {
321320
type Item = ((usize, usize), T);
322321

0 commit comments

Comments
 (0)