Skip to content

Commit 4acb543

Browse files
committed
Format documentationf for webgraph
1 parent 95b5cb5 commit 4acb543

1 file changed

Lines changed: 124 additions & 10 deletions

File tree

  • webgraph/src/graphs/bvgraph

webgraph/src/graphs/bvgraph/mod.rs

Lines changed: 124 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -5,21 +5,135 @@
55
* SPDX-License-Identifier: Apache-2.0 OR LGPL-2.1-or-later
66
*/
77

8-
//! An implementation of the Bv format.
8+
//! A compressed graph representation using the techniques described in "[The
9+
//! WebGraph Framework I: Compression Techniques][BvGraph paper]", by Paolo
10+
//! Boldi and Sebastiano Vigna, in _Proc. of the 13th international conference
11+
//! on World Wide Web_, WWW 2004, pages 595–602, ACM.
912
//!
10-
//! The format has been described by Paolo Boldi and Sebastiano Vigna in “[The
11-
//! WebGraph Framework I: Compression
12-
//! Techniques](http://vigna.di.unimi.it/papers.php#BoVWFI)”, in *Proc. of the
13-
//! 13th international conference on World Wide Web*, WWW 2004, pages 595-602,
14-
//! ACM. [DOI
15-
//! 10.1145/988672.988752](https://dl.acm.org/doi/10.1145/988672.988752).
13+
//! This module provides a flexible way to store and access graphs in compressed
14+
//! form. A compressed graph with basename `BASENAME` is described by:
15+
//!
16+
//! - a _graph file_ (`BASENAME.graph`): a bitstream containing the compressed
17+
//! representation of the graph;
18+
//! - a _properties file_ (`BASENAME.properties`): metadata about the graph and
19+
//! the compression parameters;
20+
//! - an _offsets file_ (`BASENAME.offsets`): a bitstream of γ-coded gaps between
21+
//! the bit offsets of each successor list in the graph file.
22+
//!
23+
//! Additionally, an [Elias–Fano] representation of the offsets
24+
//! (`BASENAME.ef`), necessary for random access, can be built using the
25+
//! `webgraph build ef` command.
1626
//!
1727
//! The implementation is compatible with the [Java
1828
//! implementation](http://webgraph.di.unimi.it/), but it provides also a
19-
//! little-endian version, too.
29+
//! little-endian version.
30+
//!
31+
//! The main access points to the implementation are [`BvGraph::with_basename`]
32+
//! and [`BvGraphSeq::with_basename`], which provide a [`LoadConfig`] that can
33+
//! be further customized (e.g., selecting endianness, memory mapping, etc.).
34+
//!
35+
//! # The Graph File
36+
//!
37+
//! The graph is stored as a bitstream. The format depends on a number of
38+
//! parameters and encodings that can be mixed orthogonally. The parameters are:
39+
//!
40+
//! - the _window size_, a nonnegative integer;
41+
//! - the _maximum reference count_, a positive integer (it is meaningful only
42+
//! when the window is nonzero);
43+
//! - the _minimum interval length_, an integer ≥ 2, or 0, which is interpreted
44+
//! as infinity.
45+
//!
46+
//! ## Successor lists
47+
//!
48+
//! The graph file is a sequence of successor lists, one for each node. The list
49+
//! of node _x_ can be thought of as a sequence of natural numbers (even though,
50+
//! as we will explain later, this sequence is further coded suitably as a
51+
//! sequence of bits):
52+
//!
53+
//! 1. The _outdegree_ of the node; if it is zero, the list ends here.
54+
//!
55+
//! 2. If the window size is not zero, the _reference part_, that is:
56+
//! 1. a nonnegative integer, the _reference_, which never exceeds the window
57+
//! size; if the reference is _r_, the list of successors will be specified
58+
//! as a modified version of the list of successors of _x_ − _r_; if _r_
59+
//! is 0, then the list of successors will be specified explicitly;
60+
//! 2. if _r_ is nonzero:
61+
//! - a natural number β, the _block count_;
62+
//! - a sequence of β natural numbers *B*₁, …, *B*ᵦ, called the
63+
//! _copy-block list_; only the first number can be zero.
64+
//!
65+
//! 3. Then comes the _extra part_, specifying additional entries that the list
66+
//! of successors contains (or all of them, if _r_ is zero), that is:
67+
//! 1. If the minimum interval length is finite:
68+
//! - an integer _i_, the _interval count_;
69+
//! - a sequence of _i_ pairs, whose first component is the left extreme
70+
//! of an interval, and whose second component is the length of the
71+
//! interval (the number of integers contained in it).
72+
//! 2. Finally, the list of _residuals_, which contain all successors not
73+
//! specified by previous methods.
74+
//!
75+
//! The above data should be interpreted as follows:
76+
//!
77+
//! - The reference part, if present (i.e., if both the window size and the
78+
//! reference are positive), specifies that part of the list of successors of
79+
//! node _x_ − _r_ should be copied; the successors of node _x_ − _r_ that
80+
//! should be copied are described in the copy-block list; more precisely, one
81+
//! should copy the first *B*₁ entries of this list, discard the next *B*₂,
82+
//! copy the next *B*₃, etc. (the last remaining elements of the list of
83+
//! successors will be copied if β is even, and discarded if β is odd).
84+
//!
85+
//! - The extra part specifies additional successors (or all of them, if the
86+
//! reference part is absent); the extra part is not present if the number of
87+
//! successors that are to be copied according to the reference part already
88+
//! coincides with the outdegree of _x_; the successors listed in the extra
89+
//! part are given in two forms:
90+
//! - some of them are specified as belonging to (integer) intervals, if the
91+
//! minimum interval length is finite; the interval count indicates how many
92+
//! intervals, and the intervals themselves are listed as pairs (left
93+
//! extreme, length);
94+
//! - the residuals are the remaining "scattered" successors.
95+
//!
96+
//! ## How Successor Lists Are Coded
97+
//!
98+
//! The list of integers corresponding to each successor list is coded into a
99+
//! sequence of bits. This is done in two phases: we first modify the sequence
100+
//! so to obtain another sequence of integers (some of them might be negative).
101+
//! Then each integer is coded, using a coding that can be specified as an
102+
//! option; the integers that may be negative are first turned into natural
103+
//! numbers using the standard bijection.
104+
//!
105+
//! 1. The outdegree of the node is left unchanged, as well as the reference and
106+
//! the block count.
107+
//! 2. All blocks are decremented by 1, except for the first one.
108+
//! 3. The interval count is left unchanged.
109+
//! 4. All interval lengths are decremented by the minimum interval length.
110+
//! 5. The first left extreme is expressed as its difference from _x_ (it will
111+
//! be negative if the first extreme is less than _x_); the remaining left
112+
//! extremes are expressed as their distance from the previous right extreme
113+
//! plus 2 (e.g., if the interval is \[5..11\] and the previous one was
114+
//! \[1..3\], then the left extreme 5 is expressed as 5 − (3 + 2) = 0).
115+
//! 6. The first residual is expressed as its difference from _x_ (it will be
116+
//! negative if the first residual is less than _x_); the remaining residuals
117+
//! are expressed as decremented differences from the previous residual.
118+
//!
119+
//! # The Offsets File
120+
//!
121+
//! Since the graph is stored as a bitstream, we must have some way to know
122+
//! where each successor list starts. This information is stored in the offset
123+
//! file, which contains the bit offset of each successor list as a γ-coded gap
124+
//! from the previous offset (in particular, the offset of the first successor
125+
//! list will be zero). As a convenience, the offset file contains an additional
126+
//! offset pointing just after the last successor list (providing, as a
127+
//! side-effect, the actual bit length of the graph file).
128+
//!
129+
//! For random access, the list of offsets is stored as an [Elias–Fano]
130+
//! representation using [ε-serde]. Building such a representation is a
131+
//! prerequisite for random access and can be done using the `webgraph build ef`
132+
//! command.
20133
//!
21-
//! The main access point to the implementation is [`BvGraph::with_basename`],
22-
//! which provides a [`LoadConfig`] that can be further customized.
134+
//! [BvGraph paper]: <http://vigna.di.unimi.it/papers.php#BoVWFI>
135+
//! [Elias–Fano]: <https://docs.rs/sux/latest/sux/dict/elias_fano/struct.EliasFano.html>
136+
//! [ε-serde]: <https://docs.rs/epserde/latest/epserde/>
23137
24138
use std::path::Path;
25139

0 commit comments

Comments
 (0)