|
5 | 5 | * SPDX-License-Identifier: Apache-2.0 OR LGPL-2.1-or-later |
6 | 6 | */ |
7 | 7 |
|
8 | | -//! An implementation of the Bv format. |
| 8 | +//! A compressed graph representation using the techniques described in "[The |
| 9 | +//! WebGraph Framework I: Compression Techniques][BvGraph paper]", by Paolo |
| 10 | +//! Boldi and Sebastiano Vigna, in _Proc. of the 13th international conference |
| 11 | +//! on World Wide Web_, WWW 2004, pages 595–602, ACM. |
9 | 12 | //! |
10 | | -//! The format has been described by Paolo Boldi and Sebastiano Vigna in “[The |
11 | | -//! WebGraph Framework I: Compression |
12 | | -//! Techniques](http://vigna.di.unimi.it/papers.php#BoVWFI)”, in *Proc. of the |
13 | | -//! 13th international conference on World Wide Web*, WWW 2004, pages 595-602, |
14 | | -//! ACM. [DOI |
15 | | -//! 10.1145/988672.988752](https://dl.acm.org/doi/10.1145/988672.988752). |
| 13 | +//! This module provides a flexible way to store and access graphs in compressed |
| 14 | +//! form. A compressed graph with basename `BASENAME` is described by: |
| 15 | +//! |
| 16 | +//! - a _graph file_ (`BASENAME.graph`): a bitstream containing the compressed |
| 17 | +//! representation of the graph; |
| 18 | +//! - a _properties file_ (`BASENAME.properties`): metadata about the graph and |
| 19 | +//! the compression parameters; |
| 20 | +//! - an _offsets file_ (`BASENAME.offsets`): a bitstream of γ-coded gaps between |
| 21 | +//! the bit offsets of each successor list in the graph file. |
| 22 | +//! |
| 23 | +//! Additionally, an [Elias–Fano] representation of the offsets |
| 24 | +//! (`BASENAME.ef`), necessary for random access, can be built using the |
| 25 | +//! `webgraph build ef` command. |
16 | 26 | //! |
17 | 27 | //! The implementation is compatible with the [Java |
18 | 28 | //! implementation](http://webgraph.di.unimi.it/), but it provides also a |
19 | | -//! little-endian version, too. |
| 29 | +//! little-endian version. |
| 30 | +//! |
| 31 | +//! The main access points to the implementation are [`BvGraph::with_basename`] |
| 32 | +//! and [`BvGraphSeq::with_basename`], which provide a [`LoadConfig`] that can |
| 33 | +//! be further customized (e.g., selecting endianness, memory mapping, etc.). |
| 34 | +//! |
| 35 | +//! # The Graph File |
| 36 | +//! |
| 37 | +//! The graph is stored as a bitstream. The format depends on a number of |
| 38 | +//! parameters and encodings that can be mixed orthogonally. The parameters are: |
| 39 | +//! |
| 40 | +//! - the _window size_, a nonnegative integer; |
| 41 | +//! - the _maximum reference count_, a positive integer (it is meaningful only |
| 42 | +//! when the window is nonzero); |
| 43 | +//! - the _minimum interval length_, an integer ≥ 2, or 0, which is interpreted |
| 44 | +//! as infinity. |
| 45 | +//! |
| 46 | +//! ## Successor lists |
| 47 | +//! |
| 48 | +//! The graph file is a sequence of successor lists, one for each node. The list |
| 49 | +//! of node _x_ can be thought of as a sequence of natural numbers (even though, |
| 50 | +//! as we will explain later, this sequence is further coded suitably as a |
| 51 | +//! sequence of bits): |
| 52 | +//! |
| 53 | +//! 1. The _outdegree_ of the node; if it is zero, the list ends here. |
| 54 | +//! |
| 55 | +//! 2. If the window size is not zero, the _reference part_, that is: |
| 56 | +//! 1. a nonnegative integer, the _reference_, which never exceeds the window |
| 57 | +//! size; if the reference is _r_, the list of successors will be specified |
| 58 | +//! as a modified version of the list of successors of _x_ − _r_; if _r_ |
| 59 | +//! is 0, then the list of successors will be specified explicitly; |
| 60 | +//! 2. if _r_ is nonzero: |
| 61 | +//! - a natural number β, the _block count_; |
| 62 | +//! - a sequence of β natural numbers *B*₁, …, *B*ᵦ, called the |
| 63 | +//! _copy-block list_; only the first number can be zero. |
| 64 | +//! |
| 65 | +//! 3. Then comes the _extra part_, specifying additional entries that the list |
| 66 | +//! of successors contains (or all of them, if _r_ is zero), that is: |
| 67 | +//! 1. If the minimum interval length is finite: |
| 68 | +//! - an integer _i_, the _interval count_; |
| 69 | +//! - a sequence of _i_ pairs, whose first component is the left extreme |
| 70 | +//! of an interval, and whose second component is the length of the |
| 71 | +//! interval (the number of integers contained in it). |
| 72 | +//! 2. Finally, the list of _residuals_, which contain all successors not |
| 73 | +//! specified by previous methods. |
| 74 | +//! |
| 75 | +//! The above data should be interpreted as follows: |
| 76 | +//! |
| 77 | +//! - The reference part, if present (i.e., if both the window size and the |
| 78 | +//! reference are positive), specifies that part of the list of successors of |
| 79 | +//! node _x_ − _r_ should be copied; the successors of node _x_ − _r_ that |
| 80 | +//! should be copied are described in the copy-block list; more precisely, one |
| 81 | +//! should copy the first *B*₁ entries of this list, discard the next *B*₂, |
| 82 | +//! copy the next *B*₃, etc. (the last remaining elements of the list of |
| 83 | +//! successors will be copied if β is even, and discarded if β is odd). |
| 84 | +//! |
| 85 | +//! - The extra part specifies additional successors (or all of them, if the |
| 86 | +//! reference part is absent); the extra part is not present if the number of |
| 87 | +//! successors that are to be copied according to the reference part already |
| 88 | +//! coincides with the outdegree of _x_; the successors listed in the extra |
| 89 | +//! part are given in two forms: |
| 90 | +//! - some of them are specified as belonging to (integer) intervals, if the |
| 91 | +//! minimum interval length is finite; the interval count indicates how many |
| 92 | +//! intervals, and the intervals themselves are listed as pairs (left |
| 93 | +//! extreme, length); |
| 94 | +//! - the residuals are the remaining "scattered" successors. |
| 95 | +//! |
| 96 | +//! ## How Successor Lists Are Coded |
| 97 | +//! |
| 98 | +//! The list of integers corresponding to each successor list is coded into a |
| 99 | +//! sequence of bits. This is done in two phases: we first modify the sequence |
| 100 | +//! so to obtain another sequence of integers (some of them might be negative). |
| 101 | +//! Then each integer is coded, using a coding that can be specified as an |
| 102 | +//! option; the integers that may be negative are first turned into natural |
| 103 | +//! numbers using the standard bijection. |
| 104 | +//! |
| 105 | +//! 1. The outdegree of the node is left unchanged, as well as the reference and |
| 106 | +//! the block count. |
| 107 | +//! 2. All blocks are decremented by 1, except for the first one. |
| 108 | +//! 3. The interval count is left unchanged. |
| 109 | +//! 4. All interval lengths are decremented by the minimum interval length. |
| 110 | +//! 5. The first left extreme is expressed as its difference from _x_ (it will |
| 111 | +//! be negative if the first extreme is less than _x_); the remaining left |
| 112 | +//! extremes are expressed as their distance from the previous right extreme |
| 113 | +//! plus 2 (e.g., if the interval is \[5..11\] and the previous one was |
| 114 | +//! \[1..3\], then the left extreme 5 is expressed as 5 − (3 + 2) = 0). |
| 115 | +//! 6. The first residual is expressed as its difference from _x_ (it will be |
| 116 | +//! negative if the first residual is less than _x_); the remaining residuals |
| 117 | +//! are expressed as decremented differences from the previous residual. |
| 118 | +//! |
| 119 | +//! # The Offsets File |
| 120 | +//! |
| 121 | +//! Since the graph is stored as a bitstream, we must have some way to know |
| 122 | +//! where each successor list starts. This information is stored in the offset |
| 123 | +//! file, which contains the bit offset of each successor list as a γ-coded gap |
| 124 | +//! from the previous offset (in particular, the offset of the first successor |
| 125 | +//! list will be zero). As a convenience, the offset file contains an additional |
| 126 | +//! offset pointing just after the last successor list (providing, as a |
| 127 | +//! side-effect, the actual bit length of the graph file). |
| 128 | +//! |
| 129 | +//! For random access, the list of offsets is stored as an [Elias–Fano] |
| 130 | +//! representation using [ε-serde]. Building such a representation is a |
| 131 | +//! prerequisite for random access and can be done using the `webgraph build ef` |
| 132 | +//! command. |
20 | 133 | //! |
21 | | -//! The main access point to the implementation is [`BvGraph::with_basename`], |
22 | | -//! which provides a [`LoadConfig`] that can be further customized. |
| 134 | +//! [BvGraph paper]: <http://vigna.di.unimi.it/papers.php#BoVWFI> |
| 135 | +//! [Elias–Fano]: <https://docs.rs/sux/latest/sux/dict/elias_fano/struct.EliasFano.html> |
| 136 | +//! [ε-serde]: <https://docs.rs/epserde/latest/epserde/> |
23 | 137 |
|
24 | 138 | use std::path::Path; |
25 | 139 |
|
|
0 commit comments