A Rust port of MS-GF+ — takes mzML/MGF spectra + FASTA in, produces Percolator-ready
.pinout. Beats Java MS-GF+ on all three benchmark datasets at 1% FDR while running 14-330% faster.
msgf-rust is a from-scratch Rust reimplementation of MS-GF+ (Kim & Pevzner, 2014), the canonical generating-function peptide-identification engine. It reads MS/MS spectra (mzML or MGF), searches them against a FASTA protein database, and emits Percolator-ready PIN rows (or a TSV) with per-PSM features for rescoring. The original Java implementation is preserved on the java-legacy branch.
Three datasets, three results (all at 1% FDR via Percolator 3.7.1):
| Dataset | Java MS-GF+ PSMs | msgf-rust PSMs | Δ | Java wall | msgf-rust wall | Wall Δ |
|---|---|---|---|---|---|---|
| Astral DDA (LFQ_Astral_DDA_15min_50ng) | 35,818 | 36,170 | +352 (+0.98%) | 5:49 | 5:57 | within 2% |
| PXD001819 (UPS1 yeast tryp) | 14,798 | 14,760 | -38 (-0.26%) | ~150s | 45.88s | 3.3× faster |
| TMT (a05058 PXD007683) | 10,166 | 11,108 | +9.3% | ~2:55 | 2:30 | 14% faster |
What that means: on Astral we find more peptide hits than Java; on PXD001819 we match Java's hit count at 3.3× the speed; on TMT we find ~9% more PSMs at 14% less wall. The remaining feature-level divergences (lnEValue, MeanRelErrorTop7 normalization) are tracked in DOCS.md §8d as research follow-up — they don't gate cutover.
Option 1 — download a release archive (recommended):
Grab the archive for your platform from the Releases page. Five platform builds are published per release:
msgf-rust-<version>-x86_64-unknown-linux-gnu.tar.gz
msgf-rust-<version>-aarch64-unknown-linux-gnu.tar.gz
msgf-rust-<version>-x86_64-apple-darwin.tar.gz
msgf-rust-<version>-aarch64-apple-darwin.tar.gz
msgf-rust-<version>-x86_64-pc-windows-msvc.zip
Each archive contains the msgf-rust binary, the resources/ tree (39 bundled .param files + unimod.obo), and LICENSE/NOTICE/README.
Option 2 — cargo install:
cargo install --git https://github.com/bigbio/msgf-rust --bin msgf-rustOption 3 — build from source:
git clone https://github.com/bigbio/msgf-rust
cd msgf-rust
cargo build --release
# Binary: target/release/msgf-rustRequires Rust 1.85+ (see rust-toolchain.toml).
msgf-rust \
--spectrum BSA.mgf \
--database BSA.fasta \
--output-pin out.pinThis runs a tryptic search at 20 ppm precursor tolerance with the bundled HCD_QExactive_Tryp scoring model, writes Percolator-format PSMs to out.pin, and prints per-phase timings to stderr. Feed out.pin directly into Percolator (Docker or native) to compute q-values.
A row in out.pin is one peptide–spectrum match. With the default charge range (2–3), each row has 36 tab-separated columns: 35 Java-parity Percolator features plus Rust-only EdgeScore (inserted before Peptide). Charge one-hot columns scale with [--charge-min, --charge-max]. Full column reference: DOCS.md §3a.
Tryptic DDA + Percolator (default):
msgf-rust --spectrum spectra.mzML --database db.fasta --output-pin out.pin
docker run --rm -v $(pwd):/data biocontainers/percolator:v3.7.1_cv1 \
percolator -X /data/weights.txt /data/out.pinTMT 10-plex search with mods.txt:
msgf-rust \
--spectrum tmt_spectra.mzML \
--database hsapiens.fasta \
--output-pin out.pin \
--mods tmt_10plex_mods.txt \
--protocol TMT \
--fragmentation HCD \
--instrument QExactiveDirect TSV output (skip Percolator):
msgf-rust --spectrum spectra.mzML --database db.fasta \
--output-pin out.pin --output-tsv out.tsvquantms pipeline integration:
Point quantms's PSM search step at msgf-rust and use the standard quantms post-processing. The .pin row format is the same; existing quantms scripts using legacy numeric flag values (--fragmentation 3 --instrument 3 --protocol 4) keep working without modification (see CLI_MIGRATION.md).
Most-used flags (full reference in DOCS.md §1):
| Flag | Purpose | Default |
|---|---|---|
--spectrum <FILE> |
Input mzML or MGF | (required) |
--database <FILE> |
Input FASTA | (required) |
--output-pin <FILE> |
Percolator PIN output | (required) |
--output-tsv <FILE> |
Optional TSV output | (off) |
--mods <FILE> |
mods.txt file (Cam-C + Ox-M built-in) | (off) |
--precursor-tol-ppm <FLOAT> |
Precursor mass tolerance | 20.0 |
--isotope-error-min/-max <INT> |
Isotope error range | -1, 2 |
--charge-min/-max <INT> |
Charge range when not in spectrum | 2, 3 |
--enzyme-specificity <auto|...> |
NTT enforcement | fully |
--max-missed-cleavages <INT> |
Missed cleavages | 1 |
--min/-max-length <INT> |
Peptide length range | 6, 40 |
--min-peaks <INT> |
Min peaks per spectrum to score | 10 |
--top-n <INT> |
PSMs retained per spectrum | 10 |
--fragmentation <auto|...> |
Frag method (auto-detect from mzML if auto) |
auto |
--instrument <low-res|...> |
Instrument class | low-res |
--protocol <auto|...> |
Search protocol | auto |
--param-file <FILE> |
Override bundled scoring model | (auto-pick) |
--threads <INT> |
Worker threads | (logical CPUs) |
Run msgf-rust --help for the auto-generated help with full descriptions.
For mzML inputs with --fragmentation auto (the default), msgf-rust peeks the first 64 MS2 spectra, histograms activation methods and analyzer types, and selects a bundled .param file from the dominant values. The --instrument CLI flag is not required for this path — instrument class is read from the mzML when possible. --protocol from the CLI is still applied when resolving the bundled model. MGF files have no activation metadata, so they use flag-based resolution (defaulting to HCD_QExactive_Tryp.param). Full resolution table: DOCS.md §4.
PIN output columns are bit-exact with Java MS-GF+ on the agreement bucket (same scan + same top-1 peptide) for most features. Three residual divergences exist as deferred research: lnEValue (num_distinct semantics), MeanRelErrorTop7 (error-stat normalization), and the BSA charge-3 SEV gap from deconvolution-implementation differences. None gate cutover; aggregate 1% FDR PSM counts beat Java on all three benchmark datasets. Full detail: DOCS.md §8d.
If you use msgf-rust in published work, please cite the original MS-GF+ paper:
Kim, S. and Pevzner, P.A. (2014). MS-GF+ makes progress towards a universal database search tool for proteomics. Nature Communications, 5:5277.
And optionally this Rust port:
bigbio (2026). msgf-rust: a Rust port of MS-GF+ for the quantms pipeline. https://github.com/bigbio/msgf-rust
msgf-rust inherits the upstream MS-GF+ UCSD-Noncommercial license. The license restricts redistribution and commercial use; see LICENSE for the full text and NOTICE for attribution. The original Java implementation is preserved on the java-legacy branch (frozen at the bigbio-optimized version) and java-legacy-original branch (synced to upstream MSGFPlus/msgfplus/master).