-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Pull requests: huggingface/tokenizers
Author
Label
Projects
Milestones
Reviews
Assignee
Sort
Pull requests list
Document that train_new_from_iterator uses BPE for WordPiece
#2065
opened May 21, 2026 by
adityasingh2400
Loading…
3 tasks done
Replace dead wikitext s3 link in quicktour with HF dataset mirror
#2064
opened May 21, 2026 by
adityasingh2400
Loading…
2
chore(deps): bump qs and body-parser in /tokenizers/examples/unstable_wasm/www
dependencies
Pull requests that update a dependency file
javascript
Pull requests that update Javascript code
#2063
opened May 20, 2026 by
dependabot
Bot
Loading…
chore(deps-dev): bump webpack-dev-server from 5.2.1 to 5.2.4 in /tokenizers/examples/unstable_wasm/www
dependencies
Pull requests that update a dependency file
javascript
Pull requests that update Javascript code
#2062
opened May 20, 2026 by
dependabot
Bot
Loading…
security: reject nested-quantifier regex in Split/Replace to prevent ReDoS (CWE-1333)
#2060
opened May 17, 2026 by
Allen930311
Loading…
4 tasks
fix(bpe): widen pair_counts from i32 to i64 to prevent overflow on large corpora
#2059
opened May 17, 2026 by
xodn348
Loading…
serialize tokenizer vocab and added_tokens compactly
#2056
opened May 13, 2026 by
ArthurZucker
Collaborator
Loading…
Apply type_ids and sequence_id to overflow encodings in post-processors
#2055
opened May 12, 2026 by
1fanwang
Loading…
Fix invalid escape sequence in Whitespace docstring
#2054
opened May 10, 2026 by
eyupcanakman
Loading…
Add scaling_bench: encode_batch vs worker-pool comparison (#1900)
#2048
opened May 1, 2026 by
stargazerZJ
Loading…
5 of 6 tasks
Cyrillic normalizer and decoder for south slavic languages
#2046
opened Apr 28, 2026 by
procesaur
Loading…
perf(unigram): pre-size token map and replace per-node HashMap with Vec
#2039
opened Apr 26, 2026 by
taeyun16
Loading…
feat(ByteLevel): skip per-byte transform for printable-ASCII tokens
#2038
opened Apr 26, 2026 by
KimYannn
Loading…
2 of 3 tasks
feat(NFC): skip Unicode pass for all-ASCII inputs
#2037
opened Apr 26, 2026 by
KimYannn
Loading…
2 of 3 tasks
feat: SIMD ASCII fast path for Lowercase normalizer (~30-49x)
#2036
opened Apr 26, 2026 by
KimYannn
Loading…
6 of 7 tasks
perf(byte_level): port GPT-2 split regex to logos FSM (−22% on GPT-2 encode)
#2031
opened Apr 23, 2026 by
ArthurZucker
Collaborator
Loading…
Real-world (batch × input_length) tokenizer benchmark + cross-library leaderboard
#2030
opened Apr 23, 2026 by
ArthurZucker
Collaborator
Loading…
4 tasks
perf: skip alignment tracking in encode_fast normalization
#2022
opened Apr 10, 2026 by
ArthurZucker
Collaborator
Loading…
feat: Normalizer::normalize_str — skip NormalizedString allocation
#2020
opened Apr 10, 2026 by
ArthurZucker
Collaborator
Loading…
Previous Next
ProTip!
Type g i on any issue or pull request to go back to the issue listing page.