|
1 | | -## Text Normalization & Inverse Text Normalization |
2 | | - |
3 | | -### 0. Brief Introduction |
4 | | - |
5 | | -[WeTextProcessing: Production First & Production Ready Text Processing Toolkit](https://mp.weixin.qq.com/s/q_11lck78qcjylHCi6wVsQ) |
6 | | - |
7 | | -#### 0.1 Text Normalization |
8 | | - |
9 | | -<div align=center><img src="https://user-images.githubusercontent.com/13466943/193439861-acfba531-13d1-4fca-b2f2-6e47fc10f195.png" alt="Cover" width="50%"/></div> |
10 | | - |
11 | | -#### 0.2 Inverse Text Normalization |
12 | | - |
13 | | -<div align=center><img src="https://user-images.githubusercontent.com/13466943/193439870-634c44a3-bd62-4311-bcf2-1427758d5f62.png" alt="Cover" width="50%"/></div> |
14 | | - |
15 | | -### 1. How To Use |
16 | | - |
17 | | -#### 1.1 Quick Start: |
18 | | -```bash |
19 | | -# install |
20 | | -pip install WeTextProcessing |
21 | | -``` |
22 | | - |
23 | | -```py |
24 | | -# tn usage |
25 | | ->>> from tn.chinese.normalizer import Normalizer |
26 | | ->>> normalizer = Normalizer() |
27 | | ->>> normalizer.normalize("2.5平方电线") |
28 | | -# itn usage |
29 | | ->>> from itn.chinese.inverse_normalizer import InverseNormalizer |
30 | | ->>> invnormalizer = InverseNormalizer() |
31 | | ->>> invnormalizer.normalize("二点五平方电线") |
32 | | -``` |
33 | | - |
34 | | -#### 1.2 Advanced Usage: |
35 | | - |
36 | | -DIY your own rules && Deploy WeTextProcessing with cpp runtime !! |
37 | | - |
38 | | -For users who want modifications and adapt tn/itn rules to fix badcase, please try: |
39 | | - |
40 | | -``` bash |
41 | | -git clone https://github.com/wenet-e2e/WeTextProcessing.git |
42 | | -cd WeTextProcessing |
43 | | -# `overwrite_cache` will rebuild all rules according to |
44 | | -# your modifications on tn/chinese/rules/xx.py (itn/chinese/rules/xx.py). |
45 | | -# After rebuild, you can find new far files at `$PWD/tn` and `$PWD/itn`. |
46 | | -python normalize.py --text "2.5平方电线" --overwrite_cache |
47 | | -python inverse_normalize.py --text "二点五平方电线" --overwrite_cache |
48 | | -``` |
49 | | - |
50 | | -Once you successfully rebuild your rules, you can deploy them either with your installed pypi packages: |
51 | | - |
52 | | -```py |
53 | | -# tn usage |
54 | | ->>> from tn.chinese.normalizer import Normalizer |
55 | | ->>> normalizer = Normalizer(cache_dir="PATH_TO_GIT_CLONED_WETEXTPROCESSING/tn") |
56 | | ->>> normalizer.normalize("2.5平方电线") |
57 | | -# itn usage |
58 | | ->>> from itn.chinese.inverse_normalizer import InverseNormalizer |
59 | | ->>> invnormalizer = InverseNormalizer(cache_dir="PATH_TO_GIT_CLONED_WETEXTPROCESSING/itn") |
60 | | ->>> invnormalizer.normalize("二点五平方电线") |
61 | | -``` |
62 | | - |
63 | | -Or with cpp runtime: |
64 | | - |
65 | | -```bash |
66 | | -cmake -B build -S runtime -DCMAKE_BUILD_TYPE=Release |
67 | | -cmake --build build |
68 | | -# tn usage |
69 | | -cache_dir=PATH_TO_GIT_CLONED_WETEXTPROCESSING/tn |
70 | | -./build/bin/processor_main --tagger $fst_dir/zh_tn_tagger.fst --verbalizer $cache_dir/zh_tn_verbalizer.fst --text "2.5平方电线" |
71 | | -# itn usage |
72 | | -cache_dir=PATH_TO_GIT_CLONED_WETEXTPROCESSING/itn |
73 | | -./build/bin/processor_main --tagger $fst_dir/zh_itn_tagger.fst --verbalizer $cache_dir/zh_itn_verbalizer.fst --text "二点五平方电线" |
74 | | -``` |
75 | | - |
76 | | -### 2. TN Pipeline |
77 | | - |
78 | | -Please refer to [TN.README](tn/README.md) |
79 | | - |
80 | | -### 3. ITN Pipeline |
81 | | - |
82 | | -Please refer to [ITN.README](itn/README.md) |
83 | | - |
84 | | -## Discussion & Communication |
85 | | - |
86 | | -For Chinese users, you can aslo scan the QR code on the left to follow our offical account of WeNet. |
87 | | -We created a WeChat group for better discussion and quicker response. |
88 | | -Please scan the personal QR code on the right, and the guy is responsible for inviting you to the chat group. |
89 | | - |
90 | | -| <img src="https://github.com/robin1001/qr/blob/master/wenet.jpeg" width="250px"> | <img src="https://user-images.githubusercontent.com/13466943/203046432-f637180e-4c87-40cc-be05-ce48c65dd1ef.jpg" width="250px"> | |
91 | | -| ---- | ---- | |
92 | | - |
93 | | -Or you can directly discuss on [Github Issues](https://github.com/wenet-e2e/WeTextProcessing/issues). |
94 | | - |
95 | | -## Acknowledge |
96 | | - |
97 | | -1. Thank the authors of foundational libraries like [OpenFst](https://www.openfst.org/twiki/bin/view/FST/WebHome) & [Pynini](https://www.openfst.org/twiki/bin/view/GRM/Pynini). |
98 | | -3. Thank [NeMo](https://github.com/NVIDIA/NeMo) team & NeMo open-source community. |
99 | | -2. Thank [Zhenxiang Ma](https://github.com/mzxcpp), [Jiayu Du](https://github.com/dophist), and [SpeechColab](https://github.com/SpeechColab) organization. |
100 | | -3. Referred [Pynini](https://github.com/kylebgorman/pynini) for reading the FAR, and printing the shortest path of a lattice in the C++ runtime. |
101 | | -4. Referred [TN of NeMo](https://github.com/NVIDIA/NeMo/tree/main/nemo_text_processing/text_normalization/zh) for the data to build the tagger graph. |
102 | | -5. Referred [ITN of chinese_text_normalization](https://github.com/speechio/chinese_text_normalization/tree/master/thrax/src/cn) for the data to build the tagger graph. |
| 1 | +## Text Normalization & Inverse Text Normalization |
| 2 | + |
| 3 | +### 0. Brief Introduction |
| 4 | + |
| 5 | +[WeTextProcessing: Production First & Production Ready Text Processing Toolkit](https://mp.weixin.qq.com/s/q_11lck78qcjylHCi6wVsQ) |
| 6 | + |
| 7 | +#### 0.1 Text Normalization |
| 8 | + |
| 9 | +<div align=center><img src="https://user-images.githubusercontent.com/13466943/193439861-acfba531-13d1-4fca-b2f2-6e47fc10f195.png" alt="Cover" width="50%"/></div> |
| 10 | + |
| 11 | +#### 0.2 Inverse Text Normalization |
| 12 | + |
| 13 | +<div align=center><img src="https://user-images.githubusercontent.com/13466943/193439870-634c44a3-bd62-4311-bcf2-1427758d5f62.png" alt="Cover" width="50%"/></div> |
| 14 | + |
| 15 | +### 1. How To Use |
| 16 | + |
| 17 | +#### 1.1 Quick Start: |
| 18 | +```bash |
| 19 | +# install |
| 20 | +pip install WeTextProcessing |
| 21 | +``` |
| 22 | + |
| 23 | +```py |
| 24 | +# tn usage |
| 25 | +>>> from tn.chinese.normalizer import Normalizer |
| 26 | +>>> normalizer = Normalizer() |
| 27 | +>>> normalizer.normalize("2.5平方电线") |
| 28 | +# itn usage |
| 29 | +>>> from itn.chinese.inverse_normalizer import InverseNormalizer |
| 30 | +>>> invnormalizer = InverseNormalizer() |
| 31 | +>>> invnormalizer.normalize("二点五平方电线") |
| 32 | +``` |
| 33 | + |
| 34 | +#### 1.2 Advanced Usage: |
| 35 | + |
| 36 | +DIY your own rules && Deploy WeTextProcessing with cpp runtime !! |
| 37 | + |
| 38 | +For users who want modifications and adapt tn/itn rules to fix badcase, please try: |
| 39 | + |
| 40 | +``` bash |
| 41 | +git clone https://github.com/wenet-e2e/WeTextProcessing.git |
| 42 | +cd WeTextProcessing |
| 43 | +# `overwrite_cache` will rebuild all rules according to |
| 44 | +# your modifications on tn/chinese/rules/xx.py (itn/chinese/rules/xx.py). |
| 45 | +# After rebuild, you can find new far files at `$PWD/tn` and `$PWD/itn`. |
| 46 | +python normalize.py --text "2.5平方电线" --overwrite_cache |
| 47 | +python inverse_normalize.py --text "二点五平方电线" --overwrite_cache |
| 48 | +``` |
| 49 | + |
| 50 | +Once you successfully rebuild your rules, you can deploy them either with your installed pypi packages: |
| 51 | + |
| 52 | +```py |
| 53 | +# tn usage |
| 54 | +>>> from tn.chinese.normalizer import Normalizer |
| 55 | +>>> normalizer = Normalizer(cache_dir="PATH_TO_GIT_CLONED_WETEXTPROCESSING/tn") |
| 56 | +>>> normalizer.normalize("2.5平方电线") |
| 57 | +# itn usage |
| 58 | +>>> from itn.chinese.inverse_normalizer import InverseNormalizer |
| 59 | +>>> invnormalizer = InverseNormalizer(cache_dir="PATH_TO_GIT_CLONED_WETEXTPROCESSING/itn") |
| 60 | +>>> invnormalizer.normalize("二点五平方电线") |
| 61 | +``` |
| 62 | + |
| 63 | +Or with cpp runtime: |
| 64 | + |
| 65 | +```bash |
| 66 | +cmake -B build -S runtime -DCMAKE_BUILD_TYPE=Release |
| 67 | +cmake --build build |
| 68 | +# tn usage |
| 69 | +cache_dir=PATH_TO_GIT_CLONED_WETEXTPROCESSING/tn |
| 70 | +./build/processor_main --tagger $fst_dir/zh_tn_tagger.fst --verbalizer $cache_dir/zh_tn_verbalizer.fst --text "2.5平方电线" |
| 71 | +# itn usage |
| 72 | +cache_dir=PATH_TO_GIT_CLONED_WETEXTPROCESSING/itn |
| 73 | +./build/processor_main --tagger $fst_dir/zh_itn_tagger.fst --verbalizer $cache_dir/zh_itn_verbalizer.fst --text "二点五平方电线" |
| 74 | +``` |
| 75 | + |
| 76 | +### 2. TN Pipeline |
| 77 | + |
| 78 | +Please refer to [TN.README](tn/README.md) |
| 79 | + |
| 80 | +### 3. ITN Pipeline |
| 81 | + |
| 82 | +Please refer to [ITN.README](itn/README.md) |
| 83 | + |
| 84 | +## Discussion & Communication |
| 85 | + |
| 86 | +For Chinese users, you can aslo scan the QR code on the left to follow our offical account of WeNet. |
| 87 | +We created a WeChat group for better discussion and quicker response. |
| 88 | +Please scan the personal QR code on the right, and the guy is responsible for inviting you to the chat group. |
| 89 | + |
| 90 | +| <img src="https://github.com/robin1001/qr/blob/master/wenet.jpeg" width="250px"> | <img src="https://user-images.githubusercontent.com/13466943/203046432-f637180e-4c87-40cc-be05-ce48c65dd1ef.jpg" width="250px"> | |
| 91 | +| ---- | ---- | |
| 92 | + |
| 93 | +Or you can directly discuss on [Github Issues](https://github.com/wenet-e2e/WeTextProcessing/issues). |
| 94 | + |
| 95 | +## Acknowledge |
| 96 | + |
| 97 | +1. Thank the authors of foundational libraries like [OpenFst](https://www.openfst.org/twiki/bin/view/FST/WebHome) & [Pynini](https://www.openfst.org/twiki/bin/view/GRM/Pynini). |
| 98 | +3. Thank [NeMo](https://github.com/NVIDIA/NeMo) team & NeMo open-source community. |
| 99 | +2. Thank [Zhenxiang Ma](https://github.com/mzxcpp), [Jiayu Du](https://github.com/dophist), and [SpeechColab](https://github.com/SpeechColab) organization. |
| 100 | +3. Referred [Pynini](https://github.com/kylebgorman/pynini) for reading the FAR, and printing the shortest path of a lattice in the C++ runtime. |
| 101 | +4. Referred [TN of NeMo](https://github.com/NVIDIA/NeMo/tree/main/nemo_text_processing/text_normalization/zh) for the data to build the tagger graph. |
| 102 | +5. Referred [ITN of chinese_text_normalization](https://github.com/speechio/chinese_text_normalization/tree/master/thrax/src/cn) for the data to build the tagger graph. |
0 commit comments