Skip to content

Commit 7e764f2

Browse files
authored
[runtime] fix namespace (#73)
1 parent 2f7ee10 commit 7e764f2

18 files changed

Lines changed: 185 additions & 248 deletions

README.md

Lines changed: 102 additions & 102 deletions
Original file line numberDiff line numberDiff line change
@@ -1,102 +1,102 @@
1-
## Text Normalization & Inverse Text Normalization
2-
3-
### 0. Brief Introduction
4-
5-
[WeTextProcessing: Production First & Production Ready Text Processing Toolkit](https://mp.weixin.qq.com/s/q_11lck78qcjylHCi6wVsQ)
6-
7-
#### 0.1 Text Normalization
8-
9-
<div align=center><img src="https://user-images.githubusercontent.com/13466943/193439861-acfba531-13d1-4fca-b2f2-6e47fc10f195.png" alt="Cover" width="50%"/></div>
10-
11-
#### 0.2 Inverse Text Normalization
12-
13-
<div align=center><img src="https://user-images.githubusercontent.com/13466943/193439870-634c44a3-bd62-4311-bcf2-1427758d5f62.png" alt="Cover" width="50%"/></div>
14-
15-
### 1. How To Use
16-
17-
#### 1.1 Quick Start:
18-
```bash
19-
# install
20-
pip install WeTextProcessing
21-
```
22-
23-
```py
24-
# tn usage
25-
>>> from tn.chinese.normalizer import Normalizer
26-
>>> normalizer = Normalizer()
27-
>>> normalizer.normalize("2.5平方电线")
28-
# itn usage
29-
>>> from itn.chinese.inverse_normalizer import InverseNormalizer
30-
>>> invnormalizer = InverseNormalizer()
31-
>>> invnormalizer.normalize("二点五平方电线")
32-
```
33-
34-
#### 1.2 Advanced Usage:
35-
36-
DIY your own rules && Deploy WeTextProcessing with cpp runtime !!
37-
38-
For users who want modifications and adapt tn/itn rules to fix badcase, please try:
39-
40-
``` bash
41-
git clone https://github.com/wenet-e2e/WeTextProcessing.git
42-
cd WeTextProcessing
43-
# `overwrite_cache` will rebuild all rules according to
44-
# your modifications on tn/chinese/rules/xx.py (itn/chinese/rules/xx.py).
45-
# After rebuild, you can find new far files at `$PWD/tn` and `$PWD/itn`.
46-
python normalize.py --text "2.5平方电线" --overwrite_cache
47-
python inverse_normalize.py --text "二点五平方电线" --overwrite_cache
48-
```
49-
50-
Once you successfully rebuild your rules, you can deploy them either with your installed pypi packages:
51-
52-
```py
53-
# tn usage
54-
>>> from tn.chinese.normalizer import Normalizer
55-
>>> normalizer = Normalizer(cache_dir="PATH_TO_GIT_CLONED_WETEXTPROCESSING/tn")
56-
>>> normalizer.normalize("2.5平方电线")
57-
# itn usage
58-
>>> from itn.chinese.inverse_normalizer import InverseNormalizer
59-
>>> invnormalizer = InverseNormalizer(cache_dir="PATH_TO_GIT_CLONED_WETEXTPROCESSING/itn")
60-
>>> invnormalizer.normalize("二点五平方电线")
61-
```
62-
63-
Or with cpp runtime:
64-
65-
```bash
66-
cmake -B build -S runtime -DCMAKE_BUILD_TYPE=Release
67-
cmake --build build
68-
# tn usage
69-
cache_dir=PATH_TO_GIT_CLONED_WETEXTPROCESSING/tn
70-
./build/bin/processor_main --tagger $fst_dir/zh_tn_tagger.fst --verbalizer $cache_dir/zh_tn_verbalizer.fst --text "2.5平方电线"
71-
# itn usage
72-
cache_dir=PATH_TO_GIT_CLONED_WETEXTPROCESSING/itn
73-
./build/bin/processor_main --tagger $fst_dir/zh_itn_tagger.fst --verbalizer $cache_dir/zh_itn_verbalizer.fst --text "二点五平方电线"
74-
```
75-
76-
### 2. TN Pipeline
77-
78-
Please refer to [TN.README](tn/README.md)
79-
80-
### 3. ITN Pipeline
81-
82-
Please refer to [ITN.README](itn/README.md)
83-
84-
## Discussion & Communication
85-
86-
For Chinese users, you can aslo scan the QR code on the left to follow our offical account of WeNet.
87-
We created a WeChat group for better discussion and quicker response.
88-
Please scan the personal QR code on the right, and the guy is responsible for inviting you to the chat group.
89-
90-
| <img src="https://github.com/robin1001/qr/blob/master/wenet.jpeg" width="250px"> | <img src="https://user-images.githubusercontent.com/13466943/203046432-f637180e-4c87-40cc-be05-ce48c65dd1ef.jpg" width="250px"> |
91-
| ---- | ---- |
92-
93-
Or you can directly discuss on [Github Issues](https://github.com/wenet-e2e/WeTextProcessing/issues).
94-
95-
## Acknowledge
96-
97-
1. Thank the authors of foundational libraries like [OpenFst](https://www.openfst.org/twiki/bin/view/FST/WebHome) & [Pynini](https://www.openfst.org/twiki/bin/view/GRM/Pynini).
98-
3. Thank [NeMo](https://github.com/NVIDIA/NeMo) team & NeMo open-source community.
99-
2. Thank [Zhenxiang Ma](https://github.com/mzxcpp), [Jiayu Du](https://github.com/dophist), and [SpeechColab](https://github.com/SpeechColab) organization.
100-
3. Referred [Pynini](https://github.com/kylebgorman/pynini) for reading the FAR, and printing the shortest path of a lattice in the C++ runtime.
101-
4. Referred [TN of NeMo](https://github.com/NVIDIA/NeMo/tree/main/nemo_text_processing/text_normalization/zh) for the data to build the tagger graph.
102-
5. Referred [ITN of chinese_text_normalization](https://github.com/speechio/chinese_text_normalization/tree/master/thrax/src/cn) for the data to build the tagger graph.
1+
## Text Normalization & Inverse Text Normalization
2+
3+
### 0. Brief Introduction
4+
5+
[WeTextProcessing: Production First & Production Ready Text Processing Toolkit](https://mp.weixin.qq.com/s/q_11lck78qcjylHCi6wVsQ)
6+
7+
#### 0.1 Text Normalization
8+
9+
<div align=center><img src="https://user-images.githubusercontent.com/13466943/193439861-acfba531-13d1-4fca-b2f2-6e47fc10f195.png" alt="Cover" width="50%"/></div>
10+
11+
#### 0.2 Inverse Text Normalization
12+
13+
<div align=center><img src="https://user-images.githubusercontent.com/13466943/193439870-634c44a3-bd62-4311-bcf2-1427758d5f62.png" alt="Cover" width="50%"/></div>
14+
15+
### 1. How To Use
16+
17+
#### 1.1 Quick Start:
18+
```bash
19+
# install
20+
pip install WeTextProcessing
21+
```
22+
23+
```py
24+
# tn usage
25+
>>> from tn.chinese.normalizer import Normalizer
26+
>>> normalizer = Normalizer()
27+
>>> normalizer.normalize("2.5平方电线")
28+
# itn usage
29+
>>> from itn.chinese.inverse_normalizer import InverseNormalizer
30+
>>> invnormalizer = InverseNormalizer()
31+
>>> invnormalizer.normalize("二点五平方电线")
32+
```
33+
34+
#### 1.2 Advanced Usage:
35+
36+
DIY your own rules && Deploy WeTextProcessing with cpp runtime !!
37+
38+
For users who want modifications and adapt tn/itn rules to fix badcase, please try:
39+
40+
``` bash
41+
git clone https://github.com/wenet-e2e/WeTextProcessing.git
42+
cd WeTextProcessing
43+
# `overwrite_cache` will rebuild all rules according to
44+
# your modifications on tn/chinese/rules/xx.py (itn/chinese/rules/xx.py).
45+
# After rebuild, you can find new far files at `$PWD/tn` and `$PWD/itn`.
46+
python normalize.py --text "2.5平方电线" --overwrite_cache
47+
python inverse_normalize.py --text "二点五平方电线" --overwrite_cache
48+
```
49+
50+
Once you successfully rebuild your rules, you can deploy them either with your installed pypi packages:
51+
52+
```py
53+
# tn usage
54+
>>> from tn.chinese.normalizer import Normalizer
55+
>>> normalizer = Normalizer(cache_dir="PATH_TO_GIT_CLONED_WETEXTPROCESSING/tn")
56+
>>> normalizer.normalize("2.5平方电线")
57+
# itn usage
58+
>>> from itn.chinese.inverse_normalizer import InverseNormalizer
59+
>>> invnormalizer = InverseNormalizer(cache_dir="PATH_TO_GIT_CLONED_WETEXTPROCESSING/itn")
60+
>>> invnormalizer.normalize("二点五平方电线")
61+
```
62+
63+
Or with cpp runtime:
64+
65+
```bash
66+
cmake -B build -S runtime -DCMAKE_BUILD_TYPE=Release
67+
cmake --build build
68+
# tn usage
69+
cache_dir=PATH_TO_GIT_CLONED_WETEXTPROCESSING/tn
70+
./build/processor_main --tagger $fst_dir/zh_tn_tagger.fst --verbalizer $cache_dir/zh_tn_verbalizer.fst --text "2.5平方电线"
71+
# itn usage
72+
cache_dir=PATH_TO_GIT_CLONED_WETEXTPROCESSING/itn
73+
./build/processor_main --tagger $fst_dir/zh_itn_tagger.fst --verbalizer $cache_dir/zh_itn_verbalizer.fst --text "二点五平方电线"
74+
```
75+
76+
### 2. TN Pipeline
77+
78+
Please refer to [TN.README](tn/README.md)
79+
80+
### 3. ITN Pipeline
81+
82+
Please refer to [ITN.README](itn/README.md)
83+
84+
## Discussion & Communication
85+
86+
For Chinese users, you can aslo scan the QR code on the left to follow our offical account of WeNet.
87+
We created a WeChat group for better discussion and quicker response.
88+
Please scan the personal QR code on the right, and the guy is responsible for inviting you to the chat group.
89+
90+
| <img src="https://github.com/robin1001/qr/blob/master/wenet.jpeg" width="250px"> | <img src="https://user-images.githubusercontent.com/13466943/203046432-f637180e-4c87-40cc-be05-ce48c65dd1ef.jpg" width="250px"> |
91+
| ---- | ---- |
92+
93+
Or you can directly discuss on [Github Issues](https://github.com/wenet-e2e/WeTextProcessing/issues).
94+
95+
## Acknowledge
96+
97+
1. Thank the authors of foundational libraries like [OpenFst](https://www.openfst.org/twiki/bin/view/FST/WebHome) & [Pynini](https://www.openfst.org/twiki/bin/view/GRM/Pynini).
98+
3. Thank [NeMo](https://github.com/NVIDIA/NeMo) team & NeMo open-source community.
99+
2. Thank [Zhenxiang Ma](https://github.com/mzxcpp), [Jiayu Du](https://github.com/dophist), and [SpeechColab](https://github.com/SpeechColab) organization.
100+
3. Referred [Pynini](https://github.com/kylebgorman/pynini) for reading the FAR, and printing the shortest path of a lattice in the C++ runtime.
101+
4. Referred [TN of NeMo](https://github.com/NVIDIA/NeMo/tree/main/nemo_text_processing/text_normalization/zh) for the data to build the tagger graph.
102+
5. Referred [ITN of chinese_text_normalization](https://github.com/speechio/chinese_text_normalization/tree/master/thrax/src/cn) for the data to build the tagger graph.

runtime/CMakeLists.txt

Lines changed: 14 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -28,9 +28,20 @@ endif()
2828

2929
include(openfst)
3030
include_directories(${PROJECT_SOURCE_DIR})
31-
add_subdirectory(utils)
32-
add_subdirectory(processor)
33-
add_subdirectory(bin)
31+
32+
add_library(processor STATIC
33+
processor/processor.cc
34+
processor/token_parser.cc
35+
utils/utf8_string.cc
36+
)
37+
if(MSVC)
38+
target_link_libraries(processor PUBLIC fst)
39+
else()
40+
target_link_libraries(processor PUBLIC dl fst)
41+
endif()
42+
43+
add_executable(processor_main bin/processor_main.cc)
44+
target_link_libraries(processor_main PUBLIC processor)
3445

3546
if(BUILD_TESTING)
3647
include(gtest)

runtime/README.md

Lines changed: 22 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -1,22 +1,22 @@
1-
## WeTextProcessing Runtime
2-
3-
1. How to build
4-
5-
``` bash
6-
$ cmake -B build -DCMAKE_BUILD_TYPE=Release
7-
$ cmake --build build
8-
```
9-
10-
2. How to use
11-
12-
``` bash
13-
# tn usage
14-
$ wget https://github.com/wenet-e2e/WeTextProcessing/releases/download/WeTextProcessing/zh_tn_tagger.fst
15-
$ wget https://github.com/wenet-e2e/WeTextProcessing/releases/download/WeTextProcessing/zh_tn_verbalizer.fst
16-
$ ./build/bin/processor_main --tagger zh_tn_tagger.fst --verbalizer zh_tn_verbalizer.fst --text "2.5平方电线"
17-
18-
# itn usage
19-
$ wget https://github.com/wenet-e2e/WeTextProcessing/releases/download/WeTextProcessing/zh_itn_tagger.fst
20-
$ wget https://github.com/wenet-e2e/WeTextProcessing/releases/download/WeTextProcessing/zh_itn_verbalizer.fst
21-
$ ./build/bin/processor_main --tagger zh_itn_tagger.fst --verbalizer zh_itn_verbalizer.fst --text "二点五平方电线"
22-
```
1+
## WeTextProcessing Runtime
2+
3+
1. How to build
4+
5+
``` bash
6+
$ cmake -B build -DCMAKE_BUILD_TYPE=Release
7+
$ cmake --build build
8+
```
9+
10+
2. How to use
11+
12+
``` bash
13+
# tn usage
14+
$ wget https://github.com/wenet-e2e/WeTextProcessing/releases/download/WeTextProcessing/zh_tn_tagger.fst
15+
$ wget https://github.com/wenet-e2e/WeTextProcessing/releases/download/WeTextProcessing/zh_tn_verbalizer.fst
16+
$ ./build/processor_main --tagger zh_tn_tagger.fst --verbalizer zh_tn_verbalizer.fst --text "2.5平方电线"
17+
18+
# itn usage
19+
$ wget https://github.com/wenet-e2e/WeTextProcessing/releases/download/WeTextProcessing/zh_itn_tagger.fst
20+
$ wget https://github.com/wenet-e2e/WeTextProcessing/releases/download/WeTextProcessing/zh_itn_verbalizer.fst
21+
$ ./build/processor_main --tagger zh_itn_tagger.fst --verbalizer zh_itn_verbalizer.fst --text "二点五平方电线"
22+
```

runtime/bin/CMakeLists.txt

Lines changed: 0 additions & 2 deletions
This file was deleted.

runtime/bin/processor_main.cc

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,7 @@ int main(int argc, char* argv[]) {
3131
if (FLAGS_tagger.empty() || FLAGS_verbalizer.empty()) {
3232
LOG(FATAL) << "Please provide the tagger and verbalizer fst files.";
3333
}
34-
wenet::Processor processor(FLAGS_tagger, FLAGS_verbalizer);
34+
wetext::Processor processor(FLAGS_tagger, FLAGS_verbalizer);
3535

3636
if (!FLAGS_text.empty()) {
3737
std::string tagged_text = processor.tag(FLAGS_text);

runtime/processor/CMakeLists.txt

Lines changed: 0 additions & 5 deletions
This file was deleted.

runtime/processor/processor.cc

Lines changed: 12 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -14,17 +14,15 @@
1414

1515
#include "processor/processor.h"
1616

17-
#include "utils/utils.h"
18-
1917
using fst::StringTokenType;
2018

21-
namespace wenet {
22-
19+
namespace wetext {
2320
Processor::Processor(const std::string& tagger_path,
2421
const std::string& verbalizer_path) {
2522
tagger_.reset(StdVectorFst::Read(tagger_path));
2623
verbalizer_.reset(StdVectorFst::Read(verbalizer_path));
2724
compiler_ = std::make_shared<StringCompiler<StdArc>>(StringTokenType::BYTE);
25+
printer_ = std::make_shared<StringPrinter<StdArc>>(StringTokenType::BYTE);
2826

2927
if (tagger_path.find("_tn_") != tagger_path.npos) {
3028
parse_type_ = ParseType::kTN;
@@ -36,6 +34,15 @@ Processor::Processor(const std::string& tagger_path,
3634
}
3735
}
3836

37+
std::string Processor::shortest_path(const StdVectorFst& lattice) {
38+
StdVectorFst shortest_path;
39+
fst::ShortestPath(lattice, &shortest_path, 1, true);
40+
41+
std::string output;
42+
printer_->operator()(shortest_path, &output);
43+
return output;
44+
}
45+
3946
std::string Processor::compose(const std::string& input,
4047
const StdVectorFst* fst) {
4148
StdVectorFst input_fst;
@@ -63,4 +70,4 @@ std::string Processor::normalize(const std::string& input) {
6370
return verbalize(tag(input));
6471
}
6572

66-
} // namespace wenet
73+
} // namespace wetext

runtime/processor/processor.h

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -22,9 +22,9 @@
2222
using fst::StdArc;
2323
using fst::StdVectorFst;
2424
using fst::StringCompiler;
25+
using fst::StringPrinter;
2526

26-
namespace wenet {
27-
27+
namespace wetext {
2828
class Processor {
2929
public:
3030
Processor(const std::string& tagger_path, const std::string& verbalizer_path);
@@ -33,14 +33,16 @@ class Processor {
3333
std::string normalize(const std::string& input);
3434

3535
private:
36+
std::string shortest_path(const StdVectorFst& lattice);
3637
std::string compose(const std::string& input, const StdVectorFst* fst);
3738

3839
ParseType parse_type_;
3940
std::shared_ptr<StdVectorFst> tagger_ = nullptr;
4041
std::shared_ptr<StdVectorFst> verbalizer_ = nullptr;
4142
std::shared_ptr<StringCompiler<StdArc>> compiler_ = nullptr;
43+
std::shared_ptr<StringPrinter<StdArc>> printer_ = nullptr;
4244
};
4345

44-
} // namespace wenet
46+
} // namespace wetext
4547

4648
#endif // PROCESSOR_PROCESSOR_H_

runtime/processor/token_parser.cc

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -17,8 +17,7 @@
1717
#include "utils/log.h"
1818
#include "utils/utf8_string.h"
1919

20-
namespace wenet {
21-
20+
namespace wetext {
2221
const std::string EOS = "<EOS>";
2322
const std::set<std::string> UTF8_WHITESPACE = {" ", "\t", "\n", "\r",
2423
"\x0b\x0c"};
@@ -151,4 +150,4 @@ std::string TokenParser::reorder(const std::string& input) {
151150
return trim(output);
152151
}
153152

154-
} // namespace wenet
153+
} // namespace wetext

runtime/processor/token_parser.h

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@
2020
#include <unordered_map>
2121
#include <vector>
2222

23-
namespace wenet {
23+
namespace wetext {
2424

2525
extern const std::string EOS;
2626
extern const std::set<std::string> UTF8_WHITESPACE;
@@ -86,6 +86,6 @@ class TokenParser {
8686
std::unordered_map<std::string, std::vector<std::string>> orders;
8787
};
8888

89-
} // namespace wenet
89+
} // wetext
9090

9191
#endif // PROCESSOR_TOKEN_PARSER_H_

0 commit comments

Comments
 (0)