wenet-e2e
diff --git a/‎README.md‎
Lines changed: 102 additions & 102 deletions b/‎README.md‎
Lines changed: 102 additions & 102 deletions
diff --git a/‎runtime/CMakeLists.txt‎
Lines changed: 14 additions & 3 deletions b/‎runtime/CMakeLists.txt‎
Lines changed: 14 additions & 3 deletions
diff --git a/‎runtime/README.md‎
Lines changed: 22 additions & 22 deletions b/‎runtime/README.md‎
Lines changed: 22 additions & 22 deletions
diff --git a/‎runtime/bin/CMakeLists.txt‎
Lines changed: 0 additions & 2 deletions b/‎runtime/bin/CMakeLists.txt‎
Lines changed: 0 additions & 2 deletions
diff --git a/‎runtime/bin/processor_main.cc‎
Lines changed: 1 addition & 1 deletion b/‎runtime/bin/processor_main.cc‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎runtime/processor/CMakeLists.txt‎
Lines changed: 0 additions & 5 deletions b/‎runtime/processor/CMakeLists.txt‎
Lines changed: 0 additions & 5 deletions
diff --git a/‎runtime/processor/processor.cc‎
Lines changed: 12 additions & 5 deletions b/‎runtime/processor/processor.cc‎
Lines changed: 12 additions & 5 deletions
diff --git a/‎runtime/processor/processor.h‎
Lines changed: 5 additions & 3 deletions b/‎runtime/processor/processor.h‎
Lines changed: 5 additions & 3 deletions
diff --git a/‎runtime/processor/token_parser.cc‎
Lines changed: 2 additions & 3 deletions b/‎runtime/processor/token_parser.cc‎
Lines changed: 2 additions & 3 deletions
diff --git a/‎runtime/processor/token_parser.h‎
Lines changed: 2 additions & 2 deletions b/‎runtime/processor/token_parser.h‎
Lines changed: 2 additions & 2 deletions
@@ -1,102 +1,102 @@
-## Text Normalization & Inverse Text Normalization
-
-### 0. Brief Introduction
-
-[WeTextProcessing: Production First & Production Ready Text Processing Toolkit](https://mp.weixin.qq.com/s/q_11lck78qcjylHCi6wVsQ)
-
-#### 0.1 Text Normalization
-
-<div align=center><img src="https://user-images.githubusercontent.com/13466943/193439861-acfba531-13d1-4fca-b2f2-6e47fc10f195.png" alt="Cover" width="50%"/></div>
-
-#### 0.2 Inverse Text Normalization
-
-<div align=center><img src="https://user-images.githubusercontent.com/13466943/193439870-634c44a3-bd62-4311-bcf2-1427758d5f62.png" alt="Cover" width="50%"/></div>
-
-### 1. How To Use
-
-#### 1.1 Quick Start:
-```bash
-# install
-pip install WeTextProcessing
-```
-
-```py
-# tn usage
->>> from tn.chinese.normalizer import Normalizer
->>> normalizer = Normalizer()
->>> normalizer.normalize("2.5平方电线")
-# itn usage
->>> from itn.chinese.inverse_normalizer import InverseNormalizer
->>> invnormalizer = InverseNormalizer()
->>> invnormalizer.normalize("二点五平方电线")
-```
-
-#### 1.2 Advanced Usage:
-
-DIY your own rules && Deploy WeTextProcessing with cpp runtime !!
-
-For users who want modifications and adapt tn/itn rules to fix badcase, please try:
-
-``` bash
-git clone https://github.com/wenet-e2e/WeTextProcessing.git
-cd WeTextProcessing
-# `overwrite_cache` will rebuild all rules according to
-#   your modifications on tn/chinese/rules/xx.py (itn/chinese/rules/xx.py).
-#   After rebuild, you can find new far files at `$PWD/tn` and `$PWD/itn`.
-python normalize.py --text "2.5平方电线" --overwrite_cache
-python inverse_normalize.py --text "二点五平方电线" --overwrite_cache
-```
-
-Once you successfully rebuild your rules, you can deploy them either with your installed pypi packages:
-
-```py
-# tn usage
->>> from tn.chinese.normalizer import Normalizer
->>> normalizer = Normalizer(cache_dir="PATH_TO_GIT_CLONED_WETEXTPROCESSING/tn")
->>> normalizer.normalize("2.5平方电线")
-# itn usage
->>> from itn.chinese.inverse_normalizer import InverseNormalizer
->>> invnormalizer = InverseNormalizer(cache_dir="PATH_TO_GIT_CLONED_WETEXTPROCESSING/itn")
->>> invnormalizer.normalize("二点五平方电线")
-```
-
-Or with cpp runtime:
-
-```bash
-cmake -B build -S runtime -DCMAKE_BUILD_TYPE=Release
-cmake --build build
-# tn usage
-cache_dir=PATH_TO_GIT_CLONED_WETEXTPROCESSING/tn
-./build/bin/processor_main --tagger $fst_dir/zh_tn_tagger.fst --verbalizer $cache_dir/zh_tn_verbalizer.fst --text "2.5平方电线"
-# itn usage
-cache_dir=PATH_TO_GIT_CLONED_WETEXTPROCESSING/itn
-./build/bin/processor_main --tagger $fst_dir/zh_itn_tagger.fst --verbalizer $cache_dir/zh_itn_verbalizer.fst --text "二点五平方电线"
-```
-
-### 2. TN Pipeline
-
-Please refer to [TN.README](tn/README.md)
-
-### 3. ITN Pipeline
-
-Please refer to [ITN.README](itn/README.md)
-
-## Discussion & Communication
-
-For Chinese users, you can aslo scan the QR code on the left to follow our offical account of WeNet.
-We created a WeChat group for better discussion and quicker response.
-Please scan the personal QR code on the right, and the guy is responsible for inviting you to the chat group.
-
-| <img src="https://github.com/robin1001/qr/blob/master/wenet.jpeg" width="250px"> | <img src="https://user-images.githubusercontent.com/13466943/203046432-f637180e-4c87-40cc-be05-ce48c65dd1ef.jpg" width="250px"> |
-| ---- | ---- |
-
-Or you can directly discuss on [Github Issues](https://github.com/wenet-e2e/WeTextProcessing/issues).
-
-## Acknowledge
-
-1. Thank the authors of foundational libraries like [OpenFst](https://www.openfst.org/twiki/bin/view/FST/WebHome) & [Pynini](https://www.openfst.org/twiki/bin/view/GRM/Pynini).
-3. Thank [NeMo](https://github.com/NVIDIA/NeMo) team & NeMo open-source community.
-2. Thank [Zhenxiang Ma](https://github.com/mzxcpp), [Jiayu Du](https://github.com/dophist), and [SpeechColab](https://github.com/SpeechColab) organization.
-3. Referred [Pynini](https://github.com/kylebgorman/pynini) for reading the FAR, and printing the shortest path of a lattice in the C++ runtime.
-4. Referred [TN of NeMo](https://github.com/NVIDIA/NeMo/tree/main/nemo_text_processing/text_normalization/zh) for the data to build the tagger graph.
-5. Referred [ITN of chinese_text_normalization](https://github.com/speechio/chinese_text_normalization/tree/master/thrax/src/cn) for the data to build the tagger graph.
+## Text Normalization & Inverse Text Normalization
+
+### 0. Brief Introduction
+
+[WeTextProcessing: Production First & Production Ready Text Processing Toolkit](https://mp.weixin.qq.com/s/q_11lck78qcjylHCi6wVsQ)
+
+#### 0.1 Text Normalization
+
+<div align=center><img src="https://user-images.githubusercontent.com/13466943/193439861-acfba531-13d1-4fca-b2f2-6e47fc10f195.png" alt="Cover" width="50%"/></div>
+
+#### 0.2 Inverse Text Normalization
+
+<div align=center><img src="https://user-images.githubusercontent.com/13466943/193439870-634c44a3-bd62-4311-bcf2-1427758d5f62.png" alt="Cover" width="50%"/></div>
+
+### 1. How To Use
+
+#### 1.1 Quick Start:
+```bash
+# install
+pip install WeTextProcessing
+```
+
+```py
+# tn usage
+>>> from tn.chinese.normalizer import Normalizer
+>>> normalizer = Normalizer()
+>>> normalizer.normalize("2.5平方电线")
+# itn usage
+>>> from itn.chinese.inverse_normalizer import InverseNormalizer
+>>> invnormalizer = InverseNormalizer()
+>>> invnormalizer.normalize("二点五平方电线")
+```
+
+#### 1.2 Advanced Usage:
+
+DIY your own rules && Deploy WeTextProcessing with cpp runtime !!
+
+For users who want modifications and adapt tn/itn rules to fix badcase, please try:
+
+``` bash
+git clone https://github.com/wenet-e2e/WeTextProcessing.git
+cd WeTextProcessing
+# `overwrite_cache` will rebuild all rules according to
+#   your modifications on tn/chinese/rules/xx.py (itn/chinese/rules/xx.py).
+#   After rebuild, you can find new far files at `$PWD/tn` and `$PWD/itn`.
+python normalize.py --text "2.5平方电线" --overwrite_cache
+python inverse_normalize.py --text "二点五平方电线" --overwrite_cache
+```
+
+Once you successfully rebuild your rules, you can deploy them either with your installed pypi packages:
+
+```py
+# tn usage
+>>> from tn.chinese.normalizer import Normalizer
+>>> normalizer = Normalizer(cache_dir="PATH_TO_GIT_CLONED_WETEXTPROCESSING/tn")
+>>> normalizer.normalize("2.5平方电线")
+# itn usage
+>>> from itn.chinese.inverse_normalizer import InverseNormalizer
+>>> invnormalizer = InverseNormalizer(cache_dir="PATH_TO_GIT_CLONED_WETEXTPROCESSING/itn")
+>>> invnormalizer.normalize("二点五平方电线")
+```
+
+Or with cpp runtime:
+
+```bash
+cmake -B build -S runtime -DCMAKE_BUILD_TYPE=Release
+cmake --build build
+# tn usage
+cache_dir=PATH_TO_GIT_CLONED_WETEXTPROCESSING/tn
+./build/processor_main --tagger $fst_dir/zh_tn_tagger.fst --verbalizer $cache_dir/zh_tn_verbalizer.fst --text "2.5平方电线"
+# itn usage
+cache_dir=PATH_TO_GIT_CLONED_WETEXTPROCESSING/itn
+./build/processor_main --tagger $fst_dir/zh_itn_tagger.fst --verbalizer $cache_dir/zh_itn_verbalizer.fst --text "二点五平方电线"
+```
+
+### 2. TN Pipeline
+
+Please refer to [TN.README](tn/README.md)
+
+### 3. ITN Pipeline
+
+Please refer to [ITN.README](itn/README.md)
+
+## Discussion & Communication
+
+For Chinese users, you can aslo scan the QR code on the left to follow our offical account of WeNet.
+We created a WeChat group for better discussion and quicker response.
+Please scan the personal QR code on the right, and the guy is responsible for inviting you to the chat group.
+
+| <img src="https://github.com/robin1001/qr/blob/master/wenet.jpeg" width="250px"> | <img src="https://user-images.githubusercontent.com/13466943/203046432-f637180e-4c87-40cc-be05-ce48c65dd1ef.jpg" width="250px"> |
+| ---- | ---- |
+
+Or you can directly discuss on [Github Issues](https://github.com/wenet-e2e/WeTextProcessing/issues).
+
+## Acknowledge
+
+1. Thank the authors of foundational libraries like [OpenFst](https://www.openfst.org/twiki/bin/view/FST/WebHome) & [Pynini](https://www.openfst.org/twiki/bin/view/GRM/Pynini).
+3. Thank [NeMo](https://github.com/NVIDIA/NeMo) team & NeMo open-source community.
+2. Thank [Zhenxiang Ma](https://github.com/mzxcpp), [Jiayu Du](https://github.com/dophist), and [SpeechColab](https://github.com/SpeechColab) organization.
+3. Referred [Pynini](https://github.com/kylebgorman/pynini) for reading the FAR, and printing the shortest path of a lattice in the C++ runtime.
+4. Referred [TN of NeMo](https://github.com/NVIDIA/NeMo/tree/main/nemo_text_processing/text_normalization/zh) for the data to build the tagger graph.
+5. Referred [ITN of chinese_text_normalization](https://github.com/speechio/chinese_text_normalization/tree/master/thrax/src/cn) for the data to build the tagger graph.
@@ -28,9 +28,20 @@ endif()
 
 include(openfst)
 include_directories(${PROJECT_SOURCE_DIR})
-add_subdirectory(utils)
-add_subdirectory(processor)
-add_subdirectory(bin)
+
+add_library(processor STATIC
+  processor/processor.cc
+  processor/token_parser.cc
+  utils/utf8_string.cc
+)
+if(MSVC)
+  target_link_libraries(processor PUBLIC fst)
+else()
+  target_link_libraries(processor PUBLIC dl fst)
+endif()
+
+add_executable(processor_main bin/processor_main.cc)
+target_link_libraries(processor_main PUBLIC processor)
 
 if(BUILD_TESTING)
   include(gtest)
 
@@ -1,22 +1,22 @@
-## WeTextProcessing Runtime
-
-1. How to build
-
-``` bash
-$ cmake -B build -DCMAKE_BUILD_TYPE=Release
-$ cmake --build build
-```
-
-2. How to use
-
-``` bash
-# tn usage
-$ wget https://github.com/wenet-e2e/WeTextProcessing/releases/download/WeTextProcessing/zh_tn_tagger.fst
-$ wget https://github.com/wenet-e2e/WeTextProcessing/releases/download/WeTextProcessing/zh_tn_verbalizer.fst
-$ ./build/bin/processor_main --tagger zh_tn_tagger.fst --verbalizer zh_tn_verbalizer.fst --text "2.5平方电线"
-
-# itn usage
-$ wget https://github.com/wenet-e2e/WeTextProcessing/releases/download/WeTextProcessing/zh_itn_tagger.fst
-$ wget https://github.com/wenet-e2e/WeTextProcessing/releases/download/WeTextProcessing/zh_itn_verbalizer.fst
-$ ./build/bin/processor_main --tagger zh_itn_tagger.fst --verbalizer zh_itn_verbalizer.fst --text "二点五平方电线"
-```
+## WeTextProcessing Runtime
+
+1. How to build
+
+``` bash
+$ cmake -B build -DCMAKE_BUILD_TYPE=Release
+$ cmake --build build
+```
+
+2. How to use
+
+``` bash
+# tn usage
+$ wget https://github.com/wenet-e2e/WeTextProcessing/releases/download/WeTextProcessing/zh_tn_tagger.fst
+$ wget https://github.com/wenet-e2e/WeTextProcessing/releases/download/WeTextProcessing/zh_tn_verbalizer.fst
+$ ./build/processor_main --tagger zh_tn_tagger.fst --verbalizer zh_tn_verbalizer.fst --text "2.5平方电线"
+
+# itn usage
+$ wget https://github.com/wenet-e2e/WeTextProcessing/releases/download/WeTextProcessing/zh_itn_tagger.fst
+$ wget https://github.com/wenet-e2e/WeTextProcessing/releases/download/WeTextProcessing/zh_itn_verbalizer.fst
+$ ./build/processor_main --tagger zh_itn_tagger.fst --verbalizer zh_itn_verbalizer.fst --text "二点五平方电线"
+```
@@ -31,7 +31,7 @@ int main(int argc, char* argv[]) {
   if (FLAGS_tagger.empty() || FLAGS_verbalizer.empty()) {
     LOG(FATAL) << "Please provide the tagger and verbalizer fst files.";
   }
-  wenet::Processor processor(FLAGS_tagger, FLAGS_verbalizer);
+  wetext::Processor processor(FLAGS_tagger, FLAGS_verbalizer);
 
   if (!FLAGS_text.empty()) {
     std::string tagged_text = processor.tag(FLAGS_text);
 
@@ -14,17 +14,15 @@
 
 #include "processor/processor.h"
 
-#include "utils/utils.h"
-
 using fst::StringTokenType;
 
-namespace wenet {
-
+namespace wetext {
 Processor::Processor(const std::string& tagger_path,
                      const std::string& verbalizer_path) {
   tagger_.reset(StdVectorFst::Read(tagger_path));
   verbalizer_.reset(StdVectorFst::Read(verbalizer_path));
   compiler_ = std::make_shared<StringCompiler<StdArc>>(StringTokenType::BYTE);
+  printer_ = std::make_shared<StringPrinter<StdArc>>(StringTokenType::BYTE);
 
   if (tagger_path.find("_tn_") != tagger_path.npos) {
     parse_type_ = ParseType::kTN;
@@ -36,6 +34,15 @@ Processor::Processor(const std::string& tagger_path,
   }
 }
 
+std::string Processor::shortest_path(const StdVectorFst& lattice) {
+  StdVectorFst shortest_path;
+  fst::ShortestPath(lattice, &shortest_path, 1, true);
+
+  std::string output;
+  printer_->operator()(shortest_path, &output);
+  return output;
+}
+
 std::string Processor::compose(const std::string& input,
                                const StdVectorFst* fst) {
   StdVectorFst input_fst;
@@ -63,4 +70,4 @@ std::string Processor::normalize(const std::string& input) {
   return verbalize(tag(input));
 }
 
-}  // namespace wenet
+}  // namespace wetext
@@ -22,9 +22,9 @@
 using fst::StdArc;
 using fst::StdVectorFst;
 using fst::StringCompiler;
+using fst::StringPrinter;
 
-namespace wenet {
-
+namespace wetext {
 class Processor {
  public:
   Processor(const std::string& tagger_path, const std::string& verbalizer_path);
@@ -33,14 +33,16 @@ class Processor {
   std::string normalize(const std::string& input);
 
  private:
+  std::string shortest_path(const StdVectorFst& lattice);
   std::string compose(const std::string& input, const StdVectorFst* fst);
 
   ParseType parse_type_;
   std::shared_ptr<StdVectorFst> tagger_ = nullptr;
   std::shared_ptr<StdVectorFst> verbalizer_ = nullptr;
   std::shared_ptr<StringCompiler<StdArc>> compiler_ = nullptr;
+  std::shared_ptr<StringPrinter<StdArc>> printer_ = nullptr;
 };
 
-}  // namespace wenet
+}  // namespace wetext
 
 #endif  // PROCESSOR_PROCESSOR_H_
@@ -17,8 +17,7 @@
 #include "utils/log.h"
 #include "utils/utf8_string.h"
 
-namespace wenet {
-
+namespace wetext {
 const std::string EOS = "<EOS>";
 const std::set<std::string> UTF8_WHITESPACE = {" ", "\t", "\n", "\r",
                                                "\x0b\x0c"};
@@ -151,4 +150,4 @@ std::string TokenParser::reorder(const std::string& input) {
   return trim(output);
 }
 
-}  // namespace wenet
+}  // namespace wetext
@@ -20,7 +20,7 @@
 #include <unordered_map>
 #include <vector>
 
-namespace wenet {
+namespace wetext {
 
 extern const std::string EOS;
 extern const std::set<std::string> UTF8_WHITESPACE;
@@ -86,6 +86,6 @@ class TokenParser {
   std::unordered_map<std::string, std::vector<std::string>> orders;
 };
 
-}  // namespace wenet
+}  // wetext
 
 #endif  // PROCESSOR_TOKEN_PARSER_H_
Original file line number	Diff line number	Diff line change
`@@ -31,7 +31,7 @@ int main(int argc, char* argv[]) {`
`31`	`31`	`if (FLAGS_tagger.empty() \|\| FLAGS_verbalizer.empty()) {`
`32`	`32`	`LOG(FATAL) << "Please provide the tagger and verbalizer fst files.";`
`33`	`33`	`}`
`34`		`- wenet::Processor processor(FLAGS_tagger, FLAGS_verbalizer);`
	`34`	`+ wetext::Processor processor(FLAGS_tagger, FLAGS_verbalizer);`
`35`	`35`
`36`	`36`	`if (!FLAGS_text.empty()) {`
`37`	`37`	`std::string tagged_text = processor.tag(FLAGS_text);`