You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+156Lines changed: 156 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -7,8 +7,42 @@ The Tensorflow version used is 1.13.1. The inference REST API works on CPU and d
7
7
8
8
Models trained using our training tensorflow repository can be deployed in this API. Several object detection models can be loaded and used at the same time.
9
9
10
+
This repo can be deployed using either **docker** or **docker swarm**.
11
+
12
+
Please use **docker swarm** only if you need to:
13
+
14
+
* Provide redundancy in terms of API containers: In case a container went down, the incoming requests will be redirected to another running instance.
15
+
16
+
* Coordinate between the containers: Swarm will orchestrate between the APIs and choose one of them to listen to the incoming request.
17
+
18
+
* Scale up the Inference service in order to get a faster prediction especially if there's traffic on the service.
19
+
20
+
If none of the aforementioned requirements are needed, simply use **docker**.
21
+
10
22

11
23
24
+
## Contents
25
+
26
+
```sh
27
+
Tensorflow CPU Inference API For Windows and Linux/
To run the API, go the to the API's directory and run the following:
60
96
61
97
#### Using Linux based docker:
@@ -74,6 +110,104 @@ The <docker_host_port> can be any unique port of your choice.
74
110
75
111
The API file will be run automatically, and the service will listen to http requests on the chosen port.
76
112
113
+
114
+
115
+
In case you are deploying your API without **docker swarm**, please skip the next section and directly proceed to *API endpoints section*.
116
+
117
+
### Docker swarm
118
+
119
+
Docker swarm can scale up the API into multiple replicas and can be used on one or multiple hosts(Linux users only). In both cases, a docker swarm setup is required for all hosts.
120
+
121
+
#### Docker swarm setup
122
+
123
+
1- Initialize Swarm:
124
+
125
+
```sh
126
+
docker swarm init
127
+
```
128
+
129
+
2- On the manager host, open the cpu-inference.yaml file and specify the number of replicas needed. In case you are using multiple hosts (With multiple hosts section), the number of replicas will be divided across all hosts.
130
+
131
+
```yaml
132
+
version: "3"
133
+
134
+
services:
135
+
api:
136
+
ports:
137
+
- "4343:4343"
138
+
image: tensorflow_inference_api_cpu
139
+
volumes:
140
+
- "/mnt/models:/models"
141
+
deploy:
142
+
replicas: 1
143
+
update_config:
144
+
parallelism: 2
145
+
delay: 10s
146
+
restart_policy:
147
+
condition: on-failure
148
+
```
149
+
150
+
**Notes about cpu-inference.yaml:**
151
+
152
+
* the volumes field on the left of ":" should be an absolute path, can be changeable by the user, and represents the models directory on your Operating System
153
+
* the following volume's field ":/models" should never be changed
1- In order to scale up the service to 4 replicas for example use this command:
188
+
189
+
```sh
190
+
docker service scale tensorflow-cpu_api=4
191
+
```
192
+
193
+
2- To check the available workers:
194
+
195
+
```sh
196
+
docker node ls
197
+
```
198
+
199
+
3- To check on which node the container is running:
200
+
201
+
```sh
202
+
docker service ps tensorflow-cpu_api
203
+
```
204
+
205
+
4- To check the number of replicas:
206
+
207
+
```sh
208
+
docker service ls
209
+
```
210
+
77
211
## API Endpoints
78
212
79
213
To see all available endpoints, open your favorite browser and navigate to:
@@ -167,6 +301,8 @@ Inside each subfolder there should be a:
167
301
168
302
## Benchmarking
169
303
304
+
### Docker
305
+
170
306
<table>
171
307
<thead align="center">
172
308
<tr>
@@ -230,6 +366,26 @@ Inside each subfolder there should be a:
230
366
</tbody>
231
367
</table>
232
368
369
+
### Docker swarm
370
+
371
+
Here are two graphs showing time of prediction for different number of requests at the same time.
372
+
373
+
374
+

375
+
376
+
377
+

378
+
379
+
380
+
We can see that both graphs got the same result no matter what is the number of received requests at the same time. When we increase the number of workers (hosts) we are able to speed up the inference by at least 2 times. For example we can see in the last column we were able to process 40 requests in:
381
+
382
+
- 17.5 seconds with 20 replicas in 1 machine
383
+
- 8.8 seconds with 20 replicas in each of the 2 machines
384
+
385
+
Moreover, in case one of the machines is down the others are always ready to receive requests.
386
+
387
+
Finally since we are predicting on CPU scaling more replicas doesn't mean a faster prediction, 4 containers was faster than 20.
0 commit comments