Skip to content
This repository was archived by the owner on Oct 3, 2023. It is now read-only.

Commit 017d7c5

Browse files
authored
Add specification for Summary Span. (#121)
* Add specification for Summary Span. * Incorporate Summary Span spec feedback.
1 parent 342a2a0 commit 017d7c5

4 files changed

Lines changed: 223 additions & 0 deletions

File tree

Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
### Census Server Stats
2+
3+
The encoding is based on [BinaryEncoding](BinaryEncoding.md)
4+
5+
#### Fields added in Census Server Stats version 0
6+
7+
##### LB-Latency-Ns
8+
9+
* optional
10+
* `field_id` = 0
11+
* `len` = 8
12+
13+
Request processing latency observed on Load Balance. The unit is nanoseconds.
14+
It is int64 little endian.
15+
16+
##### Server-Latency_Ns
17+
18+
* optional
19+
* `field_id` = 1
20+
* `len` = 8
21+
22+
Request processing latency observed on Server. The unit is nanoseconds.
23+
It is int64 little endian.
24+
25+
##### Trace-Options
26+
27+
* optional
28+
* `field_id` = 2
29+
* `len` = 1
30+
31+
It is a 1-byte representing a 8-bit unsigned integer. The least significant
32+
bit provides if the request was sampled on the server or not (1= sampled,
33+
0= not sampled).
34+
35+
The behavior of other bits is currently undefined.
36+
37+
#### Valid example (Hex)
38+
{`0,`
39+
`0, 38, C7, 0, 0, 0, 0, 0, 0,`
40+
`1, 50, C3, 0, 0, 0, 0, 0, 0,`
41+
`2, 1`}
42+
43+
This corresponds to:
44+
* `lb_latency_ns` = 51000 (0x000000000000C738)
45+
* `server_latency_ns` = 50000 (0x000000000000C350)
46+
* `trace_options` = 1 (0x01)
47+

utils/SummarySpan.md

Lines changed: 176 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,176 @@
1+
# Objective
2+
A request for a typical Service goes through a cluster of Load Balancers
3+
before it is processed by the server. Sometimes there may be more stages in
4+
between. Each stage introduces some processing delay. There is also a network
5+
delay between these stages. These delays contribute to overall response time of
6+
the request.
7+
8+
9+
From a perspective of a typical Service customer, a response time of a
10+
query is the time elapsed between the first Load Balancers (LB) of the provider
11+
receives the request and the time this LB sends the response back. If the LB is
12+
co-located with the Server then the overall response time from LB is practically
13+
the same as Server processing time (assuming LB’s processing time and network
14+
delay between LB and the server is negligible). However, if LB is not co-located
15+
then the problem could very well be on the Provider’s internal network between
16+
LB and the Server. Providing visibility into response time from various vantage
17+
points helps narrow down the problem for the customer and the provider.
18+
19+
Note: All reference to LB here is L7 Load Balancer. L4 Load Balancer and Client
20+
side Load Balancer are out of scope of this specification.
21+
22+
![Summary Span Overview][SummarySpanOverview]
23+
24+
The objective of this spec is to provide visibility into response time of the
25+
request from one or more stages.
26+
27+
# Specification
28+
For the purpose of this specification we will only have two stages for request
29+
processing, Load Balancer and Server. However, it doesn’t restrict to extend it
30+
for multiple stages. The specification includes latency and sampling bit but can
31+
also be extended to support other measurements.
32+
33+
## Measurement Data
34+
Two measurements that will be reported are lb_latency_ns and server_latency_ns.
35+
Both fields are optional. For example, if the server is not instrumented or not
36+
enabled to collect the measurement then the server_latency_ns will not be
37+
reported. Similarly if LB is not instrumented or not enabled then lb_latency_ns
38+
will not be reported.
39+
40+
![Summary Span Measurement Data][SummarySpanMeasurementData]
41+
42+
### Measuring for Non-Streaming Request
43+
1. **lb_latency_ns**: It is the time elapsed between the time LB received the request
44+
and the time it sends the response status (Tlb1 - Tlb0). It includes processing
45+
time on LB + downstream network delay + server_latency_ns.
46+
2. **server_latency_ns**: It is the time elapsed between the time Server received the
47+
request and the time Server sends the response status (Ts1-Ts0)
48+
3. **client_latency_ns**: It is the time elapsed between the request sent to LB and the
49+
response received from the LB (Tc1-Tc0). It is also known as round-trip-latency.
50+
This is measured on the client side and it is not reported. It is simply used
51+
for the interpretation of the measurements (see below).
52+
4. **trace_option**: The least significant bit of the trace_option will be used to
53+
indicate if the request is sampled or not. Other bits in the trace_option are
54+
reserved for future use.
55+
56+
### Measurement for Streaming RPC
57+
**trace_option**: There is no difference compared to non-streaming RPC.
58+
59+
#### Case 1: Server Streaming
60+
**server_latency_ns** = Time elapsed between request received on the server and the
61+
response status. Response status is sent with the last chunk of the response
62+
data or after last chunk of the response data.
63+
64+
**lb_latency_ns** = same as above but on LB.
65+
66+
#### Case 2: Client Streaming
67+
**server_latency_ns** = Time elapsed between last chunk of the request received on
68+
the server and the response status sent from the server.
69+
70+
**lb_latency_ns** = same as above but on LB.
71+
72+
#### Case 3: Both Streaming
73+
**server_latency_ns** = Time elapsed between last chunk of the request received on
74+
the server and the response status sent from the server.
75+
76+
#### Limitations
77+
There are limitations measuring performance when streaming is used.
78+
- In case 1, it would be hard to know if the server continued to process the
79+
request while serving the response. In such case the above latency may not
80+
provide sufficient information. It only works for the case when the server
81+
processes the request and has the entire response ready to send it back to the
82+
client but it sends that in multiple chunks (stream). Here the processing time
83+
for the request is much larger than sending chunks of data on the wire.
84+
- In case 2, the server may be able to start processing the request as soon as the
85+
first chunk is received. We cannot know that without understating the
86+
application and the content of the request.
87+
- In case 3, combination of limitation in case 1 and 2 applies here.
88+
89+
90+
### How to interpret the latency measurement?
91+
92+
| client_latency_ns | lb_latency_ns | server_latency_ns | Interpretation |
93+
|-------------------|---------------|-------------------|----------------|
94+
| LOW | LOW | LOW | Normal. Services are running normally. |
95+
| HIGH | HIGH | LOW | LB is taking longer to process or network between the LB and the Server is congested. |
96+
| HIGH | HIGH | HIGH | Server is overloaded or some other underlying issue. Depending on the difference between the latency the problem could spread across. |
97+
| HIGH | LOW | HIGH | Not possible. |
98+
| HIGH | LOW | LOW | There is a Network issue between the client and LB. |
99+
| HIGH | MISSING | LOW | The issue is not on the server but one cannot conclusively determine if the issue is on LB, or in the network between the LB and the server, or In the network between the Client and the LB |
100+
| HIGH | MISSING | HIGH | The issue is on the server. Additionally, there could be an issue in other segments. |
101+
| HIGH | HIGH | MISSING | The issue is on LB or beyond but cannot conclusively determine if it is on the server or not. |
102+
103+
104+
### How to use Sampling bit?
105+
Sampling-bit is used to compare traces and measurements end-to-end. Customers
106+
can use the sampling-bit information to log traces and measurements on its side.
107+
The traces and measurement from both sides can then be used to isolate the
108+
problem.
109+
It is particularly helpful when the server owner is different then the client
110+
owner. The reason is that this helps the client to find traces that are sampled
111+
on both sides. The client owners typically do not have control over server side
112+
tracing.
113+
114+
Also mention that it is useful if the Server owner is different than the
115+
client owner because then this information helps client to find traces that are
116+
sampled on both sides (usually in this case client does not have access to the
117+
server traces).
118+
119+
### Encoding
120+
121+
The measurement data will be returned to the client in response trailer. The
122+
encoding of the same is defined for two different transport methods, Rest API
123+
using http and gRPC.
124+
125+
126+
#### Encoding with gRPC
127+
For gRPC census-server-stats-bin metadata will be sent in gRPC trailer. The
128+
encoding of this metadata is as per the format defined [here](../encodings/CensusServerStatsEncoding.md). All data is encoded
129+
in little-endian.
130+
131+
#### Census-server-stats-bin Encoding
132+
Encoding is based on [BinaryEncoding](../encodings/BinaryEncoding.md)
133+
```
134+
version_id = 0 (uint8),
135+
server_latency_ns (id) = 0 (uint8),
136+
server_latency_ns (value) = x (int64),
137+
lb_latency_ns (id) = 1 (uint8),
138+
lb_latency_ns (value) = y (int64),
139+
trace-option (id) = 2 (uint8),
140+
trace-option (value) = z (int8)
141+
- bit 0 (mask 0x01) - Sampling bit.
142+
1 = request is sampled
143+
0 = request is not sampled
144+
- bits 1-7 (mask 0xFE) - Reserved.
145+
```
146+
147+
# API
148+
149+
## Data Model
150+
151+
```
152+
// This package describes the data model. It is currently experimental
153+
154+
class ServerStats {
155+
// Latency observed at server while processing the request.
156+
long serverLatencyNs;
157+
158+
// Latency observed at Load Balancer while processing the request.
159+
long lbLatencyNs;
160+
161+
// A bitmap of tracing options.
162+
// Only least significant bit is used for now.
163+
// Sampling Bit (least significant bit).
164+
int traceOption;
165+
}
166+
167+
```
168+
169+
## Interface
170+
Each implementation should provide following
171+
- Marshaller to encode/decode from gRPC metadata headers as per
172+
[this](#census-server-stats-bin-encoding)
173+
174+
[SummarySpanOverview]: /utils/drawings/SummarySpanOverview.png "Summary Span Overview"
175+
[SummarySpanMeasurementData]: /utils/drawings/SummarySpanMeasurementData.png "Summary Span Measurement Data"
176+
37 KB
Loading
5.78 KB
Loading

0 commit comments

Comments
 (0)