[analysis] Visualize DSS operation latency from logs#1483
[analysis] Visualize DSS operation latency from logs#1483BenjaminPelletier wants to merge 9 commits into
Conversation
mickmis
left a comment
There was a problem hiding this comment.
I see the immediate added-value of having this, but I have some concerns over its maintainability with its current approach. However if, because of the current issues needing diagnosis rapidly, maintainability is not a primary concern, it should be IMO clearly identified as such, including a summary of the short-comings it has. WDYT?
| @@ -0,0 +1,109 @@ | |||
| #!/usr/bin/env python3 | |||
|
|
|||
| """Enumerate DSS routes from Go server.gen.go definitions and save them to JSON. | |||
There was a problem hiding this comment.
Run this in the CI and rename dss_routes.json to dss_routes.gen.json? In order to ensure there is no drift in the future.
Alternatively have you considered generating this file as a byproduct of the openapi-to-go-server script? That would be IMO less brittle than parsing the generated Go source code.
There was a problem hiding this comment.
Agreed on dss_routes.gen.json; changed. Running in the CI would be a little tricky as this is effectively an instance of linking to information from a version of the dss repository and the only way we do that today in monitoring is via the released docker image. If we had the CI check out a pinned version of the dss repository to regenerate and verify that dss_routes.gen.json matches, that wouldn't really provide any value since we know fixed inputs lead to fixed outputs. At best, a user changing the version of the dss repository dss_routes.gen.json reflects would need to find and update that pin in the CI and it doesn't seem like that exercise adds any value. If we had the CI check out a moving target of the dss repository, then the check would break when the dss repo moves on and someone trying to submit an unrelated PR would be blocked until they regenerated dss_routes.gen.json -- that doesn't seem like a good situation.
This information is defined in the dss repo, so I agree it makes sense for it to be generated in the dss repo long term. But, even when this information is generated in the dss repo, I don't see a good alternative to manually importing the information into a consuming repo -- this is still what we do in uas_standards. The way we link to the dss repo for most things is to use a tagged docker image. Hypothetically we could build this information and the aux interface information into that image as non-executable metadata and pull it from there, but even that would presumably require a manual sync operation to populate repo content (though the correctness of this could be tracked in CI since the pin would be in the main repo content rather than the CI). I think I would support that, but I do think that would be a future improvement rather than something we could do immediately today, so I would still expect the path to long term success to go through this PR in that case.
There was a problem hiding this comment.
Hum I did not take the reflection that far. I notably did not consider the dependency management with the DSS repo indeed. I understand better why you took this approach now.
In that case, maybe a small improvement I would suggest is that instead of relying on an externally checked out DSS repo, have the script check out a specific commit of the DSS repo. That would make the script more deterministic by pinning a specific version.
|
|
||
|
|
||
| def find_sibling_dss_dir(): | ||
| # This script is located in: monitoring/monitoring/analysis/dss_performance/enumerate_dss_routes.py |
There was a problem hiding this comment.
In that case doesn't this belong in the dss repository instead? In addition I believe we have now in the infra some Grafana deployment added by @the-glu where we could add a dashboard instead of an ad-hoc HTML template.
There was a problem hiding this comment.
If the dss repo was a subrepo of monitoring, then perhaps enumerate_dss_routes could go in the dss repo. But, it's not so we have to link the two repos a different way (see discussion above).
I would love to replace this tool with a Grafana dashboard; PRs welcome :) Note that one important feature is the ability to use this tool on local deployments and I don't believe we have Grafana there yet. Another feature useful temporarily is the ability to use this tool on deployments that don't have the latest DSS version -- if we added a DSS-deployment Grafana dashboard, that wouldn't be able to replace this tool until all the DSS instances we want to troubleshoot have adopted the DSS release that causes users to deploy that dashboard. If there were an alternate way to produce a Grafana dashboard for an existing deployment that wasn't initially set up for Grafana and pull and display information necessary for troubleshooting, that seems like an extremely valuable tool that I would definitely support. Once we have a tool like that, I think this tool could likely be decommissioned.
There was a problem hiding this comment.
A few through:
- Notice you can still achieve same graphs with tracing (even with a grafana dashboard ^^'). However that would mean: a recent DSS version and enabling it (no 'post analysis' with logs).
- (Something similar could (will) also be achieved with metrics, with lower precision however)
- For generating a grafana dashboard from the logs themself, I'm not sure there would be an advantage and keeping it like this is probably simpler. (Except if we start to have direct log access in grafana with e.g. loki).
- I also have the feeling that this script would be better in the DSS repository no? It's tightly coupled to the version (where as tests there are independent), require files from it, and would make sense in the 'deploy/operations' folder. It's more of an 'day-to-day' operation tool rather than a testing one.
There was a problem hiding this comment.
Note that one important feature is the ability to use this tool on local deployments and I don't believe we have Grafana there yet.
Ah! I thought so. That would be valuable indeed.
I also have the feeling that this script would be better in the DSS repository no?
Indeed, given it is a standalone tool aimed at analyzing DSS performances, why not putting it inside the DSS repository? That would not prevent using it on older DSS versions, and would allow a possible future integration with Grafana. Or is there some operational reason I'm not seeing?
| """A simple zero-dependency file lock implementation for Unix systems.""" | ||
|
|
||
| def __init__(self, filepath: str, timeout: int = 15): | ||
| self.lockfile = filepath + ".lock" |
|
|
||
|
|
||
| def find_sibling_dss_dir(): | ||
| # This script is located in: monitoring/monitoring/analysis/dss_performance/enumerate_dss_routes.py |
There was a problem hiding this comment.
A few through:
- Notice you can still achieve same graphs with tracing (even with a grafana dashboard ^^'). However that would mean: a recent DSS version and enabling it (no 'post analysis' with logs).
- (Something similar could (will) also be achieved with metrics, with lower precision however)
- For generating a grafana dashboard from the logs themself, I'm not sure there would be an advantage and keeping it like this is probably simpler. (Except if we start to have direct log access in grafana with e.g. loki).
- I also have the feeling that this script would be better in the DSS repository no? It's tightly coupled to the version (where as tests there are independent), require files from it, and would make sense in the 'deploy/operations' folder. It's more of an 'day-to-day' operation tool rather than a testing one.
| <link rel="preconnect" href="https://fonts.gstatic.com" crossorigin> | ||
| <link href="https://fonts.googleapis.com/css2?family=Outfit:wght@300;400;500;600;700&family=Roboto:wght@300;400;500;700&display=swap" rel="stylesheet"> | ||
| <!-- Plotly.js CDN --> | ||
| <script src="https://cdn.plot.ly/plotly-2.24.1.min.js"></script> |
There was a problem hiding this comment.
nit: local copies of files/font would avoid external dependencies (who can be down/track user)
When troubleshooting DSS latency issues like dss#1509 using logs from actual DSS deployments, it can be difficult to understand what's going on or see any trends or patterns. This PR attempts to improve the situation by adding an analysis tool to visualize latency in DSS operations over time across multiple DSS instances.
This is accomplished in two parts. First, acquire_logs.py synthesizes raw tool output from either
docker container logs(for local deployments) orgcloud(for GCP Kubernetes deployments) into a common information format easily consumable for later analysis. Second, visualize_latency.py produces an HTML interface allowing the user to visually explore latency behavior.I would not usually make a PR this large, but I think it makes more sense to be able to fully validate the pipeline rather than trying to validate just acquire_logs by itself. Since the only consumer of visualize_latency's results will be humans (rather than downstream machines), I suggest that review of that output is probably more critical than close review of how that output is produced. This is output from an actual occurrence of dss#1509:
cloud_latency.zip
...and here is what output from a 7x3 local deployment representative of dss#1509 running netrid_concurrency looks like: