Merge pull request #19609 from henrybear327/robustness/improve_readme

Update the robustness test README
This commit is contained in:
Marek Siarkowicz 2025-03-15 13:16:27 +01:00 committed by GitHub
commit f07e2ae4ed
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194

View File

@ -39,7 +39,7 @@ The purpose of these tests is to rigorously validate that etcd maintains its [KV
## How Robustness Tests Work ## How Robustness Tests Work
Robustness tests compare etcd cluster behavior against a simplified model of its expected behavior. Robustness tests compare the etcd cluster behavior against a simplified model of its expected behavior.
These tests cover various scenarios, including: These tests cover various scenarios, including:
* **Different etcd cluster setups:** Cluster sizes, configurations, and deployment topologies. * **Different etcd cluster setups:** Cluster sizes, configurations, and deployment topologies.
@ -52,8 +52,8 @@ These tests cover various scenarios, including:
2. **Traffic and Failures:** Client traffic is generated and sent to the cluster while failures are injected. 2. **Traffic and Failures:** Client traffic is generated and sent to the cluster while failures are injected.
3. **History Collection:** All client operations and their results are recorded. 3. **History Collection:** All client operations and their results are recorded.
4. **Validation:** The collected history is validated against the etcd model and a set of validators to ensure consistency and correctness. 4. **Validation:** The collected history is validated against the etcd model and a set of validators to ensure consistency and correctness.
5. **Report Generation:** If a failure is detected and a detailed report is generated to help diagnose the issue. 5. **Report Generation:** If a failure is detected then a detailed report is generated to help diagnose the issue.
This report includes information about the client operations, etcd data directories. This report includes information about the client operations and etcd data directories.
## Key Concepts ## Key Concepts
@ -96,26 +96,25 @@ Etcd provides strict serializability for KV operations and eventual consistency
make gofail-disable make gofail-disable
``` ```
2. Run the tests 2. Run the tests
```bash ```bash
make test-robustness make test-robustness
``` ```
Optionally you can pass environment variables: Optionally, you can pass environment variables:
* `GO_TEST_FLAGS` - to pass additional arguments to `go test`. * `GO_TEST_FLAGS` - to pass additional arguments to `go test`.
It is recommended to run tests multiple times with failfast enabled. this can be done by setting `GO_TEST_FLAGS='--count=100 --failfast'`. It is recommended to run tests multiple times with failfast enabled. this can be done by setting `GO_TEST_FLAGS='--count=100 --failfast'`.
* `EXPECT_DEBUG=true` - to get logs from the cluster. * `EXPECT_DEBUG=true` - to get logs from the cluster.
* `RESULTS_DIR` - to change location where results report will be saved. * `RESULTS_DIR` - to change the location where the results report will be saved.
* `PERSIST_RESULTS` - to persist the results report of the test. By default this will not be persisted in the case of a successful run. * `PERSIST_RESULTS` - to persist the results report of the test. By default this will not be persisted in the case of a successful run.
## Re-evaluate existing report ## Re-evaluate existing report
Robustness test validation is constantly changing and improving. Robustness test validation is constantly changing and improving.
Errors in etcd model could be causing false positives, which makes the ability to re-evaluate the reports after we fix the issue important. Errors in the etcd model could be causing false positives, which makes the ability to re-evaluate the reports after we fix the issue important.
> Note: Robustness test report format is not stable, and it's expected that not all old reports can be re-evaluated using the newest version. > Note: Robustness test report format is not stable, and it's expected that not all old reports can be re-evaluated using the newest version.
1. Identify location of the robustness test report. 1. Identify the location of the robustness test report.
> Note: By default robustness test report is only generated for failed test. > Note: By default robustness test report is only generated for failed test.
@ -124,7 +123,7 @@ Errors in etcd model could be causing false positives, which makes the ability t
logger.go:146: 2024-04-08T09:45:27.734+0200 INFO Saving robustness test report {"path": "/tmp/TestRobustnessExploratory_Etcd_HighTraffic_ClusterOfSize1"} logger.go:146: 2024-04-08T09:45:27.734+0200 INFO Saving robustness test report {"path": "/tmp/TestRobustnessExploratory_Etcd_HighTraffic_ClusterOfSize1"}
``` ```
* **For remote runs on CI:** you need to go to the [Prow Dashboard](https://prow.k8s.io/job-history/gs/kubernetes-jenkins/logs/ci-etcd-robustness-amd64), go to a build, download one of the Artifacts (`artifacts/results.zip`), and extract it locally. * **For remote runs on CI:** you need to go to the [Prow Dashboard](https://testgrid.k8s.io/sig-etcd-robustness#Summary), go to a build, download one of the Artifacts (`artifacts/results.zip`), and extract it locally.
![Prow job run page](readme-images/prow_job.png) ![Prow job run page](readme-images/prow_job.png)
@ -144,14 +143,14 @@ Errors in etcd model could be causing false positives, which makes the ability t
The `testdata` directory can contain multiple robustness test reports. The `testdata` directory can contain multiple robustness test reports.
The name of the report directory doesn't matter, as long as it's unique to prevent clashing with reports already present in `testdata` directory. The name of the report directory doesn't matter, as long as it's unique to prevent clashing with reports already present in `testdata` directory.
For example path for `history.html` file could look like `$REPO_ROOT/tests/robustness/testdata/v3.5_failure_24_April/history.html`. For example, the path for `history.html` file could look like `$REPO_ROOT/tests/robustness/testdata/v3.5_failure_24_April/history.html`.
3. Run `make test-robustness-reports` to validate all reports in the `testdata` directory. 3. Run `make test-robustness-reports` to validate all reports in the `testdata` directory.
## Analysing failure ## Analysing failure
If robustness tests fails we want to analyse the report to confirm if the issue is on etcd side. Location of the directory with the report If robustness tests fail, we want to analyse the report to confirm if the issue is on etcd side. The location of the directory with the report
is mentioned `Saving robustness test report` log. Logs from report generation should look like: is mentioned in the `Saving robustness test report` log. Logs from report generation should look like:
``` ```
logger.go:146: 2024-05-08T10:42:54.429+0200 INFO Saving robustness test report {"path": "/tmp/TestRobustnessRegression_Issue14370/1715157774429416550"} logger.go:146: 2024-05-08T10:42:54.429+0200 INFO Saving robustness test report {"path": "/tmp/TestRobustnessRegression_Issue14370/1715157774429416550"}
logger.go:146: 2024-05-08T10:42:54.429+0200 INFO Saving member data dir {"member": "TestRobustnessRegressionIssue14370-test-0", "path": "/tmp/TestRobustnessRegression_Issue14370/1715157774429416550/server-TestRobustnessRegressionIssue14370-test-0"} logger.go:146: 2024-05-08T10:42:54.429+0200 INFO Saving member data dir {"member": "TestRobustnessRegressionIssue14370-test-0", "path": "/tmp/TestRobustnessRegression_Issue14370/1715157774429416550/server-TestRobustnessRegressionIssue14370-test-0"}
@ -178,21 +177,21 @@ is mentioned `Saving robustness test report` log. Logs from report generation sh
logger.go:146: 2024-05-08T10:42:54.441+0200 INFO Saving visualization {"path": "/tmp/TestRobustnessRegression_Issue14370/1715157774429416550/history.html"} logger.go:146: 2024-05-08T10:42:54.441+0200 INFO Saving visualization {"path": "/tmp/TestRobustnessRegression_Issue14370/1715157774429416550/history.html"}
``` ```
Report follows the hierarchy: The report follows the hierarchy:
* `server-*` - etcd server data directories, can be used to verify disk/memory corruption. * `server-*` - etcd server data directories, can be used to verify disk/memory corruption.
* `member` * `member`
* `wal` - Write Ahead Log (WAL) directory, that can be analysed using `etcd-dump-logs` command line tool available in `tools` directory. * `wal` - Write Ahead Log (WAL) directory, that can be analysed using `etcd-dump-logs` command line tool available in `tools` directory.
* `snap` - Snapshot directory, includes the bbolt database file `db`, that can be analysed using `etcd-dump-db` command line tool available in `tools` directory. * `snap` - Snapshot directory, includes the bbolt database file `db`, that can be analysed using `etcd-dump-db` command line tool available in `tools` directory.
* `client-*` - Client request and response dumps in json format. * `client-*` - Client request and response dumps in json format.
* `watch.jon` - Watch requests and responses, can be used to validate [watch API guarantees]. * `watch.json` - Watch requests and responses, can be used to validate [watch API guarantees].
* `operations.json` - KV operation history * `operations.json` - KV operation history
* `history.html` - Visualization of KV operation history, can be used to validate [KV API guarantees]. * `history.html` - Visualization of KV operation history, can be used to validate [KV API guarantees].
### Example analysis of linearization issue ### Example analysis of a linearization issue
Let's reproduce and analyse robustness test report for issue [#14370]. Let's reproduce and analyse robustness test report for issue [#14370].
To reproduce the issue by yourself run `make test-robustness-issue14370`. To reproduce the issue by yourself run `make test-robustness-issue14370`.
After a couple of tries robustness tests should fail with a log `Linearization failed` and save report locally. After a couple of tries robustness tests should fail with a log `Linearization failed` and save the report locally.
Example: Example:
``` ```
@ -211,14 +210,14 @@ Jump to the error in linearization by clicking `[ jump to first error ]` on the
You should see a graph similar to the one on the image below. You should see a graph similar to the one on the image below.
![issue14370](readme-images/issue14370.png) ![issue14370](readme-images/issue14370.png)
Last correct request (connected with grey line) is a `Put` request that succeeded and got revision `168`. The last correct request (connected with the grey line) is a `Put` request that succeeded and got revision `168`.
All following requests are invalid (connected with red line) as they have revision `167`. All following requests are invalid (connected with red line) as they have revision `167`.
Etcd guarantee that revision is non-decreasing, so this shows a bug in etcd as there is no way revision should decrease. Etcd guarantees that revision is non-decreasing, so this shows a bug in etcd as there is no way revision should decrease.
This is consistent with the root cause of [#14370] as it was issue with process crash causing last write to be lost. This is consistent with the root cause of [#14370] as it was an issue with the process crash causing the last write to be lost.
[#14370]: https://github.com/etcd-io/etcd/issues/14370 [#14370]: https://github.com/etcd-io/etcd/issues/14370
### Example analysis of watch issue ### Example analysis of a watch issue
Let's reproduce and analyse robustness test report for issue [#15271]. Let's reproduce and analyse robustness test report for issue [#15271].
To reproduce the issue by yourself run `make test-robustness-issue15271`. To reproduce the issue by yourself run `make test-robustness-issue15271`.
@ -236,22 +235,24 @@ Example:
``` ```
Watch issues are easiest to analyse by reading the recorded watch history. Watch issues are easiest to analyse by reading the recorded watch history.
Watch history is recorded for each client separated in different subdirectory under `/tmp/TestRobustnessRegression_Issue15271/1715158215866033806`
Open `watch.json` for client mentioned in log `Broke watch guarantee`. Watch history is recorded for each client separated in different subdirectory under `/tmp/TestRobustnessRegression_Issue15271/1715158215866033806`.
Open `watch.json` for the client mentioned in the log `Broke watch guarantee`.
For client `4` that broke the watch guarantee open `/tmp/TestRobustnessRegression_Issue15271/1715158215866033806/client-4/watch.json`. For client `4` that broke the watch guarantee open `/tmp/TestRobustnessRegression_Issue15271/1715158215866033806/client-4/watch.json`.
Each line consists of json blob corresponding to single watch request sent by client. Each line consists of json blob corresponding to a single watch request sent by the client.
Look for events with `Revision` equal to revision mentioned in the first log with `Broke watch guarantee`, in this case look for `"Revision":3,`. Look for events with `Revision` equal to revision mentioned in the first log with `Broke watch guarantee`, in this case, look for `"Revision":3,`.
You should see watch responses where the `Revision` decreases like ones below: You should see watch responses where the `Revision` decreases like ones below:
``` ```
{"Events":[{"Type":"put-operation","Key":"key5","Value":{"Value":"793","Hash":0},"Revision":799,"IsCreate":false,"PrevValue":null}],"IsProgressNotify":false,"Revision":799,"Time":3202907249,"Error":""} {"Events":[{"Type":"put-operation","Key":"key5","Value":{"Value":"793","Hash":0},"Revision":799,"IsCreate":false,"PrevValue":null}],"IsProgressNotify":false,"Revision":799,"Time":3202907249,"Error":""}
{"Events":[{"Type":"put-operation","Key":"key4","Value":{"Value":"1","Hash":0},"Revision":3,"IsCreate":true,"PrevValue":null}, ... {"Events":[{"Type":"put-operation","Key":"key4","Value":{"Value":"1","Hash":0},"Revision":3,"IsCreate":true,"PrevValue":null}, ...
``` ```
Up to the first response the `Revision` of events only increased up to a value of `799`. Up to the first response, the `Revision` of events only increased up to a value of `799`.
However, the following line includes an event with `Revision` equal `3`. However, the following line includes an event with `Revision` equal `3`.
If you follow the `revision` throughout the file you should notice that watch replayed revisions second time. If you follow the `revision` throughout the file you should notice that watch replayed revisions for a second time.
This is incorrect and breaks `Ordered` [watch API guarantees]. This is incorrect and breaks `Ordered` [watch API guarantees].
This is consistent with the root cause of [#14370] where member reconnecting to cluster will resend revisions. This is consistent with the root cause of [#14370] where the member reconnecting to cluster will resend revisions.
[#15271]: https://github.com/etcd-io/etcd/issues/15271 [#15271]: https://github.com/etcd-io/etcd/issues/15271