Commit Graph

1560 Commits

Author SHA1 Message Date
b14b468661 tests/robustness: Make weighted pick random generic
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2023-05-08 19:58:38 +02:00
7c68be4cf3 tests/robustness: Implement Range limit and count
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2023-05-07 09:32:07 +02:00
40f71ef3c6 tests/robustness: Implement delete request for kubernetes scenario
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2023-05-05 13:40:46 +02:00
92366a5338 tests/robustness: Split model code into deterministic and non-deterministic
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
Co-authored-by: Benjamin Wang <wachao@vmware.com>
Co-authored-by: chao <54131596+chaochn47@users.noreply.github.com>
2023-05-05 12:25:10 +02:00
cfe154209c tests/robustness: Separate describe model functions to dedicated file
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2023-05-04 14:03:18 +02:00
2c16812841 Merge pull request #15817 from serathius/robustness-k8s-1
tests/robustness: Implement first step in validating the Kubernetes-etcd contract
2023-05-04 13:52:25 +02:00
9b5680c5f1 tests/robustness: Implement first step in validating the Kubernetes-etcd contract.
* Use mod revision for optimistic concurrency.
* Introduce range requests as more general then get
* Add kubernetes specific traffic generation, for now using pull, but
  expected to evolve to use watch.
* Introduce kubernetes specific test scenario

Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2023-05-04 13:26:54 +02:00
49b59cc8e5 Merge pull request #15656 from mitake/lease-timetolive-auth
protect LeaseTimeToLive with RBAC
2023-05-02 23:02:29 +09:00
b089474b01 Merge pull request #15795 from jmhbnz/deflake-roundrobin-resolver-test
tests: Deflake TestEtcdGrpcResolverRoundRobin
2023-05-02 06:09:46 +08:00
b9533ca98b Deflake TestEtcdGrpcResolverRoundRobin.
Increase request to 1000 to increase sample size/reduce variability and increase tolerance threshold from 10 to 15%.

Signed-off-by: James Blair <mail@jamesblair.net>
2023-04-29 14:14:16 +12:00
09d053e035 tests/robustness: tune timeout policy
In a [scheduled test][1], the error shows

```
2023-04-19T11:16:15.8166316Z     traffic.go:96: rpc error: code = Unavailable desc = keepalive ping failed to receive ACK within timeout
```

According to [grpc-keepalive@v1.51.0][2], each frame from server will
fresh the `lastRead` and it won't file `Ping` frame to server. But the
client used by [`tombstone` request][3] might hit the race. Since we use
5ms as timeout, the client might not receive the result of `Ping` from
server in time. The keepalive will mark it timeout and close the
connection.

I didn't reproduce it in my local. If we add the sleep before update
`lastRead`, it can reproduce it sometimes. Still investigating this
part.

```diff
diff --git a/internal/transport/http2_client.go b/internal/transport/http2_client.go
index d518b07e..bee9c00a 100644
--- a/internal/transport/http2_client.go
+++ b/internal/transport/http2_client.go
@@ -1560,6 +1560,7 @@ func (t *http2Client) reader(errCh chan<- error) {
                t.controlBuf.throttle()
                frame, err := t.framer.fr.ReadFrame()
                if t.keepaliveEnabled {
+                       time.Sleep(2 * time.Millisecond)
                        atomic.StoreInt64(&t.lastRead, time.Now().UnixNano())
                }
                if err != nil {
```

`DialKeepAliveTime` is always >= [10s][4]. I think we should increase
the timeout to avoid flaky caused by unstable env.

And in a [scheduled test][5], the error shows

```
logger.go:130: 2023-04-22T10:45:52.646Z	INFO	Failed to trigger failpoint	{"failpoint": "blackhole", "error": "context deadline exceeded"}
```

Before sending `Status` to member, the client doesn't [pick][6] the
connection in time (100ms) and returns the error.

The `waitTillSnapshot` is used to ensure that it is good enough to
trigger snapshot transfer. And we have 1min timeout for
injectFailpoints, so I think we can remove the 100ms timeout to reduce
unnecessary stop.

```
injectFailpoints(1min timeout)
  failpoint.Inject
    triggerBlockhole.Trigger
      blackhole
        waitTillSnapshot
```

> NOTE: I didn't reproduce it either. :(

Reference:

[1]: <https://github.com/etcd-io/etcd/actions/runs/4741737098/jobs/8419176899>
[2]: <eeb9afa1f6/internal/transport/http2_client.go (L1647)>
[3]: <7450cd886d/tests/robustness/traffic.go (L94)>
[4]: <eeb9afa1f6/dialoptions.go (L445)>
[5]: <https://github.com/etcd-io/etcd/actions/runs/4772033408/jobs/8484334015>
[6]: <eeb9afa1f6/clientconn.go (L932)>

REF: #15763

Signed-off-by: Wei Fu <fuweid89@gmail.com>
2023-04-29 07:03:47 +08:00
c7d81acaf0 test: forcibly save data on pinicking
Signed-off-by: Benjamin Wang <wachao@vmware.com>
2023-04-27 14:54:35 +08:00
c9b368119e tests: e2e and integration test for timetolive
Signed-off-by: Hitoshi Mitake <h.mitake@gmail.com>
Co-authored-by: Benjamin Wang <wachao@vmware.com>
2023-04-26 20:35:20 +09:00
18e3acae0e Add new test for round robin resolver.
Signed-off-by: James Blair <mail@jamesblair.net>
2023-04-25 18:44:24 +12:00
4a8817bfb0 Merge pull request #15737 from jmhbnz/update-dependencies
Bump dependencies identified by dependabot
2023-04-21 06:35:08 +08:00
04f3e9cb9a dependency: bump golang.org/x/crypto from 0.7.0 to 0.8.0
Signed-off-by: James Blair <mail@jamesblair.net>
2023-04-21 05:34:21 +12:00
042e2e9a57 dependency: bump github.com/prometheus/client_golang from 1.14.0 to 1.15.0
Signed-off-by: James Blair <mail@jamesblair.net>
2023-04-21 05:14:40 +12:00
50aa00b203 tests: make log monitor as common helper
It's followup of #15667.

This patch is to use zaptest/observer as base to provide a similar
function to pkg/expect.Expect.

The test env

```bash
11th Gen Intel(R) Core(TM) i5-1135G7 @ 2.40GHz
mkdir /sys/fs/cgroup/etcd-followup-15667
echo 0-2 | tee /sys/fs/cgroup/etcd-followup-15667/cpuset.cpus # three cores
```

Before change:

* memory.peak: ~ 681 MiB
* Elapsed (wall clock) time (h:mm:ss or m:ss): 6:14.04

After change:

* memory.peak: ~ 671 MiB
* Elapsed (wall clock) time (h:mm:ss or m:ss): 6:13.07

Based on the test result, I think it's safe to be enabled by default.

Signed-off-by: Wei Fu <fuweid89@gmail.com>
2023-04-18 09:00:24 +08:00
9f034fbaa8 chore: use tools/mod to lock the cfssl cmd version
Signed-off-by: Wei Fu <fuweid89@gmail.com>
2023-04-13 12:06:31 +08:00
8cd5969248 chore: use strict mode for tests/*/*.sh
REF: #15514

Signed-off-by: Wei Fu <fuweid89@gmail.com>
2023-04-13 12:05:39 +08:00
78d2ead804 chore: deprecate tests/manual folder
REF: #15514

Signed-off-by: Wei Fu <fuweid89@gmail.com>
2023-04-13 12:05:39 +08:00
e48ef9ea6a tests/robustness: Disable blackholing traffic till snapshot for v3.4.X
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2023-04-12 14:39:57 +02:00
d19752f16a tests/robustness: Unify failpoint lists by depending on availability checking
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2023-04-12 14:38:30 +02:00
625d427eb5 tests/robustness: Separate triggering failpoint from injection
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2023-04-12 14:36:50 +02:00
932415a8d5 tests/robustness: Verify cluster configuration in failpoint availability
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2023-04-12 14:36:50 +02:00
941c4afb0c tests/framwork/e2e/cluster.go: revert back to sequential cluster stop to reduce e2e test run time
Signed-off-by: Chao Chen <chaochn@amazon.com>
2023-04-11 22:08:18 -07:00
dddd4780c2 dependency: bump github.com/spf13/cobra from 1.6.1 to 1.7.0
Signed-off-by: Benjamin Wang <wachao@vmware.com>
2023-04-11 08:51:26 +08:00
eb9b15bf49 dependency: bump golang.org/x/net from 0.8.0 to 0.9.0
Signed-off-by: Benjamin Wang <wachao@vmware.com>
2023-04-11 08:44:26 +08:00
8a27dd4db4 dependency: bump github.com/jonboulle/clockwork from 0.3.0 to 0.4.0
Signed-off-by: Benjamin Wang <wachao@vmware.com>
2023-04-11 08:36:44 +08:00
1683231216 Merge pull request #15667 from fuweid/deflake-issue-15545-TestV3WatchRestoreSnapshotUnsync
tests: deflake TestV3WatchRestoreSnapshotUnsync
2023-04-11 06:00:42 +08:00
536953ec6c tests: deflake TestV3WatchRestoreSnapshotUnsync
The TestV3WatchRestoreSnapshotUnsync setups three members' cluster.
Before serving any update requests from client, after leader elected,
each member will have index 8 log: 3 x ConfChange +
3 x ClusterMemberAttrSet + 1 x ClusterVersionSet.

Based on the config (SnapshotCount: 10, CatchUpCount: 5), we need to
file update requests to trigger snapshot at least twice.

T1: L(snapshot-index: 11, compacted-index:  6) F_m0(index: 8)
T2: L(snapshot-index: 22, compacted-index: 17) F_m0(index: 8, out of date)

After member0 recovers from network partition, it will reject leader's
request and return hint (index:8, term:x). If it happens after
second snapshot, leader will find out the index:8 is out of date and
force to transfer snapshot.

However, the client only files 15 update requests and leader doesn't
finish the process of snapshot in time. Since the last of
compacted-index is 6, leader can still replicate index:9 to member0
instead of snapshot.

```bash
cd tests/integration
CLUSTER_DEBUG=true go test -v -count=1 -run TestV3WatchRestoreSnapshotUnsync ./
...

INFO    m2.raft 3da8ba707f1a21a4 became leader at term 2        {"member": "m2"}
...
INFO    m2      triggering snapshot     {"member": "m2", "local-member-id": "3da8ba707f1a21a4", "local-member-applied-index": 22, "local-member-snapshot-index": 11, "local-member-snapshot-count": 10, "snapshot-forced": false}
...

cluster.go:1359: network partition between: 99626fe5001fde8b <-> 1c964119da6db036
cluster.go:1359: network partition between: 99626fe5001fde8b <-> 3da8ba707f1a21a4
cluster.go:416: WaitMembersForLeader

INFO    m0.raft 99626fe5001fde8b became follower at term 2      {"member": "m0"}
INFO    m0.raft raft.node: 99626fe5001fde8b elected leader 3da8ba707f1a21a4 at term 2   {"member": "m0"}
DEBUG   m2.raft 3da8ba707f1a21a4 received MsgAppResp(rejected, hint: (index 8, term 2)) from 99626fe5001fde8b for index 23      {"member": "m2"}
DEBUG   m2.raft 3da8ba707f1a21a4 decreased progress of 99626fe5001fde8b to [StateReplicate match=8 next=9 inflight=15]  {"member": "m2"}

DEBUG   m0      Applying entries        {"member": "m0", "num-entries": 15}
DEBUG   m0      Applying entry  {"member": "m0", "index": 9, "term": 2, "type": "EntryNormal"}

....

INFO    m2      saved snapshot  {"member": "m2", "snapshot-index": 22}
INFO    m2      compacted Raft logs     {"member": "m2", "compact-index": 17}
```

To fix this issue, the patch uses log monitor to watch "compacted Raft
log" and expect that two members should compact log twice.

Fixes: #15545

Signed-off-by: Wei Fu <fuweid89@gmail.com>
2023-04-10 22:27:58 +08:00
7153a8f2f4 Merge pull request #15646 from serathius/robustness-readme-watch-issue
tests/robustness: Document analysing watch issue
2023-04-07 23:45:42 +02:00
a5a5862e0b tests: Make using etcdctl expicit in e2e tests
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2023-04-06 13:29:37 +02:00
8b1cd036ff security: remove password after authenticating the user
fix https://nvd.nist.gov/vuln/detail/CVE-2021-28235

Signed-off-by: Benjamin Wang <wachao@vmware.com>
2023-04-06 17:11:54 +08:00
801bb4c6df test: add an e2e test to reproduce https://nvd.nist.gov/vuln/detail/CVE-2021-28235
Signed-off-by: Benjamin Wang <wachao@vmware.com>
2023-04-06 16:47:31 +08:00
2d0d3c3fdf security: bump go to 1.19.8 to fix four CVEs
Signed-off-by: Benjamin Wang <wachao@vmware.com>
2023-04-06 13:38:58 +08:00
2d9aeec91f Merge pull request #15645 from serathius/tests-cleanup-alternative-binaries
tests/framework: Cleanup alternative binaries in e2e tests
2023-04-06 07:33:17 +02:00
540d012e5e tests/robustness: Ensure that etcdctl binary is provided
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2023-04-05 23:04:20 +02:00
1e41d95ab2 tests/robustness: Document analysing watch issue
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2023-04-05 22:40:47 +02:00
651873cf7b tests/framework: Cleanup alternative binaries in e2e tests
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2023-04-05 15:32:31 +02:00
42a2643df9 tests/robustness: Reproduce issue #15220
This issue is somewhat easily reproduced simply by bombarding the
server with requests for progress notifications, which eventually
leads to one being delivered ahead of the payload message. This is
then caught by the watch response validation code previously added by
Marek Siarkowicz.

Signed-off-by: Peter Wortmann <peter.wortmann@skao.int>
2023-04-05 11:23:02 +01:00
af25936fb7 tests/integration: Demonstrate manual progress notification race
This will fail basically every time, as the progress notification
request catches the watcher in an asynchronised state.

Signed-off-by: Peter Wortmann <peter.wortmann@skao.int>
2023-04-05 11:19:07 +01:00
5bae6b1e44 tests/robustness: Detect trigger timeout and exit
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2023-04-04 15:23:58 +02:00
1227754284 Cancel watch if cluster not healthy before or after injecting failpoints.
Signed-off-by: James Blair <mail@jamesblair.net>
2023-04-04 13:58:17 +02:00
6582e349db tests: Enfoce timeout on failpoints
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2023-04-04 12:25:07 +02:00
523f235c82 Merge pull request #15603 from serathius/robustness-finish-with-success
tests: Ensure that operation history finishes with successful request
2023-04-04 12:03:36 +02:00
32acc662c9 Merge pull request #15638 from ahrtr/dependency_20230404
Bump some dependencies
2023-04-04 17:11:26 +08:00
6a5d326519 tests: Ensure that operation history finishes with successful request
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2023-04-04 09:40:17 +02:00
5e0119eadc Merge pull request #15636 from lavacat/main-test-watch-delay
tests: increase maxWatchDelay to prevent flaky TestWatchDelay*
2023-04-04 09:38:03 +02:00
138fae6246 Merge pull request #15632 from serathius/fix-comparing-etcd-version
tests: Fix comparing etcd version
2023-04-04 09:34:55 +02:00