version: bump to v3.0.17

lease/leasehttp: pass min TTL in TestRenewHTTP
etcdctlv3: snapshot restore works with lease key
2017-01-20 11:02:34 -08:00 · 2017-01-20 11:02:00 -08:00 · 2017-01-20 10:36:02 -08:00 · 2017-01-20 10:35:46 -08:00 · 2017-01-20 10:19:14 -08:00 · 2017-01-13 11:29:12 -08:00
94 changed files with 2816 additions and 2400 deletions
--- a/.travis.yml
+++ b/.travis.yml
@ -4,8 +4,7 @@ go_import_path: github.com/coreos/etcd
 sudo: false

 go:
-  - 1.6
-  - tip
+  - 1.6.4

 env:
  global:
--- a/Documentation/dev-guide/api_reference_v3.md
+++ b/Documentation/dev-guide/api_reference_v3.md
@ -427,6 +427,7 @@ Empty field.
 | ----- | ----------- | ---- |
 | key | key is the first key to delete in the range. | bytes |
 | range_end | range_end is the key following the last key to delete for the range [key, range_end). If range_end is not given, the range is defined to contain only the key argument. If range_end is '\0', the range is all keys greater than or equal to the key argument. | bytes |
+| prev_kv | If prev_kv is set, etcd gets the previous key-value pairs before deleting it. The previous key-value pairs will be returned in the delte response. | bool |



@ -436,6 +437,7 @@ Empty field.
 | ----- | ----------- | ---- |
 | header |  | ResponseHeader |
 | deleted | deleted is the number of keys deleted by the delete range request. | int64 |
+| prev_kvs | if prev_kv is set in the request, the previous key-value pairs will be returned. | (slice of) mvccpb.KeyValue |



@ -591,6 +593,7 @@ Empty field.
 | key | key is the key, in bytes, to put into the key-value store. | bytes |
 | value | value is the value, in bytes, to associate with the key in the key-value store. | bytes |
 | lease | lease is the lease ID to associate with the key in the key-value store. A lease value of 0 indicates no lease. | int64 |
+| prev_kv | If prev_kv is set, etcd gets the previous key-value pair before changing it. The previous key-value pair will be returned in the put response. | bool |



@ -599,6 +602,7 @@ Empty field.
 | Field | Description | Type |
 | ----- | ----------- | ---- |
 | header |  | ResponseHeader |
+| prev_kv | if prev_kv is set in the request, the previous key-value pair will be returned. | mvccpb.KeyValue |



@ -735,6 +739,7 @@ From google paxosdb paper: Our implementation hinges around a powerful primitive
 | range_end | range_end is the end of the range [key, range_end) to watch. If range_end is not given, only the key argument is watched. If range_end is equal to '\0', all keys greater than or equal to the key argument are watched. | bytes |
 | start_revision | start_revision is an optional revision to watch from (inclusive). No start_revision is "now". | int64 |
 | progress_notify | progress_notify is set so that the etcd server will periodically send a WatchResponse with no events to the new watcher if there are no recent events. It is useful when clients wish to recover a disconnected watcher starting from a recent known revision. The etcd server may decide how often it will send notifications based on current load. | bool |
+| prev_kv | If prev_kv is set, created watcher gets the previous KV before the event happens. If the previous KV is already compacted, nothing will be returned. | bool |



@ -767,6 +772,7 @@ From google paxosdb paper: Our implementation hinges around a powerful primitive
 | ----- | ----------- | ---- |
 | type | type is the kind of event. If type is a PUT, it indicates new data has been stored to the key. If type is a DELETE, it indicates the key was deleted. | EventType |
 | kv | kv holds the KeyValue for the event. A PUT event contains current kv pair. A PUT event with kv.Version=1 indicates the creation of a key. A DELETE/EXPIRE event contains the deleted key with its modification revision set to the revision of deletion. | KeyValue |
+| prev_kv | prev_kv holds the key-value pair before the event happens. | KeyValue |



--- a/Documentation/dev-guide/apispec/swagger/rpc.swagger.json
+++ b/Documentation/dev-guide/apispec/swagger/rpc.swagger.json
@ -1474,6 +1474,11 @@
          "format": "byte",
          "description": "key is the first key to delete in the range."
        },
+        "prev_kv": {
+          "type": "boolean",
+          "format": "boolean",
+          "description": "If prev_kv is set, etcd gets the previous key-value pairs before deleting it.\nThe previous key-value pairs will be returned in the delte response."
+        },
        "range_end": {
          "type": "string",
          "format": "byte",
@ -1491,6 +1496,13 @@
        },
        "header": {
          "$ref": "#/definitions/etcdserverpbResponseHeader"
+        },
+        "prev_kvs": {
+          "type": "array",
+          "items": {
+            "$ref": "#/definitions/mvccpbKeyValue"
+          },
+          "description": "if prev_kv is set in the request, the previous key-value pairs will be returned."
        }
      }
    },
@ -1724,6 +1736,11 @@
          "format": "int64",
          "description": "lease is the lease ID to associate with the key in the key-value store. A lease\nvalue of 0 indicates no lease."
        },
+        "prev_kv": {
+          "type": "boolean",
+          "format": "boolean",
+          "description": "If prev_kv is set, etcd gets the previous key-value pair before changing it.\nThe previous key-value pair will be returned in the put response."
+        },
        "value": {
          "type": "string",
          "format": "byte",
@ -1736,6 +1753,10 @@
      "properties": {
        "header": {
          "$ref": "#/definitions/etcdserverpbResponseHeader"
+        },
+        "prev_kv": {
+          "$ref": "#/definitions/mvccpbKeyValue",
+          "description": "if prev_kv is set in the request, the previous key-value pair will be returned."
        }
      }
    },
@ -1988,6 +2009,11 @@
          "format": "byte",
          "description": "key is the key to register for watching."
        },
+        "prev_kv": {
+          "type": "boolean",
+          "format": "boolean",
+          "description": "If prev_kv is set, created watcher gets the previous KV before the event happens.\nIf the previous KV is already compacted, nothing will be returned."
+        },
        "progress_notify": {
          "type": "boolean",
          "format": "boolean",
@ -2057,6 +2083,10 @@
          "$ref": "#/definitions/mvccpbKeyValue",
          "description": "kv holds the KeyValue for the event.\nA PUT event contains current kv pair.\nA PUT event with kv.Version=1 indicates the creation of a key.\nA DELETE/EXPIRE event contains the deleted key with\nits modification revision set to the revision of deletion."
        },
+        "prev_kv": {
+          "$ref": "#/definitions/mvccpbKeyValue",
+          "description": "prev_kv holds the key-value pair before the event happens."
+        },
        "type": {
          "$ref": "#/definitions/EventEventType",
          "description": "type is the kind of event. If type is a PUT, it indicates\nnew data has been stored to the key. If type is a DELETE,\nit indicates the key was deleted."
--- a/auth/authpb/auth.pb.go
+++ b/auth/authpb/auth.pb.go
@ -21,9 +21,9 @@ import (
 	proto "github.com/golang/protobuf/proto"

 	math "math"
-)

-import io "io"
+	io "io"
+)

 // Reference imports to suppress errors if they are not otherwise used.
 var _ = proto.Marshal
--- a/auth/range_perm_cache.go
+++ b/auth/range_perm_cache.go
@ -22,7 +22,10 @@ import (
 	"github.com/coreos/etcd/mvcc/backend"
 )

-// isSubset returns true if a is a subset of b
+// isSubset returns true if a is a subset of b.
+// If a is a prefix of b, then a is a subset of b.
+// Given intervals [a1,a2) and [b1,b2), is
+// the a interval a subset of b?
 func isSubset(a, b *rangePerm) bool {
 	switch {
 	case len(a.end) == 0 && len(b.end) == 0:
@ -32,9 +35,11 @@ func isSubset(a, b *rangePerm) bool {
 		// b is a key, a is a range
 		return false
 	case len(a.end) == 0:
-		return 0 <= bytes.Compare(a.begin, b.begin) && bytes.Compare(a.begin, b.end) <= 0
+		// a is a key, b is a range. need b1 <= a1 and a1 < b2
+		return bytes.Compare(b.begin, a.begin) <= 0 && bytes.Compare(a.begin, b.end) < 0
 	default:
-		return 0 <= bytes.Compare(a.begin, b.begin) && bytes.Compare(a.end, b.end) <= 0
+		// both are ranges. need b1 <= a1 and a2 <= b2
+		return bytes.Compare(b.begin, a.begin) <= 0 && bytes.Compare(a.end, b.end) <= 0
 	}
 }

@ -88,12 +93,18 @@ func mergeRangePerms(perms []*rangePerm) []*rangePerm {
 	i := 0
 	for i < len(perms) {
 		begin, next := i, i
-		for next+1 < len(perms) && bytes.Compare(perms[next].end, perms[next+1].begin) != -1 {
+		for next+1 < len(perms) && bytes.Compare(perms[next].end, perms[next+1].begin) >= 0 {
 			next++
 		}
-
-		merged = append(merged, &rangePerm{begin: perms[begin].begin, end: perms[next].end})
-
+		// don't merge ["a", "b") with ["b", ""), because perms[next+1].end is empty.
+		if next != begin && len(perms[next].end) > 0 {
+			merged = append(merged, &rangePerm{begin: perms[begin].begin, end: perms[next].end})
+		} else {
+			merged = append(merged, perms[begin])
+			if next != begin {
+				merged = append(merged, perms[next])
+			}
+		}
 		i = next + 1
 	}

--- a/auth/range_perm_cache_test.go
+++ b/auth/range_perm_cache_test.go
@ -46,6 +46,10 @@ func TestGetMergedPerms(t *testing.T) {
 			[]*rangePerm{{[]byte("a"), []byte("b")}},
 			[]*rangePerm{{[]byte("a"), []byte("b")}},
 		},
+		{
+			[]*rangePerm{{[]byte("a"), []byte("b")}, {[]byte("b"), []byte("")}},
+			[]*rangePerm{{[]byte("a"), []byte("b")}, {[]byte("b"), []byte("")}},
+		},
 		{
 			[]*rangePerm{{[]byte("a"), []byte("b")}, {[]byte("b"), []byte("c")}},
 			[]*rangePerm{{[]byte("a"), []byte("c")}},
@ -106,7 +110,7 @@ func TestGetMergedPerms(t *testing.T) {
 		},
 		{
 			[]*rangePerm{{[]byte("a"), []byte("")}, {[]byte("b"), []byte("c")}, {[]byte("b"), []byte("")}, {[]byte("c"), []byte("")}, {[]byte("d"), []byte("")}},
-			[]*rangePerm{{[]byte("a"), []byte("")}, {[]byte("b"), []byte("c")}, {[]byte("d"), []byte("")}},
+			[]*rangePerm{{[]byte("a"), []byte("")}, {[]byte("b"), []byte("c")}, {[]byte("c"), []byte("")}, {[]byte("d"), []byte("")}},
 		},
 		// duplicate ranges
 		{
--- a/auth/store.go
+++ b/auth/store.go
@ -603,6 +603,11 @@ func (as *authStore) isOpPermitted(userName string, key, rangeEnd []byte, permTy
 		return false
 	}

+	// root role should have permission on all ranges
+	if hasRootRole(user) {
+		return true
+	}
+
 	if as.isRangeOpPermitted(tx, userName, key, rangeEnd, permTyp) {
 		return true
 	}
--- a/clientv3/balancer.go
+++ b/clientv3/balancer.go
@ -45,6 +45,8 @@ type simpleBalancer struct {
 	// pinAddr is the currently pinned address; set to the empty string on
 	// intialization and shutdown.
 	pinAddr string
+
+	closed bool
 }

 func newSimpleBalancer(eps []string) *simpleBalancer {
@ -74,15 +76,25 @@ func (b *simpleBalancer) ConnectNotify() <-chan struct{} {

 func (b *simpleBalancer) Up(addr grpc.Address) func(error) {
 	b.mu.Lock()
+	defer b.mu.Unlock()
+
+	// gRPC might call Up after it called Close. We add this check
+	// to "fix" it up at application layer. Or our simplerBalancer
+	// might panic since b.upc is closed.
+	if b.closed {
+		return func(err error) {}
+	}
+
 	if len(b.upEps) == 0 {
 		// notify waiting Get()s and pin first connected address
 		close(b.upc)
 		b.pinAddr = addr.Addr
 	}
 	b.upEps[addr.Addr] = struct{}{}
-	b.mu.Unlock()
+
 	// notify client that a connection is up
 	b.readyOnce.Do(func() { close(b.readyc) })
+
 	return func(err error) {
 		b.mu.Lock()
 		delete(b.upEps, addr.Addr)
@ -128,13 +140,19 @@ func (b *simpleBalancer) Notify() <-chan []grpc.Address { return b.notifyCh }

 func (b *simpleBalancer) Close() error {
 	b.mu.Lock()
+	defer b.mu.Unlock()
+	// In case gRPC calls close twice. TODO: remove the checking
+	// when we are sure that gRPC wont call close twice.
+	if b.closed {
+		return nil
+	}
+	b.closed = true
 	close(b.notifyCh)
 	// terminate all waiting Get()s
 	b.pinAddr = ""
 	if len(b.upEps) == 0 {
 		close(b.upc)
 	}
-	b.mu.Unlock()
 	return nil
 }

--- a/clientv3/concurrency/election.go
+++ b/clientv3/concurrency/election.go
@ -40,7 +40,7 @@ type Election struct {

 // NewElection returns a new election on a given key prefix.
 func NewElection(client *v3.Client, pfx string) *Election {
-	return &Election{client: client, keyPrefix: pfx}
+	return &Election{client: client, keyPrefix: pfx + "/"}
 }

 // Campaign puts a value as eligible for the election. It blocks until
@ -59,7 +59,6 @@ func (e *Election) Campaign(ctx context.Context, val string) error {
 	if err != nil {
 		return err
 	}
-
 	e.leaderKey, e.leaderRev, e.leaderSession = k, resp.Header.Revision, s
 	if !resp.Succeeded {
 		kv := resp.Responses[0].GetResponseRange().Kvs[0]
--- a/clientv3/concurrency/mutex.go
+++ b/clientv3/concurrency/mutex.go
@ -32,7 +32,7 @@ type Mutex struct {
 }

 func NewMutex(client *v3.Client, pfx string) *Mutex {
-	return &Mutex{client, pfx, "", -1}
+	return &Mutex{client, pfx + "/", "", -1}
 }

 // Lock locks the mutex with a cancellable context. If the context is cancelled
@ -43,7 +43,7 @@ func (m *Mutex) Lock(ctx context.Context) error {
 		return serr
 	}

-	m.myKey = fmt.Sprintf("%s/%x", m.pfx, s.Lease())
+	m.myKey = fmt.Sprintf("%s%x", m.pfx, s.Lease())
 	cmp := v3.Compare(v3.CreateRevision(m.myKey), "=", 0)
 	// put self in lock waiters via myKey; oldest waiter holds lock
 	put := v3.OpPut(m.myKey, "", v3.WithLease(s.Lease()))
--- a/clientv3/example_auth_test.go
+++ b/clientv3/example_auth_test.go
@ -32,35 +32,63 @@ func ExampleAuth() {
 	}
 	defer cli.Close()

-	authapi := clientv3.NewAuth(cli)
-
-	if _, err = authapi.RoleAdd(context.TODO(), "root"); err != nil {
+	if _, err = cli.RoleAdd(context.TODO(), "root"); err != nil {
+		log.Fatal(err)
+	}
+	if _, err = cli.UserAdd(context.TODO(), "root", "123"); err != nil {
+		log.Fatal(err)
+	}
+	if _, err = cli.UserGrantRole(context.TODO(), "root", "root"); err != nil {
 		log.Fatal(err)
 	}

-	if _, err = authapi.RoleGrantPermission(
+	if _, err = cli.RoleAdd(context.TODO(), "r"); err != nil {
+		log.Fatal(err)
+	}
+
+	if _, err = cli.RoleGrantPermission(
 		context.TODO(),
-		"root", // role name
-		"foo",  // key
-		"zoo",  // range end
+		"r",   // role name
+		"foo", // key
+		"zoo", // range end
 		clientv3.PermissionType(clientv3.PermReadWrite),
 	); err != nil {
 		log.Fatal(err)
 	}
-
-	if _, err = authapi.UserAdd(context.TODO(), "root", "123"); err != nil {
+	if _, err = cli.UserAdd(context.TODO(), "u", "123"); err != nil {
 		log.Fatal(err)
 	}
-
-	if _, err = authapi.UserGrantRole(context.TODO(), "root", "root"); err != nil {
+	if _, err = cli.UserGrantRole(context.TODO(), "u", "r"); err != nil {
 		log.Fatal(err)
 	}
-
-	if _, err = authapi.AuthEnable(context.TODO()); err != nil {
+	if _, err = cli.AuthEnable(context.TODO()); err != nil {
 		log.Fatal(err)
 	}

 	cliAuth, err := clientv3.New(clientv3.Config{
+		Endpoints:   endpoints,
+		DialTimeout: dialTimeout,
+		Username:    "u",
+		Password:    "123",
+	})
+	if err != nil {
+		log.Fatal(err)
+	}
+	defer cliAuth.Close()
+
+	if _, err = cliAuth.Put(context.TODO(), "foo1", "bar"); err != nil {
+		log.Fatal(err)
+	}
+
+	_, err = cliAuth.Txn(context.TODO()).
+		If(clientv3.Compare(clientv3.Value("zoo1"), ">", "abc")).
+		Then(clientv3.OpPut("zoo1", "XYZ")).
+		Else(clientv3.OpPut("zoo1", "ABC")).
+		Commit()
+	fmt.Println(err)
+
+	// now check the permission with the root account
+	rootCli, err := clientv3.New(clientv3.Config{
 		Endpoints:   endpoints,
 		DialTimeout: dialTimeout,
 		Username:    "root",
@ -69,31 +97,17 @@ func ExampleAuth() {
 	if err != nil {
 		log.Fatal(err)
 	}
-	defer cliAuth.Close()
+	defer rootCli.Close()

-	kv := clientv3.NewKV(cliAuth)
-	if _, err = kv.Put(context.TODO(), "foo1", "bar"); err != nil {
-		log.Fatal(err)
-	}
-
-	_, err = kv.Txn(context.TODO()).
-		If(clientv3.Compare(clientv3.Value("zoo1"), ">", "abc")).
-		Then(clientv3.OpPut("zoo1", "XYZ")).
-		Else(clientv3.OpPut("zoo1", "ABC")).
-		Commit()
-	fmt.Println(err)
-
-	// now check the permission
-	authapi2 := clientv3.NewAuth(cliAuth)
-	resp, err := authapi2.RoleGet(context.TODO(), "root")
+	resp, err := rootCli.RoleGet(context.TODO(), "r")
 	if err != nil {
 		log.Fatal(err)
 	}
-	fmt.Printf("root user permission: key %q, range end %q\n", resp.Perm[0].Key, resp.Perm[0].RangeEnd)
+	fmt.Printf("user u permission: key %q, range end %q\n", resp.Perm[0].Key, resp.Perm[0].RangeEnd)

-	if _, err = authapi2.AuthDisable(context.TODO()); err != nil {
+	if _, err = rootCli.AuthDisable(context.TODO()); err != nil {
 		log.Fatal(err)
 	}
 	// Output: etcdserver: permission denied
-	// root user permission: key "foo", range end "zoo"
+	// user u permission: key "foo", range end "zoo"
 }
--- a/clientv3/integration/lease_test.go
+++ b/clientv3/integration/lease_test.go
@ -455,3 +455,46 @@ func TestLeaseKeepAliveTTLTimeout(t *testing.T) {

 	clus.Members[0].Restart(t)
 }
+
+// TestLeaseRenewLostQuorum ensures keepalives work after losing quorum
+// for a while.
+func TestLeaseRenewLostQuorum(t *testing.T) {
+	defer testutil.AfterTest(t)
+
+	clus := integration.NewClusterV3(t, &integration.ClusterConfig{Size: 3})
+	defer clus.Terminate(t)
+
+	cli := clus.Client(0)
+	r, err := cli.Grant(context.TODO(), 4)
+	if err != nil {
+		t.Fatal(err)
+	}
+
+	kctx, kcancel := context.WithCancel(context.Background())
+	defer kcancel()
+	ka, err := cli.KeepAlive(kctx, r.ID)
+	if err != nil {
+		t.Fatal(err)
+	}
+	// consume first keepalive so next message sends when cluster is down
+	<-ka
+
+	// force keepalive stream message to timeout
+	clus.Members[1].Stop(t)
+	clus.Members[2].Stop(t)
+	// Use TTL-1 since the client closes the keepalive channel if no
+	// keepalive arrives before the lease deadline.
+	// The cluster has 1 second to recover and reply to the keepalive.
+	time.Sleep(time.Duration(r.TTL-1) * time.Second)
+	clus.Members[1].Restart(t)
+	clus.Members[2].Restart(t)
+
+	select {
+	case _, ok := <-ka:
+		if !ok {
+			t.Fatalf("keepalive closed")
+		}
+	case <-time.After(time.Duration(r.TTL) * time.Second):
+		t.Fatalf("timed out waiting for keepalive")
+	}
+}
--- a/clientv3/integration/watch_test.go
+++ b/clientv3/integration/watch_test.go
@ -673,3 +673,131 @@ func TestWatchWithRequireLeader(t *testing.T) {
 		t.Fatalf("expected response, got closed channel")
 	}
 }
+
+// TestWatchOverlapContextCancel stresses the watcher stream teardown path by
+// creating/canceling watchers to ensure that new watchers are not taken down
+// by a torn down watch stream. The sort of race that's being detected:
+//     1. create w1 using a cancelable ctx with %v as "ctx"
+//     2. cancel ctx
+//     3. watcher client begins tearing down watcher grpc stream since no more watchers
+//     3. start creating watcher w2 using a new "ctx" (not canceled), attaches to old grpc stream
+//     4. watcher client finishes tearing down stream on "ctx"
+//     5. w2 comes back canceled
+func TestWatchOverlapContextCancel(t *testing.T) {
+	f := func(clus *integration.ClusterV3) {}
+	testWatchOverlapContextCancel(t, f)
+}
+
+func TestWatchOverlapDropConnContextCancel(t *testing.T) {
+	f := func(clus *integration.ClusterV3) {
+		clus.Members[0].DropConnections()
+	}
+	testWatchOverlapContextCancel(t, f)
+}
+
+func testWatchOverlapContextCancel(t *testing.T, f func(*integration.ClusterV3)) {
+	defer testutil.AfterTest(t)
+	clus := integration.NewClusterV3(t, &integration.ClusterConfig{Size: 1})
+	defer clus.Terminate(t)
+
+	// each unique context "%v" has a unique grpc stream
+	n := 100
+	ctxs, ctxc := make([]context.Context, 5), make([]chan struct{}, 5)
+	for i := range ctxs {
+		// make "%v" unique
+		ctxs[i] = context.WithValue(context.TODO(), "key", i)
+		// limits the maximum number of outstanding watchers per stream
+		ctxc[i] = make(chan struct{}, 2)
+	}
+
+	// issue concurrent watches on "abc" with cancel
+	cli := clus.RandClient()
+	if _, err := cli.Put(context.TODO(), "abc", "def"); err != nil {
+		t.Fatal(err)
+	}
+	ch := make(chan struct{}, n)
+	for i := 0; i < n; i++ {
+		go func() {
+			defer func() { ch <- struct{}{} }()
+			idx := rand.Intn(len(ctxs))
+			ctx, cancel := context.WithCancel(ctxs[idx])
+			ctxc[idx] <- struct{}{}
+			wch := cli.Watch(ctx, "abc", clientv3.WithRev(1))
+			f(clus)
+			select {
+			case _, ok := <-wch:
+				if !ok {
+					t.Fatalf("unexpected closed channel %p", wch)
+				}
+			// may take a second or two to reestablish a watcher because of
+			// grpc backoff policies for disconnects
+			case <-time.After(5 * time.Second):
+				t.Errorf("timed out waiting for watch on %p", wch)
+			}
+			// randomize how cancel overlaps with watch creation
+			if rand.Intn(2) == 0 {
+				<-ctxc[idx]
+				cancel()
+			} else {
+				cancel()
+				<-ctxc[idx]
+			}
+		}()
+	}
+	// join on watches
+	for i := 0; i < n; i++ {
+		select {
+		case <-ch:
+		case <-time.After(5 * time.Second):
+			t.Fatalf("timed out waiting for completed watch")
+		}
+	}
+}
+
+// TestWatchCanelAndCloseClient ensures that canceling a watcher then immediately
+// closing the client does not return a client closing error.
+func TestWatchCancelAndCloseClient(t *testing.T) {
+	defer testutil.AfterTest(t)
+	clus := integration.NewClusterV3(t, &integration.ClusterConfig{Size: 1})
+	defer clus.Terminate(t)
+	cli := clus.Client(0)
+	ctx, cancel := context.WithCancel(context.Background())
+	wch := cli.Watch(ctx, "abc")
+	donec := make(chan struct{})
+	go func() {
+		defer close(donec)
+		select {
+		case wr, ok := <-wch:
+			if ok {
+				t.Fatalf("expected closed watch after cancel(), got resp=%+v err=%v", wr, wr.Err())
+			}
+		case <-time.After(5 * time.Second):
+			t.Fatal("timed out waiting for closed channel")
+		}
+	}()
+	cancel()
+	if err := cli.Close(); err != nil {
+		t.Fatal(err)
+	}
+	<-donec
+	clus.TakeClient(0)
+}
+
+// TestWatchCancelDisconnected ensures canceling a watcher works when
+// its grpc stream is disconnected / reconnecting.
+func TestWatchCancelDisconnected(t *testing.T) {
+	defer testutil.AfterTest(t)
+	clus := integration.NewClusterV3(t, &integration.ClusterConfig{Size: 1})
+	defer clus.Terminate(t)
+	cli := clus.Client(0)
+	ctx, cancel := context.WithCancel(context.Background())
+	// add more watches than can be resumed before the cancel
+	wch := cli.Watch(ctx, "abc")
+	clus.Members[0].Stop(t)
+	cancel()
+	select {
+	case <-wch:
+	case <-time.After(time.Second):
+		t.Fatal("took too long to cancel disconnected watcher")
+	}
+}
--- a/clientv3/kv.go
+++ b/clientv3/kv.go
@ -157,14 +157,14 @@ func (kv *kv) do(ctx context.Context, op Op) (OpResponse, error) {
 		}
 	case tPut:
 		var resp *pb.PutResponse
-		r := &pb.PutRequest{Key: op.key, Value: op.val, Lease: int64(op.leaseID)}
+		r := &pb.PutRequest{Key: op.key, Value: op.val, Lease: int64(op.leaseID), PrevKv: op.prevKV}
 		resp, err = kv.remote.Put(ctx, r)
 		if err == nil {
 			return OpResponse{put: (*PutResponse)(resp)}, nil
 		}
 	case tDeleteRange:
 		var resp *pb.DeleteRangeResponse
-		r := &pb.DeleteRangeRequest{Key: op.key, RangeEnd: op.end}
+		r := &pb.DeleteRangeRequest{Key: op.key, RangeEnd: op.end, PrevKv: op.prevKV}
 		resp, err = kv.remote.DeleteRange(ctx, r)
 		if err == nil {
 			return OpResponse{del: (*DeleteResponse)(resp)}, nil
--- a/clientv3/lease.go
+++ b/clientv3/lease.go
@ -143,9 +143,6 @@ func (l *lessor) Grant(ctx context.Context, ttl int64) (*LeaseGrantResponse, err
 		if isHaltErr(cctx, err) {
 			return nil, toErr(ctx, err)
 		}
-		if nerr := l.newStream(); nerr != nil {
-			return nil, nerr
-		}
 	}
 }

@ -164,9 +161,6 @@ func (l *lessor) Revoke(ctx context.Context, id LeaseID) (*LeaseRevokeResponse,
 		if isHaltErr(ctx, err) {
 			return nil, toErr(ctx, err)
 		}
-		if nerr := l.newStream(); nerr != nil {
-			return nil, nerr
-		}
 	}
 }

@ -213,10 +207,6 @@ func (l *lessor) KeepAliveOnce(ctx context.Context, id LeaseID) (*LeaseKeepAlive
 		if isHaltErr(ctx, err) {
 			return nil, toErr(ctx, err)
 		}
-
-		if nerr := l.newStream(); nerr != nil {
-			return nil, nerr
-		}
 	}
 }

@ -312,10 +302,23 @@ func (l *lessor) recvKeepAliveLoop() {

 // resetRecv opens a new lease stream and starts sending LeaseKeepAliveRequests
 func (l *lessor) resetRecv() (pb.Lease_LeaseKeepAliveClient, error) {
-	if err := l.newStream(); err != nil {
+	sctx, cancel := context.WithCancel(l.stopCtx)
+	stream, err := l.remote.LeaseKeepAlive(sctx, grpc.FailFast(false))
+	if err = toErr(sctx, err); err != nil {
+		cancel()
 		return nil, err
 	}
-	stream := l.getKeepAliveStream()
+
+	l.mu.Lock()
+	defer l.mu.Unlock()
+	if l.stream != nil && l.streamCancel != nil {
+		l.stream.CloseSend()
+		l.streamCancel()
+	}
+
+	l.streamCancel = cancel
+	l.stream = stream
+
 	go l.sendKeepAliveLoop(stream)
 	return stream, nil
 }
@ -411,32 +414,6 @@ func (l *lessor) sendKeepAliveLoop(stream pb.Lease_LeaseKeepAliveClient) {
 	}
 }

-func (l *lessor) getKeepAliveStream() pb.Lease_LeaseKeepAliveClient {
-	l.mu.Lock()
-	defer l.mu.Unlock()
-	return l.stream
-}
-
-func (l *lessor) newStream() error {
-	sctx, cancel := context.WithCancel(l.stopCtx)
-	stream, err := l.remote.LeaseKeepAlive(sctx, grpc.FailFast(false))
-	if err != nil {
-		cancel()
-		return toErr(sctx, err)
-	}
-
-	l.mu.Lock()
-	defer l.mu.Unlock()
-	if l.stream != nil && l.streamCancel != nil {
-		l.stream.CloseSend()
-		l.streamCancel()
-	}
-
-	l.streamCancel = cancel
-	l.stream = stream
-	return nil
-}
-
 func (ka *keepAlive) Close() {
 	close(ka.donec)
 	for _, ch := range ka.chs {
--- a/clientv3/op.go
+++ b/clientv3/op.go
@ -47,6 +47,9 @@ type Op struct {
 	// for range, watch
 	rev int64

+	// for watch, put, delete
+	prevKV bool
+
 	// progressNotify is for progress updates.
 	progressNotify bool

@ -73,10 +76,10 @@ func (op Op) toRequestOp() *pb.RequestOp {
 		}
 		return &pb.RequestOp{Request: &pb.RequestOp_RequestRange{RequestRange: r}}
 	case tPut:
-		r := &pb.PutRequest{Key: op.key, Value: op.val, Lease: int64(op.leaseID)}
+		r := &pb.PutRequest{Key: op.key, Value: op.val, Lease: int64(op.leaseID), PrevKv: op.prevKV}
 		return &pb.RequestOp{Request: &pb.RequestOp_RequestPut{RequestPut: r}}
 	case tDeleteRange:
-		r := &pb.DeleteRangeRequest{Key: op.key, RangeEnd: op.end}
+		r := &pb.DeleteRangeRequest{Key: op.key, RangeEnd: op.end, PrevKv: op.prevKV}
 		return &pb.RequestOp{Request: &pb.RequestOp_RequestDeleteRange{RequestDeleteRange: r}}
 	default:
 		panic("Unknown Op")
@ -212,14 +215,15 @@ func WithPrefix() OpOption {
 	}
 }

-// WithRange specifies the range of 'Get' or 'Delete' requests.
+// WithRange specifies the range of 'Get', 'Delete', 'Watch' requests.
 // For example, 'Get' requests with 'WithRange(end)' returns
 // the keys in the range [key, end).
+// endKey must be lexicographically greater than start key.
 func WithRange(endKey string) OpOption {
 	return func(op *Op) { op.end = []byte(endKey) }
 }

-// WithFromKey specifies the range of 'Get' or 'Delete' requests
+// WithFromKey specifies the range of 'Get', 'Delete', 'Watch' requests
 // to be equal or greater than the key in the argument.
 func WithFromKey() OpOption { return WithRange("\x00") }

@ -271,3 +275,11 @@ func WithProgressNotify() OpOption {
 		op.progressNotify = true
 	}
 }
+
+// WithPrevKV gets the previous key-value pair before the event happens. If the previous KV is already compacted,
+// nothing will be returned.
+func WithPrevKV() OpOption {
+	return func(op *Op) {
+		op.prevKV = true
+	}
+}
--- a/clientv3/watch.go
+++ b/clientv3/watch.go
@ -61,6 +61,9 @@ type WatchResponse struct {
 	// the channel sends a final response that has Canceled set to true with a non-nil Err().
 	Canceled bool

+	// created is used to indicate the creation of the watcher.
+	created bool
+
 	closeErr error
 }

@ -89,7 +92,7 @@ func (wr *WatchResponse) Err() error {

 // IsProgressNotify returns true if the WatchResponse is progress notification.
 func (wr *WatchResponse) IsProgressNotify() bool {
-	return len(wr.Events) == 0 && !wr.Canceled
+	return len(wr.Events) == 0 && !wr.Canceled && !wr.created && wr.CompactRevision == 0 && wr.Header.Revision != 0
 }

 // watcher implements the Watcher interface
@ -102,6 +105,7 @@ type watcher struct {
 	streams map[string]*watchGrpcStream
 }

+// watchGrpcStream tracks all watch resources attached to a single grpc stream.
 type watchGrpcStream struct {
 	owner  *watcher
 	remote pb.WatchClient
@ -112,23 +116,25 @@ type watchGrpcStream struct {
 	ctxKey string
 	cancel context.CancelFunc

-	// mu protects the streams map
-	mu sync.RWMutex
-	// streams holds all active watchers
-	streams map[int64]*watcherStream
+	// substreams holds all active watchers on this grpc stream
+	substreams map[int64]*watcherStream
+	// resuming holds all resuming watchers on this grpc stream
+	resuming []*watcherStream

 	// reqc sends a watch request from Watch() to the main goroutine
 	reqc chan *watchRequest
 	// respc receives data from the watch client
 	respc chan *pb.WatchResponse
-	// stopc is sent to the main goroutine to stop all processing
-	stopc chan struct{}
 	// donec closes to broadcast shutdown
 	donec chan struct{}
 	// errc transmits errors from grpc Recv to the watch stream reconn logic
 	errc chan error
+	// closingc gets the watcherStream of closing watchers
+	closingc chan *watcherStream

-	// the error that closed the watch stream
+	// resumec closes to signal that all substreams should begin resuming
+	resumec chan struct{}
+	// closeErr is the error that closed the watch stream
 	closeErr error
 }

@ -140,6 +146,8 @@ type watchRequest struct {
 	rev int64
 	// progressNotify is for progress updates.
 	progressNotify bool
+	// get the previous key-value pair before the event happens
+	prevKV bool
 	// retc receives a chan WatchResponse once the watcher is established
 	retc chan chan WatchResponse
 }
@ -150,15 +158,18 @@ type watcherStream struct {
 	initReq watchRequest

 	// outc publishes watch responses to subscriber
-	outc chan<- WatchResponse
+	outc chan WatchResponse
 	// recvc buffers watch responses before publishing
 	recvc chan *WatchResponse
-	id    int64
+	// donec closes when the watcherStream goroutine stops.
+	donec chan struct{}
+	// closing is set to true when stream should be scheduled to shutdown.
+	closing bool
+	// id is the registered watch id on the grpc stream
+	id int64

-	// lastRev is revision last successfully sent over outc
-	lastRev int64
-	// resumec indicates the stream must recover at a given revision
-	resumec chan int64
+	// buf holds all events received from etcd but not yet consumed by the client
+	buf []*WatchResponse
 }

 func NewWatcher(c *Client) Watcher {
@ -182,18 +193,19 @@ func (vc *valCtx) Err() error                  { return nil }
 func (w *watcher) newWatcherGrpcStream(inctx context.Context) *watchGrpcStream {
 	ctx, cancel := context.WithCancel(&valCtx{inctx})
 	wgs := &watchGrpcStream{
-		owner:   w,
-		remote:  w.remote,
-		ctx:     ctx,
-		ctxKey:  fmt.Sprintf("%v", inctx),
-		cancel:  cancel,
-		streams: make(map[int64]*watcherStream),
+		owner:      w,
+		remote:     w.remote,
+		ctx:        ctx,
+		ctxKey:     fmt.Sprintf("%v", inctx),
+		cancel:     cancel,
+		substreams: make(map[int64]*watcherStream),

-		respc: make(chan *pb.WatchResponse),
-		reqc:  make(chan *watchRequest),
-		stopc: make(chan struct{}),
-		donec: make(chan struct{}),
-		errc:  make(chan error, 1),
+		respc:    make(chan *pb.WatchResponse),
+		reqc:     make(chan *watchRequest),
+		donec:    make(chan struct{}),
+		errc:     make(chan error, 1),
+		closingc: make(chan *watcherStream),
+		resumec:  make(chan struct{}),
 	}
 	go wgs.run()
 	return wgs
@ -203,14 +215,14 @@ func (w *watcher) newWatcherGrpcStream(inctx context.Context) *watchGrpcStream {
 func (w *watcher) Watch(ctx context.Context, key string, opts ...OpOption) WatchChan {
 	ow := opWatch(key, opts...)

-	retc := make(chan chan WatchResponse, 1)
 	wr := &watchRequest{
 		ctx:            ctx,
 		key:            string(ow.key),
 		end:            string(ow.end),
 		rev:            ow.rev,
 		progressNotify: ow.progressNotify,
-		retc:           retc,
+		prevKV:         ow.prevKV,
+		retc:           make(chan chan WatchResponse, 1),
 	}

 	ok := false
@ -242,7 +254,6 @@ func (w *watcher) Watch(ctx context.Context, key string, opts ...OpOption) Watch
 	case reqc <- wr:
 		ok = true
 	case <-wr.ctx.Done():
-		wgs.stopIfEmpty()
 	case <-donec:
 		if wgs.closeErr != nil {
 			closeCh <- WatchResponse{closeErr: wgs.closeErr}
@ -255,7 +266,7 @@ func (w *watcher) Watch(ctx context.Context, key string, opts ...OpOption) Watch
 	// receive channel
 	if ok {
 		select {
-		case ret := <-retc:
+		case ret := <-wr.retc:
 			return ret
 		case <-ctx.Done():
 		case <-donec:
@ -286,12 +297,7 @@ func (w *watcher) Close() (err error) {
 }

 func (w *watchGrpcStream) Close() (err error) {
-	w.mu.Lock()
-	if w.stopc != nil {
-		close(w.stopc)
-		w.stopc = nil
-	}
-	w.mu.Unlock()
+	w.cancel()
 	<-w.donec
 	select {
 	case err = <-w.errc:
@ -300,67 +306,57 @@ func (w *watchGrpcStream) Close() (err error) {
 	return toErr(w.ctx, err)
 }

-func (w *watchGrpcStream) addStream(resp *pb.WatchResponse, pendingReq *watchRequest) {
-	if pendingReq == nil {
-		// no pending request; ignore
-		return
-	}
-	if resp.Canceled || resp.CompactRevision != 0 {
-		// a cancel at id creation time means the start revision has
-		// been compacted out of the store
-		ret := make(chan WatchResponse, 1)
-		ret <- WatchResponse{
-			Header:          *resp.Header,
-			CompactRevision: resp.CompactRevision,
-			Canceled:        true}
-		close(ret)
-		pendingReq.retc <- ret
-		return
-	}
-
-	ret := make(chan WatchResponse)
-	if resp.WatchId == -1 {
-		// failed; no channel
-		close(ret)
-		pendingReq.retc <- ret
-		return
-	}
-
-	ws := &watcherStream{
-		initReq: *pendingReq,
-		id:      resp.WatchId,
-		outc:    ret,
-		// buffered so unlikely to block on sending while holding mu
-		recvc:   make(chan *WatchResponse, 4),
-		resumec: make(chan int64),
-	}
-
-	if pendingReq.rev == 0 {
-		// note the header revision so that a put following a current watcher
-		// disconnect will arrive on the watcher channel after reconnect
-		ws.initReq.rev = resp.Header.Revision
-	}
-
+func (w *watcher) closeStream(wgs *watchGrpcStream) {
 	w.mu.Lock()
-	w.streams[ws.id] = ws
+	close(wgs.donec)
+	wgs.cancel()
+	if w.streams != nil {
+		delete(w.streams, wgs.ctxKey)
+	}
 	w.mu.Unlock()
-
-	// pass back the subscriber channel for the watcher
-	pendingReq.retc <- ret
-
-	// send messages to subscriber
-	go w.serveStream(ws)
 }

-// closeStream closes the watcher resources and removes it
-func (w *watchGrpcStream) closeStream(ws *watcherStream) {
-	w.mu.Lock()
-	// cancels request stream; subscriber receives nil channel
-	close(ws.initReq.retc)
-	// close subscriber's channel
+func (w *watchGrpcStream) addSubstream(resp *pb.WatchResponse, ws *watcherStream) {
+	if resp.WatchId == -1 {
+		// failed; no channel
+		close(ws.recvc)
+		return
+	}
+	ws.id = resp.WatchId
+	w.substreams[ws.id] = ws
+}
+
+func (w *watchGrpcStream) sendCloseSubstream(ws *watcherStream, resp *WatchResponse) {
+	select {
+	case ws.outc <- *resp:
+	case <-ws.initReq.ctx.Done():
+	case <-time.After(closeSendErrTimeout):
+	}
 	close(ws.outc)
-	delete(w.streams, ws.id)
-	w.mu.Unlock()
+}
+
+func (w *watchGrpcStream) closeSubstream(ws *watcherStream) {
+	// send channel response in case stream was never established
+	select {
+	case ws.initReq.retc <- ws.outc:
+	default:
+	}
+	// close subscriber's channel
+	if closeErr := w.closeErr; closeErr != nil && ws.initReq.ctx.Err() == nil {
+		go w.sendCloseSubstream(ws, &WatchResponse{closeErr: w.closeErr})
+	} else if ws.outc != nil {
+		close(ws.outc)
+	}
+	if ws.id != -1 {
+		delete(w.substreams, ws.id)
+		return
+	}
+	for i := range w.resuming {
+		if w.resuming[i] == ws {
+			w.resuming[i] = nil
+			return
+		}
+	}
 }

 // run is the root of the goroutines for managing a watcher client
@ -368,66 +364,79 @@ func (w *watchGrpcStream) run() {
 	var wc pb.Watch_WatchClient
 	var closeErr error

-	defer func() {
-		w.owner.mu.Lock()
-		w.closeErr = closeErr
-		if w.owner.streams != nil {
-			delete(w.owner.streams, w.ctxKey)
-		}
-		close(w.donec)
-		w.owner.mu.Unlock()
-		w.cancel()
-	}()
+	// substreams marked to close but goroutine still running; needed for
+	// avoiding double-closing recvc on grpc stream teardown
+	closing := make(map[*watcherStream]struct{})

-	// already stopped?
-	w.mu.RLock()
-	stopc := w.stopc
-	w.mu.RUnlock()
-	if stopc == nil {
-		return
-	}
+	defer func() {
+		w.closeErr = closeErr
+		// shutdown substreams and resuming substreams
+		for _, ws := range w.substreams {
+			if _, ok := closing[ws]; !ok {
+				close(ws.recvc)
+			}
+		}
+		for _, ws := range w.resuming {
+			if _, ok := closing[ws]; ws != nil && !ok {
+				close(ws.recvc)
+			}
+		}
+		w.joinSubstreams()
+		for toClose := len(w.substreams) + len(w.resuming); toClose > 0; toClose-- {
+			w.closeSubstream(<-w.closingc)
+		}
+
+		w.owner.closeStream(w)
+	}()

 	// start a stream with the etcd grpc server
 	if wc, closeErr = w.newWatchClient(); closeErr != nil {
 		return
 	}

-	var pendingReq, failedReq *watchRequest
-	curReqC := w.reqc
 	cancelSet := make(map[int64]struct{})

 	for {
 		select {
 		// Watch() requested
-		case pendingReq = <-curReqC:
-			// no more watch requests until there's a response
-			curReqC = nil
-			if err := wc.Send(pendingReq.toPB()); err == nil {
-				// pendingReq now waits on w.respc
-				break
+		case wreq := <-w.reqc:
+			outc := make(chan WatchResponse, 1)
+			ws := &watcherStream{
+				initReq: *wreq,
+				id:      -1,
+				outc:    outc,
+				// unbufffered so resumes won't cause repeat events
+				recvc: make(chan *WatchResponse),
+			}
+
+			ws.donec = make(chan struct{})
+			go w.serveSubstream(ws, w.resumec)
+
+			// queue up for watcher creation/resume
+			w.resuming = append(w.resuming, ws)
+			if len(w.resuming) == 1 {
+				// head of resume queue, can register a new watcher
+				wc.Send(ws.initReq.toPB())
 			}
-			failedReq = pendingReq
 		// New events from the watch client
 		case pbresp := <-w.respc:
 			switch {
 			case pbresp.Created:
-				// response to pending req, try to add
-				w.addStream(pbresp, pendingReq)
-				pendingReq = nil
-				curReqC = w.reqc
+				// response to head of queue creation
+				if ws := w.resuming[0]; ws != nil {
+					w.addSubstream(pbresp, ws)
+					w.dispatchEvent(pbresp)
+					w.resuming[0] = nil
+				}
+				if ws := w.nextResume(); ws != nil {
+					wc.Send(ws.initReq.toPB())
+				}
 			case pbresp.Canceled:
 				delete(cancelSet, pbresp.WatchId)
-				// shutdown serveStream, if any
-				w.mu.Lock()
-				if ws, ok := w.streams[pbresp.WatchId]; ok {
+				if ws, ok := w.substreams[pbresp.WatchId]; ok {
+					// signal to stream goroutine to update closingc
 					close(ws.recvc)
-					delete(w.streams, ws.id)
-				}
-				numStreams := len(w.streams)
-				w.mu.Unlock()
-				if numStreams == 0 {
-					// don't leak watcher streams
-					return
+					closing[ws] = struct{}{}
 				}
 			default:
 				// dispatch to appropriate watch stream
@ -448,7 +457,6 @@ func (w *watchGrpcStream) run() {
 				wc.Send(req)
 			}
 		// watch client failed to recv; spawn another if possible
-		// TODO report watch client errors from errc?
 		case err := <-w.errc:
 			if toErr(w.ctx, err) == v3rpc.ErrNoLeader {
 				closeErr = err
@ -457,48 +465,58 @@ func (w *watchGrpcStream) run() {
 			if wc, closeErr = w.newWatchClient(); closeErr != nil {
 				return
 			}
-			curReqC = w.reqc
-			if pendingReq != nil {
-				failedReq = pendingReq
+			if ws := w.nextResume(); ws != nil {
+				wc.Send(ws.initReq.toPB())
 			}
 			cancelSet = make(map[int64]struct{})
-		case <-stopc:
+		case <-w.ctx.Done():
 			return
-		}
-
-		// send failed; queue for retry
-		if failedReq != nil {
-			go func(wr *watchRequest) {
-				select {
-				case w.reqc <- wr:
-				case <-wr.ctx.Done():
-				case <-w.donec:
-				}
-			}(pendingReq)
-			failedReq = nil
-			pendingReq = nil
+		case ws := <-w.closingc:
+			w.closeSubstream(ws)
+			delete(closing, ws)
+			if len(w.substreams)+len(w.resuming) == 0 {
+				// no more watchers on this stream, shutdown
+				return
+			}
 		}
 	}
 }

+// nextResume chooses the next resuming to register with the grpc stream. Abandoned
+// streams are marked as nil in the queue since the head must wait for its inflight registration.
+func (w *watchGrpcStream) nextResume() *watcherStream {
+	for len(w.resuming) != 0 {
+		if w.resuming[0] != nil {
+			return w.resuming[0]
+		}
+		w.resuming = w.resuming[1:len(w.resuming)]
+	}
+	return nil
+}
+
 // dispatchEvent sends a WatchResponse to the appropriate watcher stream
 func (w *watchGrpcStream) dispatchEvent(pbresp *pb.WatchResponse) bool {
-	w.mu.RLock()
-	defer w.mu.RUnlock()
-	ws, ok := w.streams[pbresp.WatchId]
+	ws, ok := w.substreams[pbresp.WatchId]
+	if !ok {
+		return false
+	}
 	events := make([]*Event, len(pbresp.Events))
 	for i, ev := range pbresp.Events {
 		events[i] = (*Event)(ev)
 	}
-	if ok {
-		wr := &WatchResponse{
-			Header:          *pbresp.Header,
-			Events:          events,
-			CompactRevision: pbresp.CompactRevision,
-			Canceled:        pbresp.Canceled}
-		ws.recvc <- wr
+	wr := &WatchResponse{
+		Header:          *pbresp.Header,
+		Events:          events,
+		CompactRevision: pbresp.CompactRevision,
+		created:         pbresp.Created,
+		Canceled:        pbresp.Canceled,
 	}
-	return ok
+	select {
+	case ws.recvc <- wr:
+	case <-ws.donec:
+		return false
+	}
+	return true
 }

 // serveWatchClient forwards messages from the grpc stream to run()
@ -520,134 +538,169 @@ func (w *watchGrpcStream) serveWatchClient(wc pb.Watch_WatchClient) {
 	}
 }

-// serveStream forwards watch responses from run() to the subscriber
-func (w *watchGrpcStream) serveStream(ws *watcherStream) {
-	var closeErr error
-	emptyWr := &WatchResponse{}
-	wrs := []*WatchResponse{}
+// serveSubstream forwards watch responses from run() to the subscriber
+func (w *watchGrpcStream) serveSubstream(ws *watcherStream, resumec chan struct{}) {
+	if ws.closing {
+		panic("created substream goroutine but substream is closing")
+	}
+
+	// nextRev is the minimum expected next revision
+	nextRev := ws.initReq.rev
 	resuming := false
-	closing := false
-	for !closing {
+	defer func() {
+		if !resuming {
+			ws.closing = true
+		}
+		close(ws.donec)
+		if !resuming {
+			w.closingc <- ws
+		}
+	}()
+
+	emptyWr := &WatchResponse{}
+	for {
 		curWr := emptyWr
 		outc := ws.outc
-		if len(wrs) > 0 {
-			curWr = wrs[0]
+
+		if len(ws.buf) > 0 && ws.buf[0].created {
+			select {
+			case ws.initReq.retc <- ws.outc:
+			default:
+			}
+			ws.buf = ws.buf[1:]
+		}
+
+		if len(ws.buf) > 0 {
+			curWr = ws.buf[0]
 		} else {
 			outc = nil
 		}
 		select {
 		case outc <- *curWr:
-			if wrs[0].Err() != nil {
-				closing = true
-				break
-			}
-			var newRev int64
-			if len(wrs[0].Events) > 0 {
-				newRev = wrs[0].Events[len(wrs[0].Events)-1].Kv.ModRevision
-			} else {
-				newRev = wrs[0].Header.Revision
-			}
-			if newRev != ws.lastRev {
-				ws.lastRev = newRev
-			}
-			wrs[0] = nil
-			wrs = wrs[1:]
-		case wr, ok := <-ws.recvc:
-			if !ok {
-				// shutdown from closeStream
+			if ws.buf[0].Err() != nil {
 				return
 			}
-			// resume up to last seen event if disconnected
-			if resuming && wr.Err() == nil {
-				resuming = false
-				// trim events already seen
-				for i := 0; i < len(wr.Events); i++ {
-					if wr.Events[i].Kv.ModRevision > ws.lastRev {
-						wr.Events = wr.Events[i:]
-						break
-					}
-				}
-				// only forward new events
-				if wr.Events[0].Kv.ModRevision == ws.lastRev {
-					break
-				}
+			ws.buf[0] = nil
+			ws.buf = ws.buf[1:]
+		case wr, ok := <-ws.recvc:
+			if !ok {
+				// shutdown from closeSubstream
+				return
 			}
-			resuming = false
-			// TODO don't keep buffering if subscriber stops reading
-			wrs = append(wrs, wr)
-		case resumeRev := <-ws.resumec:
-			wrs = nil
-			resuming = true
-			if resumeRev == -1 {
-				// pause serving stream while resume gets set up
-				break
+			// TODO pause channel if buffer gets too large
+			ws.buf = append(ws.buf, wr)
+			nextRev = wr.Header.Revision
+			if len(wr.Events) > 0 {
+				nextRev = wr.Events[len(wr.Events)-1].Kv.ModRevision + 1
 			}
-			if resumeRev != ws.lastRev {
-				panic("unexpected resume revision")
-			}
-		case <-w.donec:
-			closing = true
-			closeErr = w.closeErr
+			ws.initReq.rev = nextRev
+		case <-w.ctx.Done():
+			return
 		case <-ws.initReq.ctx.Done():
-			closing = true
+			return
+		case <-resumec:
+			resuming = true
+			return
 		}
 	}
-
-	// try to send off close error
-	if closeErr != nil {
-		select {
-		case ws.outc <- WatchResponse{closeErr: w.closeErr}:
-		case <-w.donec:
-		case <-time.After(closeSendErrTimeout):
-		}
-	}
-
-	w.closeStream(ws)
-	w.stopIfEmpty()
 	// lazily send cancel message if events on missing id
 }

-func (wgs *watchGrpcStream) stopIfEmpty() {
-	wgs.mu.Lock()
-	if len(wgs.streams) == 0 && wgs.stopc != nil {
-		close(wgs.stopc)
-		wgs.stopc = nil
-	}
-	wgs.mu.Unlock()
-}
-
 func (w *watchGrpcStream) newWatchClient() (pb.Watch_WatchClient, error) {
-	ws, rerr := w.resume()
-	if rerr != nil {
-		return nil, rerr
+	// mark all substreams as resuming
+	close(w.resumec)
+	w.resumec = make(chan struct{})
+	w.joinSubstreams()
+	for _, ws := range w.substreams {
+		ws.id = -1
+		w.resuming = append(w.resuming, ws)
 	}
-	go w.serveWatchClient(ws)
-	return ws, nil
-}
-
-// resume creates a new WatchClient with all current watchers reestablished
-func (w *watchGrpcStream) resume() (ws pb.Watch_WatchClient, err error) {
-	for {
-		if ws, err = w.openWatchClient(); err != nil {
-			break
-		} else if err = w.resumeWatchers(ws); err == nil {
-			break
+	// strip out nils, if any
+	var resuming []*watcherStream
+	for _, ws := range w.resuming {
+		if ws != nil {
+			resuming = append(resuming, ws)
+		}
+	}
+	w.resuming = resuming
+	w.substreams = make(map[int64]*watcherStream)
+
+	// connect to grpc stream while accepting watcher cancelation
+	stopc := make(chan struct{})
+	donec := w.waitCancelSubstreams(stopc)
+	wc, err := w.openWatchClient()
+	close(stopc)
+	<-donec
+
+	// serve all non-closing streams, even if there's a client error
+	// so that the teardown path can shutdown the streams as expected.
+	for _, ws := range w.resuming {
+		if ws.closing {
+			continue
+		}
+		ws.donec = make(chan struct{})
+		go w.serveSubstream(ws, w.resumec)
+	}
+
+	if err != nil {
+		return nil, v3rpc.Error(err)
+	}
+
+	// receive data from new grpc stream
+	go w.serveWatchClient(wc)
+	return wc, nil
+}
+
+func (w *watchGrpcStream) waitCancelSubstreams(stopc <-chan struct{}) <-chan struct{} {
+	var wg sync.WaitGroup
+	wg.Add(len(w.resuming))
+	donec := make(chan struct{})
+	for i := range w.resuming {
+		go func(ws *watcherStream) {
+			defer wg.Done()
+			if ws.closing {
+				return
+			}
+			select {
+			case <-ws.initReq.ctx.Done():
+				// closed ws will be removed from resuming
+				ws.closing = true
+				close(ws.outc)
+				ws.outc = nil
+				go func() { w.closingc <- ws }()
+			case <-stopc:
+			}
+		}(w.resuming[i])
+	}
+	go func() {
+		defer close(donec)
+		wg.Wait()
+	}()
+	return donec
+}
+
+// joinSubstream waits for all substream goroutines to complete
+func (w *watchGrpcStream) joinSubstreams() {
+	for _, ws := range w.substreams {
+		<-ws.donec
+	}
+	for _, ws := range w.resuming {
+		if ws != nil {
+			<-ws.donec
 		}
 	}
-	return ws, v3rpc.Error(err)
 }

 // openWatchClient retries opening a watchclient until retryConnection fails
 func (w *watchGrpcStream) openWatchClient() (ws pb.Watch_WatchClient, err error) {
 	for {
-		w.mu.Lock()
-		stopc := w.stopc
-		w.mu.Unlock()
-		if stopc == nil {
+		select {
+		case <-w.ctx.Done():
 			if err == nil {
-				err = context.Canceled
+				return nil, w.ctx.Err()
 			}
 			return nil, err
+		default:
 		}
 		if ws, err = w.remote.Watch(w.ctx, grpc.FailFast(false)); ws != nil && err == nil {
 			break
@ -659,48 +712,6 @@ func (w *watchGrpcStream) openWatchClient() (ws pb.Watch_WatchClient, err error)
 	return ws, nil
 }

-// resumeWatchers rebuilds every registered watcher on a new client
-func (w *watchGrpcStream) resumeWatchers(wc pb.Watch_WatchClient) error {
-	w.mu.RLock()
-	streams := make([]*watcherStream, 0, len(w.streams))
-	for _, ws := range w.streams {
-		streams = append(streams, ws)
-	}
-	w.mu.RUnlock()
-
-	for _, ws := range streams {
-		// pause serveStream
-		ws.resumec <- -1
-
-		// reconstruct watcher from initial request
-		if ws.lastRev != 0 {
-			ws.initReq.rev = ws.lastRev
-		}
-		if err := wc.Send(ws.initReq.toPB()); err != nil {
-			return err
-		}
-
-		// wait for request ack
-		resp, err := wc.Recv()
-		if err != nil {
-			return err
-		} else if len(resp.Events) != 0 || !resp.Created {
-			return fmt.Errorf("watcher: unexpected response (%+v)", resp)
-		}
-
-		// id may be different since new remote watcher; update map
-		w.mu.Lock()
-		delete(w.streams, ws.id)
-		ws.id = resp.WatchId
-		w.streams[ws.id] = ws
-		w.mu.Unlock()
-
-		// unpause serveStream
-		ws.resumec <- ws.lastRev
-	}
-	return nil
-}
-
 // toPB converts an internal watch request structure to its protobuf messagefunc (wr *watchRequest)
 func (wr *watchRequest) toPB() *pb.WatchRequest {
 	req := &pb.WatchCreateRequest{
@ -708,6 +719,7 @@ func (wr *watchRequest) toPB() *pb.WatchRequest {
 		Key:            []byte(wr.key),
 		RangeEnd:       []byte(wr.end),
 		ProgressNotify: wr.progressNotify,
+		PrevKv:         wr.prevKV,
 	}
 	cr := &pb.WatchRequest_CreateRequest{CreateRequest: req}
 	return &pb.WatchRequest{RequestUnion: cr}
--- a/cmd/Godeps/Godeps.json
+++ b/cmd/Godeps/Godeps.json
@ -11,10 +11,6 @@
 			"Comment": "null-5",
 			"Rev": "'75cd24fc2f2c2a2088577d12123ddee5f54e0675'"
 		},
-		{
-			"ImportPath": "github.com/akrennmair/gopcap",
-			"Rev": "00e11033259acb75598ba416495bb708d864a010"
-		},
 		{
 			"ImportPath": "github.com/beorn7/perks/quantile",
 			"Rev": "b965b613227fddccbfffe13eae360ed3fa822f8d"
@ -237,48 +233,48 @@
 		},
 		{
 			"ImportPath": "google.golang.org/grpc",
-			"Comment": "v1.0.0-174-gc278196",
-			"Rev": "c2781963b3af261a37e0f14fdcb7c1fa13259e1f"
+			"Comment": "v1.0.0-183-g231b4cf",
+			"Rev": "231b4cfea0e79843053a33f5fe90bd4d84b23cd3"
 		},
 		{
 			"ImportPath": "google.golang.org/grpc/codes",
-			"Comment": "v1.0.0-174-gc278196",
-			"Rev": "c2781963b3af261a37e0f14fdcb7c1fa13259e1f"
+			"Comment": "v1.0.0-183-g231b4cf",
+			"Rev": "231b4cfea0e79843053a33f5fe90bd4d84b23cd3"
 		},
 		{
 			"ImportPath": "google.golang.org/grpc/credentials",
-			"Comment": "v1.0.0-174-gc278196",
-			"Rev": "c2781963b3af261a37e0f14fdcb7c1fa13259e1f"
+			"Comment": "v1.0.0-183-g231b4cf",
+			"Rev": "231b4cfea0e79843053a33f5fe90bd4d84b23cd3"
 		},
 		{
 			"ImportPath": "google.golang.org/grpc/grpclog",
-			"Comment": "v1.0.0-174-gc278196",
-			"Rev": "c2781963b3af261a37e0f14fdcb7c1fa13259e1f"
+			"Comment": "v1.0.0-183-g231b4cf",
+			"Rev": "231b4cfea0e79843053a33f5fe90bd4d84b23cd3"
 		},
 		{
 			"ImportPath": "google.golang.org/grpc/internal",
-			"Comment": "v1.0.0-174-gc278196",
-			"Rev": "c2781963b3af261a37e0f14fdcb7c1fa13259e1f"
+			"Comment": "v1.0.0-183-g231b4cf",
+			"Rev": "231b4cfea0e79843053a33f5fe90bd4d84b23cd3"
 		},
 		{
 			"ImportPath": "google.golang.org/grpc/metadata",
-			"Comment": "v1.0.0-174-gc278196",
-			"Rev": "c2781963b3af261a37e0f14fdcb7c1fa13259e1f"
+			"Comment": "v1.0.0-183-g231b4cf",
+			"Rev": "231b4cfea0e79843053a33f5fe90bd4d84b23cd3"
 		},
 		{
 			"ImportPath": "google.golang.org/grpc/naming",
-			"Comment": "v1.0.0-174-gc278196",
-			"Rev": "c2781963b3af261a37e0f14fdcb7c1fa13259e1f"
+			"Comment": "v1.0.0-183-g231b4cf",
+			"Rev": "231b4cfea0e79843053a33f5fe90bd4d84b23cd3"
 		},
 		{
 			"ImportPath": "google.golang.org/grpc/peer",
-			"Comment": "v1.0.0-174-gc278196",
-			"Rev": "c2781963b3af261a37e0f14fdcb7c1fa13259e1f"
+			"Comment": "v1.0.0-183-g231b4cf",
+			"Rev": "231b4cfea0e79843053a33f5fe90bd4d84b23cd3"
 		},
 		{
 			"ImportPath": "google.golang.org/grpc/transport",
-			"Comment": "v1.0.0-174-gc278196",
-			"Rev": "c2781963b3af261a37e0f14fdcb7c1fa13259e1f"
+			"Comment": "v1.0.0-183-g231b4cf",
+			"Rev": "231b4cfea0e79843053a33f5fe90bd4d84b23cd3"
 		},
 		{
 			"ImportPath": "gopkg.in/cheggaaa/pb.v1",
--- a/cmd/vendor/github.com/akrennmair/gopcap/.gitignore
+++ b/cmd/vendor/github.com/akrennmair/gopcap/.gitignore
@ -1,5 +0,0 @@
-#*
-*~
-/tools/pass/pass
-/tools/pcaptest/pcaptest
-/tools/tcpdump/tcpdump
--- a/cmd/vendor/github.com/akrennmair/gopcap/LICENSE
+++ b/cmd/vendor/github.com/akrennmair/gopcap/LICENSE
@ -1,27 +0,0 @@
-Copyright (c) 2009-2011 Andreas Krennmair. All rights reserved.
-
-Redistribution and use in source and binary forms, with or without
-modification, are permitted provided that the following conditions are
-met:
-
-   * Redistributions of source code must retain the above copyright
-notice, this list of conditions and the following disclaimer.
-   * Redistributions in binary form must reproduce the above
-copyright notice, this list of conditions and the following disclaimer
-in the documentation and/or other materials provided with the
-distribution.
-   * Neither the name of Andreas Krennmair nor the names of its
-contributors may be used to endorse or promote products derived from
-this software without specific prior written permission.
-
-THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
-"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
-LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
-A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
-OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
-SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
-LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
-DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
-THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
-(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
-OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
--- a/cmd/vendor/github.com/akrennmair/gopcap/README.mkd
+++ b/cmd/vendor/github.com/akrennmair/gopcap/README.mkd
@ -1,11 +0,0 @@
-# PCAP
-
-This is a simple wrapper around libpcap for Go.  Originally written by Andreas
-Krennmair <ak@synflood.at> and only minorly touched up by Mark Smith <mark@qq.is>.
-
-Please see the included pcaptest.go and tcpdump.go programs for instructions on
-how to use this library.
-
-Miek Gieben <miek@miek.nl> has created a more Go-like package and replaced functionality
-with standard functions from the standard library. The package has also been renamed to
-pcap.
--- a/cmd/vendor/github.com/akrennmair/gopcap/decode.go
+++ b/cmd/vendor/github.com/akrennmair/gopcap/decode.go
@ -1,527 +0,0 @@
-package pcap
-
-import (
-	"encoding/binary"
-	"fmt"
-	"net"
-	"reflect"
-	"strings"
-)
-
-const (
-	TYPE_IP   = 0x0800
-	TYPE_ARP  = 0x0806
-	TYPE_IP6  = 0x86DD
-	TYPE_VLAN = 0x8100
-
-	IP_ICMP = 1
-	IP_INIP = 4
-	IP_TCP  = 6
-	IP_UDP  = 17
-)
-
-const (
-	ERRBUF_SIZE = 256
-
-	// According to pcap-linktype(7).
-	LINKTYPE_NULL             = 0
-	LINKTYPE_ETHERNET         = 1
-	LINKTYPE_TOKEN_RING       = 6
-	LINKTYPE_ARCNET           = 7
-	LINKTYPE_SLIP             = 8
-	LINKTYPE_PPP              = 9
-	LINKTYPE_FDDI             = 10
-	LINKTYPE_ATM_RFC1483      = 100
-	LINKTYPE_RAW              = 101
-	LINKTYPE_PPP_HDLC         = 50
-	LINKTYPE_PPP_ETHER        = 51
-	LINKTYPE_C_HDLC           = 104
-	LINKTYPE_IEEE802_11       = 105
-	LINKTYPE_FRELAY           = 107
-	LINKTYPE_LOOP             = 108
-	LINKTYPE_LINUX_SLL        = 113
-	LINKTYPE_LTALK            = 104
-	LINKTYPE_PFLOG            = 117
-	LINKTYPE_PRISM_HEADER     = 119
-	LINKTYPE_IP_OVER_FC       = 122
-	LINKTYPE_SUNATM           = 123
-	LINKTYPE_IEEE802_11_RADIO = 127
-	LINKTYPE_ARCNET_LINUX     = 129
-	LINKTYPE_LINUX_IRDA       = 144
-	LINKTYPE_LINUX_LAPD       = 177
-)
-
-type addrHdr interface {
-	SrcAddr() string
-	DestAddr() string
-	Len() int
-}
-
-type addrStringer interface {
-	String(addr addrHdr) string
-}
-
-func decodemac(pkt []byte) uint64 {
-	mac := uint64(0)
-	for i := uint(0); i < 6; i++ {
-		mac = (mac << 8) + uint64(pkt[i])
-	}
-	return mac
-}
-
-// Decode decodes the headers of a Packet.
-func (p *Packet) Decode() {
-	if len(p.Data) <= 14 {
-		return
-	}
-
-	p.Type = int(binary.BigEndian.Uint16(p.Data[12:14]))
-	p.DestMac = decodemac(p.Data[0:6])
-	p.SrcMac = decodemac(p.Data[6:12])
-
-	if len(p.Data) >= 15 {
-		p.Payload = p.Data[14:]
-	}
-
-	switch p.Type {
-	case TYPE_IP:
-		p.decodeIp()
-	case TYPE_IP6:
-		p.decodeIp6()
-	case TYPE_ARP:
-		p.decodeArp()
-	case TYPE_VLAN:
-		p.decodeVlan()
-	}
-}
-
-func (p *Packet) headerString(headers []interface{}) string {
-	// If there's just one header, return that.
-	if len(headers) == 1 {
-		if hdr, ok := headers[0].(fmt.Stringer); ok {
-			return hdr.String()
-		}
-	}
-	// If there are two headers (IPv4/IPv6 -> TCP/UDP/IP..)
-	if len(headers) == 2 {
-		// Commonly the first header is an address.
-		if addr, ok := p.Headers[0].(addrHdr); ok {
-			if hdr, ok := p.Headers[1].(addrStringer); ok {
-				return fmt.Sprintf("%s %s", p.Time, hdr.String(addr))
-			}
-		}
-	}
-	// For IP in IP, we do a recursive call.
-	if len(headers) >= 2 {
-		if addr, ok := headers[0].(addrHdr); ok {
-			if _, ok := headers[1].(addrHdr); ok {
-				return fmt.Sprintf("%s > %s IP in IP: ",
-					addr.SrcAddr(), addr.DestAddr(), p.headerString(headers[1:]))
-			}
-		}
-	}
-
-	var typeNames []string
-	for _, hdr := range headers {
-		typeNames = append(typeNames, reflect.TypeOf(hdr).String())
-	}
-
-	return fmt.Sprintf("unknown [%s]", strings.Join(typeNames, ","))
-}
-
-// String prints a one-line representation of the packet header.
-// The output is suitable for use in a tcpdump program.
-func (p *Packet) String() string {
-	// If there are no headers, print "unsupported protocol".
-	if len(p.Headers) == 0 {
-		return fmt.Sprintf("%s unsupported protocol %d", p.Time, int(p.Type))
-	}
-	return fmt.Sprintf("%s %s", p.Time, p.headerString(p.Headers))
-}
-
-// Arphdr is a ARP packet header.
-type Arphdr struct {
-	Addrtype          uint16
-	Protocol          uint16
-	HwAddressSize     uint8
-	ProtAddressSize   uint8
-	Operation         uint16
-	SourceHwAddress   []byte
-	SourceProtAddress []byte
-	DestHwAddress     []byte
-	DestProtAddress   []byte
-}
-
-func (arp *Arphdr) String() (s string) {
-	switch arp.Operation {
-	case 1:
-		s = "ARP request"
-	case 2:
-		s = "ARP Reply"
-	}
-	if arp.Addrtype == LINKTYPE_ETHERNET && arp.Protocol == TYPE_IP {
-		s = fmt.Sprintf("%012x (%s) > %012x (%s)",
-			decodemac(arp.SourceHwAddress), arp.SourceProtAddress,
-			decodemac(arp.DestHwAddress), arp.DestProtAddress)
-	} else {
-		s = fmt.Sprintf("addrtype = %d protocol = %d", arp.Addrtype, arp.Protocol)
-	}
-	return
-}
-
-func (p *Packet) decodeArp() {
-	if len(p.Payload) < 8 {
-		return
-	}
-
-	pkt := p.Payload
-	arp := new(Arphdr)
-	arp.Addrtype = binary.BigEndian.Uint16(pkt[0:2])
-	arp.Protocol = binary.BigEndian.Uint16(pkt[2:4])
-	arp.HwAddressSize = pkt[4]
-	arp.ProtAddressSize = pkt[5]
-	arp.Operation = binary.BigEndian.Uint16(pkt[6:8])
-
-	if len(pkt) < int(8+2*arp.HwAddressSize+2*arp.ProtAddressSize) {
-		return
-	}
-	arp.SourceHwAddress = pkt[8 : 8+arp.HwAddressSize]
-	arp.SourceProtAddress = pkt[8+arp.HwAddressSize : 8+arp.HwAddressSize+arp.ProtAddressSize]
-	arp.DestHwAddress = pkt[8+arp.HwAddressSize+arp.ProtAddressSize : 8+2*arp.HwAddressSize+arp.ProtAddressSize]
-	arp.DestProtAddress = pkt[8+2*arp.HwAddressSize+arp.ProtAddressSize : 8+2*arp.HwAddressSize+2*arp.ProtAddressSize]
-
-	p.Headers = append(p.Headers, arp)
-
-	if len(pkt) >= int(8+2*arp.HwAddressSize+2*arp.ProtAddressSize) {
-		p.Payload = p.Payload[8+2*arp.HwAddressSize+2*arp.ProtAddressSize:]
-	}
-}
-
-// IPadr is the header of an IP packet.
-type Iphdr struct {
-	Version    uint8
-	Ihl        uint8
-	Tos        uint8
-	Length     uint16
-	Id         uint16
-	Flags      uint8
-	FragOffset uint16
-	Ttl        uint8
-	Protocol   uint8
-	Checksum   uint16
-	SrcIp      []byte
-	DestIp     []byte
-}
-
-func (p *Packet) decodeIp() {
-	if len(p.Payload) < 20 {
-		return
-	}
-
-	pkt := p.Payload
-	ip := new(Iphdr)
-
-	ip.Version = uint8(pkt[0]) >> 4
-	ip.Ihl = uint8(pkt[0]) & 0x0F
-	ip.Tos = pkt[1]
-	ip.Length = binary.BigEndian.Uint16(pkt[2:4])
-	ip.Id = binary.BigEndian.Uint16(pkt[4:6])
-	flagsfrags := binary.BigEndian.Uint16(pkt[6:8])
-	ip.Flags = uint8(flagsfrags >> 13)
-	ip.FragOffset = flagsfrags & 0x1FFF
-	ip.Ttl = pkt[8]
-	ip.Protocol = pkt[9]
-	ip.Checksum = binary.BigEndian.Uint16(pkt[10:12])
-	ip.SrcIp = pkt[12:16]
-	ip.DestIp = pkt[16:20]
-
-	pEnd := int(ip.Length)
-	if pEnd > len(pkt) {
-		pEnd = len(pkt)
-	}
-
-	if len(pkt) >= pEnd && int(ip.Ihl*4) < pEnd {
-		p.Payload = pkt[ip.Ihl*4 : pEnd]
-	} else {
-		p.Payload = []byte{}
-	}
-
-	p.Headers = append(p.Headers, ip)
-	p.IP = ip
-
-	switch ip.Protocol {
-	case IP_TCP:
-		p.decodeTcp()
-	case IP_UDP:
-		p.decodeUdp()
-	case IP_ICMP:
-		p.decodeIcmp()
-	case IP_INIP:
-		p.decodeIp()
-	}
-}
-
-func (ip *Iphdr) SrcAddr() string  { return net.IP(ip.SrcIp).String() }
-func (ip *Iphdr) DestAddr() string { return net.IP(ip.DestIp).String() }
-func (ip *Iphdr) Len() int         { return int(ip.Length) }
-
-type Vlanhdr struct {
-	Priority       byte
-	DropEligible   bool
-	VlanIdentifier int
-	Type           int // Not actually part of the vlan header, but the type of the actual packet
-}
-
-func (v *Vlanhdr) String() {
-	fmt.Sprintf("VLAN Priority:%d Drop:%v Tag:%d", v.Priority, v.DropEligible, v.VlanIdentifier)
-}
-
-func (p *Packet) decodeVlan() {
-	pkt := p.Payload
-	vlan := new(Vlanhdr)
-	if len(pkt) < 4 {
-		return
-	}
-
-	vlan.Priority = (pkt[2] & 0xE0) >> 13
-	vlan.DropEligible = pkt[2]&0x10 != 0
-	vlan.VlanIdentifier = int(binary.BigEndian.Uint16(pkt[:2])) & 0x0FFF
-	vlan.Type = int(binary.BigEndian.Uint16(p.Payload[2:4]))
-	p.Headers = append(p.Headers, vlan)
-
-	if len(pkt) >= 5 {
-		p.Payload = p.Payload[4:]
-	}
-
-	switch vlan.Type {
-	case TYPE_IP:
-		p.decodeIp()
-	case TYPE_IP6:
-		p.decodeIp6()
-	case TYPE_ARP:
-		p.decodeArp()
-	}
-}
-
-type Tcphdr struct {
-	SrcPort    uint16
-	DestPort   uint16
-	Seq        uint32
-	Ack        uint32
-	DataOffset uint8
-	Flags      uint16
-	Window     uint16
-	Checksum   uint16
-	Urgent     uint16
-	Data       []byte
-}
-
-const (
-	TCP_FIN = 1 << iota
-	TCP_SYN
-	TCP_RST
-	TCP_PSH
-	TCP_ACK
-	TCP_URG
-	TCP_ECE
-	TCP_CWR
-	TCP_NS
-)
-
-func (p *Packet) decodeTcp() {
-	if len(p.Payload) < 20 {
-		return
-	}
-
-	pkt := p.Payload
-	tcp := new(Tcphdr)
-	tcp.SrcPort = binary.BigEndian.Uint16(pkt[0:2])
-	tcp.DestPort = binary.BigEndian.Uint16(pkt[2:4])
-	tcp.Seq = binary.BigEndian.Uint32(pkt[4:8])
-	tcp.Ack = binary.BigEndian.Uint32(pkt[8:12])
-	tcp.DataOffset = (pkt[12] & 0xF0) >> 4
-	tcp.Flags = binary.BigEndian.Uint16(pkt[12:14]) & 0x1FF
-	tcp.Window = binary.BigEndian.Uint16(pkt[14:16])
-	tcp.Checksum = binary.BigEndian.Uint16(pkt[16:18])
-	tcp.Urgent = binary.BigEndian.Uint16(pkt[18:20])
-	if len(pkt) >= int(tcp.DataOffset*4) {
-		p.Payload = pkt[tcp.DataOffset*4:]
-	}
-	p.Headers = append(p.Headers, tcp)
-	p.TCP = tcp
-}
-
-func (tcp *Tcphdr) String(hdr addrHdr) string {
-	return fmt.Sprintf("TCP %s:%d > %s:%d %s SEQ=%d ACK=%d LEN=%d",
-		hdr.SrcAddr(), int(tcp.SrcPort), hdr.DestAddr(), int(tcp.DestPort),
-		tcp.FlagsString(), int64(tcp.Seq), int64(tcp.Ack), hdr.Len())
-}
-
-func (tcp *Tcphdr) FlagsString() string {
-	var sflags []string
-	if 0 != (tcp.Flags & TCP_SYN) {
-		sflags = append(sflags, "syn")
-	}
-	if 0 != (tcp.Flags & TCP_FIN) {
-		sflags = append(sflags, "fin")
-	}
-	if 0 != (tcp.Flags & TCP_ACK) {
-		sflags = append(sflags, "ack")
-	}
-	if 0 != (tcp.Flags & TCP_PSH) {
-		sflags = append(sflags, "psh")
-	}
-	if 0 != (tcp.Flags & TCP_RST) {
-		sflags = append(sflags, "rst")
-	}
-	if 0 != (tcp.Flags & TCP_URG) {
-		sflags = append(sflags, "urg")
-	}
-	if 0 != (tcp.Flags & TCP_NS) {
-		sflags = append(sflags, "ns")
-	}
-	if 0 != (tcp.Flags & TCP_CWR) {
-		sflags = append(sflags, "cwr")
-	}
-	if 0 != (tcp.Flags & TCP_ECE) {
-		sflags = append(sflags, "ece")
-	}
-	return fmt.Sprintf("[%s]", strings.Join(sflags, " "))
-}
-
-type Udphdr struct {
-	SrcPort  uint16
-	DestPort uint16
-	Length   uint16
-	Checksum uint16
-}
-
-func (p *Packet) decodeUdp() {
-	if len(p.Payload) < 8 {
-		return
-	}
-
-	pkt := p.Payload
-	udp := new(Udphdr)
-	udp.SrcPort = binary.BigEndian.Uint16(pkt[0:2])
-	udp.DestPort = binary.BigEndian.Uint16(pkt[2:4])
-	udp.Length = binary.BigEndian.Uint16(pkt[4:6])
-	udp.Checksum = binary.BigEndian.Uint16(pkt[6:8])
-	p.Headers = append(p.Headers, udp)
-	p.UDP = udp
-	if len(p.Payload) >= 8 {
-		p.Payload = pkt[8:]
-	}
-}
-
-func (udp *Udphdr) String(hdr addrHdr) string {
-	return fmt.Sprintf("UDP %s:%d > %s:%d LEN=%d CHKSUM=%d",
-		hdr.SrcAddr(), int(udp.SrcPort), hdr.DestAddr(), int(udp.DestPort),
-		int(udp.Length), int(udp.Checksum))
-}
-
-type Icmphdr struct {
-	Type     uint8
-	Code     uint8
-	Checksum uint16
-	Id       uint16
-	Seq      uint16
-	Data     []byte
-}
-
-func (p *Packet) decodeIcmp() *Icmphdr {
-	if len(p.Payload) < 8 {
-		return nil
-	}
-
-	pkt := p.Payload
-	icmp := new(Icmphdr)
-	icmp.Type = pkt[0]
-	icmp.Code = pkt[1]
-	icmp.Checksum = binary.BigEndian.Uint16(pkt[2:4])
-	icmp.Id = binary.BigEndian.Uint16(pkt[4:6])
-	icmp.Seq = binary.BigEndian.Uint16(pkt[6:8])
-	p.Payload = pkt[8:]
-	p.Headers = append(p.Headers, icmp)
-	return icmp
-}
-
-func (icmp *Icmphdr) String(hdr addrHdr) string {
-	return fmt.Sprintf("ICMP %s > %s Type = %d Code = %d ",
-		hdr.SrcAddr(), hdr.DestAddr(), icmp.Type, icmp.Code)
-}
-
-func (icmp *Icmphdr) TypeString() (result string) {
-	switch icmp.Type {
-	case 0:
-		result = fmt.Sprintf("Echo reply seq=%d", icmp.Seq)
-	case 3:
-		switch icmp.Code {
-		case 0:
-			result = "Network unreachable"
-		case 1:
-			result = "Host unreachable"
-		case 2:
-			result = "Protocol unreachable"
-		case 3:
-			result = "Port unreachable"
-		default:
-			result = "Destination unreachable"
-		}
-	case 8:
-		result = fmt.Sprintf("Echo request seq=%d", icmp.Seq)
-	case 30:
-		result = "Traceroute"
-	}
-	return
-}
-
-type Ip6hdr struct {
-	// http://www.networksorcery.com/enp/protocol/ipv6.htm
-	Version      uint8  // 4 bits
-	TrafficClass uint8  // 8 bits
-	FlowLabel    uint32 // 20 bits
-	Length       uint16 // 16 bits
-	NextHeader   uint8  // 8 bits, same as Protocol in Iphdr
-	HopLimit     uint8  // 8 bits
-	SrcIp        []byte // 16 bytes
-	DestIp       []byte // 16 bytes
-}
-
-func (p *Packet) decodeIp6() {
-	if len(p.Payload) < 40 {
-		return
-	}
-
-	pkt := p.Payload
-	ip6 := new(Ip6hdr)
-	ip6.Version = uint8(pkt[0]) >> 4
-	ip6.TrafficClass = uint8((binary.BigEndian.Uint16(pkt[0:2]) >> 4) & 0x00FF)
-	ip6.FlowLabel = binary.BigEndian.Uint32(pkt[0:4]) & 0x000FFFFF
-	ip6.Length = binary.BigEndian.Uint16(pkt[4:6])
-	ip6.NextHeader = pkt[6]
-	ip6.HopLimit = pkt[7]
-	ip6.SrcIp = pkt[8:24]
-	ip6.DestIp = pkt[24:40]
-
-	if len(p.Payload) >= 40 {
-		p.Payload = pkt[40:]
-	}
-
-	p.Headers = append(p.Headers, ip6)
-
-	switch ip6.NextHeader {
-	case IP_TCP:
-		p.decodeTcp()
-	case IP_UDP:
-		p.decodeUdp()
-	case IP_ICMP:
-		p.decodeIcmp()
-	case IP_INIP:
-		p.decodeIp()
-	}
-}
-
-func (ip6 *Ip6hdr) SrcAddr() string  { return net.IP(ip6.SrcIp).String() }
-func (ip6 *Ip6hdr) DestAddr() string { return net.IP(ip6.DestIp).String() }
-func (ip6 *Ip6hdr) Len() int         { return int(ip6.Length) }
--- a/cmd/vendor/github.com/akrennmair/gopcap/io.go
+++ b/cmd/vendor/github.com/akrennmair/gopcap/io.go
@ -1,206 +0,0 @@
-package pcap
-
-import (
-	"encoding/binary"
-	"fmt"
-	"io"
-	"time"
-)
-
-// FileHeader is the parsed header of a pcap file.
-// http://wiki.wireshark.org/Development/LibpcapFileFormat
-type FileHeader struct {
-	MagicNumber  uint32
-	VersionMajor uint16
-	VersionMinor uint16
-	TimeZone     int32
-	SigFigs      uint32
-	SnapLen      uint32
-	Network      uint32
-}
-
-type PacketTime struct {
-	Sec  int32
-	Usec int32
-}
-
-// Convert the PacketTime to a go Time struct.
-func (p *PacketTime) Time() time.Time {
-	return time.Unix(int64(p.Sec), int64(p.Usec)*1000)
-}
-
-// Packet is a single packet parsed from a pcap file.
-//
-// Convenient access to IP, TCP, and UDP headers is provided after Decode()
-// is called if the packet is of the appropriate type.
-type Packet struct {
-	Time   time.Time // packet send/receive time
-	Caplen uint32    // bytes stored in the file (caplen <= len)
-	Len    uint32    // bytes sent/received
-	Data   []byte    // packet data
-
-	Type    int // protocol type, see LINKTYPE_*
-	DestMac uint64
-	SrcMac  uint64
-
-	Headers []interface{} // decoded headers, in order
-	Payload []byte        // remaining non-header bytes
-
-	IP  *Iphdr  // IP header (for IP packets, after decoding)
-	TCP *Tcphdr // TCP header (for TCP packets, after decoding)
-	UDP *Udphdr // UDP header (for UDP packets after decoding)
-}
-
-// Reader parses pcap files.
-type Reader struct {
-	flip         bool
-	buf          io.Reader
-	err          error
-	fourBytes    []byte
-	twoBytes     []byte
-	sixteenBytes []byte
-	Header       FileHeader
-}
-
-// NewReader reads pcap data from an io.Reader.
-func NewReader(reader io.Reader) (*Reader, error) {
-	r := &Reader{
-		buf:          reader,
-		fourBytes:    make([]byte, 4),
-		twoBytes:     make([]byte, 2),
-		sixteenBytes: make([]byte, 16),
-	}
-	switch magic := r.readUint32(); magic {
-	case 0xa1b2c3d4:
-		r.flip = false
-	case 0xd4c3b2a1:
-		r.flip = true
-	default:
-		return nil, fmt.Errorf("pcap: bad magic number: %0x", magic)
-	}
-	r.Header = FileHeader{
-		MagicNumber:  0xa1b2c3d4,
-		VersionMajor: r.readUint16(),
-		VersionMinor: r.readUint16(),
-		TimeZone:     r.readInt32(),
-		SigFigs:      r.readUint32(),
-		SnapLen:      r.readUint32(),
-		Network:      r.readUint32(),
-	}
-	return r, nil
-}
-
-// Next returns the next packet or nil if no more packets can be read.
-func (r *Reader) Next() *Packet {
-	d := r.sixteenBytes
-	r.err = r.read(d)
-	if r.err != nil {
-		return nil
-	}
-	timeSec := asUint32(d[0:4], r.flip)
-	timeUsec := asUint32(d[4:8], r.flip)
-	capLen := asUint32(d[8:12], r.flip)
-	origLen := asUint32(d[12:16], r.flip)
-
-	data := make([]byte, capLen)
-	if r.err = r.read(data); r.err != nil {
-		return nil
-	}
-	return &Packet{
-		Time:   time.Unix(int64(timeSec), int64(timeUsec)),
-		Caplen: capLen,
-		Len:    origLen,
-		Data:   data,
-	}
-}
-
-func (r *Reader) read(data []byte) error {
-	var err error
-	n, err := r.buf.Read(data)
-	for err == nil && n != len(data) {
-		var chunk int
-		chunk, err = r.buf.Read(data[n:])
-		n += chunk
-	}
-	if len(data) == n {
-		return nil
-	}
-	return err
-}
-
-func (r *Reader) readUint32() uint32 {
-	data := r.fourBytes
-	if r.err = r.read(data); r.err != nil {
-		return 0
-	}
-	return asUint32(data, r.flip)
-}
-
-func (r *Reader) readInt32() int32 {
-	data := r.fourBytes
-	if r.err = r.read(data); r.err != nil {
-		return 0
-	}
-	return int32(asUint32(data, r.flip))
-}
-
-func (r *Reader) readUint16() uint16 {
-	data := r.twoBytes
-	if r.err = r.read(data); r.err != nil {
-		return 0
-	}
-	return asUint16(data, r.flip)
-}
-
-// Writer writes a pcap file.
-type Writer struct {
-	writer io.Writer
-	buf    []byte
-}
-
-// NewWriter creates a Writer that stores output in an io.Writer.
-// The FileHeader is written immediately.
-func NewWriter(writer io.Writer, header *FileHeader) (*Writer, error) {
-	w := &Writer{
-		writer: writer,
-		buf:    make([]byte, 24),
-	}
-	binary.LittleEndian.PutUint32(w.buf, header.MagicNumber)
-	binary.LittleEndian.PutUint16(w.buf[4:], header.VersionMajor)
-	binary.LittleEndian.PutUint16(w.buf[6:], header.VersionMinor)
-	binary.LittleEndian.PutUint32(w.buf[8:], uint32(header.TimeZone))
-	binary.LittleEndian.PutUint32(w.buf[12:], header.SigFigs)
-	binary.LittleEndian.PutUint32(w.buf[16:], header.SnapLen)
-	binary.LittleEndian.PutUint32(w.buf[20:], header.Network)
-	if _, err := writer.Write(w.buf); err != nil {
-		return nil, err
-	}
-	return w, nil
-}
-
-// Writer writes a packet to the underlying writer.
-func (w *Writer) Write(pkt *Packet) error {
-	binary.LittleEndian.PutUint32(w.buf, uint32(pkt.Time.Unix()))
-	binary.LittleEndian.PutUint32(w.buf[4:], uint32(pkt.Time.Nanosecond()))
-	binary.LittleEndian.PutUint32(w.buf[8:], uint32(pkt.Time.Unix()))
-	binary.LittleEndian.PutUint32(w.buf[12:], pkt.Len)
-	if _, err := w.writer.Write(w.buf[:16]); err != nil {
-		return err
-	}
-	_, err := w.writer.Write(pkt.Data)
-	return err
-}
-
-func asUint32(data []byte, flip bool) uint32 {
-	if flip {
-		return binary.BigEndian.Uint32(data)
-	}
-	return binary.LittleEndian.Uint32(data)
-}
-
-func asUint16(data []byte, flip bool) uint16 {
-	if flip {
-		return binary.BigEndian.Uint16(data)
-	}
-	return binary.LittleEndian.Uint16(data)
-}
--- a/cmd/vendor/github.com/akrennmair/gopcap/pcap.go
+++ b/cmd/vendor/github.com/akrennmair/gopcap/pcap.go
@ -1,266 +0,0 @@
-// Interface to both live and offline pcap parsing.
-package pcap
-
-/*
-#cgo linux LDFLAGS: -lpcap
-#cgo freebsd LDFLAGS: -lpcap
-#cgo darwin LDFLAGS: -lpcap
-#cgo windows CFLAGS: -I C:/WpdPack/Include
-#cgo windows,386 LDFLAGS: -L C:/WpdPack/Lib -lwpcap
-#cgo windows,amd64 LDFLAGS: -L C:/WpdPack/Lib/x64 -lwpcap
-#include <stdlib.h>
-#include <pcap.h>
-
-// Workaround for not knowing how to cast to const u_char**
-int hack_pcap_next_ex(pcap_t *p, struct pcap_pkthdr **pkt_header,
-                      u_char **pkt_data) {
-    return pcap_next_ex(p, pkt_header, (const u_char **)pkt_data);
-}
-*/
-import "C"
-import (
-	"errors"
-	"net"
-	"syscall"
-	"time"
-	"unsafe"
-)
-
-type Pcap struct {
-	cptr *C.pcap_t
-}
-
-type Stat struct {
-	PacketsReceived  uint32
-	PacketsDropped   uint32
-	PacketsIfDropped uint32
-}
-
-type Interface struct {
-	Name        string
-	Description string
-	Addresses   []IFAddress
-	// TODO: add more elements
-}
-
-type IFAddress struct {
-	IP      net.IP
-	Netmask net.IPMask
-	// TODO: add broadcast + PtP dst ?
-}
-
-func (p *Pcap) Next() (pkt *Packet) {
-	rv, _ := p.NextEx()
-	return rv
-}
-
-// Openlive opens a device and returns a *Pcap handler
-func Openlive(device string, snaplen int32, promisc bool, timeout_ms int32) (handle *Pcap, err error) {
-	var buf *C.char
-	buf = (*C.char)(C.calloc(ERRBUF_SIZE, 1))
-	h := new(Pcap)
-	var pro int32
-	if promisc {
-		pro = 1
-	}
-
-	dev := C.CString(device)
-	defer C.free(unsafe.Pointer(dev))
-
-	h.cptr = C.pcap_open_live(dev, C.int(snaplen), C.int(pro), C.int(timeout_ms), buf)
-	if nil == h.cptr {
-		handle = nil
-		err = errors.New(C.GoString(buf))
-	} else {
-		handle = h
-	}
-	C.free(unsafe.Pointer(buf))
-	return
-}
-
-func Openoffline(file string) (handle *Pcap, err error) {
-	var buf *C.char
-	buf = (*C.char)(C.calloc(ERRBUF_SIZE, 1))
-	h := new(Pcap)
-
-	cf := C.CString(file)
-	defer C.free(unsafe.Pointer(cf))
-
-	h.cptr = C.pcap_open_offline(cf, buf)
-	if nil == h.cptr {
-		handle = nil
-		err = errors.New(C.GoString(buf))
-	} else {
-		handle = h
-	}
-	C.free(unsafe.Pointer(buf))
-	return
-}
-
-func (p *Pcap) NextEx() (pkt *Packet, result int32) {
-	var pkthdr *C.struct_pcap_pkthdr
-
-	var buf_ptr *C.u_char
-	var buf unsafe.Pointer
-	result = int32(C.hack_pcap_next_ex(p.cptr, &pkthdr, &buf_ptr))
-
-	buf = unsafe.Pointer(buf_ptr)
-	if nil == buf {
-		return
-	}
-
-	pkt = new(Packet)
-	pkt.Time = time.Unix(int64(pkthdr.ts.tv_sec), int64(pkthdr.ts.tv_usec)*1000)
-	pkt.Caplen = uint32(pkthdr.caplen)
-	pkt.Len = uint32(pkthdr.len)
-	pkt.Data = C.GoBytes(buf, C.int(pkthdr.caplen))
-	return
-}
-
-func (p *Pcap) Close() {
-	C.pcap_close(p.cptr)
-}
-
-func (p *Pcap) Geterror() error {
-	return errors.New(C.GoString(C.pcap_geterr(p.cptr)))
-}
-
-func (p *Pcap) Getstats() (stat *Stat, err error) {
-	var cstats _Ctype_struct_pcap_stat
-	if -1 == C.pcap_stats(p.cptr, &cstats) {
-		return nil, p.Geterror()
-	}
-	stats := new(Stat)
-	stats.PacketsReceived = uint32(cstats.ps_recv)
-	stats.PacketsDropped = uint32(cstats.ps_drop)
-	stats.PacketsIfDropped = uint32(cstats.ps_ifdrop)
-
-	return stats, nil
-}
-
-func (p *Pcap) Setfilter(expr string) (err error) {
-	var bpf _Ctype_struct_bpf_program
-	cexpr := C.CString(expr)
-	defer C.free(unsafe.Pointer(cexpr))
-
-	if -1 == C.pcap_compile(p.cptr, &bpf, cexpr, 1, 0) {
-		return p.Geterror()
-	}
-
-	if -1 == C.pcap_setfilter(p.cptr, &bpf) {
-		C.pcap_freecode(&bpf)
-		return p.Geterror()
-	}
-	C.pcap_freecode(&bpf)
-	return nil
-}
-
-func Version() string {
-	return C.GoString(C.pcap_lib_version())
-}
-
-func (p *Pcap) Datalink() int {
-	return int(C.pcap_datalink(p.cptr))
-}
-
-func (p *Pcap) Setdatalink(dlt int) error {
-	if -1 == C.pcap_set_datalink(p.cptr, C.int(dlt)) {
-		return p.Geterror()
-	}
-	return nil
-}
-
-func DatalinkValueToName(dlt int) string {
-	if name := C.pcap_datalink_val_to_name(C.int(dlt)); name != nil {
-		return C.GoString(name)
-	}
-	return ""
-}
-
-func DatalinkValueToDescription(dlt int) string {
-	if desc := C.pcap_datalink_val_to_description(C.int(dlt)); desc != nil {
-		return C.GoString(desc)
-	}
-	return ""
-}
-
-func Findalldevs() (ifs []Interface, err error) {
-	var buf *C.char
-	buf = (*C.char)(C.calloc(ERRBUF_SIZE, 1))
-	defer C.free(unsafe.Pointer(buf))
-	var alldevsp *C.pcap_if_t
-
-	if -1 == C.pcap_findalldevs((**C.pcap_if_t)(&alldevsp), buf) {
-		return nil, errors.New(C.GoString(buf))
-	}
-	defer C.pcap_freealldevs((*C.pcap_if_t)(alldevsp))
-	dev := alldevsp
-	var i uint32
-	for i = 0; dev != nil; dev = (*C.pcap_if_t)(dev.next) {
-		i++
-	}
-	ifs = make([]Interface, i)
-	dev = alldevsp
-	for j := uint32(0); dev != nil; dev = (*C.pcap_if_t)(dev.next) {
-		var iface Interface
-		iface.Name = C.GoString(dev.name)
-		iface.Description = C.GoString(dev.description)
-		iface.Addresses = findalladdresses(dev.addresses)
-		// TODO: add more elements
-		ifs[j] = iface
-		j++
-	}
-	return
-}
-
-func findalladdresses(addresses *_Ctype_struct_pcap_addr) (retval []IFAddress) {
-	// TODO - make it support more than IPv4 and IPv6?
-	retval = make([]IFAddress, 0, 1)
-	for curaddr := addresses; curaddr != nil; curaddr = (*_Ctype_struct_pcap_addr)(curaddr.next) {
-		var a IFAddress
-		var err error
-		if a.IP, err = sockaddr_to_IP((*syscall.RawSockaddr)(unsafe.Pointer(curaddr.addr))); err != nil {
-			continue
-		}
-		if a.Netmask, err = sockaddr_to_IP((*syscall.RawSockaddr)(unsafe.Pointer(curaddr.addr))); err != nil {
-			continue
-		}
-		retval = append(retval, a)
-	}
-	return
-}
-
-func sockaddr_to_IP(rsa *syscall.RawSockaddr) (IP []byte, err error) {
-	switch rsa.Family {
-	case syscall.AF_INET:
-		pp := (*syscall.RawSockaddrInet4)(unsafe.Pointer(rsa))
-		IP = make([]byte, 4)
-		for i := 0; i < len(IP); i++ {
-			IP[i] = pp.Addr[i]
-		}
-		return
-	case syscall.AF_INET6:
-		pp := (*syscall.RawSockaddrInet6)(unsafe.Pointer(rsa))
-		IP = make([]byte, 16)
-		for i := 0; i < len(IP); i++ {
-			IP[i] = pp.Addr[i]
-		}
-		return
-	}
-	err = errors.New("Unsupported address type")
-	return
-}
-
-func (p *Pcap) Inject(data []byte) (err error) {
-	buf := (*C.char)(C.malloc((C.size_t)(len(data))))
-
-	for i := 0; i < len(data); i++ {
-		*(*byte)(unsafe.Pointer(uintptr(unsafe.Pointer(buf)) + uintptr(i))) = data[i]
-	}
-
-	if -1 == C.pcap_sendpacket(p.cptr, (*C.u_char)(unsafe.Pointer(buf)), (C.int)(len(data))) {
-		err = p.Geterror()
-	}
-	C.free(unsafe.Pointer(buf))
-	return
-}
--- a/cmd/vendor/google.golang.org/grpc/call.go
+++ b/cmd/vendor/google.golang.org/grpc/call.go
@ -170,9 +170,9 @@ func Invoke(ctx context.Context, method string, args, reply interface{}, cc *Cli
 			if _, ok := err.(*rpcError); ok {
 				return err
 			}
-			if err == errConnClosing {
+			if err == errConnClosing || err == errConnUnavailable {
 				if c.failFast {
-					return Errorf(codes.Unavailable, "%v", errConnClosing)
+					return Errorf(codes.Unavailable, "%v", err)
 				}
 				continue
 			}
--- a/cmd/vendor/google.golang.org/grpc/clientconn.go
+++ b/cmd/vendor/google.golang.org/grpc/clientconn.go
@ -73,7 +73,9 @@ var (
 	errConnDrain = errors.New("grpc: the connection is drained")
 	// errConnClosing indicates that the connection is closing.
 	errConnClosing = errors.New("grpc: the connection is closing")
-	errNoAddr      = errors.New("grpc: there is no address available to dial")
+	// errConnUnavailable indicates that the connection is unavailable.
+	errConnUnavailable = errors.New("grpc: the connection is unavailable")
+	errNoAddr          = errors.New("grpc: there is no address available to dial")
 	// minimum time to give a connection to complete
 	minConnectTimeout = 20 * time.Second
 )
@ -213,9 +215,14 @@ func WithUserAgent(s string) DialOption {
 	}
 }

-// Dial creates a client connection the given target.
+// Dial creates a client connection to the given target.
 func Dial(target string, opts ...DialOption) (*ClientConn, error) {
-	ctx := context.Background()
+	return DialContext(context.Background(), target, opts...)
+}
+
+// DialContext creates a client connection to the given target
+// using the supplied context.
+func DialContext(ctx context.Context, target string, opts ...DialOption) (*ClientConn, error) {
 	cc := &ClientConn{
 		target: target,
 		conns:  make(map[Address]*addrConn),
@ -472,6 +479,10 @@ func (cc *ClientConn) getTransport(ctx context.Context, opts BalancerGetOptions)
 	if cc.dopts.balancer == nil {
 		// If balancer is nil, there should be only one addrConn available.
 		cc.mu.RLock()
+		if cc.conns == nil {
+			cc.mu.RUnlock()
+			return nil, nil, toRPCErr(ErrClientConnClosing)
+		}
 		for _, ac = range cc.conns {
 			// Break after the first iteration to get the first addrConn.
 			ok = true
@ -501,11 +512,7 @@ func (cc *ClientConn) getTransport(ctx context.Context, opts BalancerGetOptions)
 		}
 		return nil, nil, errConnClosing
 	}
-	// ac.wait should block on transient failure only if balancer is nil and RPC is non-failfast.
-	//  - If RPC is failfast, ac.wait should not block.
-	//  - If balancer is not nil, ac.wait should return errConnClosing on transient failure
-	//    so that non-failfast RPCs will try to get a new transport instead of waiting on ac.
-	t, err := ac.wait(ctx, cc.dopts.balancer == nil && opts.BlockingWait)
+	t, err := ac.wait(ctx, cc.dopts.balancer != nil, !opts.BlockingWait)
 	if err != nil {
 		if put != nil {
 			put()
@ -757,36 +764,42 @@ func (ac *addrConn) transportMonitor() {
 }

 // wait blocks until i) the new transport is up or ii) ctx is done or iii) ac is closed or
-// iv) transport is in TransientFailure and blocking is false.
-func (ac *addrConn) wait(ctx context.Context, blocking bool) (transport.ClientTransport, error) {
+// iv) transport is in TransientFailure and there's no balancer/failfast is true.
+func (ac *addrConn) wait(ctx context.Context, hasBalancer, failfast bool) (transport.ClientTransport, error) {
 	for {
 		ac.mu.Lock()
 		switch {
 		case ac.state == Shutdown:
-			err := ac.tearDownErr
+			if failfast || !hasBalancer {
+				// RPC is failfast or balancer is nil. This RPC should fail with ac.tearDownErr.
+				err := ac.tearDownErr
+				ac.mu.Unlock()
+				return nil, err
+			}
 			ac.mu.Unlock()
-			return nil, err
+			return nil, errConnClosing
 		case ac.state == Ready:
 			ct := ac.transport
 			ac.mu.Unlock()
 			return ct, nil
-		case ac.state == TransientFailure && !blocking:
-			ac.mu.Unlock()
-			return nil, errConnClosing
-		default:
-			ready := ac.ready
-			if ready == nil {
-				ready = make(chan struct{})
-				ac.ready = ready
-			}
-			ac.mu.Unlock()
-			select {
-			case <-ctx.Done():
-				return nil, toRPCErr(ctx.Err())
-			// Wait until the new transport is ready or failed.
-			case <-ready:
+		case ac.state == TransientFailure:
+			if failfast || hasBalancer {
+				ac.mu.Unlock()
+				return nil, errConnUnavailable
 			}
 		}
+		ready := ac.ready
+		if ready == nil {
+			ready = make(chan struct{})
+			ac.ready = ready
+		}
+		ac.mu.Unlock()
+		select {
+		case <-ctx.Done():
+			return nil, toRPCErr(ctx.Err())
+		// Wait until the new transport is ready or failed.
+		case <-ready:
+		}
 	}
 }

--- a/cmd/vendor/google.golang.org/grpc/stream.go
+++ b/cmd/vendor/google.golang.org/grpc/stream.go
@ -146,9 +146,9 @@ func NewClientStream(ctx context.Context, desc *StreamDesc, cc *ClientConn, meth
 			if _, ok := err.(*rpcError); ok {
 				return nil, err
 			}
-			if err == errConnClosing {
+			if err == errConnClosing || err == errConnUnavailable {
 				if c.failFast {
-					return nil, Errorf(codes.Unavailable, "%v", errConnClosing)
+					return nil, Errorf(codes.Unavailable, "%v", err)
 				}
 				continue
 			}
--- a/discovery/srv.go
+++ b/discovery/srv.go
@ -53,8 +53,8 @@ func SRVGetCluster(name, dns string, defaultToken string, apurls types.URLs) (st
 			return err
 		}
 		for _, srv := range addrs {
-			target := strings.TrimSuffix(srv.Target, ".")
-			host := net.JoinHostPort(target, fmt.Sprintf("%d", srv.Port))
+			port := fmt.Sprintf("%d", srv.Port)
+			host := net.JoinHostPort(srv.Target, port)
 			tcpAddr, err := resolveTCPAddr("tcp", host)
 			if err != nil {
 				plog.Warningf("couldn't resolve host %s during SRV discovery", host)
@ -70,8 +70,11 @@ func SRVGetCluster(name, dns string, defaultToken string, apurls types.URLs) (st
 				n = fmt.Sprintf("%d", tempName)
 				tempName += 1
 			}
-			stringParts = append(stringParts, fmt.Sprintf("%s=%s%s", n, prefix, host))
-			plog.Noticef("got bootstrap from DNS for %s at %s%s", service, prefix, host)
+			// SRV records have a trailing dot but URL shouldn't.
+			shortHost := strings.TrimSuffix(srv.Target, ".")
+			urlHost := net.JoinHostPort(shortHost, port)
+			stringParts = append(stringParts, fmt.Sprintf("%s=%s%s", n, prefix, urlHost))
+			plog.Noticef("got bootstrap from DNS for %s at %s%s", service, prefix, urlHost)
 		}
 		return nil
 	}
--- a/discovery/srv_test.go
+++ b/discovery/srv_test.go
@ -17,6 +17,7 @@ package discovery
 import (
 	"errors"
 	"net"
+	"strings"
 	"testing"

 	"github.com/coreos/etcd/pkg/testutil"
@ -29,11 +30,22 @@ func TestSRVGetCluster(t *testing.T) {
 	}()

 	name := "dnsClusterTest"
+	dns := map[string]string{
+		"1.example.com.:2480": "10.0.0.1:2480",
+		"2.example.com.:2480": "10.0.0.2:2480",
+		"3.example.com.:2480": "10.0.0.3:2480",
+		"4.example.com.:2380": "10.0.0.3:2380",
+	}
+	srvAll := []*net.SRV{
+		{Target: "1.example.com.", Port: 2480},
+		{Target: "2.example.com.", Port: 2480},
+		{Target: "3.example.com.", Port: 2480},
+	}
+
 	tests := []struct {
 		withSSL    []*net.SRV
 		withoutSSL []*net.SRV
 		urls       []string
-		dns        map[string]string

 		expected string
 	}{
@ -41,61 +53,50 @@ func TestSRVGetCluster(t *testing.T) {
 			[]*net.SRV{},
 			[]*net.SRV{},
 			nil,
-			nil,

 			"",
 		},
 		{
-			[]*net.SRV{
-				{Target: "10.0.0.1", Port: 2480},
-				{Target: "10.0.0.2", Port: 2480},
-				{Target: "10.0.0.3", Port: 2480},
-			},
+			srvAll,
 			[]*net.SRV{},
 			nil,
+
+			"0=https://1.example.com:2480,1=https://2.example.com:2480,2=https://3.example.com:2480",
+		},
+		{
+			srvAll,
+			[]*net.SRV{{Target: "4.example.com.", Port: 2380}},
 			nil,

-			"0=https://10.0.0.1:2480,1=https://10.0.0.2:2480,2=https://10.0.0.3:2480",
+			"0=https://1.example.com:2480,1=https://2.example.com:2480,2=https://3.example.com:2480,3=http://4.example.com:2380",
 		},
 		{
-			[]*net.SRV{
-				{Target: "10.0.0.1", Port: 2480},
-				{Target: "10.0.0.2", Port: 2480},
-				{Target: "10.0.0.3", Port: 2480},
-			},
-			[]*net.SRV{
-				{Target: "10.0.0.1", Port: 2380},
-			},
-			nil,
-			nil,
-			"0=https://10.0.0.1:2480,1=https://10.0.0.2:2480,2=https://10.0.0.3:2480,3=http://10.0.0.1:2380",
-		},
-		{
-			[]*net.SRV{
-				{Target: "10.0.0.1", Port: 2480},
-				{Target: "10.0.0.2", Port: 2480},
-				{Target: "10.0.0.3", Port: 2480},
-			},
-			[]*net.SRV{
-				{Target: "10.0.0.1", Port: 2380},
-			},
+			srvAll,
+			[]*net.SRV{{Target: "4.example.com.", Port: 2380}},
 			[]string{"https://10.0.0.1:2480"},
-			nil,
-			"dnsClusterTest=https://10.0.0.1:2480,0=https://10.0.0.2:2480,1=https://10.0.0.3:2480,2=http://10.0.0.1:2380",
+
+			"dnsClusterTest=https://1.example.com:2480,0=https://2.example.com:2480,1=https://3.example.com:2480,2=http://4.example.com:2380",
 		},
 		// matching local member with resolved addr and return unresolved hostnames
 		{
-			[]*net.SRV{
-				{Target: "1.example.com.", Port: 2480},
-				{Target: "2.example.com.", Port: 2480},
-				{Target: "3.example.com.", Port: 2480},
-			},
+			srvAll,
 			nil,
 			[]string{"https://10.0.0.1:2480"},
-			map[string]string{"1.example.com:2480": "10.0.0.1:2480", "2.example.com:2480": "10.0.0.2:2480", "3.example.com:2480": "10.0.0.3:2480"},

 			"dnsClusterTest=https://1.example.com:2480,0=https://2.example.com:2480,1=https://3.example.com:2480",
 		},
+		// invalid
+	}
+
+	resolveTCPAddr = func(network, addr string) (*net.TCPAddr, error) {
+		if strings.Contains(addr, "10.0.0.") {
+			// accept IP addresses when resolving apurls
+			return net.ResolveTCPAddr(network, addr)
+		}
+		if dns[addr] == "" {
+			return nil, errors.New("missing dns record")
+		}
+		return net.ResolveTCPAddr(network, dns[addr])
 	}

 	for i, tt := range tests {
@ -108,12 +109,6 @@ func TestSRVGetCluster(t *testing.T) {
 			}
 			return "", nil, errors.New("Unknown service in mock")
 		}
-		resolveTCPAddr = func(network, addr string) (*net.TCPAddr, error) {
-			if tt.dns == nil || tt.dns[addr] == "" {
-				return net.ResolveTCPAddr(network, addr)
-			}
-			return net.ResolveTCPAddr(network, tt.dns[addr])
-		}
 		urls := testutil.MustNewURLs(t, tt.urls)
 		str, token, err := SRVGetCluster(name, "example.com", "token", urls)
 		if err != nil {
--- a/e2e/ctl_v3_auth_test.go
+++ b/e2e/ctl_v3_auth_test.go
@ -75,11 +75,11 @@ func authCredWriteKeyTest(cx ctlCtx) {
 	cx.user, cx.pass = "root", "root"
 	authSetupTestUser(cx)

-	// confirm root role doesn't grant access to all keys
-	if err := ctlV3PutFailPerm(cx, "foo", "bar"); err != nil {
+	// confirm root role can access to all keys
+	if err := ctlV3Put(cx, "foo", "bar", ""); err != nil {
 		cx.t.Fatal(err)
 	}
-	if err := ctlV3GetFailPerm(cx, "foo"); err != nil {
+	if err := ctlV3Get(cx, []string{"foo"}, []kv{{"foo", "bar"}}...); err != nil {
 		cx.t.Fatal(err)
 	}

@ -90,17 +90,17 @@ func authCredWriteKeyTest(cx ctlCtx) {
 	}
 	// confirm put failed
 	cx.user, cx.pass = "test-user", "pass"
-	if err := ctlV3Get(cx, []string{"foo"}, []kv{{"foo", "a"}}...); err != nil {
+	if err := ctlV3Get(cx, []string{"foo"}, []kv{{"foo", "bar"}}...); err != nil {
 		cx.t.Fatal(err)
 	}

 	// try good user
 	cx.user, cx.pass = "test-user", "pass"
-	if err := ctlV3Put(cx, "foo", "bar", ""); err != nil {
+	if err := ctlV3Put(cx, "foo", "bar2", ""); err != nil {
 		cx.t.Fatal(err)
 	}
 	// confirm put succeeded
-	if err := ctlV3Get(cx, []string{"foo"}, []kv{{"foo", "bar"}}...); err != nil {
+	if err := ctlV3Get(cx, []string{"foo"}, []kv{{"foo", "bar2"}}...); err != nil {
 		cx.t.Fatal(err)
 	}

@ -111,7 +111,7 @@ func authCredWriteKeyTest(cx ctlCtx) {
 	}
 	// confirm put failed
 	cx.user, cx.pass = "test-user", "pass"
-	if err := ctlV3Get(cx, []string{"foo"}, []kv{{"foo", "bar"}}...); err != nil {
+	if err := ctlV3Get(cx, []string{"foo"}, []kv{{"foo", "bar2"}}...); err != nil {
 		cx.t.Fatal(err)
 	}
 }
@ -282,10 +282,6 @@ func ctlV3PutFailPerm(cx ctlCtx, key, val string) error {
 	return spawnWithExpect(append(cx.PrefixArgs(), "put", key, val), "permission denied")
 }

-func ctlV3GetFailPerm(cx ctlCtx, key string) error {
-	return spawnWithExpect(append(cx.PrefixArgs(), "get", key), "permission denied")
-}
-
 func authSetupTestUser(cx ctlCtx) {
 	if err := ctlV3User(cx, []string{"add", "test-user", "--interactive=false"}, "User test-user created", []string{"pass"}); err != nil {
 		cx.t.Fatal(err)
--- a/e2e/ctl_v3_migrate_test.go
+++ b/e2e/ctl_v3_migrate_test.go
@ -89,8 +89,8 @@ func TestCtlV3Migrate(t *testing.T) {
 	if len(resp.Kvs) != 1 {
 		t.Fatalf("len(resp.Kvs) expected 1, got %+v", resp.Kvs)
 	}
-	if resp.Kvs[0].CreateRevision != 4 {
-		t.Fatalf("resp.Kvs[0].CreateRevision expected 4, got %d", resp.Kvs[0].CreateRevision)
+	if resp.Kvs[0].CreateRevision != 7 {
+		t.Fatalf("resp.Kvs[0].CreateRevision expected 7, got %d", resp.Kvs[0].CreateRevision)
 	}
 }

--- a/e2e/ctl_v3_snapshot_test.go
+++ b/e2e/ctl_v3_snapshot_test.go
@ -33,10 +33,18 @@ func snapshotTest(cx ctlCtx) {
 		}
 	}

+	leaseID, err := ctlV3LeaseGrant(cx, 100)
+	if err != nil {
+		cx.t.Fatalf("snapshot: ctlV3LeaseGrant error (%v)", err)
+	}
+	if err = ctlV3Put(cx, "withlease", "withlease", leaseID); err != nil {
+		cx.t.Fatalf("snapshot: ctlV3Put error (%v)", err)
+	}
+
 	fpath := "test.snapshot"
 	defer os.RemoveAll(fpath)

-	if err := ctlV3SnapshotSave(cx, fpath); err != nil {
+	if err = ctlV3SnapshotSave(cx, fpath); err != nil {
 		cx.t.Fatalf("snapshotTest ctlV3SnapshotSave error (%v)", err)
 	}

@ -44,11 +52,11 @@ func snapshotTest(cx ctlCtx) {
 	if err != nil {
 		cx.t.Fatalf("snapshotTest getSnapshotStatus error (%v)", err)
 	}
-	if st.Revision != 4 {
+	if st.Revision != 5 {
 		cx.t.Fatalf("expected 4, got %d", st.Revision)
 	}
-	if st.TotalKey < 3 {
-		cx.t.Fatalf("expected at least 3, got %d", st.TotalKey)
+	if st.TotalKey < 4 {
+		cx.t.Fatalf("expected at least 4, got %d", st.TotalKey)
 	}
 }

--- a/e2e/ctl_v3_txn_test.go
+++ b/e2e/ctl_v3_txn_test.go
@ -39,15 +39,23 @@ func txnTestSuccess(cx ctlCtx) {
 	if err := ctlV3Put(cx, "key2", "value2", ""); err != nil {
 		cx.t.Fatalf("txnTestSuccess ctlV3Put error (%v)", err)
 	}
-
-	rqs := txnRequests{
-		compare:  []string{`version("key1") = "1"`, `version("key2") = "1"`},
-		ifSucess: []string{"get key1", "get key2"},
-		ifFail:   []string{`put key1 "fail"`, `put key2 "fail"`},
-		results:  []string{"SUCCESS", "key1", "value1", "key2", "value2"},
+	rqs := []txnRequests{
+		{
+			compare:  []string{`version("key1") = "1"`, `version("key2") = "1"`},
+			ifSucess: []string{"get key1", "get key2", `put "key \"with\" space" "value \x23"`},
+			ifFail:   []string{`put key1 "fail"`, `put key2 "fail"`},
+			results:  []string{"SUCCESS", "key1", "value1", "key2", "value2"},
+		},
+		{
+			compare:  []string{`version("key \"with\" space") = "1"`},
+			ifSucess: []string{`get "key \"with\" space"`},
+			results:  []string{"SUCCESS", `key "with" space`, "value \x23"},
+		},
 	}
-	if err := ctlV3Txn(cx, rqs); err != nil {
-		cx.t.Fatal(err)
+	for _, rq := range rqs {
+		if err := ctlV3Txn(cx, rq); err != nil {
+			cx.t.Fatal(err)
+		}
 	}
 }

--- a/etcdctl/README.md
+++ b/etcdctl/README.md
@ -221,7 +221,7 @@ OK

 ### WATCH [options] [key or prefix] [range_end]

-Watch watches events stream on keys or prefixes, [key or prefix, range_end) if `range-end` is given. The watch command runs until it encounters an error or is terminated by the user.
+Watch watches events stream on keys or prefixes, [key or prefix, range_end) if `range-end` is given. The watch command runs until it encounters an error or is terminated by the user.  If range_end is given, it must be lexicographically greater than key or "\x00".

 #### Options

@ -231,6 +231,8 @@ Watch watches events stream on keys or prefixes, [key or prefix, range_end) if `

 - prefix -- watch on a prefix if prefix is set.

+- prev-kv -- get the previous key-value pair before the event happens.
+
 - rev -- the revision to start watching. Specifying a revision is useful for observing past events.

 #### Input Format
@ -245,7 +247,7 @@ watch [options] <key or prefix>\n

 ##### Simple reply

- \<event\>\n\<key\>\n\<value\>\n\<event\>\n\<next_key\>\n\<next_value\>\n...
+- \<event\>[\n\<old_key\>\n\<old_value\>]\n\<key\>\n\<value\>\n\<event\>\n\<next_key\>\n\<next_value\>\n...

 - Additional error string if WATCH failed. Exit code is non-zero.

--- a/etcdctl/ctlv3/command/del_command.go
+++ b/etcdctl/ctlv3/command/del_command.go
@ -23,6 +23,7 @@ import (

 var (
 	delPrefix bool
+	delPrevKV bool
 )

 // NewDelCommand returns the cobra command for "del".
@ -34,6 +35,7 @@ func NewDelCommand() *cobra.Command {
 	}

 	cmd.Flags().BoolVar(&delPrefix, "prefix", false, "delete keys with matching prefix")
+	cmd.Flags().BoolVar(&delPrevKV, "prev-kv", false, "return deleted key-value pairs")
 	return cmd
 }

@ -65,6 +67,9 @@ func getDelOp(cmd *cobra.Command, args []string) (string, []clientv3.OpOption) {
 	if delPrefix {
 		opts = append(opts, clientv3.WithPrefix())
 	}
+	if delPrevKV {
+		opts = append(opts, clientv3.WithPrevKV())
+	}

 	return key, opts
 }
--- a/etcdctl/ctlv3/command/global.go
+++ b/etcdctl/ctlv3/command/global.go
@ -243,7 +243,7 @@ func authCfgFromCmd(cmd *cobra.Command) *authCfg {
 	var cfg authCfg

 	splitted := strings.SplitN(userFlag, ":", 2)
-	if len(splitted) == 0 {
+	if len(splitted) < 2 {
 		cfg.username = userFlag
 		cfg.password, err = speakeasy.Ask("Password: ")
 		if err != nil {
--- a/etcdctl/ctlv3/command/migrate_command.go
+++ b/etcdctl/ctlv3/command/migrate_command.go
@ -27,11 +27,14 @@ import (
 	"github.com/coreos/etcd/client"
 	etcdErr "github.com/coreos/etcd/error"
 	"github.com/coreos/etcd/etcdserver"
+	"github.com/coreos/etcd/etcdserver/api"
 	pb "github.com/coreos/etcd/etcdserver/etcdserverpb"
+	"github.com/coreos/etcd/etcdserver/membership"
 	"github.com/coreos/etcd/mvcc"
 	"github.com/coreos/etcd/mvcc/backend"
 	"github.com/coreos/etcd/mvcc/mvccpb"
 	"github.com/coreos/etcd/pkg/pbutil"
+	"github.com/coreos/etcd/pkg/types"
 	"github.com/coreos/etcd/raft/raftpb"
 	"github.com/coreos/etcd/snap"
 	"github.com/coreos/etcd/store"
@ -42,9 +45,10 @@ import (
 )

 var (
-	migrateDatadir     string
-	migrateWALdir      string
-	migrateTransformer string
+	migrateExcludeTTLKey bool
+	migrateDatadir       string
+	migrateWALdir        string
+	migrateTransformer   string
 )

 // NewMigrateCommand returns the cobra command for "migrate".
@ -55,6 +59,7 @@ func NewMigrateCommand() *cobra.Command {
 		Run:   migrateCommandFunc,
 	}

+	mc.Flags().BoolVar(&migrateExcludeTTLKey, "no-ttl", false, "Do not convert TTL keys")
 	mc.Flags().StringVar(&migrateDatadir, "data-dir", "", "Path to the data directory")
 	mc.Flags().StringVar(&migrateWALdir, "wal-dir", "", "Path to the WAL directory")
 	mc.Flags().StringVar(&migrateTransformer, "transformer", "", "Path to the user-provided transformer program")
@ -74,18 +79,17 @@ func migrateCommandFunc(cmd *cobra.Command, args []string) {
 		writer, reader, errc = defaultTransformer()
 	}

-	st := rebuildStoreV2()
+	st, index := rebuildStoreV2()
 	be := prepareBackend()
 	defer be.Close()

-	maxIndexc := make(chan uint64, 1)
 	go func() {
-		maxIndexc <- writeStore(writer, st)
+		writeStore(writer, st)
 		writer.Close()
 	}()

 	readKeys(reader, be)
-	mvcc.UpdateConsistentIndex(be, <-maxIndexc)
+	mvcc.UpdateConsistentIndex(be, index)
 	err := <-errc
 	if err != nil {
 		fmt.Println("failed to transform keys")
@ -106,7 +110,10 @@ func prepareBackend() backend.Backend {
 	return be
 }

-func rebuildStoreV2() store.Store {
+func rebuildStoreV2() (store.Store, uint64) {
+	var index uint64
+	cl := membership.NewCluster("")
+
 	waldir := migrateWALdir
 	if len(waldir) == 0 {
 		waldir = path.Join(migrateDatadir, "member", "wal")
@ -122,6 +129,7 @@ func rebuildStoreV2() store.Store {
 	var walsnap walpb.Snapshot
 	if snapshot != nil {
 		walsnap.Index, walsnap.Term = snapshot.Metadata.Index, snapshot.Metadata.Term
+		index = snapshot.Metadata.Index
 	}

 	w, err := wal.OpenForRead(waldir, walsnap)
@ -143,9 +151,15 @@ func rebuildStoreV2() store.Store {
 		}
 	}

-	applier := etcdserver.NewApplierV2(st, nil)
+	cl.SetStore(st)
+	cl.Recover(api.UpdateCapability)
+
+	applier := etcdserver.NewApplierV2(st, cl)
 	for _, ent := range ents {
-		if ent.Type != raftpb.EntryNormal {
+		if ent.Type == raftpb.EntryConfChange {
+			var cc raftpb.ConfChange
+			pbutil.MustUnmarshal(&cc, ent.Data)
+			applyConf(cc, cl)
 			continue
 		}

@ -160,9 +174,34 @@ func rebuildStoreV2() store.Store {
 				applyRequest(req, applier)
 			}
 		}
+		if ent.Index > index {
+			index = ent.Index
+		}
 	}

-	return st
+	return st, index
+}
+
+func applyConf(cc raftpb.ConfChange, cl *membership.RaftCluster) {
+	if err := cl.ValidateConfigurationChange(cc); err != nil {
+		return
+	}
+	switch cc.Type {
+	case raftpb.ConfChangeAddNode:
+		m := new(membership.Member)
+		if err := json.Unmarshal(cc.Context, m); err != nil {
+			panic(err)
+		}
+		cl.AddMember(m)
+	case raftpb.ConfChangeRemoveNode:
+		cl.RemoveMember(types.ID(cc.NodeID))
+	case raftpb.ConfChangeUpdateNode:
+		m := new(membership.Member)
+		if err := json.Unmarshal(cc.Context, m); err != nil {
+			panic(err)
+		}
+		cl.UpdateRaftAttributes(m.ID, m.RaftAttributes)
+	}
 }

 func applyRequest(r *pb.Request, applyV2 etcdserver.ApplierV2) {
@ -216,11 +255,13 @@ func writeKeys(w io.Writer, n *store.NodeExtern) uint64 {
 	if n.Dir {
 		n.Nodes = nil
 	}
-	b, err := json.Marshal(n)
-	if err != nil {
-		ExitWithError(ExitError, err)
+	if !migrateExcludeTTLKey || n.TTL == 0 {
+		b, err := json.Marshal(n)
+		if err != nil {
+			ExitWithError(ExitError, err)
+		}
+		fmt.Fprint(w, string(b))
 	}
-	fmt.Fprintf(w, string(b))
 	for _, nn := range nodes {
 		max := writeKeys(w, nn)
 		if max > maxIndex {
--- a/etcdctl/ctlv3/command/printer.go
+++ b/etcdctl/ctlv3/command/printer.go
@ -108,6 +108,9 @@ type simplePrinter struct {

 func (s *simplePrinter) Del(resp v3.DeleteResponse) {
 	fmt.Println(resp.Deleted)
+	for _, kv := range resp.PrevKvs {
+		printKV(s.isHex, kv)
+	}
 }

 func (s *simplePrinter) Get(resp v3.GetResponse) {
@ -116,7 +119,12 @@ func (s *simplePrinter) Get(resp v3.GetResponse) {
 	}
 }

-func (s *simplePrinter) Put(r v3.PutResponse) { fmt.Println("OK") }
+func (s *simplePrinter) Put(r v3.PutResponse) {
+	fmt.Println("OK")
+	if r.PrevKv != nil {
+		printKV(s.isHex, r.PrevKv)
+	}
+}

 func (s *simplePrinter) Txn(resp v3.TxnResponse) {
 	if resp.Succeeded {
@ -143,6 +151,9 @@ func (s *simplePrinter) Txn(resp v3.TxnResponse) {
 func (s *simplePrinter) Watch(resp v3.WatchResponse) {
 	for _, e := range resp.Events {
 		fmt.Println(e.Type)
+		if e.PrevKv != nil {
+			printKV(s.isHex, e.PrevKv)
+		}
 		printKV(s.isHex, e.Kv)
 	}
 }
--- a/etcdctl/ctlv3/command/put_command.go
+++ b/etcdctl/ctlv3/command/put_command.go
@ -24,7 +24,8 @@ import (
 )

 var (
-	leaseStr string
+	leaseStr  string
+	putPrevKV bool
 )

 // NewPutCommand returns the cobra command for "put".
@ -49,6 +50,7 @@ will store the content of the file to <key>.
 		Run: putCommandFunc,
 	}
 	cmd.Flags().StringVar(&leaseStr, "lease", "0", "lease ID (in hexadecimal) to attach to the key")
+	cmd.Flags().BoolVar(&putPrevKV, "prev-kv", false, "return changed key-value pairs")
 	return cmd
 }

@ -85,6 +87,9 @@ func getPutOp(cmd *cobra.Command, args []string) (string, string, []clientv3.OpO
 	if id != 0 {
 		opts = append(opts, clientv3.WithLease(clientv3.LeaseID(id)))
 	}
+	if putPrevKV {
+		opts = append(opts, clientv3.WithPrevKV())
+	}

 	return key, value, opts
 }
--- a/etcdctl/ctlv3/command/snapshot_command.go
+++ b/etcdctl/ctlv3/command/snapshot_command.go
@ -21,6 +21,7 @@ import (
 	"fmt"
 	"hash/crc32"
 	"io"
+	"math"
 	"os"
 	"path"
 	"reflect"
@ -30,13 +31,17 @@ import (
 	"github.com/coreos/etcd/etcdserver"
 	"github.com/coreos/etcd/etcdserver/etcdserverpb"
 	"github.com/coreos/etcd/etcdserver/membership"
+	"github.com/coreos/etcd/lease"
 	"github.com/coreos/etcd/mvcc"
 	"github.com/coreos/etcd/mvcc/backend"
 	"github.com/coreos/etcd/pkg/fileutil"
 	"github.com/coreos/etcd/pkg/types"
 	"github.com/coreos/etcd/raft"
 	"github.com/coreos/etcd/raft/raftpb"
+	"github.com/coreos/etcd/snap"
+	"github.com/coreos/etcd/store"
 	"github.com/coreos/etcd/wal"
+	"github.com/coreos/etcd/wal/walpb"
 	"github.com/spf13/cobra"
 	"golang.org/x/net/context"
 )
@ -112,7 +117,7 @@ func snapshotSaveCommandFunc(cmd *cobra.Command, args []string) {

 	partpath := path + ".part"
 	f, err := os.Create(partpath)
-	defer f.Close()
+
 	if err != nil {
 		exiterr := fmt.Errorf("could not open %s (%v)", partpath, err)
 		ExitWithError(ExitBadArgs, exiterr)
@ -131,6 +136,8 @@ func snapshotSaveCommandFunc(cmd *cobra.Command, args []string) {

 	fileutil.Fsync(f)

+	f.Close()
+
 	if rerr := os.Rename(partpath, path); rerr != nil {
 		exiterr := fmt.Errorf("could not rename %s to %s (%v)", partpath, path, rerr)
 		ExitWithError(ExitIO, exiterr)
@ -186,8 +193,8 @@ func snapshotRestoreCommandFunc(cmd *cobra.Command, args []string) {
 		ExitWithError(ExitInvalidInput, fmt.Errorf("data-dir %q exists", basedir))
 	}

-	makeDB(snapdir, args[0])
-	makeWAL(waldir, cl)
+	makeDB(snapdir, args[0], len(cl.Members()))
+	makeWALAndSnap(waldir, snapdir, cl)
 }

 func initialClusterFromName(name string) string {
@ -199,11 +206,18 @@ func initialClusterFromName(name string) string {
 }

 // makeWAL creates a WAL for the initial cluster
-func makeWAL(waldir string, cl *membership.RaftCluster) {
+func makeWALAndSnap(waldir, snapdir string, cl *membership.RaftCluster) {
 	if err := fileutil.CreateDirAll(waldir); err != nil {
 		ExitWithError(ExitIO, err)
 	}

+	// add members again to persist them to the store we create.
+	st := store.New(etcdserver.StoreClusterPrefix, etcdserver.StoreKeysPrefix)
+	cl.SetStore(st)
+	for _, m := range cl.Members() {
+		cl.AddMember(m)
+	}
+
 	m := cl.MemberByName(restoreName)
 	md := &etcdserverpb.Metadata{NodeID: uint64(m.ID), ClusterID: uint64(cl.ID())}
 	metadata, merr := md.Marshal()
@ -227,7 +241,9 @@ func makeWAL(waldir string, cl *membership.RaftCluster) {
 	}

 	ents := make([]raftpb.Entry, len(peers))
+	nodeIDs := make([]uint64, len(peers))
 	for i, p := range peers {
+		nodeIDs[i] = p.ID
 		cc := raftpb.ConfChange{
 			Type:    raftpb.ConfChangeAddNode,
 			NodeID:  p.ID,
@ -245,20 +261,48 @@ func makeWAL(waldir string, cl *membership.RaftCluster) {
 		ents[i] = e
 	}

-	w.Save(raftpb.HardState{
-		Term:   1,
+	commit, term := uint64(len(ents)), uint64(1)
+
+	if err := w.Save(raftpb.HardState{
+		Term:   term,
 		Vote:   peers[0].ID,
-		Commit: uint64(len(ents))}, ents)
+		Commit: commit}, ents); err != nil {
+		ExitWithError(ExitIO, err)
+	}
+
+	b, berr := st.Save()
+	if berr != nil {
+		ExitWithError(ExitError, berr)
+	}
+
+	raftSnap := raftpb.Snapshot{
+		Data: b,
+		Metadata: raftpb.SnapshotMetadata{
+			Index: commit,
+			Term:  term,
+			ConfState: raftpb.ConfState{
+				Nodes: nodeIDs,
+			},
+		},
+	}
+	snapshotter := snap.New(snapdir)
+	if err := snapshotter.SaveSnap(raftSnap); err != nil {
+		panic(err)
+	}
+
+	if err := w.SaveSnapshot(walpb.Snapshot{Index: commit, Term: term}); err != nil {
+		ExitWithError(ExitIO, err)
+	}
 }

 // initIndex implements ConsistentIndexGetter so the snapshot won't block
 // the new raft instance by waiting for a future raft index.
-type initIndex struct{}
+type initIndex int

-func (*initIndex) ConsistentIndex() uint64 { return 1 }
+func (i *initIndex) ConsistentIndex() uint64 { return uint64(*i) }

 // makeDB copies the database snapshot to the snapshot directory
-func makeDB(snapdir, dbfile string) {
+func makeDB(snapdir, dbfile string, commit int) {
 	f, ferr := os.OpenFile(dbfile, os.O_RDONLY, 0600)
 	if ferr != nil {
 		ExitWithError(ExitInvalidInput, ferr)
@ -329,7 +373,10 @@ func makeDB(snapdir, dbfile string) {
 	// update consistentIndex so applies go through on etcdserver despite
 	// having a new raft instance
 	be := backend.NewDefaultBackend(dbpath)
-	s := mvcc.NewStore(be, nil, &initIndex{})
+	// a lessor never timeouts leases
+	lessor := lease.NewLessor(be, math.MaxInt64)
+
+	s := mvcc.NewStore(be, lessor, (*initIndex)(&commit))
 	id := s.TxnBegin()
 	btx := be.BatchTx()
 	del := func(k, v []byte) error {
@ -339,6 +386,7 @@ func makeDB(snapdir, dbfile string) {

 	// delete stored members from old cluster since using new members
 	btx.UnsafeForEach([]byte("members"), del)
+	// todo: add back new members when we start to deprecate old snap file.
 	btx.UnsafeForEach([]byte("members_removed"), del)
 	// trigger write-out of new consistent index
 	s.TxnEnd(id)
--- a/etcdctl/ctlv3/command/txn_command.go
+++ b/etcdctl/ctlv3/command/txn_command.go
@ -77,12 +77,13 @@ func readCompares(r *bufio.Reader) (cmps []clientv3.Cmp) {
 		if err != nil {
 			ExitWithError(ExitInvalidInput, err)
 		}
-		if len(line) == 1 {
+
+		// remove space from the line
+		line = strings.TrimSpace(line)
+		if len(line) == 0 {
 			break
 		}

-		// remove trialling \n
-		line = line[:len(line)-1]
 		cmp, err := parseCompare(line)
 		if err != nil {
 			ExitWithError(ExitInvalidInput, err)
@ -99,12 +100,13 @@ func readOps(r *bufio.Reader) (ops []clientv3.Op) {
 		if err != nil {
 			ExitWithError(ExitInvalidInput, err)
 		}
-		if len(line) == 1 {
+
+		// remove space from the line
+		line = strings.TrimSpace(line)
+		if len(line) == 0 {
 			break
 		}

-		// remove trialling \n
-		line = line[:len(line)-1]
 		op, err := parseRequestUnion(line)
 		if err != nil {
 			ExitWithError(ExitInvalidInput, err)
--- a/etcdctl/ctlv3/command/util.go
+++ b/etcdctl/ctlv3/command/util.go
@ -46,8 +46,23 @@ func addHexPrefix(s string) string {
 }

 func argify(s string) []string {
-	r := regexp.MustCompile("'.+'|\".+\"|\\S+")
-	return r.FindAllString(s, -1)
+	r := regexp.MustCompile(`"(?:[^"\\]|\\.)*"|'[^']*'|[^'"\s]\S*[^'"\s]?`)
+	args := r.FindAllString(s, -1)
+	for i := range args {
+		if len(args[i]) == 0 {
+			continue
+		}
+		if args[i][0] == '\'' {
+			// 'single-quoted string'
+			args[i] = args[i][1 : len(args)-1]
+		} else if args[i][0] == '"' {
+			// "double quoted string"
+			if _, err := fmt.Sscanf(args[i], "%q", &args[i]); err != nil {
+				ExitWithError(ExitInvalidInput, err)
+			}
+		}
+	}
+	return args
 }

 func commandCtx(cmd *cobra.Command) (context.Context, context.CancelFunc) {
--- a/etcdctl/ctlv3/command/watch_command.go
+++ b/etcdctl/ctlv3/command/watch_command.go
@ -29,6 +29,7 @@ var (
 	watchRev         int64
 	watchPrefix      bool
 	watchInteractive bool
+	watchPrevKey     bool
 )

 // NewWatchCommand returns the cobra command for "watch".
@ -42,6 +43,7 @@ func NewWatchCommand() *cobra.Command {
 	cmd.Flags().BoolVarP(&watchInteractive, "interactive", "i", false, "Interactive mode")
 	cmd.Flags().BoolVar(&watchPrefix, "prefix", false, "Watch on a prefix if prefix is set")
 	cmd.Flags().Int64Var(&watchRev, "rev", 0, "Revision to start watching")
+	cmd.Flags().BoolVar(&watchPrevKey, "prev-kv", false, "get the previous key-value pair before the event happens")

 	return cmd
 }
@ -52,30 +54,18 @@ func watchCommandFunc(cmd *cobra.Command, args []string) {
 		watchInteractiveFunc(cmd, args)
 		return
 	}
-	if len(args) < 1 || len(args) > 2 {
-		ExitWithError(ExitBadArgs, fmt.Errorf("watch in non-interactive mode requires one or two arguments as key or prefix, with range end"))
-	}

-	opts := []clientv3.OpOption{clientv3.WithRev(watchRev)}
-	key := args[0]
-	if len(args) == 2 {
-		if watchPrefix {
-			ExitWithError(ExitBadArgs, fmt.Errorf("`range_end` and `--prefix` cannot be set at the same time, choose one"))
-		}
-		opts = append(opts, clientv3.WithRange(args[1]))
-	}
-
-	if watchPrefix {
-		opts = append(opts, clientv3.WithPrefix())
-	}
 	c := mustClientFromCmd(cmd)
-	wc := c.Watch(context.TODO(), key, opts...)
-	printWatchCh(wc)
-	err := c.Close()
-	if err == nil {
-		ExitWithError(ExitInterrupted, fmt.Errorf("watch is canceled by the server"))
+	wc, err := getWatchChan(c, args)
+	if err != nil {
+		ExitWithError(ExitBadArgs, err)
 	}
-	ExitWithError(ExitBadConnection, err)
+
+	printWatchCh(wc)
+	if err = c.Close(); err != nil {
+		ExitWithError(ExitBadConnection, err)
+	}
+	ExitWithError(ExitInterrupted, fmt.Errorf("watch is canceled by the server"))
 }

 func watchInteractiveFunc(cmd *cobra.Command, args []string) {
@ -107,32 +97,36 @@ func watchInteractiveFunc(cmd *cobra.Command, args []string) {
 			fmt.Fprintf(os.Stderr, "Invalid command %s (%v)\n", l, err)
 			continue
 		}
-		moreargs := flagset.Args()
-		if len(moreargs) < 1 || len(moreargs) > 2 {
-			fmt.Fprintf(os.Stderr, "Invalid command %s (Too few or many arguments)\n", l)
+		ch, err := getWatchChan(c, flagset.Args())
+		if err != nil {
+			fmt.Fprintf(os.Stderr, "Invalid command %s (%v)\n", l, err)
 			continue
 		}
-		var key string
-		_, err = fmt.Sscanf(moreargs[0], "%q", &key)
-		if err != nil {
-			key = moreargs[0]
-		}
-		opts := []clientv3.OpOption{clientv3.WithRev(watchRev)}
-		if len(moreargs) == 2 {
-			if watchPrefix {
-				fmt.Fprintf(os.Stderr, "`range_end` and `--prefix` cannot be set at the same time, choose one\n")
-				continue
-			}
-			opts = append(opts, clientv3.WithRange(moreargs[1]))
-		}
-		if watchPrefix {
-			opts = append(opts, clientv3.WithPrefix())
-		}
-		ch := c.Watch(context.TODO(), key, opts...)
 		go printWatchCh(ch)
 	}
 }

+func getWatchChan(c *clientv3.Client, args []string) (clientv3.WatchChan, error) {
+	if len(args) < 1 || len(args) > 2 {
+		return nil, fmt.Errorf("bad number of arguments")
+	}
+	key := args[0]
+	opts := []clientv3.OpOption{clientv3.WithRev(watchRev)}
+	if len(args) == 2 {
+		if watchPrefix {
+			return nil, fmt.Errorf("`range_end` and `--prefix` are mutually exclusive")
+		}
+		opts = append(opts, clientv3.WithRange(args[1]))
+	}
+	if watchPrefix {
+		opts = append(opts, clientv3.WithPrefix())
+	}
+	if watchPrevKey {
+		opts = append(opts, clientv3.WithPrevKV())
+	}
+	return c.Watch(context.TODO(), key, opts...), nil
+}
+
 func printWatchCh(ch clientv3.WatchChan) {
 	for resp := range ch {
 		display.Watch(resp)
--- a/etcdmain/config.go
+++ b/etcdmain/config.go
@ -20,6 +20,7 @@ import (
 	"flag"
 	"fmt"
 	"io/ioutil"
+	"net"
 	"net/url"
 	"os"
 	"runtime"
@ -410,6 +411,13 @@ func (cfg *config) configFromFile() error {
 }

 func (cfg *config) validateConfig(isSet func(field string) bool) error {
+	if err := checkBindURLs(cfg.lpurls); err != nil {
+		return err
+	}
+	if err := checkBindURLs(cfg.lcurls); err != nil {
+		return err
+	}
+
 	// when etcd runs in member mode user needs to set --advertise-client-urls if --listen-client-urls is set.
 	// TODO(yichengq): check this for joining through discovery service case
 	mayFallbackToProxy := isSet("discovery") && cfg.fallback.String() == fallbackFlagProxy
@ -456,3 +464,27 @@ func (cfg config) isReadonlyProxy() bool       { return cfg.proxy.String() == pr
 func (cfg config) shouldFallbackToProxy() bool { return cfg.fallback.String() == fallbackFlagProxy }

 func (cfg config) electionTicks() int { return int(cfg.ElectionMs / cfg.TickMs) }
+
+// checkBindURLs returns an error if any URL uses a domain name.
+// TODO: return error in 3.2.0
+func checkBindURLs(urls []url.URL) error {
+	for _, url := range urls {
+		if url.Scheme == "unix" || url.Scheme == "unixs" {
+			continue
+		}
+		host, _, err := net.SplitHostPort(url.Host)
+		if err != nil {
+			return err
+		}
+		if host == "localhost" {
+			// special case for local address
+			// TODO: support /etc/hosts ?
+			continue
+		}
+		if net.ParseIP(host) == nil {
+			err := fmt.Errorf("expected IP in URL for binding (%s)", url.String())
+			plog.Warning(err)
+		}
+	}
+	return nil
+}
--- a/etcdserver/api/v3rpc/lease.go
+++ b/etcdserver/api/v3rpc/lease.go
@ -73,14 +73,14 @@ func (ls *LeaseServer) LeaseKeepAlive(stream pb.Lease_LeaseKeepAliveServer) erro
 		resp := &pb.LeaseKeepAliveResponse{ID: req.ID, Header: &pb.ResponseHeader{}}
 		ls.hdr.fill(resp.Header)

-		ttl, err := ls.le.LeaseRenew(lease.LeaseID(req.ID))
+		ttl, err := ls.le.LeaseRenew(stream.Context(), lease.LeaseID(req.ID))
 		if err == lease.ErrLeaseNotFound {
 			err = nil
 			ttl = 0
 		}

 		if err != nil {
-			return err
+			return togRPCError(err)
 		}

 		resp.TTL = ttl
--- a/etcdserver/api/v3rpc/rpctypes/error.go
+++ b/etcdserver/api/v3rpc/rpctypes/error.go
@ -49,9 +49,13 @@ var (
 	ErrGRPCRoleNotGranted       = grpc.Errorf(codes.FailedPrecondition, "etcdserver: role is not granted to the user")
 	ErrGRPCPermissionNotGranted = grpc.Errorf(codes.FailedPrecondition, "etcdserver: permission is not granted to the role")

-	ErrGRPCNoLeader   = grpc.Errorf(codes.Unavailable, "etcdserver: no leader")
-	ErrGRPCNotCapable = grpc.Errorf(codes.Unavailable, "etcdserver: not capable")
-	ErrGRPCStopped    = grpc.Errorf(codes.Unavailable, "etcdserver: server stopped")
+	ErrGRPCNoLeader                   = grpc.Errorf(codes.Unavailable, "etcdserver: no leader")
+	ErrGRPCNotCapable                 = grpc.Errorf(codes.Unavailable, "etcdserver: not capable")
+	ErrGRPCStopped                    = grpc.Errorf(codes.Unavailable, "etcdserver: server stopped")
+	ErrGRPCTimeout                    = grpc.Errorf(codes.Unavailable, "etcdserver: request timed out")
+	ErrGRPCTimeoutDueToLeaderFail     = grpc.Errorf(codes.Unavailable, "etcdserver: request timed out, possibly due to previous leader failure")
+	ErrGRPCTimeoutDueToConnectionLost = grpc.Errorf(codes.Unavailable, "etcdserver: request timed out, possibly due to connection lost")
+	ErrGRPCUnhealthy                  = grpc.Errorf(codes.Unavailable, "etcdserver: unhealthy cluster")

 	errStringToError = map[string]error{
 		grpc.ErrorDesc(ErrGRPCEmptyKey):     ErrGRPCEmptyKey,
@ -82,9 +86,13 @@ var (
 		grpc.ErrorDesc(ErrGRPCRoleNotGranted):       ErrGRPCRoleNotGranted,
 		grpc.ErrorDesc(ErrGRPCPermissionNotGranted): ErrGRPCPermissionNotGranted,

-		grpc.ErrorDesc(ErrGRPCNoLeader):   ErrGRPCNoLeader,
-		grpc.ErrorDesc(ErrGRPCNotCapable): ErrGRPCNotCapable,
-		grpc.ErrorDesc(ErrGRPCStopped):    ErrGRPCStopped,
+		grpc.ErrorDesc(ErrGRPCNoLeader):                   ErrGRPCNoLeader,
+		grpc.ErrorDesc(ErrGRPCNotCapable):                 ErrGRPCNotCapable,
+		grpc.ErrorDesc(ErrGRPCStopped):                    ErrGRPCStopped,
+		grpc.ErrorDesc(ErrGRPCTimeout):                    ErrGRPCTimeout,
+		grpc.ErrorDesc(ErrGRPCTimeoutDueToLeaderFail):     ErrGRPCTimeoutDueToLeaderFail,
+		grpc.ErrorDesc(ErrGRPCTimeoutDueToConnectionLost): ErrGRPCTimeoutDueToConnectionLost,
+		grpc.ErrorDesc(ErrGRPCUnhealthy):                  ErrGRPCUnhealthy,
 	}

 	// client-side error
@ -116,9 +124,13 @@ var (
 	ErrRoleNotGranted       = Error(ErrGRPCRoleNotGranted)
 	ErrPermissionNotGranted = Error(ErrGRPCPermissionNotGranted)

-	ErrNoLeader   = Error(ErrGRPCNoLeader)
-	ErrNotCapable = Error(ErrGRPCNotCapable)
-	ErrStopped    = Error(ErrGRPCStopped)
+	ErrNoLeader                   = Error(ErrGRPCNoLeader)
+	ErrNotCapable                 = Error(ErrGRPCNotCapable)
+	ErrStopped                    = Error(ErrGRPCStopped)
+	ErrTimeout                    = Error(ErrGRPCTimeout)
+	ErrTimeoutDueToLeaderFail     = Error(ErrGRPCTimeoutDueToLeaderFail)
+	ErrTimeoutDueToConnectionLost = Error(ErrGRPCTimeoutDueToConnectionLost)
+	ErrUnhealthy                  = Error(ErrGRPCUnhealthy)
 )

 // EtcdError defines gRPC server errors.
--- a/etcdserver/api/v3rpc/util.go
+++ b/etcdserver/api/v3rpc/util.go
@ -38,6 +38,17 @@ func togRPCError(err error) error {
 	case etcdserver.ErrNoSpace:
 		return rpctypes.ErrGRPCNoSpace

+	case etcdserver.ErrNoLeader:
+		return rpctypes.ErrGRPCNoLeader
+	case etcdserver.ErrStopped:
+		return rpctypes.ErrGRPCStopped
+	case etcdserver.ErrTimeout:
+		return rpctypes.ErrGRPCTimeout
+	case etcdserver.ErrTimeoutDueToLeaderFail:
+		return rpctypes.ErrGRPCTimeoutDueToLeaderFail
+	case etcdserver.ErrTimeoutDueToConnectionLost:
+		return rpctypes.ErrGRPCTimeoutDueToConnectionLost
+
 	case auth.ErrRootUserNotExist:
 		return rpctypes.ErrGRPCRootUserNotExist
 	case auth.ErrRootRoleNotExist:
--- a/etcdserver/api/v3rpc/watch.go
+++ b/etcdserver/api/v3rpc/watch.go
@ -32,7 +32,7 @@ type watchServer struct {
 	clusterID int64
 	memberID  int64
 	raftTimer etcdserver.RaftTimer
-	watchable mvcc.Watchable
+	watchable mvcc.WatchableKV
 }

 func NewWatchServer(s *etcdserver.EtcdServer) pb.WatchServer {
@ -82,6 +82,8 @@ type serverWatchStream struct {
 	memberID  int64
 	raftTimer etcdserver.RaftTimer

+	watchable mvcc.WatchableKV
+
 	gRPCStream  pb.Watch_WatchServer
 	watchStream mvcc.WatchStream
 	ctrlStream  chan *pb.WatchResponse
@ -91,6 +93,7 @@ type serverWatchStream struct {
 	// progress tracks the watchID that stream might need to send
 	// progress to.
 	progress map[mvcc.WatchID]bool
+	prevKV   map[mvcc.WatchID]bool

 	// closec indicates the stream is closed.
 	closec chan struct{}
@ -101,14 +104,18 @@ type serverWatchStream struct {

 func (ws *watchServer) Watch(stream pb.Watch_WatchServer) (err error) {
 	sws := serverWatchStream{
-		clusterID:   ws.clusterID,
-		memberID:    ws.memberID,
-		raftTimer:   ws.raftTimer,
+		clusterID: ws.clusterID,
+		memberID:  ws.memberID,
+		raftTimer: ws.raftTimer,
+
+		watchable: ws.watchable,
+
 		gRPCStream:  stream,
 		watchStream: ws.watchable.NewWatchStream(),
 		// chan for sending control response like watcher created and canceled.
 		ctrlStream: make(chan *pb.WatchResponse, ctrlStreamBufLen),
 		progress:   make(map[mvcc.WatchID]bool),
+		prevKV:     make(map[mvcc.WatchID]bool),
 		closec:     make(chan struct{}),
 	}

@ -170,9 +177,14 @@ func (sws *serverWatchStream) recvLoop() error {
 				rev = wsrev + 1
 			}
 			id := sws.watchStream.Watch(creq.Key, creq.RangeEnd, rev)
-			if id != -1 && creq.ProgressNotify {
+			if id != -1 {
 				sws.mu.Lock()
-				sws.progress[id] = true
+				if creq.ProgressNotify {
+					sws.progress[id] = true
+				}
+				if creq.PrevKv {
+					sws.prevKV[id] = true
+				}
 				sws.mu.Unlock()
 			}
 			wr := &pb.WatchResponse{
@ -198,6 +210,7 @@ func (sws *serverWatchStream) recvLoop() error {
 					}
 					sws.mu.Lock()
 					delete(sws.progress, mvcc.WatchID(id))
+					delete(sws.prevKV, mvcc.WatchID(id))
 					sws.mu.Unlock()
 				}
 			}
@ -244,8 +257,19 @@ func (sws *serverWatchStream) sendLoop() {
 			// or define protocol buffer with []mvccpb.Event.
 			evs := wresp.Events
 			events := make([]*mvccpb.Event, len(evs))
+			sws.mu.Lock()
+			needPrevKV := sws.prevKV[wresp.WatchID]
+			sws.mu.Unlock()
 			for i := range evs {
 				events[i] = &evs[i]
+
+				if needPrevKV {
+					opt := mvcc.RangeOptions{Rev: evs[i].Kv.ModRevision - 1}
+					r, err := sws.watchable.Range(evs[i].Kv.Key, nil, opt)
+					if err == nil && len(r.KVs) != 0 {
+						events[i].PrevKv = &(r.KVs[0])
+					}
+				}
 			}

 			wr := &pb.WatchResponse{
--- a/etcdserver/apply.go
+++ b/etcdserver/apply.go
@ -159,6 +159,22 @@ func (a *applierV3backend) Put(txnID int64, p *pb.PutRequest) (*pb.PutResponse,
 		rev int64
 		err error
 	)
+
+	var rr *mvcc.RangeResult
+	if p.PrevKv {
+		if txnID != noTxn {
+			rr, err = a.s.KV().TxnRange(txnID, p.Key, nil, mvcc.RangeOptions{})
+			if err != nil {
+				return nil, err
+			}
+		} else {
+			rr, err = a.s.KV().Range(p.Key, nil, mvcc.RangeOptions{})
+			if err != nil {
+				return nil, err
+			}
+		}
+	}
+
 	if txnID != noTxn {
 		rev, err = a.s.KV().TxnPut(txnID, p.Key, p.Value, lease.LeaseID(p.Lease))
 		if err != nil {
@ -174,6 +190,9 @@ func (a *applierV3backend) Put(txnID int64, p *pb.PutRequest) (*pb.PutResponse,
 		rev = a.s.KV().Put(p.Key, p.Value, leaseID)
 	}
 	resp.Header.Revision = rev
+	if rr != nil && len(rr.KVs) != 0 {
+		resp.PrevKv = &rr.KVs[0]
+	}
 	return resp, nil
 }

@ -191,6 +210,21 @@ func (a *applierV3backend) DeleteRange(txnID int64, dr *pb.DeleteRangeRequest) (
 		dr.RangeEnd = []byte{}
 	}

+	var rr *mvcc.RangeResult
+	if dr.PrevKv {
+		if txnID != noTxn {
+			rr, err = a.s.KV().TxnRange(txnID, dr.Key, dr.RangeEnd, mvcc.RangeOptions{})
+			if err != nil {
+				return nil, err
+			}
+		} else {
+			rr, err = a.s.KV().Range(dr.Key, dr.RangeEnd, mvcc.RangeOptions{})
+			if err != nil {
+				return nil, err
+			}
+		}
+	}
+
 	if txnID != noTxn {
 		n, rev, err = a.s.KV().TxnDeleteRange(txnID, dr.Key, dr.RangeEnd)
 		if err != nil {
@ -201,6 +235,11 @@ func (a *applierV3backend) DeleteRange(txnID int64, dr *pb.DeleteRangeRequest) (
 	}

 	resp.Deleted = n
+	if rr != nil {
+		for i := range rr.KVs {
+			resp.PrevKvs = append(resp.PrevKvs, &rr.KVs[i])
+		}
+	}
 	resp.Header.Revision = rev
 	return resp, nil
 }
--- a/etcdserver/apply_auth.go
+++ b/etcdserver/apply_auth.go
@ -56,6 +56,9 @@ func (aa *authApplierV3) Put(txnID int64, r *pb.PutRequest) (*pb.PutResponse, er
 	if !aa.as.IsPutPermitted(aa.user, r.Key) {
 		return nil, auth.ErrPermissionDenied
 	}
+	if r.PrevKv && !aa.as.IsRangePermitted(aa.user, r.Key, nil) {
+		return nil, auth.ErrPermissionDenied
+	}
 	return aa.applierV3.Put(txnID, r)
 }

@ -70,6 +73,9 @@ func (aa *authApplierV3) DeleteRange(txnID int64, r *pb.DeleteRangeRequest) (*pb
 	if !aa.as.IsDeleteRangePermitted(aa.user, r.Key, r.RangeEnd) {
 		return nil, auth.ErrPermissionDenied
 	}
+	if r.PrevKv && !aa.as.IsRangePermitted(aa.user, r.Key, r.RangeEnd) {
+		return nil, auth.ErrPermissionDenied
+	}
 	return aa.applierV3.DeleteRange(txnID, r)
 }

@ -99,7 +105,7 @@ func (aa *authApplierV3) checkTxnReqsPermission(reqs []*pb.RequestOp) bool {
 				continue
 			}

-			if !aa.as.IsDeleteRangePermitted(aa.user, tv.RequestDeleteRange.Key, tv.RequestDeleteRange.RangeEnd) {
+			if tv.RequestDeleteRange.PrevKv && !aa.as.IsRangePermitted(aa.user, tv.RequestDeleteRange.Key, tv.RequestDeleteRange.RangeEnd) {
 				return false
 			}
 		}
--- a/etcdserver/etcdserverpb/etcdserver.pb.go
+++ b/etcdserver/etcdserverpb/etcdserver.pb.go
@ -102,9 +102,9 @@ import (
 	proto "github.com/golang/protobuf/proto"

 	math "math"
-)

-import io "io"
+	io "io"
+)

 // Reference imports to suppress errors if they are not otherwise used.
 var _ = proto.Marshal
--- a/etcdserver/etcdserverpb/raft_internal.pb.go
+++ b/etcdserver/etcdserverpb/raft_internal.pb.go
@ -10,9 +10,9 @@ import (
 	proto "github.com/golang/protobuf/proto"

 	math "math"
-)

-import io "io"
+	io "io"
+)

 // Reference imports to suppress errors if they are not otherwise used.
 var _ = proto.Marshal
--- a/etcdserver/etcdserverpb/rpc.pb.go
+++ b/etcdserver/etcdserverpb/rpc.pb.go
--- a/etcdserver/etcdserverpb/rpc.proto
+++ b/etcdserver/etcdserverpb/rpc.proto
@ -396,10 +396,16 @@ message PutRequest {
  // lease is the lease ID to associate with the key in the key-value store. A lease
  // value of 0 indicates no lease.
  int64 lease = 3;
+
+  // If prev_kv is set, etcd gets the previous key-value pair before changing it.
+  // The previous key-value pair will be returned in the put response.
+  bool prev_kv = 4;
 }

 message PutResponse {
  ResponseHeader header = 1;
+  // if prev_kv is set in the request, the previous key-value pair will be returned.
+  mvccpb.KeyValue prev_kv = 2;
 }

 message DeleteRangeRequest {
@ -409,12 +415,17 @@ message DeleteRangeRequest {
  // If range_end is not given, the range is defined to contain only the key argument.
  // If range_end is '\0', the range is all keys greater than or equal to the key argument.
  bytes range_end = 2;
+  // If prev_kv is set, etcd gets the previous key-value pairs before deleting it.
+  // The previous key-value pairs will be returned in the delte response.
+  bool prev_kv = 3;
 }

 message DeleteRangeResponse {
  ResponseHeader header = 1;
  // deleted is the number of keys deleted by the delete range request.
  int64 deleted = 2;
+  // if prev_kv is set in the request, the previous key-value pairs will be returned.
+  repeated mvccpb.KeyValue prev_kvs = 3;
 }

 message RequestOp {
@ -563,6 +574,9 @@ message WatchCreateRequest {
  // wish to recover a disconnected watcher starting from a recent known revision.
  // The etcd server may decide how often it will send notifications based on current load.
  bool progress_notify = 4;
+  // If prev_kv is set, created watcher gets the previous KV before the event happens.
+  // If the previous KV is already compacted, nothing will be returned.
+  bool prev_kv = 6;
 }

 message WatchCancelRequest {
--- a/etcdserver/server.go
+++ b/etcdserver/server.go
@ -18,6 +18,7 @@ import (
 	"encoding/json"
 	"expvar"
 	"fmt"
+	"math"
 	"math/rand"
 	"net/http"
 	"os"
@ -154,13 +155,13 @@ type Server interface {

 // EtcdServer is the production implementation of the Server interface
 type EtcdServer struct {
-	// r and inflightSnapshots must be the first elements to keep 64-bit alignment for atomic
-	// access to fields
-
-	// count the number of inflight snapshots.
-	// MUST use atomic operation to access this field.
-	inflightSnapshots int64
-	Cfg               *ServerConfig
+	// inflightSnapshots holds count the number of snapshots currently inflight.
+	inflightSnapshots int64  // must use atomic operations to access; keep 64-bit aligned.
+	appliedIndex      uint64 // must use atomic operations to access; keep 64-bit aligned.
+	// consistIndex used to hold the offset of current executing entry
+	// It is initialized to 0 before executing any entry.
+	consistIndex consistentIndex // must use atomic operations to access; keep 64-bit aligned.
+	Cfg          *ServerConfig

 	readych chan struct{}
 	r       raftNode
@ -195,10 +196,6 @@ type EtcdServer struct {
 	// compactor is used to auto-compact the KV.
 	compactor *compactor.Periodic

-	// consistent index used to hold the offset of current executing entry
-	// It is initialized to 0 before executing any entry.
-	consistIndex consistentIndex
-
 	// peerRt used to send requests (version, lease) to peers.
 	peerRt   http.RoundTripper
 	reqIDGen *idutil.Generator
@ -212,8 +209,6 @@ type EtcdServer struct {
 	// wg is used to wait for the go routines that depends on the server state
 	// to exit when stopping the server.
 	wg sync.WaitGroup
-
-	appliedIndex uint64
 }

 // NewServer creates a new EtcdServer from the supplied configuration. The
@ -404,15 +399,23 @@ func NewServer(cfg *ServerConfig) (srv *EtcdServer, err error) {
 	srv.applyV2 = &applierV2store{store: srv.store, cluster: srv.cluster}

 	srv.be = be
-	srv.lessor = lease.NewLessor(srv.be)
+	minTTL := time.Duration((3*cfg.ElectionTicks)/2) * time.Duration(cfg.TickMs) * time.Millisecond
+	srv.lessor = lease.NewLessor(srv.be, int64(math.Ceil(minTTL.Seconds())))
+
 	srv.kv = mvcc.New(srv.be, srv.lessor, &srv.consistIndex)
 	if beExist {
 		kvindex := srv.kv.ConsistentIndex()
+		// TODO: remove kvindex != 0 checking when we do not expect users to upgrade
+		// etcd from pre-3.0 release.
 		if snapshot != nil && kvindex < snapshot.Metadata.Index {
-			return nil, fmt.Errorf("database file (%v index %d) does not match with snapshot (index %d).", bepath, kvindex, snapshot.Metadata.Index)
+			if kvindex != 0 {
+				return nil, fmt.Errorf("database file (%v index %d) does not match with snapshot (index %d).", bepath, kvindex, snapshot.Metadata.Index)
+			}
+			plog.Warningf("consistent index never saved (snapshot index=%d)", snapshot.Metadata.Index)
 		}
 	}
 	srv.consistIndex.setConsistentIndex(srv.kv.ConsistentIndex())
+
 	srv.authStore = auth.NewAuthStore(srv.be)
 	if h := cfg.AutoCompactionRetention; h != 0 {
 		srv.compactor = compactor.NewPeriodic(h, srv.kv, srv)
@ -658,6 +661,14 @@ func (s *EtcdServer) applySnapshot(ep *etcdProgress, apply *apply) {

 	newbe := backend.NewDefaultBackend(fn)

+	// always recover lessor before kv. When we recover the mvcc.KV it will reattach keys to its leases.
+	// If we recover mvcc.KV first, it will attach the keys to the wrong lessor before it recovers.
+	if s.lessor != nil {
+		plog.Info("recovering lessor...")
+		s.lessor.Recover(newbe, s.kv)
+		plog.Info("finished recovering lessor")
+	}
+
 	plog.Info("restoring mvcc store...")

 	if err := s.kv.Restore(newbe); err != nil {
@ -684,12 +695,6 @@ func (s *EtcdServer) applySnapshot(ep *etcdProgress, apply *apply) {
 	s.be = newbe
 	s.bemu.Unlock()

-	if s.lessor != nil {
-		plog.Info("recovering lessor...")
-		s.lessor.Recover(newbe, s.kv)
-		plog.Info("finished recovering lessor")
-	}
-
 	plog.Info("recovering alarms...")
 	if err := s.restoreAlarms(); err != nil {
 		plog.Panicf("restore alarms error: %v", err)
--- a/etcdserver/v3_server.go
+++ b/etcdserver/v3_server.go
@ -20,6 +20,7 @@ import (
 	"time"

 	pb "github.com/coreos/etcd/etcdserver/etcdserverpb"
+	"github.com/coreos/etcd/etcdserver/membership"
 	"github.com/coreos/etcd/lease"
 	"github.com/coreos/etcd/lease/leasehttp"
 	"github.com/coreos/etcd/mvcc"
@ -54,7 +55,7 @@ type Lessor interface {

 	// LeaseRenew renews the lease with given ID. The renewed TTL is returned. Or an error
 	// is returned.
-	LeaseRenew(id lease.LeaseID) (int64, error)
+	LeaseRenew(ctx context.Context, id lease.LeaseID) (int64, error)
 }

 type Authenticator interface {
@ -218,7 +219,7 @@ func (s *EtcdServer) LeaseRevoke(ctx context.Context, r *pb.LeaseRevokeRequest)
 	return result.resp.(*pb.LeaseRevokeResponse), nil
 }

-func (s *EtcdServer) LeaseRenew(id lease.LeaseID) (int64, error) {
+func (s *EtcdServer) LeaseRenew(ctx context.Context, id lease.LeaseID) (int64, error) {
 	ttl, err := s.lessor.Renew(id)
 	if err == nil {
 		return ttl, nil
@ -228,29 +229,44 @@ func (s *EtcdServer) LeaseRenew(id lease.LeaseID) (int64, error) {
 	}

 	// renewals don't go through raft; forward to leader manually
+	cctx, cancel := context.WithTimeout(ctx, s.Cfg.ReqTimeout())
+	defer cancel()
+
+	// renewals don't go through raft; forward to leader manually
+	for cctx.Err() == nil && err != nil {
+		leader, lerr := s.waitLeader(cctx)
+		if lerr != nil {
+			return -1, lerr
+		}
+		for _, url := range leader.PeerURLs {
+			lurl := url + "/leases"
+			ttl, err = leasehttp.RenewHTTP(cctx, id, lurl, s.peerRt)
+			if err == nil || err == lease.ErrLeaseNotFound {
+				return ttl, err
+			}
+		}
+	}
+	return -1, ErrTimeout
+}
+
+func (s *EtcdServer) waitLeader(ctx context.Context) (*membership.Member, error) {
 	leader := s.cluster.Member(s.Leader())
-	for i := 0; i < 5 && leader == nil; i++ {
+	for leader == nil {
 		// wait an election
 		dur := time.Duration(s.Cfg.ElectionTicks) * time.Duration(s.Cfg.TickMs) * time.Millisecond
 		select {
 		case <-time.After(dur):
 			leader = s.cluster.Member(s.Leader())
 		case <-s.done:
-			return -1, ErrStopped
+			return nil, ErrStopped
+		case <-ctx.Done():
+			return nil, ErrNoLeader
 		}
 	}
 	if leader == nil || len(leader.PeerURLs) == 0 {
-		return -1, ErrNoLeader
+		return nil, ErrNoLeader
 	}
-
-	for _, url := range leader.PeerURLs {
-		lurl := url + "/leases"
-		ttl, err = leasehttp.RenewHTTP(id, lurl, s.peerRt, s.Cfg.peerDialTimeout())
-		if err == nil {
-			break
-		}
-	}
-	return ttl, err
+	return leader, nil
 }

 func (s *EtcdServer) Alarm(ctx context.Context, r *pb.AlarmRequest) (*pb.AlarmResponse, error) {
@ -551,4 +567,4 @@ func (s *EtcdServer) processInternalRaftRequest(ctx context.Context, r pb.Intern
 }

 // Watchable returns a watchable interface attached to the etcdserver.
-func (s *EtcdServer) Watchable() mvcc.Watchable { return s.KV() }
+func (s *EtcdServer) Watchable() mvcc.WatchableKV { return s.KV() }
--- a/integration/v3_election_test.go
+++ b/integration/v3_election_test.go
@ -174,3 +174,28 @@ func TestElectionSessionRecampaign(t *testing.T) {
 		t.Fatalf("expected value=%q, got response %v", "def", resp)
 	}
 }
+
+// TestElectionOnPrefixOfExistingKey checks that a single
+// candidate can be elected on a new key that is a prefix
+// of an existing key. To wit, check for regression
+// of bug #6278. https://github.com/coreos/etcd/issues/6278
+//
+func TestElectionOnPrefixOfExistingKey(t *testing.T) {
+	clus := NewClusterV3(t, &ClusterConfig{Size: 1})
+	defer clus.Terminate(t)
+
+	cli := clus.RandClient()
+	if _, err := cli.Put(context.TODO(), "testa", "value"); err != nil {
+		t.Fatal(err)
+	}
+
+	e := concurrency.NewElection(cli, "test")
+	ctx, cancel := context.WithTimeout(context.TODO(), 5*time.Second)
+	err := e.Campaign(ctx, "abc")
+	cancel()
+	if err != nil {
+		// after 5 seconds, deadlock results in
+		// 'context deadline exceeded' here.
+		t.Fatal(err)
+	}
+}
--- a/integration/v3_grpc_test.go
+++ b/integration/v3_grpc_test.go
@ -379,6 +379,7 @@ func TestV3DeleteRange(t *testing.T) {
 		keySet []string
 		begin  string
 		end    string
+		prevKV bool

 		wantSet [][]byte
 		deleted int64
@ -386,39 +387,45 @@ func TestV3DeleteRange(t *testing.T) {
 		// delete middle
 		{
 			[]string{"foo", "foo/abc", "fop"},
-			"foo/", "fop",
+			"foo/", "fop", false,
 			[][]byte{[]byte("foo"), []byte("fop")}, 1,
 		},
 		// no delete
 		{
 			[]string{"foo", "foo/abc", "fop"},
-			"foo/", "foo/",
+			"foo/", "foo/", false,
 			[][]byte{[]byte("foo"), []byte("foo/abc"), []byte("fop")}, 0,
 		},
 		// delete first
 		{
 			[]string{"foo", "foo/abc", "fop"},
-			"fo", "fop",
+			"fo", "fop", false,
 			[][]byte{[]byte("fop")}, 2,
 		},
 		// delete tail
 		{
 			[]string{"foo", "foo/abc", "fop"},
-			"foo/", "fos",
+			"foo/", "fos", false,
 			[][]byte{[]byte("foo")}, 2,
 		},
 		// delete exact
 		{
 			[]string{"foo", "foo/abc", "fop"},
-			"foo/abc", "",
+			"foo/abc", "", false,
 			[][]byte{[]byte("foo"), []byte("fop")}, 1,
 		},
 		// delete none, [x,x)
 		{
 			[]string{"foo"},
-			"foo", "foo",
+			"foo", "foo", false,
 			[][]byte{[]byte("foo")}, 0,
 		},
+		// delete middle with preserveKVs set
+		{
+			[]string{"foo", "foo/abc", "fop"},
+			"foo/", "fop", true,
+			[][]byte{[]byte("foo"), []byte("fop")}, 1,
+		},
 	}

 	for i, tt := range tests {
@ -436,7 +443,9 @@ func TestV3DeleteRange(t *testing.T) {

 		dreq := &pb.DeleteRangeRequest{
 			Key:      []byte(tt.begin),
-			RangeEnd: []byte(tt.end)}
+			RangeEnd: []byte(tt.end),
+			PrevKv:   tt.prevKV,
+		}
 		dresp, err := kvc.DeleteRange(context.TODO(), dreq)
 		if err != nil {
 			t.Fatalf("couldn't delete range on test %d (%v)", i, err)
@ -444,6 +453,11 @@ func TestV3DeleteRange(t *testing.T) {
 		if tt.deleted != dresp.Deleted {
 			t.Errorf("expected %d on test %v, got %d", tt.deleted, i, dresp.Deleted)
 		}
+		if tt.prevKV {
+			if len(dresp.PrevKvs) != int(dresp.Deleted) {
+				t.Errorf("preserve %d keys, want %d", len(dresp.PrevKvs), dresp.Deleted)
+			}
+		}

 		rreq := &pb.RangeRequest{Key: []byte{0x0}, RangeEnd: []byte{0xff}}
 		rresp, err := kvc.Range(context.TODO(), rreq)
--- a/integration/v3_lease_test.go
+++ b/integration/v3_lease_test.go
@ -19,11 +19,13 @@ import (
 	"testing"
 	"time"

+	"golang.org/x/net/context"
+	"google.golang.org/grpc/metadata"
+
 	"github.com/coreos/etcd/etcdserver/api/v3rpc/rpctypes"
 	pb "github.com/coreos/etcd/etcdserver/etcdserverpb"
 	"github.com/coreos/etcd/mvcc/mvccpb"
 	"github.com/coreos/etcd/pkg/testutil"
-	"golang.org/x/net/context"
 )

 // TestV3LeasePrmote ensures the newly elected leader can promote itself
@ -332,7 +334,9 @@ func TestV3LeaseFailover(t *testing.T) {

 	lreq := &pb.LeaseKeepAliveRequest{ID: lresp.ID}

-	ctx, cancel := context.WithCancel(context.Background())
+	md := metadata.Pairs(rpctypes.MetadataRequireLeaderKey, rpctypes.MetadataHasLeader)
+	mctx := metadata.NewContext(context.Background(), md)
+	ctx, cancel := context.WithCancel(mctx)
 	defer cancel()
 	lac, err := lc.LeaseKeepAlive(ctx)
 	if err != nil {
--- a/integration/v3_watch_test.go
+++ b/integration/v3_watch_test.go
@ -348,6 +348,51 @@ func TestV3WatchFutureRevision(t *testing.T) {
 	}
 }

+// TestV3WatchWrongRange tests wrong range does not create watchers.
+func TestV3WatchWrongRange(t *testing.T) {
+	defer testutil.AfterTest(t)
+
+	clus := NewClusterV3(t, &ClusterConfig{Size: 1})
+	defer clus.Terminate(t)
+
+	wAPI := toGRPC(clus.RandClient()).Watch
+	ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
+	defer cancel()
+	wStream, err := wAPI.Watch(ctx)
+	if err != nil {
+		t.Fatalf("wAPI.Watch error: %v", err)
+	}
+
+	tests := []struct {
+		key      []byte
+		end      []byte
+		canceled bool
+	}{
+		{[]byte("a"), []byte("a"), true},  // wrong range end
+		{[]byte("b"), []byte("a"), true},  // wrong range end
+		{[]byte("foo"), []byte{0}, false}, // watch request with 'WithFromKey'
+	}
+	for i, tt := range tests {
+		if err := wStream.Send(&pb.WatchRequest{RequestUnion: &pb.WatchRequest_CreateRequest{
+			CreateRequest: &pb.WatchCreateRequest{Key: tt.key, RangeEnd: tt.end, StartRevision: 1}}}); err != nil {
+			t.Fatalf("#%d: wStream.Send error: %v", i, err)
+		}
+		cresp, err := wStream.Recv()
+		if err != nil {
+			t.Fatalf("#%d: wStream.Recv error: %v", i, err)
+		}
+		if !cresp.Created {
+			t.Fatalf("#%d: create %v, want %v", i, cresp.Created, true)
+		}
+		if cresp.Canceled != tt.canceled {
+			t.Fatalf("#%d: canceled %v, want %v", i, tt.canceled, cresp.Canceled)
+		}
+		if tt.canceled && cresp.WatchId != -1 {
+			t.Fatalf("#%d: canceled watch ID %d, want -1", i, cresp.WatchId)
+		}
+	}
+}
+
 // TestV3WatchCancelSynced tests Watch APIs cancellation from synced map.
 func TestV3WatchCancelSynced(t *testing.T) {
 	defer testutil.AfterTest(t)
--- a/lease/leasehttp/http.go
+++ b/lease/leasehttp/http.go
@ -19,10 +19,10 @@ import (
 	"fmt"
 	"io/ioutil"
 	"net/http"
-	"time"

 	pb "github.com/coreos/etcd/etcdserver/etcdserverpb"
 	"github.com/coreos/etcd/lease"
+	"golang.org/x/net/context"
 )

 // NewHandler returns an http Handler for lease renewals
@ -75,15 +75,22 @@ func (h *leaseHandler) ServeHTTP(w http.ResponseWriter, r *http.Request) {

 // RenewHTTP renews a lease at a given primary server.
 // TODO: Batch request in future?
-func RenewHTTP(id lease.LeaseID, url string, rt http.RoundTripper, timeout time.Duration) (int64, error) {
+func RenewHTTP(ctx context.Context, id lease.LeaseID, url string, rt http.RoundTripper) (int64, error) {
 	// will post lreq protobuf to leader
 	lreq, err := (&pb.LeaseKeepAliveRequest{ID: int64(id)}).Marshal()
 	if err != nil {
 		return -1, err
 	}

-	cc := &http.Client{Transport: rt, Timeout: timeout}
-	resp, err := cc.Post(url, "application/protobuf", bytes.NewReader(lreq))
+	cc := &http.Client{Transport: rt}
+	req, err := http.NewRequest("POST", url, bytes.NewReader(lreq))
+	if err != nil {
+		return -1, err
+	}
+	req.Header.Set("Content-Type", "application/protobuf")
+	req.Cancel = ctx.Done()
+
+	resp, err := cc.Do(req)
 	if err != nil {
 		// TODO detect if leader failed and retry?
 		return -1, err
--- a/lease/leasehttp/http_test.go
+++ b/lease/leasehttp/http_test.go
@ -0,0 +1,51 @@
+// Copyright 2016 The etcd Authors
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+package leasehttp
+
+import (
+	"net/http"
+	"net/http/httptest"
+	"os"
+	"testing"
+	"time"
+
+	"github.com/coreos/etcd/lease"
+	"github.com/coreos/etcd/mvcc/backend"
+	"golang.org/x/net/context"
+)
+
+func TestRenewHTTP(t *testing.T) {
+	be, tmpPath := backend.NewTmpBackend(time.Hour, 10000)
+	defer os.Remove(tmpPath)
+	defer be.Close()
+
+	le := lease.NewLessor(be, 3)
+	le.Promote(time.Second)
+	l, err := le.Grant(1, int64(5))
+	if err != nil {
+		t.Fatalf("failed to create lease: %v", err)
+	}
+
+	ts := httptest.NewServer(NewHandler(le))
+	defer ts.Close()
+
+	ttl, err := RenewHTTP(context.TODO(), l.ID, ts.URL+"/leases", http.DefaultTransport)
+	if err != nil {
+		t.Fatal(err)
+	}
+	if ttl != 5 {
+		t.Fatalf("ttl expected 5, got %d", ttl)
+	}
+}
--- a/lease/leasepb/lease.pb.go
+++ b/lease/leasepb/lease.pb.go
@ -19,9 +19,9 @@ import (
 	proto "github.com/golang/protobuf/proto"

 	math "math"
-)

-import io "io"
+	io "io"
+)

 // Reference imports to suppress errors if they are not otherwise used.
 var _ = proto.Marshal
--- a/lease/lessor.go
+++ b/lease/lessor.go
@ -31,8 +31,6 @@ const (
 )

 var (
-	minLeaseTTL = int64(5)
-
 	leaseBucketName = []byte("lease")
 	// do not use maxInt64 since it can overflow time which will add
 	// the offset of unix time (1970yr to seconds).
@ -45,13 +43,18 @@ var (

 type LeaseID int64

-// RangeDeleter defines an interface with DeleteRange method.
+// RangeDeleter defines an interface with Txn and DeleteRange method.
 // We define this interface only for lessor to limit the number
 // of methods of mvcc.KV to what lessor actually needs.
 //
 // Having a minimum interface makes testing easy.
 type RangeDeleter interface {
-	DeleteRange(key, end []byte) (int64, int64)
+	// TxnBegin see comments on mvcc.KV
+	TxnBegin() int64
+	// TxnEnd see comments on mvcc.KV
+	TxnEnd(txnID int64) error
+	// TxnDeleteRange see comments on mvcc.KV
+	TxnDeleteRange(txnID int64, key, end []byte) (n, rev int64, err error)
 }

 // Lessor owns leases. It can grant, revoke, renew and modify leases for lessee.
@ -138,6 +141,10 @@ type lessor struct {
 	// The leased items can be recovered by iterating all the keys in kv.
 	b backend.Backend

+	// minLeaseTTL is the minimum lease TTL that can be granted for a lease. Any
+	// requests for shorter TTLs are extended to the minimum TTL.
+	minLeaseTTL int64
+
 	expiredC chan []*Lease
 	// stopC is a channel whose closure indicates that the lessor should be stopped.
 	stopC chan struct{}
@ -145,14 +152,15 @@ type lessor struct {
 	doneC chan struct{}
 }

-func NewLessor(b backend.Backend) Lessor {
-	return newLessor(b)
+func NewLessor(b backend.Backend, minLeaseTTL int64) Lessor {
+	return newLessor(b, minLeaseTTL)
 }

-func newLessor(b backend.Backend) *lessor {
+func newLessor(b backend.Backend, minLeaseTTL int64) *lessor {
 	l := &lessor{
-		leaseMap: make(map[LeaseID]*Lease),
-		b:        b,
+		leaseMap:    make(map[LeaseID]*Lease),
+		b:           b,
+		minLeaseTTL: minLeaseTTL,
 		// expiredC is a small buffered chan to avoid unnecessary blocking.
 		expiredC: make(chan []*Lease, 16),
 		stopC:    make(chan struct{}),
@ -188,6 +196,10 @@ func (le *lessor) Grant(id LeaseID, ttl int64) (*Lease, error) {
 		return nil, ErrLeaseExists
 	}

+	if l.TTL < le.minLeaseTTL {
+		l.TTL = le.minLeaseTTL
+	}
+
 	if le.primary {
 		l.refresh(0)
 	} else {
@ -211,16 +223,30 @@ func (le *lessor) Revoke(id LeaseID) error {
 	// unlock before doing external work
 	le.mu.Unlock()

-	if le.rd != nil {
-		for item := range l.itemSet {
-			le.rd.DeleteRange([]byte(item.Key), nil)
+	if le.rd == nil {
+		return nil
+	}
+
+	tid := le.rd.TxnBegin()
+	for item := range l.itemSet {
+		_, _, err := le.rd.TxnDeleteRange(tid, []byte(item.Key), nil)
+		if err != nil {
+			panic(err)
 		}
 	}

 	le.mu.Lock()
 	defer le.mu.Unlock()
 	delete(le.leaseMap, l.ID)
-	l.removeFrom(le.b)
+	// lease deletion needs to be in the same backend transaction with the
+	// kv deletion. Or we might end up with not executing the revoke or not
+	// deleting the keys if etcdserver fails in between.
+	le.b.BatchTx().UnsafeDelete(leaseBucketName, int64ToBytes(int64(l.ID)))
+
+	err := le.rd.TxnEnd(tid)
+	if err != nil {
+		panic(err)
+	}

 	return nil
 }
@ -406,6 +432,9 @@ func (le *lessor) initAndRecover() {
 			panic("failed to unmarshal lease proto item")
 		}
 		ID := LeaseID(lpb.ID)
+		if lpb.TTL < le.minLeaseTTL {
+			lpb.TTL = le.minLeaseTTL
+		}
 		le.leaseMap[ID] = &Lease{
 			ID:  ID,
 			TTL: lpb.TTL,
@ -443,30 +472,13 @@ func (l Lease) persistTo(b backend.Backend) {
 	b.BatchTx().Unlock()
 }

-func (l Lease) removeFrom(b backend.Backend) {
-	key := int64ToBytes(int64(l.ID))
-
-	b.BatchTx().Lock()
-	b.BatchTx().UnsafeDelete(leaseBucketName, key)
-	b.BatchTx().Unlock()
-}
-
-// refresh refreshes the expiry of the lease. It extends the expiry at least
-// minLeaseTTL second.
+// refresh refreshes the expiry of the lease.
 func (l *Lease) refresh(extend time.Duration) {
-	if l.TTL < minLeaseTTL {
-		l.TTL = minLeaseTTL
-	}
 	l.expiry = time.Now().Add(extend + time.Second*time.Duration(l.TTL))
 }

 // forever sets the expiry of lease to be forever.
-func (l *Lease) forever() {
-	if l.TTL < minLeaseTTL {
-		l.TTL = minLeaseTTL
-	}
-	l.expiry = forever
-}
+func (l *Lease) forever() { l.expiry = forever }

 type LeaseItem struct {
 	Key string
--- a/lease/lessor_test.go
+++ b/lease/lessor_test.go
@ -26,6 +26,8 @@ import (
 	"github.com/coreos/etcd/mvcc/backend"
 )

+const minLeaseTTL = int64(5)
+
 // TestLessorGrant ensures Lessor can grant wanted lease.
 // The granted lease should have a unique ID with a term
 // that is greater than minLeaseTTL.
@ -34,7 +36,7 @@ func TestLessorGrant(t *testing.T) {
 	defer os.RemoveAll(dir)
 	defer be.Close()

-	le := newLessor(be)
+	le := newLessor(be, minLeaseTTL)
 	le.Promote(0)

 	l, err := le.Grant(1, 1)
@ -82,7 +84,7 @@ func TestLessorRevoke(t *testing.T) {

 	fd := &fakeDeleter{}

-	le := newLessor(be)
+	le := newLessor(be, minLeaseTTL)
 	le.SetRangeDeleter(fd)

 	// grant a lease with long term (100 seconds) to
@ -129,10 +131,10 @@ func TestLessorRenew(t *testing.T) {
 	defer be.Close()
 	defer os.RemoveAll(dir)

-	le := newLessor(be)
+	le := newLessor(be, minLeaseTTL)
 	le.Promote(0)

-	l, err := le.Grant(1, 5)
+	l, err := le.Grant(1, minLeaseTTL)
 	if err != nil {
 		t.Fatalf("failed to grant lease (%v)", err)
 	}
@ -160,7 +162,7 @@ func TestLessorDetach(t *testing.T) {

 	fd := &fakeDeleter{}

-	le := newLessor(be)
+	le := newLessor(be, minLeaseTTL)
 	le.SetRangeDeleter(fd)

 	// grant a lease with long term (100 seconds) to
@ -199,7 +201,7 @@ func TestLessorRecover(t *testing.T) {
 	defer os.RemoveAll(dir)
 	defer be.Close()

-	le := newLessor(be)
+	le := newLessor(be, minLeaseTTL)
 	l1, err1 := le.Grant(1, 10)
 	l2, err2 := le.Grant(2, 20)
 	if err1 != nil || err2 != nil {
@ -207,7 +209,7 @@ func TestLessorRecover(t *testing.T) {
 	}

 	// Create a new lessor with the same backend
-	nle := newLessor(be)
+	nle := newLessor(be, minLeaseTTL)
 	nl1 := nle.get(l1.ID)
 	if nl1 == nil || nl1.TTL != l1.TTL {
 		t.Errorf("nl1 = %v, want nl1.TTL= %d", nl1.TTL, l1.TTL)
@ -223,9 +225,17 @@ type fakeDeleter struct {
 	deleted []string
 }

-func (fd *fakeDeleter) DeleteRange(key, end []byte) (int64, int64) {
+func (fd *fakeDeleter) TxnBegin() int64 {
+	return 0
+}
+
+func (fd *fakeDeleter) TxnEnd(txnID int64) error {
+	return nil
+}
+
+func (fd *fakeDeleter) TxnDeleteRange(tid int64, key, end []byte) (int64, int64, error) {
 	fd.deleted = append(fd.deleted, string(key)+"_"+string(end))
-	return 0, 0
+	return 0, 0, nil
 }

 func NewTestBackend(t *testing.T) (string, backend.Backend) {
--- a/mvcc/kvstore.go
+++ b/mvcc/kvstore.go
@ -367,6 +367,8 @@ func (s *store) restore() error {
 	revToBytes(revision{main: 1}, min)
 	revToBytes(revision{main: math.MaxInt64, sub: math.MaxInt64}, max)

+	keyToLease := make(map[string]lease.LeaseID)
+
 	// restore index
 	tx := s.b.BatchTx()
 	tx.Lock()
@ -390,26 +392,15 @@ func (s *store) restore() error {
 		switch {
 		case isTombstone(key):
 			s.kvindex.Tombstone(kv.Key, rev)
-			if lease.LeaseID(kv.Lease) != lease.NoLease {
-				err := s.le.Detach(lease.LeaseID(kv.Lease), []lease.LeaseItem{{Key: string(kv.Key)}})
-				if err != nil && err != lease.ErrLeaseNotFound {
-					plog.Fatalf("unexpected Detach error %v", err)
-				}
-			}
+			delete(keyToLease, string(kv.Key))
+
 		default:
 			s.kvindex.Restore(kv.Key, revision{kv.CreateRevision, 0}, rev, kv.Version)
-			if lease.LeaseID(kv.Lease) != lease.NoLease {
-				if s.le == nil {
-					panic("no lessor to attach lease")
-				}
-				err := s.le.Attach(lease.LeaseID(kv.Lease), []lease.LeaseItem{{Key: string(kv.Key)}})
-				// We are walking through the kv history here. It is possible that we attached a key to
-				// the lease and the lease was revoked later.
-				// Thus attaching an old version of key to a none existing lease is possible here, and
-				// we should just ignore the error.
-				if err != nil && err != lease.ErrLeaseNotFound {
-					panic("unexpected Attach error")
-				}
+
+			if lid := lease.LeaseID(kv.Lease); lid != lease.NoLease {
+				keyToLease[string(kv.Key)] = lid
+			} else {
+				delete(keyToLease, string(kv.Key))
 			}
 		}

@ -417,6 +408,23 @@ func (s *store) restore() error {
 		s.currentRev = rev
 	}

+	// keys in the range [compacted revision -N, compaction] might all be deleted due to compaction.
+	// the correct revision should be set to compaction revision in the case, not the largest revision
+	// we have seen.
+	if s.currentRev.main < s.compactMainRev {
+		s.currentRev.main = s.compactMainRev
+	}
+
+	for key, lid := range keyToLease {
+		if s.le == nil {
+			panic("no lessor to attach lease")
+		}
+		err := s.le.Attach(lid, []lease.LeaseItem{{Key: key}})
+		if err != nil {
+			plog.Errorf("unexpected Attach error: %v", err)
+		}
+	}
+
 	_, scheduledCompactBytes := tx.UnsafeRange(metaBucketName, scheduledCompactKeyName, nil, 0)
 	scheduledCompact := int64(0)
 	if len(scheduledCompactBytes) != 0 {
@ -550,7 +558,7 @@ func (s *store) put(key, value []byte, leaseID lease.LeaseID) {

 		err = s.le.Detach(oldLease, []lease.LeaseItem{{Key: string(key)}})
 		if err != nil {
-			panic("unexpected error from lease detach")
+			plog.Errorf("unexpected error from lease detach: %v", err)
 		}
 	}

@ -619,7 +627,7 @@ func (s *store) delete(key []byte, rev revision) {
 	if lease.LeaseID(kv.Lease) != lease.NoLease {
 		err = s.le.Detach(lease.LeaseID(kv.Lease), []lease.LeaseItem{{Key: string(kv.Key)}})
 		if err != nil {
-			plog.Fatalf("cannot detach %v", err)
+			plog.Errorf("cannot detach %v", err)
 		}
 	}
 }
--- a/mvcc/kvstore_compaction_test.go
+++ b/mvcc/kvstore_compaction_test.go
@ -15,8 +15,10 @@
 package mvcc

 import (
+	"os"
 	"reflect"
 	"testing"
+	"time"

 	"github.com/coreos/etcd/lease"
 	"github.com/coreos/etcd/mvcc/backend"
@ -93,3 +95,41 @@ func TestScheduleCompaction(t *testing.T) {
 		cleanup(s, b, tmpPath)
 	}
 }
+
+func TestCompactAllAndRestore(t *testing.T) {
+	b, tmpPath := backend.NewDefaultTmpBackend()
+	s0 := NewStore(b, &lease.FakeLessor{}, nil)
+	defer os.Remove(tmpPath)
+
+	s0.Put([]byte("foo"), []byte("bar"), lease.NoLease)
+	s0.Put([]byte("foo"), []byte("bar1"), lease.NoLease)
+	s0.Put([]byte("foo"), []byte("bar2"), lease.NoLease)
+	s0.DeleteRange([]byte("foo"), nil)
+
+	rev := s0.Rev()
+	// compact all keys
+	done, err := s0.Compact(rev)
+	if err != nil {
+		t.Fatal(err)
+	}
+
+	select {
+	case <-done:
+	case <-time.After(10 * time.Second):
+		t.Fatal("timeout waiting for compaction to finish")
+	}
+
+	err = s0.Close()
+	if err != nil {
+		t.Fatal(err)
+	}
+
+	s1 := NewStore(b, &lease.FakeLessor{}, nil)
+	if s1.Rev() != rev {
+		t.Errorf("rev = %v, want %v", s1.Rev(), rev)
+	}
+	_, err = s1.Range([]byte("foo"), nil, RangeOptions{})
+	if err != nil {
+		t.Errorf("unexpect range error %v", err)
+	}
+}
--- a/mvcc/mvccpb/kv.pb.go
+++ b/mvcc/mvccpb/kv.pb.go
@ -20,9 +20,9 @@ import (
 	proto "github.com/golang/protobuf/proto"

 	math "math"
-)

-import io "io"
+	io "io"
+)

 // Reference imports to suppress errors if they are not otherwise used.
 var _ = proto.Marshal
@ -89,6 +89,8 @@ type Event struct {
 	// A DELETE/EXPIRE event contains the deleted key with
 	// its modification revision set to the revision of deletion.
 	Kv *KeyValue `protobuf:"bytes,2,opt,name=kv" json:"kv,omitempty"`
+	// prev_kv holds the key-value pair before the event happens.
+	PrevKv *KeyValue `protobuf:"bytes,3,opt,name=prev_kv,json=prevKv" json:"prev_kv,omitempty"`
 }

 func (m *Event) Reset()                    { *m = Event{} }
@ -181,6 +183,16 @@ func (m *Event) MarshalTo(data []byte) (int, error) {
 		}
 		i += n1
 	}
+	if m.PrevKv != nil {
+		data[i] = 0x1a
+		i++
+		i = encodeVarintKv(data, i, uint64(m.PrevKv.Size()))
+		n2, err := m.PrevKv.MarshalTo(data[i:])
+		if err != nil {
+			return 0, err
+		}
+		i += n2
+	}
 	return i, nil
 }

@ -247,6 +259,10 @@ func (m *Event) Size() (n int) {
 		l = m.Kv.Size()
 		n += 1 + l + sovKv(uint64(l))
 	}
+	if m.PrevKv != nil {
+		l = m.PrevKv.Size()
+		n += 1 + l + sovKv(uint64(l))
+	}
 	return n
 }

@ -532,6 +548,39 @@ func (m *Event) Unmarshal(data []byte) error {
 				return err
 			}
 			iNdEx = postIndex
+		case 3:
+			if wireType != 2 {
+				return fmt.Errorf("proto: wrong wireType = %d for field PrevKv", wireType)
+			}
+			var msglen int
+			for shift := uint(0); ; shift += 7 {
+				if shift >= 64 {
+					return ErrIntOverflowKv
+				}
+				if iNdEx >= l {
+					return io.ErrUnexpectedEOF
+				}
+				b := data[iNdEx]
+				iNdEx++
+				msglen |= (int(b) & 0x7F) << shift
+				if b < 0x80 {
+					break
+				}
+			}
+			if msglen < 0 {
+				return ErrInvalidLengthKv
+			}
+			postIndex := iNdEx + msglen
+			if postIndex > l {
+				return io.ErrUnexpectedEOF
+			}
+			if m.PrevKv == nil {
+				m.PrevKv = &KeyValue{}
+			}
+			if err := m.PrevKv.Unmarshal(data[iNdEx:postIndex]); err != nil {
+				return err
+			}
+			iNdEx = postIndex
 		default:
 			iNdEx = preIndex
 			skippy, err := skipKv(data[iNdEx:])
@ -659,23 +708,24 @@ var (
 )

 var fileDescriptorKv = []byte{
-	// 285 bytes of a gzipped FileDescriptorProto
-	0x1f, 0x8b, 0x08, 0x00, 0x00, 0x09, 0x6e, 0x88, 0x02, 0xff, 0x44, 0x90, 0x41, 0x4a, 0xc3, 0x40,
-	0x14, 0x86, 0x33, 0x4d, 0x9b, 0xd6, 0xd7, 0x52, 0xc3, 0x50, 0x70, 0x70, 0x31, 0xc4, 0x6e, 0x2c,
-	0x08, 0x11, 0xea, 0x0d, 0xc4, 0xac, 0x74, 0x21, 0x21, 0xba, 0x95, 0x34, 0x7d, 0x94, 0x92, 0xa6,
-	0x13, 0xd2, 0x38, 0x98, 0x9b, 0x78, 0x0a, 0xcf, 0xd1, 0x65, 0x8f, 0x60, 0xe3, 0x45, 0x24, 0x6f,
-	0x4c, 0xdd, 0x0c, 0xef, 0xff, 0xff, 0x6f, 0x98, 0xff, 0x0d, 0x0c, 0x52, 0xed, 0xe7, 0x85, 0x2a,
-	0x15, 0x77, 0x32, 0x9d, 0x24, 0xf9, 0xe2, 0x72, 0xb2, 0x52, 0x2b, 0x45, 0xd6, 0x6d, 0x33, 0x99,
-	0x74, 0xfa, 0xc5, 0x60, 0xf0, 0x88, 0xd5, 0x6b, 0xbc, 0x79, 0x47, 0xee, 0x82, 0x9d, 0x62, 0x25,
-	0x98, 0xc7, 0x66, 0xa3, 0xb0, 0x19, 0xf9, 0x35, 0x9c, 0x27, 0x05, 0xc6, 0x25, 0xbe, 0x15, 0xa8,
-	0xd7, 0xbb, 0xb5, 0xda, 0x8a, 0x8e, 0xc7, 0x66, 0x76, 0x38, 0x36, 0x76, 0xf8, 0xe7, 0xf2, 0x2b,
-	0x18, 0x65, 0x6a, 0xf9, 0x4f, 0xd9, 0x44, 0x0d, 0x33, 0xb5, 0x3c, 0x21, 0x02, 0xfa, 0x1a, 0x0b,
-	0x4a, 0xbb, 0x94, 0xb6, 0x92, 0x4f, 0xa0, 0xa7, 0x9b, 0x02, 0xa2, 0x47, 0x2f, 0x1b, 0xd1, 0xb8,
-	0x1b, 0x8c, 0x77, 0x28, 0x1c, 0xa2, 0x8d, 0x98, 0x7e, 0x40, 0x2f, 0xd0, 0xb8, 0x2d, 0xf9, 0x0d,
-	0x74, 0xcb, 0x2a, 0x47, 0x6a, 0x3b, 0x9e, 0x5f, 0xf8, 0x66, 0x4d, 0x9f, 0x42, 0x73, 0x46, 0x55,
-	0x8e, 0x21, 0x41, 0xdc, 0x83, 0x4e, 0xaa, 0xa9, 0xfa, 0x70, 0xee, 0xb6, 0x68, 0xbb, 0x77, 0xd8,
-	0x49, 0xf5, 0xd4, 0x83, 0xb3, 0xd3, 0x25, 0xde, 0x07, 0xfb, 0xf9, 0x25, 0x72, 0x2d, 0x0e, 0xe0,
-	0x3c, 0x04, 0x4f, 0x41, 0x14, 0xb8, 0xec, 0x5e, 0xec, 0x8f, 0xd2, 0x3a, 0x1c, 0xa5, 0xb5, 0xaf,
-	0x25, 0x3b, 0xd4, 0x92, 0x7d, 0xd7, 0x92, 0x7d, 0xfe, 0x48, 0x6b, 0xe1, 0xd0, 0x5f, 0xde, 0xfd,
-	0x06, 0x00, 0x00, 0xff, 0xff, 0xd6, 0x21, 0x8f, 0x2c, 0x75, 0x01, 0x00, 0x00,
+	// 303 bytes of a gzipped FileDescriptorProto
+	0x1f, 0x8b, 0x08, 0x00, 0x00, 0x09, 0x6e, 0x88, 0x02, 0xff, 0x6c, 0x90, 0x41, 0x4e, 0xc2, 0x40,
+	0x14, 0x86, 0x3b, 0x14, 0x0a, 0x3e, 0x08, 0x36, 0x13, 0x12, 0x27, 0x2e, 0x26, 0x95, 0x8d, 0x18,
+	0x13, 0x4c, 0xf0, 0x06, 0xc6, 0xae, 0x70, 0x61, 0x1a, 0x74, 0x4b, 0x4a, 0x79, 0x21, 0xa4, 0x94,
+	0x69, 0x4a, 0x9d, 0xa4, 0x37, 0x71, 0xef, 0xde, 0x73, 0xb0, 0xe4, 0x08, 0x52, 0x2f, 0x62, 0xfa,
+	0xc6, 0xe2, 0xc6, 0xcd, 0xe4, 0xfd, 0xff, 0xff, 0x65, 0xe6, 0x7f, 0x03, 0x9d, 0x58, 0x8f, 0xd3,
+	0x4c, 0xe5, 0x8a, 0x3b, 0x89, 0x8e, 0xa2, 0x74, 0x71, 0x39, 0x58, 0xa9, 0x95, 0x22, 0xeb, 0xae,
+	0x9a, 0x4c, 0x3a, 0xfc, 0x64, 0xd0, 0x99, 0x62, 0xf1, 0x1a, 0x6e, 0xde, 0x90, 0xbb, 0x60, 0xc7,
+	0x58, 0x08, 0xe6, 0xb1, 0x51, 0x2f, 0xa8, 0x46, 0x7e, 0x0d, 0xe7, 0x51, 0x86, 0x61, 0x8e, 0xf3,
+	0x0c, 0xf5, 0x7a, 0xb7, 0x56, 0x5b, 0xd1, 0xf0, 0xd8, 0xc8, 0x0e, 0xfa, 0xc6, 0x0e, 0x7e, 0x5d,
+	0x7e, 0x05, 0xbd, 0x44, 0x2d, 0xff, 0x28, 0x9b, 0xa8, 0x6e, 0xa2, 0x96, 0x27, 0x44, 0x40, 0x5b,
+	0x63, 0x46, 0x69, 0x93, 0xd2, 0x5a, 0xf2, 0x01, 0xb4, 0x74, 0x55, 0x40, 0xb4, 0xe8, 0x65, 0x23,
+	0x2a, 0x77, 0x83, 0xe1, 0x0e, 0x85, 0x43, 0xb4, 0x11, 0xc3, 0x0f, 0x06, 0x2d, 0x5f, 0xe3, 0x36,
+	0xe7, 0xb7, 0xd0, 0xcc, 0x8b, 0x14, 0xa9, 0x6e, 0x7f, 0x72, 0x31, 0x36, 0x7b, 0x8e, 0x29, 0x34,
+	0xe7, 0xac, 0x48, 0x31, 0x20, 0x88, 0x7b, 0xd0, 0x88, 0x35, 0x75, 0xef, 0x4e, 0xdc, 0x1a, 0xad,
+	0x17, 0x0f, 0x1a, 0xb1, 0xe6, 0x37, 0xd0, 0x4e, 0x33, 0xd4, 0xf3, 0x58, 0x53, 0xf9, 0xff, 0x30,
+	0xa7, 0x02, 0xa6, 0x7a, 0xe8, 0xc1, 0xd9, 0xe9, 0x7e, 0xde, 0x06, 0xfb, 0xf9, 0x65, 0xe6, 0x5a,
+	0x1c, 0xc0, 0x79, 0xf4, 0x9f, 0xfc, 0x99, 0xef, 0xb2, 0x07, 0xb1, 0x3f, 0x4a, 0xeb, 0x70, 0x94,
+	0xd6, 0xbe, 0x94, 0xec, 0x50, 0x4a, 0xf6, 0x55, 0x4a, 0xf6, 0xfe, 0x2d, 0xad, 0x85, 0x43, 0xff,
+	0x7e, 0xff, 0x13, 0x00, 0x00, 0xff, 0xff, 0xb5, 0x45, 0x92, 0x5d, 0xa1, 0x01, 0x00, 0x00,
 }
--- a/mvcc/mvccpb/kv.proto
+++ b/mvcc/mvccpb/kv.proto
@ -43,4 +43,6 @@ message Event {
  // A DELETE/EXPIRE event contains the deleted key with
  // its modification revision set to the revision of deletion.
  KeyValue kv = 2;
+  // prev_kv holds the key-value pair before the event happens.
+  KeyValue prev_kv = 3;
 }
--- a/mvcc/watcher.go
+++ b/mvcc/watcher.go
@ -15,6 +15,7 @@
 package mvcc

 import (
+	"bytes"
 	"errors"
 	"sync"

@ -96,6 +97,12 @@ type watchStream struct {
 // Watch creates a new watcher in the stream and returns its WatchID.
 // TODO: return error if ws is closed?
 func (ws *watchStream) Watch(key, end []byte, startRev int64) WatchID {
+	// prevent wrong range where key >= end lexicographically
+	// watch request with 'WithFromKey' has empty-byte range end
+	if len(end) != 0 && bytes.Compare(key, end) != -1 {
+		return -1
+	}
+
 	ws.mu.Lock()
 	defer ws.mu.Unlock()
 	if ws.closed {
--- a/mvcc/watcher_test.go
+++ b/mvcc/watcher_test.go
@ -153,6 +153,28 @@ func TestWatcherWatchPrefix(t *testing.T) {
 	}
 }

+// TestWatcherWatchWrongRange ensures that watcher with wrong 'end' range
+// does not create watcher, which panics when canceling in range tree.
+func TestWatcherWatchWrongRange(t *testing.T) {
+	b, tmpPath := backend.NewDefaultTmpBackend()
+	s := WatchableKV(newWatchableStore(b, &lease.FakeLessor{}, nil))
+	defer cleanup(s, b, tmpPath)
+
+	w := s.NewWatchStream()
+	defer w.Close()
+
+	if id := w.Watch([]byte("foa"), []byte("foa"), 1); id != -1 {
+		t.Fatalf("key == end range given; id expected -1, got %d", id)
+	}
+	if id := w.Watch([]byte("fob"), []byte("foa"), 1); id != -1 {
+		t.Fatalf("key > end range given; id expected -1, got %d", id)
+	}
+	// watch request with 'WithFromKey' has empty-byte range end
+	if id := w.Watch([]byte("foo"), []byte{}, 1); id != 0 {
+		t.Fatalf("\x00 is range given; id expected 0, got %d", id)
+	}
+}
+
 func TestWatchDeleteRange(t *testing.T) {
 	b, tmpPath := backend.NewDefaultTmpBackend()
 	s := newWatchableStore(b, &lease.FakeLessor{}, nil)
--- a/pkg/fileutil/dir_unix.go
+++ b/pkg/fileutil/dir_unix.go
@ -12,5 +12,11 @@
 // See the License for the specific language governing permissions and
 // limitations under the License.

-// etcd-top is a utility for analyzing etcd v2 API workload traffic.
-package main
+// +build !windows
+
+package fileutil
+
+import "os"
+
+// OpenDir opens a directory for syncing.
+func OpenDir(path string) (*os.File, error) { return os.Open(path) }
--- a/pkg/fileutil/dir_windows.go
+++ b/pkg/fileutil/dir_windows.go
@ -0,0 +1,46 @@
+// Copyright 2016 The etcd Authors
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+// +build windows
+
+package fileutil
+
+import (
+	"os"
+	"syscall"
+)
+
+// OpenDir opens a directory in windows with write access for syncing.
+func OpenDir(path string) (*os.File, error) {
+	fd, err := openDir(path)
+	if err != nil {
+		return nil, err
+	}
+	return os.NewFile(uintptr(fd), path), nil
+}
+
+func openDir(path string) (fd syscall.Handle, err error) {
+	if len(path) == 0 {
+		return syscall.InvalidHandle, syscall.ERROR_FILE_NOT_FOUND
+	}
+	pathp, err := syscall.UTF16PtrFromString(path)
+	if err != nil {
+		return syscall.InvalidHandle, err
+	}
+	access := uint32(syscall.GENERIC_READ | syscall.GENERIC_WRITE)
+	sharemode := uint32(syscall.FILE_SHARE_READ | syscall.FILE_SHARE_WRITE)
+	createmode := uint32(syscall.OPEN_EXISTING)
+	fl := uint32(syscall.FILE_FLAG_BACKUP_SEMANTICS)
+	return syscall.CreateFile(pathp, access, sharemode, nil, createmode, fl, 0)
+}
--- a/pkg/fileutil/fileutil.go
+++ b/pkg/fileutil/fileutil.go
@ -96,3 +96,26 @@ func Exist(name string) bool {
 	_, err := os.Stat(name)
 	return err == nil
 }
+
+// ZeroToEnd zeros a file starting from SEEK_CUR to its SEEK_END. May temporarily
+// shorten the length of the file.
+func ZeroToEnd(f *os.File) error {
+	// TODO: support FALLOC_FL_ZERO_RANGE
+	off, err := f.Seek(0, os.SEEK_CUR)
+	if err != nil {
+		return err
+	}
+	lenf, lerr := f.Seek(0, os.SEEK_END)
+	if lerr != nil {
+		return lerr
+	}
+	if err = f.Truncate(off); err != nil {
+		return err
+	}
+	// make sure blocks remain allocated
+	if err = Preallocate(f, lenf, true); err != nil {
+		return err
+	}
+	_, err = f.Seek(off, os.SEEK_SET)
+	return err
+}
--- a/pkg/fileutil/fileutil_test.go
+++ b/pkg/fileutil/fileutil_test.go
@ -118,3 +118,42 @@ func TestExist(t *testing.T) {
 		t.Errorf("exist = %v, want false", g)
 	}
 }
+
+func TestZeroToEnd(t *testing.T) {
+	f, err := ioutil.TempFile(os.TempDir(), "fileutil")
+	if err != nil {
+		t.Fatal(err)
+	}
+	defer f.Close()
+
+	b := make([]byte, 1024)
+	for i := range b {
+		b[i] = 12
+	}
+	if _, err = f.Write(b); err != nil {
+		t.Fatal(err)
+	}
+	if _, err = f.Seek(512, os.SEEK_SET); err != nil {
+		t.Fatal(err)
+	}
+	if err = ZeroToEnd(f); err != nil {
+		t.Fatal(err)
+	}
+	off, serr := f.Seek(0, os.SEEK_CUR)
+	if serr != nil {
+		t.Fatal(serr)
+	}
+	if off != 512 {
+		t.Fatalf("expected offset 512, got %d", off)
+	}
+
+	b = make([]byte, 512)
+	if _, err = f.Read(b); err != nil {
+		t.Fatal(err)
+	}
+	for i := range b {
+		if b[i] != 0 {
+			t.Errorf("expected b[%d] = 0, got %d", i, b[i])
+		}
+	}
+}
--- a/pkg/ioutil/pagewriter.go
+++ b/pkg/ioutil/pagewriter.go
@ -0,0 +1,106 @@
+// Copyright 2016 The etcd Authors
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+package ioutil
+
+import (
+	"io"
+)
+
+var defaultBufferBytes = 128 * 1024
+
+// PageWriter implements the io.Writer interface so that writes will
+// either be in page chunks or from flushing.
+type PageWriter struct {
+	w io.Writer
+	// pageOffset tracks the page offset of the base of the buffer
+	pageOffset int
+	// pageBytes is the number of bytes per page
+	pageBytes int
+	// bufferedBytes counts the number of bytes pending for write in the buffer
+	bufferedBytes int
+	// buf holds the write buffer
+	buf []byte
+	// bufWatermarkBytes is the number of bytes the buffer can hold before it needs
+	// to be flushed. It is less than len(buf) so there is space for slack writes
+	// to bring the writer to page alignment.
+	bufWatermarkBytes int
+}
+
+// NewPageWriter creates a new PageWriter. pageBytes is the number of bytes
+// to write per page. pageOffset is the starting offset of io.Writer.
+func NewPageWriter(w io.Writer, pageBytes, pageOffset int) *PageWriter {
+	return &PageWriter{
+		w:                 w,
+		pageOffset:        pageOffset,
+		pageBytes:         pageBytes,
+		buf:               make([]byte, defaultBufferBytes+pageBytes),
+		bufWatermarkBytes: defaultBufferBytes,
+	}
+}
+
+func (pw *PageWriter) Write(p []byte) (n int, err error) {
+	if len(p)+pw.bufferedBytes <= pw.bufWatermarkBytes {
+		// no overflow
+		copy(pw.buf[pw.bufferedBytes:], p)
+		pw.bufferedBytes += len(p)
+		return len(p), nil
+	}
+	// complete the slack page in the buffer if unaligned
+	slack := pw.pageBytes - ((pw.pageOffset + pw.bufferedBytes) % pw.pageBytes)
+	if slack != pw.pageBytes {
+		partial := slack > len(p)
+		if partial {
+			// not enough data to complete the slack page
+			slack = len(p)
+		}
+		// special case: writing to slack page in buffer
+		copy(pw.buf[pw.bufferedBytes:], p[:slack])
+		pw.bufferedBytes += slack
+		n = slack
+		p = p[slack:]
+		if partial {
+			// avoid forcing an unaligned flush
+			return n, nil
+		}
+	}
+	// buffer contents are now page-aligned; clear out
+	if err = pw.Flush(); err != nil {
+		return n, err
+	}
+	// directly write all complete pages without copying
+	if len(p) > pw.pageBytes {
+		pages := len(p) / pw.pageBytes
+		c, werr := pw.w.Write(p[:pages*pw.pageBytes])
+		n += c
+		if werr != nil {
+			return n, werr
+		}
+		p = p[pages*pw.pageBytes:]
+	}
+	// write remaining tail to buffer
+	c, werr := pw.Write(p)
+	n += c
+	return n, werr
+}
+
+func (pw *PageWriter) Flush() error {
+	if pw.bufferedBytes == 0 {
+		return nil
+	}
+	_, err := pw.w.Write(pw.buf[:pw.bufferedBytes])
+	pw.pageOffset = (pw.pageOffset + pw.bufferedBytes) % pw.pageBytes
+	pw.bufferedBytes = 0
+	return err
+}
--- a/pkg/ioutil/pagewriter_test.go
+++ b/pkg/ioutil/pagewriter_test.go
@ -0,0 +1,129 @@
+// Copyright 2016 The etcd Authors
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+package ioutil
+
+import (
+	"math/rand"
+	"testing"
+)
+
+func TestPageWriterRandom(t *testing.T) {
+	// smaller buffer for stress testing
+	defaultBufferBytes = 8 * 1024
+	pageBytes := 128
+	buf := make([]byte, 4*defaultBufferBytes)
+	cw := &checkPageWriter{pageBytes: pageBytes, t: t}
+	w := NewPageWriter(cw, pageBytes, 0)
+	n := 0
+	for i := 0; i < 4096; i++ {
+		c, err := w.Write(buf[:rand.Intn(len(buf))])
+		if err != nil {
+			t.Fatal(err)
+		}
+		n += c
+	}
+	if cw.writeBytes > n {
+		t.Fatalf("wrote %d bytes to io.Writer, but only wrote %d bytes", cw.writeBytes, n)
+	}
+	if cw.writeBytes-n > pageBytes {
+		t.Fatalf("got %d bytes pending, expected less than %d bytes", cw.writeBytes-n, pageBytes)
+	}
+	t.Logf("total writes: %d", cw.writes)
+	t.Logf("total write bytes: %d (of %d)", cw.writeBytes, n)
+}
+
+// TestPageWriterPariallack tests the case where a write overflows the buffer
+// but there is not enough data to complete the slack write.
+func TestPageWriterPartialSlack(t *testing.T) {
+	defaultBufferBytes = 1024
+	pageBytes := 128
+	buf := make([]byte, defaultBufferBytes)
+	cw := &checkPageWriter{pageBytes: 64, t: t}
+	w := NewPageWriter(cw, pageBytes, 0)
+	// put writer in non-zero page offset
+	if _, err := w.Write(buf[:64]); err != nil {
+		t.Fatal(err)
+	}
+	if err := w.Flush(); err != nil {
+		t.Fatal(err)
+	}
+	if cw.writes != 1 {
+		t.Fatalf("got %d writes, expected 1", cw.writes)
+	}
+	// nearly fill buffer
+	if _, err := w.Write(buf[:1022]); err != nil {
+		t.Fatal(err)
+	}
+	// overflow buffer, but without enough to write as aligned
+	if _, err := w.Write(buf[:8]); err != nil {
+		t.Fatal(err)
+	}
+	if cw.writes != 1 {
+		t.Fatalf("got %d writes, expected 1", cw.writes)
+	}
+	// finish writing slack space
+	if _, err := w.Write(buf[:128]); err != nil {
+		t.Fatal(err)
+	}
+	if cw.writes != 2 {
+		t.Fatalf("got %d writes, expected 2", cw.writes)
+	}
+}
+
+// TestPageWriterOffset tests if page writer correctly repositions when offset is given.
+func TestPageWriterOffset(t *testing.T) {
+	defaultBufferBytes = 1024
+	pageBytes := 128
+	buf := make([]byte, defaultBufferBytes)
+	cw := &checkPageWriter{pageBytes: 64, t: t}
+	w := NewPageWriter(cw, pageBytes, 0)
+	if _, err := w.Write(buf[:64]); err != nil {
+		t.Fatal(err)
+	}
+	if err := w.Flush(); err != nil {
+		t.Fatal(err)
+	}
+	if w.pageOffset != 64 {
+		t.Fatalf("w.pageOffset expected 64, got %d", w.pageOffset)
+	}
+
+	w = NewPageWriter(cw, w.pageOffset, pageBytes)
+	if _, err := w.Write(buf[:64]); err != nil {
+		t.Fatal(err)
+	}
+	if err := w.Flush(); err != nil {
+		t.Fatal(err)
+	}
+	if w.pageOffset != 0 {
+		t.Fatalf("w.pageOffset expected 0, got %d", w.pageOffset)
+	}
+}
+
+// checkPageWriter implements an io.Writer that fails a test on unaligned writes.
+type checkPageWriter struct {
+	pageBytes  int
+	writes     int
+	writeBytes int
+	t          *testing.T
+}
+
+func (cw *checkPageWriter) Write(p []byte) (int, error) {
+	if len(p)%cw.pageBytes != 0 {
+		cw.t.Fatalf("got write len(p) = %d, expected len(p) == k*cw.pageBytes", len(p))
+	}
+	cw.writes++
+	cw.writeBytes += len(p)
+	return len(p), nil
+}
--- a/raft/node.go
+++ b/raft/node.go
@ -38,7 +38,7 @@ var (
 // SoftState provides state that is useful for logging and debugging.
 // The state is volatile and does not need to be persisted to the WAL.
 type SoftState struct {
-	Lead      uint64
+	Lead      uint64 // must use atomic operations to access; keep 64-bit aligned.
 	RaftState StateType
 }

--- a/raft/raftpb/raft.pb.go
+++ b/raft/raftpb/raft.pb.go
@ -25,9 +25,9 @@ import (
 	proto "github.com/golang/protobuf/proto"

 	math "math"
-)

-import io "io"
+	io "io"
+)

 // Reference imports to suppress errors if they are not otherwise used.
 var _ = proto.Marshal
@ -183,9 +183,9 @@ func (x *ConfChangeType) UnmarshalJSON(data []byte) error {
 func (ConfChangeType) EnumDescriptor() ([]byte, []int) { return fileDescriptorRaft, []int{2} }

 type Entry struct {
-	Type             EntryType `protobuf:"varint,1,opt,name=Type,json=type,enum=raftpb.EntryType" json:"Type"`
 	Term             uint64    `protobuf:"varint,2,opt,name=Term,json=term" json:"Term"`
 	Index            uint64    `protobuf:"varint,3,opt,name=Index,json=index" json:"Index"`
+	Type             EntryType `protobuf:"varint,1,opt,name=Type,json=type,enum=raftpb.EntryType" json:"Type"`
 	Data             []byte    `protobuf:"bytes,4,opt,name=Data,json=data" json:"Data,omitempty"`
 	XXX_unrecognized []byte    `json:"-"`
 }
--- a/raft/raftpb/raft.proto
+++ b/raft/raftpb/raft.proto
@ -15,9 +15,9 @@ enum EntryType {
 }

 message Entry {
+	optional uint64     Term  = 2 [(gogoproto.nullable) = false]; // must be 64-bit aligned for atomic operations
+	optional uint64     Index = 3 [(gogoproto.nullable) = false]; // must be 64-bit aligned for atomic operations
 	optional EntryType  Type  = 1 [(gogoproto.nullable) = false];
-	optional uint64     Term  = 2 [(gogoproto.nullable) = false];
-	optional uint64     Index = 3 [(gogoproto.nullable) = false];
 	optional bytes      Data  = 4;
 }

--- a/rafthttp/stream.go
+++ b/rafthttp/stream.go
@ -49,6 +49,7 @@ var (
 		"2.1.0": {streamTypeMsgAppV2, streamTypeMessage},
 		"2.2.0": {streamTypeMsgAppV2, streamTypeMessage},
 		"2.3.0": {streamTypeMsgAppV2, streamTypeMessage},
+		"3.0.0": {streamTypeMsgAppV2, streamTypeMessage},
 	}
 )

@ -332,7 +333,16 @@ func (cr *streamReader) decodeLoop(rc io.ReadCloser, t streamType) error {
 	default:
 		plog.Panicf("unhandled stream type %s", t)
 	}
-	cr.closer = rc
+	select {
+	case <-cr.stopc:
+		cr.mu.Unlock()
+		if err := rc.Close(); err != nil {
+			return err
+		}
+		return io.EOF
+	default:
+		cr.closer = rc
+	}
 	cr.mu.Unlock()

 	for {
--- a/rafthttp/stream_test.go
+++ b/rafthttp/stream_test.go
@ -17,6 +17,7 @@ package rafthttp
 import (
 	"errors"
 	"fmt"
+	"io"
 	"net/http"
 	"net/http/httptest"
 	"reflect"
@ -180,6 +181,60 @@ func TestStreamReaderDialResult(t *testing.T) {
 	}
 }

+// TestStreamReaderStopOnDial tests a stream reader closes the connection on stop.
+func TestStreamReaderStopOnDial(t *testing.T) {
+	defer testutil.AfterTest(t)
+	h := http.Header{}
+	h.Add("X-Server-Version", version.Version)
+	tr := &respWaitRoundTripper{rrt: &respRoundTripper{code: http.StatusOK, header: h}}
+	sr := &streamReader{
+		peerID: types.ID(2),
+		tr:     &Transport{streamRt: tr, ClusterID: types.ID(1)},
+		picker: mustNewURLPicker(t, []string{"http://localhost:2380"}),
+		errorc: make(chan error, 1),
+		typ:    streamTypeMessage,
+		status: newPeerStatus(types.ID(2)),
+	}
+	tr.onResp = func() {
+		// stop() waits for the run() goroutine to exit, but that exit
+		// needs a response from RoundTrip() first; use goroutine
+		go sr.stop()
+		// wait so that stop() is blocked on run() exiting
+		time.Sleep(10 * time.Millisecond)
+		// sr.run() completes dialing then begins decoding while stopped
+	}
+	sr.start()
+	select {
+	case <-sr.done:
+	case <-time.After(time.Second):
+		t.Fatal("streamReader did not stop in time")
+	}
+}
+
+type respWaitRoundTripper struct {
+	rrt    *respRoundTripper
+	onResp func()
+}
+
+func (t *respWaitRoundTripper) RoundTrip(req *http.Request) (*http.Response, error) {
+	resp, err := t.rrt.RoundTrip(req)
+	resp.Body = newWaitReadCloser()
+	t.onResp()
+	return resp, err
+}
+
+type waitReadCloser struct{ closec chan struct{} }
+
+func newWaitReadCloser() *waitReadCloser { return &waitReadCloser{make(chan struct{})} }
+func (wrc *waitReadCloser) Read(p []byte) (int, error) {
+	<-wrc.closec
+	return 0, io.EOF
+}
+func (wrc *waitReadCloser) Close() error {
+	close(wrc.closec)
+	return nil
+}
+
 // TestStreamReaderDialDetectUnsupport tests that dial func could find
 // out that the stream type is not supported by the remote.
 func TestStreamReaderDialDetectUnsupport(t *testing.T) {
--- a/snap/snappb/snap.pb.go
+++ b/snap/snappb/snap.pb.go
@ -19,9 +19,9 @@ import (
 	proto "github.com/golang/protobuf/proto"

 	math "math"
-)

-import io "io"
+	io "io"
+)

 // Reference imports to suppress errors if they are not otherwise used.
 var _ = proto.Marshal
--- a/9
+++ b/9
@ -125,15 +125,6 @@ function fmt_tests {
 	fi

 	if which gosimple >/dev/null; then
-		echo "Checking gosimple..."
-		for path in $GOSIMPLE_UNUSED_PATHS; do
-			simplResult=`gosimple $REPO_PATH/${path} || true`
-			if [ -n "${simplResult}" ]; then
-				echo -e "gosimple checking ${path} failed:\n${simplResult}"
-				exit 255
-			fi
-		done
-	else
 		echo "Skipping gosimple..."
 	fi
 	
--- a/tools/etcd-top/README.md
+++ b/tools/etcd-top/README.md
@ -1,23 +0,0 @@
-# etcd-top
-etcd realtime workload analyzer.  Useful for rapid diagnosis of production usage issues and analysis of production request distributions.
-
-usage:
-```
-  -iface="eth0": interface for sniffing traffic on
-  -period=1: seconds between submissions
-  -ports="2379": etcd listening ports
-  -promiscuous=true: whether to perform promiscuous sniffing or not.
-  -topk=10: submit stats for the top <K> sniffed paths
-```
-
-result:
-```
-go run etcd-top.go --period=1 -topk=3
-1440035702 sniffed 1074 requests over last 1 seconds
-
-Top 3 most popular http requests:
-     Sum     Rate Verb Path
-    1305       22 GET /v2/keys/c
-    1302        8 GET /v2/keys/S
-    1297       10 GET /v2/keys/h
-```
--- a/tools/etcd-top/etcd-top.go
+++ b/tools/etcd-top/etcd-top.go
@ -1,229 +0,0 @@
-// Copyright 2015 The etcd Authors
-//
-// Licensed under the Apache License, Version 2.0 (the "License");
-// you may not use this file except in compliance with the License.
-// You may obtain a copy of the License at
-//
-//     http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-package main
-
-import (
-	"bufio"
-	"bytes"
-	"flag"
-	"fmt"
-	"math"
-	"net/http"
-	"os"
-	"runtime"
-	"sort"
-	"strconv"
-	"strings"
-	"time"
-
-	"github.com/akrennmair/gopcap"
-	"github.com/spacejam/loghisto"
-)
-
-type nameSum struct {
-	Name string
-	Sum  float64
-	Rate float64
-}
-
-type nameSums []nameSum
-
-func (n nameSums) Len() int {
-	return len(n)
-}
-func (n nameSums) Less(i, j int) bool {
-	return n[i].Sum > n[j].Sum
-}
-func (n nameSums) Swap(i, j int) {
-	n[i], n[j] = n[j], n[i]
-}
-
-// This function listens for periodic metrics from the loghisto metric system,
-// and upon receipt of a batch of them it will print out the desired topK.
-func statPrinter(metricStream chan *loghisto.ProcessedMetricSet, topK, period uint) {
-	for m := range metricStream {
-		requestCounter := float64(0)
-		nvs := nameSums{}
-		for k, v := range m.Metrics {
-			// loghisto adds _rate suffixed metrics for counters and histograms
-			if strings.HasSuffix(k, "_rate") && !strings.HasSuffix(k, "_rate_rate") {
-				continue
-			}
-			nvs = append(nvs, nameSum{
-				Name: k,
-				Sum:  v,
-				Rate: m.Metrics[k+"_rate"],
-			})
-			requestCounter += m.Metrics[k+"_rate"]
-		}
-
-		fmt.Printf("\n%d sniffed %d requests over last %d seconds\n\n", time.Now().Unix(),
-			uint(requestCounter), period)
-		if len(nvs) == 0 {
-			continue
-		}
-		sort.Sort(nvs)
-		fmt.Printf("Top %d most popular http requests:\n", topK)
-		fmt.Println("Total Sum  Period Sum Verb Path")
-		for _, nv := range nvs[0:int(math.Min(float64(len(nvs)), float64(topK)))] {
-			fmt.Printf("%9.1d %7.1d %s\n", int(nv.Sum), int(nv.Rate), nv.Name)
-		}
-	}
-}
-
-// packetDecoder decodes packets and hands them off to the streamRouter
-func packetDecoder(packetsIn chan *pcap.Packet, packetsOut chan *pcap.Packet) {
-	for pkt := range packetsIn {
-		pkt.Decode()
-		select {
-		case packetsOut <- pkt:
-		default:
-			fmt.Fprint(os.Stderr, "shedding at decoder!")
-		}
-	}
-}
-
-// processor tries to parse an http request from each packet, and if
-// successful it records metrics about it in the loghisto metric system.
-func processor(ms *loghisto.MetricSystem, packetsIn chan *pcap.Packet) {
-	for pkt := range packetsIn {
-		req, reqErr := http.ReadRequest(bufio.NewReader(bytes.NewReader(pkt.Payload)))
-		if reqErr == nil {
-			ms.Counter(req.Method+" "+req.URL.Path, 1)
-		}
-	}
-}
-
-// streamRouter takes a decoded packet and routes it to a processor that can deal with all requests
-// and responses for this particular TCP connection.  This allows the processor to own a local map
-// of requests so that it can avoid coordinating with other goroutines to perform analysis.
-func streamRouter(ports []uint16, parsedPackets chan *pcap.Packet, processors []chan *pcap.Packet) {
-	for pkt := range parsedPackets {
-		if pkt.TCP == nil {
-			continue
-		}
-		clientPort := uint16(0)
-		for _, p := range ports {
-			if pkt.TCP.SrcPort == p {
-				clientPort = pkt.TCP.DestPort
-				break
-			}
-			if pkt.TCP.DestPort == p {
-				clientPort = pkt.TCP.SrcPort
-				break
-			}
-		}
-		if clientPort != 0 {
-			// client Port can be assumed to have sufficient entropy for
-			// distribution among processors, and we want the same
-			// tcp stream to go to the same processor every time
-			// so that if we do proper packet reconstruction it will
-			// be easier.
-			select {
-			case processors[int(clientPort)%len(processors)] <- pkt:
-			default:
-				fmt.Fprint(os.Stderr, "Shedding load at router!")
-			}
-		}
-	}
-}
-
-// 1. parse args
-// 2. start the loghisto metric system
-// 3. start the processing and printing goroutines
-// 4. open the pcap handler
-// 5. hand off packets from the handler to the decoder
-func main() {
-	portsArg := flag.String("ports", "2379", "etcd listening ports")
-	iface := flag.String("iface", "eth0", "interface for sniffing traffic on")
-	promisc := flag.Bool("promiscuous", true, "promiscuous mode")
-	period := flag.Uint("period", 1, "seconds between submissions")
-	topK := flag.Uint("topk", 10, "submit stats for the top <K> sniffed paths")
-	flag.Parse()
-
-	numCPU := runtime.NumCPU()
-	runtime.GOMAXPROCS(numCPU)
-
-	ms := loghisto.NewMetricSystem(time.Duration(*period)*time.Second, false)
-	ms.Start()
-	metricStream := make(chan *loghisto.ProcessedMetricSet, 2)
-	ms.SubscribeToProcessedMetrics(metricStream)
-	defer ms.UnsubscribeFromProcessedMetrics(metricStream)
-
-	go statPrinter(metricStream, *topK, *period)
-
-	ports := []uint16{}
-	for _, p := range strings.Split(*portsArg, ",") {
-		port, err := strconv.Atoi(p)
-		if err == nil {
-			ports = append(ports, uint16(port))
-		} else {
-			fmt.Fprintf(os.Stderr, "Failed to parse port \"%s\": %v\n", p, err)
-			os.Exit(1)
-		}
-	}
-
-	if len(ports) == 0 {
-		fmt.Fprint(os.Stderr, "No ports given!  Exiting.\n")
-		os.Exit(1)
-	}
-
-	// We choose 1518 for the snaplen because it's the default
-	// ethernet MTU at the link layer.  We choose 1000 for the
-	// timeout based on a measurement for its impact on latency
-	// impact, but it is less precise.
-	h, err := pcap.Openlive(*iface, 1518, *promisc, 1000)
-	if err != nil {
-		fmt.Fprintf(os.Stderr, "%v", err)
-		os.Exit(1)
-	}
-	defer h.Close()
-
-	portArray := strings.Split(*portsArg, ",")
-	dst := strings.Join(portArray, " or dst port ")
-	src := strings.Join(portArray, " or src port ")
-	filter := fmt.Sprintf("tcp and (dst port %s or src port %s)", dst, src)
-	fmt.Println("using bpf filter: ", filter)
-	if err := h.Setfilter(filter); err != nil {
-		fmt.Fprintf(os.Stderr, "%v", err)
-		os.Exit(1)
-	}
-
-	unparsedPackets := make(chan *pcap.Packet, 16384)
-	parsedPackets := make(chan *pcap.Packet, 16384)
-	for i := 0; i < int(math.Max(2, float64(numCPU/4))); i++ {
-		go packetDecoder(unparsedPackets, parsedPackets)
-	}
-
-	processors := []chan *pcap.Packet{}
-	for i := 0; i < int(math.Max(2, float64(numCPU/4))); i++ {
-		p := make(chan *pcap.Packet, 16384)
-		processors = append(processors, p)
-		go processor(ms, p)
-	}
-
-	go streamRouter(ports, parsedPackets, processors)
-
-	for {
-		pkt := h.Next()
-		if pkt != nil {
-			select {
-			case unparsedPackets <- pkt:
-			default:
-				fmt.Fprint(os.Stderr, "SHEDDING IN MAIN")
-			}
-		}
-	}
-}
--- a/version/version.go
+++ b/version/version.go
@ -29,7 +29,7 @@ import (
 var (
 	// MinClusterVersion is the min cluster version this etcd binary is compatible with.
 	MinClusterVersion = "2.3.0"
-	Version           = "3.0.5"
+	Version           = "3.0.17"

 	// Git SHA Value will be set during build
 	GitSHA = "Not provided (use ./build instead of go build)"
--- a/wal/encoder.go
+++ b/wal/encoder.go
@ -15,28 +15,34 @@
 package wal

 import (
-	"bufio"
 	"encoding/binary"
 	"hash"
 	"io"
+	"os"
 	"sync"

 	"github.com/coreos/etcd/pkg/crc"
+	"github.com/coreos/etcd/pkg/ioutil"
 	"github.com/coreos/etcd/wal/walpb"
 )

+// walPageBytes is the alignment for flushing records to the backing Writer.
+// It should be a multiple of the minimum sector size so that WAL repair can
+// safely between torn writes and ordinary data corruption.
+const walPageBytes = 8 * minSectorSize
+
 type encoder struct {
 	mu sync.Mutex
-	bw *bufio.Writer
+	bw *ioutil.PageWriter

 	crc       hash.Hash32
 	buf       []byte
 	uint64buf []byte
 }

-func newEncoder(w io.Writer, prevCrc uint32) *encoder {
+func newEncoder(w io.Writer, prevCrc uint32, pageOffset int) *encoder {
 	return &encoder{
-		bw:  bufio.NewWriter(w),
+		bw:  ioutil.NewPageWriter(w, walPageBytes, pageOffset),
 		crc: crc.New(prevCrc, crcTable),
 		// 1MB buffer
 		buf:       make([]byte, 1024*1024),
@ -44,6 +50,15 @@ func newEncoder(w io.Writer, prevCrc uint32) *encoder {
 	}
 }

+// newFileEncoder creates a new encoder with current file offset for the page writer.
+func newFileEncoder(f *os.File, prevCrc uint32) (*encoder, error) {
+	offset, err := f.Seek(0, os.SEEK_CUR)
+	if err != nil {
+		return nil, err
+	}
+	return newEncoder(f, prevCrc, int(offset)), nil
+}
+
 func (e *encoder) encode(rec *walpb.Record) error {
 	e.mu.Lock()
 	defer e.mu.Unlock()
--- a/wal/record_test.go
+++ b/wal/record_test.go
@ -69,7 +69,7 @@ func TestWriteRecord(t *testing.T) {
 	typ := int64(0xABCD)
 	d := []byte("Hello world!")
 	buf := new(bytes.Buffer)
-	e := newEncoder(buf, 0)
+	e := newEncoder(buf, 0, 0)
 	e.encode(&walpb.Record{Type: typ, Data: d})
 	e.flush()
 	decoder := newDecoder(ioutil.NopCloser(buf))
--- a/wal/wal.go
+++ b/wal/wal.go
@ -67,7 +67,11 @@ var (
 // A just opened WAL is in read mode, and ready for reading records.
 // The WAL will be ready for appending after reading out all the previous records.
 type WAL struct {
-	dir      string           // the living directory of the underlay files
+	dir string // the living directory of the underlay files
+
+	// dirFile is a fd for the wal directory for syncing on Rename
+	dirFile *os.File
+
 	metadata []byte           // metadata recorded at the head of each WAL
 	state    raftpb.HardState // hardstate recorded at the head of WAL

@ -106,45 +110,49 @@ func Create(dirpath string, metadata []byte) (*WAL, error) {
 	if err != nil {
 		return nil, err
 	}
-	if _, err := f.Seek(0, os.SEEK_END); err != nil {
+	if _, err = f.Seek(0, os.SEEK_END); err != nil {
 		return nil, err
 	}
-	if err := fileutil.Preallocate(f.File, segmentSizeBytes, true); err != nil {
+	if err = fileutil.Preallocate(f.File, segmentSizeBytes, true); err != nil {
 		return nil, err
 	}

 	w := &WAL{
 		dir:      dirpath,
 		metadata: metadata,
-		encoder:  newEncoder(f, 0),
+	}
+	w.encoder, err = newFileEncoder(f.File, 0)
+	if err != nil {
+		return nil, err
 	}
 	w.locks = append(w.locks, f)
-	if err := w.saveCrc(0); err != nil {
+	if err = w.saveCrc(0); err != nil {
 		return nil, err
 	}
-	if err := w.encoder.encode(&walpb.Record{Type: metadataType, Data: metadata}); err != nil {
+	if err = w.encoder.encode(&walpb.Record{Type: metadataType, Data: metadata}); err != nil {
 		return nil, err
 	}
-	if err := w.SaveSnapshot(walpb.Snapshot{}); err != nil {
+	if err = w.SaveSnapshot(walpb.Snapshot{}); err != nil {
 		return nil, err
 	}

-	// rename of directory with locked files doesn't work on windows; close
-	// the WAL to release the locks so the directory can be renamed
-	w.Close()
-	if err := os.Rename(tmpdirpath, dirpath); err != nil {
+	if w, err = w.renameWal(tmpdirpath); err != nil {
 		return nil, err
 	}
-	// reopen and relock
-	newWAL, oerr := Open(dirpath, walpb.Snapshot{})
-	if oerr != nil {
-		return nil, oerr
+
+	// directory was renamed; sync parent dir to persist rename
+	pdir, perr := fileutil.OpenDir(path.Dir(w.dir))
+	if perr != nil {
+		return nil, perr
 	}
-	if _, _, _, err := newWAL.ReadAll(); err != nil {
-		newWAL.Close()
-		return nil, err
+	if perr = fileutil.Fsync(pdir); perr != nil {
+		return nil, perr
 	}
-	return newWAL, nil
+	if perr = pdir.Close(); err != nil {
+		return nil, perr
+	}
+
+	return w, nil
 }

 // Open opens the WAL at the given snap.
@ -154,7 +162,14 @@ func Create(dirpath string, metadata []byte) (*WAL, error) {
 // the given snap. The WAL cannot be appended to before reading out all of its
 // previous records.
 func Open(dirpath string, snap walpb.Snapshot) (*WAL, error) {
-	return openAtIndex(dirpath, snap, true)
+	w, err := openAtIndex(dirpath, snap, true)
+	if err != nil {
+		return nil, err
+	}
+	if w.dirFile, err = fileutil.OpenDir(w.dir); err != nil {
+		return nil, err
+	}
+	return w, nil
 }

 // OpenForRead only opens the wal files for read.
@ -299,6 +314,18 @@ func (w *WAL) ReadAll() (metadata []byte, state raftpb.HardState, ents []raftpb.
 			state.Reset()
 			return nil, state, nil, err
 		}
+		// decodeRecord() will return io.EOF if it detects a zero record,
+		// but this zero record may be followed by non-zero records from
+		// a torn write. Overwriting some of these non-zero records, but
+		// not all, will cause CRC errors on WAL open. Since the records
+		// were never fully synced to disk in the first place, it's safe
+		// to zero them out to avoid any CRC errors from new writes.
+		if _, err = w.tail().Seek(w.decoder.lastOffset(), os.SEEK_SET); err != nil {
+			return nil, state, nil, err
+		}
+		if err = fileutil.ZeroToEnd(w.tail().File); err != nil {
+			return nil, state, nil, err
+		}
 	}

 	err = nil
@ -317,8 +344,10 @@ func (w *WAL) ReadAll() (metadata []byte, state raftpb.HardState, ents []raftpb.

 	if w.tail() != nil {
 		// create encoder (chain crc with the decoder), enable appending
-		_, err = w.tail().Seek(w.decoder.lastOffset(), os.SEEK_SET)
-		w.encoder = newEncoder(w.tail(), w.decoder.lastCRC())
+		w.encoder, err = newFileEncoder(w.tail().File, w.decoder.lastCRC())
+		if err != nil {
+			return
+		}
 	}
 	w.decoder = nil

@ -352,7 +381,10 @@ func (w *WAL) cut() error {
 	// update writer and save the previous crc
 	w.locks = append(w.locks, newTail)
 	prevCrc := w.encoder.crc.Sum32()
-	w.encoder = newEncoder(w.tail(), prevCrc)
+	w.encoder, err = newFileEncoder(w.tail().File, prevCrc)
+	if err != nil {
+		return err
+	}
 	if err = w.saveCrc(prevCrc); err != nil {
 		return err
 	}
@ -375,6 +407,10 @@ func (w *WAL) cut() error {
 	if err = os.Rename(newTail.Name(), fpath); err != nil {
 		return err
 	}
+	if err = fileutil.Fsync(w.dirFile); err != nil {
+		return err
+	}
+
 	newTail.Close()

 	if newTail, err = fileutil.LockFile(fpath, os.O_WRONLY, fileutil.PrivateFileMode); err != nil {
@ -387,7 +423,10 @@ func (w *WAL) cut() error {
 	w.locks[len(w.locks)-1] = newTail

 	prevCrc = w.encoder.crc.Sum32()
-	w.encoder = newEncoder(w.tail(), prevCrc)
+	w.encoder, err = newFileEncoder(w.tail().File, prevCrc)
+	if err != nil {
+		return err
+	}

 	plog.Infof("segmented wal file %v is created", fpath)
 	return nil
@ -477,7 +516,7 @@ func (w *WAL) Close() error {
 			plog.Errorf("failed to unlock during closing wal: %s", err)
 		}
 	}
-	return nil
+	return w.dirFile.Close()
 }

 func (w *WAL) saveEntry(e *raftpb.Entry) error {
--- a/wal/wal_test.go
+++ b/wal/wal_test.go
@ -61,7 +61,7 @@ func TestNew(t *testing.T) {
 	}

 	var wb bytes.Buffer
-	e := newEncoder(&wb, 0)
+	e := newEncoder(&wb, 0, 0)
 	err = e.encode(&walpb.Record{Type: crcType, Crc: 0})
 	if err != nil {
 		t.Fatalf("err = %v, want nil", err)
@ -465,7 +465,7 @@ func TestSaveEmpty(t *testing.T) {
 	var buf bytes.Buffer
 	var est raftpb.HardState
 	w := WAL{
-		encoder: newEncoder(&buf, 0),
+		encoder: newEncoder(&buf, 0, 0),
 	}
 	if err := w.saveState(&est); err != nil {
 		t.Errorf("err = %v, want nil", err)
@ -636,3 +636,89 @@ func TestRestartCreateWal(t *testing.T) {
 		t.Fatalf("got error %v and meta %q, expected nil and %q", rerr, meta, "abc")
 	}
 }
+
+// TestOpenOnTornWrite ensures that entries past the torn write are truncated.
+func TestOpenOnTornWrite(t *testing.T) {
+	maxEntries := 40
+	clobberIdx := 20
+	overwriteEntries := 5
+
+	p, err := ioutil.TempDir(os.TempDir(), "waltest")
+	if err != nil {
+		t.Fatal(err)
+	}
+	defer os.RemoveAll(p)
+	w, err := Create(p, nil)
+	defer w.Close()
+	if err != nil {
+		t.Fatal(err)
+	}
+
+	// get offset of end of each saved entry
+	offsets := make([]int64, maxEntries)
+	for i := range offsets {
+		es := []raftpb.Entry{{Index: uint64(i)}}
+		if err = w.Save(raftpb.HardState{}, es); err != nil {
+			t.Fatal(err)
+		}
+		if offsets[i], err = w.tail().Seek(0, os.SEEK_CUR); err != nil {
+			t.Fatal(err)
+		}
+	}
+
+	fn := path.Join(p, path.Base(w.tail().Name()))
+	w.Close()
+
+	// clobber some entry with 0's to simulate a torn write
+	f, ferr := os.OpenFile(fn, os.O_WRONLY, fileutil.PrivateFileMode)
+	if ferr != nil {
+		t.Fatal(ferr)
+	}
+	defer f.Close()
+	_, err = f.Seek(offsets[clobberIdx], os.SEEK_SET)
+	if err != nil {
+		t.Fatal(err)
+	}
+	zeros := make([]byte, offsets[clobberIdx+1]-offsets[clobberIdx])
+	_, err = f.Write(zeros)
+	if err != nil {
+		t.Fatal(err)
+	}
+	f.Close()
+
+	w, err = Open(p, walpb.Snapshot{})
+	if err != nil {
+		t.Fatal(err)
+	}
+	// seek up to clobbered entry
+	_, _, _, err = w.ReadAll()
+	if err != nil {
+		t.Fatal(err)
+	}
+
+	// write a few entries past the clobbered entry
+	for i := 0; i < overwriteEntries; i++ {
+		// Index is different from old, truncated entries
+		es := []raftpb.Entry{{Index: uint64(i + clobberIdx), Data: []byte("new")}}
+		if err = w.Save(raftpb.HardState{}, es); err != nil {
+			t.Fatal(err)
+		}
+	}
+	w.Close()
+
+	// read back the entries, confirm number of entries matches expectation
+	w, err = OpenForRead(p, walpb.Snapshot{})
+	if err != nil {
+		t.Fatal(err)
+	}
+
+	_, _, ents, rerr := w.ReadAll()
+	if rerr != nil {
+		// CRC error? the old entries were likely never truncated away
+		t.Fatal(rerr)
+	}
+	wEntries := (clobberIdx - 1) + overwriteEntries
+	if len(ents) != wEntries {
+		t.Fatalf("expected len(ents) = %d, got %d", wEntries, len(ents))
+	}
+}
--- a/wal/wal_unix.go
+++ b/wal/wal_unix.go
@ -0,0 +1,44 @@
+// Copyright 2016 The etcd Authors
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+// +build !windows
+
+package wal
+
+import (
+	"os"
+
+	"github.com/coreos/etcd/pkg/fileutil"
+)
+
+func (w *WAL) renameWal(tmpdirpath string) (*WAL, error) {
+	// On non-Windows platforms, hold the lock while renaming. Releasing
+	// the lock and trying to reacquire it quickly can be flaky because
+	// it's possible the process will fork to spawn a process while this is
+	// happening. The fds are set up as close-on-exec by the Go runtime,
+	// but there is a window between the fork and the exec where another
+	// process holds the lock.
+
+	if err := os.RemoveAll(w.dir); err != nil {
+		return nil, err
+	}
+	if err := os.Rename(tmpdirpath, w.dir); err != nil {
+		return nil, err
+	}
+
+	w.fp = newFilePipeline(w.dir, segmentSizeBytes)
+	df, err := fileutil.OpenDir(w.dir)
+	w.dirFile = df
+	return w, err
+}
--- a/wal/wal_windows.go
+++ b/wal/wal_windows.go
@ -0,0 +1,41 @@
+// Copyright 2016 The etcd Authors
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+package wal
+
+import (
+	"os"
+
+	"github.com/coreos/etcd/wal/walpb"
+)
+
+func (w *WAL) renameWal(tmpdirpath string) (*WAL, error) {
+	// rename of directory with locked files doesn't work on
+	// windows; close the WAL to release the locks so the directory
+	// can be renamed
+	w.Close()
+	if err := os.Rename(tmpdirpath, w.dir); err != nil {
+		return nil, err
+	}
+	// reopen and relock
+	newWAL, oerr := Open(w.dir, walpb.Snapshot{})
+	if oerr != nil {
+		return nil, oerr
+	}
+	if _, _, _, err := newWAL.ReadAll(); err != nil {
+		newWAL.Close()
+		return nil, err
+	}
+	return newWAL, nil
+}
--- a/wal/walpb/record.pb.go
+++ b/wal/walpb/record.pb.go
@ -20,9 +20,9 @@ import (
 	proto "github.com/golang/protobuf/proto"

 	math "math"
-)

-import io "io"
+	io "io"
+)

 // Reference imports to suppress errors if they are not otherwise used.
 var _ = proto.Marshal
Author	SHA1	Message	Date
Gyu-Ho Lee	cc198e22d3	version: bump to v3.0.17	2017-01-20 11:02:34 -08:00
Gyu-Ho Lee	fcf813427b	lease/leasehttp: pass min TTL in TestRenewHTTP	2017-01-20 11:02:00 -08:00
Xiang Li	518efab61c	etcdctlv3: snapshot restore works with lease key	2017-01-20 10:36:02 -08:00
Anthony Romano	42f9a5ef74	etcdserver, lease: tie lease min ttl to election timeout	2017-01-20 10:35:46 -08:00
Gyu-Ho Lee	21509633ba	version: bump to v3.0.16+git	2017-01-20 10:19:14 -08:00
Gyu-Ho Lee	a23109a0c6	version: bump to v3.0.16	2017-01-13 11:29:12 -08:00
Anthony Romano	219a4e9ad5	clientv3: don't reset keepalive stream on grant failure Was triggering cancelation errors on outstanding KeepAlives if Grant had to retry.	2017-01-13 11:28:17 -08:00
Anthony Romano	3d050630f4	v3api, rpctypes: add ErrTimeoutDueToConnectionLost Lack of GRPC code was causing this to look like a halting error to the client.	2017-01-13 11:27:56 -08:00
Anthony Romano	9c66ed2798	clientv3: don't reset stream on keepaliveonce or revoke failure Would cause the keepalive loop to cancel out. Fixes #7082	2017-01-13 10:42:01 -08:00
Gyu-Ho Lee	a9e2d3d4d3	*: remove 'tools/etcd-top' to drop pcap.h	2016-12-07 10:34:34 -08:00
Gyu-Ho Lee	41e329cd35	*: drop breaking-changes from master branch	2016-12-07 10:34:28 -08:00
Gyu-Ho Lee	3a8b524d36	travis, test: use Go 1.6.4, skip 'gosimple'	2016-12-07 10:28:53 -08:00
Anthony Romano	11668f53db	integration: use RequireLeader for TestV3LeaseFailover Giving Renew() the default request timeout causes TestV3LeaseFailover to miss its timing constraints. Since it only needs to wait until the leader recognizes the leader is lost, use RequireLeader to cancel the keepalive stream before the request times out.	2016-12-07 10:06:28 -08:00
Anthony Romano	7ceca7e046	clientv3/integration: test lease keepalive works following quorum loss	2016-12-07 10:06:28 -08:00
Anthony Romano	395bd2313c	v3rpc, etcdserver, leasehttp: ctxize Renew with request timeout Would retry a few times before returning a not primary error that the client should never see. Instead, use proper timeouts and then return a request timeout error on failure. Fixes #6922	2016-12-07 10:06:19 -08:00
Gyu-Ho Lee	b357569bc6	version: bump to v3.0.15+git	2016-11-11 11:17:31 -08:00
Gyu-Ho Lee	fc00305a2e	version: bump to v3.0.15	2016-11-10 13:12:43 -08:00
Gyu-Ho Lee	f322fe7f0d	clientv3, ctlv3: document range end requirement	2016-11-10 13:10:18 -08:00
Gyu-Ho Lee	049fcd30ea	integration: test wrong watcher range	2016-11-10 13:09:13 -08:00
Gyu-Ho Lee	1b702e79db	mvcc: return -1 for wrong watcher range key >= end Fix https://github.com/coreos/etcd/issues/6819.	2016-11-10 13:08:51 -08:00
Anthony Romano	b87190d9dc	integration: test canceling a watcher on disconnected stream	2016-11-10 13:07:24 -08:00
Anthony Romano	83b493f945	clientv3: let watchers cancel when reconnecting	2016-11-10 13:06:47 -08:00
Gyu-Ho Lee	9b69cbd989	version: bump to v3.0.14+git	2016-11-04 13:06:36 -07:00
Gyu-Ho Lee	8a37349097	version: bump to v3.0.14	2016-11-04 10:54:14 -07:00
Xiang Li	9a0e4dfe4f	ctlv3: fix migration	2016-11-03 09:47:41 -07:00
Timothy St. Clair	f60469af16	ctlv3: Add a no-ttl flag to etcdctl migrate to discard keys on transform.	2016-11-03 09:47:39 -07:00
Gyu-Ho Lee	932370d8ca	version: bump to v3.0.13+git	2016-10-24 11:22:50 -07:00
Gyu-Ho Lee	c99d0d4b25	version: bump to v3.0.13	2016-10-24 11:04:43 -07:00
Gyu-Ho Lee	d78216f528	e2e: remove 'ctlV3GetFailPerm'	2016-10-24 11:04:13 -07:00
Hongchao Deng	c05c027a24	etcdctl: fix migrate in outputing client.Node to json Using printf will try to parse the string and replace special characters. In migrate code, we want to just output the raw json string of client.Node. For example, Printf("%\\") => %!\(MISSING) Print("%\\") => %\ Thus, we should use print instead.	2016-10-20 10:51:16 -07:00
Gyu-Ho Lee	3fd64f913a	auth: fix return type on 'hasRootRole'	2016-10-12 13:59:27 -07:00
Xiang Li	f935290bbc	mvcc: fix rev inconsistency Try: ./etcdctl put foo bar ./etcdctl del foo ./etcdctl compact 3 restart etcd ./etcdctl get foo mvcc: required revision has been compacted The error is unexpected when range over the head revision. Internally, we incorrectly set current revision smaller than the compacted revision when we remove all keys around compacted revision. This commit fixes the issue by recovering the current revision at least to compacted revision.	2016-10-12 13:08:26 -07:00
Hitoshi Mitake	ca91f898a2	auth, e2e, clientv3: the root role should be granted access to every key This commit changes the semantics of the root role. The role should be able to access to every key. Partially fixes https://github.com/coreos/etcd/issues/6355	2016-10-11 12:19:46 -07:00
Gyu-Ho Lee	fcbada7798	Merge pull request #6622 from luxas/backport_arm_fixes Backport arm fixes	2016-10-11 12:15:58 -07:00
Jared Hulbert	fad9bdc3e1	etcdserver: atomic access alignment Most fields accessed with sync/atomic functions are 64bit aligned, but a couple are not. This makes comments out of date and therefore misleading. Affected fields reordered, comments scrubbed and updated.	2016-10-11 11:48:43 +03:00
Jared Hulbert	198ccb8b7b	raftpb: atomic access alignment The Entry struct has misaligned fields that are accessed atomically. The misalignment is caused by the EntryType enum which the Protocol Buffers spec forces to be a 32bit int. Moving the order of the fields without renumbering them in the .proto file seems to align the go structure without changing the wire format.	2016-10-11 11:48:43 +03:00
Jared Hulbert	dc5d5c6ac8	raft: atomic access alignment The relevant structures are properly aligned, however, there is no comment highlighting the need to keep it aligned as is present elsewhere in the codebase. Adding note to keep alignment, in line with similar comments in the codebase.	2016-10-11 11:48:43 +03:00
Gyu-Ho Lee	f771eaca47	version: bump to v3.0.12+git	2016-10-07 16:42:12 -07:00
Gyu-Ho Lee	2d1e2e8e64	version: bump to v3.0.12	2016-10-07 15:14:25 -07:00
Gyu-Ho Lee	6412758177	v3rpc: remove redundant locks	2016-10-07 15:13:56 -07:00
Xiang Li	836c8159f6	v3rpc: lock progress and prevKV map correctly	2016-10-07 15:13:12 -07:00
Gyu-Ho Lee	e406e6e8f4	etcdctl/ctlv3: add 'prev-kv' flag to watch command	2016-10-07 14:23:09 -07:00
Gyu-Ho Lee	2fa2c6284e	clientv3: add 'prevKV' field to watch request	2016-10-07 14:22:58 -07:00
Gyu-Ho Lee	2862c4fa12	v3rpc: implement 'prev-kv' watch	2016-10-07 14:22:19 -07:00
Gyu-Ho Lee	6f89fbf8b5	etcdserver: use mvcc.WatchableKV for prev-kv watch	2016-10-07 14:22:00 -07:00
Gyu-Ho Lee	6ae7ec9a3f	*: regenerate proto	2016-10-07 14:21:19 -07:00
Gyu-Ho Lee	4a35b1b20a	etcdserverpb: add 'prev_kb' to WatchCreateRequest	2016-10-07 14:20:46 -07:00
Gyu-Ho Lee	c859c97ee2	mvccpb: add 'prev_kv' field	2016-10-07 14:19:59 -07:00
Gyu-Ho Lee	a091c629e1	version: bump to v3.0.11+git	2016-10-07 13:25:21 -07:00
Gyu-Ho Lee	96de94a584	version: bump to v3.0.11	2016-10-07 11:27:48 -07:00
Gyu-Ho Lee	e9cd8410d7	integration: add 'prevKV' to TestV3DeleteRange	2016-10-07 11:03:19 -07:00
Gyu-Ho Lee	e37ede1d2e	etcdserver: handle 'PrevKV'	2016-10-07 11:00:48 -07:00
Gyu-Ho Lee	4420a29ac4	etcdctl/ctlv3: add 'prev-kv' flag	2016-10-07 10:56:06 -07:00
Gyu-Ho Lee	0544d4bfd0	clientv3: add WithPrevKV OpOption	2016-10-07 10:54:45 -07:00
Gyu-Ho Lee	fe7379f102	clientv3: add Op.prevKV	2016-10-07 10:51:01 -07:00
Gyu-Ho Lee	c76df5052b	*: update proto to add 'prev_kv'	2016-10-07 10:47:47 -07:00
Xiang Li	3299cad1c3	*: add put prevkv	2016-10-07 10:39:08 -07:00
Anthony Romano	d9ab018c49	integration: test a canceled watch won't return a closing error	2016-10-05 14:19:36 -07:00
Anthony Romano	e853451cd2	clientv3: only return closing error to watcher if context is not canceled Fixes #6503	2016-10-05 14:19:32 -07:00
Anthony Romano	1becf9d2f5	clientv3: fix race on watch initial revision The initial revision was being updated in the substream goroutine defer; this was racing with the resume path fetching the initial revision when the substream closes during resume. Instead, update the initial revision whenever the substream processes a new watch response. Since the substream cannot receive a watch response while it is resuming, the write to the initial revision is ordered to always happen after the resume read. Fixes #6586	2016-10-05 10:56:36 -07:00
Anthony Romano	1a712cf187	clientv3: make IsProgressNotify() false on compact event and closed channel Fixes #6549	2016-10-04 15:13:02 -07:00
Gyu-Ho Lee	023f335f67	wal: set PageWriter offset in file encoder	2016-10-04 15:12:47 -07:00
Gyu-Ho Lee	bf0da78b63	pkg/ioutil: configure pageOffset in NewPageWriter	2016-10-04 15:12:46 -07:00
Anthony Romano	e8473850a2	integration: test canceling watchers when disconnected	2016-10-04 15:12:37 -07:00
Anthony Romano	b836d187fd	clientv3: simplify watch synchronization Was more complicated than it needed to be and didn't really work in the first place. Restructured watcher registation to use a queue.	2016-10-04 15:12:18 -07:00
Gyu-Ho Lee	9b09229c4d	version: bump to v3.0.10+git	2016-09-23 11:13:45 -07:00
Gyu-Ho Lee	546c0f7ed6	version: bump to v3.0.10	2016-09-23 10:49:03 -07:00
sharat	adbad1c9b5	ctlv3: close snapshot file before rename (Windows)	2016-09-23 09:11:02 -07:00
Anthony Romano	273b986751	clientv3: process closed watcherStreams in watcherGrpcStream run loop Was racing with Watch() when closing the grpc stream on no watchers. Fixes #6476	2016-09-21 15:52:20 -07:00
Gyu-Ho Lee	5b205729b9	rafthttp: add v3.0.0 to supported streams	2016-09-16 21:54:55 +09:00
Anthony Romano	fe900b09dd	version: bump to v3.0.9+git	2016-09-15 15:10:23 -07:00
Anthony Romano	494c012659	version: bump to v3.0.9	2016-09-15 12:56:33 -07:00
Anthony Romano	4abc381ebe	clientv3: drain buffered WatchResponses before resuming Otherwise, the watcherStream can receive WatchResponses in the middle of a resume, corrupting the stream. Fixes #6364	2016-09-15 12:38:15 -07:00
Anthony Romano	73c8fdac53	integration: fix compilation for backported Election test	2016-09-15 11:45:37 -07:00
sharat	ee2717493a	ctlv3: fix line parsing for Windows	2016-09-15 11:25:53 -07:00
Xiang Li	2435eb9ecd	clientv3: balancer panics when call up after close Fix the issue by adding a simple guard varable.	2016-09-15 18:46:26 +09:00
Anthony Romano	8fb533dabe	embed: warn on domain name in listener	2016-09-15 18:46:19 +09:00
Anthony Romano	2f0f5ac504	Revert "Merge pull request #6365 from heyitsanthony/fix-dns-bind" This reverts commit `af5ab7b351`, reversing changes made to `da6a0f0594`.	2016-09-15 18:43:46 +09:00
Jason E. Aten	9ab811d478	auth: fix range handling bugs. Test 15, counting from zero, in TestGetMergedPerms in etcd/auth/range_perm_cache_test.go, was trying incorrectly assert that [a, b) merged with [b, "") should be [a, b). Added a test specifically for this. This patch fixes the incorrect larger test and the bugs in the code that it was hiding. Fixes #6359	2016-09-15 18:41:56 +09:00
Anthony Romano	e0a99fb4ba	version: bump to v3.0.8+git	2016-09-09 15:56:31 -07:00
Anthony Romano	d40982fc91	version: bump to v3.0.8	2016-09-09 13:14:44 -07:00
Gyu-Ho Lee	fe3a1cc31b	wal: fix error type	2016-09-09 09:11:25 +09:00
Gyu-Ho Lee	70713706a1	wal: fix err shadowing (go vet)	2016-09-09 09:07:48 +09:00
Xiang Li	0054e7e89b	etcdctl: restore should create a snapshot Restore should create a snasphot. So the new db file can be sent to newly joined member.	2016-09-09 09:03:51 +09:00
Anthony Romano	97f718b504	fileutil: windows OpenDir Windows needs to open a directory with write access to fsync but the go runtime won't open directories that way.	2016-09-09 09:01:56 +09:00
Anthony Romano	202da9270e	wal: fsync directory after wal file rename Fixes #6368	2016-09-09 09:01:49 +09:00
Anthony Romano	6e83ec0ed7	etcdmain: reject binding listeners to domain names Fixes #6336	2016-09-07 08:08:35 +09:00
Jason E. Aten	5c44cdfdaa	etcdctl/ctlv3: don't crash when we should prompt for pw. when 'etcdctl --user name get blah' is invoked to prompt for password, don't panic. addresses the segfault part of #6343	2016-09-04 09:02:50 +09:00
Anthony Romano	09a239f040	e2e: add quoted key/value to txn test	2016-09-04 09:02:47 +09:00
Anthony Romano	3faff8b2e2	etcdctl: fix quoted string handling in txn and watch Fixes #6315	2016-09-04 09:02:28 +09:00
Anthony Romano	2345fda18e	version: bump to v3.0.7+git	2016-08-31 16:41:06 -07:00
Gyu-Ho Lee	5695120efc	version: bump to v3.0.7	2016-08-31 09:49:24 -07:00
Gyu-Ho Lee	183293e061	wal: lowercase segmentSizeBytes	2016-08-31 09:48:30 -07:00
Jason E. Aten	4b48876f0e	clientv3/concurrency: allow election on prefixes of keys. After winning an election or obtaining a lock, we auto-append a slash after the provided key prefix. This avoids the previous deadlock due to waiting on the wrong key. Fixes #6278 Conflicts: clientv3/concurrency/election.go clientv3/concurrency/mutex.go	2016-08-31 09:46:05 -07:00
Aaron Lehmann	5089bf58fb	wal: hold file lock while renaming WAL directory on non-Windows Windows requires this lock to be released before the directory is renamed. But on unix-like operating systems, releasing the lock and trying to reacquire it immediately can be flaky if a process is forked around the same time. The file descriptors are marked as close-on-exec by the Go runtime, but there is a window between the fork and exec where another process will be holding the lock.	2016-08-31 09:39:57 -07:00
Anthony Romano	480a347179	wal: use page buffered writer for writing records Forces torn writes to only happen on sector boundaries. Fixes #6271	2016-08-30 21:06:36 -07:00
Anthony Romano	59e560c7a7	ioutil: add page buffered writer A buffered writer that only writes full pages or when explicitly flushed.	2016-08-30 21:06:33 -07:00
Xiang Li	0bd9bea2e9	etcdserver: allow zero kv index for cluster upgrade If a user upgrades etcd from 2.3.x to 3.0 and shutdown the cluster immediately without triggering any new backend writes, then the consistent index in backend would be zero. The user cannot restart etcdserver due to today's strick index match checking. We now have to lose this a bit for this case.	2016-08-30 21:05:20 -07:00
Anthony Romano	bd7581ac59	wal: zero out wal tail past its first zero record Whenever the WAL is opened for writes, it should write zeroes to its tail starting from the first zero record. Otherwise, if there are entries past the first zero record due to a torn write, any new writes that overlap the old entries will lead to a garbage record on the tail and cause a CRC mismatch.	2016-08-26 14:27:53 -07:00
Anthony Romano	db378c3d26	wal: test for truncation on torn writes	2016-08-26 14:27:51 -07:00
Anthony Romano	23740162dc	fileutil: add ZeroToEnd for zeroing files	2016-08-26 14:27:49 -07:00
Anthony Romano	96422a955f	discovery: reject IP address records in SRVGetCluster Was incorrectly trimming the trailing '.' from the target; this in turn caused the etcd server to accept any SRV record with an IP target instead of only targets with A records.	2016-08-24 09:14:47 -07:00
Gyu-Ho Lee	6fd996fdac	version: bump to v3.0.6+git	2016-08-19 12:38:13 -07:00
Gyu-Ho Lee	9efa00d103	version: bump to v3.0.6	2016-08-19 12:03:02 -07:00
Xiang Li	72d30f4c34	*: minor cleanup for lease	2016-08-19 11:53:38 -07:00
Xiang Li	2e92779777	mvcc: attach keys to leases after recover all state The previous logic is wrong. When we have hisotry like Put(foo, bar, lease1), and Put(foo, bar, lease2), we will end up with attaching foo to two leases 1 and 2. Similar things can happen for deattach by clearing the lease of a key. Now we try to fix this by starting to attach leases at the end of the recovery. We use a map to keep the last lease attachment state.	2016-08-19 11:49:05 -07:00
Xiang Li	404415b1e3	lease: do lease delection in the kv txn	2016-08-19 11:49:05 -07:00
Xiang Li	07e421d245	lease: delete kvs in a txn	2016-08-19 11:49:05 -07:00
Xiang Li	a7d6e29275	etcdserver: always recover lessor first	2016-08-19 11:49:05 -07:00
Gyu-Ho Lee	1a8b295dab	vendor: update grpc/grpc-go for clientconn patch	2016-08-19 11:46:51 -07:00
Anthony Romano	ffc45cc066	rafthttp: fix race between streamReader.stop() and connection closer	2016-08-19 11:45:39 -07:00
Gyu-Ho Lee	0db1ba8093	version: bump to v3.0.5+git	2016-08-19 11:11:10 -07:00