Compare commits

...

28 Commits

Author SHA1 Message Date
99dcc8c322 chore(server): bump back to 0.4.2 2014-06-02 15:25:03 -07:00
3d2523e7e0 Merge pull request #825 from unihorn/98
fix(multi_node_kill_all_and_recovery_test): ensure cluster is up
2014-06-02 15:12:05 -07:00
25e69d9659 fix(multi_node_kill_all_and_recovery_test): ensure cluster is up 2014-06-02 14:43:51 -07:00
707174b56a chore(server): bump to 0.4.2+git 2014-06-02 14:19:52 -07:00
ce92cc3dc5 feat(CHANGELOG): bump to v0.4.2 2014-06-02 14:17:38 -07:00
5bfbf3a48c Merge pull request #824 from unihorn/97
fix(remove_node_test): remove unnecessary cluster configuration
2014-06-02 14:12:08 -07:00
e04a188358 fix(remove_node_test): remove unnecessary cluster configuration
The cluster configuration operation is originally to make sure
the instance won't be added back automatically between removal and
check for the number of existing peer-mode instances. But this
could make some node removed before the removal command.

Use longer sync interval instead to avoid this problem.
2014-06-02 13:30:19 -07:00
a51fda3e5e Merge pull request #822 from philips/add-notes-about-discovery
docs(cluster-discovery): add caution to use old discovery endpoint
2014-06-02 12:06:00 -07:00
ca44801650 docs(cluster-discovery): add caution to use old discovery endpoint 2014-06-02 11:34:56 -07:00
2387ef3f21 Merge pull request #819 from unihorn/97
fix(server): joinIndex is not set after recovery from full outage
2014-06-02 11:04:07 -07:00
d5bfca9465 Merge pull request #814 from unihorn/91
fix(server/v2): set correct content-type for etcdError response
2014-06-02 10:38:36 -07:00
7cb126967c fix(simple_snapshot_test): enlarge reasonable index range 2014-05-31 10:42:31 -07:00
444e017c05 fix(remove_node_test): ensure cluster config is activated 2014-05-31 10:32:03 -07:00
356675b70f fix(multi_node_kill_all_and_recovery_test): ensure cluster running 2014-05-31 10:15:03 -07:00
d7768635fd fix(server): set joinIndex when recovered 2014-05-31 10:03:39 -07:00
37796ed84c tests: add TestMultiNodeKillAllAndRecorveryAndRemoveLeader
This one breaks because it doesn't set joinIndex correctly.
2014-05-31 10:01:45 -07:00
f007cf321d Merge pull request #818 from unihorn/96
fix(standby_server): able to join the cluster containing itself
2014-05-30 18:36:58 -07:00
ca29691543 tests(standby_test): comments 2014-05-30 18:36:23 -07:00
4bebb538eb fix(standby_server): able to join the cluster containing itself
Standby server will switch to peer server if it finds that
it has been contained in the cluster.
2014-05-30 14:03:49 -07:00
c27db1ec5e Merge pull request #816 from unihorn/95
docs(clustering): limit for peer-address changing
2014-05-30 13:45:12 -07:00
a5fc1d214d Merge pull request #817 from cholcombe973/master
Adding autodock into the libraries and tools section
2014-05-30 13:41:32 -07:00
1df0b941d7 Adding autodock into the libraries and tools section 2014-05-30 13:20:28 -07:00
3a71eb9d72 Merge pull request #808 from robszumski/update-optimal-size
fix(docs): add information about standbys
2014-05-30 12:26:07 -07:00
001cceb1cd fix(docs): update doc with standby info 2014-05-30 12:23:22 -07:00
98ff4af7f2 docs(clustering): limit for peer-address changing 2014-05-30 08:50:16 -07:00
db4c5e0eaa fix(server/v2): set correct content-type for etcdError response
"net/http".Error reset the content type, so we get rid of it and
write our own one.
2014-05-29 14:18:50 -07:00
b3c5ed60bd chore(pkg/btrfs): remove accidental swp file. 2014-05-22 09:50:40 -07:00
22c944d8ef chore(server): bump 0.4.0+git 2014-05-20 20:55:57 -07:00
17 changed files with 211 additions and 66 deletions

View File

@ -1,3 +1,12 @@
v0.4.2
* Improvements to the clustering documents
* Set content-type properly on errors (#469)
* Standbys re-join if they should be part of the cluster (#810, #815, #818)
v0.4.1
* Re-introduce DELETE on the machines endpoint
* Document the machines endpoint
v0.4.0
* Introduced standby mode
* Added HEAD requests

View File

@ -53,3 +53,9 @@ The Discovery API submits the `-peer-addr` of each etcd instance to the configur
The discovery API will automatically clean up the address of a stale peer that is no longer part of the cluster. The TTL for this process is a week, which should be long enough to handle any extremely long outage you may encounter. There is no harm in having stale peers in the list until they are cleaned up, since an etcd instance only needs to connect to one valid peer in the cluster to join.
[discovery-design]: https://github.com/coreos/etcd/blob/master/Documentation/design/cluster-finding.md
## Lifetime of a Discovery URL
A discovery URL identifies a single etcd cluster. Do not re-use discovery URLs for new clusters.
When a machine starts with a new discovery URL the discovery URL will be activated and record the machine's metadata. If you destroy the whole cluster and attempt to bring the cluster back up with the same discovery URL it will fail. This is intentional because all of the registered machines are gone including their logs so there is nothing to recover the killed cluster.

View File

@ -107,7 +107,7 @@ curl -L http://127.0.0.1:4001/v2/keys/foo -XPUT -d value=bar
If one machine disconnects from the cluster, it could rejoin the cluster automatically when the communication is recovered.
If one machine is killed, it could rejoin the cluster when started with old name. If the peer address is changed, etcd will treat the new peer address as the refreshed one, which benefits instance migration, or virtual machine boot with different IP.
If one machine is killed, it could rejoin the cluster when started with old name. If the peer address is changed, etcd will treat the new peer address as the refreshed one, which benefits instance migration, or virtual machine boot with different IP. The peer-address-changing functionality is only supported when the majority of the cluster is alive, because this behavior needs the consensus of the etcd cluster.
**Note:** For now, it is user responsibility to ensure that the machine doesn't join the cluster that has the member with the same name. Or unexpected error will happen. It would be improved sooner or later.
@ -167,15 +167,3 @@ Etcd can also do internal server-to-server communication using SSL client certs.
To do this just change the `-*-file` flags to `-peer-*-file`.
If you are using SSL for server-to-server communication, you must use it on all instances of etcd.
### What size cluster should I use?
Every command the client sends to the master is broadcast to all of the followers.
The command is not committed until the majority of the cluster peers receive that command.
Because of this majority voting property, the ideal cluster should be kept small to keep speed up and be made up of an odd number of peers.
Odd numbers are good because if you have 8 peers the majority will be 5 and if you have 9 peers the majority will still be 5.
The result is that an 8 peer cluster can tolerate 3 peer failures and a 9 peer cluster can tolerate 4 machine failures.
And in the best case when all 9 peers are responding the cluster will perform at the speed of the fastest 5 machines.

View File

@ -20,6 +20,7 @@
- [transitorykris/etcd-py](https://github.com/transitorykris/etcd-py)
- [jplana/python-etcd](https://github.com/jplana/python-etcd) - Supports v2
- [russellhaering/txetcd](https://github.com/russellhaering/txetcd) - a Twisted Python library
- [cholcombe973/autodock](https://github.com/cholcombe973/autodock) - A docker deployment automation tool
**Node libraries**

View File

@ -1,30 +1,38 @@
# Optimal etcd Cluster Size
etcd's Raft consensus algorithm is most efficient in small clusters between 3 and 9 peers. Let's briefly explore how etcd works internally to understand why.
## Writing to etcd
Writes to an etcd peer are always redirected to the leader of the cluster and distributed to all of the peers immediately. A write is only considered successful when a majority of the peers acknowledge the write.
For example, in a 5 node cluster, a write operation is only as fast as the 3rd fastest machine. This is the main reason for keeping your etcd cluster below 9 nodes. In practice, you only need to worry about write performance in high latency environments such as a cluster spanning multiple data centers.
## Leader Election
The leader election process is similar to writing a key — a majority of the cluster must acknowledge the new leader before cluster operations can continue. The longer each node takes to elect a new leader means you have to wait longer before you can write to the cluster again. In low latency environments this process takes milliseconds.
## Odd Cluster Size
The other important cluster optimization is to always have an odd cluster size. Adding an odd node to the cluster doesn't change the size of the majority and therefore doesn't increase the total latency of the majority as described above. But you do gain a higher tolerance for peer failure by adding the extra machine. You can see this in practice when comparing two even and odd sized clusters:
| Cluster Size | Majority | Failure Tolerance |
|--------------|------------|-------------------|
| 8 machines | 5 machines | 3 machines |
| 9 machines | 5 machines | **4 machines** |
As you can see, adding another node to bring the cluster up to an odd size is always worth it. During a network partition, an odd cluster size also guarantees that there will almost always be a majority of the cluster that can continue to operate and be the source of truth when the partition ends.
etcd's Raft consensus algorithm is most efficient in small clusters between 3 and 9 peers. For clusters larger than 9, etcd will select a subset of instances to participate in the algorithm in order to keep it efficient. The end of this document briefly explores how etcd works internally and why these choices have been made.
## Cluster Management
Currently, each CoreOS machine is an etcd peer — if you have 30 CoreOS machines, you have 30 etcd peers and end up with a cluster size that is way too large. If desired, you may manually stop some of these etcd instances to increase cluster performance.
You can manage the active cluster size through the [cluster config API](https://github.com/coreos/etcd/blob/master/Documentation/api.md#cluster-config). `activeSize` represents the etcd peers allowed to actively participate in the consensus algorithm.
Functionality is being developed to expose two different types of followers: active and benched followers. Active followers will influence operations within the cluster. Benched followers will not participate, but will transparently proxy etcd traffic to an active follower. This allows every CoreOS machine to expose etcd on port 4001 for ease of use. Benched followers will have the ability to transition into an active follower if needed.
If the total number of etcd instances exceeds this number, additional peers are started as [standbys](https://github.com/coreos/etcd/blob/master/Documentation/design/standbys.md), which can be promoted to active participation if one of the existing active instances has failed or been removed.
## Internals of etcd
### Writing to etcd
Writes to an etcd peer are always redirected to the leader of the cluster and distributed to all of the peers immediately. A write is only considered successful when a majority of the peers acknowledge the write.
For example, in a cluster with 5 peers, a write operation is only as fast as the 3rd fastest machine. This is the main reason for keeping the number of active peers below 9. In practice, you only need to worry about write performance in high latency environments such as a cluster spanning multiple data centers.
### Leader Election
The leader election process is similar to writing a key — a majority of the active peers must acknowledge the new leader before cluster operations can continue. The longer each peer takes to elect a new leader means you have to wait longer before you can write to the cluster again. In low latency environments this process takes milliseconds.
### Odd Active Cluster Size
The other important cluster optimization is to always have an odd active cluster size (i.e. `activeSize`). Adding an odd node to the number of peers doesn't change the size of the majority and therefore doesn't increase the total latency of the majority as described above. But, you gain a higher tolerance for peer failure by adding the extra machine. You can see this in practice when comparing two even and odd sized clusters:
| Active Peers | Majority | Failure Tolerance |
|--------------|------------|-------------------|
| 1 peers | 1 peers | None |
| 3 peers | 2 peers | 1 peer |
| 4 peers | 3 peers | 2 peers |
| 5 peers | 3 peers | **3 peers** |
| 6 peers | 4 peers | 2 peers |
| 7 peers | 4 peers | **3 peers** |
| 8 peers | 5 peers | 3 peers |
| 9 peers | 5 peers | **4 peers** |
As you can see, adding another peer to bring the number of active peers up to an odd size is always worth it. During a network partition, an odd number of active peers also guarantees that there will almost always be a majority of the cluster that can continue to operate and be the source of truth when the partition ends.

View File

@ -1,6 +1,6 @@
# etcd
README version 0.4.0
README version 0.4.2
A highly-available key value store for shared configuration and service discovery.
etcd is inspired by [Apache ZooKeeper][zookeeper] and [doozer][doozer], with a focus on being:

View File

@ -143,5 +143,6 @@ func (e Error) Write(w http.ResponseWriter) {
status = http.StatusInternalServerError
}
}
http.Error(w, e.toJsonString(), status)
w.WriteHeader(status)
fmt.Fprintln(w, e.toJsonString())
}

View File

@ -232,6 +232,7 @@ func (e *Etcd) Run() {
DataDir: e.Config.DataDir,
}
e.StandbyServer = server.NewStandbyServer(ssConfig, client)
e.StandbyServer.SetRaftServer(raftServer)
// Generating config could be slow.
// Put it here to make listen happen immediately after peer-server starting.
@ -347,6 +348,7 @@ func (e *Etcd) runServer() {
raftServer.SetElectionTimeout(electionTimeout)
raftServer.SetHeartbeatInterval(heartbeatInterval)
e.PeerServer.SetRaftServer(raftServer, e.Config.Snapshot)
e.StandbyServer.SetRaftServer(raftServer)
e.PeerServer.SetJoinIndex(e.StandbyServer.JoinIndex())
e.setMode(PeerMode)

Binary file not shown.

View File

@ -214,6 +214,7 @@ func (s *PeerServer) FindCluster(discoverURL string, peers []string) (toStart bo
// TODO(yichengq): Think about the action that should be done
// if it cannot connect any of the previous known node.
log.Debugf("%s is restarting the cluster %v", name, possiblePeers)
s.SetJoinIndex(s.raftServer.CommitIndex())
toStart = true
return
}

View File

@ -1,3 +1,3 @@
package server
const ReleaseVersion = "0.4.1"
const ReleaseVersion = "0.4.2"

View File

@ -36,8 +36,9 @@ type standbyInfo struct {
}
type StandbyServer struct {
Config StandbyServerConfig
client *Client
Config StandbyServerConfig
client *Client
raftServer raft.Server
standbyInfo
joinIndex uint64
@ -62,6 +63,10 @@ func NewStandbyServer(config StandbyServerConfig, client *Client) *StandbyServer
return s
}
func (s *StandbyServer) SetRaftServer(raftServer raft.Server) {
s.raftServer = raftServer
}
func (s *StandbyServer) Start() {
s.Lock()
defer s.Unlock()
@ -235,6 +240,13 @@ func (s *StandbyServer) syncCluster(peerURLs []string) error {
}
func (s *StandbyServer) join(peer string) error {
for _, url := range s.ClusterURLs() {
if s.Config.PeerURL == url {
s.joinIndex = s.raftServer.CommitIndex()
return nil
}
}
// Our version must match the leaders version
version, err := s.client.GetVersion(peer)
if err != nil {

View File

@ -24,12 +24,15 @@ func TestV2GetKey(t *testing.T) {
v.Set("value", "XXX")
fullURL := fmt.Sprintf("%s%s", s.URL(), "/v2/keys/foo/bar")
resp, _ := tests.Get(fullURL)
assert.Equal(t, resp.Header.Get("Content-Type"), "application/json")
assert.Equal(t, resp.StatusCode, http.StatusNotFound)
resp, _ = tests.PutForm(fullURL, v)
assert.Equal(t, resp.Header.Get("Content-Type"), "application/json")
tests.ReadBody(resp)
resp, _ = tests.Get(fullURL)
assert.Equal(t, resp.Header.Get("Content-Type"), "application/json")
assert.Equal(t, resp.StatusCode, http.StatusOK)
body := tests.ReadBodyJSON(resp)
assert.Equal(t, body["action"], "get", "")

View File

@ -4,6 +4,7 @@ import (
"bytes"
"os"
"strconv"
"strings"
"testing"
"time"
@ -100,6 +101,8 @@ func TestTLSMultiNodeKillAllAndRecovery(t *testing.T) {
t.Fatal("cannot create cluster")
}
time.Sleep(time.Second)
c := etcd.NewClient(nil)
go Monitor(clusterSize, clusterSize, leaderChan, all, stop)
@ -239,3 +242,74 @@ func TestMultiNodeKillAllAndRecoveryWithStandbys(t *testing.T) {
assert.NoError(t, err)
assert.Equal(t, len(result.Node.Nodes), 7)
}
// Create a five nodes
// Kill all the nodes and restart, then remove the leader
func TestMultiNodeKillAllAndRecoveryAndRemoveLeader(t *testing.T) {
procAttr := new(os.ProcAttr)
procAttr.Files = []*os.File{nil, os.Stdout, os.Stderr}
stop := make(chan bool)
leaderChan := make(chan string, 1)
all := make(chan bool, 1)
clusterSize := 5
argGroup, etcds, err := CreateCluster(clusterSize, procAttr, false)
defer DestroyCluster(etcds)
if err != nil {
t.Fatal("cannot create cluster")
}
c := etcd.NewClient(nil)
go Monitor(clusterSize, clusterSize, leaderChan, all, stop)
<-all
<-leaderChan
stop <- true
// It needs some time to sync current commits and write it to disk.
// Or some instance may be restarted as a new peer, and we don't support
// to connect back the old cluster that doesn't have majority alive
// without log now.
time.Sleep(time.Second)
c.SyncCluster()
// kill all
DestroyCluster(etcds)
time.Sleep(time.Second)
stop = make(chan bool)
leaderChan = make(chan string, 1)
all = make(chan bool, 1)
time.Sleep(time.Second)
for i := 0; i < clusterSize; i++ {
etcds[i], err = os.StartProcess(EtcdBinPath, argGroup[i], procAttr)
}
go Monitor(clusterSize, 1, leaderChan, all, stop)
<-all
leader := <-leaderChan
_, err = c.Set("foo", "bar", 0)
if err != nil {
t.Fatalf("Recovery error: %s", err)
}
port, _ := strconv.Atoi(strings.Split(leader, ":")[2])
num := port - 7000
resp, _ := tests.Delete(leader+"/v2/admin/machines/node"+strconv.Itoa(num), "application/json", nil)
if !assert.Equal(t, resp.StatusCode, 200) {
t.FailNow()
}
// check the old leader is in standby mode now
time.Sleep(time.Second)
resp, _ = tests.Get(leader + "/name")
assert.Equal(t, resp.StatusCode, 404)
}

View File

@ -31,7 +31,7 @@ func TestRemoveNode(t *testing.T) {
c.SyncCluster()
resp, _ := tests.Put("http://localhost:7001/v2/admin/config", "application/json", bytes.NewBufferString(`{"activeSize":4, "syncInterval":1}`))
resp, _ := tests.Put("http://localhost:7001/v2/admin/config", "application/json", bytes.NewBufferString(`{"activeSize":4, "syncInterval":5}`))
if !assert.Equal(t, resp.StatusCode, 200) {
t.FailNow()
}
@ -41,11 +41,6 @@ func TestRemoveNode(t *testing.T) {
client := &http.Client{}
for i := 0; i < 2; i++ {
for i := 0; i < 2; i++ {
r, _ := tests.Put("http://localhost:7001/v2/admin/config", "application/json", bytes.NewBufferString(`{"activeSize":3}`))
if !assert.Equal(t, r.StatusCode, 200) {
t.FailNow()
}
client.Do(rmReq)
fmt.Println("send remove to node3 and wait for its exiting")
@ -76,12 +71,7 @@ func TestRemoveNode(t *testing.T) {
panic(err)
}
r, _ = tests.Put("http://localhost:7001/v2/admin/config", "application/json", bytes.NewBufferString(`{"activeSize":4}`))
if !assert.Equal(t, r.StatusCode, 200) {
t.FailNow()
}
time.Sleep(time.Second + time.Second)
time.Sleep(time.Second + 5*time.Second)
resp, err = c.Get("_etcd/machines", false, false)
@ -96,11 +86,6 @@ func TestRemoveNode(t *testing.T) {
// first kill the node, then remove it, then add it back
for i := 0; i < 2; i++ {
r, _ := tests.Put("http://localhost:7001/v2/admin/config", "application/json", bytes.NewBufferString(`{"activeSize":3}`))
if !assert.Equal(t, r.StatusCode, 200) {
t.FailNow()
}
etcds[2].Kill()
fmt.Println("kill node3 and wait for its exiting")
etcds[2].Wait()
@ -131,11 +116,6 @@ func TestRemoveNode(t *testing.T) {
panic(err)
}
r, _ = tests.Put("http://localhost:7001/v2/admin/config", "application/json", bytes.NewBufferString(`{"activeSize":4}`))
if !assert.Equal(t, r.StatusCode, 200) {
t.FailNow()
}
time.Sleep(time.Second + time.Second)
resp, err = c.Get("_etcd/machines", false, false)
@ -169,7 +149,8 @@ func TestRemovePausedNode(t *testing.T) {
if !assert.Equal(t, r.StatusCode, 200) {
t.FailNow()
}
time.Sleep(2 * time.Second)
// Wait for standby instances to update its cluster config
time.Sleep(6 * time.Second)
resp, err := c.Get("_etcd/machines", false, false)
if err != nil {

View File

@ -89,7 +89,7 @@ func TestSnapshot(t *testing.T) {
index, _ = strconv.Atoi(snapshots[0].Name()[2:6])
if index < 1010 || index > 1025 {
if index < 1010 || index > 1029 {
t.Fatal("wrong name of snapshot :", snapshots[0].Name())
}
}

View File

@ -8,6 +8,7 @@ import (
"time"
"github.com/coreos/etcd/server"
"github.com/coreos/etcd/store"
"github.com/coreos/etcd/tests"
"github.com/coreos/etcd/third_party/github.com/coreos/go-etcd/etcd"
"github.com/coreos/etcd/third_party/github.com/stretchr/testify/assert"
@ -279,3 +280,61 @@ func TestStandbyDramaticChange(t *testing.T) {
}
}
}
func TestStandbyJoinMiss(t *testing.T) {
clusterSize := 2
_, etcds, err := CreateCluster(clusterSize, &os.ProcAttr{Files: []*os.File{nil, os.Stdout, os.Stderr}}, false)
if err != nil {
t.Fatal("cannot create cluster")
}
defer DestroyCluster(etcds)
c := etcd.NewClient(nil)
c.SyncCluster()
time.Sleep(1 * time.Second)
// Verify that we have two machines.
result, err := c.Get("_etcd/machines", false, true)
assert.NoError(t, err)
assert.Equal(t, len(result.Node.Nodes), clusterSize)
resp, _ := tests.Put("http://localhost:7001/v2/admin/config", "application/json", bytes.NewBufferString(`{"removeDelay":4, "syncInterval":4}`))
if !assert.Equal(t, resp.StatusCode, 200) {
t.FailNow()
}
time.Sleep(time.Second)
resp, _ = tests.Delete("http://localhost:7001/v2/admin/machines/node2", "application/json", nil)
if !assert.Equal(t, resp.StatusCode, 200) {
t.FailNow()
}
// Wait for a monitor cycle before checking for removal.
time.Sleep(server.ActiveMonitorTimeout + (1 * time.Second))
// Verify that we now have one peer.
result, err = c.Get("_etcd/machines", false, true)
assert.NoError(t, err)
assert.Equal(t, len(result.Node.Nodes), 1)
// Simulate the join failure
_, err = server.NewClient(nil).AddMachine("http://localhost:7001",
&server.JoinCommand{
MinVersion: store.MinVersion(),
MaxVersion: store.MaxVersion(),
Name: "node2",
RaftURL: "http://127.0.0.1:7002",
EtcdURL: "http://127.0.0.1:4002",
})
assert.NoError(t, err)
time.Sleep(6 * time.Second)
go tests.Delete("http://localhost:7001/v2/admin/machines/node2", "application/json", nil)
time.Sleep(time.Second)
result, err = c.Get("_etcd/machines", false, true)
assert.NoError(t, err)
assert.Equal(t, len(result.Node.Nodes), 1)
}