It is not sent out because it is useless to let remote raft step the
message.
Moreover, MsgApp stream reader can always assume that the length
of entries sent is > 0.
doozer in particular is rather confusing to mention since the project
hasn't been worked on in years. While we are at it it might simplify
people's understanding if we remove zookeeper too.
When using very similar flags to our examples, the cluster doesn't bootstrap due to mismatched protocols (`http` vs `https`) in the `-initial-advertise-peer-urls` and `initial-cluster` list:
```
./etcd -name infra0 -initial-advertise-peer-urls https://127.0.0.1:2380 \
> -listen-peer-urls https://127.0.0.1:2380 \
> -initial-cluster-token etcd-cluster-1 \
> -initial-cluster infra0=http://127.0.0.1:2380,infra1=http://127.0.0.1:2381,infra2=http://127.0.0.1:2382 \
> -initial-cluster-state new
2015/01/29 10:32:16 no data-dir provided, using default data-dir ./infra0.etcd
2015/01/29 10:32:16 etcd: listening for peers on https://127.0.0.1:2380
2015/01/29 10:32:16 etcd: listening for client requests on http://localhost:2379
2015/01/29 10:32:16 etcd: listening for client requests on http://localhost:4001
2015/01/29 10:32:16 etcd: stopping listening for client requests on http://localhost:4001
2015/01/29 10:32:16 etcd: stopping listening for client requests on http://localhost:2379
2015/01/29 10:32:16 etcd: stopping listening for peers on https://127.0.0.1:2380
2015/01/29 10:32:16 etcd: infra0 has different advertised URLs in the cluster and advertised peer URLs list
```
This addresses a problem that comes up in the cockroach tests,
in which the order of messages may lead to deadlocks (due to
the fact that we don't have regular heartbeat timers in most
of our tests).
Example output of `$ etcdctl exec-watch` command was wrong and showing non-existing ETCD_VALUE, ETCD_KEY, etc keys. Submitting a PR, which updates those examples with correct keys (ETCD_WATCH_VALUE, ETCD_WATCH_KEY, etc).
The ./test script will no longer run the integration tests. To run the
integration test, set the INTEGRATION env var to a nonzero value. For
example, `INTEGRATION=y ./test`.
So that it can easily be used in a container.
Symptoms:
$ sudo bin/rkt run ../etcd/etcd-${VERSION}-linux-amd64.aci
Error: Unable to open "/lib64/ld-linux-x86-64.so.2": No such file or directory
Now that heartbeats are distinct from MsgApp{,Resp}, the retries
currently performed in stepLeader's MsgAppResp section are only
performed on an actual MsgAppResp (or a new MsgProp). This means
that it may take a long time to recover from a dropped MsgAppResp
in a quiet cluster.
This commit adds a dedicated heartbeat response message. This message
does not convey the follower's current log position because the
MsgHeartbeat does not include the leaders term and index. Upon receipt
of a heartbeat response, the leader may retry the latest MsgApp if it
believes the follower to be behind.
In code outside the raft package, we cannot call raft.bcastHeartbeat
directly. Instead, to control heartbeats we set heartbeatInterval to 1
and call Tick().
According to http://godoc.org/fmt#Scan, if scan number is less than the
number of arguments, err will report why. So we don't need to handle
this error case.
etcd resolves DNS hostnames to IP addresses for client and peer URLs
before creating any listening sockets.
The following messages are logged during startup:
etcd: Resolving infra0.coreos.com:2380 to 10.0.1.10:2380
Fixes#1991
It is reasonable for the leader to wait for the reply before sending out the next
msgApp or msgSnap for the follower in bad path. Or the leader will send out useless
messages if the previous message is rejected or the previous message is a snapshot.
Especially for the snapshot case, the leader will be 100% to send out duplicate message
including the snapshot, which is a huge waste.
This commit implement a timeout based wait mechanism. The timeout for normal msgApp is a
heartbeatTimeout and the timeout for snapshot is electionTimeout(snapshot is larger). We
can implement a piggyback mechanism(application notifies the msg lost) in the future
if necessary.
As we move to container-based infrastructure testing env
on travis, the tcp write buffer is more than 1MB. Change
the test according to the change on the testing env.
Perviously, etcd retries three times for timeout error. After this
commit, etcd retries forever. Also this commit make etcd client
aware of gateway timetout.
If amounts of MsgProp goes to the follower, it could batch them and
forward to the leader. This can avoid dropping MsgProp in good path
due to exceed maximal serving in sender.
Moreover, batching MsgProp can increase the throughput of proposals that
come from follower.
This panic can never be reached when using raft.Node, because we only
read from propc when there is a leader. However, it is possible to see
this error when using raft the raft object directly (as in MultiNode),
and in this case it is better to simply drop the proposal (as if we had
sent it to a leader that immediately vanished).
Add an error return to MemoryStorage.Append for consistency.
If we cannot find the `m.from` from current peers in the raft and it is a response
message, we should filter it out or raft panics. We are not targetting to avoid
malicious peers.
It has to be done in the raft node layer syncchronously. Although we can check
it at the application layer asynchronously, but after the checking and before
the message going into raft, the raft state machine might make progress and
unfortunately remove the `m.from` peer.
It is necessary to make this check because of the following case:
1. memory storage contains ents from index 0 to 50, and unstable has
ents from index 50 to 60.
2. raft receives an incoming snapshot with index 100.
3. raft restores its unstable to 100, but has not applied snapshot on memory storage.
4. raft receives an out-dated MsgApp from index 60.
5. raft finds the term of index 60 to check the match.
6. raft asks memory storage about the term of index 60 after it failed to get
it from unstable.
7. memory storage panics because it knows nothing about index 60.
Lock implementation for etcd. It has three go routines:
a) acquire - loop that watches for the lock to be free and tries to acquire it.
b) watch - to watch for lock changes
c) refresh - to refresh the ttl when the lock is acquired
All the changes in lock ownership are notified on the events channel. Any feedback welcome!
Memory storage should append all entries that have greater index
than the snap.Matedata.Index. We first truncate the old parts of
incoming entries. Then truncate the existing entries in the storage.
At last, we append the incoming entries to the existing entries.
stableTo should only mark the index stable if the term is matched. After raft sends out unstable
entries to application, raft makes progress without waiting for reply. When the appliaction
calls the stableTo to notify the entries up to "index" are stable, raft might have truncated
some entries before "index" due to leader lost. raft must verify the (index,term) of stableTo,
before marking the entries as stable.
etcdmain: Fix misuse "-addr" flag
In code, it uses "-advertise-client-urls" or "-addr" flags to get the list of this member's peer URLs,
It should be using "-peer-addr" flag instead of "-addr" flag.
Streaming buffer is used for:
1. hand over data to io goroutine in non-blocking way
2. hold pressure for temprorary network delay
3. be able to wait on I/O instead of data coming under high throughput
The old 1024 value is too small and is very likely to be full and
break the streaming when suffering temprorary network delay.
It should ignore the compact operation instead of panic because the case that
the log is restored from snapshot before executing compact is reasonable.
When loading from a backup with a snapshot and WAL, the length of WAL entries
can be lower than the current index integer value, causing a panic when
slicing off uncommitted entries. This looks for WAL entries higher than
the current index before slicing.
* coreos/master:
rafthttp: fix import
raft: should not decrease match and next when handling out of order msgAppResp
Fix migration to allow snapshots to have the right IDs
add snapshotted integration test
fix test import loop
fix import loop, add set to types, and fix comments
etcdserver: autodetect v0.4 WALs and upgrade them to v0.5 automatically
wal: add a bench for write entry
rafthttp: add streaming server and client
dep: use vendored imports in codegangsta/cli
dep: bump golang.org/x/net/context
Conflicts:
etcdserver/server.go
etcdserver/server_test.go
migrate/snapshot.go
* coreos/master:
scripts: build-docker tag and use ENTRYPOINT
scripts: build-release add etcd-migrate
create .godir
raft: optimistically increase the next if the follower is already matched
raft: add handleHeartbeat handleHeartbeat commits to the commit index in the message. It never decreases the commit index of the raft state machine.
rafthttp: send takes raft message instead of bytes
*: add rafthttp pkg into test list
raft: include commitIndex in heartbeat
rafthttp: move server stats in raftHandler to etcdserver
*: etcdhttp.raftHandler -> rafthttp.RaftHandler
etcdserver: rename sender.go -> sendhub.go
*: etcdserver.sender -> rafthttp.Sender
Conflicts:
raft/log.go
raft/raft_paper_test.go
Compaction is now treated as an implementation detail of Storage
implementations; Node.Compact() and related functionality have been
removed. Ready.Snapshot is now used only for incoming snapshots.
A return value has been added to ApplyConfChange to allow applications
to track the node information that must be stored in the snapshot.
raftpb.Snapshot has been split into Snapshot and SnapshotMetadata, to
allow the full snapshot data to be read from disk only when needed.
raft.Storage has new methods Snapshot, ApplySnapshot, HardState, and
SetHardState. The Snapshot and HardState parameters have been removed
from RestartNode() and will now be loaded from Storage instead.
The only remaining difference between StartNode and RestartNode is that
the former bootstraps an initial list of Peers.
This is useful since we want to pipeline the appendEntry requests. Without
enabling optimistic increasing, the second pipelining appendEntry request
will include the entries the first one has already sent out. We decrease
the next directly to match if the leader receives a rejection for a matched
follower. This happens if one pipelining request get lost and following ones
arrives at the follower.
* coreos/master: (21 commits)
etcdserver: refactor ValidateClusterAndAssignIDs
integration: add integration test for remove member
integration: add test for member restart
version: bump to alpha.3
etcdserver: add buffer to the sender queue
*: gracefully stop etcdserver
Fix up migration tool, add snapshot migration
etcd4: migration from v0.4 -> v0.5
etcdserver: export Member.StoreKey
etcdserver: recover cluster when receiving newer snapshot
etcdserver: check and select committed entries to apply
etcdserver: recover from snapshot before applying requests
raft: not set applied when restored from snapshot
sender: support elegant stop
etcdserver: add StopNotify
etcdserver: fix TestDoProposalStopped test
etcdserver: minor cleanup
etcdserver: validate new node is not registered before in best effort
etcdserver: fix server.Stop()
*: print out configuration when necessary
...
Conflicts:
etcdserver/server.go
etcdserver/server_test.go
raft/log.go
The first entry in the log is a dummy which is used for matchTerm
but may not have an actual payload. This change permits Storage
implementations to treat this term value specially instead of
storing it as a dummy Entry.
Storage.FirstIndex() no longer includes the term-only entry.
This reverses a recent decision to create entry zero as initially
unstable; Storage implementations are now required to make
Term(0) == 0 and the first unstable entry is now index 1.
stableTo(0) is no longer allowed.
Fixes all updates since bcwaldon sketched the original, with cleanup and
into an acutal working state. The commit log follows:
fix pb reference and remove unused file post rebase
unbreak the migrate folder
correctly detect node IDs
fix snapshotting
Fix previous broken snapshot
Add raft log entries to the translation; fix test for all timezones. (Still in progress, but passing)
Fix etcd:join and etcd:remove
print more data when dumping the log
Cleanup based on yichengq's comments
more comments
Fix the commited index based on the snapshot, if one exists
detect nodeIDs from snapshot
add initial tool documentation and match the semantics in the build script and main
formalize migration doc
rename function and clarify docs
fix nil pointer
fix the record conversion test
add migration to test suite and fix govet
We start etcd server in this test without the cluster. Sometimes it panics when
accessing the cluster. Most of the time it does not panic, since we can stop the
server fast enough before applying the first configuration change entry.
Stop should be idempotent. It should simply send a stop signal to the server.
It is the server's responsibility to stop the go-routines and related components.
* coreos/master:
etcdserver: add sender tests
raft: Only call stableTo when we have ready entries or a snapshot.
etcdserver: add ID() function to the Server interface.
sender: use RoundTripper instead of Client in sender
The first Ready after RestartNode (with no snapshot) will have no
unstable entries, so we don't have the correct prevLastUnstablei
when Advance is called. This would cause raftLog.unstable to move
backwards and previously-stable entries would be returned to
the application again.
This should have been caught by the "unexpected Ready" portion of
TestNodeRestart, but it went unnoticed because the Node's goroutine
takes some time to read from advancec and prepare the write to read to
readyc. Added a small (1ms) delay to all such tests to ensure that the
goroutine has time to enter its select wait.
* coreos/master: (27 commits)
pkg/wait: move wait to pkg/wait
etcdserver: do not add/remove/update local member to/from sender hub
etcdserver: not record attributes when add member
raft: add a test for proposeConfChange
raft: block Stop() on n.done, support idempotency
raft: add a test for node proposal
integration: add increase cluster size test
integration: remove unnecessary t.Testing argument
raft: stop the node synchronously
integration: fix test to propagate NewServer errors
etcdserver: move peer URLs check to config
etcdserver: ensure initial-advertise-peer-urls match initial-cluster
raft: add a test for node.Tick
raft: add comment string for TestNodeStart
etcdserver: use member instead of node at etcd level
raft: nodes return sorted ids
raft: update unstable when calling stableTo with 0
*: support updating advertise-peer-url Users might want to update the peerurl of the etcd member in several cases. For example, if the IP address of the physical machine etcd running on is changed, user need to update the adversite-pee-rurl accordingly. This commit makes etcd support updating the advertise-peer-url of its members.
transport: create a tls listener only if the tlsInfo is not empty and the scheme is HTTPS
etcdserver: use member pointer for all tests
...
Conflicts:
etcdserver/server.go
raft/log.go
raft/log_test.go
raft/node.go
This entry is now persisted through the normal flow instead of appearing
in the stored log at creation time. This is how things worked before
the Storage interface was introduced. (see coreos/etcd#1689)
Callers must in general have a reference to their Storage objects to
transfer entries from Ready to Storage, so it doesn't make sense to
create a hidden Storage for them.
By explicitly creating Storage objects in tests we can remove a
few casts of raftLog's storage field.
This adds a check to setupCluster to ensure that the list of URLs
specified in `initial-advertise-peer-urls` matches those configured in
`initial-cluster` for this node. Also updates the documentation to
clarify this and address some changes in wording.
Users might want to update the peerurl of the etcd member in several cases.
For example, if the IP address of the physical machine etcd running on is
changed, user need to update the adversite-pee-rurl accordingly.
This commit makes etcd support updating the advertise-peer-url of its members.
This change splits the raftLog.entries array into an in-memory
"unstable" list and a pluggable interface for retrieving entries that
have been persisted to disk. An in-memory implementation of this
interface is provided which behaves the same as the old version;
in a future commit etcdserver could replace the MemoryStorage with
one backed by the WAL.
The wait ensures that cluster goes into the stable stage, which means that
leader has been elected and starts to heartbeat to followers.
This makes future client requests always handled in time, and there is no
need to retry sending requests.
This function was fundamentally buggy, as a panic could be trivially
triggered by setting the wrong environment variable (e.g.
ETCD_BIND_ADDR=foo). Instead, let's propagate the error and present it
to the user in a cleaner way.
It is the correct thing to do to ensure that the communication is full
of out-of-date messages.
It results in that integration testing is very easy to throw MsgProp away,
and makes client wait until 5 min timeout. Sync interval and heartbeat are
increased to alleviate the traffic.
There's no real need to expose a Discoverer interface/struct when the
only use of the interface (and indeed the module) is to invoke a single
function. This isn't Java, after all. So instead, simplify to Discovery
exposing just two functions: JoinCluster (i.e. what was formerly called
"discovery"), and GetCluster (hitherto "ProxyDiscovery")
Node set the applied to committed right after it sends out Ready to application. This is not
correct since the application has not actually applied the entries at that point. We add a
Advance interface to Node. Application needs to call Advance to tell raft Node its progress.
Also this change can avoid unnecessary copying when application is still applying entires but
there are more entries to be applied.
1. etcd fatals if there is critical error in the system and operator should
do something for it
2. etcd panics if there happens something unexpected, and it should be
reported to us to debug.
When waiting for a watch result, we expect the good path to complete
quickly here so we don't need to time out so aggressively. (Failure
noted in #1600)
In certain cases (for example, if a cluster peer is accessible but it
has no members listed), the httpClusterClient could have an empty set of
endpoints as a result of the Sync. This means that its Do function could
potentially return a nil response and nil error, with catastrophic
consequences for callers.
To be safe (particularly about this latter behaviour), this change
errors in both Sync and Do if no endpoints are available.
The docs for the other APIs use curl for example usage, which matches
the docs for the etcd APIs.
Other cleanup include fixing usage of peer ports and using 10.0.0.x IPs
throughout.
- Provide more concrete examples and explanation
- Cleanup the formatting to one sentence per line, this makes reviewing
easier
- Point to existing docs on wal and snap instead of trying to duplicate
it here again.
This creates a simple ID type (wrapped around uint64) to provide for
standard serialization/deserialization to a string (i.e. base 16
encoded). This replaces strutil so now that package is removed.
There's no real need for do and doWithTimeout to return Responses when
the only field of interest is the status code.
This also removes the superfluous httpMembersAPIResponse struct.
If any of this initialization fails, something very bad has happened,
and we should not continue as-is (this has previously manifested in
strange bugs)
etcd checks that the data dir is writable by writing and removing an
empty file to the data dir during startup and exits non-zero if that
fails.
fixes#876
This is a simple solution to having the proxy keep up to date with the
state of the cluster. Basically, it uses the cluster configuration
provided at start up (i.e. with `-initial-cluster-state`) to determine
where to reach peer(s) in the cluster, and then it will periodically hit
the `/members` endpoint of those peer(s) (using the same mechanism that
`-cluster-state=existing` does to initialise) to update the set of valid
client URLs to proxy to.
This does not address discovery (#1376), and it would probably be better
to update the set of proxyURLs dynamically whenever we fetch the new
state of the cluster; but it needs a bit more thinking to have this done
in a clean way with the proxy interface.
Example in Procfile works again.
- Update all of the ports to use the new IANA port numbers
- Update the stats section to talk about the id fields
- Remove mention of the modules
- Remove -L from all of the curl commands since it is no longer needed
- Point people at the clustering.md guide for the cluster APIs
It returns error messaage like this now:
'{"errorCode":100,"message":"Key not found","cause":"/1/pants","index":10}'
The commit trims '/1' prefix from cause field if exists.
This is a hack to make it display well. It is correct because all error causes
that contain Path puts Path at the head of the string.
As explained in #1366, the leader will fail to transmit the missed
logs if the leader receives a hearbeat response from a follower
that is not yet matched in the leader. In other words, there are
append responses that do not explicitly reject an append but
implied a gap.
This commit is based on @xiangli-cmu's idea. We should only acknowledge
upto the index of logs in the append message. This way responses to
heartbeats would never interfer with the log synchronization because
their log index is always 0.
Fixes#1366
Continue using hex everywhere. Including here.
TODO: cleanup the printing of the structs which currently have decimal
to/from:
`{Type:MsgAppResp To:9973738105406047488 From:17050684879817348455 T...`
1) Don't panic since we know exactly where this is coming from and don't
need the user to see a full back trace
2) Add docs explaining this situation a bit further
3) Cleanup the error to look like other similiar errors
When adding new member, the etcdserver generates the ID based on the current time
and the given peerurls. We include time to add the uniqueness, since the node with
same peerurls should be able to (add, then remove) several times.
Removes the notion of name being anything more than advisory or
command-line grouping, and adds checks for bootstrapping the command
line. IDs are consistent if the URLs are consistent.
This adds the remaining two stats endpoints: `/v2/stats/self`, for
various statistics on the EtcdServer, and `/v2/stats/leader`, for
statistics on a leader's followers.
By and large most of the stats code is copied across from 0.4.x, updated
where necessary to integrate with the new decoupling of raft from
transport.
This does not satisfactorily resolve the question of name vs ID. In the
old world, names were unique in the cluster and transmitted over the
wire, so they could be used safely in all statistics. In the new world,
a given EtcdServer only knows its own name, and it is instead IDs that
are communicated among the cluster members. Hence in most places here we
simply substitute a string-encoded ID in place of name, and only where
possible do we retain the actual given name of the EtcdServer.
We use nodeID as the default dir previously. It works fine before we do dynamic nodeID
generation (introducing time). After the change the dynamic nodeID will change every
time we restart the etcd process. If the user does not provide the data dir, the default
dir will change every time. It is not the desired behavior.
In this commit, we change the default data dir to node name. If the user changes the node
name and does provide the data dir, etcd still cannot recover from previous state. But it
is much better than using nodeID. And it is actually a doucmentation issue.
Conflicts:
main.go
SSLv3 is no longer considered secure, and is not supported by golang
clients. Set the minimum version of all TLSConfigs that etcd uses to
ensure that only TLS >=1.0 can be used.
Since these are arrays instead of maps, the "keys" here are actually
(useless) goto labels. What really matters is that the ordering is
the same between the constant declarations and the array.
Before this commit, compact always compact log at current appliedindex of raft.
This prevents us from doing non-blocking snapshot since we have to make snapshot
and compact atomically. To prepare for non-blocking snapshot, this commit make
compact supports index and nodes parameters. After completing snapshot, the applier
should call compact with the snapshot index and the nodes at snapshot index to do
a compaction at snapsohot index.
In preperation for adding the ability to join a machine to an existing
cluster force the user to specify whether they expect this to me a new
cluster or an active one.
The error for not specifying the initial-cluster-state is:
```
etcd: initial cluster state unset and no wal found
```
send should not attach current term to msgProp. Send should simply do proxy for msgProp without
changing its term. msgProp has a special term 0, which indicates that it is a local message.
It's slightly unclear why we expose this timeout as being configurable,
and the `-timeout` flag does not exist in 0.4.x, so for now, remove the
flag until we have evidence that it is needed.
This adds a StartIndex field to the Watcher interface, which represents
the Etcd-Index at which the Watcher is created.
Also refactors the HTTP tests to use a table for most handleWatch tests
The usage of removed:
1. tell removed node about its removal explicitly using msgDenied
2. prevent removed node disrupt cluster progress by launching leader election
It is set when apply node removal, or receive msgDenied.
The -addr, -bind-addr, -peer-addr, and peer-bind-addr flags are
superseded by -advertise-client-urls, -listen-client-urls,
-advertise-peer-urls, and -listen-peer-urls, respectively.
If any of the former flags are provided, however, they will still
work. If the new counterparts to the old flags are provided, etcd
will log an error and fail.
This introduces two new concepts: the cluster and the member.
Members are logical etcd instances that have a name, raft ID, and a list
of peer and client addresses.
A cluster is made up of a list of members.
This prevents the bug like this:
1. a node sends join to a cluster and succeeds
2. it starts with empty peers and waits for sync, but it have not
received anything
3. election timeout passes, and it promotes itself to leader
4. it commits some log entry
5. its log conflicts with the cluster's
The work being done in the tests is completely wasted, as we do not
need to test the udnerlying x509 library. Faking out the parser function
allows the tests to run much faster without having to carry massive
fixtures, either.
This adds an EtcdIndex field to store.Event and uses that as the header
instead of the node's modifiedIndex. To facilitate this in a non-racy
way, we set the EtcdIndex while holding the lock.
Configure is used to propose config change. AddNode and RemoveNode is
used to apply cluster change to raft state machine. They are the
basics for dynamic configuration.
golang's concept of "zero" time does not align with the zero value of
a unix timestamp. Initializing time.Time with a unix timestamp of 0
makes time.Time.IsZero fail. Solve this by initializing time.Time only
if we care about the time.
This is to fix possible testing failures caused by sync tests.
Changes:
1. Get rid of time sleep operations, which introduces uncertainty.
2. Use fake Store.
This changes etcdserver.Server to an interface, with the former Server
(now "EtcdServer") becoming the canonical/production implementation.
This will facilitate better testing of the http server et al with mock
implementations of the interface.
It also more clearly defines the boundary for users of the Server.
The ReverseProxy code from the standard library doesn't actually
give us the control that we want. Pull it down and rip out what
we don't need, adding tests in the process.
All available endpoints are attempted when proxying a request. If a
proxied request fails, the upstream will be considered unavailable
for 5s and no more requests will be proxied to it. After the 5s is
up, the endpoint will be put back to rotation.
Make raftIndex section to be expected raftIndex of next entry.
It makes filename more intuitive and straight-forward.
The commit also adds comments for filename format.
golang convention dictates that the individual characters in an
abbreviation should all have the same case. Use ID instead of Id.
The protobuf generator still generates code that does not meet
this convention, but that's a fight for another day.
Add basic input validation of all query parameters supported by
serveKeys. Also restructures etcdhttp a bit to better facilitate
testing.
Test coverage is slightly improved.
Using 0xBEEF is annoying in examples because it makes it makes it look
like the user can use ascii or something. In the Procfile use
0x0,0x1,0x2,etc and use 0xBAD0 in test.
This adds a build script that attempts to be as user friendly as
possible: if they have already set $GOPATH and/or $GOBIN, use those
environment variables. If not, create a gopath for them in this
directory. This should facilitate both `go get` and `git clone` usage.
The `test` script is updated, and the new `cover` script facilitates
easy coverage generation for the repo's constituent packages by setting
the PKG environment variable.
Why do this? `go test ./...` has a ton of annoying output:
```
? github.com/coreos/etcd [no test files]
? github.com/coreos/etcd/crc [no test files]
? github.com/coreos/etcd/elog [no test files]
? github.com/coreos/etcd/error [no test files]
ok github.com/coreos/etcd/etcdserver 0.267s
ok github.com/coreos/etcd/etcdserver/etcdhttp 0.022s
? github.com/coreos/etcd/etcdserver/etcdserverpb [no test files]
ok github.com/coreos/etcd/raft 0.157s
? github.com/coreos/etcd/raft/raftpb [no test files]
ok github.com/coreos/etcd/snap 0.018s
? github.com/coreos/etcd/snap/snappb [no test files]
third_party/code.google.com/p/gogoprotobuf/proto/testdata/test.pb.go🔢
undefined: __emptyarchive__.Extension
ok github.com/coreos/etcd/store 4.247s
ok
github.com/coreos/etcd/third_party/code.google.com/p/go.net/context
2.724s
FAIL
github.com/coreos/etcd/third_party/code.google.com/p/gogoprotobuf/proto
[build failed]
ok
github.com/coreos/etcd/third_party/github.com/stretchr/testify/assert
0.013s
ok github.com/coreos/etcd/wait 0.010s
ok github.com/coreos/etcd/wal 0.024s
? github.com/coreos/etcd/wal/walpb [no test files]
```
And we have no had to manually configure drone.io which I want to avoid:
https://drone.io/github.com/coreos/etcd/admin
The documentation for these options has been incorrect ever since their
addition via commit 084dcb55. This broke upgrading a CoreOS stable
cluster to 410.0.0 because the user was using the example TOML config
which contains:
http_write_timeout = 10
This was silently ignored in the old stable version which predates the
addition of these options. After upgrading etcd began failing with:
Type mismatch for 'config.Config.http_write_timeout': Expected float
but found 'int64’.
Original issue: https://github.com/coreos/update_engine/issues/45
The default is for connections to last forever[1]. This leads to fds
leaking. I set the timeout so high by default so that watches don't have
to keep retrying but perhaps we should set it slower.
Tested on a cluster with lots of clients and it seems to have relieved
the problem.
[1] https://groups.google.com/forum/#!msg/golang-nuts/JFhGwh1q9xU/heh4J8pul3QJ
Original Commit: 084dcb5596
From: philips <brandon@ifup.org>
Many http clients will missbehave unless they get an initial http-
response, even when long-polling. It also saves the user/client from
having to handle headers on the first action of the watch, but rather
handle the response immediately.
Original commit: 2338481bb1
From: Christoffer Vikström
To make configuration change safe without adding configuration protocol:
1. We only allow to add/remove one node at a time.
2. We only allow one uncommitted configuration entry in the log.
These two rules can make sure there is no disjoint quorums in both current cluster and the
future(after applied any number of committed entries or uncommitted entries in log) clusters.
We add a type field in Entry structure for two reasons:
1. Statemachine needs to know if there is a pending configuration change.
2. Configuration entry should be executed by raft package rather application who is using raft.
@ -21,12 +21,13 @@ This is a rough outline of what a contributor's workflow looks like:
- Make sure your commit messages are in the proper format (see below).
- Push your changes to a topic branch in your fork of the repository.
- Submit a pull request to coreos/etcd.
- Your PR must receive a LGTM from two maintainers found in the MAINTAINERS file.
Thanks for your contributions!
### Code style
The coding style suggested by the Golang community is used in etcd. See [style doc](https://code.google.com/p/go-wiki/wiki/Style) for details.
The coding style suggested by the Golang community is used in etcd. See the [style doc](https://code.google.com/p/go-wiki/wiki/CodeReviewComments) for details.
Please follow this style to make etcd easy to review, maintain and develop.
Between 0.4.x and 2.0, the on-disk data formats have changed. In order to allow users to convert to 2.0, a migration tool is provided.
etcd will detect 0.4.x data dir and update the data automatically (while leaving a backup, in case of emergency).
### Data Migration Tips
* Keep the environment variables and etcd instance flags the same, particularly `--name`/`ETCD_NAME`.
* Don't change the cluster configuration. If there's a plan to add or remove machines, it's probably best to arrange for that after the migration, rather than before or at the same time.
### Running the tool
The tool can be run via:
```sh
./bin/etcd-migrate --data-dir=<PATH TO YOUR DATA>
```
It should autodetect everything and convert the data-dir to be 2.0 compatible. It does not remove the 0.4.x data, and is safe to convert multiple times; the 2.0 data will be overwritten. Recovering the disk space once everything is settled is covered later in the document.
If, however, it complains about autodetecting the name (which can happen, depending on how the cluster was configured), you need to supply the name of this particular node. This is equivalent to the `--name` flag (or `ETCD_NAME` variable) that etcd was run with, which can also be found by accessing the self api, eg:
```sh
curl -L http://127.0.0.1:4001/v2/stats/self
```
Where the `"name"` field is the name of the local machine.
Then, run the migration tool with
```sh
./bin/etcd-migrate --data-dir=<PATH TO YOUR DATA> --name=<NAME>
```
And the tool should migrate successfully. If it still has an error at this time, it's a failure or bug in the tool and it's worth reporting a bug.
### Recovering Disk Space
If the conversion has completed, the entire cluster is running on something 2.0-based, and the disk space is important, the following command will clear 0.4.x data from the data-dir:
```sh
rm -ri snapshot conf log
```
It will ask before every deletion, but these are the 0.4.x files and will not affect the working 2.0 data.
When first started, etcd stores its configuration into a data directory specified by the data-dir configuration parameter.
Configuration is stored in the write ahead log and includes: the local member ID, cluster ID, and initial cluster configuration.
The write ahead log and snapshot files are used during member operation and to recover after a restart.
If a member’s data directory is ever lost or corrupted then the user should remove the etcd member from the cluster via the [members API][members-api].
A user should avoid restarting an etcd member with a data directory from an out-of-date backup.
Using an out-of-date data directory can lead to inconsistency as the member had agreed to store information via raft then re-joins saying it needs that information again.
For maximum safety, if an etcd member suffers any sort of data corruption or loss, it must be removed from the cluster.
Once removed the member can be re-added with an empty data directory.
If you are spinning up multiple clusters for testing it is recommended that you specify a unique initial-cluster-token for the different clusters.
This can protect you from cluster corruption in case of mis-configuration because two members started with different cluster tokens will refuse members from each other.
#### Optimal Cluster Size
The recommended etcd cluster size is 3, 5 or 7, which is decided by the fault tolerance requirement. A 7-member cluster can provide enough fault tolerance in most cases. While larger cluster provides better fault tolerance the write performance reduces since data needs to be replicated to more machines.
#### Fault Tolerance Table
It is recommended to have an odd number of members in a cluster. Having an odd cluster size doesn't change the number needed for majority, but you gain a higher tolerance for failure by adding the extra member. You can see this in practice when comparing even and odd sized clusters:
| Cluster Size | Majority | Failure Tolerance |
|--------------|------------|-------------------|
| 1 | 1 | 0 |
| 3 | 2 | 1 |
| 4 | 3 | 1 |
| 5 | 3 | **2** |
| 6 | 4 | 2 |
| 7 | 4 | **3** |
| 8 | 5 | 3 |
| 9 | 5 | **4** |
As you can see, adding another member to bring the size of cluster up to an odd size is always worth it. During a network partition, an odd number of members also guarantees that there will almost always be a majority of the cluster that can continue to operate and be the source of truth when the partition ends.
#### Changing Cluster Size
After your cluster is up and running, adding or removing members is done via [runtime reconfiguration](runtime-configuration.md), which allows the cluster to be modified without downtime. The `etcdctl` tool has a `member list`, `member add` and `member remove` commands to complete this process.
### Member Migration
When there is a scheduled machine maintenance or retirement, you might want to migrate an etcd member to another machine without losing the data and changing the member ID.
The data directory contains all the data to recover a member to its point-in-time state. To migrate a member:
* Stop the member process
* Copy the data directory of the now-idle member to the new machine
* Update the peer URLs for that member to reflect the new machine according to the [member api] [change peer url]
* Start etcd on the new machine, using the same configuration and the copy of the data directory
This example will walk you through the process of migrating the infra1 member to a new machine:
etcd is designed to be resilient to machine failures. An etcd cluster can automatically recover from any number of temporary failures (for example, machine reboots), and a cluster of N members can tolerate up to _(N/2)-1_ permanent failures (where a member can no longer access the cluster, due to hardware failure or disk corruption). However, in extreme circumstances, a cluster might permanently lose enough members such that quorum is irrevocably lost. For example, if a three-node cluster suffered two simultaneous and unrecoverable machine failures, it would be normally impossible for the cluster to restore quorum and continue functioning.
To recover from such scenarios, etcd provides functionality to backup and restore the datastore and recreate the cluster without data loss.
#### Backing up the datastore
**NB:** Windows users must stop etcd before running the backup command.
The first step of the recovery is to backup the data directory on a functioning etcd node. To do this, use the `etcdctl backup` command, passing in the original data directory used by etcd. For example:
```sh
etcdctl backup \
--data-dir /var/lib/etcd \
--backup-dir /tmp/etcd_backup
```
This command will rewrite some of the metadata contained in the backup (specifically, the node ID and cluster ID), which means that the node will lose its former identity. In order to recreate a cluster from the backup, you will need to start a new, single-node cluster. The metadata is rewritten to prevent the new node from inadvertently being joined onto an existing cluster.
#### Restoring a backup
To restore a backup using the procedure created above, start etcd with the `-force-new-cluster` option and pointing to the backup directory. This will initialize a new, single-member cluster with the default advertised peer URLs, but preserve the entire contents of the etcd data store. Continuing from the previous example:
```sh
etcd \
-data-dir=/tmp/etcd_backup \
-force-new-cluster \
...
```
Now etcd should be available on this node and serving the original datastore.
Once you have verified that etcd has started successfully, shut it down and move the data back to the previous location (you may wish to make another copy as well to be safe):
```sh
pkill etcd
rm -fr /var/lib/etcd
mv /tmp/etcd_backup /var/lib/etcd
etcd \
-data-dir=/var/lib/etcd \
...
```
#### Restoring the cluster
Now that the node is running successfully, you can add more nodes to the cluster and restore resiliency. See the [runtime configuration](runtime-configuration.md) guide for more details.
### Client Request Timeout
etcd sets different timeouts for various types of client requests. The timeout value is not tunable now, which will be improved soon(https://github.com/coreos/etcd/issues/2038).
#### Get requests
Timeout is not set for get requests, because etcd serves the result locally in a non-blocking way.
**Note**: QuorumGet request is a different type, which is mentioned in the following sections.
#### Watch requests
Timeout is not set for watch requests. etcd will not stop a watch request until client cancels it, or the connection is broken.
#### Delete, Put, Post, QuorumGet requests
The default timeout is 5 seconds. It should be large enough to allow all key modifications if the majority of cluster is functioning.
If the request times out, it indicates two possibilities:
1. the server the request sent to was not functioning at that time.
2. the majority of the cluster is not functioning.
If timeout happens several times continuously, administrators should check status of cluster and resolve it as soon as possible.
Allow-legacy is a special mode in etcd that contains logic to enable a running etcd cluster to smoothly transition between major versions of etcd. For example, the internal API versions between etcd 0.4 (internal v1) and etcd 2.0 (internal v2) aren't compatible and the cluster needs to be updated all at once to make the switch. To minimize downtime, allow-legacy coordinates with all of the members of the cluster to shutdown, migration of data and restart onto the new version.
Allow-legacy helps users upgrade v0.4 etcd clusters easily, and allows your etcd cluster to have a minimal amount of downtime -- less than 1 minute for clusters storing less than 50 MB.
It supports upgrading from internal v1 to internal v2 now.
### Setup
This mode is enabled if `ETCD_ALLOW_LEGACY_MODE` is set to true, or etcd is running in CoreOS system.
It treats `ETCD_BINARY_DIR` as the directory for etcd binaries, which is organized in this way:
```
ETCD_BINARY_DIR
|
-- 1
|
-- 2
```
`1` is etcd with internal v1 protocol. You should use etcd v0.4.7 here. `2` is etcd with internal v2 protocol, which is etcd v2.x.
The default value for `ETCD_BINARY_DIR` is `/usr/libexec/etcd/internal_versions/`.
### Upgrading a Cluster
When starting etcd with a v1 data directory and v1 flags, etcd executes the v0.4.7 binary and runs exactly the same as before. To start the migration, follow the steps below:

#### 1. Check the Cluster Health
Before upgrading, you should check the health of the cluster to double check that everything working perfectly. Check the health by running:
```
$ etcdctl cluster-health
cluster is healthy
member 6e3bd23ae5f1eae0 is healthy
member 924e2e83e93f2560 is healthy
member a8266ecf031671f3 is healthy
```
If the cluster and all members are healthy, you can start the upgrading process. If not, check the unhealthy machines and repair them using [admin guide](./admin_guide.md).
#### 2. Trigger the Upgrade
When you're ready, use the `etcdctl upgrade` command to start the upgrade the etcd cluster to 2.0:
`PEER_URL` can be any accessible peer url of the cluster.
Once triggered, all peer-mode members will print out:
```
detected next internal version 2, exit after 10 seconds.
```
#### Parallel Coordinated Upgrade
As part of the upgrade, etcd does internal coordination within the cluster for a brief period and then exits. Clusters storing 50 MB should be unavailable for less than 1 minute.
#### Restart etcd Processes
After the etcd processes exit, they need to be restarted. You can do this manually or configure your unit system to do this automatically. On CoreOS, etcd is already configured to start automatically with systemd.
When restarted, the data directory of each member is upgraded, and afterwards etcd v2.0 will be running and servicing requests. The upgrade is now complete!
Standby-mode members are a special case — they will be upgraded into proxy mode (a new feature in etcd 2.0) upon restarting. When the upgrade is triggered, any standbys will exit with the message:
```
Detect the cluster has been upgraded to internal API v2. Exit now.
```
Once restarted, standbys run in v2.0 proxy mode, which proxy user requests to the etcd cluster.
#### 3. Check the Cluster Health
After the upgrade process, you can run the health check again to verify the upgrade. If the cluster is unhealthy or there is an unhealthy member, please refer to start [failure recovery](#failure-recovery).
### Downgrade
If the upgrading fails due to disk/network issues, you still can restart the upgrading process manually. However, once you upgrade etcd to internal v2 protocol, you CANNOT downgrade it back to internal v1 protocol. If you want to downgrade etcd in the future, please backup your v1 data dir beforehand.
### Upgrade Process on CoreOS
When running on a CoreOS system, allow-legacy mode is enabled by default and an automatic update will set up everything needed to execute the upgrade. The `etcd.service` on CoreOS is already configured to restart automatically. All you need to do is run `etcdctl upgrade` when you're ready, as described
### Internal Details
etcd v0.4.7 registers versions of available etcd binaries in its local machine into the key space at bootstrap stage. When the upgrade command is executed, etcdctl checks whether each member has internal-version-v2 etcd binary around. If that is true, each member is asked to record the fact that it needs to be upgraded the next time it reboots, and exits after 10 seconds.
Once restarted, etcd v2.0 sees the upgrade flag recorded. It upgrades the data directory, and executes etcd v2.0.
### Failure Recovery
If `etcdctl cluster-health` says that the cluster is unhealthy, the upgrade process fails, which may happen if the network is broken, or the disk cannot work.
The way to recover it is to manually upgrade the whole cluster to v2.0:
- Log into machines that ran v0.4 peer-mode etcd
- Stop all etcd services
- Remove the `member` directory under the etcd data-dir
- Start etcd service using [2.0 flags](configuration.md). An example for this is:
The main goal of etcd 2.0 release is to improve cluster safety around bootstrapping and dynamic reconfiguration. To do this, we deprecated the old error-prone APIs and provide a new set of APIs.
The other main focus of this release was a more reliable Raft implementation, but as this change is internal it should not have any notable effects to users.
#### Command Line Flags Changes
The major flag changes are to mostly related to bootstrapping. The `initial-*` flags provide an improved way to specify the required criteria to start the cluster. The advertised URLs now support a list of values instead of a single value, which allows etcd users to gracefully migrate to the new set of IANA-assigned ports (2379/client and 2380/peers) while maintaining backward compatibility with the old ports.
-`-addr` is replaced by `-advertise-client-urls`.
-`-bind-addr` is replaced by `-listen-client-urls`.
-`-peer-addr` is replaced by `-initial-advertise-peer-urls`.
-`-peer-bind-addr` is replaced by `-listen-peer-urls`.
-`-peers` is replaced by `-initial-cluster`.
-`-peers-file` is replaced by `-initial-cluster`.
-`-peer-heartbeat-interval` is replaced by `-heartbeat-interval`.
-`-peer-election-timeout` is replaced by `-election-timeout`.
The documentation of new command line flags can be found at
- Default data dir location has changed from {$hostname}.etcd to {name}.etcd.
- The disk format within the data dir has changed. etcd 2.0 should be able to auto upgrade the old data format. Instructions on doing so manually are in the [migration tool doc][migrationtooldoc].
The consistent flag for read operations is removed in etcd 2.0.0. The normal read operations provides the same consistency guarantees with the 0.4.6 read operations with consistent flag set.
The read consistency guarantees are:
The consistent read guarantees the sequential consistency within one client that talks to one etcd server. Read/Write from one client to one etcd member should be observed in order. If one client write a value to a etcd server successfully, it should be able to get the value out of the server immediately.
Each etcd member will proxy the request to leader and only return the result to user after the result is applied on the local member. Thus after the write succeed, the user is guaranteed to see the value on the member it sent the request to.
Reads do not provide linearizability. If you want linearizabilable read, you need to set quorum option to true.
**Previous behavior**
We added an option for a consistent read in the old version of etcd since etcd 0.x redirects the write request to the leader. When the user get back the result from the leader, the member it sent the request to originally might not apply the write request yet. With the consistent flag set to true, the client will always send read request to the leader. So one client should be able to see its last write when consistent=true is enabled. There is no order guarantees among different clients.
#### Standby
etcd 0.4’s standby mode has been deprecated. [Proxy mode][proxymode] is introduced to solve a subset of problems standby was solving.
Standby mode was intended for large clusters that had a subset of the members acting in the consensus process. Overall this process was too magical and allowed for operators to back themselves into a corner.
Proxy mode in 2.0 will provide similar functionality, and with improved control over which machines act as proxies due to the operator specifically configuring them. Proxies also support read only or read/write modes for increased security and durability.
`v2/admin` on peer url and `v2/keys/_etcd` are unified under the new [v2/member API][memberapi] to better explain which machines are part of an etcd cluster, and to simplify the keyspace for all your use cases.
Starting an etcd cluster requires that each node knows another in the cluster. If you are trying to bring up a cluster all at once, say using a cloud formation, you also need to coordinate who will be the initial cluster leader. The discovery protocol helps you by providing an automated way to discover other existing peers in a cluster.
For more information on how etcd can locate the cluster, see the [finding the cluster][cluster-finding] documentation.
Please note - at least 3 nodes are required for [cluster availability][optimal-cluster-size].
To use the discovery API, you must first create a token for your etcd cluster. Visit [https://discovery.etcd.io/new](https://discovery.etcd.io/new) to create a new token.
You can inspect the list of peers by viewing `https://discovery.etcd.io/<token>`.
### Start etcd With the Discovery Flag
Specify the `-discovery` flag when you start each etcd instance. The list of existing peers in the cluster will be downloaded and configured. If the instance is the first peer, it will start as the leader of the cluster.
The discovery API communicates with a separate etcd cluster to store and retrieve the list of peers. CoreOS provides [https://discovery.etcd.io](https://discovery.etcd.io) as a free service, but you can easily run your own etcd cluster for this purpose. Here's an example using an etcd cluster located at `10.10.10.10:4001`:
If you're interested in how to discovery API works behind the scenes, read about the [Discovery Protocol](https://github.com/coreos/etcd/blob/master/Documentation/discovery-protocol.md).
## Setting Peer Addresses Correctly
The Discovery API submits the `-peer-addr` of each etcd instance to the configured Discovery endpoint. It's important to select an address that *all* peers in the cluster can communicate with. For example, if you're located in two regions of a cloud provider, configuring a private `10.x` address will not work between the two regions, and communication will not be possible between all peers.
## Stale Peers
The discovery API will automatically clean up the address of a stale peer that is no longer part of the cluster. The TTL for this process is a week, which should be long enough to handle any extremely long outage you may encounter. There is no harm in having stale peers in the list until they are cleaned up, since an etcd instance only needs to connect to one valid peer in the cluster to join.
## Lifetime of a Discovery URL
A discovery URL identifies a single etcd cluster. Do not re-use discovery URLs for new clusters.
When a machine starts with a new discovery URL the discovery URL will be activated and record the machine's metadata. If you destroy the whole cluster and attempt to bring the cluster back up with the same discovery URL it will fail. This is intentional because all of the registered machines are gone including their logs so there is nothing to recover the killed cluster.
We use Raft as the underlying distributed protocol which provides consistency and persistence of the data across all of the etcd instances.
Starting an etcd cluster statically requires that each member knows another in the cluster. In a number of cases, you might not know the IPs of your cluster members ahead of time. In these cases, you can bootstrap an etcd cluster with the help of a discovery service.
Let start by creating 3 new etcd instances.
Once an etcd cluster is up and running, adding or removing members is done via [runtime reconfiguration](runtime-configuration.md).
We use `-peer-addr` to specify server port and `-addr` to specify client port and `-data-dir` to specify the directory to store the log and info of the machine in the cluster:
This guide will cover the following mechanisms for bootstrapping an etcd cluster:
**Note:** If you want to run etcd on an external IP address and still have access locally, you'll need to add `-bind-addr 0.0.0.0` so that it will listen on both external and localhost addresses.
A similar argument `-peer-bind-addr` is used to setup the listening address for the server port.
Each of the bootstrapping mechanisms will be used to create a three machine etcd cluster with the following details:
Let's join two more machines to this cluster using the `-peers` argument. A single connection to any peer will allow a new machine to join, but multiple can be specified for greater resiliency.
We can retrieve a list of machines in the cluster using the HTTP API:
```sh
curl -L http://127.0.0.1:4001/v2/machines
```
We should see there are three machines in the cluster
As we know the cluster members, their addresses and the size of the cluster before starting, we can use an offline bootstrap configuration by setting the `initial-cluster` flag. Each machine will get either the following command line or environment variables:
If one machine disconnects from the cluster, it could rejoin the cluster automatically when the communication is recovered.
If one machine is killed, it could rejoin the cluster when started with old name. If the peer address is changed, etcd will treat the new peer address as the refreshed one, which benefits instance migration, or virtual machine boot with different IP. The peer-address-changing functionality is only supported when the majority of the cluster is alive, because this behavior needs the consensus of the etcd cluster.
**Note:** For now, it is user responsibility to ensure that the machine doesn't join the cluster that has the member with the same name. Or unexpected error will happen. It would be improved sooner or later.
### Killing Nodes in the Cluster
Now if we kill the leader of the cluster, we can get the value from one of the other two machines:
```sh
curl -L http://127.0.0.1:4002/v2/keys/foo
```
We can also see that a new leader has been elected:
Note that the URLs specified in `initial-cluster` are the _advertised peer URLs_, i.e. they should match the value of `initial-advertise-peer-urls` on the respective nodes.
If you are spinning up multiple clusters (or creating and destroying a single cluster) with same configuration for testing purpose, it is highly recommended that you specify a unique `initial-cluster-token` for the different clusters. By doing this, etcd can generate unique cluster IDs and member IDs for the clusters even if they otherwise have the exact same configuration. This can protect you from cross-cluster-interaction, which might corrupt your clusters.
On each machine you would start etcd with these flags:
The command line parameters starting with `-initial-cluster` will be ignored on subsequent runs of etcd. You are free to remove the environment variables or command line flags after the initial bootstrap process. If you need to make changes to the configuration later (for example, adding or removing members to/from the cluster), see the [runtime configuration](runtime-configuration.md) guide.
### Testing Persistence
### Error Cases
Next we'll kill all the machines to test persistence.
Type `CTRL-C` on each terminal and then rerun the same command you used to start each machine.
In the following example, we have not included our new host in the list of enumerated nodes. If this is a new cluster, the node _must_ be added to the list of initial cluster members.
Your request for the `foo` key will return the correct value:
etcd: infra1 not listed in the initial cluster config
exit 1
```
```json
{
"action":"get",
"node":{
"createdIndex":4,
"key":"/foo",
"modifiedIndex":4,
"value":"bar"
}
}
In this example, we are attempting to map a node (infra0) on a different address (127.0.0.1:2380) than its enumerated address in the cluster list (10.0.1.10:2380). If this node is to listen on multiple addresses, all addresses _must_ be reflected in the "initial-cluster" configuration directive.
etcd: conflicting cluster ID to the target cluster (c6ab534d07e8fcc4 != bc25ea2a74fb18b0). Exiting.
exit 1
```
In the previous example we showed how to use SSL client certs for client-to-server communication.
Etcd can also do internal server-to-server communication using SSL client certs.
To do this just change the `-*-file` flags to `-peer-*-file`.
## Discovery
If you are using SSL for server-to-server communication, you must use it on all instances of etcd.
In a number of cases, you might not know the IPs of your cluster peers ahead of time. This is common when utilizing cloud providers or when your network uses DHCP. In these cases, rather than specifying a static configuration, you can use an existing etcd cluster to bootstrap a new one. We call this process"discovery".
There two methods that can be used for discovery:
* etcd discovery service
* DNS SRV records
### etcd Discovery
#### Lifetime of a Discovery URL
A discovery URL identifies a unique etcd cluster. Instead of reusing a discovery URL, you should always create discovery URLs for new clusters.
Moreover, discovery URLs should ONLY be used for the initial bootstrapping of a cluster. To change cluster membership after the cluster is already running, see the [runtime reconfiguration][runtime] guide.
Discovery uses an existing cluster to bootstrap itself. If you are using your own etcd cluster you can create a URL like so:
```
$ curl -X PUT https://myetcd.local/v2/keys/discovery/6c007a14875d53d9bf0ef5a6fc0257c817f0fb83/_config/size -d value=3
```
By setting the size key to the URL, you create a discovery URL with an expected cluster size of 3.
If you bootstrap an etcd cluster using discovery service with more than the expected number of etcd members, the extra etcd processes will [fall back][fall-back] to being [proxies][proxy] by default.
The URL you will use in this case will be `https://myetcd.local/v2/keys/discovery/6c007a14875d53d9bf0ef5a6fc0257c817f0fb83` and the etcd members will use the `https://myetcd.local/v2/keys/discovery/6c007a14875d53d9bf0ef5a6fc0257c817f0fb83` directory for registration as they start.
Now we start etcd with those relevant flags for each member:
This will cause each member to register itself with the custom etcd discovery service and begin the cluster once all machines have been registered.
#### Public etcd Discovery Service
If you do not have access to an existing cluster, you can use the public discovery service hosted at `discovery.etcd.io`. You can create a private discovery URL using the "new" endpoint like so:
This will create the cluster with an initial expected size of 3 members. If you do not specify a size, a default of 3 will be used.
If you bootstrap an etcd cluster using discovery service with more than the expected number of etcd members, the extra etcd processes will [fall back][fall-back] to being [proxies][proxy] by default.
DNS SRV records can also be used to configure the list of peers for an etcd server running in proxy mode:
```
$ etcd --proxy on -discovery-srv example.com
```
# 0.4 to 2.0+ Migration Guide
In etcd 2.0 we introduced the ability to listen on more than one address and to advertise multiple addresses. This makes using etcd easier when you have complex networking, such as private and public networks on various cloud providers.
To make understanding this feature easier, we changed the naming of some flags, but we support the old flags to make the migration from the old to new version easier.
|-peer-addr |-initial-advertise-peer-urls |If specified, peer-addr will be used as the only peer URL. Error if both flags specified.|
|-addr |-advertise-client-urls |If specified, addr will be used as the only client URL. Error if both flags specified.|
|-peer-bind-addr |-listen-peer-urls |If specified, peer-bind-addr will be used as the only peer bind URL. Error if both flags specified.|
|-bind-addr |-listen-client-urls |If specified, bind-addr will be used as the only client bind URL. Error if both flags specified.|
|-peers |none |Deprecated. The -initial-cluster flag provides a similar concept with different semantics. Please read this guide on cluster startup.|
|-peers-file |none |Deprecated. The -initial-cluster flag provides a similar concept with different semantics. Please read this guide on cluster startup.|
etcd is configurable through command-line flags and environment variables. Options set on the command line take precedence over those from the environment.
Individual node configuration options can be set in three places:
The format of environment variable for flag `-my-flag` is `ETCD_MY_FLAG`. It applies to all flags.
1. Command line flags
2. Environment variables
3. Configuration file
To start etcd automatically using custom settings at startup in Linux, using a [systemd][systemd-intro] unit is highly recommended.
Options set on the command line take precedence over all other sources.
Options set in environment variables take precedence over options set in
Cluster-wide settings are configured via the `/config` admin endpoint and additionally in the configuration file. Values contained in the configuration file will seed the cluster setting with the provided value. After the cluster is running, only the admin endpoint is used.
##### -name
+ Human-readable name for this member.
+ default: "default"
The full documentation is contained in the [API docs](https://github.com/coreos/etcd/blob/master/Documentation/api.md#cluster-config).
##### -data-dir
+ Path to the data directory.
+ default: "${name}.etcd"
*`activeSize` - the maximum number of peers that can participate in the consensus protocol. Other peers will join as standbys.
*`removeDelay` - the minimum time in seconds that a machine has been observed to be unresponsive before it is removed from the cluster.
*`syncInterval` - the amount of time in seconds between cluster sync when it runs in standby mode.
##### -snapshot-count
+ Number of committed transactions to trigger a snapshot to disk.
+ default: "10000"
## Command Line Flags
##### -heartbeat-interval
+ Time (in milliseconds) of a heartbeat interval.
+ default: "100"
### Required
##### -election-timeout
+ Time (in milliseconds) for an election to timeout.
*`-addr` - The advertised public hostname:port for client communication. Defaults to `127.0.0.1:4001`.
*`-discovery` - A URL to use for discovering the peer list. (i.e `"https://discovery.etcd.io/your-unique-key"`).
*`-http-read-timeout` - The number of seconds before an HTTP read operation is timed out.
*`-http-write-timeout` - The number of seconds before an HTTP write operation is timed out.
*`-bind-addr` - The listening hostname for client communication. Defaults to advertised IP.
*`-peers` - A comma separated list of peers in the cluster (i.e `"203.0.113.101:7001,203.0.113.102:7001"`).
*`-peers-file` - The file path containing a comma separated list of peers in the cluster.
*`-ca-file` - The path of the client CAFile. Enables client cert authentication when present.
*`-cert-file` - The cert file of the client.
*`-key-file` - The key file of the client.
*`-config` - The path of the etcd configuration file. Defaults to `/etc/etcd/etcd.conf`.
*`-cors` - A comma separated white list of origins for cross-origin resource sharing.
*`-cpuprofile` - The path to a file to output CPU profile data. Enables CPU profiling when present.
*`-data-dir` - The directory to store log and snapshot. Defaults to the current working directory.
*`-max-result-buffer` - The max size of result buffer. Defaults to `1024`.
*`-max-retry-attempts` - The max retry attempts when trying to join a cluster. Defaults to `3`.
*`-peer-addr` - The advertised public hostname:port for server communication. Defaults to `127.0.0.1:7001`.
*`-peer-bind-addr` - The listening hostname for server communication. Defaults to advertised IP.
*`-peer-ca-file` - The path of the CAFile. Enables client/peer cert authentication when present.
*`-peer-cert-file` - The cert file of the server.
*`-peer-key-file` - The key file of the server.
*`-peer-election-timeout` - The number of milliseconds to wait before the leader is declared unhealthy.
*`-peer-heartbeat-interval` - The number of milliseconds in between heartbeat requests
*`-snapshot=false` - Disable log snapshots. Defaults to `true`.
*`-cluster-active-size` - The expected number of instances participating in the consensus protocol. Only applied if the etcd instance is the first peer in the cluster.
*`-cluster-remove-delay` - The number of seconds before one node is removed from the cluster since it cannot be connected at all. Only applied if the etcd instance is the first peer in the cluster.
*`-cluster-sync-interval` - The number of seconds between synchronization for standby-mode instance with the cluster. Only applied if the etcd instance is the first peer in the cluster.
*`-v` - Enable verbose logging. Defaults to `false`.
*`-vv` - Enable very verbose logging. Defaults to `false`.
*`-version` - Print the version and exit.
##### -max-snapshots
+ Maximum number of snapshot files to retain (0 is unlimited)
+ default: 5
+ The default for users on Windows is unlimited, and manual purging down to 5 (or your preference for safety) is recommended.
## Configuration File
##### -max-wals
+ Maximum number of wal files to retain (0 is unlimited)
+ default: 5
+ The default for users on Windows is unlimited, and manual purging down to 5 (or your preference for safety) is recommended.
The etcd configuration file is written in [TOML](https://github.com/mojombo/toml)
and read from `/etc/etcd/etcd.conf` by default.
##### -cors
+ Comma-separated white list of origins for CORS (cross-origin resource sharing).
`-initial` prefix flags are used in bootstrapping ([static bootstrap][build-cluster], [discovery-service bootstrap][discovery] or [runtime reconfiguration][reconfig]) a new member, and ignored when restarting an existing member.
[cluster]
active_size=9
remove_delay=1800.0
sync_interval=5.0
```
`-discovery` prefix flags need to be set when using [discovery service][discovery].
## Environment Variables
##### -initial-advertise-peer-urls
*`ETCD_ADDR`
*`ETCD_BIND_ADDR`
*`ETCD_CA_FILE`
*`ETCD_CERT_FILE`
*`ETCD_CORS_ORIGINS`
*`ETCD_CONFIG`
*`ETCD_CPU_PROFILE_FILE`
*`ETCD_DATA_DIR`
*`ETCD_DISCOVERY`
*`ETCD_CLUSTER_HTTP_READ_TIMEOUT`
*`ETCD_CLUSTER_HTTP_WRITE_TIMEOUT`
*`ETCD_KEY_FILE`
*`ETCD_PEERS`
*`ETCD_PEERS_FILE`
*`ETCD_MAX_CLUSTER_SIZE`
*`ETCD_MAX_RESULT_BUFFER`
*`ETCD_MAX_RETRY_ATTEMPTS`
*`ETCD_NAME`
*`ETCD_SNAPSHOT`
*`ETCD_VERBOSE`
*`ETCD_VERY_VERBOSE`
*`ETCD_PEER_ADDR`
*`ETCD_PEER_BIND_ADDR`
*`ETCD_PEER_CA_FILE`
*`ETCD_PEER_CERT_FILE`
*`ETCD_PEER_KEY_FILE`
*`ETCD_PEER_ELECTION_TIMEOUT`
*`ETCD_CLUSTER_ACTIVE_SIZE`
*`ETCD_CLUSTER_REMOVE_DELAY`
*`ETCD_CLUSTER_SYNC_INTERVAL`
+ List of this member's peer URLs to advertise to the rest of the cluster. These addresses are used for communicating etcd data around the cluster. At least one must be routable to all cluster members.
+ Initial cluster state ("new" or "existing"). Set to `new` for all members present during initial static or DNS bootstrapping. If this option is set to `existing`, etcd will attempt to join the existing cluster. If the wrong value is set, etcd will attempt to start but fail safely.
+ default: "new"
[static bootstrap]: clustering.md#static
##### -initial-cluster-token
+ Initial cluster token for the etcd cluster during bootstrap.
+ default: "etcd-cluster"
##### -advertise-client-urls
+ List of this member's client URLs to advertise to the rest of the cluster.
+ Expected behavior ("exit" or "proxy") when discovery services fails.
+ default: "proxy"
##### -discovery-proxy
+ HTTP proxy to use for traffic to discovery service.
+ default: none
### Proxy Flags
`-proxy` prefix flags configures etcd to run in [proxy mode][proxy].
##### -proxy
+ Proxy mode setting ("off", "readonly" or "on").
+ default: "off"
### Security Flags
The security flags help to [build a secure etcd cluster][security].
##### -ca-file
+ Path to the client server TLS CA file.
+ default: none
##### -cert-file
+ Path to the client server TLS cert file.
+ default: none
##### -key-file
+ Path to the client server TLS key file.
+ default: none
##### -peer-ca-file
+ Path to the peer server TLS CA file.
+ default: none
##### -peer-cert-file
+ Path to the peer server TLS cert file.
+ default: none
##### -peer-key-file
+ Path to the peer server TLS key file.
+ default: none
### Unsafe Flags
Be CAUTIOUS to use unsafe flags because it will break the guarantee given by consensus protocol. For example, it may panic if other members in the cluster are still alive. Follow the instructions when using these falgs.
##### -force-new-cluster
+ Force to create a new one-member cluster. It commits configuration changes in force to remove all existing members in the cluster and add itself. It needs to be set to [restore a backup][restore].
Diagnosing issues in a distributed application is hard.
etcd will help as much as it can - just enable these debug features using the CLI flag `-trace=*` or the config option `trace=*`.
## Logging
Log verbosity can be increased to the max using either the `-vvv` CLI flag or the `very_very_verbose=true` config option.
The only supported logging mode is to stdout.
## Metrics
etcd itself can generate a set of metrics.
These metrics represent many different internal data points that can be helpful when debugging etcd servers.
#### Metrics reference
Each individual metric name is prefixed with `etcd.<NAME>`, where \<NAME\> is the configured name of the etcd server.
*`timer.appendentries.handle`: amount of time a peer takes to process an AppendEntriesRequest from the POV of the peer itself
*`timer.peer.<PEER>.heartbeat`: amount of time a peer heartbeat operation takes from the POV of the leader that initiated that operation for peer \<PEER\>
*`timer.command.<COMMAND>`: amount of time a given command took to be processed through the local server's raft state machine. This does not include time waiting on locks.
#### Fetching metrics over HTTP
Once tracing has been enabled on a given etcd server, all metric data is available at the server's `/debug/metrics` HTTP endpoint (i.e. `http://127.0.0.1:4001/debug/metrics`).
Executing a GET HTTP command against the metrics endpoint will yield the current state of all metrics in the etcd server.
#### Sending metrics to Graphite
etcd supports [Graphite's Carbon plaintext protocol](https://graphite.readthedocs.org/en/latest/feeding-carbon.html#the-plaintext-protocol) - a TCP wire protocol designed for shipping metric data to an aggregator.
To send metrics to a Graphite endpoint using this protocol, use of the `-graphite-host` CLI flag or the `graphite_host` config option (i.e. `graphite_host=172.17.0.19:2003`).
See an [example graphite deploy script](https://github.com/coreos/etcd/contrib/graphite).
#### Generating additional metrics with Collectd
[Collectd](http://collectd.org/documentation.shtml) gathers metrics from the host running etcd.
While these aren't metrics generated by etcd itself, it can be invaluable to compare etcd's view of the world to that of a separate process running next to etcd.
See an [example collectd deploy script](https://github.com/coreos/etcd/contrib/collectd).
## Profiling
etcd exposes profiling information from the Go pprof package over HTTP.
The basic browsable interface is served by etcd at the `/debug/pprof` HTTP endpoint (i.e. `http://127.0.0.1:4001/debug/pprof`).
For more information on using profiling tools, see http://blog.golang.org/profiling-go-programs.
**NOTE**: In the following examples you need to ensure that the `./bin/etcd` is identical to the `./bin/etcd` that you are targeting (same git hash, arch, platform, etc).
#### Heap memory profile
```
go tool pprof ./bin/etcd http://127.0.0.1:4001/debug/pprof/heap
```
#### CPU profile
```
go tool pprof ./bin/etcd http://127.0.0.1:4001/debug/pprof/profile
```
#### Blocked goroutine profile
```
go tool pprof ./bin/etcd http://127.0.0.1:4001/debug/pprof/block
Peer discovery uses the following sources in this order: log data in `-data-dir`, `-discovery` and `-peers`.
If log data is provided, etcd will concatenate possible peers from three sources: the log data, the `-discovery` option, and `-peers`. Then it tries to join cluster through them one by one. If all connection attempts fail (which indicates that the majority of the cluster is currently down), it will restart itself based on the log data, which helps the cluster to recover from a full outage.
Without log data, the instance is assumed to be a brand new one. If possible targets are provided by `-discovery` and `-peers`, etcd will make a best effort attempt to join them, and if none is reachable it will exit. Otherwise, if no `-discovery` or `-peers` option is provided, a new cluster will always be started.
This ensures that users can always restart the node safely with the same command (without --force), and etcd will either reconnect to the old cluster if it is still running or recover its cluster from a outage.
Adding peers in an etcd cluster adds network, CPU, and disk overhead to the leader since each one requires replication.
Peers primarily provide resiliency in the event of a leader failure but the benefit of more failover nodes decreases as the cluster size increases.
A lightweight alternative is the standby.
Standbys are a way for an etcd node to forward requests along to the cluster but the standbys are not part of the Raft cluster themselves.
This provides an easier API for local applications while reducing the overhead required by a regular peer node.
Standbys also act as standby nodes in the event that a peer node in the cluster has not recovered after a long duration.
## Configuration Parameters
There are three configuration parameters used by standbys: active size, remove delay and standby sync interval.
The active size specifies a target size for the number of peers in the cluster.
If there are not enough peers to meet the active size, standbys will send join requests until the peer count is equal to the active size.
If there are more peers than the target active size then peers are removed by the leader and will become standbys.
The remove delay specifies how long the cluster should wait before removing a dead peer.
By default this is 30 minutes.
If a peer is inactive for 30 minutes then the peer is removed.
The standby sync interval specifies the synchronization interval of standbys with the cluster.
By default this is 5 seconds.
After each interval, standbys synchronize information with cluster.
## Logical Workflow
### Start a etcd machine
#### Main logic
```
If find existing standby cluster info:
Goto standby loop
Find cluster as required
If determine to start peer server:
Goto peer loop
Else:
Goto standby loop
Peer loop:
Start peer mode
If running:
Wait for stop
Goto standby loop
Standby loop:
Start standby mode
If running:
Wait for stop
Goto peer loop
```
#### [Cluster finding logic][cluster-finding.md]
#### Join request logic:
```
Fetch machine info
If cannot match version:
return false
If active size <= peer count:
return false
If it has existed in the cluster:
return true
If join request fails:
return false
return true
```
**Note**
1. [TODO] The running mode cannot be determined by log, because the log may be outdated. But the log could be used to estimate its state.
2. Even if sync cluster fails, it will restart still for recovery from full outage.
#### Peer mode start logic
```
Start raft server
Start other helper routines
```
#### Peer mode auto stop logic
```
When removed from the cluster:
Stop raft server
Stop other helper routines
```
#### Standby mode run logic
```
Loop:
Sleep for some time
Sync cluster, and write cluster info into disk
Check active size and send join request if needed
If succeed:
Clear cluster info from disk
Return
```
#### Serve Requests as Standby
Return '404 Page Not Found' always on peer address. This is because peer address is used for raft communication and cluster management, which should not be used in standby mode.
Serve requests from client:
```
Redirect all requests to client URL of leader
```
**Note**
1. The leader here implies the one in raft cluster when doing the latest successful synchronization.
2. [IDEA] We could extend HTTP Redirect to multiple possible targets.
### Join Request Handling
```
If machine has existed in the cluster:
Return
If peer count < active size:
Add peer
Increase peer count
```
### Remove Request Handling
```
If machine exists in the cluster:
Remove peer
Decrease peer count
```
## Cluster Monitor Logic
### Active Size Monitor:
This is only run by current cluster leader.
```
Loop:
Sleep for some time
If peer count > active size:
Remove randomly selected peer
```
### Peer Activity Monitor
This is only run by current cluster leader.
```
Loop:
Sleep for some time
For each peer:
If peer last activity time > remove delay:
Remove the peer
Goto Loop
```
## Cluster Cases
### Create Cluster with Thousands of Instances
First few machines run in peer mode.
All the others check the status of the cluster and run in standby mode.
### Recover from full outage
Machines with log data restart with join failure.
Machines in peer mode recover heartbeat between each other.
Machines in standby mode always sync the cluster. If sync fails, it uses the first address from data log as redirect target.
### Kill one peer machine
Leader of the cluster lose the connection with the peer.
When the time exceeds remove delay, it removes the peer from the cluster.
Machine in standby mode finds one available place of the cluster. It sends join request and joins the cluster.
**Note**
1. [TODO] Machine which was divided from majority and was removed from the cluster will distribute running of the cluster if the new node uses the same name.
### Kill one standby machine
No change for the cluster.
## Cons
1. New instance cannot join immediately after one peer is kicked out of the cluster, because the leader doesn't know the info about the standby instances.
2. It may introduce join collision
3. Cluster needs a good interval setting to balance the join delay and join collision.
## Future Attack Plans
1. Based on heartbeat miss and remove delay, standby could adjust its next check time.
2. Preregister the promotion target when heartbeat miss happens.
3. Get the estimated cluster size from the check happened in the sync interval, and adjust sync interval dynamically.
4. Accept join requests based on active size and alive peers.
Starting a new etcd cluster can be painful since each machine needs to know of at least one live machine in the cluster. If you are trying to bring up a new cluster all at once, say using an AWS cloud formation, you also need to coordinate who will be the initial cluster leader. The discovery protocol uses an existing running etcd cluster to start a second etcd cluster.
To use this feature you add the command line flag `-discovery` to your etcd args. In this example we will use `http://example.com/v2/keys/_etcd/registry` as the URL prefix.
## The Protocol
By convention the etcd discovery protocol uses the key prefix `_etcd/registry`. A full URL to the keyspace will be `http://example.com/v2/keys/_etcd/registry`.
### Creating a New Cluster
Generate a unique token that will identify the new cluster. This will be used as a key prefix in the following steps. An easy way to do this is to use uuidgen:
```
UUID=$(uuidgen)
```
### Bringing up Machines
Now that you have your cluster ID you can start bringing up machines. Every machine will follow this protocol internally in etcd if given a `-discovery`.
### Registering your Machine
The first thing etcd must do is register your machine. This is done by using the machine name (from the `-name` arg) and posting it with a long TTL to the given key.
```
curl -X PUT "http://example.com/v2/keys/_etcd/registry/${UUID}/${etcd_machine_name}?ttl=604800" -d value=${peer_addr}
```
### Discovering Peers
Now that this etcd machine is registered it must discover its peers.
But, the tricky bit of starting a new cluster is that one machine needs to assume the initial role of leader and will have no peers. To figure out if another machine has already started the cluster etcd needs to create the `_state` key and set its value to "started":
```
curl -X PUT "http://example.com/v2/keys/_etcd/registry/${UUID}/_state?prevExist=false" -d value=started
```
If this returns a `200 OK` response then this machine is the initial leader and should start with no peers configured. If, however, this returns a `412 Precondition Failed` then you need to find all of the registered peers:
```
curl -X GET "http://example.com/v2/keys/_etcd/registry/${UUID}?recursive=true"
Atomic *test and set* value to a file. If test succeeds, this operation will change the previous value of the file to the given value.
- If the prevValue is given, it will test against previous value of
the node.
- If the prevValue is empty, it will test if the node is not existing.
- If the prevValue is not empty, it will test if the prevValue is equal to the current value of the file.
- If the prevIndex is given, it will test if the create/last modified index of the node is equal to prevIndex.
- **Renew** (path, ttl)
Set the node's expiration time to (current time + ttl)
## ACL
### Theory
Etcd exports a Unix-like file system interface consisting of files and directories, collectively called nodes.
Each node has various meta-data, including three names of the access control lists used to control reading, writing and changing (change ACL names for the node).
We are storing the ACL names for nodes under a special *ACL* directory.
Each node has ACL name corresponding to one file within *ACL* dir.
Unless overridden, a node naturally inherits the ACL names of its parent directory on creation.
For each ACL name, it has three children: *R (Reading)*, *W (Writing)*, *C (Changing)*
Each permission is also a node. Under the node it contains the users who have this permission for the file referring to this ACL name.
This document defines the various terms used in etcd documentation, command line and source code.
### Node
Node is an instance of raft state machine.
It has a unique identification, and records other nodes' progress internally when it is the leader.
### Member
Member is an instance of etcd. It hosts a node, and provides service to clients.
### Cluster
Cluster consists of several members.
The node in each member follows raft consensus protocol to replicate logs. Cluster receives proposals from members, commits them and apply to local store.
etcd has a number of modules that are built on top of the core etcd API.
These modules provide things like dashboards, locks and leader election (removed).
**Warning**: Modules are deprecated from v0.4 until we have a solid base we can apply them back onto.
For now, we are choosing to focus on raft algorithm and core etcd to make sure that it works correctly and fast.
And it is time consuming to maintain these modules in this period, given that etcd's API changes from time to time.
Moreover, the lock module has some unfixed bugs, which may mislead users.
But we also notice that these modules are popular and useful, and plan to add them back with full functionality as soon as possible.
### Dashboard
An HTML dashboard can be found at `http://127.0.0.1:4001/mod/dashboard/`.
This dashboard is compiled into the etcd binary and uses the same API as regular etcd clients.
Use the `-cors='*'` flag to allow your browser to request information from the current master as it changes.
### Lock
The Lock module implements a fair lock that can be used when lots of clients want access to a single resource.
A lock can be associated with a value.
The value is unique so if a lock tries to request a value that is already queued for a lock then it will find it and watch until that value obtains the lock.
You may supply a `timeout` which will cancel the lock request if it is not obtained within `timeout` seconds. If `timeout` is not supplied, it is presumed to be infinite. If `timeout` is `0`, the lock request will fail if it is not immediately acquired.
If you lock the same value on a key from two separate curl sessions they'll both return at the same time.
Here's the API:
**Acquire a lock (with no value) for "customer1"**
```sh
curl -X POST http://127.0.0.1:4001/mod/v2/lock/customer1?ttl=60
```
**Acquire a lock for "customer1" that is associated with the value "bar"**
```sh
curl -X POST http://127.0.0.1:4001/mod/v2/lock/customer1?ttl=60 -d value=bar
```
**Acquire a lock for "customer1" that is associated with the value "bar" only if it is done within 2 seconds**
```sh
curl -X POST http://127.0.0.1:4001/mod/v2/lock/customer1?ttl=60 -d value=bar -d timeout=2
```
**Renew the TTL on the "customer1" lock for index 2**
```sh
curl -X PUT http://127.0.0.1:4001/mod/v2/lock/customer1?ttl=60 -d index=2
```
**Renew the TTL on the "customer1" lock for value "bar"**
```sh
curl -X PUT http://127.0.0.1:4001/mod/v2/lock/customer1?ttl=60 -d value=bar
```
**Retrieve the current value for the "customer1" lock.**
```sh
curl http://127.0.0.1:4001/mod/v2/lock/customer1
```
**Retrieve the current index for the "customer1" lock**
etcd's Raft consensus algorithm is most efficient in small clusters between 3 and 9 peers. For clusters larger than 9, etcd will select a subset of instances to participate in the algorithm in order to keep it efficient. The end of this document briefly explores how etcd works internally and why these choices have been made.
## Cluster Management
You can manage the active cluster size through the [cluster config API](https://github.com/coreos/etcd/blob/master/Documentation/api.md#cluster-config). `activeSize` represents the etcd peers allowed to actively participate in the consensus algorithm.
If the total number of etcd instances exceeds this number, additional peers are started as [standbys](https://github.com/coreos/etcd/blob/master/Documentation/design/standbys.md), which can be promoted to active participation if one of the existing active instances has failed or been removed.
## Internals of etcd
### Writing to etcd
Writes to an etcd peer are always redirected to the leader of the cluster and distributed to all of the peers immediately. A write is only considered successful when a majority of the peers acknowledge the write.
For example, in a cluster with 5 peers, a write operation is only as fast as the 3rd fastest machine. This is the main reason for keeping the number of active peers below 9. In practice, you only need to worry about write performance in high latency environments such as a cluster spanning multiple data centers.
### Leader Election
The leader election process is similar to writing a key — a majority of the active peers must acknowledge the new leader before cluster operations can continue. The longer each peer takes to elect a new leader means you have to wait longer before you can write to the cluster again. In low latency environments this process takes milliseconds.
### Odd Active Cluster Size
The other important cluster optimization is to always have an odd active cluster size (i.e. `activeSize`). Adding an odd node to the number of peers doesn't change the size of the majority and therefore doesn't increase the total latency of the majority as described above. But, you gain a higher tolerance for peer failure by adding the extra machine. You can see this in practice when comparing two even and odd sized clusters:
| Active Peers | Majority | Failure Tolerance |
|--------------|------------|-------------------|
| 1 peers | 1 peers | None |
| 3 peers | 2 peers | 1 peer |
| 4 peers | 3 peers | 1 peer |
| 5 peers | 3 peers | **2 peers** |
| 6 peers | 4 peers | 2 peers |
| 7 peers | 4 peers | **3 peers** |
| 8 peers | 5 peers | 3 peers |
| 9 peers | 5 peers | **4 peers** |
As you can see, adding another peer to bring the number of active peers up to an odd size is always worth it. During a network partition, an odd number of active peers also guarantees that there will almost always be a majority of the cluster that can continue to operate and be the source of truth when the partition ends.
* [Change the peer urls of a member](#change-the-peer-urls-of-a-member)
## List members
Return an HTTP 200 OK response code and a representation of all members in the etcd cluster.
### Request
```
GET /v2/members HTTP/1.1
```
### Example
```sh
curl http://10.0.0.10:2379/v2/members
```
```json
{
"members":[
{
"id":"272e204152",
"name":"infra1",
"peerURLs":[
"http://10.0.0.10:2380"
],
"clientURLs":[
"http://10.0.0.10:2379"
]
},
{
"id":"2225373f43",
"name":"infra2",
"peerURLs":[
"http://10.0.0.11:2380"
],
"clientURLs":[
"http://10.0.0.11:2379"
]
},
]
}
```
## Add a member
Returns an HTTP 201 response code and the representation of added member with a newly generated a memberID when successful. Returns a string describing the failure condition when unsuccessful.
If the POST body is malformed an HTTP 400 will be returned. If the member exists in the cluster or existed in the cluster at some point in the past an HTTP 409 will be returned. If any of the given peerURLs exists in the cluster an HTTP 409 will be returned. If the cluster fails to process the request within timeout an HTTP 500 will be returned, though the request may be processed later.
Remove a member from the cluster. The member ID must be a hex-encoded uint64.
Returns 204 with empty content when successful. Returns a string describing the failure condition when unsuccessful.
If the member does not exist in the cluster an HTTP 500(TODO: fix this) will be returned. If the cluster fails to process the request within timeout an HTTP 500 will be returned, though the request may be processed later.
Change the peer urls of a given mamber. The member ID must be a hex-encoded uint64. Returns 204 with empty content when successful. Returns a string describing the failure condition when unsuccessful.
If the POST body is malformed an HTTP 400 will be returned. If the member does not exist in the cluster an HTTP 404 will be returned. If any of the given peerURLs exists in the cluster an HTTP 409 will be returned. If the cluster fails to process the request within timeout an HTTP 500 will be returned, though the request may be processed later.
etcd can now run as a transparent proxy. Running etcd as a proxy allows for easily discovery of etcd within your infrastructure, since it can run on each machine as a local service. In this mode, etcd acts as a reverse proxy and forwards client requests to an active etcd cluster. The etcd proxy does not participant in the consensus replication of the etcd cluster, thus it neither increases the resilience nor decreases the write performance of the etcd cluster.
etcd currently supports two proxy modes: `readwrite` and `readonly`. The default mode is `readwrite`, which forwards both read and write requests to the etcd cluster. A `readonly` etcd proxy only forwards read requests to the etcd cluster, and returns `HTTP 501` to all write requests.
### Using an etcd proxy
To start etcd in proxy mode, you need to provide three flags: `proxy`, `listen-client-urls`, and `initial-cluster` (or `discovery-url`).
To start a readwrite proxy, set `-proxy on`; To start a readonly proxy, set `-proxy readonly`.
The proxy will be listening on `listen-client-urls` and forward requests to the etcd cluster discovered from in `initial-cluster` or `discovery url`.
#### Start an etcd proxy with a static configuration
To start a proxy that will connect to a statically defined etcd cluster, specify the `initial-cluster` flag:
```
etcd -proxy on -client-listen-urls 127.0.0.1:8080 -initial-cluster infra0=http://10.0.1.10:2380,infra1=http://10.0.1.11:2380,infra2=http://10.0.1.12:2380
```
#### Start an etcd proxy with the discovery service
If you bootstrap an etcd cluster using the [discovery service][discovery-service], you can also start the proxy with the same `discovery-url`.
To start a proxy using the discovery service, specify the `discovery-url` flag. The proxy will wait until the etcd cluster defined at the `discovery-url` finishes bootstrapping, and then start to forward the requests.
```
etcd -proxy on -client-listen-urls 127.0.0.1:8080 -discovery https://discovery.etcd.io/3e86b59982e49066c5d813af1c2e2579cbf573de
```
#### Fallback to proxy mode with discovery service
If you bootstrap a etcd cluster using [discovery service][discovery-service] with more than the expected number of etcd members, the extra etcd processes will fall back to being `readwrite` proxies by default. They will forward the requests to the cluster as described above. For example, if you create a discovery url with `size=5`, and start ten etcd processes using that same discovery URL, the result will be a cluster with five etcd members and five proxies. Note that this behaviour can be disabled with the `proxy-fallback` flag.
etcd comes with support for incremental runtime reconfiguration, which allows users to update the membership of the cluster at run time.
Reconfiguration requests can only be processed when the the majority of the cluster members are functioning. It is **highly recommended** to always have a cluster size greater than two in production. It is unsafe to remove a member from a two member cluster. The majority of a two member cluster is also two. If there is a failure during the removal process, the cluster might not able to make progress and need to [restart from majority failure][majority failure].
Let us walk through some common reasons for reconfiguring a cluster. Most of these just involve combinations of adding or removing a member, which are explained below under [Cluster Reconfiguration Operations](#cluster-reconfiguration-operations).
### Cycle or Upgrade Multiple Machines
If you need to move multiple members of your cluster due to planned maintenance (hardware upgrades, network downtime, etc.), it is recommended to modify members one at a time.
It is safe to remove the leader, however there is a brief period of downtime while the election process takes place. If your cluster holds more than 50MB, it is recommended to [migrate the member's data directory][member migration].
Increasing the cluster size can enhance [failure tolerance][fault tolerance table] and provide better read performance. Since clients can read from any member, increasing the number of members increases the overall read throughput.
Decreasing the cluster size can improve the write performance of a cluster, with a trade-off of decreased resilience. Writes into the cluster are replicated to a majority of members of the cluster before considered committed. Decreasing the cluster size lowers the majority, and each write is committed more quickly.
If a machine fails due to hardware failure, data directory corruption, or some other fatal situation, it should be replaced as soon as possible. Machines that have failed but haven't been removed adversely affect your quorum and reduce the tolerance for an additional failure.
To replace the machine, follow the instructions for [removing the member][remove member] from the cluster, and then [add a new member][add member] in its place. If your cluster holds more than 50MB, it is recommended to [migrate the failed member's data directory][member migration] if you can still access it.
[remove member]: #remove-a-member
[add member]: #add-a-new-member
### Restart Cluster from Majority Failure
If the majority of your cluster is lost, then you need to take manual action in order to recover safely.
The basic steps in the recovery process include [creating a new cluster using the old data][disaster recovery], forcing a single member to act as the leader, and finally using runtime configuration to [add new members][add member] to this new cluster one at a time.
Now that we have the use cases in mind, let us lay out the operations involved in each.
Before making any change, the simple majority (quorum) of etcd members must be available.
This is essentially the same requirement as for any other write to etcd.
All changes to the cluster are done one at a time:
To replace a single member you will make an add then a remove operation
To increase from 3 to 5 members you will make two add operations
To decrease from 5 to 3 you will make two remove operations
All of these examples will use the `etcdctl` command line tool that ships with etcd.
If you want to use the member API directly you can find the documentation [here](https://github.com/coreos/etcd/blob/master/Documentation/other_apis.md).
### Remove a Member
First, we need to find the target member's ID. You can list all members with `etcdctl`:
Let us say the member ID we want to remove is a8266ecf031671f3.
We then use the `remove` command to perform the removal:
```
$ etcdctl member remove a8266ecf031671f3
Removed member a8266ecf031671f3 from cluster
```
The target member will stop itself at this point and print out the removal in the log:
```
etcd: this member has been permanently removed from the cluster. Exiting.
```
It is safe to remove the leader, however the cluster will be inactive while a new leader is elected. This duration is normally the period of election timeout plus the voting process.
### Add a New Member
Adding a member is a two step process:
* Add the new member to the cluster via the [members API](https://github.com/coreos/etcd/blob/master/Documentation/other_apis.md#post-v2members) or the `etcdctl member add` command.
* Start the new member with the new cluster configuration, including a list of the updated members (existing members + the new member).
Using `etcdctl` let's add the new member to the cluster by specifing its [name](configuration.md#-name) and [advertised peer URLs](configuration.md#-initial-advertise-peer-urls):
The new member will run as a part of the cluster and immediately begin catching up with the rest of the cluster.
If you are adding multiple members the best practice is to configure a single member at a time and verify it starts correctly before adding more new members.
If you add a new member to a 1-node cluster, the cluster cannot make progress before the new member starts because it needs two members as majority to agree on the consensus. You will only see this behavior between the time `etcdctl member add` informs the cluster about the new member and the new member successfully establishing a connection to the existing one.
#### Error Cases
In the following case we have not included our new host in the list of enumerated nodes.
If this is a new cluster, the node must be added to the list of initial cluster members.
etcd supports SSL/TLS as well as authentication through client certificates, both for clients to server as well as peer (server to server / cluster) communication.
Etcd supports SSL/TLS and client cert authentication for clients to server, as well as server to server communication.
To get up and running you first need to have a CA certificate and a signed key pair for one member. It is recommended to create and sign a new key pair for every member in a cluster.
For convenience the [etcd-ca](https://github.com/coreos/etcd-ca) tool provides an easy interface to certificate generation, alternatively this site provides a good reference on how to generate self-signed key pairs:
First, you need to have a CA cert `clientCA.crt` and signed key pair `client.crt`, `client.key`.
This site has a good reference for how to generate self-signed key pairs:
http://www.g-loaded.eu/2005/11/10/be-your-own-ca/
Or you could use [etcd-ca](https://github.com/coreos/etcd-ca) to generate certs and keys.
For testing you can use the certificates in the `fixtures/ca` directory.
## Basic setup
Let's configure etcd to use this keypair:
etcd takes several certificate related configuration options, either through command-line flags or environment variables:
**Client-to-server communication:**
`--cert-file=<path>`: Certificate used for SSL/TLS connections **to** etcd. When this option is set, you can set advertise-client-urls using HTTPS schema.
`--key-file=<path>`: Key for the certificate. Must be unencrypted.
`--ca-file=<path>`: When this is set etcd will check all incoming HTTPS requests for a client certificate signed by the supplied CA, requests that don't supply a valid client certificate will fail.
The peer options work the same way as the client-to-server options:
`--peer-cert-file=<path>`: Certificate used for SSL/TLS connections between peers. This will be used both for listening on the peer address as well as sending requests to other peers.
`--peer-key-file=<path>`: Key for the certificate. Must be unencrypted.
`--peer-ca-file=<path>`: When set, etcd will check all incoming peer requests from the cluster for valid client certificates signed by the supplied CA.
If either a client-to-server or peer certificate is supplied the key must also be set. All of these configuration options are also available through the environment variables, `ETCD_CA_FILE`, `ETCD_PEER_CA_FILE` and so on.
## Example 1: Client-to-server transport security with HTTPS
For this you need your CA certificate (`ca.crt`) and signed key pair (`server.crt`, `server.key`) ready.
Let us configure etcd to provide simple HTTPS transport security step by step:
*`-f` - forces a new machine configuration, even if an existing configuration is found. (WARNING: data loss!)
*`-cert-file` and `-key-file` specify the location of the cert and key files to be used for for transport layer security between the client and server.
You can now test the configuration using HTTPS:
This should start up fine and you can now test the configuration by speaking HTTPS to etcd:
You should be able to see the handshake succeed. Because we use self-signed certificates with our own certificate authorities you need to provide the CA to curl using the `--cacert` option. Another possibility would be to add your CA certificate to the trusted certificates on your system (usually in `/etc/ssl/certs`).
**OSX 10.9+ Users**: curl 7.30.0 on OSX 10.9+ doesn't understand certificates passed in on the command line.
Instead you must import the dummy ca.crt directly into the keychain or add the `-k` flag to curl to ignore errors.
@ -36,42 +58,29 @@ If you want to test without the `-k` flag run `open ./fixtures/ca/ca.crt` and fo
Please remove this certificate after you are done testing!
If you know of a workaround let us know.
```
...
SSLv3, TLS handshake, Finished (20):
...
```
## Example 2: Client-to-server authentication with HTTPS client certificates
And also the response from the etcd server:
For now we've given the etcd client the ability to verify the server identity and provide transport security. We can however also use client certificates to prevent unauthorized access to etcd.
```json
{
"action":"set",
"key":"/foo",
"modifiedIndex":3,
"value":"bar"
}
```
The clients will provide their certificates to the server and the server will check whether the cert is signed by the supplied CA and decide whether to serve the request.
## Authentication with HTTPS Client Certificates
We can also do authentication using CA certs.
The clients will provide their cert to the server and the server will check whether the cert is signed by the CA and decide whether to serve the request.
You need the same files mentioned in the first example for this, as well as a key pair for the client (`client.crt`, `client.key`) signed by the same certificate authority.
@ -108,7 +118,34 @@ And also the response from the server:
}
```
### Why SSLv3 alert handshake failure when using SSL client auth?
## Example 3: Transport security & client certificates in a cluster
etcd supports the same model as above for **peer communication**, that means the communication between etcd members in a cluster.
Assuming we have our `ca.crt` and two members with their own keypairs (`member1.crt`&`member1.key`, `member2.crt`&`member2.key`) signed by this CA, we launch etcd as follows:
```sh
DISCOVERY_URL=... # from https://discovery.etcd.io/new
The etcd members will form a cluster and all communication between members in the cluster will be encrypted and authenticated using the client certificates. You will see in the output of etcd that the addresses it connects to use HTTPS.
## Frequently Asked Questions
### I'm seeing a SSLv3 alert handshake failure when using SSL client authentication?
The `crypto/tls` package of `golang` checks the key usage of the certificate public key before using it.
To use the certificate public key to do client auth, we need to add `clientAuth` to `Extended Key Usage` when creating the certificate public key.
@ -127,5 +164,10 @@ Add the following section to your openssl.cnf:
When creating the cert be sure to reference it in the `-extensions` flag:
### With peer certificate authentication I receive "certificate is valid for 127.0.0.1, not $MY_IP"
Make sure that you sign your certificates with a Subject Name your member's public IP address. The `etcd-ca` tool for example provides an `--ip=` option for its `new-cert` command.
If you need your certificate to be signed for your member's FQDN in its Subject Name then you could use Subject Alternative Names (short IP SANs) to add your IP address. The `etcd-ca` tool provides `--domain=` option for its `new-cert` command, and openssl can make [it](http://wiki.cacert.org/FAQ/subjectAltName) too.
The default settings in etcd should work well for installations on a local network where the average network latency is low.
However, when using etcd across multiple data centers or over networks with high latency you may need to tweak the heartbeat interval and election timeout settings.
The network isn't the only source of latency. Each request and response may be impacted by slow disks on both the leader and follower. Each of these timeouts represents the total time from request to successful response from the other machine.
### Time Parameters
The underlying distributed consensus protocol relies on two separate time parameters to ensure that nodes can handoff leadership if one stalls or goes offline.
The first parameter is called the *Heartbeat Interval*.
This is the frequency with which the leader will notify followers that it is still the leader.
etcd batches commands together for higher throughput so this heartbeat interval is also a delay for how long it takes for commands to be committed.
By default, etcd uses a `50ms` heartbeat interval.
By default, etcd uses a `100ms` heartbeat interval.
The second parameter is the *Election Timeout*.
This timeout is how long a follower node will go without hearing a heartbeat before attempting to become leader itself.
By default, etcd uses a `200ms` election timeout.
By default, etcd uses a `1000ms` election timeout.
Adjusting these values is a trade off.
Lowering the heartbeat interval will cause individual commands to be committed faster but it will lower the overall throughput of etcd.
@ -30,23 +32,14 @@ You can override the default values on the command line:
etcd clusters can be upgraded by doing a rolling upgrade or all at once. We make every effort to test this process, but please be sure to backup your data [by etcd-dump](https://github.com/AaronO/etcd-dump), or make a copy of data directory beforehand.
## Upgrade Process
- Stop the old etcd processes
- Upgrade the etcd binary
- Restart the etcd instance using the original --name, --address, --peer-address and --data-dir.
## Rolling Upgrade
During an upgrade, etcd clusters are designed to continue working in a mix of old and new versions. It's recommended to converge on the new version quickly. Using new API features before the entire cluster has been upgraded is only supported as a best effort. Each instance's version can be found with `curl http://127.0.0.1:4001/version`.
## All at Once
If downtime is not an issue, the easiest way to upgrade your cluster is to shutdown all of the etcd instances and restart them with the new binary. The current state of the cluster is saved to disk and will be loaded into the cluster when it restarts.
cli.go is simple, fast, and fun package for building command line apps in Go. The goal is to enable developers to write fast and distributable command line applications in an expressive way.
You can view the API docs here:
http://godoc.org/github.com/codegangsta/cli
## Overview
Command line apps are usually so tiny that there is absolutely no reason why your code should *not* be self-documenting. Things like generating help text and parsing command flags/options should not hinder productivity when writing a command line app.
**This is where cli.go comes into play.** cli.go makes command line programming fun, organized, and expressive!
## Installation
Make sure you have a working Go environment (go 1.1 is *required*). [See the install instructions](http://golang.org/doc/install.html).
To install `cli.go`, simply run:
```
$ go get github.com/codegangsta/cli
```
Make sure your `PATH` includes to the `$GOPATH/bin` directory so your commands can be easily used:
```
export PATH=$PATH:$GOPATH/bin
```
## Getting Started
One of the philosophies behind cli.go is that an API should be playful and full of discovery. So a cli.go app can be as little as one line of code in `main()`.
``` go
package main
import (
"os"
"github.com/codegangsta/cli"
)
func main() {
cli.NewApp().Run(os.Args)
}
```
This app will run and show help text, but is not very useful. Let's give an action to execute and some help documentation:
``` go
package main
import (
"os"
"github.com/codegangsta/cli"
)
func main() {
app := cli.NewApp()
app.Name = "boom"
app.Usage = "make an explosive entrance"
app.Action = func(c *cli.Context) {
println("boom! I say!")
}
app.Run(os.Args)
}
```
Running this already gives you a ton of functionality, plus support for things like subcommands and flags, which are covered below.
## Example
Being a programmer can be a lonely job. Thankfully by the power of automation that is not the case! Let's create a greeter app to fend off our demons of loneliness!
``` go
/* greet.go */
package main
import (
"os"
"github.com/codegangsta/cli"
)
func main() {
app := cli.NewApp()
app.Name = "greet"
app.Usage = "fight the loneliness!"
app.Action = func(c *cli.Context) {
println("Hello friend!")
}
app.Run(os.Args)
}
```
Install our command to the `$GOPATH/bin` directory:
help, h Shows a list of commands or help for one command
GLOBAL OPTIONS
--version Shows version information
```
### Arguments
You can lookup arguments by calling the `Args` function on `cli.Context`.
``` go
...
app.Action = func(c *cli.Context) {
println("Hello", c.Args()[0])
}
...
```
### Flags
Setting and querying flags is simple.
``` go
...
app.Flags = []cli.Flag {
cli.StringFlag{
Name: "lang",
Value: "english",
Usage: "language for the greeting",
},
}
app.Action = func(c *cli.Context) {
name := "someone"
if len(c.Args()) > 0 {
name = c.Args()[0]
}
if c.String("lang") == "spanish" {
println("Hola", name)
} else {
println("Hello", name)
}
}
...
```
#### Alternate Names
You can set alternate (or short) names for flags by providing a comma-delimited list for the `Name`. e.g.
``` go
app.Flags = []cli.Flag {
cli.StringFlag{
Name: "lang, l",
Value: "english",
Usage: "language for the greeting",
},
}
```
#### Values from the Environment
You can also have the default value set from the environment via `EnvVar`. e.g.
``` go
app.Flags = []cli.Flag {
cli.StringFlag{
Name: "lang, l",
Value: "english",
Usage: "language for the greeting",
EnvVar: "APP_LANG",
},
}
```
That flag can then be set with `--lang spanish` or `-l spanish`. Note that giving two different forms of the same flag in the same command invocation is an error.
### Subcommands
Subcommands can be defined for a more git-like command line app.
Feel free to put up a pull request to fix a bug or maybe add a feature. I will give it a code review and make sure that it does not break backwards compatibility. If I or any other collaborators agree that it is in line with the vision of the project, we will work with you to get the code into a mergeable state and merge it into the master branch.
If you are have contributed something significant to the project, I will most likely add you as a collaborator. As a collaborator you are given the ability to merge others pull requests. It is very important that new code does not break existing code, so be careful about what code you do choose to merge. If you have any questions feel free to link @codegangsta to the issue in question and we can review it together.
If you feel like you have contributed to the project but have not yet been added as a collaborator, I probably forgot to add you. Hit @codegangsta up over email and we will get it figured out.
## About
cli.go is written by none other than the [Code Gangsta](http://codegangsta.io)
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.