Compare commits

..

39 Commits

Author SHA1 Message Date
05b564a394 *: bump to v2.2.3 2015-12-30 13:41:16 -08:00
cb779b2305 etcdctl: fix syncWithPeerAPI by breaking the loop when there is no error 2015-12-30 11:24:27 -08:00
22c3208fb3 etcdserver: always check if the data dir is writable before starting etcd 2015-12-30 10:22:52 -08:00
e44372e430 etcdsever: avoid creating member dir before finishing validate bootstrap
This commit fixes the issue of creating member dir before validating
the configuration. When member dir exists, it indicates the local etcd
process is a valid etcd member. So we should only create member dir
after we finish configuration validation, joining validation or
discovery validation.
2015-12-30 10:19:51 -08:00
05a90bc1e5 etcdmain: fix incomplete proxy config file
etcd might generate incomplete proxy config file after a power failure.
It is because we use ioutil.WriteFile. And iotuile.WriteFile does
not call Sync before closing the file.
2015-12-22 12:30:39 -08:00
6751727809 etcdctl: support basic operations with etcd 0.4.
For CoreOS users, they will get a updated version of etcdctl without updating
the etcd server version. And the users cannot really control this behavior.
We do not want to suddenly break them without enough communication.

So we still want the most basic opeartions like get, set, watch of etcdctl2 work
with etcd 0.4. This patches solve the incompability issue.
2015-12-22 12:29:16 -08:00
916106c3a2 client: support reset Endpoints.
ResetEndpoints is useful when the there is a scheduled cluster
changes or when manually manage the cluster without auto-sync
enabled.
2015-12-22 12:25:59 -08:00
e0c7768f94 store: fix data race when modify event in watchHub.
The event got from watchHub should be considered as readonly.
To modify it, we first need to get a clone of it or there might
be a data race.
2015-12-14 14:08:39 -08:00
0fb2d5d4d3 client: fix goroutine leak in unreleased context
If headerTimeout is not zero then two context are created but only one is released.
cancelCtx in this case is never released which leads to goroutine leak inside it.
2015-12-14 13:58:43 -08:00
fc61fc7c7a etcdctl: cluster health exit with non-zero when cluster is unhealthy 2015-12-14 13:58:10 -08:00
09b81bad15 *: bump to v2.2.2+git 2015-11-19 14:24:59 -08:00
b4bddf685b *: bump to v2.2.2 2015-11-19 14:24:35 -08:00
af1c711270 auth: use canonical path for pre-defined guest role 2015-11-19 13:41:27 -08:00
c269426be8 *: fix various data races detected by race detector
Conflicts:
	rafthttp/transport.go
2015-11-19 13:38:28 -08:00
20b7df3c12 rafthttp: fix data races detected by go race detector
Conflicts:
	rafthttp/pipeline.go
2015-11-19 13:34:49 -08:00
e342de3cc5 etcdmain: fix parsing discovery error
The discovery error is wrapped into a struct now, and cannot be compared
to predefined errors. Correct the comparison behavior to fix the
problem.
2015-11-19 13:26:50 -08:00
26cc2111cd etcdmain: improve log when join discovery fails
Before this PR, the log is

```
2015/09/1 13:18:31 etcdmain: client: etcd cluster is unavailable or
misconfigured
```

It is quite hard for people to understand what happens.

Now we print out the exact reason for the failure, and explains the way
to handle it.
2015-11-19 13:26:42 -08:00
5d6457e658 godeps: bump coreos/pkg/capnslog
Update to catch coreos/pkg#43 which should fix SYSLOG_IDENTIFIER getting
set when etcd is logging to the journal.
2015-11-19 13:26:29 -08:00
53bc644168 client: regenerate code to unmarshal key response
Regenerate code for unmarshaling key response with a new version of
ugorji/go/codec.
2015-11-19 13:26:19 -08:00
ad3bb484ca Godeps: update ugorji/go/codec dependency
Update ugorji/go/codec dependency to the newer version.
2015-11-19 13:26:08 -08:00
15f7b736e4 etcdctl:fix health check condition 2015-11-19 13:25:57 -08:00
4dc835c718 *: bump to v2.2.1+git 2015-10-15 21:59:15 -07:00
75f8282eef *: bump to v2.2.1 2015-10-15 21:31:51 -07:00
45c86af0eb etcdctl/command: mk command with PrevNoExist
This attempts to fix #3676. `PrevNoExist` checks if the key previously exists
and if so, it returns an error, which is how `mk` command is supposed to work.
The previous code ignores the previous key and overwrites with the later value.

/cc @yichengq
2015-10-15 15:26:52 -07:00
71e5467807 Godeps: update prometheus dependency
prometheus updates its directory layout
(https://github.com/prometheus/client_golang#where-is-model-extraction-and-text)
and makes Godeps restore/save unable to work.

Remove all prometheus dependency manually and godep save again to fix
this problem.
2015-10-15 15:26:42 -07:00
0169fec873 client: regenerate code to unmarshal key response
Regenerate code for unmarshaling key response with a new version of
ugorji/go/codec
2015-10-15 15:24:21 -07:00
766023b1b0 Godeps: update ugorji/go/codec dependency
Update ugorji/go/codec dependency to the newer version (a bunch of fixed were made).
2015-10-15 15:24:12 -07:00
ca9e63dde2 pkg/types: fix unwanted unescape in NewURLsMap
We use url.ParseQuery to parse names-to-urls string, but it has side
effect that unescape the string. If the initial-cluster string has ipv6
which contains `%25`, it will unescape it to `%` and make further url
parse failed.

Fix it by modifiying the parse process.

Go1.4 doesn't support literal IPv6 address w/ zone in
URI(https://github.com/golang/go/issues/6530), so we only enable tests
in Go1.5+.
2015-10-15 15:24:00 -07:00
7659bbb1b2 etcdmain: print out error and suggestion for fixing notify issue 2015-10-15 15:23:49 -07:00
f8b98d3925 etcdhttp: add Content-Type: application/json header to version handler 2015-10-15 15:23:34 -07:00
9ee3ed777b etcdmain: exit after print out ErrDuplicateID
etcd should exit after printing log for unhandlable error.
2015-10-15 15:23:24 -07:00
c9bd125490 etcdsever: mismatch error uses the same format as the corresponding flags 2015-10-15 15:22:52 -07:00
ec49496111 proxy: improve log for retrying an unavailable endpoint
Fixes #3541

Signed-off-by: Guohua ouyang <guohuaouyang@gmail.com>
2015-10-15 15:22:40 -07:00
baaefd18e2 etcdmain: better logging when user forget to set initial flags 2015-10-15 15:22:29 -07:00
72c18eb7ba etcdctl: use a context with -total-timeout in simple commands
Like the commit 8ebc933111, this commit lets simple etcdctl commands
use a context with timeout value passed via -total-timeout.

This commit doesn't change complex commands like watch,
cluster-health, and import because it is not obvious that using the
context in the commands is good or not.
2015-10-15 15:22:13 -07:00
2e87d71bc6 etcdctl: use user specified timeout value for entire command execution
etcdctl should be capable to use a user specified timeout value for
total command execution, not only per request timeout. This commit
adds a new option --total-timeout to the command. The value passed via
this option is used as a timeout value of entire command execution.

Fixes coreos#3517
2015-10-15 15:21:38 -07:00
217dccd617 raft: improve panic error message
Give a human being some insight into how we might have gotten to this
state based on feedback from #3504.
2015-10-15 15:20:12 -07:00
3ceb5dd270 client: add Nodes to codecgen and regenerate 2015-10-15 15:19:58 -07:00
49b77a59cf client: add Nodes type to faciliate sorting
This helps users to sort easily.
2015-10-15 15:19:52 -07:00
1024 changed files with 37381 additions and 96950 deletions

1
.gitignore vendored
View File

@ -9,4 +9,3 @@
*.swp
/hack/insta-discovery/.env
*.test
tools/functional-tester/docker/bin

View File

@ -1,4 +1,4 @@
// Copyright 2016 CoreOS, Inc.
// Copyright 2014 CoreOS, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.

View File

@ -1,25 +1,11 @@
language: go
sudo: false
go:
- 1.4
- 1.5
- 1.6
- tip
matrix:
allow_failures:
- go: tip
addons:
apt:
packages:
- libpcap-dev
- libaspell-dev
- libhunspell-dev
before_install:
- go get -v github.com/chzchzchz/goword
install:
- go get github.com/barakmich/go-nyet
script:
- ./test
- INTEGRATION=y ./test

View File

@ -35,7 +35,7 @@ Thanks for your contributions!
### Code style
The coding style suggested by the Golang community is used in etcd. See the [style doc](https://github.com/golang/go/wiki/CodeReviewComments) for details.
The coding style suggested by the Golang community is used in etcd. See the [style doc](https://code.google.com/p/go-wiki/wiki/CodeReviewComments) for details.
Please follow this style to make etcd easy to review, maintain and develop.

View File

@ -1,4 +1,4 @@
# Snapshot Migration
## Snapshot Migration
You can migrate a snapshot of your data from a v0.4.9+ cluster into a new etcd 2.2 cluster using a snapshot migration. After snapshot migration, the etcd indexes of your data will change. Many etcd applications rely on these indexes to behave correctly. This operation should only be done while all etcd applications are stopped.
@ -15,7 +15,7 @@ etcdctl --endpoint new_cluster.example.com import --snap backup.snap
```
If you have a large amount of data, you can specify more concurrent works to copy data in parallel by using `-c` flag.
If you have hidden keys to copy, you can use `--hidden` flag to specify. For example fleet uses `/_coreos.com/fleet` so to import those keys use `--hidden /_coreos.com`.
If you have hidden keys to copy, you can use `--hidden` flag to specify.
And the data will quickly copy into the new cluster:

View File

@ -1,8 +1,8 @@
# Administration
## Administration
## Data Directory
### Data Directory
### Lifecycle
#### Lifecycle
When first started, etcd stores its configuration into a data directory specified by the data-dir configuration parameter.
Configuration is stored in the write ahead log and includes: the local member ID, cluster ID, and initial cluster configuration.
@ -18,7 +18,9 @@ Using an out-of-date data directory can lead to inconsistency as the member had
For maximum safety, if an etcd member suffers any sort of data corruption or loss, it must be removed from the cluster.
Once removed the member can be re-added with an empty data directory.
### Contents
[remove-a-member]: runtime-configuration.md#remove-a-member
#### Contents
The data directory has two sub-directories in it:
@ -27,18 +29,21 @@ The data directory has two sub-directories in it:
If `--wal-dir` flag is set, etcd will write the write ahead log files to the specified directory instead of data directory.
## Cluster Management
[wal-pkg]: http://godoc.org/github.com/coreos/etcd/wal
[snap-pkg]: http://godoc.org/github.com/coreos/etcd/snap
### Lifecycle
### Cluster Management
#### Lifecycle
If you are spinning up multiple clusters for testing it is recommended that you specify a unique initial-cluster-token for the different clusters.
This can protect you from cluster corruption in case of mis-configuration because two members started with different cluster tokens will refuse members from each other.
### Monitoring
#### Monitoring
It is important to monitor your production etcd cluster for healthy information and runtime metrics.
#### Health Monitoring
##### Health Monitoring
At lowest level, etcd exposes health information via HTTP at `/health` in JSON format. If it returns `{"health": "true"}`, then the cluster is healthy. Please note the `/health` endpoint is still an experimental one as in etcd 2.2.
@ -58,16 +63,16 @@ member fd422379fda50e48 is healthy: got healthy result from http://127.0.0.1:323
cluster is healthy
```
#### Runtime Metrics
##### Runtime Metrics
etcd uses [Prometheus][prometheus] for metrics reporting in the server. You can read more through the runtime metrics [doc][metrics].
etcd uses [Prometheus](http://prometheus.io/) for metrics reporting in the server. You can read more through the runtime metrics [doc](metrics.md).
### Debugging
#### Debugging
Debugging a distributed system can be difficult. etcd provides several ways to make debug
easier.
#### Enabling Debug Logging
##### Enabling Debug Logging
When you want to debug etcd without stopping it, you can enable debug logging at runtime.
etcd exposes logging configuration at `/config/local/log`.
@ -80,7 +85,7 @@ $ curl http://127.0.0.1:2379/config/local/log -XPUT -d '{"Level":"INFO"}'
$ # debug logging disabled
```
#### Debugging Variables
##### Debugging Variables
Debug variables are exposed for real-time debugging purposes. Developers who are familiar with etcd can utilize these variables to debug unexpected behavior. etcd exposes debug variables via HTTP at `/debug/vars` in JSON format. The debug variables contains
`cmdline`, `file_descriptor_limit`, `memstats` and `raft.status`.
@ -89,7 +94,7 @@ Debug variables are exposed for real-time debugging purposes. Developers who are
`file_descriptor_limit` is the max number of file descriptors etcd can utilize.
`memstats` is explained in detail in the [Go runtime documentation][golang-memstats].
`memstats` is well explained [here](http://golang.org/pkg/runtime/#MemStats).
`raft.status` is useful when you want to debug low level raft issues if you are familiar with raft internals. In most cases, you do not need to check `raft.status`.
@ -102,7 +107,7 @@ Debug variables are exposed for real-time debugging purposes. Developers who are
}
```
### Optimal Cluster Size
#### Optimal Cluster Size
The recommended etcd cluster size is 3, 5 or 7, which is decided by the fault tolerance requirement. A 7-member cluster can provide enough fault tolerance in most cases. While larger cluster provides better fault tolerance the write performance reduces since data needs to be replicated to more machines.
@ -125,7 +130,7 @@ As you can see, adding another member to bring the size of cluster up to an odd
#### Changing Cluster Size
After your cluster is up and running, adding or removing members is done via [runtime reconfiguration][runtime-reconfig], which allows the cluster to be modified without downtime. The `etcdctl` tool has `member list`, `member add` and `member remove` commands to complete this process.
After your cluster is up and running, adding or removing members is done via [runtime reconfiguration](runtime-configuration.md#cluster-reconfiguration-operations), which allows the cluster to be modified without downtime. The `etcdctl` tool has a `member list`, `member add` and `member remove` commands to complete this process.
### Member Migration
@ -133,10 +138,10 @@ When there is a scheduled machine maintenance or retirement, you might want to m
The data directory contains all the data to recover a member to its point-in-time state. To migrate a member:
* Stop the member process.
* Copy the data directory of the now-idle member to the new machine.
* Update the peer URLs for the replaced member to reflect the new machine according to the [runtime reconfiguration instructions][update-member].
* Start etcd on the new machine, using the same configuration and the copy of the data directory.
* Stop the member process
* Copy the data directory of the now-idle member to the new machine
* Update the peer URLs for that member to reflect the new machine according to the [runtime configuration] [change peer url]
* Start etcd on the new machine, using the same configuration and the copy of the data directory
This example will walk you through the process of migrating the infra1 member to a new machine:
@ -147,7 +152,7 @@ This example will walk you through the process of migrating the infra1 member to
|infra2|10.0.1.12:2380|
```sh
$ export ETCDCTL_ENDPOINT=http://10.0.1.10:2379,http://10.0.1.11:2379,http://10.0.1.12:2379
$ export ETCDCTL_PEERS=http://10.0.1.10:2379,http://10.0.1.11:2379,http://10.0.1.12:2379
```
```sh
@ -207,6 +212,8 @@ etcd -name infra1 \
-advertise-client-urls http://10.0.1.13:2379,http://127.0.0.1:2379
```
[change peer url]: runtime-configuration.md#update-a-member
### Disaster Recovery
etcd is designed to be resilient to machine failures. An etcd cluster can automatically recover from any number of temporary failures (for example, machine reboots), and a cluster of N members can tolerate up to _(N-1)/2_ permanent failures (where a member can no longer access the cluster, due to hardware failure or disk corruption). However, in extreme circumstances, a cluster might permanently lose enough members such that quorum is irrevocably lost. For example, if a three-node cluster suffered two simultaneous and unrecoverable machine failures, it would be normally impossible for the cluster to restore quorum and continue functioning.
@ -253,9 +260,9 @@ Once you have verified that etcd has started successfully, shut it down and move
#### Restoring the cluster
Now that the node is running successfully, [change its advertised peer URLs][update-member], as the `--force-new-cluster` option has set the peer URL to the default listening on localhost.
Now that if the node is running successfully, you should [change its advertised peer URLs](runtime-configuration.md#update-a-member), as the `--force-new-cluster` has set the peer URL to the default (listening on localhost).
You can then add more nodes to the cluster and restore resiliency. See the [add a new member][add-a-member] guide for more details. **NB:** If you are trying to restore your cluster using old failed etcd nodes, please make sure you have stopped old etcd instances and removed their old data directories specified by the data-dir configuration parameter.
You can then add more nodes to the cluster and restore resiliency. See the [add a new member](runtime-configuration.md#add-a-new-member) guide for more details. **NB:** If you are trying to restore your cluster using old failed etcd nodes, please make sure you have stopped old etcd instances and removed their old data directories specified by the data-dir configuration parameter.
### Client Request Timeout
@ -286,18 +293,6 @@ If timeout happens several times continuously, administrators should check statu
#### Maximum OS threads
By default, etcd uses the default configuration of the Go 1.4 runtime, which means that at most one operating system thread will be used to execute code simultaneously. (Note that this default behavior [has changed in Go 1.5][golang1.5-runtime]).
By default, etcd uses the default configuration of the Go 1.4 runtime, which means that at most one operating system thread will be used to execute code simultaneously. (Note that this default behavior [may change in Go 1.5](https://docs.google.com/document/d/1At2Ls5_fhJQ59kDK2DFVhFu3g5mATSXqqV5QrxinasI/edit)).
When using etcd in heavy-load scenarios on machines with multiple cores it will usually be desirable to increase the number of threads that etcd can utilize. To do this, simply set the environment variable GOMAXPROCS to the desired number when starting etcd. For more information on this variable, see the [Go runtime documentation][golang-runtime].
[add-a-member]: runtime-configuration.md#add-a-new-member
[golang1.5-runtime]: https://golang.org/doc/go1.5#runtime
[golang-memstats]: https://golang.org/pkg/runtime/#MemStats
[golang-runtime]: https://golang.org/pkg/runtime
[metrics]: metrics.md
[prometheus]: http://prometheus.io/
[remove-a-member]: runtime-configuration.md#remove-a-member
[runtime-reconfig]: runtime-configuration.md#cluster-reconfiguration-operations
[snap-pkg]: http://godoc.org/github.com/coreos/etcd/snap
[update-a-member]: runtime-configuration.md#update-a-member
[wal-pkg]: http://godoc.org/github.com/coreos/etcd/wal
When using etcd in heavy-load scenarios on machines with multiple cores it will usually be desirable to increase the number of threads that etcd can utilize. To do this, simply set the environment variable `GOMAXPROCS` to the desired number when starting etcd. For more information on this variable, see the Go [runtime](https://golang.org/pkg/runtime) documentation.

View File

@ -78,9 +78,12 @@ X-Raft-Index: 5398
X-Raft-Term: 1
```
* `X-Etcd-Index` is the current etcd index as explained above. When request is a watch on key space, `X-Etcd-Index` is the current etcd index when the watch starts, which means that the watched event may happen after `X-Etcd-Index`.
* `X-Raft-Index` is similar to the etcd index but is for the underlying raft protocol.
* `X-Raft-Term` is an integer that will increase whenever an etcd master election happens in the cluster. If this number is increasing rapidly, you may need to tune the election timeout. See the [tuning][tuning] section for details.
- `X-Etcd-Index` is the current etcd index as explained above. When request is a watch on key space, `X-Etcd-Index` is the current etcd index when the watch starts, which means that the watched event may happen after `X-Etcd-Index`.
- `X-Raft-Index` is similar to the etcd index but is for the underlying raft protocol
- `X-Raft-Term` is an integer that will increase whenever an etcd master election happens in the cluster. If this number is increasing rapidly, you may need to tune the election timeout. See the [tuning][tuning] section for details.
[tuning]: tuning.md
### Get the value of a key
@ -231,50 +234,6 @@ curl http://127.0.0.1:2379/v2/keys/foo -XPUT -d value=bar -d ttl= -d prevExist=t
}
```
### Refreshing key TTL
Keys in etcd can be refreshed without notifying watchers
this can be achieved by setting the refresh to true when updating a TTL
You cannot update the value of a key when refreshing it
```sh
curl http://127.0.0.1:2379/v2/keys/foo -XPUT -d value=bar -d ttl=5
curl http://127.0.0.1:2379/v2/keys/foo -XPUT -d ttl=5 -d refresh=true -d prevExist=true
```
```json
{
"action": "set",
"node": {
"createdIndex": 5,
"expiration": "2013-12-04T12:01:21.874888581-08:00",
"key": "/foo",
"modifiedIndex": 5,
"ttl": 5,
"value": "bar"
}
}
{
"action":"update",
"node":{
"key":"/foo",
"value":"bar",
"expiration": "2013-12-04T12:01:26.874888581-08:00",
"ttl":5,
"modifiedIndex":6,
"createdIndex":5
},
"prevNode":{
"key":"/foo",
"value":"bar",
"expiration":"2013-12-04T12:01:21.874888581-08:00",
"ttl":3,
"modifiedIndex":5,
"createdIndex":5
}
}
```
### Waiting for a change
@ -400,7 +359,7 @@ curl 'http://127.0.0.1:2379/v2/keys/foo?wait=true&waitIndex=2008'
#### Connection being closed prematurely
The server may close a long polling connection before emitting any events.
This can happen due to a timeout or the server being shutdown.
This can happend due to a timeout or the server being shutdown.
Since the HTTP header is sent immediately upon accepting the connection, the response will be seen as empty: `200 OK` and empty body.
The clients should be prepared to deal with this scenario and retry the watch.
@ -541,15 +500,13 @@ etcd can be used as a centralized coordination service in a cluster, and `Compar
This command will set the value of a key only if the client-provided conditions are equal to the current conditions.
*Note that `CompareAndSwap` does not work with [directories][directories]. If an attempt is made to `CompareAndSwap` a directory, a 102 "Not a file" error will be returned.*
The current comparable conditions are:
1. `prevValue` - checks the previous value of the key.
2. `prevIndex` - checks the previous modifiedIndex of the key.
3. `prevExist` - checks existence of the key: if `prevExist` is true, it is an `update` request; if `prevExist` is `false`, it is a `create` request.
3. `prevExist` - checks existence of the key: if `prevExist` is true, it is an `update` request; if prevExist is `false`, it is a `create` request.
Here is a simple example.
Let's create a key-value pair first: `foo=one`.
@ -628,8 +585,6 @@ We successfully changed the value from "one" to "two" since we gave the correct
This command will delete a key only if the client-provided conditions are equal to the current conditions.
*Note that `CompareAndDelete` does not work with [directories]. If an attempt is made to `CompareAndDelete` a directory, a 102 "Not a file" error will be returned.*
The current comparable conditions are:
1. `prevValue` - checks the previous value of the key.
@ -1093,7 +1048,6 @@ curl http://127.0.0.1:2379/v2/stats/self
### Store Statistics
The store statistics include information about the operations that this node has handled.
Note that v2 `store Statistics` is stored in-memory. When a member stops, store statistics will reset on restart.
Operations that modify the store's state like create, delete, set and update are seen by the entire cluster and the number will increase on all nodes.
Operations like get and watch are node local and will only be seen on this node.
@ -1123,8 +1077,6 @@ curl http://127.0.0.1:2379/v2/stats/store
## Cluster Config
See the [members API][members-api] for details on the cluster management.
See the [other etcd APIs][other-apis] for details on the cluster management.
[directories]: #listing-a-directory
[members-api]: members_api.md
[tuning]: tuning.md
[other-apis]: other_apis.md

View File

@ -1,92 +0,0 @@
# etcd3 API
TODO: API doc
## Data Model
etcd is designed to reliably store infrequently updated data and provide reliable watch queries. etcd exposes previous versions of key-value pairs to support inexpensive snapshots and watch history events (“time travel queries”). A persistent, multi-version, concurrency-control data model is a good fit for these use cases.
etcd stores data in a multiversion [persistent][persistent-ds] key-value store. The persistent key-value store preserves the previous version of a key-value pair when its value is superseded with new data. The key-value store is effectively immutable; its operations do not update the structure in-place, but instead always generates a new updated structure. All past versions of keys are still accessible and watchable after modification. To prevent the data store from growing indefinitely over time from maintaining old versions, the store may be compacted to shed the oldest versions of superseded data.
### Logical View
The stores logical view is a flat binary key space. The key space has a lexically sorted index on byte string keys so range queries are inexpensive.
The key space maintains multiple revisions. Each atomic mutative operation (e.g., a transaction operation may contain multiple operations) creates a new revision on the key space. All data held by previous revisions remains unchanged. Old versions of key can still be accessed through previous revisions. Likewise, revisions are indexed as well; ranging over revisions with watchers is efficient. If the store is compacted to recover space, revisions before the compact revision will be removed.
A keys lifetime spans a generation. Each key may have one or multiple generations. Creating a key increments the generation of that key, starting at 1 if the key never existed. Deleting a key generates a key tombstone, concluding the keys current generation. Each modification of a key creates a new version of the key. Once a compaction happens, any generation ended before the given revision will be removed and values set before the compaction revision except the latest one will be removed.
### Physical View
etcd stores the physical data as key-value pairs in a persistent [b+tree][b+tree]. Each revision of the stores state only contains the delta from its previous revision to be efficient. A single revision may correspond to multiple keys in the tree.
The key of key-value pair is a 3-tuple (major, sub, type). Major is the store revision holding the key. Sub differentiates among keys within the same revision. Type is an optional suffix for special value (e.g., `t` if the value contains a tombstone). The value of the key-value pair contains the modification from previous revision, thus one delta from previous revision. The b+tree is ordered by key in lexical byte-order. Ranged lookups over revision deltas are fast; this enables quickly finding modifications from one specific revision to another. Compaction removes out-of-date keys-value pairs.
etcd also keeps a secondary in-memory [btree][btree] index to speed up range queries over keys. The keys in the btree index are the keys of the store exposed to user. The value is a pointer to the modification of the persistent b+tree. Compaction removes dead pointers.
## KV API Guarantees
etcd is a consistent and durable key value store with mini-transaction(TODO: link to txn doc when we have it) support. The key value store is exposed through the KV APIs. etcd tries to ensure the strongest consistency and durability guarantees for a distributed system. This specification enumerates the KV API guarantees made by etcd.
### APIs to consider
* Read APIs
* range
* watch
* Write APIs
* put
* delete
* Combination (read-modify-write) APIs
* txn
### etcd Specific Definitions
#### operation completed
An etcd operation is considered complete when it is committed through consensus, and therefore “executed” -- permanently stored -- by the etcd storage engine. The client knows an operation is completed when it receives a response from the etcd server. Note that the client may be uncertain about the status of an operation if it times out, or there is a network disruption between the client and the etcd member. etcd may also abort operations when there is a leader election. etcd does not send `abort` responses to clients outstanding requests in this event.
#### revision
An etcd operation that modifies the key value store is assigned with a single increasing revision. A transaction operation might modifies the key value store multiple times, but only one revision is assigned. The revision attribute of a key value pair that modified by the operation has the same value as the revision of the operation. The revision can be used as a logical clock for key value store. A key value pair that has a larger revision is modified after a key value pair with a smaller revision. Two key value pairs that have the same revision are modified by an operation "concurrently".
### Guarantees Provided
#### Atomicity
All API requests are atomic; an operation either completes entirely or not at all. For watch requests, all events generated by one operation will be in one watch response. Watch never observes partial events for a single operation.
#### Consistency
All API calls ensure [sequential consistency][seq_consistency], the strongest consistency guarantee available from distributed systems. No matter which etcd member server a client makes requests to, a client reads the same events in the same order. If two members complete the same number of operations, the state of the two members is consistent.
For watch operations, etcd guarantees to return the same value for the same key across all members for the same revision. For range operations, etcd has a similar guarantee for [linearized][Linearizability] access; serialized access may be behind the quorum state, so that the later revision is not yet available.
As with all distributed systems, it is impossible for etcd to ensure [strict consistency][strict_consistency]. etcd does not guarantee that it will return to a read the “most recent” value (as measured by a wall clock when a request is completed) available on any cluster member.
#### Isolation
etcd ensures [serializable isolation][serializable_isolation], which is the highest isolation level available in distributed systems. Read operations will never observe any intermediate data.
#### Durability
Any completed operations are durable. All accessible data is also durable data. A read will never return data that has not been made durable.
#### Linearizability
Linearizability (also known as Atomic Consistency or External Consistency) is a consistency level between strict consistency and sequential consistency.
For linearizability, suppose each operation receives a timestamp from a loosely synchronized global clock. Operations are linearized if and only if they always complete as though they were executed in a sequential order and each operation appears to complete in the order specified by the program. Likewise, if an operations timestamp precedes another, that operation must also precede the other operation in the sequence.
For example, consider a client completing a write at time point 1 (*t1*). A client issuing a read at *t2* (for *t2* > *t1*) should receive a value at least as recent as the previous write, completed at *t1*. However, the read might actually complete only by *t3*, and the returned value, current at *t2* when the read began, might be "stale" by *t3*.
etcd does not ensure linearizability for watch operations. Users are expected to verify the revision of watch responses to ensure correct ordering.
etcd ensures linearizability for all other operations by default. Linearizability comes with a cost, however, because linearized requests must go through the Raft consensus process. To obtain lower latencies and higher throughput for read requests, clients can configure a requests consistency mode to `serializable`, which may access stale data with respect to quorum, but removes the performance penalty of linearized accesses' reliance on live consensus.
[persistent-ds]: [https://en.wikipedia.org/wiki/Persistent_data_structure]
[btree]: [https://en.wikipedia.org/wiki/B-tree]
[b+tree]: [https://en.wikipedia.org/wiki/B%2B_tree]
[seq_consistency]: [https://en.wikipedia.org/wiki/Consistency_model#Sequential_consistency]
[strict_consistency]: [https://en.wikipedia.org/wiki/Consistency_model#Strict_consistency]
[serializable_isolation]: [https://en.wikipedia.org/wiki/Isolation_(database_systems)#Serializable]
[Linearizability]: [#Linearizability]

View File

@ -19,7 +19,7 @@ Each role has exact one associated Permission List. An permission list exists fo
The special static ROOT (named `root`) role has a full permissions on all key-value resources, the permission to manage user resources and settings resources. Only the ROOT role has the permission to manage user resources and modify settings resources. The ROOT role is built-in and does not need to be created.
There is also a special GUEST role, named 'guest'. These are the permissions given to unauthenticated requests to etcd. This role will be created automatically, and by default allows access to the full keyspace due to backward compatibility. (etcd did not previously authenticate any actions.). This role can be modified by a ROOT role holder at any time, to reduce the capabilities of unauthenticated users.
There is also a special GUEST role, named 'guest'. These are the permissions given to unauthenticated requests to etcd. This role will be created automatically, and by default allows access to the full keyspace due to backward compatability. (etcd did not previously authenticate any actions.). This role can be modified by a ROOT role holder at any time, to reduce the capabilities of unauthenticated users.
#### Permissions
@ -40,7 +40,7 @@ Specific settings for the cluster as a whole. This can include adding and removi
## v2 Auth
### Basic Auth
We only support [Basic Auth][basic-auth] for the first version. Client needs to attach the basic auth to the HTTP Authorization Header.
We only support [Basic Auth](http://en.wikipedia.org/wiki/Basic_access_authentication) for the first version. Client needs to attach the basic auth to the HTTP Authorization Header.
### Authorization field for operations
Added to requests to /v2/keys, /v2/auth
@ -124,7 +124,7 @@ The User JSON object is formed as follows:
Password is only passed when necessary.
**Get a List of Users**
**Get a list of users**
GET/HEAD /v2/auth/users
@ -137,36 +137,7 @@ GET/HEAD /v2/auth/users
Content-type: application/json
200 Body:
{
"users": [
{
"user": "alice",
"roles": [
{
"role": "root",
"permissions": {
"kv": {
"read": ["*"],
"write": ["*"]
}
}
}
]
},
{
"user": "bob",
"roles": [
{
"role": "guest",
"permissions": {
"kv": {
"read": ["*"],
"write": ["*"]
}
}
}
]
}
]
"users": ["alice", "bob", "eve"]
}
**Get User Details**
@ -184,26 +155,7 @@ GET/HEAD /v2/auth/users/alice
200 Body:
{
"user" : "alice",
"roles" : [
{
"role": "fleet",
"permissions" : {
"kv" : {
"read": [ "/fleet/" ],
"write": [ "/fleet/" ]
}
}
},
{
"role": "etcd",
"permissions" : {
"kv" : {
"read": [ "*" ],
"write": [ "*" ]
}
}
}
]
"roles" : ["fleet", "etcd"]
}
**Create Or Update A User**
@ -261,6 +213,22 @@ A full role structure may look like this. A Permission List structure is used fo
}
```
**Get a list of Roles**
GET/HEAD /v2/auth/roles
Sent Headers:
Authorization: Basic <BasicAuthString>
Possible Status Codes:
200 OK
401 Unauthorized
200 Headers:
Content-type: application/json
200 Body:
{
"roles": ["fleet", "etcd", "quay"]
}
**Get Role Details**
GET/HEAD /v2/auth/roles/fleet
@ -284,50 +252,6 @@ GET/HEAD /v2/auth/roles/fleet
}
}
**Get a list of Roles**
GET/HEAD /v2/auth/roles
Sent Headers:
Authorization: Basic <BasicAuthString>
Possible Status Codes:
200 OK
401 Unauthorized
200 Headers:
Content-type: application/json
200 Body:
{
"roles": [
{
"role": "fleet",
"permissions": {
"kv": {
"read": ["/fleet/"],
"write": ["/fleet/"]
}
}
},
{
"role": "etcd",
"permissions": {
"kv": {
"read": ["*"],
"write": ["*"]
}
}
},
{
"role": "quay",
"permissions": {
"kv": {
"read": ["*"],
"write": ["*"]
}
}
}
]
}
**Create Or Update A Role**
PUT /v2/auth/roles/rkt
@ -508,4 +432,3 @@ PUT /v2/keys/rkt/RktData
Reads and writes outside the prefixes granted will fail with a 401 Unauthorized.
[basic-auth]: https://en.wikipedia.org/wiki/Basic_access_authentication

View File

@ -1,12 +1,14 @@
# Authentication Guide
**NOTE: The authentication feature is considered experimental. We may change workflow without warning in future releases.**
## Overview
Authentication -- having users and roles in etcd -- was added in etcd 2.1. This guide will help you set up basic authentication in etcd.
etcd before 2.1 was a completely open system; anyone with access to the API could change keys. In order to preserve backward compatibility and upgradability, this feature is off by default.
For a full discussion of the RESTful API, see [the authentication API documentation][auth-api]
For a full discussion of the RESTful API, see [the authentication API documentation](auth_api.md)
## Special Users and Roles
@ -92,7 +94,6 @@ Roles are granted access to various parts of the keyspace, a single path at a ti
Reading a path is simple; if the path ends in `*`, that key **and all keys prefixed with it**, are granted to holders of this role. If it does not end in `*`, only that key and that key alone is granted.
Access can be granted as either read, write, or both, as in the following examples:
```
# Give read access to keys under the /foo directory
$ etcdctl role grant myrolename -path '/foo/*' -read
@ -133,7 +134,7 @@ $ etcdctl role remove myrolename
## Enabling authentication
The minimal steps to enabling auth are as follows. The administrator can set up users and roles before or after enabling authentication, as a matter of preference.
The minimal steps to enabling auth follow. The administrator can set up users and roles before or after enabling authentication, as a matter of preference.
Make sure the root user is created:
@ -176,5 +177,3 @@ $ etcdctl -u user get foo
```
Otherwise, all `etcdctl` commands remain the same. Users and roles can still be created and modified, but require authentication by a user with the root role.
[auth-api]: auth_api.md

View File

@ -32,7 +32,7 @@ The consistent flag for read operations is removed in etcd 2.0.0. The normal rea
The read consistency guarantees are:
The consistent read guarantees the sequential consistency within one client that talks to one etcd server. Read/Write from one client to one etcd member should be observed in order. If one client write a value to an etcd server successfully, it should be able to get the value out of the server immediately.
The consistent read guarantees the sequential consistency within one client that talks to one etcd server. Read/Write from one client to one etcd member should be observed in order. If one client write a value to a etcd server successfully, it should be able to get the value out of the server immediately.
Each etcd member will proxy the request to leader and only return the result to user after the result is applied on the local member. Thus after the write succeed, the user is guaranteed to see the value on the member it sent the request to.
@ -60,9 +60,9 @@ A size key needs to be provided inside a [discovery token][discoverytoken].
## HTTP Admin API
`v2/admin` on peer url and `v2/keys/_etcd` are unified under the new [v2/members API][members-api] to better explain which machines are part of an etcd cluster, and to simplify the keyspace for all your use cases.
`v2/admin` on peer url and `v2/keys/_etcd` are unified under the new [v2/member API][memberapi] to better explain which machines are part of an etcd cluster, and to simplify the keyspace for all your use cases.
[members-api]: members_api.md
[memberapi]: other_apis.md
## HTTP Key Value API
- The follower can now transparently proxy write requests to the leader. Clients will no longer see 307 redirections to the leader from etcd.

View File

@ -2,17 +2,12 @@
etcd benchmarks will be published regularly and tracked for each release below:
- [etcd v2.1.0-alpha][2.1]
- [etcd v2.2.0-rc][2.2]
- [etcd v3 demo][3.0]
- [etcd v2.1.0-alpha](./etcd-2-1-0-alpha-benchmarks.md)
- [etcd v2.2.0-rc](./etcd-2-2-0-rc-benchmarks.md)
- [etcd v3 demo](./etcd-3-demo-benchmarks.md)
# Memory Usage Benchmarks
It records expected memory usage in different scenarios.
- [etcd v2.2.0-rc][2.2-mem]
[2.1]: etcd-2-1-0-alpha-benchmarks.md
[2.2]: etcd-2-2-0-rc-benchmarks.md
[2.2-mem]: etcd-2-2-0-rc-memory-benchmarks.md
[3.0]: etcd-3-demo-benchmarks.md
- [etcd v2.2.0-rc](./etcd-2-2-0-rc-memory-benchmarks.md)

View File

@ -14,7 +14,7 @@ GCE n1-highcpu-2 machine type
## Testing
Bootstrap another machine and use the [boom HTTP benchmark tool][boom] to send requests to each etcd member. Check the [benchmark hacking guide][hack-benchmark] for detailed instructions.
Bootstrap another machine and use benchmark tool [boom](https://github.com/rakyll/boom) to send requests to each etcd member.
## Performance
@ -47,6 +47,3 @@ Bootstrap another machine and use the [boom HTTP benchmark tool][boom] to send r
| 64 | 256 | all servers | 3260 | 123.8 |
| 256 | 64 | all servers | 1033 | 121.5 |
| 256 | 256 | all servers | 3061 | 119.3 |
[boom]: https://github.com/rakyll/boom
[hack-benchmark]: /hack/benchmark/

View File

@ -1,69 +0,0 @@
# Benchmarking etcd v2.2.0
## Physical Machines
GCE n1-highcpu-2 machine type
- 1x dedicated local SSD mounted as etcd data directory
- 1x dedicated slow disk for the OS
- 1.8 GB memory
- 2x CPUs
## etcd Cluster
3 etcd 2.2.0 members, each runs on a single machine.
Detailed versions:
```
etcd Version: 2.2.0
Git SHA: e4561dd
Go Version: go1.5
Go OS/Arch: linux/amd64
```
## Testing
Bootstrap another machine, outside of the etcd cluster, and run the [`boom` HTTP benchmark tool](https://github.com/rakyll/boom) with a connection reuse patch to send requests to each etcd cluster member. See the [benchmark instructions](../../hack/benchmark/) for the patch and the steps to reproduce our procedures.
The performance is calulated through results of 100 benchmark rounds.
## Performance
### Single Key Read Performance
| key size in bytes | number of clients | target etcd server | average read QPS | read QPS stddev | average 90th Percentile Latency (ms) | latency stddev |
|-------------------|-------------------|--------------------|------------------|-----------------|--------------------------------------|----------------|
| 64 | 1 | leader only | 2303 | 200 | 0.49 | 0.06 |
| 64 | 64 | leader only | 15048 | 685 | 7.60 | 0.46 |
| 64 | 256 | leader only | 14508 | 434 | 29.76 | 1.05 |
| 256 | 1 | leader only | 2162 | 214 | 0.52 | 0.06 |
| 256 | 64 | leader only | 14789 | 792 | 7.69| 0.48 |
| 256 | 256 | leader only | 14424 | 512 | 29.92 | 1.42 |
| 64 | 64 | all servers | 45752 | 2048 | 2.47 | 0.14 |
| 64 | 256 | all servers | 46592 | 1273 | 10.14 | 0.59 |
| 256 | 64 | all servers | 45332 | 1847 | 2.48| 0.12 |
| 256 | 256 | all servers | 46485 | 1340 | 10.18 | 0.74 |
### Single Key Write Performance
| key size in bytes | number of clients | target etcd server | average write QPS | write QPS stddev | average 90th Percentile Latency (ms) | latency stddev |
|-------------------|-------------------|--------------------|------------------|-----------------|--------------------------------------|----------------|
| 64 | 1 | leader only | 55 | 4 | 24.51 | 13.26 |
| 64 | 64 | leader only | 2139 | 125 | 35.23 | 3.40 |
| 64 | 256 | leader only | 4581 | 581 | 70.53 | 10.22 |
| 256 | 1 | leader only | 56 | 4 | 22.37| 4.33 |
| 256 | 64 | leader only | 2052 | 151 | 36.83 | 4.20 |
| 256 | 256 | leader only | 4442 | 560 | 71.59 | 10.03 |
| 64 | 64 | all servers | 1625 | 85 | 58.51 | 5.14 |
| 64 | 256 | all servers | 4461 | 298 | 89.47 | 36.48 |
| 256 | 64 | all servers | 1599 | 94 | 60.11| 6.43 |
| 256 | 256 | all servers | 4315 | 193 | 88.98 | 7.01 |
## Performance Changes
- Because etcd now records metrics for each API call, read QPS performance seems to see a minor decrease in most scenarios. This minimal performance impact was judged a reasonable investment for the breadth of monitoring and debugging information returned.
- Write QPS to cluster leaders seems to be increased by a small margin. This is because the main loop and entry apply loops were decoupled in the etcd raft logic, eliminating several blocks between them.
- Write QPS to all members seems to be increased by a significant margin, because followers now receive the latest commit index sooner, and commit proposals more quickly.

View File

@ -20,11 +20,11 @@ Go Version: go1.4.2
Go OS/Arch: linux/amd64
```
Also, we use 3 etcd 2.1.0 alpha-stage members to form cluster to get base performance. etcd's commit head is at [c7146bd5][c7146bd5], which is the same as the one that we use in [etcd 2.1 benchmark][etcd-2.1-benchmark].
Also, we use 3 etcd 2.1.0 alpha-stage members to form cluster to get base performance. etcd's commit head is at [c7146bd5](https://github.com/coreos/etcd/commits/c7146bd5f2c73716091262edc638401bb8229144), which is the same as the one that we use in [etcd 2.1 benchmark](./etcd-2-1-0-benchmarks.md).
## Testing
Bootstrap another machine and use the [boom HTTP benchmark tool][boom] to send requests to each etcd member. Check the [benchmark hacking guide][hack-benchmark] for detailed instructions.
Bootstrap another machine and use benchmark tool [boom](https://github.com/rakyll/boom) to send requests to each etcd member. Check [here](../../hack/benchmark/) for instructions.
## Performance
@ -65,8 +65,3 @@ Bootstrap another machine and use the [boom HTTP benchmark tool][boom] to send r
- write QPS to leader is increased by 20~30%. This is because we decouple raft main loop and entry apply loop, which avoids them blocking each other.
- write QPS to all servers is increased by 30~80% because follower could receive latest commit index earlier and commit proposals faster.
[boom]: https://github.com/rakyll/boom
[c7146bd5]: https://github.com/coreos/etcd/commits/c7146bd5f2c73716091262edc638401bb8229144
[etcd-2.1-benchmark]: etcd-2-1-0-alpha-benchmarks.md
[hack-benchmark]: /hack/benchmark/

View File

@ -14,7 +14,7 @@ GCE n1-highcpu-2 machine type
## Testing
Use [etcd v3 benchmark tool][etcd-v3-benchmark].
Use [etcd v3 benchmark tool](../../hack/v3benchmark/).
## Performance
@ -38,5 +38,3 @@ The performance is nearly the same as the one with empty server handler.
The performance with empty server handler is not affected by one put. So the
performance downgrade should be caused by storage package.
[etcd-v3-benchmark]: /tools/benchmark/

View File

@ -1,77 +0,0 @@
# Watch Memory Usage Benchmark
*NOTE*: The watch features are under active development, and their memory usage may change as that development progresses. We do not expect it to significantly increase beyond the figures stated below.
A primary goal of etcd is supporting a very large number of watchers doing a massively large amount of watching. etcd aims to support O(10k) clients, O(100K) watch streams (O(10) streams per client) and O(10M) total watchings (O(100) watching per stream). The memory consumed by each individual watching accounts for the largest portion of etcd's overall usage, and is therefore the focus of current and future optimizations.
Three related components of etcd watch consume physical memory: each `grpc.Conn`, each watch stream, and each instance of the watching activity. `grpc.Conn` maintains the actual TCP connection and other gRPC connection state. Each `grpc.Conn` consumes O(10kb) of memory, and might have multiple watch streams attached.
Each watch stream is an independent HTTP2 connection which consumes another O(10kb) of memory.
Multiple watchings might share one watch stream.
Watching is the actual struct that tracks the changes on the key-value store. Each watching should only consume < O(1kb).
```
+-------+
| watch |
+---------> | foo |
| +-------+
+------+-----+
| stream |
+--------------> | |
| +------+-----+ +-------+
| | | watch |
| +---------> | bar |
+-----+------+ +-------+
| | +------------+
| conn +-------> | stream |
| | | |
+-----+------+ +------------+
|
|
|
| +------------+
+--------------> | stream |
| |
+------------+
```
The theoretical memory consumption of watch can be approximated with the formula:
`memory = c1 * number_of_conn + c2 * avg_number_of_stream_per_conn + c3 * avg_number_of_watch_stream`
## Testing Environment
etcd version
- git head https://github.com/coreos/etcd/commit/185097ffaa627b909007e772c175e8fefac17af3
GCE n1-standard-2 machine type
- 7.5 GB memory
- 2x CPUs
## Overall memory usage
The overall memory usage captures how much [RSS][rss] etcd consumes with the client watchers. While the result may vary by as much as 10%, it is still meaningful, since the goal is to learn about the rough memory usage and the pattern of allocations.
With the benchmark result, we can calculate roughly that `c1 = 17kb`, `c2 = 18kb` and `c3 = 350bytes`. So each additional client connection consumes 17kb of memory and each additional stream consumes 18kb of memory, and each additional watching only cause 350bytes. A single etcd server can maintain millions of watchings with a few GB of memory in normal case.
| clients | streams per client | watchings per stream | total watching | memory usage |
|---------|---------|-----------|----------------|--------------|
| 1k | 1 | 1 | 1k | 50MB |
| 2k | 1 | 1 | 2k | 90MB |
| 5k | 1 | 1 | 5k | 200MB |
| 1k | 10 | 1 | 10k | 217MB |
| 2k | 10 | 1 | 20k | 417MB |
| 5k | 10 | 1 | 50k | 980MB |
| 1k | 50 | 1 | 50k | 1001MB |
| 2k | 50 | 1 | 100k | 1960MB |
| 5k | 50 | 1 | 250k | 4700MB |
| 1k | 50 | 10 | 500k | 1171MB |
| 2k | 50 | 10 | 1M | 2371MB |
| 5k | 50 | 10 | 2.5M | 5710MB |
| 1k | 50 | 100 | 5M | 2380MB |
| 2k | 50 | 100 | 10M | 4672MB |
| 5k | 50 | 100 | 50M | *OOM* |
[rss]: https://en.wikipedia.org/wiki/Resident_set_size

View File

@ -1,98 +0,0 @@
# Storage Memory Usage Benchmark
<!---todo: link storage to storage design doc-->
Two components of etcd storage consume physical memory. The etcd process allocates an *in-memory index* to speed key lookup. The process's *page cache*, managed by the operating system, stores recently-accessed data from disk for quick re-use.
The in-memory index holds all the keys in a [B-tree][btree] data structure, along with pointers to the on-disk data (the values). Each key in the B-tree may contain multiple pointers, pointing to different versions of its values. The theoretical memory consumption of the in-memory index can hence be approximated with the formula:
`N * (c1 + avg_key_size) + N * (avg_versions_of_key) * (c2 + size_of_pointer)`
where `c1` is the key metadata overhead and `c2` is the version metadata overhead.
The graph shows the detailed structure of the in-memory index B-tree.
```
In mem index
+------------+
| key || ... |
+--------------+ | || |
| | +------------+
| | | v1 || ... |
| disk <----------------| || | Tree Node
| | +------------+
| | | v2 || ... |
| <----------------+ || |
| | +------------+
+--------------+ +-----+ | | |
| | | | |
| +------------+
|
|
^
------+
| ... |
| |
+-----+
| ... | Tree Node
| |
+-----+
| ... |
| |
------+
```
[Page cache memory][pagecache] is managed by the operating system and is not covered in detail in this document.
## Testing Environment
etcd version
- git head https://github.com/coreos/etcd/commit/776e9fb7be7eee5e6b58ab977c8887b4fe4d48db
GCE n1-standard-2 machine type
- 7.5 GB memory
- 2x CPUs
## In-memory index memory usage
In this test, we only benchmark the memory usage of the in-memory index. The goal is to find `c1` and `c2` mentioned above and to understand the hard limit of memory consumption of the storage.
We calculate the memory usage consumption via the Go runtime.ReadMemStats. We calculate the total allocated bytes difference before creating the index and after creating the index. It cannot perfectly reflect the memory usage of the in-memory index itself but can show the rough consumption pattern.
| N | versions | key size | memory usage |
|------|----------|----------|--------------|
| 100K | 1 | 64bytes | 22MB |
| 100K | 5 | 64bytes | 39MB |
| 1M | 1 | 64bytes | 218MB |
| 1M | 5 | 64bytes | 432MB |
| 100K | 1 | 256bytes | 41MB |
| 100K | 5 | 256bytes | 65MB |
| 1M | 1 | 256bytes | 409MB |
| 1M | 5 | 256bytes | 506MB |
Based on the result, we can calculate `c1=120bytes`, `c2=30bytes`. We only need two sets of data to calculate `c1` and `c2`, since they are the only unknown variable in the formula. The `c1=120bytes` and `c2=30bytes` are the average value of the 4 sets of `c1` and `c2` we calculated. The key metadata overhead is still relatively nontrivial (50%) for small key-value pairs. However, this is a significant improvement over the old store, which had at least 1000% overhead.
## Overall memory usage
The overall memory usage captures how much RSS etcd consumes with the storage. The value size should have very little impact on the overall memory usage of etcd, since we keep values on disk and only retain hot values in memory, managed by the OS page cache.
| N | versions | key size | value size | memory usage |
|------|----------|----------|------------|--------------|
| 100K | 1 | 64bytes | 256bytes | 40MB |
| 100K | 5 | 64bytes | 256bytes | 89MB |
| 1M | 1 | 64bytes | 256bytes | 470MB |
| 1M | 5 | 64bytes | 256bytes | 880MB |
| 100K | 1 | 64bytes | 1KB | 102MB |
| 100K | 5 | 64bytes | 1KB | 164MB |
| 1M | 1 | 64bytes | 1KB | 587MB |
| 1M | 5 | 64bytes | 1KB | 836MB |
Based on the result, we know the value size does not significantly impact the memory consumption. There is some minor increase due to more data held in the OS page cache.
[btree]: https://en.wikipedia.org/wiki/B-tree
[pagecache]: https://en.wikipedia.org/wiki/Page_cache

View File

@ -1,13 +1,13 @@
# Branch Management
## Branch Management
## Guide
### Guide
* New development occurs on the [master branch][master].
* Master branch should always have a green build!
* Backwards-compatible bug fixes should target the master branch and subsequently be ported to stable branches.
* Once the master branch is ready for release, it will be tagged and become the new stable branch.
- New development occurs on the [master branch](https://github.com/coreos/etcd/tree/master)
- Master branch should always have a green build!
- Backwards-compatible bug fixes should target the master branch and subsequently be ported to stable branches
- Once the master branch is ready for release, it will be tagged and become the new stable branch.
The etcd team has adopted a *rolling release model* and supports one stable version of etcd.
The etcd team has adopted a _rolling release model_ and supports one stable version of etcd.
### Master branch
@ -22,5 +22,3 @@ Before the release of the next stable version, feature PRs will be frozen. We wi
All branches with prefix `release-` are considered _stable_ branches.
After every minor release (http://semver.org/), we will have a new stable branch for that release. We will keep fixing the backwards-compatible bugs for the latest stable release, but not previous releases. The _patch_ release, incorporating any bug fixes, will be once every two weeks, given any patches.
[master]: https://github.com/coreos/etcd/tree/master

View File

@ -4,7 +4,7 @@
Starting an etcd cluster statically requires that each member knows another in the cluster. In a number of cases, you might not know the IPs of your cluster members ahead of time. In these cases, you can bootstrap an etcd cluster with the help of a discovery service.
Once an etcd cluster is up and running, adding or removing members is done via [runtime reconfiguration][runtime-conf]. To better understand the design behind runtime reconfiguration, we suggest you read [the runtime configuration design document][runtime-reconf-design].
Once an etcd cluster is up and running, adding or removing members is done via [runtime reconfiguration](runtime-configuration.md). To better understand the design behind runtime reconfiguration, we suggest you read [this](runtime-reconf-design.md).
This guide will cover the following mechanisms for bootstrapping an etcd cluster:
@ -30,59 +30,59 @@ ETCD_INITIAL_CLUSTER_STATE=new
```
```
--initial-cluster infra0=http://10.0.1.10:2380,infra1=http://10.0.1.11:2380,infra2=http://10.0.1.12:2380 \
--initial-cluster-state new
-initial-cluster infra0=http://10.0.1.10:2380,infra1=http://10.0.1.11:2380,infra2=http://10.0.1.12:2380 \
-initial-cluster-state new
```
Note that the URLs specified in `initial-cluster` are the _advertised peer URLs_, i.e. they should match the value of `initial-advertise-peer-urls` on the respective nodes.
If you are spinning up multiple clusters (or creating and destroying a single cluster) with same configuration for testing purpose, it is highly recommended that you specify a unique `initial-cluster-token` for the different clusters. By doing this, etcd can generate unique cluster IDs and member IDs for the clusters even if they otherwise have the exact same configuration. This can protect you from cross-cluster-interaction, which might corrupt your clusters.
etcd listens on [`listen-client-urls`][conf-listen-client] to accept client traffic. etcd member advertises the URLs specified in [`advertise-client-urls`][conf-adv-client] to other members, proxies, clients. Please make sure the `advertise-client-urls` are reachable from intended clients. A common mistake is setting `advertise-client-urls` to localhost or leave it as default when you want the remote clients to reach etcd.
etcd listens on [`listen-client-urls`](configuration.md#-listen-client-urls) to accept client traffic. etcd member advertises the URLs specified in [`advertise-client-urls`](configuration.md#-advertise-client-urls) to other members, proxies, clients. Please make sure the `advertise-client-urls` are reachable from intended clients. A common mistake is setting `advertise-client-urls` to localhost or leave it as default when you want the remote clients to reach etcd.
On each machine you would start etcd with these flags:
```
$ etcd --name infra0 --initial-advertise-peer-urls http://10.0.1.10:2380 \
--listen-peer-urls http://10.0.1.10:2380 \
--listen-client-urls http://10.0.1.10:2379,http://127.0.0.1:2379 \
--advertise-client-urls http://10.0.1.10:2379 \
--initial-cluster-token etcd-cluster-1 \
--initial-cluster infra0=http://10.0.1.10:2380,infra1=http://10.0.1.11:2380,infra2=http://10.0.1.12:2380 \
--initial-cluster-state new
$ etcd -name infra0 -initial-advertise-peer-urls http://10.0.1.10:2380 \
-listen-peer-urls http://10.0.1.10:2380 \
-listen-client-urls http://10.0.1.10:2379,http://127.0.0.1:2379 \
-advertise-client-urls http://10.0.1.10:2379 \
-initial-cluster-token etcd-cluster-1 \
-initial-cluster infra0=http://10.0.1.10:2380,infra1=http://10.0.1.11:2380,infra2=http://10.0.1.12:2380 \
-initial-cluster-state new
```
```
$ etcd --name infra1 --initial-advertise-peer-urls http://10.0.1.11:2380 \
--listen-peer-urls http://10.0.1.11:2380 \
--listen-client-urls http://10.0.1.11:2379,http://127.0.0.1:2379 \
--advertise-client-urls http://10.0.1.11:2379 \
--initial-cluster-token etcd-cluster-1 \
--initial-cluster infra0=http://10.0.1.10:2380,infra1=http://10.0.1.11:2380,infra2=http://10.0.1.12:2380 \
--initial-cluster-state new
$ etcd -name infra1 -initial-advertise-peer-urls http://10.0.1.11:2380 \
-listen-peer-urls http://10.0.1.11:2380 \
-listen-client-urls http://10.0.1.11:2379,http://127.0.0.1:2379 \
-advertise-client-urls http://10.0.1.11:2379 \
-initial-cluster-token etcd-cluster-1 \
-initial-cluster infra0=http://10.0.1.10:2380,infra1=http://10.0.1.11:2380,infra2=http://10.0.1.12:2380 \
-initial-cluster-state new
```
```
$ etcd --name infra2 --initial-advertise-peer-urls http://10.0.1.12:2380 \
--listen-peer-urls http://10.0.1.12:2380 \
--listen-client-urls http://10.0.1.12:2379,http://127.0.0.1:2379 \
--advertise-client-urls http://10.0.1.12:2379 \
--initial-cluster-token etcd-cluster-1 \
--initial-cluster infra0=http://10.0.1.10:2380,infra1=http://10.0.1.11:2380,infra2=http://10.0.1.12:2380 \
--initial-cluster-state new
$ etcd -name infra2 -initial-advertise-peer-urls http://10.0.1.12:2380 \
-listen-peer-urls http://10.0.1.12:2380 \
-listen-client-urls http://10.0.1.12:2379,http://127.0.0.1:2379 \
-advertise-client-urls http://10.0.1.12:2379 \
-initial-cluster-token etcd-cluster-1 \
-initial-cluster infra0=http://10.0.1.10:2380,infra1=http://10.0.1.11:2380,infra2=http://10.0.1.12:2380 \
-initial-cluster-state new
```
The command line parameters starting with `--initial-cluster` will be ignored on subsequent runs of etcd. You are free to remove the environment variables or command line flags after the initial bootstrap process. If you need to make changes to the configuration later (for example, adding or removing members to/from the cluster), see the [runtime configuration][runtime-conf] guide.
The command line parameters starting with `-initial-cluster` will be ignored on subsequent runs of etcd. You are free to remove the environment variables or command line flags after the initial bootstrap process. If you need to make changes to the configuration later (for example, adding or removing members to/from the cluster), see the [runtime configuration](runtime-configuration.md) guide.
### Error Cases
In the following example, we have not included our new host in the list of enumerated nodes. If this is a new cluster, the node _must_ be added to the list of initial cluster members.
```
$ etcd --name infra1 --initial-advertise-peer-urls http://10.0.1.11:2380 \
--listen-peer-urls https://10.0.1.11:2380 \
--listen-client-urls http://10.0.1.11:2379,http://127.0.0.1:2379 \
--advertise-client-urls http://10.0.1.11:2379 \
--initial-cluster infra0=http://10.0.1.10:2380 \
--initial-cluster-state new
$ etcd -name infra1 -initial-advertise-peer-urls http://10.0.1.11:2380 \
-listen-peer-urls https://10.0.1.11:2380 \
-listen-client-urls http://10.0.1.11:2379,http://127.0.0.1:2379 \
-advertise-client-urls http://10.0.1.11:2379 \
-initial-cluster infra0=http://10.0.1.10:2380 \
-initial-cluster-state new
etcd: infra1 not listed in the initial cluster config
exit 1
```
@ -90,12 +90,12 @@ exit 1
In this example, we are attempting to map a node (infra0) on a different address (127.0.0.1:2380) than its enumerated address in the cluster list (10.0.1.10:2380). If this node is to listen on multiple addresses, all addresses _must_ be reflected in the "initial-cluster" configuration directive.
```
$ etcd --name infra0 --initial-advertise-peer-urls http://127.0.0.1:2380 \
--listen-peer-urls http://10.0.1.10:2380 \
--listen-client-urls http://10.0.1.10:2379,http://127.0.0.1:2379 \
--advertise-client-urls http://10.0.1.10:2379 \
--initial-cluster infra0=http://10.0.1.10:2380,infra1=http://10.0.1.11:2380,infra2=http://10.0.1.12:2380 \
--initial-cluster-state=new
$ etcd -name infra0 -initial-advertise-peer-urls http://127.0.0.1:2380 \
-listen-peer-urls http://10.0.1.10:2380 \
-listen-client-urls http://10.0.1.10:2379,http://127.0.0.1:2379 \
-advertise-client-urls http://10.0.1.10:2379 \
-initial-cluster infra0=http://10.0.1.10:2380,infra1=http://10.0.1.11:2380,infra2=http://10.0.1.12:2380 \
-initial-cluster-state=new
etcd: error setting up initial cluster: infra0 has different advertised URLs in the cluster and advertised peer URLs list
exit 1
```
@ -103,12 +103,12 @@ exit 1
If you configure a peer with a different set of configuration and attempt to join this cluster you will get a cluster ID mismatch and etcd will exit.
```
$ etcd --name infra3 --initial-advertise-peer-urls http://10.0.1.13:2380 \
--listen-peer-urls http://10.0.1.13:2380 \
--listen-client-urls http://10.0.1.13:2379,http://127.0.0.1:2379 \
--advertise-client-urls http://10.0.1.13:2379 \
--initial-cluster infra0=http://10.0.1.10:2380,infra1=http://10.0.1.11:2380,infra3=http://10.0.1.13:2380 \
--initial-cluster-state=new
$ etcd -name infra3 -initial-advertise-peer-urls http://10.0.1.13:2380 \
-listen-peer-urls http://10.0.1.13:2380 \
-listen-client-urls http://10.0.1.13:2379,http://127.0.0.1:2379 \
-advertise-client-urls http://10.0.1.13:2379 \
-initial-cluster infra0=http://10.0.1.10:2380,infra1=http://10.0.1.11:2380,infra3=http://10.0.1.13:2380 \
-initial-cluster-state=new
etcd: conflicting cluster ID to the target cluster (c6ab534d07e8fcc4 != bc25ea2a74fb18b0). Exiting.
exit 1
```
@ -124,13 +124,15 @@ There two methods that can be used for discovery:
### etcd Discovery
To better understand the design about discovery service protocol, we suggest you read [this][discovery-proto].
To better understand the design about discovery service protocol, we suggest you read [this](./discovery_protocol.md).
#### Lifetime of a Discovery URL
A discovery URL identifies a unique etcd cluster. Instead of reusing a discovery URL, you should always create discovery URLs for new clusters.
Moreover, discovery URLs should ONLY be used for the initial bootstrapping of a cluster. To change cluster membership after the cluster is already running, see the [runtime reconfiguration][runtime-conf] guide.
Moreover, discovery URLs should ONLY be used for the initial bootstrapping of a cluster. To change cluster membership after the cluster is already running, see the [runtime reconfiguration][runtime] guide.
[runtime]: runtime-configuration.md
#### Custom etcd Discovery Service
@ -146,30 +148,30 @@ If you bootstrap an etcd cluster using discovery service with more than the expe
The URL you will use in this case will be `https://myetcd.local/v2/keys/discovery/6c007a14875d53d9bf0ef5a6fc0257c817f0fb83` and the etcd members will use the `https://myetcd.local/v2/keys/discovery/6c007a14875d53d9bf0ef5a6fc0257c817f0fb83` directory for registration as they start.
**Each member must have a different name flag specified. `Hostname` or `machine-id` can be a good choice. Or discovery will fail due to duplicated name.**
Each member must have a different name flag specified. Or discovery will fail due to duplicated name.
Now we start etcd with those relevant flags for each member:
```
$ etcd --name infra0 --initial-advertise-peer-urls http://10.0.1.10:2380 \
--listen-peer-urls http://10.0.1.10:2380 \
--listen-client-urls http://10.0.1.10:2379,http://127.0.0.1:2379 \
--advertise-client-urls http://10.0.1.10:2379 \
--discovery https://myetcd.local/v2/keys/discovery/6c007a14875d53d9bf0ef5a6fc0257c817f0fb83
$ etcd -name infra0 -initial-advertise-peer-urls http://10.0.1.10:2380 \
-listen-peer-urls http://10.0.1.10:2380 \
-listen-client-urls http://10.0.1.10:2379,http://127.0.0.1:2379 \
-advertise-client-urls http://10.0.1.10:2379 \
-discovery https://myetcd.local/v2/keys/discovery/6c007a14875d53d9bf0ef5a6fc0257c817f0fb83
```
```
$ etcd --name infra1 --initial-advertise-peer-urls http://10.0.1.11:2380 \
--listen-peer-urls http://10.0.1.11:2380 \
--listen-client-urls http://10.0.1.11:2379,http://127.0.0.1:2379 \
--advertise-client-urls http://10.0.1.11:2379 \
--discovery https://myetcd.local/v2/keys/discovery/6c007a14875d53d9bf0ef5a6fc0257c817f0fb83
$ etcd -name infra1 -initial-advertise-peer-urls http://10.0.1.11:2380 \
-listen-peer-urls http://10.0.1.11:2380 \
-listen-client-urls http://10.0.1.11:2379,http://127.0.0.1:2379 \
-advertise-client-urls http://10.0.1.11:2379 \
-discovery https://myetcd.local/v2/keys/discovery/6c007a14875d53d9bf0ef5a6fc0257c817f0fb83
```
```
$ etcd --name infra2 --initial-advertise-peer-urls http://10.0.1.12:2380 \
--listen-peer-urls http://10.0.1.12:2380 \
--listen-client-urls http://10.0.1.12:2379,http://127.0.0.1:2379 \
--advertise-client-urls http://10.0.1.12:2379 \
--discovery https://myetcd.local/v2/keys/discovery/6c007a14875d53d9bf0ef5a6fc0257c817f0fb83
$ etcd -name infra2 -initial-advertise-peer-urls http://10.0.1.12:2380 \
-listen-peer-urls http://10.0.1.12:2380 \
-listen-client-urls http://10.0.1.12:2379,http://127.0.0.1:2379 \
-advertise-client-urls http://10.0.1.12:2379 \
-discovery https://myetcd.local/v2/keys/discovery/6c007a14875d53d9bf0ef5a6fc0257c817f0fb83
```
This will cause each member to register itself with the custom etcd discovery service and begin the cluster once all machines have been registered.
@ -187,6 +189,9 @@ This will create the cluster with an initial expected size of 3 members. If you
If you bootstrap an etcd cluster using discovery service with more than the expected number of etcd members, the extra etcd processes will [fall back][fall-back] to being [proxies][proxy] by default.
[fall-back]: proxy.md#fallback-to-proxy-mode-with-discovery-service
[proxy]: proxy.md
```
ETCD_DISCOVERY=https://discovery.etcd.io/3e86b59982e49066c5d813af1c2e2579cbf573de
```
@ -195,30 +200,30 @@ ETCD_DISCOVERY=https://discovery.etcd.io/3e86b59982e49066c5d813af1c2e2579cbf573d
-discovery https://discovery.etcd.io/3e86b59982e49066c5d813af1c2e2579cbf573de
```
**Each member must have a different name flag specified. `Hostname` or `machine-id` can be a good choice. Or discovery will fail due to duplicated name.**
Each member must have a different name flag specified. Or discovery will fail due to duplicated name.
Now we start etcd with those relevant flags for each member:
```
$ etcd --name infra0 --initial-advertise-peer-urls http://10.0.1.10:2380 \
--listen-peer-urls http://10.0.1.10:2380 \
--listen-client-urls http://10.0.1.10:2379,http://127.0.0.1:2379 \
--advertise-client-urls http://10.0.1.10:2379 \
--discovery https://discovery.etcd.io/3e86b59982e49066c5d813af1c2e2579cbf573de
$ etcd -name infra0 -initial-advertise-peer-urls http://10.0.1.10:2380 \
-listen-peer-urls http://10.0.1.10:2380 \
-listen-client-urls http://10.0.1.10:2379,http://127.0.0.1:2379 \
-advertise-client-urls http://10.0.1.10:2379 \
-discovery https://discovery.etcd.io/3e86b59982e49066c5d813af1c2e2579cbf573de
```
```
$ etcd --name infra1 --initial-advertise-peer-urls http://10.0.1.11:2380 \
--listen-peer-urls http://10.0.1.11:2380 \
--listen-client-urls http://10.0.1.11:2379,http://127.0.0.1:2379 \
--advertise-client-urls http://10.0.1.11:2379 \
--discovery https://discovery.etcd.io/3e86b59982e49066c5d813af1c2e2579cbf573de
$ etcd -name infra1 -initial-advertise-peer-urls http://10.0.1.11:2380 \
-listen-peer-urls http://10.0.1.11:2380 \
-listen-client-urls http://10.0.1.11:2379,http://127.0.0.1:2379 \
-advertise-client-urls http://10.0.1.11:2379 \
-discovery https://discovery.etcd.io/3e86b59982e49066c5d813af1c2e2579cbf573de
```
```
$ etcd --name infra2 --initial-advertise-peer-urls http://10.0.1.12:2380 \
--listen-peer-urls http://10.0.1.12:2380 \
--listen-client-urls http://10.0.1.12:2379,http://127.0.0.1:2379 \
--advertise-client-urls http://10.0.1.12:2379 \
--discovery https://discovery.etcd.io/3e86b59982e49066c5d813af1c2e2579cbf573de
$ etcd -name infra2 -initial-advertise-peer-urls http://10.0.1.12:2380 \
-listen-peer-urls http://10.0.1.12:2380 \
-listen-client-urls http://10.0.1.12:2379,http://127.0.0.1:2379 \
-advertise-client-urls http://10.0.1.12:2379 \
-discovery https://discovery.etcd.io/3e86b59982e49066c5d813af1c2e2579cbf573de
```
This will cause each member to register itself with the discovery service and begin the cluster once all members have been registered.
@ -231,11 +236,11 @@ You can use the environment variable `ETCD_DISCOVERY_PROXY` to cause etcd to use
```
$ etcd --name infra0 --initial-advertise-peer-urls http://10.0.1.10:2380 \
--listen-peer-urls http://10.0.1.10:2380 \
--listen-client-urls http://10.0.1.10:2379,http://127.0.0.1:2379 \
--advertise-client-urls http://10.0.1.10:2379 \
--discovery https://discovery.etcd.io/3e86b59982e49066c5d813af1c2e2579cbf573de
$ etcd -name infra0 -initial-advertise-peer-urls http://10.0.1.10:2380 \
-listen-peer-urls http://10.0.1.10:2380 \
-listen-client-urls http://10.0.1.10:2379,http://127.0.0.1:2379 \
-advertise-client-urls http://10.0.1.10:2379 \
-discovery https://discovery.etcd.io/3e86b59982e49066c5d813af1c2e2579cbf573de
etcd: error: the cluster doesnt have a size configuration value in https://discovery.etcd.io/3e86b59982e49066c5d813af1c2e2579cbf573de/_config
exit 1
```
@ -245,12 +250,12 @@ exit 1
This error will occur if the discovery cluster already has the configured number of members, and `discovery-fallback` is explicitly disabled
```
$ etcd --name infra0 --initial-advertise-peer-urls http://10.0.1.10:2380 \
--listen-peer-urls http://10.0.1.10:2380 \
--listen-client-urls http://10.0.1.10:2379,http://127.0.0.1:2379 \
--advertise-client-urls http://10.0.1.10:2379 \
--discovery https://discovery.etcd.io/3e86b59982e49066c5d813af1c2e2579cbf573de \
--discovery-fallback exit
$ etcd -name infra0 -initial-advertise-peer-urls http://10.0.1.10:2380 \
-listen-peer-urls http://10.0.1.10:2380 \
-listen-client-urls http://10.0.1.10:2379,http://127.0.0.1:2379 \
-advertise-client-urls http://10.0.1.10:2379 \
-discovery https://discovery.etcd.io/3e86b59982e49066c5d813af1c2e2579cbf573de \
-discovery-fallback exit
etcd: discovery: cluster is full
exit 1
```
@ -261,17 +266,17 @@ This is a harmless warning notifying you that the discovery URL will be
ignored on this machine.
```
$ etcd --name infra0 --initial-advertise-peer-urls http://10.0.1.10:2380 \
--listen-peer-urls http://10.0.1.10:2380 \
--listen-client-urls http://10.0.1.10:2379,http://127.0.0.1:2379 \
--advertise-client-urls http://10.0.1.10:2379 \
--discovery https://discovery.etcd.io/3e86b59982e49066c5d813af1c2e2579cbf573de
$ etcd -name infra0 -initial-advertise-peer-urls http://10.0.1.10:2380 \
-listen-peer-urls http://10.0.1.10:2380 \
-listen-client-urls http://10.0.1.10:2379,http://127.0.0.1:2379 \
-advertise-client-urls http://10.0.1.10:2379 \
-discovery https://discovery.etcd.io/3e86b59982e49066c5d813af1c2e2579cbf573de
etcdserver: discovery token ignored since a cluster has already been initialized. Valid log found at /var/lib/etcd
```
### DNS Discovery
DNS [SRV records][rfc-srv] can be used as a discovery mechanism.
DNS [SRV records](http://www.ietf.org/rfc/rfc2052.txt) can be used as a discovery mechanism.
The `-discovery-srv` flag can be used to set the DNS domain name where the discovery SRV records can be found.
The following DNS SRV records are looked up in the listed order:
@ -280,107 +285,93 @@ The following DNS SRV records are looked up in the listed order:
If `_etcd-server-ssl._tcp.example.com` is found then etcd will attempt the bootstrapping process over SSL.
To help clients discover the etcd cluster, the following DNS SRV records are looked up in the listed order:
* _etcd-client._tcp.example.com
* _etcd-client-ssl._tcp.example.com
If `_etcd-client-ssl._tcp.example.com` is found, clients will attempt to communicate with the etcd cluster over SSL.
#### Create DNS SRV records
```
$ dig +noall +answer SRV _etcd-server._tcp.example.com
_etcd-server._tcp.example.com. 300 IN SRV 0 0 2380 infra0.example.com.
_etcd-server._tcp.example.com. 300 IN SRV 0 0 2380 infra1.example.com.
_etcd-server._tcp.example.com. 300 IN SRV 0 0 2380 infra2.example.com.
```
```
$ dig +noall +answer SRV _etcd-client._tcp.example.com
_etcd-client._tcp.example.com. 300 IN SRV 0 0 2379 infra0.example.com.
_etcd-client._tcp.example.com. 300 IN SRV 0 0 2379 infra1.example.com.
_etcd-client._tcp.example.com. 300 IN SRV 0 0 2379 infra2.example.com.
_etcd-server._tcp.example.com. 300 IN SRV 0 0 2380 infra0.example.com.
_etcd-server._tcp.example.com. 300 IN SRV 0 0 2380 infra1.example.com.
_etcd-server._tcp.example.com. 300 IN SRV 0 0 2380 infra2.example.com.
```
```
$ dig +noall +answer infra0.example.com infra1.example.com infra2.example.com
infra0.example.com. 300 IN A 10.0.1.10
infra1.example.com. 300 IN A 10.0.1.11
infra2.example.com. 300 IN A 10.0.1.12
infra0.example.com. 300 IN A 10.0.1.10
infra1.example.com. 300 IN A 10.0.1.11
infra2.example.com. 300 IN A 10.0.1.12
```
#### Bootstrap the etcd cluster using DNS
etcd cluster members can listen on domain names or IP address, the bootstrap process will resolve DNS A records.
The resolved address in `--initial-advertise-peer-urls` *must match* one of the resolved addresses in the SRV targets. The etcd member reads the resolved address to find out if it belongs to the cluster defined in the SRV records.
The resolved address in `-initial-advertise-peer-urls` *must match* one of the resolved addresses in the SRV targets. The etcd member reads the resolved address to find out if it belongs to the cluster defined in the SRV records.
```
$ etcd --name infra0 \
--discovery-srv example.com \
--initial-advertise-peer-urls http://infra0.example.com:2380 \
--initial-cluster-token etcd-cluster-1 \
--initial-cluster-state new \
--advertise-client-urls http://infra0.example.com:2379 \
--listen-client-urls http://infra0.example.com:2379 \
--listen-peer-urls http://infra0.example.com:2380
$ etcd -name infra0 \
-discovery-srv example.com \
-initial-advertise-peer-urls http://infra0.example.com:2380 \
-initial-cluster-token etcd-cluster-1 \
-initial-cluster-state new \
-advertise-client-urls http://infra0.example.com:2379 \
-listen-client-urls http://infra0.example.com:2379 \
-listen-peer-urls http://infra0.example.com:2380
```
```
$ etcd --name infra1 \
--discovery-srv example.com \
--initial-advertise-peer-urls http://infra1.example.com:2380 \
--initial-cluster-token etcd-cluster-1 \
--initial-cluster-state new \
--advertise-client-urls http://infra1.example.com:2379 \
--listen-client-urls http://infra1.example.com:2379 \
--listen-peer-urls http://infra1.example.com:2380
$ etcd -name infra1 \
-discovery-srv example.com \
-initial-advertise-peer-urls http://infra1.example.com:2380 \
-initial-cluster-token etcd-cluster-1 \
-initial-cluster-state new \
-advertise-client-urls http://infra1.example.com:2379 \
-listen-client-urls http://infra1.example.com:2379 \
-listen-peer-urls http://infra1.example.com:2380
```
```
$ etcd --name infra2 \
--discovery-srv example.com \
--initial-advertise-peer-urls http://infra2.example.com:2380 \
--initial-cluster-token etcd-cluster-1 \
--initial-cluster-state new \
--advertise-client-urls http://infra2.example.com:2379 \
--listen-client-urls http://infra2.example.com:2379 \
--listen-peer-urls http://infra2.example.com:2380
$ etcd -name infra2 \
-discovery-srv example.com \
-initial-advertise-peer-urls http://infra2.example.com:2380 \
-initial-cluster-token etcd-cluster-1 \
-initial-cluster-state new \
-advertise-client-urls http://infra2.example.com:2379 \
-listen-client-urls http://infra2.example.com:2379 \
-listen-peer-urls http://infra2.example.com:2380
```
You can also bootstrap the cluster using IP addresses instead of domain names:
```
$ etcd --name infra0 \
--discovery-srv example.com \
--initial-advertise-peer-urls http://10.0.1.10:2380 \
--initial-cluster-token etcd-cluster-1 \
--initial-cluster-state new \
--advertise-client-urls http://10.0.1.10:2379 \
--listen-client-urls http://10.0.1.10:2379 \
--listen-peer-urls http://10.0.1.10:2380
$ etcd -name infra0 \
-discovery-srv example.com \
-initial-advertise-peer-urls http://10.0.1.10:2380 \
-initial-cluster-token etcd-cluster-1 \
-initial-cluster-state new \
-advertise-client-urls http://10.0.1.10:2379 \
-listen-client-urls http://10.0.1.10:2379 \
-listen-peer-urls http://10.0.1.10:2380
```
```
$ etcd --name infra1 \
--discovery-srv example.com \
--initial-advertise-peer-urls http://10.0.1.11:2380 \
--initial-cluster-token etcd-cluster-1 \
--initial-cluster-state new \
--advertise-client-urls http://10.0.1.11:2379 \
--listen-client-urls http://10.0.1.11:2379 \
--listen-peer-urls http://10.0.1.11:2380
$ etcd -name infra1 \
-discovery-srv example.com \
-initial-advertise-peer-urls http://10.0.1.11:2380 \
-initial-cluster-token etcd-cluster-1 \
-initial-cluster-state new \
-advertise-client-urls http://10.0.1.11:2379 \
-listen-client-urls http://10.0.1.11:2379 \
-listen-peer-urls http://10.0.1.11:2380
```
```
$ etcd --name infra2 \
--discovery-srv example.com \
--initial-advertise-peer-urls http://10.0.1.12:2380 \
--initial-cluster-token etcd-cluster-1 \
--initial-cluster-state new \
--advertise-client-urls http://10.0.1.12:2379 \
--listen-client-urls http://10.0.1.12:2379 \
--listen-peer-urls http://10.0.1.12:2380
$ etcd -name infra2 \
-discovery-srv example.com \
-initial-advertise-peer-urls http://10.0.1.12:2380 \
-initial-cluster-token etcd-cluster-1 \
-initial-cluster-state new \
-advertise-client-urls http://10.0.1.12:2379 \
-listen-client-urls http://10.0.1.12:2379 \
-listen-peer-urls http://10.0.1.12:2380
```
#### etcd proxy configuration
@ -388,24 +379,12 @@ $ etcd --name infra2 \
DNS SRV records can also be used to configure the list of peers for an etcd server running in proxy mode:
```
$ etcd --proxy on --discovery-srv example.com
```
#### etcd client configuration
DNS SRV records can also be used to help clients discover the etcd cluster.
The official [etcd/client][client] supports [DNS Discovery][client-discoverer].
`etcdctl` also supports DNS Discovery by specifying the `--discovery-srv` option.
```
$ etcdctl --discovery-srv example.com set foo bar
$ etcd --proxy on -discovery-srv example.com
```
#### Error Cases
You might see an error like `cannot find local etcd $name from SRV records.`. That means the etcd member fails to find itself from the cluster defined in SRV records. The resolved address in `--initial-advertise-peer-urls` *must match* one of the resolved addresses in the SRV targets.
You might see the an error like `cannot find local etcd $name from SRV records.`. That means the etcd member fails to find itself from the cluster defined in SRV records. The resolved address in `-initial-advertise-peer-urls` *must match* one of the resolved addresses in the SRV targets.
# 0.4 to 2.0+ Migration Guide
@ -413,22 +392,11 @@ In etcd 2.0 we introduced the ability to listen on more than one address and to
To make understanding this feature easier, we changed the naming of some flags, but we support the old flags to make the migration from the old to new version easier.
|Old Flag |New Flag |Migration Behavior |
|Old Flag |New Flag |Migration Behavior |
|-----------------------|-----------------------|---------------------------------------------------------------------------------------|
|-peer-addr |--initial-advertise-peer-urls |If specified, peer-addr will be used as the only peer URL. Error if both flags specified.|
|-addr |--advertise-client-urls |If specified, addr will be used as the only client URL. Error if both flags specified.|
|-peer-bind-addr |--listen-peer-urls |If specified, peer-bind-addr will be used as the only peer bind URL. Error if both flags specified.|
|-bind-addr |--listen-client-urls |If specified, bind-addr will be used as the only client bind URL. Error if both flags specified.|
|-peers |none |Deprecated. The --initial-cluster flag provides a similar concept with different semantics. Please read this guide on cluster startup.|
|-peers-file |none |Deprecated. The --initial-cluster flag provides a similar concept with different semantics. Please read this guide on cluster startup.|
[client]: /client
[client-discoverer]: https://godoc.org/github.com/coreos/etcd/client#Discoverer
[conf-adv-client]: configuration.md#-advertise-client-urls
[conf-listen-client]: configuration.md#-listen-client-urls
[discovery-proto]: discovery_protocol.md
[fall-back]: proxy.md#fallback-to-proxy-mode-with-discovery-service
[proxy]: proxy.md
[rfc-srv]: http://www.ietf.org/rfc/rfc2052.txt
[runtime-conf]: runtime-configuration.md
[runtime-reconf-design]: runtime-reconf-design.md
|-peer-addr |-initial-advertise-peer-urls |If specified, peer-addr will be used as the only peer URL. Error if both flags specified.|
|-addr |-advertise-client-urls |If specified, addr will be used as the only client URL. Error if both flags specified.|
|-peer-bind-addr |-listen-peer-urls |If specified, peer-bind-addr will be used as the only peer bind URL. Error if both flags specified.|
|-bind-addr |-listen-client-urls |If specified, bind-addr will be used as the only client bind URL. Error if both flags specified.|
|-peers |none |Deprecated. The -initial-cluster flag provides a similar concept with different semantics. Please read this guide on cluster startup.|
|-peers-file |none |Deprecated. The -initial-cluster flag provides a similar concept with different semantics. Please read this guide on cluster startup.|

View File

@ -1,282 +1,264 @@
# Configuration Flags
## Configuration Flags
etcd is configurable through command-line flags and environment variables. Options set on the command line take precedence over those from the environment.
The format of environment variable for flag `--my-flag` is `ETCD_MY_FLAG`. It applies to all flags.
The [official etcd ports][iana-ports] are 2379 for client requests, and 2380 for peer communication. Some legacy code and documentation still references ports 4001 and 7001, but all new etcd use and discussion should adopt the assigned ports.
The format of environment variable for flag `-my-flag` is `ETCD_MY_FLAG`. It applies to all flags.
To start etcd automatically using custom settings at startup in Linux, using a [systemd][systemd-intro] unit is highly recommended.
[systemd-intro]: http://freedesktop.org/wiki/Software/systemd/
## Member Flags
### Member Flags
### --name
##### -name
+ Human-readable name for this member.
+ default: "default"
+ env variable: ETCD_NAME
+ This value is referenced as this node's own entries listed in the `--initial-cluster` flag (Ex: `default=http://localhost:2380` or `default=http://localhost:2380,default=http://localhost:7001`). This needs to match the key used in the flag if you're using [static bootstrapping][build-cluster]. When using discovery, each member must have a unique name. `Hostname` or `machine-id` can be a good choice.
+ This value is referenced as this node's own entries listed in the `-initial-cluster` flag (Ex: `default=http://localhost:2380` or `default=http://localhost:2380,default=http://localhost:7001`). This needs to match the key used in the flag if you're using [static boostrapping](clustering.md#static).
### --data-dir
##### -data-dir
+ Path to the data directory.
+ default: "${name}.etcd"
+ env variable: ETCD_DATA_DIR
### --wal-dir
##### -wal-dir
+ Path to the dedicated wal directory. If this flag is set, etcd will write the WAL files to the walDir rather than the dataDir. This allows a dedicated disk to be used, and helps avoid io competition between logging and other IO operations.
+ default: ""
+ env variable: ETCD_WAL_DIR
### --snapshot-count
##### -snapshot-count
+ Number of committed transactions to trigger a snapshot to disk.
+ default: "10000"
+ env variable: ETCD_SNAPSHOT_COUNT
### --heartbeat-interval
##### -heartbeat-interval
+ Time (in milliseconds) of a heartbeat interval.
+ default: "100"
+ env variable: ETCD_HEARTBEAT_INTERVAL
### --election-timeout
##### -election-timeout
+ Time (in milliseconds) for an election to timeout. See [Documentation/tuning.md](tuning.md#time-parameters) for details.
+ default: "1000"
+ env variable: ETCD_ELECTION_TIMEOUT
### --listen-peer-urls
##### -listen-peer-urls
+ List of URLs to listen on for peer traffic. This flag tells the etcd to accept incoming requests from its peers on the specified scheme://IP:port combinations. Scheme can be either http or https.If 0.0.0.0 is specified as the IP, etcd listens to the given port on all interfaces. If an IP address is given as well as a port, etcd will listen on the given port and interface. Multiple URLs may be used to specify a number of addresses and ports to listen on. The etcd will respond to requests from any of the listed addresses and ports.
+ default: "http://localhost:2380,http://localhost:7001"
+ env variable: ETCD_LISTEN_PEER_URLS
+ example: "http://10.0.0.1:2380"
+ invalid example: "http://example.com:2380" (domain name is invalid for binding)
### --listen-client-urls
##### -listen-client-urls
+ List of URLs to listen on for client traffic. This flag tells the etcd to accept incoming requests from the clients on the specified scheme://IP:port combinations. Scheme can be either http or https. If 0.0.0.0 is specified as the IP, etcd listens to the given port on all interfaces. If an IP address is given as well as a port, etcd will listen on the given port and interface. Multiple URLs may be used to specify a number of addresses and ports to listen on. The etcd will respond to requests from any of the listed addresses and ports.
+ default: "http://localhost:2379,http://localhost:4001"
+ env variable: ETCD_LISTEN_CLIENT_URLS
+ example: "http://10.0.0.1:2379"
+ invalid example: "http://example.com:2379" (domain name is invalid for binding)
### --max-snapshots
##### -max-snapshots
+ Maximum number of snapshot files to retain (0 is unlimited)
+ default: 5
+ env variable: ETCD_MAX_SNAPSHOTS
+ The default for users on Windows is unlimited, and manual purging down to 5 (or your preference for safety) is recommended.
### --max-wals
##### -max-wals
+ Maximum number of wal files to retain (0 is unlimited)
+ default: 5
+ env variable: ETCD_MAX_WALS
+ The default for users on Windows is unlimited, and manual purging down to 5 (or your preference for safety) is recommended.
### --cors
##### -cors
+ Comma-separated white list of origins for CORS (cross-origin resource sharing).
+ default: none
+ env variable: ETCD_CORS
## Clustering Flags
### Clustering Flags
`--initial` prefix flags are used in bootstrapping ([static bootstrap][build-cluster], [discovery-service bootstrap][discovery] or [runtime reconfiguration][reconfig]) a new member, and ignored when restarting an existing member.
`-initial` prefix flags are used in bootstrapping ([static bootstrap][build-cluster], [discovery-service bootstrap][discovery] or [runtime reconfiguration][reconfig]) a new member, and ignored when restarting an existing member.
`--discovery` prefix flags need to be set when using [discovery service][discovery].
`-discovery` prefix flags need to be set when using [discovery service][discovery].
### --initial-advertise-peer-urls
##### -initial-advertise-peer-urls
+ List of this member's peer URLs to advertise to the rest of the cluster. These addresses are used for communicating etcd data around the cluster. At least one must be routable to all cluster members. These URLs can contain domain names.
+ default: "http://localhost:2380,http://localhost:7001"
+ env variable: ETCD_INITIAL_ADVERTISE_PEER_URLS
+ example: "http://example.com:2380, http://10.0.0.1:2380"
### --initial-cluster
##### -initial-cluster
+ Initial cluster configuration for bootstrapping.
+ default: "default=http://localhost:2380,default=http://localhost:7001"
+ env variable: ETCD_INITIAL_CLUSTER
+ The key is the value of the `--name` flag for each node provided. The default uses `default` for the key because this is the default for the `--name` flag.
+ The key is the value of the `-name` flag for each node provided. The default uses `default` for the key because this is the default for the `-name` flag.
### --initial-cluster-state
##### -initial-cluster-state
+ Initial cluster state ("new" or "existing"). Set to `new` for all members present during initial static or DNS bootstrapping. If this option is set to `existing`, etcd will attempt to join the existing cluster. If the wrong value is set, etcd will attempt to start but fail safely.
+ default: "new"
+ env variable: ETCD_INITIAL_CLUSTER_STATE
[static bootstrap]: clustering.md#static
### --initial-cluster-token
##### -initial-cluster-token
+ Initial cluster token for the etcd cluster during bootstrap.
+ default: "etcd-cluster"
+ env variable: ETCD_INITIAL_CLUSTER_TOKEN
### --advertise-client-urls
##### -advertise-client-urls
+ List of this member's client URLs to advertise to the rest of the cluster. These URLs can contain domain names.
+ default: "http://localhost:2379,http://localhost:4001"
+ env variable: ETCD_ADVERTISE_CLIENT_URLS
+ example: "http://example.com:2379, http://10.0.0.1:2379"
+ Be careful if you are advertising URLs such as http://localhost:2379 from a cluster member and are using the proxy feature of etcd. This will cause loops, because the proxy will be forwarding requests to itself until its resources (memory, file descriptors) are eventually depleted.
### --discovery
##### -discovery
+ Discovery URL used to bootstrap the cluster.
+ default: none
+ env variable: ETCD_DISCOVERY
### --discovery-srv
##### -discovery-srv
+ DNS srv domain used to bootstrap the cluster.
+ default: none
+ env variable: ETCD_DISCOVERY_SRV
### --discovery-fallback
##### -discovery-fallback
+ Expected behavior ("exit" or "proxy") when discovery services fails.
+ default: "proxy"
+ env variable: ETCD_DISCOVERY_FALLBACK
### --discovery-proxy
##### -discovery-proxy
+ HTTP proxy to use for traffic to discovery service.
+ default: none
+ env variable: ETCD_DISCOVERY_PROXY
### --strict-reconfig-check
+ Reject reconfiguration requests that would cause quorum loss.
+ default: false
+ env variable: ETCD_STRICT_RECONFIG_CHECK
### Proxy Flags
## Proxy Flags
`-proxy` prefix flags configures etcd to run in [proxy mode][proxy].
`--proxy` prefix flags configures etcd to run in [proxy mode][proxy].
### --proxy
##### -proxy
+ Proxy mode setting ("off", "readonly" or "on").
+ default: "off"
+ env variable: ETCD_PROXY
### --proxy-failure-wait
##### -proxy-failure-wait
+ Time (in milliseconds) an endpoint will be held in a failed state before being reconsidered for proxied requests.
+ default: 5000
+ env variable: ETCD_PROXY_FAILURE_WAIT
### --proxy-refresh-interval
##### -proxy-refresh-interval
+ Time (in milliseconds) of the endpoints refresh interval.
+ default: 30000
+ env variable: ETCD_PROXY_REFRESH_INTERVAL
### --proxy-dial-timeout
##### -proxy-dial-timeout
+ Time (in milliseconds) for a dial to timeout or 0 to disable the timeout
+ default: 1000
+ env variable: ETCD_PROXY_DIAL_TIMEOUT
### --proxy-write-timeout
##### -proxy-write-timeout
+ Time (in milliseconds) for a write to timeout or 0 to disable the timeout.
+ default: 5000
+ env variable: ETCD_PROXY_WRITE_TIMEOUT
### --proxy-read-timeout
##### -proxy-read-timeout
+ Time (in milliseconds) for a read to timeout or 0 to disable the timeout.
+ Don't change this value if you use watches because they are using long polling requests.
+ default: 0
+ env variable: ETCD_PROXY_READ_TIMEOUT
## Security Flags
### Security Flags
The security flags help to [build a secure etcd cluster][security].
### --ca-file [DEPRECATED]
+ Path to the client server TLS CA file. `--ca-file ca.crt` could be replaced by `--trusted-ca-file ca.crt --client-cert-auth` and etcd will perform the same.
##### -ca-file [DEPRECATED]
+ Path to the client server TLS CA file. `-ca-file ca.crt` could be replaced by `-trusted-ca-file ca.crt -client-cert-auth` and etcd will perform the same.
+ default: none
+ env variable: ETCD_CA_FILE
### --cert-file
##### -cert-file
+ Path to the client server TLS cert file.
+ default: none
+ env variable: ETCD_CERT_FILE
### --key-file
##### -key-file
+ Path to the client server TLS key file.
+ default: none
+ env variable: ETCD_KEY_FILE
### --client-cert-auth
##### -client-cert-auth
+ Enable client cert authentication.
+ default: false
+ env variable: ETCD_CLIENT_CERT_AUTH
### --trusted-ca-file
##### -trusted-ca-file
+ Path to the client server TLS trusted CA key file.
+ default: none
+ env variable: ETCD_TRUSTED_CA_FILE
### --peer-ca-file [DEPRECATED]
+ Path to the peer server TLS CA file. `--peer-ca-file ca.crt` could be replaced by `--peer-trusted-ca-file ca.crt --peer-client-cert-auth` and etcd will perform the same.
##### -peer-ca-file [DEPRECATED]
+ Path to the peer server TLS CA file. `-peer-ca-file ca.crt` could be replaced by `-peer-trusted-ca-file ca.crt -peer-client-cert-auth` and etcd will perform the same.
+ default: none
+ env variable: ETCD_PEER_CA_FILE
### --peer-cert-file
##### -peer-cert-file
+ Path to the peer server TLS cert file.
+ default: none
+ env variable: ETCD_PEER_CERT_FILE
### --peer-key-file
##### -peer-key-file
+ Path to the peer server TLS key file.
+ default: none
+ env variable: ETCD_PEER_KEY_FILE
### --peer-client-cert-auth
##### -peer-client-cert-auth
+ Enable peer client cert authentication.
+ default: false
+ env variable: ETCD_PEER_CLIENT_CERT_AUTH
### --peer-trusted-ca-file
##### -peer-trusted-ca-file
+ Path to the peer server TLS trusted CA file.
+ default: none
+ env variable: ETCD_PEER_TRUSTED_CA_FILE
## Logging Flags
### Logging Flags
### --debug
##### -debug
+ Drop the default log level to DEBUG for all subpackages.
+ default: false (INFO for all packages)
+ env variable: ETCD_DEBUG
### --log-package-levels
##### -log-package-levels
+ Set individual etcd subpackages to specific log levels. An example being `etcdserver=WARNING,security=DEBUG`
+ default: none (INFO for all packages)
+ env variable: ETCD_LOG_PACKAGE_LEVELS
## Unsafe Flags
### Unsafe Flags
Please be CAUTIOUS when using unsafe flags because it will break the guarantees given by the consensus protocol.
For example, it may panic if other members in the cluster are still alive.
Follow the instructions when using these flags.
### --force-new-cluster
+ Force to create a new one-member cluster. It commits configuration changes forcing to remove all existing members in the cluster and add itself. It needs to be set to [restore a backup][restore].
##### -force-new-cluster
+ Force to create a new one-member cluster. It commits configuration changes in force to remove all existing members in the cluster and add itself. It needs to be set to [restore a backup][restore].
+ default: false
+ env variable: ETCD_FORCE_NEW_CLUSTER
## Experimental Flags
### Experimental Flags
### --experimental-v3demo
+ Enable experimental [v3 demo API][rfc-v3].
##### -experimental-v3demo
+ Enable experimental [v3 demo API](rfc/v3api.proto).
+ default: false
+ env variable: ETCD_EXPERIMENTAL_V3DEMO
## Miscellaneous Flags
### Miscellaneous Flags
### --version
##### -version
+ Print the version and exit.
+ default: false
## Profiling flags
### --enable-pprof
+ Enable runtime profiling data via HTTP server. Address is at client URL + "/debug/pprof"
+ default: false
[build-cluster]: clustering.md#static
[reconfig]: runtime-configuration.md
[discovery]: clustering.md#discovery
[iana-ports]: https://www.iana.org/assignments/service-names-port-numbers/service-names-port-numbers.xhtml?search=etcd
[proxy]: proxy.md
[reconfig]: runtime-configuration.md
[restore]: admin_guide.md#restoring-a-backup
[rfc-v3]: rfc/v3api.md
[security]: security.md
[systemd-intro]: http://freedesktop.org/wiki/Software/systemd/
[tuning]: tuning.md#time-parameters
[restore]: admin_guide.md#restoring-a-backup

View File

@ -6,11 +6,11 @@ The procedure includes some manual steps for sanity checking but it can probably
## Prepare Release
Set desired version as environment variable for following steps. Here is an example to release 2.3.0:
Set desired version as environment variable for following steps. Here is an example to release 2.1.3:
```
export VERSION=v2.3.0
export PREV_VERSION=v2.2.5
export VERSION=v2.1.3
export PREV_VERSION=v2.1.2
```
All releases version numbers follow the format of [semantic versioning 2.0.0](http://semver.org/).
@ -30,6 +30,7 @@ All releases version numbers follow the format of [semantic versioning 2.0.0](ht
## Write Release Note
- Write introduction for the new release. For example, what major bug we fix, what new features we introduce or what performance improvement we make.
- Write changelog for the last release. ChangeLog should be straightforward and easy to understand for the end-user.
- Put `[GH XXXX]` at the head of change line to reference Pull Request that introduces the change. Moreover, add a link on it to jump to the Pull Request.
@ -60,14 +61,16 @@ It generates all release binaries and images under directory ./release.
## Sign Binaries and Images
etcd project key must be used to sign the generated binaries and images.`$SUBKEYID` is the key ID of etcd project Yubikey. Connect the key and run `gpg2 --card-status` to get the ID.
Choose appropriate private key to sign the generated binaries and images.
The following commands are used for public release sign:
```
cd release
for i in etcd-*{.zip,.tar.gz}; do gpg2 --default-key $SUBKEYID --output ${i}.asc --detach-sign ${i}; done
for i in etcd-*{.zip,.tar.gz}; do gpg2 --verify ${i}.asc ${i}; done
# personal GPG is okay for now
for i in etcd-*{.zip,.tar.gz}; do gpg --sign ${i}; done
# use `CoreOS ACI Builder <release@coreos.com>` secret key
gpg -u 88182190 -a --output etcd-${VERSION}-linux-amd64.aci.asc --detach-sig etcd-${VERSION}-linux-amd64.aci
```
## Publish Release Page in GitHub

View File

@ -4,7 +4,7 @@ Discovery service protocol helps new etcd member to discover all other members i
Discovery service protocol is _only_ used in cluster bootstrap phase, and cannot be used for runtime reconfiguration or cluster monitoring.
The protocol uses a new discovery token to bootstrap one _unique_ etcd cluster. Remember that one discovery token can represent only one etcd cluster. As long as discovery protocol on this token starts, even if it fails halfway, it must not be used to bootstrap another etcd cluster.
The protocol uses a new discovery token to bootstrap one _unique_ etcd cluster. Remember that one discovery token can represent only one etcd cluster. As long as discovery protocol on this token starts, even if fails halfway, it must not be used to bootstrap another etcd cluster.
The rest of this article will walk through the discovery process with examples that correspond to a self-hosted discovery cluster. The public discovery service, discovery.etcd.io, functions the same way, but with a layer of polish to abstract away ugly URLs, generate UUIDs automatically, and provide some protections against excessive requests. At its core, the public discovery service still uses an etcd cluster as the data store as described in this document.
@ -14,7 +14,7 @@ The idea of discovery protocol is to use an internal etcd cluster to coordinate
In the following example workflow, we will list each step of protocol in curl format for ease of understanding.
By convention the etcd discovery protocol uses the key prefix `_etcd/registry`. If `http://example.com` hosts an etcd cluster for discovery service, a full URL to discovery keyspace will be `http://example.com/v2/keys/_etcd/registry`. We will use this as the URL prefix in the example.
By convention the etcd discovery protocol uses the key prefix `_etcd/registry`. If `http://example.com` hosts a etcd cluster for discovery service, a full URL to discovery keyspace will be `http://example.com/v2/keys/_etcd/registry`. We will use this as the URL prefix in the example.
### Creating a New Discovery Token
@ -32,7 +32,7 @@ You need to specify the expected cluster size for this discovery token. The size
curl -X PUT http://example.com/v2/keys/_etcd/registry/${UUID}/_config/size -d value=${cluster_size}
```
Usually the cluster size is 3, 5 or 7. Check [optimal cluster size][cluster-size] for more details.
Usually the cluster size is 3, 5 or 7. Check [optimal cluster size](admin_guide.md#optimal-cluster-size) for more details.
### Bringing up etcd Processes
@ -64,7 +64,7 @@ In etcd implementation, the member may check the cluster status even before regi
### Waiting for All Members
The wait process is described in detail in the [etcd API documentation][api].
The wait process is described in details [here](https://github.com/coreos/etcd/blob/master/Documentation/api.md#waiting-for-a-change).
```
curl -X GET http://example.com/v2/keys/_etcd/registry/${UUID}?wait=true&waitIndex=${current_etcd_index}
@ -94,7 +94,7 @@ Possible status codes:
generated discovery url
```
The generation process in the service follows the steps from [Creating a New Discovery Token][new-discovery-token] to [Specifying the Expected Cluster Size][expected-cluster-size].
The generation process in the service follows the step from [Creating a New Discovery Token](#creating-a-new-discovery-token) to [Specifying the Expected Cluster Size](#specifying-the-expected-cluster-size).
### Check Discovery Status
@ -107,8 +107,3 @@ You can check the status for this discovery token, including the machines that h
### Open-source repository
The repository is located at https://github.com/coreos/discovery.etcd.io. You could use it to build your own public discovery service.
[api]: api.md#waiting-for-a-change
[cluster-size]: admin_guide.md#optimal-cluster-size
[expected-cluster-size]: #specifying-the-expected-cluster-size
[new-discovery-token]: #creating-a-new-discovery-token

View File

@ -12,11 +12,9 @@ export HostIP="192.168.12.50"
The following `docker run` command will expose the etcd client API over ports 4001 and 2379, and expose the peer port over 2380.
This will run the latest release version of etcd. You can specify version if needed (e.g. `quay.io/coreos/etcd:v2.2.0`).
```
docker run -d -v /usr/share/ca-certificates/:/etc/ssl/certs -p 4001:4001 -p 2380:2380 -p 2379:2379 \
--name etcd quay.io/coreos/etcd \
--name etcd quay.io/coreos/etcd:v2.0.8 \
-name etcd0 \
-advertise-client-urls http://${HostIP}:2379,http://${HostIP}:4001 \
-listen-client-urls http://0.0.0.0:2379,http://0.0.0.0:4001 \
@ -46,7 +44,7 @@ The main difference being the value used for the `-initial-cluster` flag, which
```
docker run -d -v /usr/share/ca-certificates/:/etc/ssl/certs -p 4001:4001 -p 2380:2380 -p 2379:2379 \
--name etcd quay.io/coreos/etcd \
--name etcd quay.io/coreos/etcd:v2.0.8 \
-name etcd0 \
-advertise-client-urls http://192.168.12.50:2379,http://192.168.12.50:4001 \
-listen-client-urls http://0.0.0.0:2379,http://0.0.0.0:4001 \
@ -61,7 +59,7 @@ docker run -d -v /usr/share/ca-certificates/:/etc/ssl/certs -p 4001:4001 -p 2380
```
docker run -d -v /usr/share/ca-certificates/:/etc/ssl/certs -p 4001:4001 -p 2380:2380 -p 2379:2379 \
--name etcd quay.io/coreos/etcd \
--name etcd quay.io/coreos/etcd:v2.0.8 \
-name etcd1 \
-advertise-client-urls http://192.168.12.51:2379,http://192.168.12.51:4001 \
-listen-client-urls http://0.0.0.0:2379,http://0.0.0.0:4001 \
@ -76,7 +74,7 @@ docker run -d -v /usr/share/ca-certificates/:/etc/ssl/certs -p 4001:4001 -p 2380
```
docker run -d -v /usr/share/ca-certificates/:/etc/ssl/certs -p 4001:4001 -p 2380:2380 -p 2379:2379 \
--name etcd quay.io/coreos/etcd \
--name etcd quay.io/coreos/etcd:v2.0.8 \
-name etcd2 \
-advertise-client-urls http://192.168.12.52:2379,http://192.168.12.52:4001 \
-listen-client-urls http://0.0.0.0:2379,http://0.0.0.0:4001 \

View File

@ -1,4 +1,4 @@
# Error Code
Error Code
======
This document describes the error code used in key space '/v2/keys'. Feel free to import 'github.com/coreos/etcd/error' to use.

View File

@ -37,7 +37,7 @@ timeout.
A proxy is a redirection server to the etcd cluster. The proxy handles the
redirection of a client to the current configuration of the etcd cluster. A
typical use case is to start a proxy on a machine, and on first boot up of the
typical usecase is to start a proxy on a machine, and on first boot up of the
proxy specify both the `--proxy` flag and the `--initial-cluster` flag.
From there, any etcdctl client that starts up automatically speaks to the local
@ -57,27 +57,24 @@ and their integration with the reconfiguration API.
Thus, a member that is down, even infinitely, will never be automatically
removed from the etcd cluster member list.
This makes sense because it's usually an application level / administrative
This makes sense because its usually an application level / administrative
action to determine whether a reconfiguration should happen based on health.
For more information, refer to the [runtime reconfiguration design document][runtime-reconf-design].
For more information, refer to [Documentation/runtime-reconfiguration.md].
## 6) how does --endpoint work with etcdctl?
## 6) how does --peers work with etcdctl?
The `--endpoint` flag can specify any number of etcd cluster members in a comma
The `--peers` flag can specify any number of etcd cluster members in a comma
separated list. This list might be a subset, equal to, or more than the actual
etcd cluster member list itself.
If only one peer is specified via the `--endpoint` flag, the etcdctl discovers the
If only one peer is specified via the `--peers` flag, the etcdctl discovers the
rest of the cluster via the member list of that one peer, and then it randomly
chooses a member to use. Again, the client can use the `quorum=true` flag on
reads, which will always fail when using a member in the minority.
If peers from multiple clusters are specified via the `--endpoint` flag, etcdctl
If peers from multiple clusters are specified via the `--peers` flag, etcdctl
will randomly choose a peer, and the request will simply get routed to one of
the clusters. This is probably not what you want.
Note: --peers flag is now deprecated and --endpoint should be used instead,
as it might confuse users to give etcdctl a peerURL.
[runtime-reconf-design]: runtime-reconf-design.md

View File

@ -1,35 +1,35 @@
# Glossary
## Glossary
This document defines the various terms used in etcd documentation, command line and source code.
## Node
### Node
Node is an instance of raft state machine.
It has a unique identification, and records other nodes' progress internally when it is the leader.
## Member
### Member
Member is an instance of etcd. It hosts a node, and provides service to clients.
## Cluster
### Cluster
Cluster consists of several members.
The node in each member follows raft consensus protocol to replicate logs. Cluster receives proposals from members, commits them and apply to local store.
## Peer
### Peer
Peer is another member of the same cluster.
## Proposal
### Proposal
A proposal is a request (for example a write request, a configuration change request) that needs to go through raft protocol.
## Client
### Client
Client is a caller of the cluster's HTTP API.
## Machine (deprecated)
### Machine (deprecated)
The alternative of Member in etcd before 2.0

View File

@ -46,7 +46,7 @@ ExecStart=/usr/bin/etcd
There are several error cases:
0) Init has already run and the data directory is already configured
0) Init has already ran and the data directory is already configured
1) Discovery fails because of network timeout, etc
2) Discovery fails because the cluster is already full and etcd needs to fall back to proxy
3) Static cluster configuration fails because of conflict, misconfiguration or timeout

View File

@ -1,24 +1,19 @@
# Libraries and Tools
## Libraries and Tools
**Tools**
- [etcdctl](https://github.com/coreos/etcd/tree/master/etcdctl) - A command line client for etcd
- [etcdctl](https://github.com/coreos/etcdctl) - A command line client for etcd
- [etcd-backup](https://github.com/fanhattan/etcd-backup) - A powerful command line utility for dumping/restoring etcd - Supports v2
- [etcd-dump](https://npmjs.org/package/etcd-dump) - Command line utility for dumping/restoring etcd.
- [etcd-fs](https://github.com/xetorthio/etcd-fs) - FUSE filesystem for etcd
- [etcddir](https://github.com/rekby/etcddir) - Realtime sync etcd and local directory. Work with windows and linux.
- [etcd-browser](https://github.com/henszey/etcd-browser) - A web-based key/value editor for etcd using AngularJS
- [etcd-lock](https://github.com/datawisesystems/etcd-lock) - Master election & distributed r/w lock implementation using etcd - Supports v2
- [etcd-console](https://github.com/matishsiao/etcd-console) - A web-base key/value editor for etcd using PHP
- [etcd-viewer](https://github.com/nikfoundas/etcd-viewer) - An etcd key-value store editor/viewer written in Java
- [etcdtool](https://github.com/mickep76/etcdtool) - Export/Import/Edit etcd directory as JSON/YAML/TOML and Validate directory using JSON schema
- [etcd-rest](https://github.com/mickep76/etcd-rest) - Create generic REST API in Go using etcd as a backend with validation using JSON schema
- [etcdsh](https://github.com/kamilhark/etcdsh) - A command line client with support of command history and tab completion. Supports v2
**Go libraries**
- [etcd/client](https://github.com/coreos/etcd/blob/master/client) - the officially maintained Go client
- [go-etcd](https://github.com/coreos/go-etcd) - the deprecated official client. May be useful for older (<2.0.0) versions of etcd.
- [go-etcd](https://github.com/coreos/go-etcd) - Supports v2
**Java libraries**
@ -54,7 +49,6 @@
**C++ libraries**
- [edwardcapriolo/etcdcpp](https://github.com/edwardcapriolo/etcdcpp) - Supports v2
- [suryanathan/etcdcpp](https://github.com/suryanathan/etcdcpp) - Supports v2 (with waits)
**Clojure libraries**
@ -68,7 +62,6 @@
**.Net Libraries**
- [wangjia184/etcdnet](https://github.com/wangjia184/etcdnet) - Supports v2
- [drusellers/etcetera](https://github.com/drusellers/etcetera)
**PHP Libraries**
@ -87,6 +80,10 @@
- [efrecon/etcd-tcl](https://github.com/efrecon/etcd-tcl) - Supports v2, except wait.
A detailed recap of client functionalities can be found in the [clients compatibility matrix][clients-matrix.md].
[clients-matrix.md]: https://github.com/coreos/etcd/blob/master/Documentation/clients-matrix.md
**Chef Integration**
- [coderanger/etcd-chef](https://github.com/coderanger/etcd-chef)
@ -114,7 +111,7 @@
- [configdb](https://git.autistici.org/ai/configdb/tree/master) - A REST relational abstraction on top of arbitrary database backends, aimed at storing configs and inventories.
- [scrz](https://github.com/scrz/scrz) - Container manager, stores configuration in etcd.
- [fleet](https://github.com/coreos/fleet) - Distributed init system
- [kubernetes/kubernetes](https://github.com/kubernetes/kubernetes) - Container cluster manager introduced by Google.
- [GoogleCloudPlatform/kubernetes](https://github.com/GoogleCloudPlatform/kubernetes) - Container cluster manager.
- [mailgun/vulcand](https://github.com/mailgun/vulcand) - HTTP proxy that uses etcd as a configuration backend.
- [duedil-ltd/discodns](https://github.com/duedil-ltd/discodns) - Simple DNS nameserver using etcd as a database for names and records.
- [skynetservices/skydns](https://github.com/skynetservices/skydns) - RFC compliant DNS server

View File

@ -1,120 +0,0 @@
# Members API
* [List members](#list-members)
* [Add a member](#add-a-member)
* [Delete a member](#delete-a-member)
* [Change the peer urls of a member](#change-the-peer-urls-of-a-member)
## List members
Return an HTTP 200 OK response code and a representation of all members in the etcd cluster.
### Request
```
GET /v2/members HTTP/1.1
```
### Example
```sh
curl http://10.0.0.10:2379/v2/members
```
```json
{
"members": [
{
"id": "272e204152",
"name": "infra1",
"peerURLs": [
"http://10.0.0.10:2380"
],
"clientURLs": [
"http://10.0.0.10:2379"
]
},
{
"id": "2225373f43",
"name": "infra2",
"peerURLs": [
"http://10.0.0.11:2380"
],
"clientURLs": [
"http://10.0.0.11:2379"
]
},
]
}
```
## Add a member
Returns an HTTP 201 response code and the representation of added member with a newly generated a memberID when successful. Returns a string describing the failure condition when unsuccessful.
If the POST body is malformed an HTTP 400 will be returned. If the member exists in the cluster or existed in the cluster at some point in the past an HTTP 409 will be returned. If any of the given peerURLs exists in the cluster an HTTP 409 will be returned. If the cluster fails to process the request within timeout an HTTP 500 will be returned, though the request may be processed later.
### Request
```
POST /v2/members HTTP/1.1
{"peerURLs": ["http://10.0.0.10:2380"]}
```
### Example
```sh
curl http://10.0.0.10:2379/v2/members -XPOST \
-H "Content-Type: application/json" -d '{"peerURLs":["http://10.0.0.10:2380"]}'
```
```json
{
"id": "3777296169",
"peerURLs": [
"http://10.0.0.10:2380"
]
}
```
## Delete a member
Remove a member from the cluster. The member ID must be a hex-encoded uint64.
Returns 204 with empty content when successful. Returns a string describing the failure condition when unsuccessful.
If the member does not exist in the cluster an HTTP 500(TODO: fix this) will be returned. If the cluster fails to process the request within timeout an HTTP 500 will be returned, though the request may be processed later.
### Request
```
DELETE /v2/members/<id> HTTP/1.1
```
### Example
```sh
curl http://10.0.0.10:2379/v2/members/272e204152 -XDELETE
```
## Change the peer urls of a member
Change the peer urls of a given member. The member ID must be a hex-encoded uint64. Returns 204 with empty content when successful. Returns a string describing the failure condition when unsuccessful.
If the POST body is malformed an HTTP 400 will be returned. If the member does not exist in the cluster an HTTP 404 will be returned. If any of the given peerURLs exists in the cluster an HTTP 409 will be returned. If the cluster fails to process the request within timeout an HTTP 500 will be returned, though the request may be processed later.
### Request
```
PUT /v2/members/<id> HTTP/1.1
{"peerURLs": ["http://10.0.0.10:2380"]}
```
### Example
```sh
curl http://10.0.0.10:2379/v2/members/272e204152 -XPUT \
-H "Content-Type: application/json" -d '{"peerURLs":["http://10.0.0.10:2380"]}'
```

View File

@ -1,94 +1,101 @@
# Metrics
## Metrics
**NOTE: The metrics feature is considered experimental. We may add/change/remove metrics without warning in future releases.**
**NOTE: The metrics feature is considered as an experimental. We might add/change/remove metrics without warning in the future releases.**
etcd uses [Prometheus][prometheus] for metrics reporting in the server. The metrics can be used for real-time monitoring and debugging.
etcd only stores these data in memory. If a member restarts, metrics will reset.
etcd uses [Prometheus](http://prometheus.io/) for metrics reporting in the server. The metrics can be used for real-time monitoring and debugging.
The simplest way to see the available metrics is to cURL the metrics endpoint `/metrics` of etcd. The format is described [here](http://prometheus.io/docs/instrumenting/exposition_formats/).
Follow the [Prometheus getting started doc][prometheus-getting-started] to spin up a Prometheus server to collect etcd metrics.
The naming of metrics follows the suggested [best practice of Prometheus][prometheus-naming]. A metric name has an `etcd` prefix as its namespace and a subsystem prefix (for example `wal` and `etcdserver`).
You can also follow the doc [here](http://prometheus.io/docs/introduction/getting_started/) to start a Promethus server and monitor etcd metrics.
The naming of metrics follows the suggested [best practice of Promethus](http://prometheus.io/docs/practices/naming/). A metric name has an `etcd` prefix as its namespace and a subsystem prefix (for example `wal` and `etcdserver`).
etcd now exposes the following metrics:
## etcdserver
### etcdserver
| Name | Description | Type |
|-----------------------------------------|--------------------------------------------------|-----------|
| file_descriptors_used_total | The total number of file descriptors used | Gauge |
| proposal_durations_seconds | The latency distributions of committing proposal | Histogram |
| pending_proposal_total | The total number of pending proposals | Gauge |
| proposal_failed_total | The total number of failed proposals | Counter |
| Name | Description | Type |
|-----------------------------------------|--------------------------------------------------|---------|
| file_descriptors_used_total | The total number of file descriptors used | Gauge |
| proposal_durations_milliseconds | The latency distributions of committing proposal | Summary |
| pending_proposal_total | The total number of pending proposals | Gauge |
| proposal_failed_total | The total number of failed proposals | Counter |
High file descriptors (`file_descriptors_used_total`) usage (near the file descriptors limitation of the process) indicates a potential out of file descriptors issue. That might cause etcd fails to create new WAL files and panics.
[Proposal][glossary-proposal] durations (`proposal_durations_seconds`) provides a histogram about the proposal commit latency. Latency can be introduced into this process by network and disk IO.
[Proposal](glossary.md#proposal) durations (`proposal_durations_milliseconds`) give you an summary about the proposal commit latency. Latency can be introduced into this process by network and disk IO.
Pending proposal (`pending_proposal_total`) gives you an idea about how many proposal are in the queue and waiting for commit. An increasing pending number indicates a high client load or an unstable cluster.
Failed proposals (`proposal_failed_total`) are normally related to two issues: temporary failures related to a leader election or longer duration downtime caused by a loss of quorum in the cluster.
## wal
| Name | Description | Type |
|------------------------------------|--------------------------------------------------|-----------|
| fsync_durations_seconds | The latency distributions of fsync called by wal | Histogram |
| last_index_saved | The index of the last entry saved by wal | Gauge |
### store
Abnormally high fsync duration (`fsync_durations_seconds`) indicates disk issues and might cause the cluster to be unstable.
These metrics describe the accesses into the data store of etcd members that exist in the cluster. They
are useful to count what kind of actions are taken by users. It is also useful to see and whether all etcd members
"see" the same set of data mutations, and whether reads and watches (which are local) are equally distributed.
All these metrics are prefixed with `etcd_store_`.
## http requests
These metrics describe the serving of requests (non-watch events) served by etcd members in non-proxy mode: total
incoming requests, request failures and processing latency (inc. raft rounds for storage). They are useful for tracking
user-generated traffic hitting the etcd cluster .
All these metrics are prefixed with `etcd_http_`
| Name | Description | Type |
|--------------------------------|-----------------------------------------------------------------------------------------|--------------------|
| received_total | Total number of events after parsing and auth. | Counter(method) |
| failed_total | Total number of failed events.   | Counter(method,error) |
| successful_duration_second | Bucketed handling times of the requests, including raft rounds for writes. | Histogram(method) |
| Name | Description | Type |
|---------------------------|------------------------------------------------------------------------------------------|--------------------|
| reads_total | Total number of reads from store, should differ among etcd members (local reads). | Counter(action) |
| writes_total | Total number of writes to store, should be same among all etcd members. | Counter(action) |
| reads_failed_total | Number of failed reads from store (e.g. key missing) on local reads. | Counter(action) |
| writes_failed_total | Number of failed writes to store (e.g. failed compare and swap). | Counter(action) |
| expires_total | Total number of expired keys (due to TTL).   | Counter |
| watch_requests_totals | Total number of incoming watch requests to this etcd member (local watches). | Counter |
| watchers | Current count of active watchers on this etcd member. | Gauge |
Both `reads_total` and `writes_total` count both successful and failed requests. `reads_failed_total` and
`writes_failed_total` count failed requests. A lot of failed writes indicate possible contentions on keys (e.g. when
doing `compareAndSet`), and read failures indicate that some clients try to access keys that don't exist.
Example Prometheus queries that may be useful from these metrics (across all etcd members):
* `sum(rate(etcd_http_failed_total{job="etcd"}[1m]) by (method) / sum(rate(etcd_http_events_received_total{job="etcd"})[1m]) by (method)`
* `sum(rate(etcd_store_reads_total{job="etcd"}[1m])) by (action)`
`max(rate(etcd_store_writes_total{job="etcd"}[1m])) by (action)`
Shows the fraction of events that failed by HTTP method across all members, across a time window of `1m`.
* `sum(rate(etcd_http_received_total{job="etcd",method="GET})[1m]) by (method)`
`sum(rate(etcd_http_received_total{job="etcd",method~="GET})[1m]) by (method)`
Rate of reads and writes by action, across all servers across a time window of `1m`. The reason why `max` is used
for writes as opposed to `sum` for reads is because all of etcd nodes in the cluster apply all writes to their stores.
Shows the rate of successfull readonly/write queries across all servers, across a time window of `1m`.
* `sum(rate(etcd_store_watch_requests_total{job="etcd"}[1m]))`
Shows the rate of successful readonly/write queries across all servers, across a time window of `1m`.
Shows rate of new watch requests per second. Likely driven by how often watched keys change.
* `sum(etcd_store_watchers{job="etcd"})`
* `histogram_quantile(0.9, sum(increase(etcd_http_successful_processing_seconds{job="etcd",method="GET"}[5m]) ) by (le))`
`histogram_quantile(0.9, sum(increase(etcd_http_successful_processing_seconds{job="etcd",method!="GET"}[5m]) ) by (le))`
Show the 0.90-tile latency (in seconds) of read/write (respectively) event handling across all members, with a window of `5m`.
## snapshot
| Name | Description | Type |
|--------------------------------------------|------------------------------------------------------------|-----------|
| snapshot_save_total_durations_seconds | The total latency distributions of save called by snapshot | Histogram |
Abnormally high snapshot duration (`snapshot_save_total_durations_seconds`) indicates disk issues and might cause the cluster to be unstable.
Number of active watchers across all etcd servers.
## rafthttp
### wal
| Name | Description | Type | Labels |
|-----------------------------------|--------------------------------------------|--------------|--------------------------------|
| message_sent_latency_seconds | The latency distributions of messages sent | HistogramVec | sendingType, msgType, remoteID |
| message_sent_failed_total | The total number of failed messages sent | Summary | sendingType, msgType, remoteID |
| Name | Description | Type |
|------------------------------------|--------------------------------------------------|---------|
| fsync_durations_microseconds | The latency distributions of fsync called by wal | Summary |
| last_index_saved | The index of the last entry saved by wal | Gauge |
Abnormally high fsync duration (`fsync_durations_microseconds`) indicates disk issues and might cause the cluster to be unstable.
### snapshot
| Name | Description | Type |
|--------------------------------------------|------------------------------------------------------------|---------|
| snapshot_save_total_durations_microseconds | The total latency distributions of save called by snapshot | Summary |
Abnormally high snapshot duration (`snapshot_save_total_durations_microseconds`) indicates disk issues and might cause the cluster to be unstable.
Abnormally high message duration (`message_sent_latency_seconds`) indicates network issues and might cause the cluster to be unstable.
### rafthttp
| Name | Description | Type | Labels |
|-----------------------------------|--------------------------------------------|---------|--------------------------------|
| message_sent_latency_microseconds | The latency distributions of messages sent | Summary | sendingType, msgType, remoteID |
| message_sent_failed_total | The total number of failed messages sent | Summary | sendingType, msgType, remoteID |
Abnormally high message duration (`message_sent_latency_microseconds`) indicates network issues and might cause the cluster to be unstable.
An increase in message failures (`message_sent_failed_total`) indicates more severe network issues and might cause the cluster to be unstable.
@ -99,7 +106,7 @@ Label `msgType` is the type of raft message. `MsgApp` is log replication message
Label `remoteID` is the member ID of the message destination.
## proxy
### proxy
etcd members operating in proxy mode do not do store operations. They forward all requests
to cluster instances.
@ -123,12 +130,8 @@ Example Prometheus queries that may be useful from these metrics (across all etc
* `histogram_quantile(0.9, sum(increase(etcd_proxy_events_handling_time_seconds_bucket{job="etcd",method="GET"}[5m])) by (le))`
`histogram_quantile(0.9, sum(increase(etcd_proxy_events_handling_time_seconds_bucket{job="etcd",method!="GET"}[5m])) by (le))`
Show the 0.90-tile latency (in seconds) of handling of user requests across all proxy machines, with a window of `5m`.
Show the 0.90-tile latency (in seconds) of handling of user requestsacross all proxy machines, with a window of `5m`.
* `sum(rate(etcd_proxy_dropped_total{job="etcd"}[1m])) by (proxying_error)`
Number of failed request on the proxy. This should be 0, spikes here indicate connectivity issues to etcd cluster.
[glossary-proposal]: glossary.md#proposal
[prometheus]: http://prometheus.io/
[prometheus-getting-started](http://prometheus.io/docs/introduction/getting_started/)
[prometheus-naming]: http://prometheus.io/docs/practices/naming/

View File

@ -1,28 +1,119 @@
# Miscellaneous APIs
## Members API
* [Getting the etcd version](#getting-the-etcd-version)
* [Checking health of an etcd member node](#checking-health-of-an-etcd-member-node)
* [List members](#list-members)
* [Add a member](#add-a-member)
* [Delete a member](#delete-a-member)
* [Change the peer urls of a member](#change-the-peer-urls-of-a-member)
## Getting the etcd version
## List members
The etcd version of a specific instance can be obtained from the `/version` endpoint.
Return an HTTP 200 OK response code and a representation of all members in the etcd cluster.
### Request
```
GET /v2/members HTTP/1.1
```
### Example
```sh
curl -L http://127.0.0.1:2379/version
```
```
etcd 2.0.12
```
## Checking health of an etcd member node
etcd provides a `/health` endpoint to verify the health of a particular member.
```sh
curl http://10.0.0.10:2379/health
curl http://10.0.0.10:2379/v2/members
```
```json
{"health": "true"}
{
"members": [
{
"id": "272e204152",
"name": "infra1",
"peerURLs": [
"http://10.0.0.10:2380"
],
"clientURLs": [
"http://10.0.0.10:2379"
]
},
{
"id": "2225373f43",
"name": "infra2",
"peerURLs": [
"http://10.0.0.11:2380"
],
"clientURLs": [
"http://10.0.0.11:2379"
]
},
]
}
```
## Add a member
Returns an HTTP 201 response code and the representation of added member with a newly generated a memberID when successful. Returns a string describing the failure condition when unsuccessful.
If the POST body is malformed an HTTP 400 will be returned. If the member exists in the cluster or existed in the cluster at some point in the past an HTTP 409 will be returned. If any of the given peerURLs exists in the cluster an HTTP 409 will be returned. If the cluster fails to process the request within timeout an HTTP 500 will be returned, though the request may be processed later.
### Request
```
POST /v2/members HTTP/1.1
{"peerURLs": ["http://10.0.0.10:2380"]}
```
### Example
```sh
curl http://10.0.0.10:2379/v2/members -XPOST \
-H "Content-Type: application/json" -d '{"peerURLs":["http://10.0.0.10:2380"]}'
```
```json
{
"id": "3777296169",
"peerURLs": [
"http://10.0.0.10:2380"
]
}
```
## Delete a member
Remove a member from the cluster. The member ID must be a hex-encoded uint64.
Returns 204 with empty content when successful. Returns a string describing the failure condition when unsuccessful.
If the member does not exist in the cluster an HTTP 500(TODO: fix this) will be returned. If the cluster fails to process the request within timeout an HTTP 500 will be returned, though the request may be processed later.
### Request
```
DELETE /v2/members/<id> HTTP/1.1
```
### Example
```sh
curl http://10.0.0.10:2379/v2/members/272e204152 -XDELETE
```
## Change the peer urls of a member
Change the peer urls of a given member. The member ID must be a hex-encoded uint64. Returns 204 with empty content when successful. Returns a string describing the failure condition when unsuccessful.
If the POST body is malformed an HTTP 400 will be returned. If the member does not exist in the cluster an HTTP 404 will be returned. If any of the given peerURLs exists in the cluster an HTTP 409 will be returned. If the cluster fails to process the request within timeout an HTTP 500 will be returned, though the request may be processed later.
#### Request
```
PUT /v2/members/<id> HTTP/1.1
{"peerURLs": ["http://10.0.0.10:2380"]}
```
#### Example
```sh
curl http://10.0.0.10:2379/v2/members/272e204152 -XPUT \
-H "Content-Type: application/json" -d '{"peerURLs":["http://10.0.0.10:2380"]}'
```

View File

@ -15,7 +15,7 @@ when asked
2. Update your repository data with `pkg update`
3. Install etcd with `pkg install coreos-etcd coreos-etcdctl`
3. Install etcd with `pkg install coreos­etcd coreos­etcdctl`
4. Verify successful installation with `pkg info | grep etcd` and you should get:

View File

@ -0,0 +1,4 @@
etcd is being used successfully by many companies in production. It is,
however, under active development and systems like etcd are difficult to get
correct. If you are comfortable with bleeding-edge software please use etcd and
provide us with the feedback and testing young software needs.

View File

@ -1,51 +0,0 @@
# Production Users
This document tracks people and use cases for etcd in production. By creating a list of production use cases we hope to build a community of advisors that we can reach out to with experience using various etcd applications, operation environments, and cluster sizes. The etcd development team may reach out periodically to check-in on your experience and update this list.
## discovery.etcd.io
- *Application*: https://github.com/coreos/discovery.etcd.io
- *Launched*: Feb. 2014
- *Cluster Size*: 5 members, 5 discovery proxies
- *Order of Data Size*: 100s of Megabytes
- *Operator*: CoreOS, brandon.philips@coreos.com
- *Environment*: AWS
- *Backups*: Periodic async to S3
discovery.etcd.io is the longest continuously running etcd backed service that we know about. It is the basis of automatic cluster bootstrap and was launched in Feb. 2014: https://coreos.com/blog/etcd-0.3.0-released/.
## OpenTable
- *Application*: OpenTable internal service discovery and cluster configuration management
- *Launched*: May 2014
- *Cluster Size*: 3 members each in 6 independent clusters; approximately 50 nodes reading / writing
- *Order of Data Size*: 10s of MB
- *Operator*: OpenTable, Inc; sschlansker@opentable.com
- *Environment*: AWS, VMWare
- *Backups*: None, all data can be re-created if necessary.
## cycoresys.com
- *Application*: multiple
- *Launched*: Jul. 2014
- *Cluster Size*: 3 members, _n_ proxies
- *Order of Data Size*: 100s of kilobytes
- *Operator*: CyCore Systems, Inc, sys@cycoresys.com
- *Environment*: Baremetal
- *Backups*: Periodic sync to Ceph RadosGW and DigitalOcean VM
CyCore Systems provides architecture and engineering for computing systems. This cluster provides microservices, virtual machines, databases, storage clusters to a number of clients. It is built on CoreOS machines, with each machine in the cluster running etcd as a peer or proxy.
## Radius Intelligence
- *Application*: multiple internal tools, Kubernetes clusters, bootstrappable system configs
- *Launched*: June 2015
- *Cluster Size*: 2 clusters of 5 and 3 members; approximately a dozen nodes read/write
- *Order of Data Size*: 100s of kilobytes
- *Operator*: Radius Intelligence; jcderr@radius.com
- *Environment*: AWS, CoreOS, Kubernetes
- *Backups*: None, all data can be recreated if necessary.
Radius Intelligence uses Kubernetes running CoreOS to containerize and scale internal toolsets. Examples include running [JetBrains TeamCity][teamcity] and internal AWS security and cost reporting tools. etcd clusters back these clusters as well as provide some basic environment bootstrapping configuration keys.
[teamcity]: https://www.jetbrains.com/teamcity/

View File

@ -1,153 +1,37 @@
# Proxy
## Proxy
etcd can run as a transparent proxy. Doing so allows for easy discovery of etcd within your infrastructure, since it can run on each machine as a local service. In this mode, etcd acts as a reverse proxy and forwards client requests to an active etcd cluster. The etcd proxy does not participate in the consensus replication of the etcd cluster, thus it neither increases the resilience nor decreases the write performance of the etcd cluster.
etcd can now run as a transparent proxy. Running etcd as a proxy allows for easily discovery of etcd within your infrastructure, since it can run on each machine as a local service. In this mode, etcd acts as a reverse proxy and forwards client requests to an active etcd cluster. The etcd proxy does not participate in the consensus replication of the etcd cluster, thus it neither increases the resilience nor decreases the write performance of the etcd cluster.
etcd currently supports two proxy modes: `readwrite` and `readonly`. The default mode is `readwrite`, which forwards both read and write requests to the etcd cluster. A `readonly` etcd proxy only forwards read requests to the etcd cluster, and returns `HTTP 501` to all write requests.
etcd currently supports two proxy modes: `readwrite` and `readonly`. The default mode is `readwrite`, which forwards both read and write requests to the etcd cluster. A `readonly` etcd proxy only forwards read requests to the etcd cluster, and returns `HTTP 501` to all write requests.
The proxy will shuffle the list of cluster members periodically to avoid sending all connections to a single member.
The member list used by an etcd proxy consists of all client URLs advertised in the cluster. These client URLs are specified in each etcd cluster member's `advertise-client-urls` option.
The member list used by proxy consists of all client URLs advertised within the cluster, as specified in each members' `-advertise-client-urls` flag. If this flag is set incorrectly, requests sent to the proxy are forwarded to wrong addresses and then fail. Including URLs in the `-advertise-client-urls` flag that point to the proxy itself, e.g. http://localhost:2379, is even more problematic as it will cause loops, because the proxy keeps trying to forward requests to itself until its resources (memory, file descriptors) are eventually depleted. The fix for this problem is to restart etcd member with correct `-advertise-client-urls` flag. After client URLs list in proxy is recalculated, which happens every 30 seconds, requests will be forwarded correctly.
An etcd proxy examines several command-line options to discover its peer URLs. In order of precedence, these options are `discovery`, `discovery-srv`, and `initial-cluster`. The `initial-cluster` option is set to a comma-separated list of one or more etcd peer URLs used temporarily in order to discover the permanent cluster.
After establishing a list of peer URLs in this manner, the proxy retrieves the list of client URLs from the first reachable peer. These client URLs are specified by the `advertise-client-urls` option to etcd peers. The proxy then continues to connect to the first reachable etcd cluster member every thirty seconds to refresh the list of client URLs.
While etcd proxies therefore do not need to be given the `advertise-client-urls` option, as they retrieve this configuration from the cluster, this implies that `initial-cluster` must be set correctly for every proxy, and the `advertise-client-urls` option must be set correctly for every non-proxy, first-order cluster peer. Otherwise, requests to any etcd proxy would be forwarded improperly. Take special care not to set the `advertise-client-urls` option to URLs that point to the proxy itself, as such a configuration will cause the proxy to enter a loop, forwarding requests to itself until resources are exhausted. To correct either case, stop etcd and restart it with the correct URLs.
[This example Procfile][procfile] illustrates the difference in the etcd peer and proxy command lines used to configure and start a cluster with one proxy under the [goreman process management utility][goreman].
To summarize etcd proxy startup and peer discovery:
1. etcd proxies execute the following steps in order until the cluster *peer-urls* are known:
1. If `discovery` is set for the proxy, ask the given discovery service for
the *peer-urls*. The *peer-urls* will be the combined
`initial-advertise-peer-urls` of all first-order, non-proxy cluster
members.
2. If `discovery-srv` is set for the proxy, the *peer-urls* are discovered
from DNS.
3. If `initial-cluster` is set for the proxy, that will become the value of
*peer-urls*.
4. Otherwise use the default value of
`http://localhost:2380,http://localhost:7001`.
2. These *peer-urls* are used to contact the (non-proxy) members of the cluster
to find their *client-urls*. The *client-urls* will thus be the combined
`advertise-client-urls` of all cluster members (i.e. non-proxies).
3. Request of clients of the proxy will be forwarded (proxied) to these
*client-urls*.
Always start the first-order etcd cluster members first, then any proxies. A proxy must be able to reach the cluster members to retrieve its configuration, and will attempt connections somewhat aggressively in the absence of such a channel. Starting the members before any proxy ensures the proxy can discover the client URLs when it later starts.
## Using an etcd proxy
To start etcd in proxy mode, you need to provide three flags: `proxy`, `listen-client-urls`, and `initial-cluster` (or `discovery`).
### Using an etcd proxy
To start etcd in proxy mode, you need to provide three flags: `proxy`, `listen-client-urls`, and `initial-cluster` (or `discovery`).
To start a readwrite proxy, set `-proxy on`; To start a readonly proxy, set `-proxy readonly`.
The proxy will be listening on `listen-client-urls` and forward requests to the etcd cluster discovered from in `initial-cluster` or `discovery` url.
The proxy will be listening on `listen-client-urls` and forward requests to the etcd cluster discovered from in `initial-cluster` or `discovery` url.
### Start an etcd proxy with a static configuration
#### Start an etcd proxy with a static configuration
To start a proxy that will connect to a statically defined etcd cluster, specify the `initial-cluster` flag:
```
etcd --proxy on \
--listen-client-urls http://127.0.0.1:8080 \
--initial-cluster infra0=http://10.0.1.10:2380,infra1=http://10.0.1.11:2380,infra2=http://10.0.1.12:2380
etcd -proxy on -listen-client-urls http://127.0.0.1:8080 -initial-cluster infra0=http://10.0.1.10:2380,infra1=http://10.0.1.11:2380,infra2=http://10.0.1.12:2380
```
### Start an etcd proxy with the discovery service
If you bootstrap an etcd cluster using the [discovery service][discovery-service], you can also start the proxy with the same `discovery`.
#### Start an etcd proxy with the discovery service
If you bootstrap an etcd cluster using the [discovery service][discovery-service], you can also start the proxy with the same `discovery`.
To start a proxy using the discovery service, specify the `discovery` flag. The proxy will wait until the etcd cluster defined at the `discovery` url finishes bootstrapping, and then start to forward the requests.
To start a proxy using the discovery service, specify the `discovery` flag. The proxy will wait until the etcd cluster defined at the `discovery` url finishes bootstrapping, and then start to forward the requests.
```
etcd --proxy on \
--listen-client-urls http://127.0.0.1:8080 \
--discovery https://discovery.etcd.io/3e86b59982e49066c5d813af1c2e2579cbf573de \
etcd -proxy on -listen-client-urls http://127.0.0.1:8080 -discovery https://discovery.etcd.io/3e86b59982e49066c5d813af1c2e2579cbf573de
```
## Fallback to proxy mode with discovery service
If you bootstrap an etcd cluster using [discovery service][discovery-service] with more than the expected number of etcd members, the extra etcd processes will fall back to being `readwrite` proxies by default. They will forward the requests to the cluster as described above. For example, if you create a discovery url with `size=5`, and start ten etcd processes using that same discovery url, the result will be a cluster with five etcd members and five proxies. Note that this behaviour can be disabled with the `discovery-fallback='exit'` flag.
## Promote a proxy to a member of etcd cluster
A Proxy is in the part of etcd cluster that does not participate in consensus. A proxy will not promote itself to an etcd member that participates in consensus automatically in any case.
If you want to promote a proxy to an etcd member, there are four steps you need to follow:
- use etcdctl to add the proxy node as an etcd member into the existing cluster
- stop the etcd proxy process or service
- remove the existing proxy data directory
- restart the etcd process with new member configuration
## Example
We assume you have a one member etcd cluster with one proxy. The cluster information is listed below:
|Name|Address|
|------|---------|
|infra0|10.0.1.10|
|proxy0|10.0.1.11|
This example walks you through a case that you promote one proxy to an etcd member. The cluster will become a two member cluster after finishing the four steps.
### Add a new member into the existing cluster
First, use etcdctl to add the member to the cluster, which will output the environment variables need to correctly configure the new member:
``` bash
$ etcdctl -endpoint http://10.0.1.10:2379 member add infra1 http://10.0.1.11:2380
added member 9bf1b35fc7761a23 to cluster
ETCD_NAME="infra1"
ETCD_INITIAL_CLUSTER="infra0=http://10.0.1.10:2380,infra1=http://10.0.1.11:2380"
ETCD_INITIAL_CLUSTER_STATE=existing
```
### Stop the proxy process
Stop the existing proxy so we can wipe it's state on disk and reload it with the new configuration:
``` bash
px aux | grep etcd
kill %etcd_proxy_pid%
```
or (if you are running etcd proxy as etcd service under systemd)
``` bash
sudo systemctl stop etcd
```
### Remove the existing proxy data dir
``` bash
rm -rf %data_dir%/proxy
```
### Start etcd as a new member
Finally, start the reconfigured member and make sure it joins the cluster correctly:
``` bash
$ export ETCD_NAME="infra1"
$ export ETCD_INITIAL_CLUSTER="infra0=http://10.0.1.10:2380,infra1=http://10.0.1.11:2380"
$ export ETCD_INITIAL_CLUSTER_STATE=existing
$ etcd --listen-client-urls http://10.0.1.11:2379 \
--advertise-client-urls http://10.0.1.11:2379 \
--listen-peer-urls http://10.0.1.11:2380 \
--initial-advertise-peer-urls http://10.0.1.11:2380 \
--data-dir %data_dir%
```
If you are running etcd under systemd, you should modify the service file with correct configuration and restart the service:
``` bash
sudo systemd restart etcd
```
If an error occurs, check the [add member troubleshooting doc][runtime-configuration].
#### Fallback to proxy mode with discovery service
If you bootstrap a etcd cluster using [discovery service][discovery-service] with more than the expected number of etcd members, the extra etcd processes will fall back to being `readwrite` proxies by default. They will forward the requests to the cluster as described above. For example, if you create a discovery url with `size=5`, and start ten etcd processes using that same discovery url, the result will be a cluster with five etcd members and five proxies. Note that this behaviour can be disabled with the `proxy-fallback` flag.
[discovery-service]: clustering.md#discovery
[goreman]: https://github.com/mattn/goreman
[procfile]: /Procfile
[runtime-configuration]: runtime-configuration.md#error-cases-when-adding-members

View File

@ -1,6 +1,6 @@
# Reporting Bugs
## Reporting Bugs
If you find bugs or documentation mistakes in the etcd project, please let us know by [opening an issue][issue]. We treat bugs and mistakes very seriously and believe no issue is too small. Before creating a bug report, please check that an issue reporting the same problem does not already exist.
If you find bugs or documentation mistakes in etcd project, please let us know by [opening an issue](https://github.com/coreos/etcd/issues/new). We treat bugs and mistakes very seriously and believe no issue is too small. Before creating a bug report, please check there that one does not already exist.
To make your bug report accurate and easy to understand, please try to create bug reports that are:
@ -14,13 +14,13 @@ To make your bug report accurate and easy to understand, please try to create bu
- Scoped. One bug per report. Do not follow up with another bug inside one report.
You might also want to read [Elika Etemads article on filing good bug reports][filing-good-bugs] before creating a bug report.
You might also want to read [Elika Etemads article on filing good bug reports](http://fantasai.inkedblade.net/style/talks/filing-good-bugs/) before creating a bug report.
We might ask you for further information to locate a bug. A duplicated bug report will be closed.
## Frequently Asked Questions
### How to get a stack trace
### How to get stack trace
``` bash
$ kill -QUIT $PID
@ -41,5 +41,3 @@ $ sudo journalctl -u etcd2
Due to an upstream systemd bug, journald may miss the last few log lines when its process exit. If journalctl tells you that etcd stops without fatal or panic message, you could try `sudo journalctl -f -t etcd2` to get full log.
[etcd-issue]: https://github.com/coreos/etcd/issues/new
[filing-good-bugs]: http://fantasai.inkedblade.net/style/talks/filing-good-bugs/

View File

@ -1,10 +1,4 @@
# Overview
The etcd v3 API is designed to give users a more efficient and cleaner abstraction compared to etcd v2. There are a number of semantic and protocol changes in this new API. For an overview [see Xiang Li's video](https://youtu.be/J5AioGtEPeQ?t=211).
To prove out the design of the v3 API the team has also built [a number of example recipes](https://github.com/coreos/etcd/tree/master/contrib/recipes), there is a [video discussing these recipes too](https://www.youtube.com/watch?v=fj-2RY-3yVU&feature=youtu.be&t=590).
# Design
## Design
1. Flatten binary key-value space
@ -34,24 +28,13 @@ To prove out the design of the v3 API the team has also built [a number of examp
- easy for people to write simple etcd application
## Notes
### Request Size Limitation
The max request size is around 1MB. Since etcd replicates requests in a streaming fashion, a very large
request might block other requests for a long time. The use case for etcd is to store small configuration
values, so we prevent user from submitting large requests. This also applies to Txn requests. We might loosen
the size in the future a little bit or make it configurable.
## Protobuf Defined API
[api protobuf][api-protobuf]
[protobuf](./v3api.proto)
[kv protobuf][kv-protobuf]
### Examples
## Examples
### Put a key (foo=bar)
#### Put a key (foo=bar)
```
// A put is always successful
Put( PutRequest { key = foo, value = bar } )
@ -64,7 +47,7 @@ PutResponse {
}
```
### Get a key (assume we have foo=bar)
#### Get a key (assume we have foo=bar)
```
Get ( RangeRequest { key = foo } )
@ -85,7 +68,7 @@ RangeResponse {
}
```
### Range over a key space (assume we have foo0=bar0… foo100=bar100)
#### Range over a key space (assume we have foo0=bar0… foo100=bar100)
```
Range ( RangeRequest { key = foo, end_key = foo80, limit = 30 } )
@ -114,7 +97,7 @@ RangeResponse {
}
```
### Finish a txn (assume we have foo0=bar0, foo1=bar1)
#### Finish a txn (assume we have foo0=bar0, foo1=bar1)
```
Txn(TxnRequest {
// mod_revision of foo0 is equal to 1, mod_revision of foo1 is greater than 1
@ -146,7 +129,7 @@ TxnResponse {
}
```
### Watch on a key/range
#### Watch on a key/range
```
Watch( WatchRequest{
@ -206,6 +189,3 @@ WatchResponse {
```
[api-protobuf]: https://github.com/coreos/etcd/blob/master/etcdserver/etcdserverpb/rpc.proto
[kv-protobuf]: https://github.com/coreos/etcd/blob/master/storage/storagepb/kv.proto

View File

@ -0,0 +1,285 @@
syntax = "proto3";
// Interface exported by the server.
service etcd {
// Range gets the keys in the range from the store.
rpc Range(RangeRequest) returns (RangeResponse) {}
// Put puts the given key into the store.
// A put request increases the revision of the store,
// and generates one event in the event history.
rpc Put(PutRequest) returns (PutResponse) {}
// Delete deletes the given range from the store.
// A delete request increase the revision of the store,
// and generates one event in the event history.
rpc DeleteRange(DeleteRangeRequest) returns (DeleteRangeResponse) {}
// Txn processes all the requests in one transaction.
// A txn request increases the revision of the store,
// and generates events with the same revision in the event history.
rpc Txn(TxnRequest) returns (TxnResponse) {}
// Watch watches the events happening or happened in etcd. Both input and output
// are stream. One watch rpc can watch for multiple ranges and get a stream of
// events. The whole events history can be watched unless compacted.
rpc WatchRange(stream WatchRangeRequest) returns (stream WatchRangeResponse) {}
// Compact compacts the event history in etcd. User should compact the
// event history periodically, or it will grow infinitely.
rpc Compact(CompactionRequest) returns (CompactionResponse) {}
// LeaseCreate creates a lease. A lease has a TTL. The lease will expire if the
// server does not receive a keepAlive within TTL from the lease holder.
// All keys attached to the lease will be expired and deleted if the lease expires.
// The key expiration generates an event in event history.
rpc LeaseCreate(LeaseCreateRequest) returns (LeaseCreateResponse) {}
// LeaseRevoke revokes a lease. All the key attached to the lease will be expired and deleted.
rpc LeaseRevoke(LeaseRevokeRequest) returns (LeaseRevokeResponse) {}
// LeaseAttach attaches keys with a lease.
rpc LeaseAttach(LeaseAttachRequest) returns (LeaseAttachResponse) {}
// LeaseTxn likes Txn. It has two addition success and failure LeaseAttachRequest list.
// If the Txn is successful, then the success list will be executed. Or the failure list
// will be executed.
rpc LeaseTxn(LeaseTxnRequest) returns (LeaseTxnResponse) {}
// KeepAlive keeps the lease alive.
rpc LeaseKeepAlive(stream LeaseKeepAliveRequest) returns (stream LeaseKeepAliveResponse) {}
}
message ResponseHeader {
// an error type message?
string error = 1;
uint64 cluster_id = 2;
uint64 member_id = 3;
// revision of the store when the request was applied.
int64 revision = 4;
// term of raft when the request was applied.
uint64 raft_term = 5;
}
message RangeRequest {
// if the range_end is not given, the request returns the key.
bytes key = 1;
// if the range_end is given, it gets the keys in range [key, range_end).
bytes range_end = 2;
// limit the number of keys returned.
int64 limit = 3;
// range over the store at the given revision.
// if revision is less or equal to zero, range over the newest store.
// if the revision has been compacted, ErrCompaction will be returned in
// response.
int64 revision = 4;
}
message RangeResponse {
ResponseHeader header = 1;
repeated storagepb.KeyValue kvs = 2;
// more indicates if there are more keys to return in the requested range.
bool more = 3;
}
message PutRequest {
bytes key = 1;
bytes value = 2;
}
message PutResponse {
ResponseHeader header = 1;
}
message DeleteRangeRequest {
// if the range_end is not given, the request deletes the key.
bytes key = 1;
// if the range_end is given, it deletes the keys in range [key, range_end).
bytes range_end = 2;
}
message DeleteRangeResponse {
ResponseHeader header = 1;
}
message RequestUnion {
oneof request {
RangeRequest request_range = 1;
PutRequest request_put = 2;
DeleteRangeRequest request_delete_range = 3;
}
}
message ResponseUnion {
oneof response {
RangeResponse response_range = 1;
PutResponse response_put = 2;
DeleteRangeResponse response_delete_range = 3;
}
}
message Compare {
enum CompareResult {
EQUAL = 0;
GREATER = 1;
LESS = 2;
}
enum CompareTarget {
VERSION = 0;
CREATE = 1;
MOD = 2;
VALUE= 3;
}
CompareResult result = 1;
CompareTarget target = 2;
// key path
bytes key = 3;
oneof target_union {
// version of the given key
int64 version = 4;
// create revision of the given key
int64 create_revision = 5;
// last modified revision of the given key
int64 mod_revision = 6;
// value of the given key
bytes value = 7;
}
}
// If the comparisons succeed, then the success requests will be processed in order,
// and the response will contain their respective responses in order.
// If the comparisons fail, then the failure requests will be processed in order,
// and the response will contain their respective responses in order.
// From google paxosdb paper:
// Our implementation hinges around a powerful primitive which we call MultiOp. All other database
// operations except for iteration are implemented as a single call to MultiOp. A MultiOp is applied atomically
// and consists of three components:
// 1. A list of tests called guard. Each test in guard checks a single entry in the database. It may check
// for the absence or presence of a value, or compare with a given value. Two different tests in the guard
// may apply to the same or different entries in the database. All tests in the guard are applied and
// MultiOp returns the results. If all tests are true, MultiOp executes t op (see item 2 below), otherwise
// it executes f op (see item 3 below).
// 2. A list of database operations called t op. Each operation in the list is either an insert, delete, or
// lookup operation, and applies to a single database entry. Two different operations in the list may apply
// to the same or different entries in the database. These operations are executed
// if guard evaluates to
// true.
// 3. A list of database operations called f op. Like t op, but executed if guard evaluates to false.
message TxnRequest {
repeated Compare compare = 1;
repeated RequestUnion success = 2;
repeated RequestUnion failure = 3;
}
message TxnResponse {
ResponseHeader header = 1;
bool succeeded = 2;
repeated ResponseUnion responses = 3;
}
message KeyValue {
bytes key = 1;
int64 create_revision = 2;
// mod_revision is the last modified revision of the key.
int64 mod_revision = 3;
// version is the version of the key. A deletion resets
// the version to zero and any modification of the key
// increases its version.
int64 version = 4;
bytes value = 5;
}
message WatchRangeRequest {
// if the range_end is not given, the request returns the key.
bytes key = 1;
// if the range_end is given, it gets the keys in range [key, range_end).
bytes range_end = 2;
// start_revision is an optional revision (including) to watch from. No start_revision is "now".
int64 start_revision = 3;
// end_revision is an optional revision (excluding) to end watch. No end_revision is "forever".
int64 end_revision = 4;
bool progress_notification = 5;
}
message WatchRangeResponse {
ResponseHeader header = 1;
repeated Event events = 2;
}
message Event {
enum EventType {
PUT = 0;
DELETE = 1;
EXPIRE = 2;
}
EventType event_type = 1;
// a put event contains the current key-value
// a delete/expire event contains the previous
// key-value
KeyValue kv = 2;
}
// Compaction compacts the kv store upto the given revision (including).
// It removes the old versions of a key. It keeps the newest version of
// the key even if its latest modification revision is smaller than the given
// revision.
message CompactionRequest {
int64 revision = 1;
}
message CompactionResponse {
ResponseHeader header = 1;
}
message LeaseCreateRequest {
// advisory ttl in seconds
int64 ttl = 1;
}
message LeaseCreateResponse {
ResponseHeader header = 1;
int64 lease_id = 2;
// server decided ttl in second
int64 ttl = 3;
string error = 4;
}
message LeaseRevokeRequest {
int64 lease_id = 1;
}
message LeaseRevokeResponse {
ResponseHeader header = 1;
}
message LeaseTxnRequest {
TxnRequest request = 1;
repeated LeaseAttachRequest success = 2;
repeated LeaseAttachRequest failure = 3;
}
message LeaseTxnResponse {
ResponseHeader header = 1;
TxnResponse response = 2;
repeated LeaseAttachResponse attach_responses = 3;
}
message LeaseAttachRequest {
int64 lease_id = 1;
bytes key = 2;
}
message LeaseAttachResponse {
ResponseHeader header = 1;
}
message LeaseKeepAliveRequest {
int64 lease_id = 1;
}
message LeaseKeepAliveResponse {
ResponseHeader header = 1;
int64 lease_id = 2;
int64 ttl = 3;
}

View File

@ -1,14 +1,16 @@
# Runtime Reconfiguration
## Runtime Reconfiguration
etcd comes with support for incremental runtime reconfiguration, which allows users to update the membership of the cluster at run time.
Reconfiguration requests can only be processed when the majority of the cluster members are functioning. It is **highly recommended** to always have a cluster size greater than two in production. It is unsafe to remove a member from a two member cluster. The majority of a two member cluster is also two. If there is a failure during the removal process, the cluster might not able to make progress and need to [restart from majority failure][majority failure].
Reconfiguration requests can only be processed when the the majority of the cluster members are functioning. It is **highly recommended** to always have a cluster size greater than two in production. It is unsafe to remove a member from a two member cluster. The majority of a two member cluster is also two. If there is a failure during the removal process, the cluster might not able to make progress and need to [restart from majority failure][majority failure].
To better understand the design behind runtime reconfiguration, we suggest you read [the runtime reconfiguration document][runtime-reconf].
To better understand the design behind runtime reconfiguration, we suggest you read [this](runtime-reconf-design.md).
[majority failure]: #restart-cluster-from-majority-failure
## Reconfiguration Use Cases
Let's walk through some common reasons for reconfiguring a cluster. Most of these just involve combinations of adding or removing a member, which are explained below under [Cluster Reconfiguration Operations][cluster-reconf].
Let us walk through some common reasons for reconfiguring a cluster. Most of these just involve combinations of adding or removing a member, which are explained below under [Cluster Reconfiguration Operations](#cluster-reconfiguration-operations).
### Cycle or Upgrade Multiple Machines
@ -16,23 +18,33 @@ If you need to move multiple members of your cluster due to planned maintenance
It is safe to remove the leader, however there is a brief period of downtime while the election process takes place. If your cluster holds more than 50MB, it is recommended to [migrate the member's data directory][member migration].
[member migration]: admin_guide.md#member-migration
### Change the Cluster Size
Increasing the cluster size can enhance [failure tolerance][fault tolerance table] and provide better read performance. Since clients can read from any member, increasing the number of members increases the overall read throughput.
Decreasing the cluster size can improve the write performance of a cluster, with a trade-off of decreased resilience. Writes into the cluster are replicated to a majority of members of the cluster before considered committed. Decreasing the cluster size lowers the majority, and each write is committed more quickly.
[fault tolerance table]: admin_guide.md#fault-tolerance-table
### Replace A Failed Machine
If a machine fails due to hardware failure, data directory corruption, or some other fatal situation, it should be replaced as soon as possible. Machines that have failed but haven't been removed adversely affect your quorum and reduce the tolerance for an additional failure.
To replace the machine, follow the instructions for [removing the member][remove member] from the cluster, and then [add a new member][add member] in its place. If your cluster holds more than 50MB, it is recommended to [migrate the failed member's data directory][member migration] if you can still access it.
[remove member]: #remove-a-member
[add member]: #add-a-new-member
### Restart Cluster from Majority Failure
If the majority of your cluster is lost or all of your nodes have changed IP addresses, then you need to take manual action in order to recover safely.
The basic steps in the recovery process include [creating a new cluster using the old data][disaster recovery], forcing a single member to act as the leader, and finally using runtime configuration to [add new members][add member] to this new cluster one at a time.
[add member]: #add-a-new-member
[disaster recovery]: admin_guide.md#disaster-recovery
## Cluster Reconfiguration Operations
Now that we have the use cases in mind, let us lay out the operations involved in each.
@ -48,24 +60,11 @@ All changes to the cluster are done one at a time:
* To decrease from 5 to 3 you will make two remove operations
All of these examples will use the `etcdctl` command line tool that ships with etcd.
If you want to use the members API directly you can find the documentation [here][member-api].
If you want to use the member API directly you can find the documentation [here](other_apis.md).
### Update a Member
#### Update advertise client URLs
If you would like to update the advertise client URLs of a member, you can simply restart
that member with updated client urls flag (`--advertise-client-urls`) or environment variable
(`ETCD_ADVERTISE_CLIENT_URLS`). The restarted member will self publish the updated URLs.
A wrongly updated client URL will not affect the health of the etcd cluster.
#### Update advertise peer URLs
If you would like to update the advertise peer URLs of a member, you have to first update
it explicitly via member command and then restart the member. The additional action is required
since updating peer URLs changes the cluster wide configuration and can affect the health of the etcd cluster.
To update the peer URLs, first, we need to find the target member's ID. You can list all members with `etcdctl`:
If you would like to update a member IP address (peerURLs), first, we need to find the target member's ID. You can list all members with `etcdctl`:
```sh
$ etcdctl member list
@ -103,10 +102,10 @@ It is safe to remove the leader, however the cluster will be inactive while a ne
Adding a member is a two step process:
* Add the new member to the cluster via the [members API][member-api] or the `etcdctl member add` command.
* Add the new member to the cluster via the [members API](other_apis.md#post-v2members) or the `etcdctl member add` command.
* Start the new member with the new cluster configuration, including a list of the updated members (existing members + the new member).
Using `etcdctl` let's add the new member to the cluster by specifying its [name][conf-name] and [advertised peer URLs][conf-adv-peer]:
Using `etcdctl` let's add the new member to the cluster by specifying its [name](configuration.md#-name) and [advertised peer URLs](configuration.md#-initial-advertise-peer-urls):
```sh
$ etcdctl member add infra3 http://10.0.1.13:2380
@ -132,7 +131,7 @@ The new member will run as a part of the cluster and immediately begin catching
If you are adding multiple members the best practice is to configure a single member at a time and verify it starts correctly before adding more new members.
If you add a new member to a 1-node cluster, the cluster cannot make progress before the new member starts because it needs two members as majority to agree on the consensus. You will only see this behavior between the time `etcdctl member add` informs the cluster about the new member and the new member successfully establishing a connection to the existing one.
#### Error Cases When Adding Members
#### Error Cases
In the following case we have not included our new host in the list of enumerated nodes.
If this is a new cluster, the node must be added to the list of initial cluster members.
@ -155,30 +154,10 @@ etcdserver: assign ids error: unmatched member while checking PeerURLs
exit 1
```
When we start etcd using the data directory of a removed member, etcd will exit automatically if it connects to any active member in the cluster:
When we start etcd using the data directory of a removed member, etcd will exit automatically if it connects to any alive member in the cluster:
```sh
$ etcd
etcd: this member has been permanently removed from the cluster. Exiting.
exit 1
```
### Strict Reconfiguration Check Mode (`-strict-reconfig-check`)
As described in the above, the best practice of adding new members is to configure a single member at a time and verify it starts correctly before adding more new members. This step by step approach is very important because if newly added members is not configured correctly (for example the peer URLs are incorrect), the cluster can lose quorum. The quorum loss happens since the newly added member are counted in the quorum even if that member is not reachable from other existing members. Also quorum loss might happen if there is a connectivity issue or there are operational issues.
For avoiding this problem, etcd provides an option `-strict-reconfig-check`. If this option is passed to etcd, etcd rejects reconfiguration requests if the number of started members will be less than a quorum of the reconfigured cluster.
It is recommended to enable this option. However, it is disabled by default because of keeping compatibility.
[add member]: #add-a-new-member
[cluster-reconf]: #cluster-reconfiguration-operations
[conf-adv-peer]: configuration.md#-initial-advertise-peer-urls
[conf-name]: configuration.md#-name
[disaster recovery]: admin_guide.md#disaster-recovery
[fault tolerance table]: admin_guide.md#fault-tolerance-table
[majority failure]: #restart-cluster-from-majority-failure
[member-api]: members_api.md
[member migration]: admin_guide.md#member-migration
[remove member]: #remove-a-member
[runtime-reconf]: runtime-reconf-design.md

View File

@ -1,12 +1,12 @@
# Design of Runtime Reconfiguration
### Design of Runtime Reconfiguration
Runtime reconfiguration is one of the hardest and most error prone features in a distributed system, especially in a consensus based system like etcd.
Read on to learn about the design of etcd's runtime reconfiguration commands and how we tackled these problems.
## Two Phase Config Changes Keep you Safe
### Two Phase Config Changes Keep you Safe
In etcd, every runtime reconfiguration has to go through [two phases][add-member] for safety reasons. For example, to add a member you need to first inform cluster of new configuration and then start the new member.
In etcd, every runtime reconfiguration has to go through [two phases](Documentation/runtime-configuration.md#add-a-new-member) for safety reasons. For example, to add a member you need to first inform cluster of new configuration and then start the new member.
Phase 1 - Inform cluster of new configuration
@ -22,15 +22,15 @@ Without the explicit workflow around cluster membership etcd would be vulnerable
We think runtime reconfiguration should be a low frequent operation. We made the decision to keep it explicit and user-driven to ensure configuration safety and keep your cluster always running smoothly under your control.
## Permanent Loss of Quorum Requires New Cluster
### Permanent Loss of Quorum Requires New Cluster
If a cluster permanently loses a majority of its members, a new cluster will need to be started from an old data directory to recover the previous state.
It is entirely possible to force removing the failed members from the existing cluster to recover. However, we decided not to support this method since it bypasses the normal consensus committing phase, which is unsafe. If the member to remove is not actually dead or you force to remove different members through different members in the same cluster, you will end up with diverged cluster with same clusterID. This is very dangerous and hard to debug/fix afterwards.
If you have a correct deployment, the possibility of permanent majority lose is very low. But it is a severe enough problem that worth special care. We strongly suggest you to read the [disaster recovery documentation][disaster-recovery] and prepare for permanent majority lose before you put etcd into production.
If you have a correct deployment, the possibility of permanent majority lose is very low. But it is a severe enough problem that worth special care. We strongly suggest you to read the [disaster recovery documentation](admin_guide.md#disaster-recovery) and prepare for permanent majority lose before you put etcd into production.
## Do Not Use Public Discovery Service For Runtime Reconfiguration
### Do Not Use Public Discovery Service For Runtime Reconfiguration
The public discovery service should only be used for bootstrapping a cluster. To join member into an existing cluster, you should use runtime reconfiguration API.
@ -38,13 +38,10 @@ Discovery service is designed for bootstrapping an etcd cluster in the cloud env
It seems that using public discovery service is a convenient way to do runtime reconfiguration, after all discovery service already has all the cluster configuration information. However relying on public discovery service brings troubles:
1. it introduces external dependencies for the entire life-cycle of your cluster, not just bootstrap time. If there is a network issue between your cluster and public discovery service, your cluster will suffer from it.
1. it introduces a external dependencies for the entire life-cycle of your cluster, not just bootstrap time. If there is a network issue between your cluster and public discover service, your cluster will suffer from it.
2. public discovery service must reflect correct runtime configuration of your cluster during it life-cycle. It has to provide security mechanism to avoid bad actions, and it is hard.
3. public discovery service has to keep tens of thousands of cluster configurations. Our public discovery service backend is not ready for that workload.
If you want to have a discovery service that supports runtime reconfiguration, the best choice is to build your private one.
[add-member]: runtime-configuration.md#add-a-new-member
[disaster-recovery]: admin_guide.md#disaster-recovery
If you want to have a discovery service that supports runtime reconfiguration, the best choice is to build your private one.

View File

@ -1,10 +1,12 @@
# Security Model
# security model
etcd supports SSL/TLS as well as authentication through client certificates, both for clients to server as well as peer (server to server / cluster) communication.
To get up and running you first need to have a CA certificate and a signed key pair for one member. It is recommended to create and sign a new key pair for every member in a cluster.
For convenience, the [cfssl] tool provides an easy interface to certificate generation, and we provide an example using the tool [here][tls-setup]. You can also examine this [alternative guide to generating self-signed key pairs][tls-guide].
For convenience the [cfssl](https://github.com/cloudflare/cfssl) tool provides an easy interface to certificate generation, and we provide a full example using the tool at [here](../hack/tls-setup). Alternatively this site provides a good reference on how to generate self-signed key pairs:
http://www.g-loaded.eu/2005/11/10/be-your-own-ca/
## Basic setup
@ -95,7 +97,7 @@ $ curl --cacert /path/to/ca.crt --cert /path/to/client.crt --key /path/to/client
-L https://127.0.0.1:2379/v2/keys/foo -XPUT -d value=bar -v
```
You should be able to see:
You should able to see:
```
...
@ -136,21 +138,13 @@ $ etcd -name infra1 -data-dir infra1 \
# member2
$ etcd -name infra2 -data-dir infra2 \
-peer-client-cert-auth -peer-trusted-ca-file=/path/to/ca.crt -peer-cert-file=/path/to/member2.crt -peer-key-file=/path/to/member2.key \
-peer-client-cert-atuh -peer-trusted-ca-file=/path/to/ca.crt -peer-cert-file=/path/to/member2.crt -peer-key-file=/path/to/member2.key \
-initial-advertise-peer-urls=https://10.0.1.11:2380 -listen-peer-urls=https://10.0.1.11:2380 \
-discovery ${DISCOVERY_URL}
```
The etcd members will form a cluster and all communication between members in the cluster will be encrypted and authenticated using the client certificates. You will see in the output of etcd that the addresses it connects to use HTTPS.
## Notes For etcd Proxy
etcd proxy terminates the TLS from its client if the connection is secure, and uses proxy's own key/cert specified in `--peer-key-file` and `--peer-cert-file` to communicate with etcd members.
The proxy communicates with etcd members through both the `--advertise-client-urls` and `--advertise-peer-urls` of a given member. It forwards client requests to etcd members advertised client urls, and it syncs the initial cluster configuration through etcd members advertised peer urls.
When client authentication is enabled for an etcd member, the administrator must ensure that the peer certificate specified in the proxy's `--peer-cert-file` option is valid for that authentication. The proxy's peer certificate must also be valid for peer authentication if peer authentication is enabled.
## Frequently Asked Questions
### My cluster is not working with peer tls configuration?
@ -158,7 +152,7 @@ When client authentication is enabled for an etcd member, the administrator must
The internal protocol of etcd v2.0.x uses a lot of short-lived HTTP connections.
So, when enabling TLS you may need to increase the heartbeat interval and election timeouts to reduce internal cluster connection churn.
A reasonable place to start are these values: ` --heartbeat-interval 500 --election-timeout 2500`.
These issues are resolved in the etcd v2.1.x series of releases which uses fewer connections.
This issues is resolved in the etcd v2.1.x series of releases which uses fewer connections.
### I'm seeing a SSLv3 alert handshake failure when using SSL client authentication?
@ -185,9 +179,4 @@ $ openssl ca -config openssl.cnf -policy policy_anything -extensions ssl_client
### With peer certificate authentication I receive "certificate is valid for 127.0.0.1, not $MY_IP"
Make sure that you sign your certificates with a Subject Name your member's public IP address. The `etcd-ca` tool for example provides an `--ip=` option for its `new-cert` command.
If you need your certificate to be signed for your member's FQDN in its Subject Name then you could use Subject Alternative Names (short IP SANs) to add your IP address. The `etcd-ca` tool provides `--domain=` option for its `new-cert` command, and openssl can make [it][alt-name] too.
[cfssl]: https://github.com/cloudflare/cfssl
[tls-setup]: /hack/tls-setup
[tls-guide]: https://github.com/coreos/docs/blob/master/os/generate-self-signed-certificates.md
[alt-name]: http://wiki.cacert.org/FAQ/subjectAltName
If you need your certificate to be signed for your member's FQDN in its Subject Name then you could use Subject Alternative Names (short IP SANs) to add your IP address. The `etcd-ca` tool provides `--domain=` option for its `new-cert` command, and openssl can make [it](http://wiki.cacert.org/FAQ/subjectAltName) too.

View File

@ -1,16 +1,16 @@
# Tuning
## Tuning
The default settings in etcd should work well for installations on a local network where the average network latency is low.
However, when using etcd across multiple data centers or over networks with high latency you may need to tweak the heartbeat interval and election timeout settings.
The network isn't the only source of latency. Each request and response may be impacted by slow disks on both the leader and follower. Each of these timeouts represents the total time from request to successful response from the other machine.
## Time Parameters
### Time Parameters
The underlying distributed consensus protocol relies on two separate time parameters to ensure that nodes can handoff leadership if one stalls or goes offline.
The first parameter is called the *Heartbeat Interval*.
This is the frequency with which the leader will notify followers that it is still the leader.
For best practices, the parameter should be set around round-trip time between members.
For best pratices, the parameter should be set around round-trip time between members.
By default, etcd uses a `100ms` heartbeat interval.
The second parameter is the *Election Timeout*.
@ -21,20 +21,17 @@ Adjusting these values is a trade off.
The value of heartbeat interval is recommended to be around the maximum of average round-trip time (RTT) between members, normally around 0.5-1.5x the round-trip time.
If heartbeat interval is too low, etcd will send unnecessary messages that increase the usage of CPU and network resources.
On the other side, a too high heartbeat interval leads to high election timeout. Higher election timeout takes longer time to detect a leader failure.
The easiest way to measure round-trip time (RTT) is to use [PING utility][ping].
The easiest way to measure round-trip time (RTT) is to use [PING utility](https://en.wikipedia.org/wiki/Ping_(networking_utility)).
The election timeout should be set based on the heartbeat interval and average round-trip time between members.
Election timeouts must be at least 10 times the round-trip time so it can account for variance in your network.
For example, if the round-trip time between your members is 10ms then you should have at least a 100ms election timeout.
The upper limit of election timeout is 50000ms, which should only be used when deploying global etcd cluster. First, 5s is the upper limit of average global round-trip time. A reasonable round-trip time for the continental united states is 130ms, and the time between US and japan is around 350-400ms. Because package gets delayed a lot, and network situation may be terrible, 5s is a safe value for it. Then, because election timeout should be an order of magnitude bigger than broadcast time, 50s becomes its maximum.
You should also set your election timeout to at least 5 to 10 times your heartbeat interval to account for variance in leader replication.
For a heartbeat interval of 50ms you should set your election timeout to at least 250ms - 500ms.
The upper limit of election timeout is 50000ms (50s), which should only be used when deploying a globally-distributed etcd cluster.
A reasonable round-trip time for the continental United States is 130ms, and the time between US and Japan is around 350-400ms.
If your network has uneven performance or regular packet delays/loss then it is possible that a couple of retries may be necessary to successfully send a packet. So 5s is a safe upper limit of global round-trip time.
As the election timeout should be an order of magnitude bigger than broadcast time, in the case of ~5s for a globally distributed cluster, then 50 seconds becomes a reasonable maximum.
The heartbeat interval and election timeout value should be the same for all members in one cluster. Setting different values for etcd members may disrupt cluster stability.
You can override the default values on the command line:
@ -49,7 +46,7 @@ $ ETCD_HEARTBEAT_INTERVAL=100 ETCD_ELECTION_TIMEOUT=500 etcd
The values are specified in milliseconds.
## Snapshots
### Snapshots
etcd appends all key changes to a log file.
This log grows forever and is a complete linear history of every change made to the keys.
@ -71,5 +68,3 @@ $ etcd -snapshot-count=5000
# Environment variables:
$ ETCD_SNAPSHOT_COUNT=5000 etcd
```
[ping]: https://en.wikipedia.org/wiki/Ping_(networking_utility)

View File

@ -1,4 +1,4 @@
# Upgrade etcd to 2.1
## Upgrade etcd to 2.1
In the general case, upgrading from etcd 2.0 to 2.1 can be a zero-downtime, rolling upgrade:
- one by one, stop the etcd v2.0 processes and replace them with etcd v2.1 processes
@ -6,29 +6,29 @@ In the general case, upgrading from etcd 2.0 to 2.1 can be a zero-downtime, roll
Before [starting an upgrade](#upgrade-procedure), read through the rest of this guide to prepare.
## Upgrade Checklists
### Upgrade Checklists
### Upgrade Requirements
#### Upgrade Requirement
To upgrade an existing etcd deployment to 2.1, you must be running 2.0. If youre running a version of etcd before 2.0, you must upgrade to [2.0][v2.0] before upgrading to 2.1.
To upgrade an existing etcd deployment to 2.1, you must be running 2.0. If youre running a version of etcd before 2.0, you must upgrade to [2.0](https://github.com/coreos/etcd/releases/tag/v2.0.13) before upgrading to 2.1.
Also, to ensure a smooth rolling upgrade, your running cluster must be healthy. You can check the health of the cluster by using `etcdctl cluster-health` command.
### Preparedness
#### Preparedness
Before upgrading etcd, always test the services relying on etcd in a staging environment before deploying the upgrade to the production environment.
You might also want to [backup your data directory][backup-datastore] for a potential [downgrade](#downgrade).
You might also want to [backup your data directory](admin_guide.md#backing-up-the-datastore) for a potential [downgrade](#downgrade).
etcd 2.1 introduces a new [authentication][auth] feature, which is disabled by default. If your deployment depends on these, you may want to test the auth features before enabling them in production.
etcd 2.1 introduces a new [authentication](auth_api.md) feature, which is disabled by default. If your deployment depends on these, you may want to test the auth features before enabling them in production.
### Mixed Versions
#### Mixed Versions
While upgrading, an etcd cluster supports mixed versions of etcd members. The cluster is only considered upgraded once all its members are upgraded to 2.1.
Internally, etcd members negotiate with each other to determine the overall etcd cluster version, which controls the reported cluster version and the supported features. For example, if you are mid-upgrade, any 2.1 features (such as the the authentication feature mentioned above) wont be available.
### Limitations
#### Limitations
If you encounter any issues during the upgrade, you can attempt to restart the etcd process in trouble using a newer v2.1 binary to solve the problem. One known issue is that etcd v2.0.0 and v2.0.2 may panic during rolling upgrades due to an existing bug, which has been fixed since etcd v2.0.3.
@ -36,11 +36,11 @@ It might take up to 2 minutes for the newly upgraded member to catch up with the
If you have even more data, this might take more time. If you have a data size larger than 100MB you should contact us before upgrading, so we can make sure the upgrades work smoothly.
### Downgrade
#### Downgrade
If all members have been upgraded to v2.1, the cluster will be upgraded to v2.1, and downgrade is **not possible**. If any member is still v2.0, the cluster will remain in v2.0, and you can go back to use v2.0 binary.
Please [backup your data directory][backup-datastore] of all etcd members if you want to downgrade the cluster, even if it is upgraded.
Please [backup your data directory](admin_guide.md#backing-up-the-datastore) of all etcd members if you want to downgrade the cluster, even if it is upgraded.
### Upgrade Procedure
@ -70,7 +70,7 @@ You will see similar error logging from other etcd processes in your cluster. Th
2015/06/23 15:45:11 stream: stopping the stream server...
```
You could [backup your data directory][backup-datastore] for data safety.
You could [backup your data directory](https://github.com/coreos/etcd/blob/7f7e2cc79d9c5c342a6eb1e48c386b0223cf934e/Documentation/admin_guide.md#backing-up-the-datastore) for data safety.
```
$ etcdctl backup \
@ -110,7 +110,3 @@ When all members are upgraded, you will see the cluster is upgraded to 2.1 succe
$ curl http://127.0.0.1:4001/version
{"etcdserver":"2.1.x","etcdcluster":"2.1.0"}
```
[auth]: auth_api.md
[backup-datastore]: admin_guide.md#backing-up-the-datastore
[v2.0]: https://github.com/coreos/etcd/releases/tag/v2.0.13

View File

@ -1,4 +1,4 @@
# Upgrade etcd from 2.1 to 2.2
## Upgrade etcd from 2.1 to 2.2
In the general case, upgrading from etcd 2.1 to 2.2 can be a zero-downtime, rolling upgrade:
@ -7,37 +7,37 @@ In the general case, upgrading from etcd 2.1 to 2.2 can be a zero-downtime, roll
Before [starting an upgrade](#upgrade-procedure), read through the rest of this guide to prepare.
## Upgrade Checklists
### Upgrade Checklists
### Upgrade Requirement
#### Upgrade Requirement
To upgrade an existing etcd deployment to 2.2, you must be running 2.1. If youre running a version of etcd before 2.1, you must upgrade to [2.1][v2.1] before upgrading to 2.2.
To upgrade an existing etcd deployment to 2.2, you must be running 2.1. If youre running a version of etcd before 2.1, you must upgrade to [2.1](https://github.com/coreos/etcd/releases/tag/v2.1.2) before upgrading to 2.2.
Also, to ensure a smooth rolling upgrade, your running cluster must be healthy. You can check the health of the cluster by using `etcdctl cluster-health` command.
### Preparedness
#### Preparedness
Before upgrading etcd, always test the services relying on etcd in a staging environment before deploying the upgrade to the production environment.
You might also want to [backup the data directory][backup-datastore] for a potential [downgrade].
You might also want to [backup your data directory](admin_guide.md#backing-up-the-datastore) for a potential [downgrade](#downgrade).
### Mixed Versions
#### Mixed Versions
While upgrading, an etcd cluster supports mixed versions of etcd members. The cluster is only considered upgraded once all its members are upgraded to 2.2.
Internally, etcd members negotiate with each other to determine the overall etcd cluster version, which controls the reported cluster version and the supported features.
### Limitations
#### Limitations
If you have a data size larger than 100MB you should contact us before upgrading, so we can make sure the upgrades work smoothly.
Every etcd 2.2 member will do health checking across the cluster periodically. etcd 2.1 member does not support health checking. During the upgrade, etcd 2.2 member will log warning about the unhealthy state of etcd 2.1 member. You can ignore the warning.
### Downgrade
#### Downgrade
If all members have been upgraded to v2.2, the cluster will be upgraded to v2.2, and downgrade is **not possible**. If any member is still v2.1, the cluster will remain in v2.1, and you can go back to use v2.1 binary.
Please [backup the data directory][backup-datastore] of all etcd members if you want to downgrade the cluster, even if it is upgraded.
Please [backup your data directory](admin_guide.md#backing-up-the-datastore) of all etcd members if you want to downgrade the cluster, even if it is upgraded.
### Upgrade Procedure
@ -85,7 +85,7 @@ You will also see logging output like this from the newly upgraded member, since
```
[Backup your data directory][backup-datastore] for data safety.
You could [backup your data directory](https://github.com/coreos/etcd/blob/7f7e2cc79d9c5c342a6eb1e48c386b0223cf934e/Documentation/admin_guide.md#backing-up-the-datastore) for data safety.
```
$ etcdctl backup \
@ -126,7 +126,3 @@ When all members are upgraded, you will see the cluster is upgraded to 2.2 succe
$ curl http://127.0.0.1:4001/version
{"etcdserver":"2.2.x","etcdcluster":"2.2.0"}
```
[backup-datastore]: admin_guide.md#backing-up-the-datastore
[downgrade]: #downgrade
[v2.1]: https://github.com/coreos/etcd/releases/tag/v2.1.2

View File

@ -1,121 +0,0 @@
## Upgrade etcd from 2.2 to 2.3
In the general case, upgrading from etcd 2.2 to 2.3 can be a zero-downtime, rolling upgrade:
- one by one, stop the etcd v2.2 processes and replace them with etcd v2.3 processes
- after running all v2.3 processes, new features in v2.3 are available to the cluster
Before [starting an upgrade](#upgrade-procedure), read through the rest of this guide to prepare.
### Upgrade Checklists
#### Upgrade Requirements
To upgrade an existing etcd deployment to 2.3, the running cluster must be 2.2 or greater. If it's before 2.2, please upgrade to [2.2](https://github.com/coreos/etcd/releases/tag/v2.2.0) before upgrading to 2.3.
Also, to ensure a smooth rolling upgrade, the running cluster must be healthy. You can check the health of the cluster by using the `etcdctl cluster-health` command.
#### Preparation
Before upgrading etcd, always test the services relying on etcd in a staging environment before deploying the upgrade to the production environment.
Before beginning, [backup the etcd data directory](admin_guide.md#backing-up-the-datastore). Should something go wrong with the upgrade, it is possible to use this backup to[downgrade](#downgrade) back to existing etcd version.
#### Mixed Versions
While upgrading, an etcd cluster supports mixed versions of etcd members, and operates with the protocol of the lowest common version. The cluster is only considered upgraded once all of its members are upgraded to version 2.3. Internally, etcd members negotiate with each other to determine the overall cluster version, which controls the reported version and the supported features.
#### Limitations
It might take up to 2 minutes for the newly upgraded member to catch up with the existing cluster when the total data size is larger than 50MB. Check the size of a recent snapshot to estimate the total data size. In other words, it is safest to wait for 2 minutes between upgrading each member.
For a much larger total data size, 100MB or more , this one-time process might take even more time. Administrators of very large etcd clusters of this magnitude can feel free to contact the [etcd team][etcd-contact] before upgrading, and well be happy to provide advice on the procedure.
#### Downgrade
If all members have been upgraded to v2.3, the cluster will be upgraded to v2.3, and downgrade from this completed state is **not possible**. If any single member is still v2.2, however, the cluster and its operations remains “v2.2”, and it is possible from this mixed cluster state to return to using a v2.2 etcd binary on all members.
Please [backup the data directory](admin_guide.md#backing-up-the-datastore) of all etcd members to make downgrading the cluster possible even after it has been completely upgraded.
### Upgrade Procedure
This example details the upgrade of a three-member v2.2 ectd cluster running on a local machine.
#### 1. Check upgrade requirements.
Is the the cluster healthy and running v.2.2.x?
```
$ etcdctl cluster-health
member 6e3bd23ae5f1eae0 is healthy: got healthy result from http://localhost:22379
member 924e2e83e93f2560 is healthy: got healthy result from http://localhost:32379
member a8266ecf031671f3 is healthy: got healthy result from http://localhost:12379
cluster is healthy
$ curl http://localhost:4001/version
{"etcdserver":"2.2.x","etcdcluster":"2.2.0"}
```
#### 2. Stop the existing etcd process
When each etcd process is stopped, expected errors will be logged by other cluster members. This is normal since a cluster member connection has been (temporarily) broken:
```
2016-03-11 09:50:49.860319 E | rafthttp: failed to read 8211f1d0f64f3269 on stream Message (unexpected EOF)
2016-03-11 09:50:49.860335 I | rafthttp: the connection with 8211f1d0f64f3269 became inactive
2016-03-11 09:50:51.023804 W | etcdserver: failed to reach the peerURL(http://127.0.0.1:12380) of member 8211f1d0f64f3269 (Get http://127.0.0.1:12380/version: dial tcp 127.0.0.1:12380: getsockopt: connection refused)
2016-03-11 09:50:51.023821 W | etcdserver: cannot get the version of member 8211f1d0f64f3269 (Get http://127.0.0.1:12380/version: dial tcp 127.0.0.1:12380: getsockopt: connection refused)
```
Its a good idea at this point to [backup the etcd data directory](https://github.com/coreos/etcd/blob/7f7e2cc79d9c5c342a6eb1e48c386b0223cf934e/Documentation/admin_guide.md#backing-up-the-datastore) to provide a downgrade path should any problems occur:
```
$ etcdctl backup \
--data-dir /var/lib/etcd \
--backup-dir /tmp/etcd_backup
```
#### 3. Drop-in etcd v2.3 binary and start the new etcd process
The new v2.3 etcd will publish its information to the cluster:
```
09:58:25.938673 I | etcdserver: published {Name:infra1 ClientURLs:[http://localhost:12379]} to cluster 524400597fb1d5f6
```
Verify that each member, and then the entire cluster, becomes healthy with the new v2.3 etcd binary:
```
$ etcdctl cluster-health
member 6e3bd23ae5f1eae0 is healthy: got healthy result from http://localhost:22379
member 924e2e83e93f2560 is healthy: got healthy result from http://localhost:32379
member a8266ecf031671f3 is healthy: got healthy result from http://localhost:12379
cluster is healthy
```
Upgraded members will log warnings like the following until the entire cluster is upgraded. This is expected and will cease after all etcd cluster members are upgraded to v2.3:
```
2016-03-11 09:58:26.851837 W | etcdserver: the local etcd version 2.2.0 is not up-to-date
2016-03-11 09:58:26.851854 W | etcdserver: member c02c70ede158499f has a higher version 2.3.0
```
#### 4. Repeat step 2 to step 3 for all other members
#### 5. Finish
When all members are upgraded, the cluster will report upgrading to 2.3 successfully:
```
2016-03-11 10:03:01.583392 N | etcdserver: updated the cluster version from 2.2 to 2.3
```
```
$ curl http://127.0.0.1:4001/version
{"etcdserver":"2.3.x","etcdcluster":"2.3.0"}
```
[etcd-contact]: https://coreos.com/etcd/?

103
Godeps/Godeps.json generated
View File

@ -1,6 +1,6 @@
{
"ImportPath": "github.com/coreos/etcd",
"GoVersion": "go1.5.1",
"GoVersion": "go1.4.2",
"Packages": [
"./..."
],
@ -10,10 +10,6 @@
"Comment": "null-5",
"Rev": "75cd24fc2f2c2a2088577d12123ddee5f54e0675"
},
{
"ImportPath": "github.com/akrennmair/gopcap",
"Rev": "00e11033259acb75598ba416495bb708d864a010"
},
{
"ImportPath": "github.com/beorn7/perks/quantile",
"Rev": "b965b613227fddccbfffe13eae360ed3fa822f8d"
@ -24,21 +20,17 @@
},
{
"ImportPath": "github.com/boltdb/bolt",
"Comment": "v1.1.0-81-g0fd4c05",
"Rev": "0fd4c0547d204c7b1cad6db6f3adad5f2cf453e5"
"Comment": "v1.0-119-g90fef38",
"Rev": "90fef389f98027ca55594edd7dbd6e7f3926fdad"
},
{
"ImportPath": "github.com/cheggaaa/pb",
"Rev": "da1f27ad1d9509b16f65f52fd9d8138b0f2dc7b2"
"ImportPath": "github.com/bradfitz/http2",
"Rev": "3e36af6d3af0e56fa3da71099f864933dea3d9fb"
},
{
"ImportPath": "github.com/codegangsta/cli",
"Comment": "1.2.0-183-gb5232bb",
"Rev": "b5232bb2934f606f9f27a1305f1eea224e8e8b88"
},
{
"ImportPath": "github.com/coreos/gexpect",
"Rev": "5173270e159f5aa8fbc999dc7e3dcb50f4098a69"
"Comment": "1.2.0-26-gf7ebb76",
"Rev": "f7ebb761e83e21225d1d8954fde853bf8edd46c4"
},
{
"ImportPath": "github.com/coreos/go-semver/semver",
@ -63,15 +55,9 @@
"ImportPath": "github.com/coreos/pkg/capnslog",
"Rev": "2c77715c4df99b5420ffcae14ead08f52104065d"
},
{
"ImportPath": "github.com/cpuguy83/go-md2man/md2man",
"Comment": "v1.0.4",
"Rev": "71acacd42f85e5e82f70a55327789582a5200a90"
},
{
"ImportPath": "github.com/gogo/protobuf/proto",
"Comment": "v0.1-118-ge8904f5",
"Rev": "e8904f58e872a473a5b91bc9bf3377d223555263"
"Rev": "64f27bf06efee53589314a6e5a4af34cdd85adf6"
},
{
"ImportPath": "github.com/golang/glog",
@ -79,46 +65,20 @@
},
{
"ImportPath": "github.com/golang/protobuf/proto",
"Rev": "6aaa8d47701fa6cf07e914ec01fde3d4a1fe79c3"
"Rev": "5677a0e3d5e89854c9974e1256839ee23f8233ca"
},
{
"ImportPath": "github.com/google/btree",
"Rev": "cc6329d4279e3f025a53a83c397d2339b5705c45"
},
{
"ImportPath": "github.com/inconshreveable/mousetrap",
"Rev": "76626ae9c91c4f2a10f34cad8ce83ea42c93bb75"
},
{
"ImportPath": "github.com/jonboulle/clockwork",
"Rev": "72f9bd7c4e0c2a40055ab3d0f09654f730cce982"
},
{
"ImportPath": "github.com/kballard/go-shellquote",
"Rev": "d8ec1a69a250a17bb0e419c386eac1f3711dc142"
},
{
"ImportPath": "github.com/kr/pty",
"Comment": "release.r56-29-gf7ee69f",
"Rev": "f7ee69f31298ecbe5d2b349c711e2547a617d398"
},
{
"ImportPath": "github.com/mattn/go-runewidth",
"Comment": "travisish-46-gd6bea18",
"Rev": "d6bea18f789704b5f83375793155289da36a3c7f"
},
{
"ImportPath": "github.com/matttproud/golang_protobuf_extensions/pbutil",
"Rev": "fc2b8d3a73c4867e51861bbdd5ae3c1f0869dd6a"
},
{
"ImportPath": "github.com/olekukonko/tablewriter",
"Rev": "cca8bbc0798408af109aaaa239cbd2634846b340"
},
{
"ImportPath": "github.com/olekukonko/ts",
"Rev": "ecf753e7c962639ab5a1fb46f7da627d4c0a04b8"
},
{
"ImportPath": "github.com/prometheus/client_golang/prometheus",
"Comment": "0.7.0-52-ge51041b",
@ -142,25 +102,8 @@
"Rev": "454a56f35412459b5e684fd5ec0f9211b94f002a"
},
{
"ImportPath": "github.com/russross/blackfriday",
"Comment": "v1.4-2-g300106c",
"Rev": "300106c228d52c8941d4b3de6054a6062a86dda3"
},
{
"ImportPath": "github.com/shurcooL/sanitized_anchor_name",
"Rev": "10ef21a441db47d8b13ebcc5fd2310f636973c77"
},
{
"ImportPath": "github.com/spacejam/loghisto",
"Rev": "323309774dec8b7430187e46cd0793974ccca04a"
},
{
"ImportPath": "github.com/spf13/cobra",
"Rev": "1c44ec8d3f1552cac48999f9306da23c4d8a288b"
},
{
"ImportPath": "github.com/spf13/pflag",
"Rev": "08b1a584251b5b62f458943640fc8ebd4d50aaa5"
"ImportPath": "github.com/rakyll/pb",
"Rev": "dc507ad06b7462501281bb4691ee43f0b1d1ec37"
},
{
"ImportPath": "github.com/stretchr/testify/assert",
@ -184,27 +127,31 @@
},
{
"ImportPath": "golang.org/x/net/context",
"Rev": "6acef71eb69611914f7a30939ea9f6e194c78172"
"Rev": "7dbad50ab5b31073856416cdcfeb2796d682f844"
},
{
"ImportPath": "golang.org/x/net/http2",
"Rev": "6acef71eb69611914f7a30939ea9f6e194c78172"
"ImportPath": "golang.org/x/net/netutil",
"Rev": "7dbad50ab5b31073856416cdcfeb2796d682f844"
},
{
"ImportPath": "golang.org/x/net/internal/timeseries",
"Rev": "6acef71eb69611914f7a30939ea9f6e194c78172"
},
{
"ImportPath": "golang.org/x/net/trace",
"Rev": "6acef71eb69611914f7a30939ea9f6e194c78172"
"ImportPath": "golang.org/x/oauth2",
"Rev": "3046bc76d6dfd7d3707f6640f85e42d9c4050f50"
},
{
"ImportPath": "golang.org/x/sys/unix",
"Rev": "9c60d1c508f5134d1ca726b4641db998f2523357"
},
{
"ImportPath": "google.golang.org/cloud/compute/metadata",
"Rev": "f20d6dcccb44ed49de45ae3703312cb46e627db1"
},
{
"ImportPath": "google.golang.org/cloud/internal",
"Rev": "f20d6dcccb44ed49de45ae3703312cb46e627db1"
},
{
"ImportPath": "google.golang.org/grpc",
"Rev": "b88c12e7caf74af3928de99a864aaa9916fa5aad"
"Rev": "f5ebd86be717593ab029545492c93ddf8914832b"
}
]
}

View File

@ -1,5 +0,0 @@
#*
*~
/tools/pass/pass
/tools/pcaptest/pcaptest
/tools/tcpdump/tcpdump

View File

@ -1,27 +0,0 @@
Copyright (c) 2009-2011 Andreas Krennmair. All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:
* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above
copyright notice, this list of conditions and the following disclaimer
in the documentation and/or other materials provided with the
distribution.
* Neither the name of Andreas Krennmair nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

View File

@ -1,11 +0,0 @@
# PCAP
This is a simple wrapper around libpcap for Go. Originally written by Andreas
Krennmair <ak@synflood.at> and only minorly touched up by Mark Smith <mark@qq.is>.
Please see the included pcaptest.go and tcpdump.go programs for instructions on
how to use this library.
Miek Gieben <miek@miek.nl> has created a more Go-like package and replaced functionality
with standard functions from the standard library. The package has also been renamed to
pcap.

View File

@ -1,527 +0,0 @@
package pcap
import (
"encoding/binary"
"fmt"
"net"
"reflect"
"strings"
)
const (
TYPE_IP = 0x0800
TYPE_ARP = 0x0806
TYPE_IP6 = 0x86DD
TYPE_VLAN = 0x8100
IP_ICMP = 1
IP_INIP = 4
IP_TCP = 6
IP_UDP = 17
)
const (
ERRBUF_SIZE = 256
// According to pcap-linktype(7).
LINKTYPE_NULL = 0
LINKTYPE_ETHERNET = 1
LINKTYPE_TOKEN_RING = 6
LINKTYPE_ARCNET = 7
LINKTYPE_SLIP = 8
LINKTYPE_PPP = 9
LINKTYPE_FDDI = 10
LINKTYPE_ATM_RFC1483 = 100
LINKTYPE_RAW = 101
LINKTYPE_PPP_HDLC = 50
LINKTYPE_PPP_ETHER = 51
LINKTYPE_C_HDLC = 104
LINKTYPE_IEEE802_11 = 105
LINKTYPE_FRELAY = 107
LINKTYPE_LOOP = 108
LINKTYPE_LINUX_SLL = 113
LINKTYPE_LTALK = 104
LINKTYPE_PFLOG = 117
LINKTYPE_PRISM_HEADER = 119
LINKTYPE_IP_OVER_FC = 122
LINKTYPE_SUNATM = 123
LINKTYPE_IEEE802_11_RADIO = 127
LINKTYPE_ARCNET_LINUX = 129
LINKTYPE_LINUX_IRDA = 144
LINKTYPE_LINUX_LAPD = 177
)
type addrHdr interface {
SrcAddr() string
DestAddr() string
Len() int
}
type addrStringer interface {
String(addr addrHdr) string
}
func decodemac(pkt []byte) uint64 {
mac := uint64(0)
for i := uint(0); i < 6; i++ {
mac = (mac << 8) + uint64(pkt[i])
}
return mac
}
// Decode decodes the headers of a Packet.
func (p *Packet) Decode() {
if len(p.Data) <= 14 {
return
}
p.Type = int(binary.BigEndian.Uint16(p.Data[12:14]))
p.DestMac = decodemac(p.Data[0:6])
p.SrcMac = decodemac(p.Data[6:12])
if len(p.Data) >= 15 {
p.Payload = p.Data[14:]
}
switch p.Type {
case TYPE_IP:
p.decodeIp()
case TYPE_IP6:
p.decodeIp6()
case TYPE_ARP:
p.decodeArp()
case TYPE_VLAN:
p.decodeVlan()
}
}
func (p *Packet) headerString(headers []interface{}) string {
// If there's just one header, return that.
if len(headers) == 1 {
if hdr, ok := headers[0].(fmt.Stringer); ok {
return hdr.String()
}
}
// If there are two headers (IPv4/IPv6 -> TCP/UDP/IP..)
if len(headers) == 2 {
// Commonly the first header is an address.
if addr, ok := p.Headers[0].(addrHdr); ok {
if hdr, ok := p.Headers[1].(addrStringer); ok {
return fmt.Sprintf("%s %s", p.Time, hdr.String(addr))
}
}
}
// For IP in IP, we do a recursive call.
if len(headers) >= 2 {
if addr, ok := headers[0].(addrHdr); ok {
if _, ok := headers[1].(addrHdr); ok {
return fmt.Sprintf("%s > %s IP in IP: ",
addr.SrcAddr(), addr.DestAddr(), p.headerString(headers[1:]))
}
}
}
var typeNames []string
for _, hdr := range headers {
typeNames = append(typeNames, reflect.TypeOf(hdr).String())
}
return fmt.Sprintf("unknown [%s]", strings.Join(typeNames, ","))
}
// String prints a one-line representation of the packet header.
// The output is suitable for use in a tcpdump program.
func (p *Packet) String() string {
// If there are no headers, print "unsupported protocol".
if len(p.Headers) == 0 {
return fmt.Sprintf("%s unsupported protocol %d", p.Time, int(p.Type))
}
return fmt.Sprintf("%s %s", p.Time, p.headerString(p.Headers))
}
// Arphdr is a ARP packet header.
type Arphdr struct {
Addrtype uint16
Protocol uint16
HwAddressSize uint8
ProtAddressSize uint8
Operation uint16
SourceHwAddress []byte
SourceProtAddress []byte
DestHwAddress []byte
DestProtAddress []byte
}
func (arp *Arphdr) String() (s string) {
switch arp.Operation {
case 1:
s = "ARP request"
case 2:
s = "ARP Reply"
}
if arp.Addrtype == LINKTYPE_ETHERNET && arp.Protocol == TYPE_IP {
s = fmt.Sprintf("%012x (%s) > %012x (%s)",
decodemac(arp.SourceHwAddress), arp.SourceProtAddress,
decodemac(arp.DestHwAddress), arp.DestProtAddress)
} else {
s = fmt.Sprintf("addrtype = %d protocol = %d", arp.Addrtype, arp.Protocol)
}
return
}
func (p *Packet) decodeArp() {
if len(p.Payload) < 8 {
return
}
pkt := p.Payload
arp := new(Arphdr)
arp.Addrtype = binary.BigEndian.Uint16(pkt[0:2])
arp.Protocol = binary.BigEndian.Uint16(pkt[2:4])
arp.HwAddressSize = pkt[4]
arp.ProtAddressSize = pkt[5]
arp.Operation = binary.BigEndian.Uint16(pkt[6:8])
if len(pkt) < int(8+2*arp.HwAddressSize+2*arp.ProtAddressSize) {
return
}
arp.SourceHwAddress = pkt[8 : 8+arp.HwAddressSize]
arp.SourceProtAddress = pkt[8+arp.HwAddressSize : 8+arp.HwAddressSize+arp.ProtAddressSize]
arp.DestHwAddress = pkt[8+arp.HwAddressSize+arp.ProtAddressSize : 8+2*arp.HwAddressSize+arp.ProtAddressSize]
arp.DestProtAddress = pkt[8+2*arp.HwAddressSize+arp.ProtAddressSize : 8+2*arp.HwAddressSize+2*arp.ProtAddressSize]
p.Headers = append(p.Headers, arp)
if len(pkt) >= int(8+2*arp.HwAddressSize+2*arp.ProtAddressSize) {
p.Payload = p.Payload[8+2*arp.HwAddressSize+2*arp.ProtAddressSize:]
}
}
// IPadr is the header of an IP packet.
type Iphdr struct {
Version uint8
Ihl uint8
Tos uint8
Length uint16
Id uint16
Flags uint8
FragOffset uint16
Ttl uint8
Protocol uint8
Checksum uint16
SrcIp []byte
DestIp []byte
}
func (p *Packet) decodeIp() {
if len(p.Payload) < 20 {
return
}
pkt := p.Payload
ip := new(Iphdr)
ip.Version = uint8(pkt[0]) >> 4
ip.Ihl = uint8(pkt[0]) & 0x0F
ip.Tos = pkt[1]
ip.Length = binary.BigEndian.Uint16(pkt[2:4])
ip.Id = binary.BigEndian.Uint16(pkt[4:6])
flagsfrags := binary.BigEndian.Uint16(pkt[6:8])
ip.Flags = uint8(flagsfrags >> 13)
ip.FragOffset = flagsfrags & 0x1FFF
ip.Ttl = pkt[8]
ip.Protocol = pkt[9]
ip.Checksum = binary.BigEndian.Uint16(pkt[10:12])
ip.SrcIp = pkt[12:16]
ip.DestIp = pkt[16:20]
pEnd := int(ip.Length)
if pEnd > len(pkt) {
pEnd = len(pkt)
}
if len(pkt) >= pEnd && int(ip.Ihl*4) < pEnd {
p.Payload = pkt[ip.Ihl*4 : pEnd]
} else {
p.Payload = []byte{}
}
p.Headers = append(p.Headers, ip)
p.IP = ip
switch ip.Protocol {
case IP_TCP:
p.decodeTcp()
case IP_UDP:
p.decodeUdp()
case IP_ICMP:
p.decodeIcmp()
case IP_INIP:
p.decodeIp()
}
}
func (ip *Iphdr) SrcAddr() string { return net.IP(ip.SrcIp).String() }
func (ip *Iphdr) DestAddr() string { return net.IP(ip.DestIp).String() }
func (ip *Iphdr) Len() int { return int(ip.Length) }
type Vlanhdr struct {
Priority byte
DropEligible bool
VlanIdentifier int
Type int // Not actually part of the vlan header, but the type of the actual packet
}
func (v *Vlanhdr) String() {
fmt.Sprintf("VLAN Priority:%d Drop:%v Tag:%d", v.Priority, v.DropEligible, v.VlanIdentifier)
}
func (p *Packet) decodeVlan() {
pkt := p.Payload
vlan := new(Vlanhdr)
if len(pkt) < 4 {
return
}
vlan.Priority = (pkt[2] & 0xE0) >> 13
vlan.DropEligible = pkt[2]&0x10 != 0
vlan.VlanIdentifier = int(binary.BigEndian.Uint16(pkt[:2])) & 0x0FFF
vlan.Type = int(binary.BigEndian.Uint16(p.Payload[2:4]))
p.Headers = append(p.Headers, vlan)
if len(pkt) >= 5 {
p.Payload = p.Payload[4:]
}
switch vlan.Type {
case TYPE_IP:
p.decodeIp()
case TYPE_IP6:
p.decodeIp6()
case TYPE_ARP:
p.decodeArp()
}
}
type Tcphdr struct {
SrcPort uint16
DestPort uint16
Seq uint32
Ack uint32
DataOffset uint8
Flags uint16
Window uint16
Checksum uint16
Urgent uint16
Data []byte
}
const (
TCP_FIN = 1 << iota
TCP_SYN
TCP_RST
TCP_PSH
TCP_ACK
TCP_URG
TCP_ECE
TCP_CWR
TCP_NS
)
func (p *Packet) decodeTcp() {
if len(p.Payload) < 20 {
return
}
pkt := p.Payload
tcp := new(Tcphdr)
tcp.SrcPort = binary.BigEndian.Uint16(pkt[0:2])
tcp.DestPort = binary.BigEndian.Uint16(pkt[2:4])
tcp.Seq = binary.BigEndian.Uint32(pkt[4:8])
tcp.Ack = binary.BigEndian.Uint32(pkt[8:12])
tcp.DataOffset = (pkt[12] & 0xF0) >> 4
tcp.Flags = binary.BigEndian.Uint16(pkt[12:14]) & 0x1FF
tcp.Window = binary.BigEndian.Uint16(pkt[14:16])
tcp.Checksum = binary.BigEndian.Uint16(pkt[16:18])
tcp.Urgent = binary.BigEndian.Uint16(pkt[18:20])
if len(pkt) >= int(tcp.DataOffset*4) {
p.Payload = pkt[tcp.DataOffset*4:]
}
p.Headers = append(p.Headers, tcp)
p.TCP = tcp
}
func (tcp *Tcphdr) String(hdr addrHdr) string {
return fmt.Sprintf("TCP %s:%d > %s:%d %s SEQ=%d ACK=%d LEN=%d",
hdr.SrcAddr(), int(tcp.SrcPort), hdr.DestAddr(), int(tcp.DestPort),
tcp.FlagsString(), int64(tcp.Seq), int64(tcp.Ack), hdr.Len())
}
func (tcp *Tcphdr) FlagsString() string {
var sflags []string
if 0 != (tcp.Flags & TCP_SYN) {
sflags = append(sflags, "syn")
}
if 0 != (tcp.Flags & TCP_FIN) {
sflags = append(sflags, "fin")
}
if 0 != (tcp.Flags & TCP_ACK) {
sflags = append(sflags, "ack")
}
if 0 != (tcp.Flags & TCP_PSH) {
sflags = append(sflags, "psh")
}
if 0 != (tcp.Flags & TCP_RST) {
sflags = append(sflags, "rst")
}
if 0 != (tcp.Flags & TCP_URG) {
sflags = append(sflags, "urg")
}
if 0 != (tcp.Flags & TCP_NS) {
sflags = append(sflags, "ns")
}
if 0 != (tcp.Flags & TCP_CWR) {
sflags = append(sflags, "cwr")
}
if 0 != (tcp.Flags & TCP_ECE) {
sflags = append(sflags, "ece")
}
return fmt.Sprintf("[%s]", strings.Join(sflags, " "))
}
type Udphdr struct {
SrcPort uint16
DestPort uint16
Length uint16
Checksum uint16
}
func (p *Packet) decodeUdp() {
if len(p.Payload) < 8 {
return
}
pkt := p.Payload
udp := new(Udphdr)
udp.SrcPort = binary.BigEndian.Uint16(pkt[0:2])
udp.DestPort = binary.BigEndian.Uint16(pkt[2:4])
udp.Length = binary.BigEndian.Uint16(pkt[4:6])
udp.Checksum = binary.BigEndian.Uint16(pkt[6:8])
p.Headers = append(p.Headers, udp)
p.UDP = udp
if len(p.Payload) >= 8 {
p.Payload = pkt[8:]
}
}
func (udp *Udphdr) String(hdr addrHdr) string {
return fmt.Sprintf("UDP %s:%d > %s:%d LEN=%d CHKSUM=%d",
hdr.SrcAddr(), int(udp.SrcPort), hdr.DestAddr(), int(udp.DestPort),
int(udp.Length), int(udp.Checksum))
}
type Icmphdr struct {
Type uint8
Code uint8
Checksum uint16
Id uint16
Seq uint16
Data []byte
}
func (p *Packet) decodeIcmp() *Icmphdr {
if len(p.Payload) < 8 {
return nil
}
pkt := p.Payload
icmp := new(Icmphdr)
icmp.Type = pkt[0]
icmp.Code = pkt[1]
icmp.Checksum = binary.BigEndian.Uint16(pkt[2:4])
icmp.Id = binary.BigEndian.Uint16(pkt[4:6])
icmp.Seq = binary.BigEndian.Uint16(pkt[6:8])
p.Payload = pkt[8:]
p.Headers = append(p.Headers, icmp)
return icmp
}
func (icmp *Icmphdr) String(hdr addrHdr) string {
return fmt.Sprintf("ICMP %s > %s Type = %d Code = %d ",
hdr.SrcAddr(), hdr.DestAddr(), icmp.Type, icmp.Code)
}
func (icmp *Icmphdr) TypeString() (result string) {
switch icmp.Type {
case 0:
result = fmt.Sprintf("Echo reply seq=%d", icmp.Seq)
case 3:
switch icmp.Code {
case 0:
result = "Network unreachable"
case 1:
result = "Host unreachable"
case 2:
result = "Protocol unreachable"
case 3:
result = "Port unreachable"
default:
result = "Destination unreachable"
}
case 8:
result = fmt.Sprintf("Echo request seq=%d", icmp.Seq)
case 30:
result = "Traceroute"
}
return
}
type Ip6hdr struct {
// http://www.networksorcery.com/enp/protocol/ipv6.htm
Version uint8 // 4 bits
TrafficClass uint8 // 8 bits
FlowLabel uint32 // 20 bits
Length uint16 // 16 bits
NextHeader uint8 // 8 bits, same as Protocol in Iphdr
HopLimit uint8 // 8 bits
SrcIp []byte // 16 bytes
DestIp []byte // 16 bytes
}
func (p *Packet) decodeIp6() {
if len(p.Payload) < 40 {
return
}
pkt := p.Payload
ip6 := new(Ip6hdr)
ip6.Version = uint8(pkt[0]) >> 4
ip6.TrafficClass = uint8((binary.BigEndian.Uint16(pkt[0:2]) >> 4) & 0x00FF)
ip6.FlowLabel = binary.BigEndian.Uint32(pkt[0:4]) & 0x000FFFFF
ip6.Length = binary.BigEndian.Uint16(pkt[4:6])
ip6.NextHeader = pkt[6]
ip6.HopLimit = pkt[7]
ip6.SrcIp = pkt[8:24]
ip6.DestIp = pkt[24:40]
if len(p.Payload) >= 40 {
p.Payload = pkt[40:]
}
p.Headers = append(p.Headers, ip6)
switch ip6.NextHeader {
case IP_TCP:
p.decodeTcp()
case IP_UDP:
p.decodeUdp()
case IP_ICMP:
p.decodeIcmp()
case IP_INIP:
p.decodeIp()
}
}
func (ip6 *Ip6hdr) SrcAddr() string { return net.IP(ip6.SrcIp).String() }
func (ip6 *Ip6hdr) DestAddr() string { return net.IP(ip6.DestIp).String() }
func (ip6 *Ip6hdr) Len() int { return int(ip6.Length) }

View File

@ -1,247 +0,0 @@
package pcap
import (
"bytes"
"testing"
"time"
)
var testSimpleTcpPacket *Packet = &Packet{
Data: []byte{
0x00, 0x00, 0x0c, 0x9f, 0xf0, 0x20, 0xbc, 0x30, 0x5b, 0xe8, 0xd3, 0x49,
0x08, 0x00, 0x45, 0x00, 0x01, 0xa4, 0x39, 0xdf, 0x40, 0x00, 0x40, 0x06,
0x55, 0x5a, 0xac, 0x11, 0x51, 0x49, 0xad, 0xde, 0xfe, 0xe1, 0xc5, 0xf7,
0x00, 0x50, 0xc5, 0x7e, 0x0e, 0x48, 0x49, 0x07, 0x42, 0x32, 0x80, 0x18,
0x00, 0x73, 0xab, 0xb1, 0x00, 0x00, 0x01, 0x01, 0x08, 0x0a, 0x03, 0x77,
0x37, 0x9c, 0x42, 0x77, 0x5e, 0x3a, 0x47, 0x45, 0x54, 0x20, 0x2f, 0x20,
0x48, 0x54, 0x54, 0x50, 0x2f, 0x31, 0x2e, 0x31, 0x0d, 0x0a, 0x48, 0x6f,
0x73, 0x74, 0x3a, 0x20, 0x77, 0x77, 0x77, 0x2e, 0x66, 0x69, 0x73, 0x68,
0x2e, 0x63, 0x6f, 0x6d, 0x0d, 0x0a, 0x43, 0x6f, 0x6e, 0x6e, 0x65, 0x63,
0x74, 0x69, 0x6f, 0x6e, 0x3a, 0x20, 0x6b, 0x65, 0x65, 0x70, 0x2d, 0x61,
0x6c, 0x69, 0x76, 0x65, 0x0d, 0x0a, 0x55, 0x73, 0x65, 0x72, 0x2d, 0x41,
0x67, 0x65, 0x6e, 0x74, 0x3a, 0x20, 0x4d, 0x6f, 0x7a, 0x69, 0x6c, 0x6c,
0x61, 0x2f, 0x35, 0x2e, 0x30, 0x20, 0x28, 0x58, 0x31, 0x31, 0x3b, 0x20,
0x4c, 0x69, 0x6e, 0x75, 0x78, 0x20, 0x78, 0x38, 0x36, 0x5f, 0x36, 0x34,
0x29, 0x20, 0x41, 0x70, 0x70, 0x6c, 0x65, 0x57, 0x65, 0x62, 0x4b, 0x69,
0x74, 0x2f, 0x35, 0x33, 0x35, 0x2e, 0x32, 0x20, 0x28, 0x4b, 0x48, 0x54,
0x4d, 0x4c, 0x2c, 0x20, 0x6c, 0x69, 0x6b, 0x65, 0x20, 0x47, 0x65, 0x63,
0x6b, 0x6f, 0x29, 0x20, 0x43, 0x68, 0x72, 0x6f, 0x6d, 0x65, 0x2f, 0x31,
0x35, 0x2e, 0x30, 0x2e, 0x38, 0x37, 0x34, 0x2e, 0x31, 0x32, 0x31, 0x20,
0x53, 0x61, 0x66, 0x61, 0x72, 0x69, 0x2f, 0x35, 0x33, 0x35, 0x2e, 0x32,
0x0d, 0x0a, 0x41, 0x63, 0x63, 0x65, 0x70, 0x74, 0x3a, 0x20, 0x74, 0x65,
0x78, 0x74, 0x2f, 0x68, 0x74, 0x6d, 0x6c, 0x2c, 0x61, 0x70, 0x70, 0x6c,
0x69, 0x63, 0x61, 0x74, 0x69, 0x6f, 0x6e, 0x2f, 0x78, 0x68, 0x74, 0x6d,
0x6c, 0x2b, 0x78, 0x6d, 0x6c, 0x2c, 0x61, 0x70, 0x70, 0x6c, 0x69, 0x63,
0x61, 0x74, 0x69, 0x6f, 0x6e, 0x2f, 0x78, 0x6d, 0x6c, 0x3b, 0x71, 0x3d,
0x30, 0x2e, 0x39, 0x2c, 0x2a, 0x2f, 0x2a, 0x3b, 0x71, 0x3d, 0x30, 0x2e,
0x38, 0x0d, 0x0a, 0x41, 0x63, 0x63, 0x65, 0x70, 0x74, 0x2d, 0x45, 0x6e,
0x63, 0x6f, 0x64, 0x69, 0x6e, 0x67, 0x3a, 0x20, 0x67, 0x7a, 0x69, 0x70,
0x2c, 0x64, 0x65, 0x66, 0x6c, 0x61, 0x74, 0x65, 0x2c, 0x73, 0x64, 0x63,
0x68, 0x0d, 0x0a, 0x41, 0x63, 0x63, 0x65, 0x70, 0x74, 0x2d, 0x4c, 0x61,
0x6e, 0x67, 0x75, 0x61, 0x67, 0x65, 0x3a, 0x20, 0x65, 0x6e, 0x2d, 0x55,
0x53, 0x2c, 0x65, 0x6e, 0x3b, 0x71, 0x3d, 0x30, 0x2e, 0x38, 0x0d, 0x0a,
0x41, 0x63, 0x63, 0x65, 0x70, 0x74, 0x2d, 0x43, 0x68, 0x61, 0x72, 0x73,
0x65, 0x74, 0x3a, 0x20, 0x49, 0x53, 0x4f, 0x2d, 0x38, 0x38, 0x35, 0x39,
0x2d, 0x31, 0x2c, 0x75, 0x74, 0x66, 0x2d, 0x38, 0x3b, 0x71, 0x3d, 0x30,
0x2e, 0x37, 0x2c, 0x2a, 0x3b, 0x71, 0x3d, 0x30, 0x2e, 0x33, 0x0d, 0x0a,
0x0d, 0x0a,
}}
func BenchmarkDecodeSimpleTcpPacket(b *testing.B) {
for i := 0; i < b.N; i++ {
testSimpleTcpPacket.Decode()
}
}
func TestDecodeSimpleTcpPacket(t *testing.T) {
p := testSimpleTcpPacket
p.Decode()
if p.DestMac != 0x00000c9ff020 {
t.Error("Dest mac", p.DestMac)
}
if p.SrcMac != 0xbc305be8d349 {
t.Error("Src mac", p.SrcMac)
}
if len(p.Headers) != 2 {
t.Error("Incorrect number of headers", len(p.Headers))
return
}
if ip, ipOk := p.Headers[0].(*Iphdr); ipOk {
if ip.Version != 4 {
t.Error("ip Version", ip.Version)
}
if ip.Ihl != 5 {
t.Error("ip header length", ip.Ihl)
}
if ip.Tos != 0 {
t.Error("ip TOS", ip.Tos)
}
if ip.Length != 420 {
t.Error("ip Length", ip.Length)
}
if ip.Id != 14815 {
t.Error("ip ID", ip.Id)
}
if ip.Flags != 0x02 {
t.Error("ip Flags", ip.Flags)
}
if ip.FragOffset != 0 {
t.Error("ip Fragoffset", ip.FragOffset)
}
if ip.Ttl != 64 {
t.Error("ip TTL", ip.Ttl)
}
if ip.Protocol != 6 {
t.Error("ip Protocol", ip.Protocol)
}
if ip.Checksum != 0x555A {
t.Error("ip Checksum", ip.Checksum)
}
if !bytes.Equal(ip.SrcIp, []byte{172, 17, 81, 73}) {
t.Error("ip Src", ip.SrcIp)
}
if !bytes.Equal(ip.DestIp, []byte{173, 222, 254, 225}) {
t.Error("ip Dest", ip.DestIp)
}
if tcp, tcpOk := p.Headers[1].(*Tcphdr); tcpOk {
if tcp.SrcPort != 50679 {
t.Error("tcp srcport", tcp.SrcPort)
}
if tcp.DestPort != 80 {
t.Error("tcp destport", tcp.DestPort)
}
if tcp.Seq != 0xc57e0e48 {
t.Error("tcp seq", tcp.Seq)
}
if tcp.Ack != 0x49074232 {
t.Error("tcp ack", tcp.Ack)
}
if tcp.DataOffset != 8 {
t.Error("tcp dataoffset", tcp.DataOffset)
}
if tcp.Flags != 0x18 {
t.Error("tcp flags", tcp.Flags)
}
if tcp.Window != 0x73 {
t.Error("tcp window", tcp.Window)
}
if tcp.Checksum != 0xabb1 {
t.Error("tcp checksum", tcp.Checksum)
}
if tcp.Urgent != 0 {
t.Error("tcp urgent", tcp.Urgent)
}
} else {
t.Error("Second header is not TCP header")
}
} else {
t.Error("First header is not IP header")
}
if string(p.Payload) != "GET / HTTP/1.1\r\nHost: www.fish.com\r\nConnection: keep-alive\r\nUser-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.121 Safari/535.2\r\nAccept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8\r\nAccept-Encoding: gzip,deflate,sdch\r\nAccept-Language: en-US,en;q=0.8\r\nAccept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.3\r\n\r\n" {
t.Error("--- PAYLOAD STRING ---\n", string(p.Payload), "\n--- PAYLOAD BYTES ---\n", p.Payload)
}
}
// Makes sure packet payload doesn't display the 6 trailing null of this packet
// as part of the payload. They're actually the ethernet trailer.
func TestDecodeSmallTcpPacketHasEmptyPayload(t *testing.T) {
p := &Packet{
// This packet is only 54 bits (an empty TCP RST), thus 6 trailing null
// bytes are added by the ethernet layer to make it the minimum packet size.
Data: []byte{
0xbc, 0x30, 0x5b, 0xe8, 0xd3, 0x49, 0xb8, 0xac, 0x6f, 0x92, 0xd5, 0xbf,
0x08, 0x00, 0x45, 0x00, 0x00, 0x28, 0x00, 0x00, 0x40, 0x00, 0x40, 0x06,
0x3f, 0x9f, 0xac, 0x11, 0x51, 0xc5, 0xac, 0x11, 0x51, 0x49, 0x00, 0x63,
0x9a, 0xef, 0x00, 0x00, 0x00, 0x00, 0x2e, 0xc1, 0x27, 0x83, 0x50, 0x14,
0x00, 0x00, 0xc3, 0x08, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
}}
p.Decode()
if p.Payload == nil {
t.Error("Nil payload")
}
if len(p.Payload) != 0 {
t.Error("Non-empty payload:", p.Payload)
}
}
func TestDecodeVlanPacket(t *testing.T) {
p := &Packet{
Data: []byte{
0x00, 0x10, 0xdb, 0xff, 0x10, 0x00, 0x00, 0x15, 0x2c, 0x9d, 0xcc, 0x00, 0x81, 0x00, 0x01, 0xf7,
0x08, 0x00, 0x45, 0x00, 0x00, 0x28, 0x29, 0x8d, 0x40, 0x00, 0x7d, 0x06, 0x83, 0xa0, 0xac, 0x1b,
0xca, 0x8e, 0x45, 0x16, 0x94, 0xe2, 0xd4, 0x0a, 0x00, 0x50, 0xdf, 0xab, 0x9c, 0xc6, 0xcd, 0x1e,
0xe5, 0xd1, 0x50, 0x10, 0x01, 0x00, 0x5a, 0x74, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
}}
p.Decode()
if p.Type != TYPE_VLAN {
t.Error("Didn't detect vlan")
}
if len(p.Headers) != 3 {
t.Error("Incorrect number of headers:", len(p.Headers))
for i, h := range p.Headers {
t.Errorf("Header %d: %#v", i, h)
}
t.FailNow()
}
if _, ok := p.Headers[0].(*Vlanhdr); !ok {
t.Errorf("First header isn't vlan: %q", p.Headers[0])
}
if _, ok := p.Headers[1].(*Iphdr); !ok {
t.Errorf("Second header isn't IP: %q", p.Headers[1])
}
if _, ok := p.Headers[2].(*Tcphdr); !ok {
t.Errorf("Third header isn't TCP: %q", p.Headers[2])
}
}
func TestDecodeFuzzFallout(t *testing.T) {
testData := []struct {
Data []byte
}{
{[]byte("000000000000\x81\x000")},
{[]byte("000000000000\x81\x00000")},
{[]byte("000000000000\x86\xdd0")},
{[]byte("000000000000\b\x000")},
{[]byte("000000000000\b\x060")},
{[]byte{}},
{[]byte("000000000000\b\x0600000000")},
{[]byte("000000000000\x86\xdd000000\x01000000000000000000000000000000000")},
{[]byte("000000000000\x81\x0000\b\x0600000000")},
{[]byte("000000000000\b\x00n0000000000000000000")},
{[]byte("000000000000\x86\xdd000000\x0100000000000000000000000000000000000")},
{[]byte("000000000000\x81\x0000\b\x00g0000000000000000000")},
//{[]byte()},
{[]byte("000000000000\b\x00400000000\x110000000000")},
{[]byte("0nMء\xfe\x13\x13\x81\x00gr\b\x00&x\xc9\xe5b'\x1e0\x00\x04\x00\x0020596224")},
{[]byte("000000000000\x81\x0000\b\x00400000000\x110000000000")},
{[]byte("000000000000\b\x00000000000\x0600\xff0000000")},
{[]byte("000000000000\x86\xdd000000\x06000000000000000000000000000000000")},
{[]byte("000000000000\x81\x0000\b\x00000000000\x0600b0000000")},
{[]byte("000000000000\x81\x0000\b\x00400000000\x060000000000")},
{[]byte("000000000000\x86\xdd000000\x11000000000000000000000000000000000")},
{[]byte("000000000000\x86\xdd000000\x0600000000000000000000000000000000000000000000M")},
{[]byte("000000000000\b\x00500000000\x0600000000000")},
{[]byte("0nM\xd80\xfe\x13\x13\x81\x00gr\b\x00&x\xc9\xe5b'\x1e0\x00\x04\x00\x0020596224")},
}
for _, entry := range testData {
pkt := &Packet{
Time: time.Now(),
Caplen: uint32(len(entry.Data)),
Len: uint32(len(entry.Data)),
Data: entry.Data,
}
pkt.Decode()
/*
func() {
defer func() {
if err := recover(); err != nil {
t.Fatalf("%d. %q failed: %v", idx, string(entry.Data), err)
}
}()
pkt.Decode()
}()
*/
}
}

View File

@ -1,206 +0,0 @@
package pcap
import (
"encoding/binary"
"fmt"
"io"
"time"
)
// FileHeader is the parsed header of a pcap file.
// http://wiki.wireshark.org/Development/LibpcapFileFormat
type FileHeader struct {
MagicNumber uint32
VersionMajor uint16
VersionMinor uint16
TimeZone int32
SigFigs uint32
SnapLen uint32
Network uint32
}
type PacketTime struct {
Sec int32
Usec int32
}
// Convert the PacketTime to a go Time struct.
func (p *PacketTime) Time() time.Time {
return time.Unix(int64(p.Sec), int64(p.Usec)*1000)
}
// Packet is a single packet parsed from a pcap file.
//
// Convenient access to IP, TCP, and UDP headers is provided after Decode()
// is called if the packet is of the appropriate type.
type Packet struct {
Time time.Time // packet send/receive time
Caplen uint32 // bytes stored in the file (caplen <= len)
Len uint32 // bytes sent/received
Data []byte // packet data
Type int // protocol type, see LINKTYPE_*
DestMac uint64
SrcMac uint64
Headers []interface{} // decoded headers, in order
Payload []byte // remaining non-header bytes
IP *Iphdr // IP header (for IP packets, after decoding)
TCP *Tcphdr // TCP header (for TCP packets, after decoding)
UDP *Udphdr // UDP header (for UDP packets after decoding)
}
// Reader parses pcap files.
type Reader struct {
flip bool
buf io.Reader
err error
fourBytes []byte
twoBytes []byte
sixteenBytes []byte
Header FileHeader
}
// NewReader reads pcap data from an io.Reader.
func NewReader(reader io.Reader) (*Reader, error) {
r := &Reader{
buf: reader,
fourBytes: make([]byte, 4),
twoBytes: make([]byte, 2),
sixteenBytes: make([]byte, 16),
}
switch magic := r.readUint32(); magic {
case 0xa1b2c3d4:
r.flip = false
case 0xd4c3b2a1:
r.flip = true
default:
return nil, fmt.Errorf("pcap: bad magic number: %0x", magic)
}
r.Header = FileHeader{
MagicNumber: 0xa1b2c3d4,
VersionMajor: r.readUint16(),
VersionMinor: r.readUint16(),
TimeZone: r.readInt32(),
SigFigs: r.readUint32(),
SnapLen: r.readUint32(),
Network: r.readUint32(),
}
return r, nil
}
// Next returns the next packet or nil if no more packets can be read.
func (r *Reader) Next() *Packet {
d := r.sixteenBytes
r.err = r.read(d)
if r.err != nil {
return nil
}
timeSec := asUint32(d[0:4], r.flip)
timeUsec := asUint32(d[4:8], r.flip)
capLen := asUint32(d[8:12], r.flip)
origLen := asUint32(d[12:16], r.flip)
data := make([]byte, capLen)
if r.err = r.read(data); r.err != nil {
return nil
}
return &Packet{
Time: time.Unix(int64(timeSec), int64(timeUsec)),
Caplen: capLen,
Len: origLen,
Data: data,
}
}
func (r *Reader) read(data []byte) error {
var err error
n, err := r.buf.Read(data)
for err == nil && n != len(data) {
var chunk int
chunk, err = r.buf.Read(data[n:])
n += chunk
}
if len(data) == n {
return nil
}
return err
}
func (r *Reader) readUint32() uint32 {
data := r.fourBytes
if r.err = r.read(data); r.err != nil {
return 0
}
return asUint32(data, r.flip)
}
func (r *Reader) readInt32() int32 {
data := r.fourBytes
if r.err = r.read(data); r.err != nil {
return 0
}
return int32(asUint32(data, r.flip))
}
func (r *Reader) readUint16() uint16 {
data := r.twoBytes
if r.err = r.read(data); r.err != nil {
return 0
}
return asUint16(data, r.flip)
}
// Writer writes a pcap file.
type Writer struct {
writer io.Writer
buf []byte
}
// NewWriter creates a Writer that stores output in an io.Writer.
// The FileHeader is written immediately.
func NewWriter(writer io.Writer, header *FileHeader) (*Writer, error) {
w := &Writer{
writer: writer,
buf: make([]byte, 24),
}
binary.LittleEndian.PutUint32(w.buf, header.MagicNumber)
binary.LittleEndian.PutUint16(w.buf[4:], header.VersionMajor)
binary.LittleEndian.PutUint16(w.buf[6:], header.VersionMinor)
binary.LittleEndian.PutUint32(w.buf[8:], uint32(header.TimeZone))
binary.LittleEndian.PutUint32(w.buf[12:], header.SigFigs)
binary.LittleEndian.PutUint32(w.buf[16:], header.SnapLen)
binary.LittleEndian.PutUint32(w.buf[20:], header.Network)
if _, err := writer.Write(w.buf); err != nil {
return nil, err
}
return w, nil
}
// Writer writes a packet to the underlying writer.
func (w *Writer) Write(pkt *Packet) error {
binary.LittleEndian.PutUint32(w.buf, uint32(pkt.Time.Unix()))
binary.LittleEndian.PutUint32(w.buf[4:], uint32(pkt.Time.Nanosecond()))
binary.LittleEndian.PutUint32(w.buf[8:], uint32(pkt.Time.Unix()))
binary.LittleEndian.PutUint32(w.buf[12:], pkt.Len)
if _, err := w.writer.Write(w.buf[:16]); err != nil {
return err
}
_, err := w.writer.Write(pkt.Data)
return err
}
func asUint32(data []byte, flip bool) uint32 {
if flip {
return binary.BigEndian.Uint32(data)
}
return binary.LittleEndian.Uint32(data)
}
func asUint16(data []byte, flip bool) uint16 {
if flip {
return binary.BigEndian.Uint16(data)
}
return binary.LittleEndian.Uint16(data)
}

View File

@ -1,266 +0,0 @@
// Interface to both live and offline pcap parsing.
package pcap
/*
#cgo linux LDFLAGS: -lpcap
#cgo freebsd LDFLAGS: -lpcap
#cgo darwin LDFLAGS: -lpcap
#cgo windows CFLAGS: -I C:/WpdPack/Include
#cgo windows,386 LDFLAGS: -L C:/WpdPack/Lib -lwpcap
#cgo windows,amd64 LDFLAGS: -L C:/WpdPack/Lib/x64 -lwpcap
#include <stdlib.h>
#include <pcap.h>
// Workaround for not knowing how to cast to const u_char**
int hack_pcap_next_ex(pcap_t *p, struct pcap_pkthdr **pkt_header,
u_char **pkt_data) {
return pcap_next_ex(p, pkt_header, (const u_char **)pkt_data);
}
*/
import "C"
import (
"errors"
"net"
"syscall"
"time"
"unsafe"
)
type Pcap struct {
cptr *C.pcap_t
}
type Stat struct {
PacketsReceived uint32
PacketsDropped uint32
PacketsIfDropped uint32
}
type Interface struct {
Name string
Description string
Addresses []IFAddress
// TODO: add more elements
}
type IFAddress struct {
IP net.IP
Netmask net.IPMask
// TODO: add broadcast + PtP dst ?
}
func (p *Pcap) Next() (pkt *Packet) {
rv, _ := p.NextEx()
return rv
}
// Openlive opens a device and returns a *Pcap handler
func Openlive(device string, snaplen int32, promisc bool, timeout_ms int32) (handle *Pcap, err error) {
var buf *C.char
buf = (*C.char)(C.calloc(ERRBUF_SIZE, 1))
h := new(Pcap)
var pro int32
if promisc {
pro = 1
}
dev := C.CString(device)
defer C.free(unsafe.Pointer(dev))
h.cptr = C.pcap_open_live(dev, C.int(snaplen), C.int(pro), C.int(timeout_ms), buf)
if nil == h.cptr {
handle = nil
err = errors.New(C.GoString(buf))
} else {
handle = h
}
C.free(unsafe.Pointer(buf))
return
}
func Openoffline(file string) (handle *Pcap, err error) {
var buf *C.char
buf = (*C.char)(C.calloc(ERRBUF_SIZE, 1))
h := new(Pcap)
cf := C.CString(file)
defer C.free(unsafe.Pointer(cf))
h.cptr = C.pcap_open_offline(cf, buf)
if nil == h.cptr {
handle = nil
err = errors.New(C.GoString(buf))
} else {
handle = h
}
C.free(unsafe.Pointer(buf))
return
}
func (p *Pcap) NextEx() (pkt *Packet, result int32) {
var pkthdr *C.struct_pcap_pkthdr
var buf_ptr *C.u_char
var buf unsafe.Pointer
result = int32(C.hack_pcap_next_ex(p.cptr, &pkthdr, &buf_ptr))
buf = unsafe.Pointer(buf_ptr)
if nil == buf {
return
}
pkt = new(Packet)
pkt.Time = time.Unix(int64(pkthdr.ts.tv_sec), int64(pkthdr.ts.tv_usec)*1000)
pkt.Caplen = uint32(pkthdr.caplen)
pkt.Len = uint32(pkthdr.len)
pkt.Data = C.GoBytes(buf, C.int(pkthdr.caplen))
return
}
func (p *Pcap) Close() {
C.pcap_close(p.cptr)
}
func (p *Pcap) Geterror() error {
return errors.New(C.GoString(C.pcap_geterr(p.cptr)))
}
func (p *Pcap) Getstats() (stat *Stat, err error) {
var cstats _Ctype_struct_pcap_stat
if -1 == C.pcap_stats(p.cptr, &cstats) {
return nil, p.Geterror()
}
stats := new(Stat)
stats.PacketsReceived = uint32(cstats.ps_recv)
stats.PacketsDropped = uint32(cstats.ps_drop)
stats.PacketsIfDropped = uint32(cstats.ps_ifdrop)
return stats, nil
}
func (p *Pcap) Setfilter(expr string) (err error) {
var bpf _Ctype_struct_bpf_program
cexpr := C.CString(expr)
defer C.free(unsafe.Pointer(cexpr))
if -1 == C.pcap_compile(p.cptr, &bpf, cexpr, 1, 0) {
return p.Geterror()
}
if -1 == C.pcap_setfilter(p.cptr, &bpf) {
C.pcap_freecode(&bpf)
return p.Geterror()
}
C.pcap_freecode(&bpf)
return nil
}
func Version() string {
return C.GoString(C.pcap_lib_version())
}
func (p *Pcap) Datalink() int {
return int(C.pcap_datalink(p.cptr))
}
func (p *Pcap) Setdatalink(dlt int) error {
if -1 == C.pcap_set_datalink(p.cptr, C.int(dlt)) {
return p.Geterror()
}
return nil
}
func DatalinkValueToName(dlt int) string {
if name := C.pcap_datalink_val_to_name(C.int(dlt)); name != nil {
return C.GoString(name)
}
return ""
}
func DatalinkValueToDescription(dlt int) string {
if desc := C.pcap_datalink_val_to_description(C.int(dlt)); desc != nil {
return C.GoString(desc)
}
return ""
}
func Findalldevs() (ifs []Interface, err error) {
var buf *C.char
buf = (*C.char)(C.calloc(ERRBUF_SIZE, 1))
defer C.free(unsafe.Pointer(buf))
var alldevsp *C.pcap_if_t
if -1 == C.pcap_findalldevs((**C.pcap_if_t)(&alldevsp), buf) {
return nil, errors.New(C.GoString(buf))
}
defer C.pcap_freealldevs((*C.pcap_if_t)(alldevsp))
dev := alldevsp
var i uint32
for i = 0; dev != nil; dev = (*C.pcap_if_t)(dev.next) {
i++
}
ifs = make([]Interface, i)
dev = alldevsp
for j := uint32(0); dev != nil; dev = (*C.pcap_if_t)(dev.next) {
var iface Interface
iface.Name = C.GoString(dev.name)
iface.Description = C.GoString(dev.description)
iface.Addresses = findalladdresses(dev.addresses)
// TODO: add more elements
ifs[j] = iface
j++
}
return
}
func findalladdresses(addresses *_Ctype_struct_pcap_addr) (retval []IFAddress) {
// TODO - make it support more than IPv4 and IPv6?
retval = make([]IFAddress, 0, 1)
for curaddr := addresses; curaddr != nil; curaddr = (*_Ctype_struct_pcap_addr)(curaddr.next) {
var a IFAddress
var err error
if a.IP, err = sockaddr_to_IP((*syscall.RawSockaddr)(unsafe.Pointer(curaddr.addr))); err != nil {
continue
}
if a.Netmask, err = sockaddr_to_IP((*syscall.RawSockaddr)(unsafe.Pointer(curaddr.addr))); err != nil {
continue
}
retval = append(retval, a)
}
return
}
func sockaddr_to_IP(rsa *syscall.RawSockaddr) (IP []byte, err error) {
switch rsa.Family {
case syscall.AF_INET:
pp := (*syscall.RawSockaddrInet4)(unsafe.Pointer(rsa))
IP = make([]byte, 4)
for i := 0; i < len(IP); i++ {
IP[i] = pp.Addr[i]
}
return
case syscall.AF_INET6:
pp := (*syscall.RawSockaddrInet6)(unsafe.Pointer(rsa))
IP = make([]byte, 16)
for i := 0; i < len(IP); i++ {
IP[i] = pp.Addr[i]
}
return
}
err = errors.New("Unsupported address type")
return
}
func (p *Pcap) Inject(data []byte) (err error) {
buf := (*C.char)(C.malloc((C.size_t)(len(data))))
for i := 0; i < len(data); i++ {
*(*byte)(unsafe.Pointer(uintptr(unsafe.Pointer(buf)) + uintptr(i))) = data[i]
}
if -1 == C.pcap_sendpacket(p.cptr, (*C.u_char)(unsafe.Pointer(buf)), (C.int)(len(data))) {
err = p.Geterror()
}
C.free(unsafe.Pointer(buf))
return
}

View File

@ -1,49 +0,0 @@
package main
import (
"flag"
"fmt"
"os"
"runtime/pprof"
"time"
"github.com/coreos/etcd/Godeps/_workspace/src/github.com/akrennmair/gopcap"
)
func main() {
var filename *string = flag.String("file", "", "filename")
var decode *bool = flag.Bool("d", false, "If true, decode each packet")
var cpuprofile *string = flag.String("cpuprofile", "", "filename")
flag.Parse()
h, err := pcap.Openoffline(*filename)
if err != nil {
fmt.Printf("Couldn't create pcap reader: %v", err)
}
if *cpuprofile != "" {
if out, err := os.Create(*cpuprofile); err == nil {
pprof.StartCPUProfile(out)
defer func() {
pprof.StopCPUProfile()
out.Close()
}()
} else {
panic(err)
}
}
i, nilPackets := 0, 0
start := time.Now()
for pkt, code := h.NextEx(); code != -2; pkt, code = h.NextEx() {
if pkt == nil {
nilPackets++
} else if *decode {
pkt.Decode()
}
i++
}
duration := time.Since(start)
fmt.Printf("Took %v to process %v packets, %v per packet, %d nil packets\n", duration, i, duration/time.Duration(i), nilPackets)
}

View File

@ -1,96 +0,0 @@
package main
// Parses a pcap file, writes it back to disk, then verifies the files
// are the same.
import (
"bufio"
"flag"
"fmt"
"io"
"os"
"github.com/coreos/etcd/Godeps/_workspace/src/github.com/akrennmair/gopcap"
)
var input *string = flag.String("input", "", "input file")
var output *string = flag.String("output", "", "output file")
var decode *bool = flag.Bool("decode", false, "print decoded packets")
func copyPcap(dest, src string) {
f, err := os.Open(src)
if err != nil {
fmt.Printf("couldn't open %q: %v\n", src, err)
return
}
defer f.Close()
reader, err := pcap.NewReader(bufio.NewReader(f))
if err != nil {
fmt.Printf("couldn't create reader: %v\n", err)
return
}
w, err := os.Create(dest)
if err != nil {
fmt.Printf("couldn't open %q: %v\n", dest, err)
return
}
defer w.Close()
buf := bufio.NewWriter(w)
writer, err := pcap.NewWriter(buf, &reader.Header)
if err != nil {
fmt.Printf("couldn't create writer: %v\n", err)
return
}
for {
pkt := reader.Next()
if pkt == nil {
break
}
if *decode {
pkt.Decode()
fmt.Println(pkt.String())
}
writer.Write(pkt)
}
buf.Flush()
}
func check(dest, src string) {
f, err := os.Open(src)
if err != nil {
fmt.Printf("couldn't open %q: %v\n", src, err)
return
}
defer f.Close()
freader := bufio.NewReader(f)
g, err := os.Open(dest)
if err != nil {
fmt.Printf("couldn't open %q: %v\n", src, err)
return
}
defer g.Close()
greader := bufio.NewReader(g)
for {
fb, ferr := freader.ReadByte()
gb, gerr := greader.ReadByte()
if ferr == io.EOF && gerr == io.EOF {
break
}
if fb == gb {
continue
}
fmt.Println("FAIL")
return
}
fmt.Println("PASS")
}
func main() {
flag.Parse()
copyPcap(*output, *input)
check(*output, *input)
}

View File

@ -1,82 +0,0 @@
package main
import (
"flag"
"fmt"
"time"
"github.com/coreos/etcd/Godeps/_workspace/src/github.com/akrennmair/gopcap"
)
func min(x uint32, y uint32) uint32 {
if x < y {
return x
}
return y
}
func main() {
var device *string = flag.String("d", "", "device")
var file *string = flag.String("r", "", "file")
var expr *string = flag.String("e", "", "filter expression")
flag.Parse()
var h *pcap.Pcap
var err error
ifs, err := pcap.Findalldevs()
if len(ifs) == 0 {
fmt.Printf("Warning: no devices found : %s\n", err)
} else {
for i := 0; i < len(ifs); i++ {
fmt.Printf("dev %d: %s (%s)\n", i+1, ifs[i].Name, ifs[i].Description)
}
}
if *device != "" {
h, err = pcap.Openlive(*device, 65535, true, 0)
if h == nil {
fmt.Printf("Openlive(%s) failed: %s\n", *device, err)
return
}
} else if *file != "" {
h, err = pcap.Openoffline(*file)
if h == nil {
fmt.Printf("Openoffline(%s) failed: %s\n", *file, err)
return
}
} else {
fmt.Printf("usage: pcaptest [-d <device> | -r <file>]\n")
return
}
defer h.Close()
fmt.Printf("pcap version: %s\n", pcap.Version())
if *expr != "" {
fmt.Printf("Setting filter: %s\n", *expr)
err := h.Setfilter(*expr)
if err != nil {
fmt.Printf("Warning: setting filter failed: %s\n", err)
}
}
for pkt := h.Next(); pkt != nil; pkt = h.Next() {
fmt.Printf("time: %d.%06d (%s) caplen: %d len: %d\nData:",
int64(pkt.Time.Second()), int64(pkt.Time.Nanosecond()),
time.Unix(int64(pkt.Time.Second()), 0).String(), int64(pkt.Caplen), int64(pkt.Len))
for i := uint32(0); i < pkt.Caplen; i++ {
if i%32 == 0 {
fmt.Printf("\n")
}
if 32 <= pkt.Data[i] && pkt.Data[i] <= 126 {
fmt.Printf("%c", pkt.Data[i])
} else {
fmt.Printf(".")
}
}
fmt.Printf("\n\n")
}
}

View File

@ -1,121 +0,0 @@
package main
import (
"bufio"
"flag"
"fmt"
"os"
"github.com/coreos/etcd/Godeps/_workspace/src/github.com/akrennmair/gopcap"
)
const (
TYPE_IP = 0x0800
TYPE_ARP = 0x0806
TYPE_IP6 = 0x86DD
IP_ICMP = 1
IP_INIP = 4
IP_TCP = 6
IP_UDP = 17
)
var out *bufio.Writer
var errout *bufio.Writer
func main() {
var device *string = flag.String("i", "", "interface")
var snaplen *int = flag.Int("s", 65535, "snaplen")
var hexdump *bool = flag.Bool("X", false, "hexdump")
expr := ""
out = bufio.NewWriter(os.Stdout)
errout = bufio.NewWriter(os.Stderr)
flag.Usage = func() {
fmt.Fprintf(errout, "usage: %s [ -i interface ] [ -s snaplen ] [ -X ] [ expression ]\n", os.Args[0])
errout.Flush()
os.Exit(1)
}
flag.Parse()
if len(flag.Args()) > 0 {
expr = flag.Arg(0)
}
if *device == "" {
devs, err := pcap.Findalldevs()
if err != nil {
fmt.Fprintf(errout, "tcpdump: couldn't find any devices: %s\n", err)
}
if 0 == len(devs) {
flag.Usage()
}
*device = devs[0].Name
}
h, err := pcap.Openlive(*device, int32(*snaplen), true, 0)
if h == nil {
fmt.Fprintf(errout, "tcpdump: %s\n", err)
errout.Flush()
return
}
defer h.Close()
if expr != "" {
ferr := h.Setfilter(expr)
if ferr != nil {
fmt.Fprintf(out, "tcpdump: %s\n", ferr)
out.Flush()
}
}
for pkt := h.Next(); pkt != nil; pkt = h.Next() {
pkt.Decode()
fmt.Fprintf(out, "%s\n", pkt.String())
if *hexdump {
Hexdump(pkt)
}
out.Flush()
}
}
func min(a, b int) int {
if a < b {
return a
}
return b
}
func Hexdump(pkt *pcap.Packet) {
for i := 0; i < len(pkt.Data); i += 16 {
Dumpline(uint32(i), pkt.Data[i:min(i+16, len(pkt.Data))])
}
}
func Dumpline(addr uint32, line []byte) {
fmt.Fprintf(out, "\t0x%04x: ", int32(addr))
var i uint16
for i = 0; i < 16 && i < uint16(len(line)); i++ {
if i%2 == 0 {
out.WriteString(" ")
}
fmt.Fprintf(out, "%02x", line[i])
}
for j := i; j <= 16; j++ {
if j%2 == 0 {
out.WriteString(" ")
}
out.WriteString(" ")
}
out.WriteString(" ")
for i = 0; i < 16 && i < uint16(len(line)); i++ {
if line[i] >= 32 && line[i] <= 126 {
fmt.Fprintf(out, "%c", line[i])
} else {
out.WriteString(".")
}
}
out.WriteString("\n")
}

View File

@ -1,18 +1,54 @@
TEST=.
BENCH=.
COVERPROFILE=/tmp/c.out
BRANCH=`git rev-parse --abbrev-ref HEAD`
COMMIT=`git rev-parse --short HEAD`
GOLDFLAGS="-X main.branch $(BRANCH) -X main.commit $(COMMIT)"
default: build
race:
@go test -v -race -test.run="TestSimulate_(100op|1000op)"
bench:
go test -v -test.run=NOTHINCONTAINSTHIS -test.bench=$(BENCH)
# http://cloc.sourceforge.net/
cloc:
@cloc --not-match-f='Makefile|_test.go' .
cover: fmt
go test -coverprofile=$(COVERPROFILE) -test.run=$(TEST) $(COVERFLAG) .
go tool cover -html=$(COVERPROFILE)
rm $(COVERPROFILE)
cpuprofile: fmt
@go test -c
@./bolt.test -test.v -test.run=$(TEST) -test.cpuprofile cpu.prof
# go get github.com/kisielk/errcheck
errcheck:
@errcheck -ignorepkg=bytes -ignore=os:Remove github.com/boltdb/bolt
@echo "=== errcheck ==="
@errcheck github.com/boltdb/bolt
test:
@go test -v -cover .
@go test -v ./cmd/bolt
fmt:
@go fmt ./...
.PHONY: fmt test
get:
@go get -d ./...
build: get
@mkdir -p bin
@go build -ldflags=$(GOLDFLAGS) -a -o bin/bolt ./cmd/bolt
test: fmt
@go get github.com/stretchr/testify/assert
@echo "=== TESTS ==="
@go test -v -cover -test.run=$(TEST)
@echo ""
@echo ""
@echo "=== CLI ==="
@go test -v -test.run=$(TEST) ./cmd/bolt
@echo ""
@echo ""
@echo "=== RACE DETECTOR ==="
@go test -v -race -test.run="TestSimulate_(100op|1000op)"
.PHONY: bench cloc cover cpuprofile fmt memprofile test

View File

@ -1,8 +1,8 @@
Bolt [![Build Status](https://drone.io/github.com/boltdb/bolt/status.png)](https://drone.io/github.com/boltdb/bolt/latest) [![Coverage Status](https://coveralls.io/repos/boltdb/bolt/badge.svg?branch=master)](https://coveralls.io/r/boltdb/bolt?branch=master) [![GoDoc](https://godoc.org/github.com/boltdb/bolt?status.svg)](https://godoc.org/github.com/boltdb/bolt) ![Version](https://img.shields.io/badge/version-1.0-green.svg)
Bolt [![Build Status](https://drone.io/github.com/boltdb/bolt/status.png)](https://drone.io/github.com/boltdb/bolt/latest) [![Coverage Status](https://coveralls.io/repos/boltdb/bolt/badge.png?branch=master)](https://coveralls.io/r/boltdb/bolt?branch=master) [![GoDoc](https://godoc.org/github.com/boltdb/bolt?status.png)](https://godoc.org/github.com/boltdb/bolt) ![Version](http://img.shields.io/badge/version-1.0-green.png)
====
Bolt is a pure Go key/value store inspired by [Howard Chu's][hyc_symas]
[LMDB project][lmdb]. The goal of the project is to provide a simple,
Bolt is a pure Go key/value store inspired by [Howard Chu's][hyc_symas] and
the [LMDB project][lmdb]. The goal of the project is to provide a simple,
fast, and reliable database for projects that don't require a full database
server such as Postgres or MySQL.
@ -13,6 +13,7 @@ and setting values. That's it.
[hyc_symas]: https://twitter.com/hyc_symas
[lmdb]: http://symas.com/mdb/
## Project Status
Bolt is stable and the API is fixed. Full unit test coverage and randomized
@ -21,36 +22,6 @@ Bolt is currently in high-load production environments serving databases as
large as 1TB. Many companies such as Shopify and Heroku use Bolt-backed
services every day.
## Table of Contents
- [Getting Started](#getting-started)
- [Installing](#installing)
- [Opening a database](#opening-a-database)
- [Transactions](#transactions)
- [Read-write transactions](#read-write-transactions)
- [Read-only transactions](#read-only-transactions)
- [Batch read-write transactions](#batch-read-write-transactions)
- [Managing transactions manually](#managing-transactions-manually)
- [Using buckets](#using-buckets)
- [Using key/value pairs](#using-keyvalue-pairs)
- [Autoincrementing integer for the bucket](#autoincrementing-integer-for-the-bucket)
- [Iterating over keys](#iterating-over-keys)
- [Prefix scans](#prefix-scans)
- [Range scans](#range-scans)
- [ForEach()](#foreach)
- [Nested buckets](#nested-buckets)
- [Database backups](#database-backups)
- [Statistics](#statistics)
- [Read-Only Mode](#read-only-mode)
- [Mobile Use (iOS/Android)](#mobile-use-iosandroid)
- [Resources](#resources)
- [Comparison with other databases](#comparison-with-other-databases)
- [Postgres, MySQL, & other relational databases](#postgres-mysql--other-relational-databases)
- [LevelDB, RocksDB](#leveldb-rocksdb)
- [LMDB](#lmdb)
- [Caveats & Limitations](#caveats--limitations)
- [Reading the Source](#reading-the-source)
- [Other Projects Using Bolt](#other-projects-using-bolt)
## Getting Started
@ -209,8 +180,8 @@ and then safely close your transaction if an error is returned. This is the
recommended way to use Bolt transactions.
However, sometimes you may want to manually start and end your transactions.
You can use the `Tx.Begin()` function directly but **please** be sure to close
the transaction.
You can use the `Tx.Begin()` function directly but _please_ be sure to close the
transaction.
```go
// Start a writable transaction.
@ -285,7 +256,7 @@ db.View(func(tx *bolt.Tx) error {
```
The `Get()` function does not return an error because its operation is
guaranteed to work (unless there is some kind of system failure). If the key
guarenteed to work (unless there is some kind of system failure). If the key
exists then it will return its byte slice value. If it doesn't exist then it
will return `nil`. It's important to note that you can have a zero-length value
set to a key which is different than the key not existing.
@ -297,49 +268,6 @@ transaction is open. If you need to use a value outside of the transaction
then you must use `copy()` to copy it to another byte slice.
### Autoincrementing integer for the bucket
By using the `NextSequence()` function, you can let Bolt determine a sequence
which can be used as the unique identifier for your key/value pairs. See the
example below.
```go
// CreateUser saves u to the store. The new user ID is set on u once the data is persisted.
func (s *Store) CreateUser(u *User) error {
return s.db.Update(func(tx *bolt.Tx) error {
// Retrieve the users bucket.
// This should be created when the DB is first opened.
b := tx.Bucket([]byte("users"))
// Generate ID for the user.
// This returns an error only if the Tx is closed or not writeable.
// That can't happen in an Update() call so I ignore the error check.
id, _ = b.NextSequence()
u.ID = int(id)
// Marshal user data into bytes.
buf, err := json.Marshal(u)
if err != nil {
return err
}
// Persist bytes to users bucket.
return b.Put(itob(u.ID), buf)
})
}
// itob returns an 8-byte big endian representation of v.
func itob(v int) []byte {
b := make([]byte, 8)
binary.BigEndian.PutUint64(b, uint64(v))
return b
}
type User struct {
ID int
...
}
```
### Iterating over keys
Bolt stores its keys in byte-sorted order within a bucket. This makes sequential
@ -348,9 +276,7 @@ iteration over these keys extremely fast. To iterate over keys we'll use a
```go
db.View(func(tx *bolt.Tx) error {
// Assume bucket exists and has keys
b := tx.Bucket([]byte("MyBucket"))
c := b.Cursor()
for k, v := c.First(); k != nil; k, v = c.Next() {
@ -374,15 +300,10 @@ Next() Move to the next key.
Prev() Move to the previous key.
```
Each of those functions has a return signature of `(key []byte, value []byte)`.
When you have iterated to the end of the cursor then `Next()` will return a
`nil` key. You must seek to a position using `First()`, `Last()`, or `Seek()`
before calling `Next()` or `Prev()`. If you do not seek to a position then
these functions will return a `nil` key.
During iteration, if the key is non-`nil` but the value is `nil`, that means
the key refers to a bucket rather than a value. Use `Bucket.Bucket()` to
access the sub-bucket.
When you have iterated to the end of the cursor then `Next()` will return `nil`.
You must seek to a position using `First()`, `Last()`, or `Seek()` before
calling `Next()` or `Prev()`. If you do not seek to a position then these
functions will return `nil`.
#### Prefix scans
@ -391,7 +312,6 @@ To iterate over a key prefix, you can combine `Seek()` and `bytes.HasPrefix()`:
```go
db.View(func(tx *bolt.Tx) error {
// Assume bucket exists and has keys
c := tx.Bucket([]byte("MyBucket")).Cursor()
prefix := []byte("1234")
@ -411,7 +331,7 @@ date range like this:
```go
db.View(func(tx *bolt.Tx) error {
// Assume our events bucket exists and has RFC3339 encoded time keys.
// Assume our events bucket has RFC3339 encoded time keys.
c := tx.Bucket([]byte("Events")).Cursor()
// Our time range spans the 90's decade.
@ -435,9 +355,7 @@ all the keys in a bucket:
```go
db.View(func(tx *bolt.Tx) error {
// Assume bucket exists and has keys
b := tx.Bucket([]byte("MyBucket"))
b.ForEach(func(k, v []byte) error {
fmt.Printf("key=%s, value=%s\n", k, v)
return nil
@ -464,11 +382,8 @@ func (*Bucket) DeleteBucket(key []byte) error
Bolt is a single file so it's easy to backup. You can use the `Tx.WriteTo()`
function to write a consistent view of the database to a writer. If you call
this from a read-only transaction, it will perform a hot backup and not block
your other database reads and writes.
By default, it will use a regular file handle which will utilize the operating
system's page cache. See the [`Tx`](https://godoc.org/github.com/boltdb/bolt#Tx)
documentation for information about optimizing for larger-than-RAM datasets.
your other database reads and writes. It will also use `O_DIRECT` when available
to prevent page cache trashing.
One common use case is to backup over HTTP so you can use tools like `cURL` to
do database backups:
@ -550,84 +465,6 @@ if err != nil {
}
```
### Mobile Use (iOS/Android)
Bolt is able to run on mobile devices by leveraging the binding feature of the
[gomobile](https://github.com/golang/mobile) tool. Create a struct that will
contain your database logic and a reference to a `*bolt.DB` with a initializing
contstructor that takes in a filepath where the database file will be stored.
Neither Android nor iOS require extra permissions or cleanup from using this method.
```go
func NewBoltDB(filepath string) *BoltDB {
db, err := bolt.Open(filepath+"/demo.db", 0600, nil)
if err != nil {
log.Fatal(err)
}
return &BoltDB{db}
}
type BoltDB struct {
db *bolt.DB
...
}
func (b *BoltDB) Path() string {
return b.db.Path()
}
func (b *BoltDB) Close() {
b.db.Close()
}
```
Database logic should be defined as methods on this wrapper struct.
To initialize this struct from the native language (both platforms now sync
their local storage to the cloud. These snippets disable that functionality for the
database file):
#### Android
```java
String path;
if (android.os.Build.VERSION.SDK_INT >=android.os.Build.VERSION_CODES.LOLLIPOP){
path = getNoBackupFilesDir().getAbsolutePath();
} else{
path = getFilesDir().getAbsolutePath();
}
Boltmobiledemo.BoltDB boltDB = Boltmobiledemo.NewBoltDB(path)
```
#### iOS
```objc
- (void)demo {
NSString* path = [NSSearchPathForDirectoriesInDomains(NSLibraryDirectory,
NSUserDomainMask,
YES) objectAtIndex:0];
GoBoltmobiledemoBoltDB * demo = GoBoltmobiledemoNewBoltDB(path);
[self addSkipBackupAttributeToItemAtPath:demo.path];
//Some DB Logic would go here
[demo close];
}
- (BOOL)addSkipBackupAttributeToItemAtPath:(NSString *) filePathString
{
NSURL* URL= [NSURL fileURLWithPath: filePathString];
assert([[NSFileManager defaultManager] fileExistsAtPath: [URL path]]);
NSError *error = nil;
BOOL success = [URL setResourceValue: [NSNumber numberWithBool: YES]
forKey: NSURLIsExcludedFromBackupKey error: &error];
if(!success){
NSLog(@"Error excluding %@ from backup %@", [URL lastPathComponent], error);
}
return success;
}
```
## Resources
@ -663,7 +500,7 @@ they are libraries bundled into the application, however, their underlying
structure is a log-structured merge-tree (LSM tree). An LSM tree optimizes
random writes by using a write ahead log and multi-tiered, sorted files called
SSTables. Bolt uses a B+tree internally and only a single file. Both approaches
have trade-offs.
have trade offs.
If you require a high random write throughput (>10,000 w/sec) or you need to use
spinning disks then LevelDB could be a good choice. If your application is
@ -699,8 +536,9 @@ It's important to pick the right tool for the job and Bolt is no exception.
Here are a few things to note when evaluating and using Bolt:
* Bolt is good for read intensive workloads. Sequential write performance is
also fast but random writes can be slow. You can use `DB.Batch()` or add a
write-ahead log to help mitigate this issue.
also fast but random writes can be slow. You can add a write-ahead log or
[transaction coalescer](https://github.com/boltdb/coalescer) in front of Bolt
to mitigate this issue.
* Bolt uses a B+tree internally so there can be a lot of random page access.
SSDs provide a significant performance boost over spinning disks.
@ -730,13 +568,11 @@ Here are a few things to note when evaluating and using Bolt:
can in memory and will release memory as needed to other processes. This means
that Bolt can show very high memory usage when working with large databases.
However, this is expected and the OS will release memory as needed. Bolt can
handle databases much larger than the available physical RAM, provided its
memory-map fits in the process virtual address space. It may be problematic
on 32-bits systems.
handle databases much larger than the available physical RAM.
* The data structures in the Bolt database are memory mapped so the data file
will be endian specific. This means that you cannot copy a Bolt file from a
little endian machine to a big endian machine and have it work. For most
little endian machine to a big endian machine and have it work. For most
users this is not a concern since most modern CPUs are little endian.
* Because of the way pages are laid out on disk, Bolt cannot truncate data files
@ -751,56 +587,6 @@ Here are a few things to note when evaluating and using Bolt:
[page-allocation]: https://github.com/boltdb/bolt/issues/308#issuecomment-74811638
## Reading the Source
Bolt is a relatively small code base (<3KLOC) for an embedded, serializable,
transactional key/value database so it can be a good starting point for people
interested in how databases work.
The best places to start are the main entry points into Bolt:
- `Open()` - Initializes the reference to the database. It's responsible for
creating the database if it doesn't exist, obtaining an exclusive lock on the
file, reading the meta pages, & memory-mapping the file.
- `DB.Begin()` - Starts a read-only or read-write transaction depending on the
value of the `writable` argument. This requires briefly obtaining the "meta"
lock to keep track of open transactions. Only one read-write transaction can
exist at a time so the "rwlock" is acquired during the life of a read-write
transaction.
- `Bucket.Put()` - Writes a key/value pair into a bucket. After validating the
arguments, a cursor is used to traverse the B+tree to the page and position
where they key & value will be written. Once the position is found, the bucket
materializes the underlying page and the page's parent pages into memory as
"nodes". These nodes are where mutations occur during read-write transactions.
These changes get flushed to disk during commit.
- `Bucket.Get()` - Retrieves a key/value pair from a bucket. This uses a cursor
to move to the page & position of a key/value pair. During a read-only
transaction, the key and value data is returned as a direct reference to the
underlying mmap file so there's no allocation overhead. For read-write
transactions, this data may reference the mmap file or one of the in-memory
node values.
- `Cursor` - This object is simply for traversing the B+tree of on-disk pages
or in-memory nodes. It can seek to a specific key, move to the first or last
value, or it can move forward or backward. The cursor handles the movement up
and down the B+tree transparently to the end user.
- `Tx.Commit()` - Converts the in-memory dirty nodes and the list of free pages
into pages to be written to disk. Writing to disk then occurs in two phases.
First, the dirty pages are written to disk and an `fsync()` occurs. Second, a
new meta page with an incremented transaction ID is written and another
`fsync()` occurs. This two phase write ensures that partially written data
pages are ignored in the event of a crash since the meta page pointing to them
is never written. Partially written meta pages are invalidated because they
are written with a checksum.
If you have additional notes that could be helpful for others, please submit
them via pull request.
## Other Projects Using Bolt
Below is a list of public, open source projects that use Bolt:
@ -811,33 +597,25 @@ Below is a list of public, open source projects that use Bolt:
* [Skybox Analytics](https://github.com/skybox/skybox) - A standalone funnel analysis tool for web analytics.
* [Scuttlebutt](https://github.com/benbjohnson/scuttlebutt) - Uses Bolt to store and process all Twitter mentions of GitHub projects.
* [Wiki](https://github.com/peterhellberg/wiki) - A tiny wiki using Goji, BoltDB and Blackfriday.
* [ChainStore](https://github.com/pressly/chainstore) - Simple key-value interface to a variety of storage engines organized as a chain of operations.
* [ChainStore](https://github.com/nulayer/chainstore) - Simple key-value interface to a variety of storage engines organized as a chain of operations.
* [MetricBase](https://github.com/msiebuhr/MetricBase) - Single-binary version of Graphite.
* [Gitchain](https://github.com/gitchain/gitchain) - Decentralized, peer-to-peer Git repositories aka "Git meets Bitcoin".
* [event-shuttle](https://github.com/sclasen/event-shuttle) - A Unix system service to collect and reliably deliver messages to Kafka.
* [ipxed](https://github.com/kelseyhightower/ipxed) - Web interface and api for ipxed.
* [BoltStore](https://github.com/yosssi/boltstore) - Session store using Bolt.
* [photosite/session](https://godoc.org/bitbucket.org/kardianos/photosite/session) - Sessions for a photo viewing site.
* [photosite/session](http://godoc.org/bitbucket.org/kardianos/photosite/session) - Sessions for a photo viewing site.
* [LedisDB](https://github.com/siddontang/ledisdb) - A high performance NoSQL, using Bolt as optional storage.
* [ipLocator](https://github.com/AndreasBriese/ipLocator) - A fast ip-geo-location-server using bolt with bloom filters.
* [cayley](https://github.com/google/cayley) - Cayley is an open-source graph database using Bolt as optional backend.
* [bleve](http://www.blevesearch.com/) - A pure Go search engine similar to ElasticSearch that uses Bolt as the default storage backend.
* [tentacool](https://github.com/optiflows/tentacool) - REST api server to manage system stuff (IP, DNS, Gateway...) on a linux server.
* [SkyDB](https://github.com/skydb/sky) - Behavioral analytics database.
* [Seaweed File System](https://github.com/chrislusf/seaweedfs) - Highly scalable distributed key~file system with O(1) disk read.
* [InfluxDB](https://influxdata.com) - Scalable datastore for metrics, events, and real-time analytics.
* [Seaweed File System](https://github.com/chrislusf/weed-fs) - Highly scalable distributed key~file system with O(1) disk read.
* [InfluxDB](http://influxdb.com) - Scalable datastore for metrics, events, and real-time analytics.
* [Freehold](http://tshannon.bitbucket.org/freehold/) - An open, secure, and lightweight platform for your files and data.
* [Prometheus Annotation Server](https://github.com/oliver006/prom_annotation_server) - Annotation server for PromDash & Prometheus service monitoring system.
* [Consul](https://github.com/hashicorp/consul) - Consul is service discovery and configuration made easy. Distributed, highly available, and datacenter-aware.
* [Kala](https://github.com/ajvb/kala) - Kala is a modern job scheduler optimized to run on a single node. It is persistent, JSON over HTTP API, ISO 8601 duration notation, and dependent jobs.
* [Kala](https://github.com/ajvb/kala) - Kala is a modern job scheduler optimized to run on a single node. It is persistant, JSON over HTTP API, ISO 8601 duration notation, and dependent jobs.
* [drive](https://github.com/odeke-em/drive) - drive is an unofficial Google Drive command line client for \*NIX operating systems.
* [stow](https://github.com/djherbis/stow) - a persistence manager for objects
backed by boltdb.
* [buckets](https://github.com/joyrexus/buckets) - a bolt wrapper streamlining
simple tx and key scans.
* [mbuckets](https://github.com/abhigupta912/mbuckets) - A Bolt wrapper that allows easy operations on multi level (nested) buckets.
* [Request Baskets](https://github.com/darklynx/request-baskets) - A web service to collect arbitrary HTTP requests and inspect them via REST API or simple web UI, similar to [RequestBin](http://requestb.in/) service
* [Go Report Card](https://goreportcard.com/) - Go code quality report cards as a (free and open source) service.
* [Boltdb Boilerplate](https://github.com/bobintornado/boltdb-boilerplate) - Boilerplate wrapper around bolt aiming to make simple calls one-liners.
If you are using Bolt in a project please send a pull request to add it to the list.

138
Godeps/_workspace/src/github.com/boltdb/bolt/batch.go generated vendored Normal file
View File

@ -0,0 +1,138 @@
package bolt
import (
"errors"
"fmt"
"sync"
"time"
)
// Batch calls fn as part of a batch. It behaves similar to Update,
// except:
//
// 1. concurrent Batch calls can be combined into a single Bolt
// transaction.
//
// 2. the function passed to Batch may be called multiple times,
// regardless of whether it returns error or not.
//
// This means that Batch function side effects must be idempotent and
// take permanent effect only after a successful return is seen in
// caller.
//
// The maximum batch size and delay can be adjusted with DB.MaxBatchSize
// and DB.MaxBatchDelay, respectively.
//
// Batch is only useful when there are multiple goroutines calling it.
func (db *DB) Batch(fn func(*Tx) error) error {
errCh := make(chan error, 1)
db.batchMu.Lock()
if (db.batch == nil) || (db.batch != nil && len(db.batch.calls) >= db.MaxBatchSize) {
// There is no existing batch, or the existing batch is full; start a new one.
db.batch = &batch{
db: db,
}
db.batch.timer = time.AfterFunc(db.MaxBatchDelay, db.batch.trigger)
}
db.batch.calls = append(db.batch.calls, call{fn: fn, err: errCh})
if len(db.batch.calls) >= db.MaxBatchSize {
// wake up batch, it's ready to run
go db.batch.trigger()
}
db.batchMu.Unlock()
err := <-errCh
if err == trySolo {
err = db.Update(fn)
}
return err
}
type call struct {
fn func(*Tx) error
err chan<- error
}
type batch struct {
db *DB
timer *time.Timer
start sync.Once
calls []call
}
// trigger runs the batch if it hasn't already been run.
func (b *batch) trigger() {
b.start.Do(b.run)
}
// run performs the transactions in the batch and communicates results
// back to DB.Batch.
func (b *batch) run() {
b.db.batchMu.Lock()
b.timer.Stop()
// Make sure no new work is added to this batch, but don't break
// other batches.
if b.db.batch == b {
b.db.batch = nil
}
b.db.batchMu.Unlock()
retry:
for len(b.calls) > 0 {
var failIdx = -1
err := b.db.Update(func(tx *Tx) error {
for i, c := range b.calls {
if err := safelyCall(c.fn, tx); err != nil {
failIdx = i
return err
}
}
return nil
})
if failIdx >= 0 {
// take the failing transaction out of the batch. it's
// safe to shorten b.calls here because db.batch no longer
// points to us, and we hold the mutex anyway.
c := b.calls[failIdx]
b.calls[failIdx], b.calls = b.calls[len(b.calls)-1], b.calls[:len(b.calls)-1]
// tell the submitter re-run it solo, continue with the rest of the batch
c.err <- trySolo
continue retry
}
// pass success, or bolt internal errors, to all callers
for _, c := range b.calls {
if c.err != nil {
c.err <- err
}
}
break retry
}
}
// trySolo is a special sentinel error value used for signaling that a
// transaction function should be re-run. It should never be seen by
// callers.
var trySolo = errors.New("batch function returned an error and should be re-run solo")
type panicked struct {
reason interface{}
}
func (p panicked) Error() string {
if err, ok := p.reason.(error); ok {
return err.Error()
}
return fmt.Sprintf("panic: %v", p.reason)
}
func safelyCall(fn func(*Tx) error, tx *Tx) (err error) {
defer func() {
if p := recover(); p != nil {
err = panicked{p}
}
}()
return fn(tx)
}

View File

@ -0,0 +1,170 @@
package bolt_test
import (
"bytes"
"encoding/binary"
"errors"
"hash/fnv"
"sync"
"testing"
"github.com/coreos/etcd/Godeps/_workspace/src/github.com/boltdb/bolt"
)
func validateBatchBench(b *testing.B, db *TestDB) {
var rollback = errors.New("sentinel error to cause rollback")
validate := func(tx *bolt.Tx) error {
bucket := tx.Bucket([]byte("bench"))
h := fnv.New32a()
buf := make([]byte, 4)
for id := uint32(0); id < 1000; id++ {
binary.LittleEndian.PutUint32(buf, id)
h.Reset()
h.Write(buf[:])
k := h.Sum(nil)
v := bucket.Get(k)
if v == nil {
b.Errorf("not found id=%d key=%x", id, k)
continue
}
if g, e := v, []byte("filler"); !bytes.Equal(g, e) {
b.Errorf("bad value for id=%d key=%x: %s != %q", id, k, g, e)
}
if err := bucket.Delete(k); err != nil {
return err
}
}
// should be empty now
c := bucket.Cursor()
for k, v := c.First(); k != nil; k, v = c.Next() {
b.Errorf("unexpected key: %x = %q", k, v)
}
return rollback
}
if err := db.Update(validate); err != nil && err != rollback {
b.Error(err)
}
}
func BenchmarkDBBatchAutomatic(b *testing.B) {
db := NewTestDB()
defer db.Close()
db.MustCreateBucket([]byte("bench"))
b.ResetTimer()
for i := 0; i < b.N; i++ {
start := make(chan struct{})
var wg sync.WaitGroup
for round := 0; round < 1000; round++ {
wg.Add(1)
go func(id uint32) {
defer wg.Done()
<-start
h := fnv.New32a()
buf := make([]byte, 4)
binary.LittleEndian.PutUint32(buf, id)
h.Write(buf[:])
k := h.Sum(nil)
insert := func(tx *bolt.Tx) error {
b := tx.Bucket([]byte("bench"))
return b.Put(k, []byte("filler"))
}
if err := db.Batch(insert); err != nil {
b.Error(err)
return
}
}(uint32(round))
}
close(start)
wg.Wait()
}
b.StopTimer()
validateBatchBench(b, db)
}
func BenchmarkDBBatchSingle(b *testing.B) {
db := NewTestDB()
defer db.Close()
db.MustCreateBucket([]byte("bench"))
b.ResetTimer()
for i := 0; i < b.N; i++ {
start := make(chan struct{})
var wg sync.WaitGroup
for round := 0; round < 1000; round++ {
wg.Add(1)
go func(id uint32) {
defer wg.Done()
<-start
h := fnv.New32a()
buf := make([]byte, 4)
binary.LittleEndian.PutUint32(buf, id)
h.Write(buf[:])
k := h.Sum(nil)
insert := func(tx *bolt.Tx) error {
b := tx.Bucket([]byte("bench"))
return b.Put(k, []byte("filler"))
}
if err := db.Update(insert); err != nil {
b.Error(err)
return
}
}(uint32(round))
}
close(start)
wg.Wait()
}
b.StopTimer()
validateBatchBench(b, db)
}
func BenchmarkDBBatchManual10x100(b *testing.B) {
db := NewTestDB()
defer db.Close()
db.MustCreateBucket([]byte("bench"))
b.ResetTimer()
for i := 0; i < b.N; i++ {
start := make(chan struct{})
var wg sync.WaitGroup
for major := 0; major < 10; major++ {
wg.Add(1)
go func(id uint32) {
defer wg.Done()
<-start
insert100 := func(tx *bolt.Tx) error {
h := fnv.New32a()
buf := make([]byte, 4)
for minor := uint32(0); minor < 100; minor++ {
binary.LittleEndian.PutUint32(buf, uint32(id*100+minor))
h.Reset()
h.Write(buf[:])
k := h.Sum(nil)
b := tx.Bucket([]byte("bench"))
if err := b.Put(k, []byte("filler")); err != nil {
return err
}
}
return nil
}
if err := db.Update(insert100); err != nil {
b.Fatal(err)
}
}(uint32(major))
}
close(start)
wg.Wait()
}
b.StopTimer()
validateBatchBench(b, db)
}

View File

@ -0,0 +1,148 @@
package bolt_test
import (
"encoding/binary"
"fmt"
"io/ioutil"
"log"
"math/rand"
"net/http"
"net/http/httptest"
"os"
"github.com/coreos/etcd/Godeps/_workspace/src/github.com/boltdb/bolt"
)
// Set this to see how the counts are actually updated.
const verbose = false
// Counter updates a counter in Bolt for every URL path requested.
type counter struct {
db *bolt.DB
}
func (c counter) ServeHTTP(rw http.ResponseWriter, req *http.Request) {
// Communicates the new count from a successful database
// transaction.
var result uint64
increment := func(tx *bolt.Tx) error {
b, err := tx.CreateBucketIfNotExists([]byte("hits"))
if err != nil {
return err
}
key := []byte(req.URL.String())
// Decode handles key not found for us.
count := decode(b.Get(key)) + 1
b.Put(key, encode(count))
// All good, communicate new count.
result = count
return nil
}
if err := c.db.Batch(increment); err != nil {
http.Error(rw, err.Error(), 500)
return
}
if verbose {
log.Printf("server: %s: %d", req.URL.String(), result)
}
rw.Header().Set("Content-Type", "application/octet-stream")
fmt.Fprintf(rw, "%d\n", result)
}
func client(id int, base string, paths []string) error {
// Process paths in random order.
rng := rand.New(rand.NewSource(int64(id)))
permutation := rng.Perm(len(paths))
for i := range paths {
path := paths[permutation[i]]
resp, err := http.Get(base + path)
if err != nil {
return err
}
defer resp.Body.Close()
buf, err := ioutil.ReadAll(resp.Body)
if err != nil {
return err
}
if verbose {
log.Printf("client: %s: %s", path, buf)
}
}
return nil
}
func ExampleDB_Batch() {
// Open the database.
db, _ := bolt.Open(tempfile(), 0666, nil)
defer os.Remove(db.Path())
defer db.Close()
// Start our web server
count := counter{db}
srv := httptest.NewServer(count)
defer srv.Close()
// Decrease the batch size to make things more interesting.
db.MaxBatchSize = 3
// Get every path multiple times concurrently.
const clients = 10
paths := []string{
"/foo",
"/bar",
"/baz",
"/quux",
"/thud",
"/xyzzy",
}
errors := make(chan error, clients)
for i := 0; i < clients; i++ {
go func(id int) {
errors <- client(id, srv.URL, paths)
}(i)
}
// Check all responses to make sure there's no error.
for i := 0; i < clients; i++ {
if err := <-errors; err != nil {
fmt.Printf("client error: %v", err)
return
}
}
// Check the final result
db.View(func(tx *bolt.Tx) error {
b := tx.Bucket([]byte("hits"))
c := b.Cursor()
for k, v := c.First(); k != nil; k, v = c.Next() {
fmt.Printf("hits to %s: %d\n", k, decode(v))
}
return nil
})
// Output:
// hits to /bar: 10
// hits to /baz: 10
// hits to /foo: 10
// hits to /quux: 10
// hits to /thud: 10
// hits to /xyzzy: 10
}
// encode marshals a counter.
func encode(n uint64) []byte {
buf := make([]byte, 8)
binary.BigEndian.PutUint64(buf, n)
return buf
}
// decode unmarshals a counter. Nil buffers are decoded as 0.
func decode(buf []byte) uint64 {
if buf == nil {
return 0
}
return binary.BigEndian.Uint64(buf)
}

View File

@ -0,0 +1,167 @@
package bolt_test
import (
"testing"
"time"
"github.com/coreos/etcd/Godeps/_workspace/src/github.com/boltdb/bolt"
)
// Ensure two functions can perform updates in a single batch.
func TestDB_Batch(t *testing.T) {
db := NewTestDB()
defer db.Close()
db.MustCreateBucket([]byte("widgets"))
// Iterate over multiple updates in separate goroutines.
n := 2
ch := make(chan error)
for i := 0; i < n; i++ {
go func(i int) {
ch <- db.Batch(func(tx *bolt.Tx) error {
return tx.Bucket([]byte("widgets")).Put(u64tob(uint64(i)), []byte{})
})
}(i)
}
// Check all responses to make sure there's no error.
for i := 0; i < n; i++ {
if err := <-ch; err != nil {
t.Fatal(err)
}
}
// Ensure data is correct.
db.MustView(func(tx *bolt.Tx) error {
b := tx.Bucket([]byte("widgets"))
for i := 0; i < n; i++ {
if v := b.Get(u64tob(uint64(i))); v == nil {
t.Errorf("key not found: %d", i)
}
}
return nil
})
}
func TestDB_Batch_Panic(t *testing.T) {
db := NewTestDB()
defer db.Close()
var sentinel int
var bork = &sentinel
var problem interface{}
var err error
// Execute a function inside a batch that panics.
func() {
defer func() {
if p := recover(); p != nil {
problem = p
}
}()
err = db.Batch(func(tx *bolt.Tx) error {
panic(bork)
})
}()
// Verify there is no error.
if g, e := err, error(nil); g != e {
t.Fatalf("wrong error: %v != %v", g, e)
}
// Verify the panic was captured.
if g, e := problem, bork; g != e {
t.Fatalf("wrong error: %v != %v", g, e)
}
}
func TestDB_BatchFull(t *testing.T) {
db := NewTestDB()
defer db.Close()
db.MustCreateBucket([]byte("widgets"))
const size = 3
// buffered so we never leak goroutines
ch := make(chan error, size)
put := func(i int) {
ch <- db.Batch(func(tx *bolt.Tx) error {
return tx.Bucket([]byte("widgets")).Put(u64tob(uint64(i)), []byte{})
})
}
db.MaxBatchSize = size
// high enough to never trigger here
db.MaxBatchDelay = 1 * time.Hour
go put(1)
go put(2)
// Give the batch a chance to exhibit bugs.
time.Sleep(10 * time.Millisecond)
// not triggered yet
select {
case <-ch:
t.Fatalf("batch triggered too early")
default:
}
go put(3)
// Check all responses to make sure there's no error.
for i := 0; i < size; i++ {
if err := <-ch; err != nil {
t.Fatal(err)
}
}
// Ensure data is correct.
db.MustView(func(tx *bolt.Tx) error {
b := tx.Bucket([]byte("widgets"))
for i := 1; i <= size; i++ {
if v := b.Get(u64tob(uint64(i))); v == nil {
t.Errorf("key not found: %d", i)
}
}
return nil
})
}
func TestDB_BatchTime(t *testing.T) {
db := NewTestDB()
defer db.Close()
db.MustCreateBucket([]byte("widgets"))
const size = 1
// buffered so we never leak goroutines
ch := make(chan error, size)
put := func(i int) {
ch <- db.Batch(func(tx *bolt.Tx) error {
return tx.Bucket([]byte("widgets")).Put(u64tob(uint64(i)), []byte{})
})
}
db.MaxBatchSize = 1000
db.MaxBatchDelay = 0
go put(1)
// Batch must trigger by time alone.
// Check all responses to make sure there's no error.
for i := 0; i < size; i++ {
if err := <-ch; err != nil {
t.Fatal(err)
}
}
// Ensure data is correct.
db.MustView(func(tx *bolt.Tx) error {
b := tx.Bucket([]byte("widgets"))
for i := 1; i <= size; i++ {
if v := b.Get(u64tob(uint64(i))); v == nil {
t.Errorf("key not found: %d", i)
}
}
return nil
})
}

View File

@ -1,9 +0,0 @@
// +build arm64
package bolt
// maxMapSize represents the largest mmap size supported by Bolt.
const maxMapSize = 0xFFFFFFFFFFFF // 256TB
// maxAllocSize is the size used when creating array pointers.
const maxAllocSize = 0x7FFFFFFF

View File

@ -4,6 +4,8 @@ import (
"syscall"
)
var odirect = syscall.O_DIRECT
// fdatasync flushes written data to a file descriptor.
func fdatasync(db *DB) error {
return syscall.Fdatasync(int(db.file.Fd()))

View File

@ -11,6 +11,8 @@ const (
msInvalidate // invalidate cached data
)
var odirect int
func msync(db *DB) error {
_, _, errno := syscall.Syscall(syscall.SYS_MSYNC, uintptr(unsafe.Pointer(db.data)), uintptr(db.datasz), msInvalidate)
if errno != 0 {

View File

@ -1,9 +0,0 @@
// +build ppc64le
package bolt
// maxMapSize represents the largest mmap size supported by Bolt.
const maxMapSize = 0xFFFFFFFFFFFF // 256TB
// maxAllocSize is the size used when creating array pointers.
const maxAllocSize = 0x7FFFFFFF

View File

@ -1,9 +0,0 @@
// +build s390x
package bolt
// maxMapSize represents the largest mmap size supported by Bolt.
const maxMapSize = 0xFFFFFFFFFFFF // 256TB
// maxAllocSize is the size used when creating array pointers.
const maxAllocSize = 0x7FFFFFFF

View File

@ -0,0 +1,36 @@
package bolt_test
import (
"fmt"
"path/filepath"
"reflect"
"runtime"
"testing"
)
// assert fails the test if the condition is false.
func assert(tb testing.TB, condition bool, msg string, v ...interface{}) {
if !condition {
_, file, line, _ := runtime.Caller(1)
fmt.Printf("\033[31m%s:%d: "+msg+"\033[39m\n\n", append([]interface{}{filepath.Base(file), line}, v...)...)
tb.FailNow()
}
}
// ok fails the test if an err is not nil.
func ok(tb testing.TB, err error) {
if err != nil {
_, file, line, _ := runtime.Caller(1)
fmt.Printf("\033[31m%s:%d: unexpected error: %s\033[39m\n\n", filepath.Base(file), line, err.Error())
tb.FailNow()
}
}
// equals fails the test if exp is not equal to act.
func equals(tb testing.TB, exp, act interface{}) {
if !reflect.DeepEqual(exp, act) {
_, file, line, _ := runtime.Caller(1)
fmt.Printf("\033[31m%s:%d:\n\n\texp: %#v\n\n\tgot: %#v\033[39m\n\n", filepath.Base(file), line, exp, act)
tb.FailNow()
}
}

View File

@ -11,7 +11,7 @@ import (
)
// flock acquires an advisory lock on a file descriptor.
func flock(db *DB, mode os.FileMode, exclusive bool, timeout time.Duration) error {
func flock(f *os.File, exclusive bool, timeout time.Duration) error {
var t time.Time
for {
// If we're beyond our timeout then return an error.
@ -27,7 +27,7 @@ func flock(db *DB, mode os.FileMode, exclusive bool, timeout time.Duration) erro
}
// Otherwise attempt to obtain an exclusive lock.
err := syscall.Flock(int(db.file.Fd()), flag|syscall.LOCK_NB)
err := syscall.Flock(int(f.Fd()), flag|syscall.LOCK_NB)
if err == nil {
return nil
} else if err != syscall.EWOULDBLOCK {
@ -40,14 +40,25 @@ func flock(db *DB, mode os.FileMode, exclusive bool, timeout time.Duration) erro
}
// funlock releases an advisory lock on a file descriptor.
func funlock(db *DB) error {
return syscall.Flock(int(db.file.Fd()), syscall.LOCK_UN)
func funlock(f *os.File) error {
return syscall.Flock(int(f.Fd()), syscall.LOCK_UN)
}
// mmap memory maps a DB's data file.
func mmap(db *DB, sz int) error {
// Truncate and fsync to ensure file size metadata is flushed.
// https://github.com/boltdb/bolt/issues/284
if !db.NoGrowSync && !db.readOnly {
if err := db.file.Truncate(int64(sz)); err != nil {
return fmt.Errorf("file resize error: %s", err)
}
if err := db.file.Sync(); err != nil {
return fmt.Errorf("file sync error: %s", err)
}
}
// Map the data file to memory.
b, err := syscall.Mmap(int(db.file.Fd()), 0, sz, syscall.PROT_READ, syscall.MAP_SHARED|db.MmapFlags)
b, err := syscall.Mmap(int(db.file.Fd()), 0, sz, syscall.PROT_READ, syscall.MAP_SHARED)
if err != nil {
return err
}

View File

@ -2,16 +2,15 @@ package bolt
import (
"fmt"
"github.com/coreos/etcd/Godeps/_workspace/src/golang.org/x/sys/unix"
"os"
"syscall"
"time"
"unsafe"
"github.com/coreos/etcd/Godeps/_workspace/src/golang.org/x/sys/unix"
)
// flock acquires an advisory lock on a file descriptor.
func flock(db *DB, mode os.FileMode, exclusive bool, timeout time.Duration) error {
func flock(f *os.File, exclusive bool, timeout time.Duration) error {
var t time.Time
for {
// If we're beyond our timeout then return an error.
@ -32,7 +31,7 @@ func flock(db *DB, mode os.FileMode, exclusive bool, timeout time.Duration) erro
} else {
lock.Type = syscall.F_RDLCK
}
err := syscall.FcntlFlock(db.file.Fd(), syscall.F_SETLK, &lock)
err := syscall.FcntlFlock(f.Fd(), syscall.F_SETLK, &lock)
if err == nil {
return nil
} else if err != syscall.EAGAIN {
@ -45,19 +44,30 @@ func flock(db *DB, mode os.FileMode, exclusive bool, timeout time.Duration) erro
}
// funlock releases an advisory lock on a file descriptor.
func funlock(db *DB) error {
func funlock(f *os.File) error {
var lock syscall.Flock_t
lock.Start = 0
lock.Len = 0
lock.Type = syscall.F_UNLCK
lock.Whence = 0
return syscall.FcntlFlock(uintptr(db.file.Fd()), syscall.F_SETLK, &lock)
return syscall.FcntlFlock(uintptr(f.Fd()), syscall.F_SETLK, &lock)
}
// mmap memory maps a DB's data file.
func mmap(db *DB, sz int) error {
// Truncate and fsync to ensure file size metadata is flushed.
// https://github.com/boltdb/bolt/issues/284
if !db.NoGrowSync && !db.readOnly {
if err := db.file.Truncate(int64(sz)); err != nil {
return fmt.Errorf("file resize error: %s", err)
}
if err := db.file.Sync(); err != nil {
return fmt.Errorf("file sync error: %s", err)
}
}
// Map the data file to memory.
b, err := unix.Mmap(int(db.file.Fd()), 0, sz, syscall.PROT_READ, syscall.MAP_SHARED|db.MmapFlags)
b, err := unix.Mmap(int(db.file.Fd()), 0, sz, syscall.PROT_READ, syscall.MAP_SHARED)
if err != nil {
return err
}

View File

@ -8,39 +8,7 @@ import (
"unsafe"
)
// LockFileEx code derived from golang build filemutex_windows.go @ v1.5.1
var (
modkernel32 = syscall.NewLazyDLL("kernel32.dll")
procLockFileEx = modkernel32.NewProc("LockFileEx")
procUnlockFileEx = modkernel32.NewProc("UnlockFileEx")
)
const (
lockExt = ".lock"
// see https://msdn.microsoft.com/en-us/library/windows/desktop/aa365203(v=vs.85).aspx
flagLockExclusive = 2
flagLockFailImmediately = 1
// see https://msdn.microsoft.com/en-us/library/windows/desktop/ms681382(v=vs.85).aspx
errLockViolation syscall.Errno = 0x21
)
func lockFileEx(h syscall.Handle, flags, reserved, locklow, lockhigh uint32, ol *syscall.Overlapped) (err error) {
r, _, err := procLockFileEx.Call(uintptr(h), uintptr(flags), uintptr(reserved), uintptr(locklow), uintptr(lockhigh), uintptr(unsafe.Pointer(ol)))
if r == 0 {
return err
}
return nil
}
func unlockFileEx(h syscall.Handle, reserved, locklow, lockhigh uint32, ol *syscall.Overlapped) (err error) {
r, _, err := procUnlockFileEx.Call(uintptr(h), uintptr(reserved), uintptr(locklow), uintptr(lockhigh), uintptr(unsafe.Pointer(ol)), 0)
if r == 0 {
return err
}
return nil
}
var odirect int
// fdatasync flushes written data to a file descriptor.
func fdatasync(db *DB) error {
@ -48,49 +16,13 @@ func fdatasync(db *DB) error {
}
// flock acquires an advisory lock on a file descriptor.
func flock(db *DB, mode os.FileMode, exclusive bool, timeout time.Duration) error {
// Create a separate lock file on windows because a process
// cannot share an exclusive lock on the same file. This is
// needed during Tx.WriteTo().
f, err := os.OpenFile(db.path+lockExt, os.O_CREATE, mode)
if err != nil {
return err
}
db.lockfile = f
var t time.Time
for {
// If we're beyond our timeout then return an error.
// This can only occur after we've attempted a flock once.
if t.IsZero() {
t = time.Now()
} else if timeout > 0 && time.Since(t) > timeout {
return ErrTimeout
}
var flag uint32 = flagLockFailImmediately
if exclusive {
flag |= flagLockExclusive
}
err := lockFileEx(syscall.Handle(db.lockfile.Fd()), flag, 0, 1, 0, &syscall.Overlapped{})
if err == nil {
return nil
} else if err != errLockViolation {
return err
}
// Wait for a bit and try again.
time.Sleep(50 * time.Millisecond)
}
func flock(f *os.File, _ bool, _ time.Duration) error {
return nil
}
// funlock releases an advisory lock on a file descriptor.
func funlock(db *DB) error {
err := unlockFileEx(syscall.Handle(db.lockfile.Fd()), 0, 1, 0, &syscall.Overlapped{})
db.lockfile.Close()
os.Remove(db.path+lockExt)
return err
func funlock(f *os.File) error {
return nil
}
// mmap memory maps a DB's data file.

View File

@ -2,6 +2,8 @@
package bolt
var odirect int
// fdatasync flushes written data to a file descriptor.
func fdatasync(db *DB) error {
return db.file.Sync()

View File

@ -11,7 +11,7 @@ const (
MaxKeySize = 32768
// MaxValueSize is the maximum length of a value, in bytes.
MaxValueSize = (1 << 31) - 2
MaxValueSize = 4294967295
)
const (
@ -99,7 +99,6 @@ func (b *Bucket) Cursor() *Cursor {
// Bucket retrieves a nested bucket by name.
// Returns nil if the bucket does not exist.
// The bucket instance is only valid for the lifetime of the transaction.
func (b *Bucket) Bucket(name []byte) *Bucket {
if b.buckets != nil {
if child := b.buckets[string(name)]; child != nil {
@ -149,7 +148,6 @@ func (b *Bucket) openBucket(value []byte) *Bucket {
// CreateBucket creates a new bucket at the given key and returns the new bucket.
// Returns an error if the key already exists, if the bucket name is blank, or if the bucket name is too long.
// The bucket instance is only valid for the lifetime of the transaction.
func (b *Bucket) CreateBucket(key []byte) (*Bucket, error) {
if b.tx.db == nil {
return nil, ErrTxClosed
@ -194,7 +192,6 @@ func (b *Bucket) CreateBucket(key []byte) (*Bucket, error) {
// CreateBucketIfNotExists creates a new bucket if it doesn't already exist and returns a reference to it.
// Returns an error if the bucket name is blank, or if the bucket name is too long.
// The bucket instance is only valid for the lifetime of the transaction.
func (b *Bucket) CreateBucketIfNotExists(key []byte) (*Bucket, error) {
child, err := b.CreateBucket(key)
if err == ErrBucketExists {
@ -273,7 +270,6 @@ func (b *Bucket) Get(key []byte) []byte {
// Put sets the value for a key in the bucket.
// If the key exist then its previous value will be overwritten.
// Supplied value must remain valid for the life of the transaction.
// Returns an error if the bucket was created from a read-only transaction, if the key is blank, if the key is too large, or if the value is too large.
func (b *Bucket) Put(key []byte, value []byte) error {
if b.tx.db == nil {
@ -350,8 +346,7 @@ func (b *Bucket) NextSequence() (uint64, error) {
// ForEach executes a function for each key/value pair in a bucket.
// If the provided function returns an error then the iteration is stopped and
// the error is returned to the caller. The provided function must not modify
// the bucket; this will result in undefined behavior.
// the error is returned to the caller.
func (b *Bucket) ForEach(fn func(k, v []byte) error) error {
if b.tx.db == nil {
return ErrTxClosed

File diff suppressed because it is too large Load Diff

View File

@ -825,10 +825,7 @@ func (cmd *StatsCommand) Run(args ...string) error {
fmt.Fprintln(cmd.Stdout, "Bucket statistics")
fmt.Fprintf(cmd.Stdout, "\tTotal number of buckets: %d\n", s.BucketN)
percentage = 0
if s.BucketN != 0 {
percentage = int(float32(s.InlineBucketN) * 100.0 / float32(s.BucketN))
}
percentage = int(float32(s.InlineBucketN) * 100.0 / float32(s.BucketN))
fmt.Fprintf(cmd.Stdout, "\tTotal number on inlined buckets: %d (%d%%)\n", s.InlineBucketN, percentage)
percentage = 0
if s.LeafInuse != 0 {

View File

@ -0,0 +1,145 @@
package main_test
import (
"bytes"
"io/ioutil"
"os"
"strconv"
"testing"
"github.com/coreos/etcd/Godeps/_workspace/src/github.com/boltdb/bolt"
"github.com/coreos/etcd/Godeps/_workspace/src/github.com/boltdb/bolt/cmd/bolt"
)
// Ensure the "info" command can print information about a database.
func TestInfoCommand_Run(t *testing.T) {
db := MustOpen(0666, nil)
db.DB.Close()
defer db.Close()
// Run the info command.
m := NewMain()
if err := m.Run("info", db.Path); err != nil {
t.Fatal(err)
}
}
// Ensure the "stats" command can execute correctly.
func TestStatsCommand_Run(t *testing.T) {
// Ignore
if os.Getpagesize() != 4096 {
t.Skip("system does not use 4KB page size")
}
db := MustOpen(0666, nil)
defer db.Close()
if err := db.Update(func(tx *bolt.Tx) error {
// Create "foo" bucket.
b, err := tx.CreateBucket([]byte("foo"))
if err != nil {
return err
}
for i := 0; i < 10; i++ {
if err := b.Put([]byte(strconv.Itoa(i)), []byte(strconv.Itoa(i))); err != nil {
return err
}
}
// Create "bar" bucket.
b, err = tx.CreateBucket([]byte("bar"))
if err != nil {
return err
}
for i := 0; i < 100; i++ {
if err := b.Put([]byte(strconv.Itoa(i)), []byte(strconv.Itoa(i))); err != nil {
return err
}
}
// Create "baz" bucket.
b, err = tx.CreateBucket([]byte("baz"))
if err != nil {
return err
}
if err := b.Put([]byte("key"), []byte("value")); err != nil {
return err
}
return nil
}); err != nil {
t.Fatal(err)
}
db.DB.Close()
// Generate expected result.
exp := "Aggregate statistics for 3 buckets\n\n" +
"Page count statistics\n" +
"\tNumber of logical branch pages: 0\n" +
"\tNumber of physical branch overflow pages: 0\n" +
"\tNumber of logical leaf pages: 1\n" +
"\tNumber of physical leaf overflow pages: 0\n" +
"Tree statistics\n" +
"\tNumber of keys/value pairs: 111\n" +
"\tNumber of levels in B+tree: 1\n" +
"Page size utilization\n" +
"\tBytes allocated for physical branch pages: 0\n" +
"\tBytes actually used for branch data: 0 (0%)\n" +
"\tBytes allocated for physical leaf pages: 4096\n" +
"\tBytes actually used for leaf data: 1996 (48%)\n" +
"Bucket statistics\n" +
"\tTotal number of buckets: 3\n" +
"\tTotal number on inlined buckets: 2 (66%)\n" +
"\tBytes used for inlined buckets: 236 (11%)\n"
// Run the command.
m := NewMain()
if err := m.Run("stats", db.Path); err != nil {
t.Fatal(err)
} else if m.Stdout.String() != exp {
t.Fatalf("unexpected stdout:\n\n%s", m.Stdout.String())
}
}
// Main represents a test wrapper for main.Main that records output.
type Main struct {
*main.Main
Stdin bytes.Buffer
Stdout bytes.Buffer
Stderr bytes.Buffer
}
// NewMain returns a new instance of Main.
func NewMain() *Main {
m := &Main{Main: main.NewMain()}
m.Main.Stdin = &m.Stdin
m.Main.Stdout = &m.Stdout
m.Main.Stderr = &m.Stderr
return m
}
// MustOpen creates a Bolt database in a temporary location.
func MustOpen(mode os.FileMode, options *bolt.Options) *DB {
// Create temporary path.
f, _ := ioutil.TempFile("", "bolt-")
f.Close()
os.Remove(f.Name())
db, err := bolt.Open(f.Name(), mode, options)
if err != nil {
panic(err.Error())
}
return &DB{DB: db, Path: f.Name()}
}
// DB is a test wrapper for bolt.DB.
type DB struct {
*bolt.DB
Path string
}
// Close closes and removes the database.
func (db *DB) Close() error {
defer os.Remove(db.Path)
return db.DB.Close()
}

View File

@ -34,13 +34,6 @@ func (c *Cursor) First() (key []byte, value []byte) {
p, n := c.bucket.pageNode(c.bucket.root)
c.stack = append(c.stack, elemRef{page: p, node: n, index: 0})
c.first()
// If we land on an empty page then move to the next value.
// https://github.com/boltdb/bolt/issues/450
if c.stack[len(c.stack)-1].count() == 0 {
c.next()
}
k, v, flags := c.keyValue()
if (flags & uint32(bucketLeafFlag)) != 0 {
return k, nil
@ -216,37 +209,28 @@ func (c *Cursor) last() {
// next moves to the next leaf element and returns the key and value.
// If the cursor is at the last leaf element then it stays there and returns nil.
func (c *Cursor) next() (key []byte, value []byte, flags uint32) {
for {
// Attempt to move over one element until we're successful.
// Move up the stack as we hit the end of each page in our stack.
var i int
for i = len(c.stack) - 1; i >= 0; i-- {
elem := &c.stack[i]
if elem.index < elem.count()-1 {
elem.index++
break
}
// Attempt to move over one element until we're successful.
// Move up the stack as we hit the end of each page in our stack.
var i int
for i = len(c.stack) - 1; i >= 0; i-- {
elem := &c.stack[i]
if elem.index < elem.count()-1 {
elem.index++
break
}
// If we've hit the root page then stop and return. This will leave the
// cursor on the last element of the last page.
if i == -1 {
return nil, nil, 0
}
// Otherwise start from where we left off in the stack and find the
// first element of the first leaf page.
c.stack = c.stack[:i+1]
c.first()
// If this is an empty page then restart and move back up the stack.
// https://github.com/boltdb/bolt/issues/450
if c.stack[len(c.stack)-1].count() == 0 {
continue
}
return c.keyValue()
}
// If we've hit the root page then stop and return. This will leave the
// cursor on the last element of the last page.
if i == -1 {
return nil, nil, 0
}
// Otherwise start from where we left off in the stack and find the
// first element of the first leaf page.
c.stack = c.stack[:i+1]
c.first()
return c.keyValue()
}
// search recursively performs a binary search against a given page/node until it finds a given key.

View File

@ -0,0 +1,511 @@
package bolt_test
import (
"bytes"
"encoding/binary"
"fmt"
"os"
"sort"
"testing"
"testing/quick"
"github.com/coreos/etcd/Godeps/_workspace/src/github.com/boltdb/bolt"
)
// Ensure that a cursor can return a reference to the bucket that created it.
func TestCursor_Bucket(t *testing.T) {
db := NewTestDB()
defer db.Close()
db.Update(func(tx *bolt.Tx) error {
b, _ := tx.CreateBucket([]byte("widgets"))
c := b.Cursor()
equals(t, b, c.Bucket())
return nil
})
}
// Ensure that a Tx cursor can seek to the appropriate keys.
func TestCursor_Seek(t *testing.T) {
db := NewTestDB()
defer db.Close()
db.Update(func(tx *bolt.Tx) error {
b, err := tx.CreateBucket([]byte("widgets"))
ok(t, err)
ok(t, b.Put([]byte("foo"), []byte("0001")))
ok(t, b.Put([]byte("bar"), []byte("0002")))
ok(t, b.Put([]byte("baz"), []byte("0003")))
_, err = b.CreateBucket([]byte("bkt"))
ok(t, err)
return nil
})
db.View(func(tx *bolt.Tx) error {
c := tx.Bucket([]byte("widgets")).Cursor()
// Exact match should go to the key.
k, v := c.Seek([]byte("bar"))
equals(t, []byte("bar"), k)
equals(t, []byte("0002"), v)
// Inexact match should go to the next key.
k, v = c.Seek([]byte("bas"))
equals(t, []byte("baz"), k)
equals(t, []byte("0003"), v)
// Low key should go to the first key.
k, v = c.Seek([]byte(""))
equals(t, []byte("bar"), k)
equals(t, []byte("0002"), v)
// High key should return no key.
k, v = c.Seek([]byte("zzz"))
assert(t, k == nil, "")
assert(t, v == nil, "")
// Buckets should return their key but no value.
k, v = c.Seek([]byte("bkt"))
equals(t, []byte("bkt"), k)
assert(t, v == nil, "")
return nil
})
}
func TestCursor_Delete(t *testing.T) {
db := NewTestDB()
defer db.Close()
var count = 1000
// Insert every other key between 0 and $count.
db.Update(func(tx *bolt.Tx) error {
b, _ := tx.CreateBucket([]byte("widgets"))
for i := 0; i < count; i += 1 {
k := make([]byte, 8)
binary.BigEndian.PutUint64(k, uint64(i))
b.Put(k, make([]byte, 100))
}
b.CreateBucket([]byte("sub"))
return nil
})
db.Update(func(tx *bolt.Tx) error {
c := tx.Bucket([]byte("widgets")).Cursor()
bound := make([]byte, 8)
binary.BigEndian.PutUint64(bound, uint64(count/2))
for key, _ := c.First(); bytes.Compare(key, bound) < 0; key, _ = c.Next() {
if err := c.Delete(); err != nil {
return err
}
}
c.Seek([]byte("sub"))
err := c.Delete()
equals(t, err, bolt.ErrIncompatibleValue)
return nil
})
db.View(func(tx *bolt.Tx) error {
b := tx.Bucket([]byte("widgets"))
equals(t, b.Stats().KeyN, count/2+1)
return nil
})
}
// Ensure that a Tx cursor can seek to the appropriate keys when there are a
// large number of keys. This test also checks that seek will always move
// forward to the next key.
//
// Related: https://github.com/boltdb/bolt/pull/187
func TestCursor_Seek_Large(t *testing.T) {
db := NewTestDB()
defer db.Close()
var count = 10000
// Insert every other key between 0 and $count.
db.Update(func(tx *bolt.Tx) error {
b, _ := tx.CreateBucket([]byte("widgets"))
for i := 0; i < count; i += 100 {
for j := i; j < i+100; j += 2 {
k := make([]byte, 8)
binary.BigEndian.PutUint64(k, uint64(j))
b.Put(k, make([]byte, 100))
}
}
return nil
})
db.View(func(tx *bolt.Tx) error {
c := tx.Bucket([]byte("widgets")).Cursor()
for i := 0; i < count; i++ {
seek := make([]byte, 8)
binary.BigEndian.PutUint64(seek, uint64(i))
k, _ := c.Seek(seek)
// The last seek is beyond the end of the the range so
// it should return nil.
if i == count-1 {
assert(t, k == nil, "")
continue
}
// Otherwise we should seek to the exact key or the next key.
num := binary.BigEndian.Uint64(k)
if i%2 == 0 {
equals(t, uint64(i), num)
} else {
equals(t, uint64(i+1), num)
}
}
return nil
})
}
// Ensure that a cursor can iterate over an empty bucket without error.
func TestCursor_EmptyBucket(t *testing.T) {
db := NewTestDB()
defer db.Close()
db.Update(func(tx *bolt.Tx) error {
_, err := tx.CreateBucket([]byte("widgets"))
return err
})
db.View(func(tx *bolt.Tx) error {
c := tx.Bucket([]byte("widgets")).Cursor()
k, v := c.First()
assert(t, k == nil, "")
assert(t, v == nil, "")
return nil
})
}
// Ensure that a Tx cursor can reverse iterate over an empty bucket without error.
func TestCursor_EmptyBucketReverse(t *testing.T) {
db := NewTestDB()
defer db.Close()
db.Update(func(tx *bolt.Tx) error {
_, err := tx.CreateBucket([]byte("widgets"))
return err
})
db.View(func(tx *bolt.Tx) error {
c := tx.Bucket([]byte("widgets")).Cursor()
k, v := c.Last()
assert(t, k == nil, "")
assert(t, v == nil, "")
return nil
})
}
// Ensure that a Tx cursor can iterate over a single root with a couple elements.
func TestCursor_Iterate_Leaf(t *testing.T) {
db := NewTestDB()
defer db.Close()
db.Update(func(tx *bolt.Tx) error {
tx.CreateBucket([]byte("widgets"))
tx.Bucket([]byte("widgets")).Put([]byte("baz"), []byte{})
tx.Bucket([]byte("widgets")).Put([]byte("foo"), []byte{0})
tx.Bucket([]byte("widgets")).Put([]byte("bar"), []byte{1})
return nil
})
tx, _ := db.Begin(false)
c := tx.Bucket([]byte("widgets")).Cursor()
k, v := c.First()
equals(t, string(k), "bar")
equals(t, v, []byte{1})
k, v = c.Next()
equals(t, string(k), "baz")
equals(t, v, []byte{})
k, v = c.Next()
equals(t, string(k), "foo")
equals(t, v, []byte{0})
k, v = c.Next()
assert(t, k == nil, "")
assert(t, v == nil, "")
k, v = c.Next()
assert(t, k == nil, "")
assert(t, v == nil, "")
tx.Rollback()
}
// Ensure that a Tx cursor can iterate in reverse over a single root with a couple elements.
func TestCursor_LeafRootReverse(t *testing.T) {
db := NewTestDB()
defer db.Close()
db.Update(func(tx *bolt.Tx) error {
tx.CreateBucket([]byte("widgets"))
tx.Bucket([]byte("widgets")).Put([]byte("baz"), []byte{})
tx.Bucket([]byte("widgets")).Put([]byte("foo"), []byte{0})
tx.Bucket([]byte("widgets")).Put([]byte("bar"), []byte{1})
return nil
})
tx, _ := db.Begin(false)
c := tx.Bucket([]byte("widgets")).Cursor()
k, v := c.Last()
equals(t, string(k), "foo")
equals(t, v, []byte{0})
k, v = c.Prev()
equals(t, string(k), "baz")
equals(t, v, []byte{})
k, v = c.Prev()
equals(t, string(k), "bar")
equals(t, v, []byte{1})
k, v = c.Prev()
assert(t, k == nil, "")
assert(t, v == nil, "")
k, v = c.Prev()
assert(t, k == nil, "")
assert(t, v == nil, "")
tx.Rollback()
}
// Ensure that a Tx cursor can restart from the beginning.
func TestCursor_Restart(t *testing.T) {
db := NewTestDB()
defer db.Close()
db.Update(func(tx *bolt.Tx) error {
tx.CreateBucket([]byte("widgets"))
tx.Bucket([]byte("widgets")).Put([]byte("bar"), []byte{})
tx.Bucket([]byte("widgets")).Put([]byte("foo"), []byte{})
return nil
})
tx, _ := db.Begin(false)
c := tx.Bucket([]byte("widgets")).Cursor()
k, _ := c.First()
equals(t, string(k), "bar")
k, _ = c.Next()
equals(t, string(k), "foo")
k, _ = c.First()
equals(t, string(k), "bar")
k, _ = c.Next()
equals(t, string(k), "foo")
tx.Rollback()
}
// Ensure that a Tx can iterate over all elements in a bucket.
func TestCursor_QuickCheck(t *testing.T) {
f := func(items testdata) bool {
db := NewTestDB()
defer db.Close()
// Bulk insert all values.
tx, _ := db.Begin(true)
tx.CreateBucket([]byte("widgets"))
b := tx.Bucket([]byte("widgets"))
for _, item := range items {
ok(t, b.Put(item.Key, item.Value))
}
ok(t, tx.Commit())
// Sort test data.
sort.Sort(items)
// Iterate over all items and check consistency.
var index = 0
tx, _ = db.Begin(false)
c := tx.Bucket([]byte("widgets")).Cursor()
for k, v := c.First(); k != nil && index < len(items); k, v = c.Next() {
equals(t, k, items[index].Key)
equals(t, v, items[index].Value)
index++
}
equals(t, len(items), index)
tx.Rollback()
return true
}
if err := quick.Check(f, qconfig()); err != nil {
t.Error(err)
}
}
// Ensure that a transaction can iterate over all elements in a bucket in reverse.
func TestCursor_QuickCheck_Reverse(t *testing.T) {
f := func(items testdata) bool {
db := NewTestDB()
defer db.Close()
// Bulk insert all values.
tx, _ := db.Begin(true)
tx.CreateBucket([]byte("widgets"))
b := tx.Bucket([]byte("widgets"))
for _, item := range items {
ok(t, b.Put(item.Key, item.Value))
}
ok(t, tx.Commit())
// Sort test data.
sort.Sort(revtestdata(items))
// Iterate over all items and check consistency.
var index = 0
tx, _ = db.Begin(false)
c := tx.Bucket([]byte("widgets")).Cursor()
for k, v := c.Last(); k != nil && index < len(items); k, v = c.Prev() {
equals(t, k, items[index].Key)
equals(t, v, items[index].Value)
index++
}
equals(t, len(items), index)
tx.Rollback()
return true
}
if err := quick.Check(f, qconfig()); err != nil {
t.Error(err)
}
}
// Ensure that a Tx cursor can iterate over subbuckets.
func TestCursor_QuickCheck_BucketsOnly(t *testing.T) {
db := NewTestDB()
defer db.Close()
db.Update(func(tx *bolt.Tx) error {
b, err := tx.CreateBucket([]byte("widgets"))
ok(t, err)
_, err = b.CreateBucket([]byte("foo"))
ok(t, err)
_, err = b.CreateBucket([]byte("bar"))
ok(t, err)
_, err = b.CreateBucket([]byte("baz"))
ok(t, err)
return nil
})
db.View(func(tx *bolt.Tx) error {
var names []string
c := tx.Bucket([]byte("widgets")).Cursor()
for k, v := c.First(); k != nil; k, v = c.Next() {
names = append(names, string(k))
assert(t, v == nil, "")
}
equals(t, names, []string{"bar", "baz", "foo"})
return nil
})
}
// Ensure that a Tx cursor can reverse iterate over subbuckets.
func TestCursor_QuickCheck_BucketsOnly_Reverse(t *testing.T) {
db := NewTestDB()
defer db.Close()
db.Update(func(tx *bolt.Tx) error {
b, err := tx.CreateBucket([]byte("widgets"))
ok(t, err)
_, err = b.CreateBucket([]byte("foo"))
ok(t, err)
_, err = b.CreateBucket([]byte("bar"))
ok(t, err)
_, err = b.CreateBucket([]byte("baz"))
ok(t, err)
return nil
})
db.View(func(tx *bolt.Tx) error {
var names []string
c := tx.Bucket([]byte("widgets")).Cursor()
for k, v := c.Last(); k != nil; k, v = c.Prev() {
names = append(names, string(k))
assert(t, v == nil, "")
}
equals(t, names, []string{"foo", "baz", "bar"})
return nil
})
}
func ExampleCursor() {
// Open the database.
db, _ := bolt.Open(tempfile(), 0666, nil)
defer os.Remove(db.Path())
defer db.Close()
// Start a read-write transaction.
db.Update(func(tx *bolt.Tx) error {
// Create a new bucket.
tx.CreateBucket([]byte("animals"))
// Insert data into a bucket.
b := tx.Bucket([]byte("animals"))
b.Put([]byte("dog"), []byte("fun"))
b.Put([]byte("cat"), []byte("lame"))
b.Put([]byte("liger"), []byte("awesome"))
// Create a cursor for iteration.
c := b.Cursor()
// Iterate over items in sorted key order. This starts from the
// first key/value pair and updates the k/v variables to the
// next key/value on each iteration.
//
// The loop finishes at the end of the cursor when a nil key is returned.
for k, v := c.First(); k != nil; k, v = c.Next() {
fmt.Printf("A %s is %s.\n", k, v)
}
return nil
})
// Output:
// A cat is lame.
// A dog is fun.
// A liger is awesome.
}
func ExampleCursor_reverse() {
// Open the database.
db, _ := bolt.Open(tempfile(), 0666, nil)
defer os.Remove(db.Path())
defer db.Close()
// Start a read-write transaction.
db.Update(func(tx *bolt.Tx) error {
// Create a new bucket.
tx.CreateBucket([]byte("animals"))
// Insert data into a bucket.
b := tx.Bucket([]byte("animals"))
b.Put([]byte("dog"), []byte("fun"))
b.Put([]byte("cat"), []byte("lame"))
b.Put([]byte("liger"), []byte("awesome"))
// Create a cursor for iteration.
c := b.Cursor()
// Iterate over items in reverse sorted key order. This starts
// from the last key/value pair and updates the k/v variables to
// the previous key/value on each iteration.
//
// The loop finishes at the beginning of the cursor when a nil key
// is returned.
for k, v := c.Last(); k != nil; k, v = c.Prev() {
fmt.Printf("A %s is %s.\n", k, v)
}
return nil
})
// Output:
// A liger is awesome.
// A dog is fun.
// A cat is lame.
}

View File

@ -1,10 +1,8 @@
package bolt
import (
"errors"
"fmt"
"hash/fnv"
"log"
"os"
"runtime"
"runtime/debug"
@ -26,14 +24,13 @@ const magic uint32 = 0xED0CDAED
// IgnoreNoSync specifies whether the NoSync field of a DB is ignored when
// syncing changes to a file. This is required as some operating systems,
// such as OpenBSD, do not have a unified buffer cache (UBC) and writes
// must be synchronized using the msync(2) syscall.
// must be synchronzied using the msync(2) syscall.
const IgnoreNoSync = runtime.GOOS == "openbsd"
// Default values if not set in a DB instance.
const (
DefaultMaxBatchSize int = 1000
DefaultMaxBatchDelay = 10 * time.Millisecond
DefaultAllocSize = 16 * 1024 * 1024
)
// DB represents a collection of buckets persisted to a file on disk.
@ -66,10 +63,6 @@ type DB struct {
// https://github.com/boltdb/bolt/issues/284
NoGrowSync bool
// If you want to read the entire database fast, you can set MmapFlag to
// syscall.MAP_POPULATE on Linux 2.6.23+ for sequential read-ahead.
MmapFlags int
// MaxBatchSize is the maximum size of a batch. Default value is
// copied from DefaultMaxBatchSize in Open.
//
@ -86,18 +79,11 @@ type DB struct {
// Do not change concurrently with calls to Batch.
MaxBatchDelay time.Duration
// AllocSize is the amount of space allocated when the database
// needs to create new pages. This is done to amortize the cost
// of truncate() and fsync() when growing the data file.
AllocSize int
path string
file *os.File
lockfile *os.File // windows only
dataref []byte // mmap'ed readonly, write throws SEGV
data *[maxMapSize]byte
datasz int
filesz int // current on disk file size
meta0 *meta
meta1 *meta
pageSize int
@ -150,12 +136,10 @@ func Open(path string, mode os.FileMode, options *Options) (*DB, error) {
options = DefaultOptions
}
db.NoGrowSync = options.NoGrowSync
db.MmapFlags = options.MmapFlags
// Set default values for later DB operations.
db.MaxBatchSize = DefaultMaxBatchSize
db.MaxBatchDelay = DefaultMaxBatchDelay
db.AllocSize = DefaultAllocSize
flag := os.O_RDWR
if options.ReadOnly {
@ -178,7 +162,7 @@ func Open(path string, mode os.FileMode, options *Options) (*DB, error) {
// if !options.ReadOnly.
// The database file is locked using the shared lock (more than one process may
// hold a lock at the same time) otherwise (options.ReadOnly is set).
if err := flock(db, mode, !db.readOnly, options.Timeout); err != nil {
if err := flock(db.file, !db.readOnly, options.Timeout); err != nil {
_ = db.close()
return nil, err
}
@ -188,7 +172,7 @@ func Open(path string, mode os.FileMode, options *Options) (*DB, error) {
// Initialize the database if it doesn't exist.
if info, err := db.file.Stat(); err != nil {
return nil, err
return nil, fmt.Errorf("stat error: %s", err)
} else if info.Size() == 0 {
// Initialize new files with meta pages.
if err := db.init(); err != nil {
@ -200,14 +184,14 @@ func Open(path string, mode os.FileMode, options *Options) (*DB, error) {
if _, err := db.file.ReadAt(buf[:], 0); err == nil {
m := db.pageInBuffer(buf[:], 0).meta()
if err := m.validate(); err != nil {
return nil, err
return nil, fmt.Errorf("meta0 error: %s", err)
}
db.pageSize = int(m.pageSize)
}
}
// Memory map the data file.
if err := db.mmap(options.InitialMmapSize); err != nil {
if err := db.mmap(0); err != nil {
_ = db.close()
return nil, err
}
@ -264,10 +248,10 @@ func (db *DB) mmap(minsz int) error {
// Validate the meta pages.
if err := db.meta0.validate(); err != nil {
return err
return fmt.Errorf("meta0 error: %s", err)
}
if err := db.meta1.validate(); err != nil {
return err
return fmt.Errorf("meta1 error: %s", err)
}
return nil
@ -282,7 +266,7 @@ func (db *DB) munmap() error {
}
// mmapSize determines the appropriate size for the mmap given the current size
// of the database. The minimum size is 32KB and doubles until it reaches 1GB.
// of the database. The minimum size is 1MB and doubles until it reaches 1GB.
// Returns an error if the new mmap size is greater than the max allowed.
func (db *DB) mmapSize(size int) (int, error) {
// Double the size from 32KB until 1GB.
@ -380,10 +364,6 @@ func (db *DB) Close() error {
}
func (db *DB) close() error {
if !db.opened {
return nil
}
db.opened = false
db.freelist = nil
@ -402,9 +382,7 @@ func (db *DB) close() error {
// No need to unlock read-only file.
if !db.readOnly {
// Unlock the file.
if err := funlock(db); err != nil {
log.Printf("bolt.Close(): funlock error: %s", err)
}
_ = funlock(db.file)
}
// Close the file descriptor.
@ -423,15 +401,11 @@ func (db *DB) close() error {
// will cause the calls to block and be serialized until the current write
// transaction finishes.
//
// Transactions should not be dependent on one another. Opening a read
// Transactions should not be depedent on one another. Opening a read
// transaction and a write transaction in the same goroutine can cause the
// writer to deadlock because the database periodically needs to re-mmap itself
// as it grows and it cannot do that while a read transaction is open.
//
// If a long running read transaction (for example, a snapshot transaction) is
// needed, you might want to set DB.InitialMmapSize to a large enough value
// to avoid potential blocking of write transaction.
//
// IMPORTANT: You must close read-only transactions after you are finished or
// else the database will not reclaim old pages.
func (db *DB) Begin(writable bool) (*Tx, error) {
@ -615,136 +589,6 @@ func (db *DB) View(fn func(*Tx) error) error {
return nil
}
// Batch calls fn as part of a batch. It behaves similar to Update,
// except:
//
// 1. concurrent Batch calls can be combined into a single Bolt
// transaction.
//
// 2. the function passed to Batch may be called multiple times,
// regardless of whether it returns error or not.
//
// This means that Batch function side effects must be idempotent and
// take permanent effect only after a successful return is seen in
// caller.
//
// The maximum batch size and delay can be adjusted with DB.MaxBatchSize
// and DB.MaxBatchDelay, respectively.
//
// Batch is only useful when there are multiple goroutines calling it.
func (db *DB) Batch(fn func(*Tx) error) error {
errCh := make(chan error, 1)
db.batchMu.Lock()
if (db.batch == nil) || (db.batch != nil && len(db.batch.calls) >= db.MaxBatchSize) {
// There is no existing batch, or the existing batch is full; start a new one.
db.batch = &batch{
db: db,
}
db.batch.timer = time.AfterFunc(db.MaxBatchDelay, db.batch.trigger)
}
db.batch.calls = append(db.batch.calls, call{fn: fn, err: errCh})
if len(db.batch.calls) >= db.MaxBatchSize {
// wake up batch, it's ready to run
go db.batch.trigger()
}
db.batchMu.Unlock()
err := <-errCh
if err == trySolo {
err = db.Update(fn)
}
return err
}
type call struct {
fn func(*Tx) error
err chan<- error
}
type batch struct {
db *DB
timer *time.Timer
start sync.Once
calls []call
}
// trigger runs the batch if it hasn't already been run.
func (b *batch) trigger() {
b.start.Do(b.run)
}
// run performs the transactions in the batch and communicates results
// back to DB.Batch.
func (b *batch) run() {
b.db.batchMu.Lock()
b.timer.Stop()
// Make sure no new work is added to this batch, but don't break
// other batches.
if b.db.batch == b {
b.db.batch = nil
}
b.db.batchMu.Unlock()
retry:
for len(b.calls) > 0 {
var failIdx = -1
err := b.db.Update(func(tx *Tx) error {
for i, c := range b.calls {
if err := safelyCall(c.fn, tx); err != nil {
failIdx = i
return err
}
}
return nil
})
if failIdx >= 0 {
// take the failing transaction out of the batch. it's
// safe to shorten b.calls here because db.batch no longer
// points to us, and we hold the mutex anyway.
c := b.calls[failIdx]
b.calls[failIdx], b.calls = b.calls[len(b.calls)-1], b.calls[:len(b.calls)-1]
// tell the submitter re-run it solo, continue with the rest of the batch
c.err <- trySolo
continue retry
}
// pass success, or bolt internal errors, to all callers
for _, c := range b.calls {
if c.err != nil {
c.err <- err
}
}
break retry
}
}
// trySolo is a special sentinel error value used for signaling that a
// transaction function should be re-run. It should never be seen by
// callers.
var trySolo = errors.New("batch function returned an error and should be re-run solo")
type panicked struct {
reason interface{}
}
func (p panicked) Error() string {
if err, ok := p.reason.(error); ok {
return err.Error()
}
return fmt.Sprintf("panic: %v", p.reason)
}
func safelyCall(fn func(*Tx) error, tx *Tx) (err error) {
defer func() {
if p := recover(); p != nil {
err = panicked{p}
}
}()
return fn(tx)
}
// Sync executes fdatasync() against the database file handle.
//
// This is not necessary under normal operation, however, if you use NoSync
@ -811,38 +655,6 @@ func (db *DB) allocate(count int) (*page, error) {
return p, nil
}
// grow grows the size of the database to the given sz.
func (db *DB) grow(sz int) error {
// Ignore if the new size is less than available file size.
if sz <= db.filesz {
return nil
}
// If the data is smaller than the alloc size then only allocate what's needed.
// Once it goes over the allocation size then allocate in chunks.
if db.datasz < db.AllocSize {
sz = db.datasz
} else {
sz += db.AllocSize
}
// Truncate and fsync to ensure file size metadata is flushed.
// https://github.com/boltdb/bolt/issues/284
if !db.NoGrowSync && !db.readOnly {
if runtime.GOOS != "windows" {
if err := db.file.Truncate(int64(sz)); err != nil {
return fmt.Errorf("file resize error: %s", err)
}
}
if err := db.file.Sync(); err != nil {
return fmt.Errorf("file sync error: %s", err)
}
}
db.filesz = sz
return nil
}
func (db *DB) IsReadOnly() bool {
return db.readOnly
}
@ -860,19 +672,6 @@ type Options struct {
// Open database in read-only mode. Uses flock(..., LOCK_SH |LOCK_NB) to
// grab a shared lock (UNIX).
ReadOnly bool
// Sets the DB.MmapFlags flag before memory mapping the file.
MmapFlags int
// InitialMmapSize is the initial mmap size of the database
// in bytes. Read transactions won't block write transaction
// if the InitialMmapSize is large enough to hold database mmap
// size. (See DB.Begin for more information)
//
// If <=0, the initial map size is 0.
// If initialMmapSize is smaller than the previous database size,
// it takes no effect.
InitialMmapSize int
}
// DefaultOptions represent the options used if nil options are passed into Open().

913
Godeps/_workspace/src/github.com/boltdb/bolt/db_test.go generated vendored Normal file
View File

@ -0,0 +1,913 @@
package bolt_test
import (
"encoding/binary"
"errors"
"flag"
"fmt"
"io/ioutil"
"os"
"regexp"
"runtime"
"sort"
"strings"
"testing"
"time"
"github.com/coreos/etcd/Godeps/_workspace/src/github.com/boltdb/bolt"
)
var statsFlag = flag.Bool("stats", false, "show performance stats")
// Ensure that opening a database with a bad path returns an error.
func TestOpen_BadPath(t *testing.T) {
db, err := bolt.Open("", 0666, nil)
assert(t, err != nil, "err: %s", err)
assert(t, db == nil, "")
}
// Ensure that a database can be opened without error.
func TestOpen(t *testing.T) {
path := tempfile()
defer os.Remove(path)
db, err := bolt.Open(path, 0666, nil)
assert(t, db != nil, "")
ok(t, err)
equals(t, db.Path(), path)
ok(t, db.Close())
}
// Ensure that opening an already open database file will timeout.
func TestOpen_Timeout(t *testing.T) {
if runtime.GOOS == "windows" {
t.Skip("timeout not supported on windows")
}
if runtime.GOOS == "solaris" {
t.Skip("solaris fcntl locks don't support intra-process locking")
}
path := tempfile()
defer os.Remove(path)
// Open a data file.
db0, err := bolt.Open(path, 0666, nil)
assert(t, db0 != nil, "")
ok(t, err)
// Attempt to open the database again.
start := time.Now()
db1, err := bolt.Open(path, 0666, &bolt.Options{Timeout: 100 * time.Millisecond})
assert(t, db1 == nil, "")
equals(t, bolt.ErrTimeout, err)
assert(t, time.Since(start) > 100*time.Millisecond, "")
db0.Close()
}
// Ensure that opening an already open database file will wait until its closed.
func TestOpen_Wait(t *testing.T) {
if runtime.GOOS == "windows" {
t.Skip("timeout not supported on windows")
}
if runtime.GOOS == "solaris" {
t.Skip("solaris fcntl locks don't support intra-process locking")
}
path := tempfile()
defer os.Remove(path)
// Open a data file.
db0, err := bolt.Open(path, 0666, nil)
assert(t, db0 != nil, "")
ok(t, err)
// Close it in just a bit.
time.AfterFunc(100*time.Millisecond, func() { db0.Close() })
// Attempt to open the database again.
start := time.Now()
db1, err := bolt.Open(path, 0666, &bolt.Options{Timeout: 200 * time.Millisecond})
assert(t, db1 != nil, "")
ok(t, err)
assert(t, time.Since(start) > 100*time.Millisecond, "")
}
// Ensure that opening a database does not increase its size.
// https://github.com/boltdb/bolt/issues/291
func TestOpen_Size(t *testing.T) {
// Open a data file.
db := NewTestDB()
path := db.Path()
defer db.Close()
// Insert until we get above the minimum 4MB size.
ok(t, db.Update(func(tx *bolt.Tx) error {
b, _ := tx.CreateBucketIfNotExists([]byte("data"))
for i := 0; i < 10000; i++ {
ok(t, b.Put([]byte(fmt.Sprintf("%04d", i)), make([]byte, 1000)))
}
return nil
}))
// Close database and grab the size.
db.DB.Close()
sz := fileSize(path)
if sz == 0 {
t.Fatalf("unexpected new file size: %d", sz)
}
// Reopen database, update, and check size again.
db0, err := bolt.Open(path, 0666, nil)
ok(t, err)
ok(t, db0.Update(func(tx *bolt.Tx) error { return tx.Bucket([]byte("data")).Put([]byte{0}, []byte{0}) }))
ok(t, db0.Close())
newSz := fileSize(path)
if newSz == 0 {
t.Fatalf("unexpected new file size: %d", newSz)
}
// Compare the original size with the new size.
if sz != newSz {
t.Fatalf("unexpected file growth: %d => %d", sz, newSz)
}
}
// Ensure that opening a database beyond the max step size does not increase its size.
// https://github.com/boltdb/bolt/issues/303
func TestOpen_Size_Large(t *testing.T) {
if testing.Short() {
t.Skip("short mode")
}
// Open a data file.
db := NewTestDB()
path := db.Path()
defer db.Close()
// Insert until we get above the minimum 4MB size.
var index uint64
for i := 0; i < 10000; i++ {
ok(t, db.Update(func(tx *bolt.Tx) error {
b, _ := tx.CreateBucketIfNotExists([]byte("data"))
for j := 0; j < 1000; j++ {
ok(t, b.Put(u64tob(index), make([]byte, 50)))
index++
}
return nil
}))
}
// Close database and grab the size.
db.DB.Close()
sz := fileSize(path)
if sz == 0 {
t.Fatalf("unexpected new file size: %d", sz)
} else if sz < (1 << 30) {
t.Fatalf("expected larger initial size: %d", sz)
}
// Reopen database, update, and check size again.
db0, err := bolt.Open(path, 0666, nil)
ok(t, err)
ok(t, db0.Update(func(tx *bolt.Tx) error { return tx.Bucket([]byte("data")).Put([]byte{0}, []byte{0}) }))
ok(t, db0.Close())
newSz := fileSize(path)
if newSz == 0 {
t.Fatalf("unexpected new file size: %d", newSz)
}
// Compare the original size with the new size.
if sz != newSz {
t.Fatalf("unexpected file growth: %d => %d", sz, newSz)
}
}
// Ensure that a re-opened database is consistent.
func TestOpen_Check(t *testing.T) {
path := tempfile()
defer os.Remove(path)
db, err := bolt.Open(path, 0666, nil)
ok(t, err)
ok(t, db.View(func(tx *bolt.Tx) error { return <-tx.Check() }))
db.Close()
db, err = bolt.Open(path, 0666, nil)
ok(t, err)
ok(t, db.View(func(tx *bolt.Tx) error { return <-tx.Check() }))
db.Close()
}
// Ensure that the database returns an error if the file handle cannot be open.
func TestDB_Open_FileError(t *testing.T) {
path := tempfile()
defer os.Remove(path)
_, err := bolt.Open(path+"/youre-not-my-real-parent", 0666, nil)
assert(t, err.(*os.PathError) != nil, "")
equals(t, path+"/youre-not-my-real-parent", err.(*os.PathError).Path)
equals(t, "open", err.(*os.PathError).Op)
}
// Ensure that write errors to the meta file handler during initialization are returned.
func TestDB_Open_MetaInitWriteError(t *testing.T) {
t.Skip("pending")
}
// Ensure that a database that is too small returns an error.
func TestDB_Open_FileTooSmall(t *testing.T) {
path := tempfile()
defer os.Remove(path)
db, err := bolt.Open(path, 0666, nil)
ok(t, err)
db.Close()
// corrupt the database
ok(t, os.Truncate(path, int64(os.Getpagesize())))
db, err = bolt.Open(path, 0666, nil)
equals(t, errors.New("file size too small"), err)
}
// Ensure that a database can be opened in read-only mode by multiple processes
// and that a database can not be opened in read-write mode and in read-only
// mode at the same time.
func TestOpen_ReadOnly(t *testing.T) {
if runtime.GOOS == "solaris" {
t.Skip("solaris fcntl locks don't support intra-process locking")
}
bucket, key, value := []byte(`bucket`), []byte(`key`), []byte(`value`)
path := tempfile()
defer os.Remove(path)
// Open in read-write mode.
db, err := bolt.Open(path, 0666, nil)
ok(t, db.Update(func(tx *bolt.Tx) error {
b, err := tx.CreateBucket(bucket)
if err != nil {
return err
}
return b.Put(key, value)
}))
assert(t, db != nil, "")
assert(t, !db.IsReadOnly(), "")
ok(t, err)
ok(t, db.Close())
// Open in read-only mode.
db0, err := bolt.Open(path, 0666, &bolt.Options{ReadOnly: true})
ok(t, err)
defer db0.Close()
// Opening in read-write mode should return an error.
_, err = bolt.Open(path, 0666, &bolt.Options{Timeout: time.Millisecond * 100})
assert(t, err != nil, "")
// And again (in read-only mode).
db1, err := bolt.Open(path, 0666, &bolt.Options{ReadOnly: true})
ok(t, err)
defer db1.Close()
// Verify both read-only databases are accessible.
for _, db := range []*bolt.DB{db0, db1} {
// Verify is is in read only mode indeed.
assert(t, db.IsReadOnly(), "")
// Read-only databases should not allow updates.
assert(t,
bolt.ErrDatabaseReadOnly == db.Update(func(*bolt.Tx) error {
panic(`should never get here`)
}),
"")
// Read-only databases should not allow beginning writable txns.
_, err = db.Begin(true)
assert(t, bolt.ErrDatabaseReadOnly == err, "")
// Verify the data.
ok(t, db.View(func(tx *bolt.Tx) error {
b := tx.Bucket(bucket)
if b == nil {
return fmt.Errorf("expected bucket `%s`", string(bucket))
}
got := string(b.Get(key))
expected := string(value)
if got != expected {
return fmt.Errorf("expected `%s`, got `%s`", expected, got)
}
return nil
}))
}
}
// TODO(benbjohnson): Test corruption at every byte of the first two pages.
// Ensure that a database cannot open a transaction when it's not open.
func TestDB_Begin_DatabaseNotOpen(t *testing.T) {
var db bolt.DB
tx, err := db.Begin(false)
assert(t, tx == nil, "")
equals(t, err, bolt.ErrDatabaseNotOpen)
}
// Ensure that a read-write transaction can be retrieved.
func TestDB_BeginRW(t *testing.T) {
db := NewTestDB()
defer db.Close()
tx, err := db.Begin(true)
assert(t, tx != nil, "")
ok(t, err)
assert(t, tx.DB() == db.DB, "")
equals(t, tx.Writable(), true)
ok(t, tx.Commit())
}
// Ensure that opening a transaction while the DB is closed returns an error.
func TestDB_BeginRW_Closed(t *testing.T) {
var db bolt.DB
tx, err := db.Begin(true)
equals(t, err, bolt.ErrDatabaseNotOpen)
assert(t, tx == nil, "")
}
func TestDB_Close_PendingTx_RW(t *testing.T) { testDB_Close_PendingTx(t, true) }
func TestDB_Close_PendingTx_RO(t *testing.T) { testDB_Close_PendingTx(t, false) }
// Ensure that a database cannot close while transactions are open.
func testDB_Close_PendingTx(t *testing.T, writable bool) {
db := NewTestDB()
defer db.Close()
// Start transaction.
tx, err := db.Begin(true)
if err != nil {
t.Fatal(err)
}
// Open update in separate goroutine.
done := make(chan struct{})
go func() {
db.Close()
close(done)
}()
// Ensure database hasn't closed.
time.Sleep(100 * time.Millisecond)
select {
case <-done:
t.Fatal("database closed too early")
default:
}
// Commit transaction.
if err := tx.Commit(); err != nil {
t.Fatal(err)
}
// Ensure database closed now.
time.Sleep(100 * time.Millisecond)
select {
case <-done:
default:
t.Fatal("database did not close")
}
}
// Ensure a database can provide a transactional block.
func TestDB_Update(t *testing.T) {
db := NewTestDB()
defer db.Close()
err := db.Update(func(tx *bolt.Tx) error {
tx.CreateBucket([]byte("widgets"))
b := tx.Bucket([]byte("widgets"))
b.Put([]byte("foo"), []byte("bar"))
b.Put([]byte("baz"), []byte("bat"))
b.Delete([]byte("foo"))
return nil
})
ok(t, err)
err = db.View(func(tx *bolt.Tx) error {
assert(t, tx.Bucket([]byte("widgets")).Get([]byte("foo")) == nil, "")
equals(t, []byte("bat"), tx.Bucket([]byte("widgets")).Get([]byte("baz")))
return nil
})
ok(t, err)
}
// Ensure a closed database returns an error while running a transaction block
func TestDB_Update_Closed(t *testing.T) {
var db bolt.DB
err := db.Update(func(tx *bolt.Tx) error {
tx.CreateBucket([]byte("widgets"))
return nil
})
equals(t, err, bolt.ErrDatabaseNotOpen)
}
// Ensure a panic occurs while trying to commit a managed transaction.
func TestDB_Update_ManualCommit(t *testing.T) {
db := NewTestDB()
defer db.Close()
var ok bool
db.Update(func(tx *bolt.Tx) error {
func() {
defer func() {
if r := recover(); r != nil {
ok = true
}
}()
tx.Commit()
}()
return nil
})
assert(t, ok, "expected panic")
}
// Ensure a panic occurs while trying to rollback a managed transaction.
func TestDB_Update_ManualRollback(t *testing.T) {
db := NewTestDB()
defer db.Close()
var ok bool
db.Update(func(tx *bolt.Tx) error {
func() {
defer func() {
if r := recover(); r != nil {
ok = true
}
}()
tx.Rollback()
}()
return nil
})
assert(t, ok, "expected panic")
}
// Ensure a panic occurs while trying to commit a managed transaction.
func TestDB_View_ManualCommit(t *testing.T) {
db := NewTestDB()
defer db.Close()
var ok bool
db.Update(func(tx *bolt.Tx) error {
func() {
defer func() {
if r := recover(); r != nil {
ok = true
}
}()
tx.Commit()
}()
return nil
})
assert(t, ok, "expected panic")
}
// Ensure a panic occurs while trying to rollback a managed transaction.
func TestDB_View_ManualRollback(t *testing.T) {
db := NewTestDB()
defer db.Close()
var ok bool
db.Update(func(tx *bolt.Tx) error {
func() {
defer func() {
if r := recover(); r != nil {
ok = true
}
}()
tx.Rollback()
}()
return nil
})
assert(t, ok, "expected panic")
}
// Ensure a write transaction that panics does not hold open locks.
func TestDB_Update_Panic(t *testing.T) {
db := NewTestDB()
defer db.Close()
func() {
defer func() {
if r := recover(); r != nil {
t.Log("recover: update", r)
}
}()
db.Update(func(tx *bolt.Tx) error {
tx.CreateBucket([]byte("widgets"))
panic("omg")
})
}()
// Verify we can update again.
err := db.Update(func(tx *bolt.Tx) error {
_, err := tx.CreateBucket([]byte("widgets"))
return err
})
ok(t, err)
// Verify that our change persisted.
err = db.Update(func(tx *bolt.Tx) error {
assert(t, tx.Bucket([]byte("widgets")) != nil, "")
return nil
})
}
// Ensure a database can return an error through a read-only transactional block.
func TestDB_View_Error(t *testing.T) {
db := NewTestDB()
defer db.Close()
err := db.View(func(tx *bolt.Tx) error {
return errors.New("xxx")
})
equals(t, errors.New("xxx"), err)
}
// Ensure a read transaction that panics does not hold open locks.
func TestDB_View_Panic(t *testing.T) {
db := NewTestDB()
defer db.Close()
db.Update(func(tx *bolt.Tx) error {
tx.CreateBucket([]byte("widgets"))
return nil
})
func() {
defer func() {
if r := recover(); r != nil {
t.Log("recover: view", r)
}
}()
db.View(func(tx *bolt.Tx) error {
assert(t, tx.Bucket([]byte("widgets")) != nil, "")
panic("omg")
})
}()
// Verify that we can still use read transactions.
db.View(func(tx *bolt.Tx) error {
assert(t, tx.Bucket([]byte("widgets")) != nil, "")
return nil
})
}
// Ensure that an error is returned when a database write fails.
func TestDB_Commit_WriteFail(t *testing.T) {
t.Skip("pending") // TODO(benbjohnson)
}
// Ensure that DB stats can be returned.
func TestDB_Stats(t *testing.T) {
db := NewTestDB()
defer db.Close()
db.Update(func(tx *bolt.Tx) error {
_, err := tx.CreateBucket([]byte("widgets"))
return err
})
stats := db.Stats()
equals(t, 2, stats.TxStats.PageCount)
equals(t, 0, stats.FreePageN)
equals(t, 2, stats.PendingPageN)
}
// Ensure that database pages are in expected order and type.
func TestDB_Consistency(t *testing.T) {
db := NewTestDB()
defer db.Close()
db.Update(func(tx *bolt.Tx) error {
_, err := tx.CreateBucket([]byte("widgets"))
return err
})
for i := 0; i < 10; i++ {
db.Update(func(tx *bolt.Tx) error {
ok(t, tx.Bucket([]byte("widgets")).Put([]byte("foo"), []byte("bar")))
return nil
})
}
db.Update(func(tx *bolt.Tx) error {
p, _ := tx.Page(0)
assert(t, p != nil, "")
equals(t, "meta", p.Type)
p, _ = tx.Page(1)
assert(t, p != nil, "")
equals(t, "meta", p.Type)
p, _ = tx.Page(2)
assert(t, p != nil, "")
equals(t, "free", p.Type)
p, _ = tx.Page(3)
assert(t, p != nil, "")
equals(t, "free", p.Type)
p, _ = tx.Page(4)
assert(t, p != nil, "")
equals(t, "leaf", p.Type)
p, _ = tx.Page(5)
assert(t, p != nil, "")
equals(t, "freelist", p.Type)
p, _ = tx.Page(6)
assert(t, p == nil, "")
return nil
})
}
// Ensure that DB stats can be substracted from one another.
func TestDBStats_Sub(t *testing.T) {
var a, b bolt.Stats
a.TxStats.PageCount = 3
a.FreePageN = 4
b.TxStats.PageCount = 10
b.FreePageN = 14
diff := b.Sub(&a)
equals(t, 7, diff.TxStats.PageCount)
// free page stats are copied from the receiver and not subtracted
equals(t, 14, diff.FreePageN)
}
func ExampleDB_Update() {
// Open the database.
db, _ := bolt.Open(tempfile(), 0666, nil)
defer os.Remove(db.Path())
defer db.Close()
// Execute several commands within a write transaction.
err := db.Update(func(tx *bolt.Tx) error {
b, err := tx.CreateBucket([]byte("widgets"))
if err != nil {
return err
}
if err := b.Put([]byte("foo"), []byte("bar")); err != nil {
return err
}
return nil
})
// If our transactional block didn't return an error then our data is saved.
if err == nil {
db.View(func(tx *bolt.Tx) error {
value := tx.Bucket([]byte("widgets")).Get([]byte("foo"))
fmt.Printf("The value of 'foo' is: %s\n", value)
return nil
})
}
// Output:
// The value of 'foo' is: bar
}
func ExampleDB_View() {
// Open the database.
db, _ := bolt.Open(tempfile(), 0666, nil)
defer os.Remove(db.Path())
defer db.Close()
// Insert data into a bucket.
db.Update(func(tx *bolt.Tx) error {
tx.CreateBucket([]byte("people"))
b := tx.Bucket([]byte("people"))
b.Put([]byte("john"), []byte("doe"))
b.Put([]byte("susy"), []byte("que"))
return nil
})
// Access data from within a read-only transactional block.
db.View(func(tx *bolt.Tx) error {
v := tx.Bucket([]byte("people")).Get([]byte("john"))
fmt.Printf("John's last name is %s.\n", v)
return nil
})
// Output:
// John's last name is doe.
}
func ExampleDB_Begin_ReadOnly() {
// Open the database.
db, _ := bolt.Open(tempfile(), 0666, nil)
defer os.Remove(db.Path())
defer db.Close()
// Create a bucket.
db.Update(func(tx *bolt.Tx) error {
_, err := tx.CreateBucket([]byte("widgets"))
return err
})
// Create several keys in a transaction.
tx, _ := db.Begin(true)
b := tx.Bucket([]byte("widgets"))
b.Put([]byte("john"), []byte("blue"))
b.Put([]byte("abby"), []byte("red"))
b.Put([]byte("zephyr"), []byte("purple"))
tx.Commit()
// Iterate over the values in sorted key order.
tx, _ = db.Begin(false)
c := tx.Bucket([]byte("widgets")).Cursor()
for k, v := c.First(); k != nil; k, v = c.Next() {
fmt.Printf("%s likes %s\n", k, v)
}
tx.Rollback()
// Output:
// abby likes red
// john likes blue
// zephyr likes purple
}
// TestDB represents a wrapper around a Bolt DB to handle temporary file
// creation and automatic cleanup on close.
type TestDB struct {
*bolt.DB
}
// NewTestDB returns a new instance of TestDB.
func NewTestDB() *TestDB {
db, err := bolt.Open(tempfile(), 0666, nil)
if err != nil {
panic("cannot open db: " + err.Error())
}
return &TestDB{db}
}
// MustView executes a read-only function. Panic on error.
func (db *TestDB) MustView(fn func(tx *bolt.Tx) error) {
if err := db.DB.View(func(tx *bolt.Tx) error {
return fn(tx)
}); err != nil {
panic(err.Error())
}
}
// MustUpdate executes a read-write function. Panic on error.
func (db *TestDB) MustUpdate(fn func(tx *bolt.Tx) error) {
if err := db.DB.View(func(tx *bolt.Tx) error {
return fn(tx)
}); err != nil {
panic(err.Error())
}
}
// MustCreateBucket creates a new bucket. Panic on error.
func (db *TestDB) MustCreateBucket(name []byte) {
if err := db.Update(func(tx *bolt.Tx) error {
_, err := tx.CreateBucket([]byte(name))
return err
}); err != nil {
panic(err.Error())
}
}
// Close closes the database and deletes the underlying file.
func (db *TestDB) Close() {
// Log statistics.
if *statsFlag {
db.PrintStats()
}
// Check database consistency after every test.
db.MustCheck()
// Close database and remove file.
defer os.Remove(db.Path())
db.DB.Close()
}
// PrintStats prints the database stats
func (db *TestDB) PrintStats() {
var stats = db.Stats()
fmt.Printf("[db] %-20s %-20s %-20s\n",
fmt.Sprintf("pg(%d/%d)", stats.TxStats.PageCount, stats.TxStats.PageAlloc),
fmt.Sprintf("cur(%d)", stats.TxStats.CursorCount),
fmt.Sprintf("node(%d/%d)", stats.TxStats.NodeCount, stats.TxStats.NodeDeref),
)
fmt.Printf(" %-20s %-20s %-20s\n",
fmt.Sprintf("rebal(%d/%v)", stats.TxStats.Rebalance, truncDuration(stats.TxStats.RebalanceTime)),
fmt.Sprintf("spill(%d/%v)", stats.TxStats.Spill, truncDuration(stats.TxStats.SpillTime)),
fmt.Sprintf("w(%d/%v)", stats.TxStats.Write, truncDuration(stats.TxStats.WriteTime)),
)
}
// MustCheck runs a consistency check on the database and panics if any errors are found.
func (db *TestDB) MustCheck() {
db.Update(func(tx *bolt.Tx) error {
// Collect all the errors.
var errors []error
for err := range tx.Check() {
errors = append(errors, err)
if len(errors) > 10 {
break
}
}
// If errors occurred, copy the DB and print the errors.
if len(errors) > 0 {
var path = tempfile()
tx.CopyFile(path, 0600)
// Print errors.
fmt.Print("\n\n")
fmt.Printf("consistency check failed (%d errors)\n", len(errors))
for _, err := range errors {
fmt.Println(err)
}
fmt.Println("")
fmt.Println("db saved to:")
fmt.Println(path)
fmt.Print("\n\n")
os.Exit(-1)
}
return nil
})
}
// CopyTempFile copies a database to a temporary file.
func (db *TestDB) CopyTempFile() {
path := tempfile()
db.View(func(tx *bolt.Tx) error { return tx.CopyFile(path, 0600) })
fmt.Println("db copied to: ", path)
}
// tempfile returns a temporary file path.
func tempfile() string {
f, _ := ioutil.TempFile("", "bolt-")
f.Close()
os.Remove(f.Name())
return f.Name()
}
// mustContainKeys checks that a bucket contains a given set of keys.
func mustContainKeys(b *bolt.Bucket, m map[string]string) {
found := make(map[string]string)
b.ForEach(func(k, _ []byte) error {
found[string(k)] = ""
return nil
})
// Check for keys found in bucket that shouldn't be there.
var keys []string
for k, _ := range found {
if _, ok := m[string(k)]; !ok {
keys = append(keys, k)
}
}
if len(keys) > 0 {
sort.Strings(keys)
panic(fmt.Sprintf("keys found(%d): %s", len(keys), strings.Join(keys, ",")))
}
// Check for keys not found in bucket that should be there.
for k, _ := range m {
if _, ok := found[string(k)]; !ok {
keys = append(keys, k)
}
}
if len(keys) > 0 {
sort.Strings(keys)
panic(fmt.Sprintf("keys not found(%d): %s", len(keys), strings.Join(keys, ",")))
}
}
func trunc(b []byte, length int) []byte {
if length < len(b) {
return b[:length]
}
return b
}
func truncDuration(d time.Duration) string {
return regexp.MustCompile(`^(\d+)(\.\d+)`).ReplaceAllString(d.String(), "$1")
}
func fileSize(path string) int64 {
fi, err := os.Stat(path)
if err != nil {
return 0
}
return fi.Size()
}
func warn(v ...interface{}) { fmt.Fprintln(os.Stderr, v...) }
func warnf(msg string, v ...interface{}) { fmt.Fprintf(os.Stderr, msg+"\n", v...) }
// u64tob converts a uint64 into an 8-byte slice.
func u64tob(v uint64) []byte {
b := make([]byte, 8)
binary.BigEndian.PutUint64(b, v)
return b
}
// btou64 converts an 8-byte slice into an uint64.
func btou64(b []byte) uint64 { return binary.BigEndian.Uint64(b) }

View File

@ -0,0 +1,156 @@
package bolt
import (
"math/rand"
"reflect"
"sort"
"testing"
"unsafe"
)
// Ensure that a page is added to a transaction's freelist.
func TestFreelist_free(t *testing.T) {
f := newFreelist()
f.free(100, &page{id: 12})
if !reflect.DeepEqual([]pgid{12}, f.pending[100]) {
t.Fatalf("exp=%v; got=%v", []pgid{12}, f.pending[100])
}
}
// Ensure that a page and its overflow is added to a transaction's freelist.
func TestFreelist_free_overflow(t *testing.T) {
f := newFreelist()
f.free(100, &page{id: 12, overflow: 3})
if exp := []pgid{12, 13, 14, 15}; !reflect.DeepEqual(exp, f.pending[100]) {
t.Fatalf("exp=%v; got=%v", exp, f.pending[100])
}
}
// Ensure that a transaction's free pages can be released.
func TestFreelist_release(t *testing.T) {
f := newFreelist()
f.free(100, &page{id: 12, overflow: 1})
f.free(100, &page{id: 9})
f.free(102, &page{id: 39})
f.release(100)
f.release(101)
if exp := []pgid{9, 12, 13}; !reflect.DeepEqual(exp, f.ids) {
t.Fatalf("exp=%v; got=%v", exp, f.ids)
}
f.release(102)
if exp := []pgid{9, 12, 13, 39}; !reflect.DeepEqual(exp, f.ids) {
t.Fatalf("exp=%v; got=%v", exp, f.ids)
}
}
// Ensure that a freelist can find contiguous blocks of pages.
func TestFreelist_allocate(t *testing.T) {
f := &freelist{ids: []pgid{3, 4, 5, 6, 7, 9, 12, 13, 18}}
if id := int(f.allocate(3)); id != 3 {
t.Fatalf("exp=3; got=%v", id)
}
if id := int(f.allocate(1)); id != 6 {
t.Fatalf("exp=6; got=%v", id)
}
if id := int(f.allocate(3)); id != 0 {
t.Fatalf("exp=0; got=%v", id)
}
if id := int(f.allocate(2)); id != 12 {
t.Fatalf("exp=12; got=%v", id)
}
if id := int(f.allocate(1)); id != 7 {
t.Fatalf("exp=7; got=%v", id)
}
if id := int(f.allocate(0)); id != 0 {
t.Fatalf("exp=0; got=%v", id)
}
if id := int(f.allocate(0)); id != 0 {
t.Fatalf("exp=0; got=%v", id)
}
if exp := []pgid{9, 18}; !reflect.DeepEqual(exp, f.ids) {
t.Fatalf("exp=%v; got=%v", exp, f.ids)
}
if id := int(f.allocate(1)); id != 9 {
t.Fatalf("exp=9; got=%v", id)
}
if id := int(f.allocate(1)); id != 18 {
t.Fatalf("exp=18; got=%v", id)
}
if id := int(f.allocate(1)); id != 0 {
t.Fatalf("exp=0; got=%v", id)
}
if exp := []pgid{}; !reflect.DeepEqual(exp, f.ids) {
t.Fatalf("exp=%v; got=%v", exp, f.ids)
}
}
// Ensure that a freelist can deserialize from a freelist page.
func TestFreelist_read(t *testing.T) {
// Create a page.
var buf [4096]byte
page := (*page)(unsafe.Pointer(&buf[0]))
page.flags = freelistPageFlag
page.count = 2
// Insert 2 page ids.
ids := (*[3]pgid)(unsafe.Pointer(&page.ptr))
ids[0] = 23
ids[1] = 50
// Deserialize page into a freelist.
f := newFreelist()
f.read(page)
// Ensure that there are two page ids in the freelist.
if exp := []pgid{23, 50}; !reflect.DeepEqual(exp, f.ids) {
t.Fatalf("exp=%v; got=%v", exp, f.ids)
}
}
// Ensure that a freelist can serialize into a freelist page.
func TestFreelist_write(t *testing.T) {
// Create a freelist and write it to a page.
var buf [4096]byte
f := &freelist{ids: []pgid{12, 39}, pending: make(map[txid][]pgid)}
f.pending[100] = []pgid{28, 11}
f.pending[101] = []pgid{3}
p := (*page)(unsafe.Pointer(&buf[0]))
f.write(p)
// Read the page back out.
f2 := newFreelist()
f2.read(p)
// Ensure that the freelist is correct.
// All pages should be present and in reverse order.
if exp := []pgid{3, 11, 12, 28, 39}; !reflect.DeepEqual(exp, f2.ids) {
t.Fatalf("exp=%v; got=%v", exp, f2.ids)
}
}
func Benchmark_FreelistRelease10K(b *testing.B) { benchmark_FreelistRelease(b, 10000) }
func Benchmark_FreelistRelease100K(b *testing.B) { benchmark_FreelistRelease(b, 100000) }
func Benchmark_FreelistRelease1000K(b *testing.B) { benchmark_FreelistRelease(b, 1000000) }
func Benchmark_FreelistRelease10000K(b *testing.B) { benchmark_FreelistRelease(b, 10000000) }
func benchmark_FreelistRelease(b *testing.B, size int) {
ids := randomPgids(size)
pending := randomPgids(len(ids) / 400)
b.ResetTimer()
for i := 0; i < b.N; i++ {
f := &freelist{ids: ids, pending: map[txid][]pgid{1: pending}}
f.release(1)
}
}
func randomPgids(n int) []pgid {
rand.Seed(42)
pgids := make(pgids, n)
for i := range pgids {
pgids[i] = pgid(rand.Int63())
}
sort.Sort(pgids)
return pgids
}

View File

@ -0,0 +1,156 @@
package bolt
import (
"testing"
"unsafe"
)
// Ensure that a node can insert a key/value.
func TestNode_put(t *testing.T) {
n := &node{inodes: make(inodes, 0), bucket: &Bucket{tx: &Tx{meta: &meta{pgid: 1}}}}
n.put([]byte("baz"), []byte("baz"), []byte("2"), 0, 0)
n.put([]byte("foo"), []byte("foo"), []byte("0"), 0, 0)
n.put([]byte("bar"), []byte("bar"), []byte("1"), 0, 0)
n.put([]byte("foo"), []byte("foo"), []byte("3"), 0, leafPageFlag)
if len(n.inodes) != 3 {
t.Fatalf("exp=3; got=%d", len(n.inodes))
}
if k, v := n.inodes[0].key, n.inodes[0].value; string(k) != "bar" || string(v) != "1" {
t.Fatalf("exp=<bar,1>; got=<%s,%s>", k, v)
}
if k, v := n.inodes[1].key, n.inodes[1].value; string(k) != "baz" || string(v) != "2" {
t.Fatalf("exp=<baz,2>; got=<%s,%s>", k, v)
}
if k, v := n.inodes[2].key, n.inodes[2].value; string(k) != "foo" || string(v) != "3" {
t.Fatalf("exp=<foo,3>; got=<%s,%s>", k, v)
}
if n.inodes[2].flags != uint32(leafPageFlag) {
t.Fatalf("not a leaf: %d", n.inodes[2].flags)
}
}
// Ensure that a node can deserialize from a leaf page.
func TestNode_read_LeafPage(t *testing.T) {
// Create a page.
var buf [4096]byte
page := (*page)(unsafe.Pointer(&buf[0]))
page.flags = leafPageFlag
page.count = 2
// Insert 2 elements at the beginning. sizeof(leafPageElement) == 16
nodes := (*[3]leafPageElement)(unsafe.Pointer(&page.ptr))
nodes[0] = leafPageElement{flags: 0, pos: 32, ksize: 3, vsize: 4} // pos = sizeof(leafPageElement) * 2
nodes[1] = leafPageElement{flags: 0, pos: 23, ksize: 10, vsize: 3} // pos = sizeof(leafPageElement) + 3 + 4
// Write data for the nodes at the end.
data := (*[4096]byte)(unsafe.Pointer(&nodes[2]))
copy(data[:], []byte("barfooz"))
copy(data[7:], []byte("helloworldbye"))
// Deserialize page into a leaf.
n := &node{}
n.read(page)
// Check that there are two inodes with correct data.
if !n.isLeaf {
t.Fatal("expected leaf")
}
if len(n.inodes) != 2 {
t.Fatalf("exp=2; got=%d", len(n.inodes))
}
if k, v := n.inodes[0].key, n.inodes[0].value; string(k) != "bar" || string(v) != "fooz" {
t.Fatalf("exp=<bar,fooz>; got=<%s,%s>", k, v)
}
if k, v := n.inodes[1].key, n.inodes[1].value; string(k) != "helloworld" || string(v) != "bye" {
t.Fatalf("exp=<helloworld,bye>; got=<%s,%s>", k, v)
}
}
// Ensure that a node can serialize into a leaf page.
func TestNode_write_LeafPage(t *testing.T) {
// Create a node.
n := &node{isLeaf: true, inodes: make(inodes, 0), bucket: &Bucket{tx: &Tx{db: &DB{}, meta: &meta{pgid: 1}}}}
n.put([]byte("susy"), []byte("susy"), []byte("que"), 0, 0)
n.put([]byte("ricki"), []byte("ricki"), []byte("lake"), 0, 0)
n.put([]byte("john"), []byte("john"), []byte("johnson"), 0, 0)
// Write it to a page.
var buf [4096]byte
p := (*page)(unsafe.Pointer(&buf[0]))
n.write(p)
// Read the page back in.
n2 := &node{}
n2.read(p)
// Check that the two pages are the same.
if len(n2.inodes) != 3 {
t.Fatalf("exp=3; got=%d", len(n2.inodes))
}
if k, v := n2.inodes[0].key, n2.inodes[0].value; string(k) != "john" || string(v) != "johnson" {
t.Fatalf("exp=<john,johnson>; got=<%s,%s>", k, v)
}
if k, v := n2.inodes[1].key, n2.inodes[1].value; string(k) != "ricki" || string(v) != "lake" {
t.Fatalf("exp=<ricki,lake>; got=<%s,%s>", k, v)
}
if k, v := n2.inodes[2].key, n2.inodes[2].value; string(k) != "susy" || string(v) != "que" {
t.Fatalf("exp=<susy,que>; got=<%s,%s>", k, v)
}
}
// Ensure that a node can split into appropriate subgroups.
func TestNode_split(t *testing.T) {
// Create a node.
n := &node{inodes: make(inodes, 0), bucket: &Bucket{tx: &Tx{db: &DB{}, meta: &meta{pgid: 1}}}}
n.put([]byte("00000001"), []byte("00000001"), []byte("0123456701234567"), 0, 0)
n.put([]byte("00000002"), []byte("00000002"), []byte("0123456701234567"), 0, 0)
n.put([]byte("00000003"), []byte("00000003"), []byte("0123456701234567"), 0, 0)
n.put([]byte("00000004"), []byte("00000004"), []byte("0123456701234567"), 0, 0)
n.put([]byte("00000005"), []byte("00000005"), []byte("0123456701234567"), 0, 0)
// Split between 2 & 3.
n.split(100)
var parent = n.parent
if len(parent.children) != 2 {
t.Fatalf("exp=2; got=%d", len(parent.children))
}
if len(parent.children[0].inodes) != 2 {
t.Fatalf("exp=2; got=%d", len(parent.children[0].inodes))
}
if len(parent.children[1].inodes) != 3 {
t.Fatalf("exp=3; got=%d", len(parent.children[1].inodes))
}
}
// Ensure that a page with the minimum number of inodes just returns a single node.
func TestNode_split_MinKeys(t *testing.T) {
// Create a node.
n := &node{inodes: make(inodes, 0), bucket: &Bucket{tx: &Tx{db: &DB{}, meta: &meta{pgid: 1}}}}
n.put([]byte("00000001"), []byte("00000001"), []byte("0123456701234567"), 0, 0)
n.put([]byte("00000002"), []byte("00000002"), []byte("0123456701234567"), 0, 0)
// Split.
n.split(20)
if n.parent != nil {
t.Fatalf("expected nil parent")
}
}
// Ensure that a node that has keys that all fit on a page just returns one leaf.
func TestNode_split_SinglePage(t *testing.T) {
// Create a node.
n := &node{inodes: make(inodes, 0), bucket: &Bucket{tx: &Tx{db: &DB{}, meta: &meta{pgid: 1}}}}
n.put([]byte("00000001"), []byte("00000001"), []byte("0123456701234567"), 0, 0)
n.put([]byte("00000002"), []byte("00000002"), []byte("0123456701234567"), 0, 0)
n.put([]byte("00000003"), []byte("00000003"), []byte("0123456701234567"), 0, 0)
n.put([]byte("00000004"), []byte("00000004"), []byte("0123456701234567"), 0, 0)
n.put([]byte("00000005"), []byte("00000005"), []byte("0123456701234567"), 0, 0)
// Split.
n.split(4096)
if n.parent != nil {
t.Fatalf("expected nil parent")
}
}

View File

@ -0,0 +1,72 @@
package bolt
import (
"reflect"
"sort"
"testing"
"testing/quick"
)
// Ensure that the page type can be returned in human readable format.
func TestPage_typ(t *testing.T) {
if typ := (&page{flags: branchPageFlag}).typ(); typ != "branch" {
t.Fatalf("exp=branch; got=%v", typ)
}
if typ := (&page{flags: leafPageFlag}).typ(); typ != "leaf" {
t.Fatalf("exp=leaf; got=%v", typ)
}
if typ := (&page{flags: metaPageFlag}).typ(); typ != "meta" {
t.Fatalf("exp=meta; got=%v", typ)
}
if typ := (&page{flags: freelistPageFlag}).typ(); typ != "freelist" {
t.Fatalf("exp=freelist; got=%v", typ)
}
if typ := (&page{flags: 20000}).typ(); typ != "unknown<4e20>" {
t.Fatalf("exp=unknown<4e20>; got=%v", typ)
}
}
// Ensure that the hexdump debugging function doesn't blow up.
func TestPage_dump(t *testing.T) {
(&page{id: 256}).hexdump(16)
}
func TestPgids_merge(t *testing.T) {
a := pgids{4, 5, 6, 10, 11, 12, 13, 27}
b := pgids{1, 3, 8, 9, 25, 30}
c := a.merge(b)
if !reflect.DeepEqual(c, pgids{1, 3, 4, 5, 6, 8, 9, 10, 11, 12, 13, 25, 27, 30}) {
t.Errorf("mismatch: %v", c)
}
a = pgids{4, 5, 6, 10, 11, 12, 13, 27, 35, 36}
b = pgids{8, 9, 25, 30}
c = a.merge(b)
if !reflect.DeepEqual(c, pgids{4, 5, 6, 8, 9, 10, 11, 12, 13, 25, 27, 30, 35, 36}) {
t.Errorf("mismatch: %v", c)
}
}
func TestPgids_merge_quick(t *testing.T) {
if err := quick.Check(func(a, b pgids) bool {
// Sort incoming lists.
sort.Sort(a)
sort.Sort(b)
// Merge the two lists together.
got := a.merge(b)
// The expected value should be the two lists combined and sorted.
exp := append(a, b...)
sort.Sort(exp)
if !reflect.DeepEqual(exp, got) {
t.Errorf("\nexp=%+v\ngot=%+v\n", exp, got)
return false
}
return true
}, nil); err != nil {
t.Fatal(err)
}
}

View File

@ -0,0 +1,79 @@
package bolt_test
import (
"bytes"
"flag"
"fmt"
"math/rand"
"os"
"reflect"
"testing/quick"
"time"
)
// testing/quick defaults to 5 iterations and a random seed.
// You can override these settings from the command line:
//
// -quick.count The number of iterations to perform.
// -quick.seed The seed to use for randomizing.
// -quick.maxitems The maximum number of items to insert into a DB.
// -quick.maxksize The maximum size of a key.
// -quick.maxvsize The maximum size of a value.
//
var qcount, qseed, qmaxitems, qmaxksize, qmaxvsize int
func init() {
flag.IntVar(&qcount, "quick.count", 5, "")
flag.IntVar(&qseed, "quick.seed", int(time.Now().UnixNano())%100000, "")
flag.IntVar(&qmaxitems, "quick.maxitems", 1000, "")
flag.IntVar(&qmaxksize, "quick.maxksize", 1024, "")
flag.IntVar(&qmaxvsize, "quick.maxvsize", 1024, "")
flag.Parse()
fmt.Fprintln(os.Stderr, "seed:", qseed)
fmt.Fprintf(os.Stderr, "quick settings: count=%v, items=%v, ksize=%v, vsize=%v\n", qcount, qmaxitems, qmaxksize, qmaxvsize)
}
func qconfig() *quick.Config {
return &quick.Config{
MaxCount: qcount,
Rand: rand.New(rand.NewSource(int64(qseed))),
}
}
type testdata []testdataitem
func (t testdata) Len() int { return len(t) }
func (t testdata) Swap(i, j int) { t[i], t[j] = t[j], t[i] }
func (t testdata) Less(i, j int) bool { return bytes.Compare(t[i].Key, t[j].Key) == -1 }
func (t testdata) Generate(rand *rand.Rand, size int) reflect.Value {
n := rand.Intn(qmaxitems-1) + 1
items := make(testdata, n)
for i := 0; i < n; i++ {
item := &items[i]
item.Key = randByteSlice(rand, 1, qmaxksize)
item.Value = randByteSlice(rand, 0, qmaxvsize)
}
return reflect.ValueOf(items)
}
type revtestdata []testdataitem
func (t revtestdata) Len() int { return len(t) }
func (t revtestdata) Swap(i, j int) { t[i], t[j] = t[j], t[i] }
func (t revtestdata) Less(i, j int) bool { return bytes.Compare(t[i].Key, t[j].Key) == 1 }
type testdataitem struct {
Key []byte
Value []byte
}
func randByteSlice(rand *rand.Rand, minSize, maxSize int) []byte {
n := rand.Intn(maxSize-minSize) + minSize
b := make([]byte, n)
for i := 0; i < n; i++ {
b[i] = byte(rand.Intn(255))
}
return b
}

View File

@ -0,0 +1,327 @@
package bolt_test
import (
"bytes"
"fmt"
"math/rand"
"sync"
"testing"
"github.com/coreos/etcd/Godeps/_workspace/src/github.com/boltdb/bolt"
)
func TestSimulate_1op_1p(t *testing.T) { testSimulate(t, 100, 1) }
func TestSimulate_10op_1p(t *testing.T) { testSimulate(t, 10, 1) }
func TestSimulate_100op_1p(t *testing.T) { testSimulate(t, 100, 1) }
func TestSimulate_1000op_1p(t *testing.T) { testSimulate(t, 1000, 1) }
func TestSimulate_10000op_1p(t *testing.T) { testSimulate(t, 10000, 1) }
func TestSimulate_10op_10p(t *testing.T) { testSimulate(t, 10, 10) }
func TestSimulate_100op_10p(t *testing.T) { testSimulate(t, 100, 10) }
func TestSimulate_1000op_10p(t *testing.T) { testSimulate(t, 1000, 10) }
func TestSimulate_10000op_10p(t *testing.T) { testSimulate(t, 10000, 10) }
func TestSimulate_100op_100p(t *testing.T) { testSimulate(t, 100, 100) }
func TestSimulate_1000op_100p(t *testing.T) { testSimulate(t, 1000, 100) }
func TestSimulate_10000op_100p(t *testing.T) { testSimulate(t, 10000, 100) }
func TestSimulate_10000op_1000p(t *testing.T) { testSimulate(t, 10000, 1000) }
// Randomly generate operations on a given database with multiple clients to ensure consistency and thread safety.
func testSimulate(t *testing.T, threadCount, parallelism int) {
if testing.Short() {
t.Skip("skipping test in short mode.")
}
rand.Seed(int64(qseed))
// A list of operations that readers and writers can perform.
var readerHandlers = []simulateHandler{simulateGetHandler}
var writerHandlers = []simulateHandler{simulateGetHandler, simulatePutHandler}
var versions = make(map[int]*QuickDB)
versions[1] = NewQuickDB()
db := NewTestDB()
defer db.Close()
var mutex sync.Mutex
// Run n threads in parallel, each with their own operation.
var wg sync.WaitGroup
var threads = make(chan bool, parallelism)
var i int
for {
threads <- true
wg.Add(1)
writable := ((rand.Int() % 100) < 20) // 20% writers
// Choose an operation to execute.
var handler simulateHandler
if writable {
handler = writerHandlers[rand.Intn(len(writerHandlers))]
} else {
handler = readerHandlers[rand.Intn(len(readerHandlers))]
}
// Execute a thread for the given operation.
go func(writable bool, handler simulateHandler) {
defer wg.Done()
// Start transaction.
tx, err := db.Begin(writable)
if err != nil {
t.Fatal("tx begin: ", err)
}
// Obtain current state of the dataset.
mutex.Lock()
var qdb = versions[tx.ID()]
if writable {
qdb = versions[tx.ID()-1].Copy()
}
mutex.Unlock()
// Make sure we commit/rollback the tx at the end and update the state.
if writable {
defer func() {
mutex.Lock()
versions[tx.ID()] = qdb
mutex.Unlock()
ok(t, tx.Commit())
}()
} else {
defer tx.Rollback()
}
// Ignore operation if we don't have data yet.
if qdb == nil {
return
}
// Execute handler.
handler(tx, qdb)
// Release a thread back to the scheduling loop.
<-threads
}(writable, handler)
i++
if i > threadCount {
break
}
}
// Wait until all threads are done.
wg.Wait()
}
type simulateHandler func(tx *bolt.Tx, qdb *QuickDB)
// Retrieves a key from the database and verifies that it is what is expected.
func simulateGetHandler(tx *bolt.Tx, qdb *QuickDB) {
// Randomly retrieve an existing exist.
keys := qdb.Rand()
if len(keys) == 0 {
return
}
// Retrieve root bucket.
b := tx.Bucket(keys[0])
if b == nil {
panic(fmt.Sprintf("bucket[0] expected: %08x\n", trunc(keys[0], 4)))
}
// Drill into nested buckets.
for _, key := range keys[1 : len(keys)-1] {
b = b.Bucket(key)
if b == nil {
panic(fmt.Sprintf("bucket[n] expected: %v -> %v\n", keys, key))
}
}
// Verify key/value on the final bucket.
expected := qdb.Get(keys)
actual := b.Get(keys[len(keys)-1])
if !bytes.Equal(actual, expected) {
fmt.Println("=== EXPECTED ===")
fmt.Println(expected)
fmt.Println("=== ACTUAL ===")
fmt.Println(actual)
fmt.Println("=== END ===")
panic("value mismatch")
}
}
// Inserts a key into the database.
func simulatePutHandler(tx *bolt.Tx, qdb *QuickDB) {
var err error
keys, value := randKeys(), randValue()
// Retrieve root bucket.
b := tx.Bucket(keys[0])
if b == nil {
b, err = tx.CreateBucket(keys[0])
if err != nil {
panic("create bucket: " + err.Error())
}
}
// Create nested buckets, if necessary.
for _, key := range keys[1 : len(keys)-1] {
child := b.Bucket(key)
if child != nil {
b = child
} else {
b, err = b.CreateBucket(key)
if err != nil {
panic("create bucket: " + err.Error())
}
}
}
// Insert into database.
if err := b.Put(keys[len(keys)-1], value); err != nil {
panic("put: " + err.Error())
}
// Insert into in-memory database.
qdb.Put(keys, value)
}
// QuickDB is an in-memory database that replicates the functionality of the
// Bolt DB type except that it is entirely in-memory. It is meant for testing
// that the Bolt database is consistent.
type QuickDB struct {
sync.RWMutex
m map[string]interface{}
}
// NewQuickDB returns an instance of QuickDB.
func NewQuickDB() *QuickDB {
return &QuickDB{m: make(map[string]interface{})}
}
// Get retrieves the value at a key path.
func (db *QuickDB) Get(keys [][]byte) []byte {
db.RLock()
defer db.RUnlock()
m := db.m
for _, key := range keys[:len(keys)-1] {
value := m[string(key)]
if value == nil {
return nil
}
switch value := value.(type) {
case map[string]interface{}:
m = value
case []byte:
return nil
}
}
// Only return if it's a simple value.
if value, ok := m[string(keys[len(keys)-1])].([]byte); ok {
return value
}
return nil
}
// Put inserts a value into a key path.
func (db *QuickDB) Put(keys [][]byte, value []byte) {
db.Lock()
defer db.Unlock()
// Build buckets all the way down the key path.
m := db.m
for _, key := range keys[:len(keys)-1] {
if _, ok := m[string(key)].([]byte); ok {
return // Keypath intersects with a simple value. Do nothing.
}
if m[string(key)] == nil {
m[string(key)] = make(map[string]interface{})
}
m = m[string(key)].(map[string]interface{})
}
// Insert value into the last key.
m[string(keys[len(keys)-1])] = value
}
// Rand returns a random key path that points to a simple value.
func (db *QuickDB) Rand() [][]byte {
db.RLock()
defer db.RUnlock()
if len(db.m) == 0 {
return nil
}
var keys [][]byte
db.rand(db.m, &keys)
return keys
}
func (db *QuickDB) rand(m map[string]interface{}, keys *[][]byte) {
i, index := 0, rand.Intn(len(m))
for k, v := range m {
if i == index {
*keys = append(*keys, []byte(k))
if v, ok := v.(map[string]interface{}); ok {
db.rand(v, keys)
}
return
}
i++
}
panic("quickdb rand: out-of-range")
}
// Copy copies the entire database.
func (db *QuickDB) Copy() *QuickDB {
db.RLock()
defer db.RUnlock()
return &QuickDB{m: db.copy(db.m)}
}
func (db *QuickDB) copy(m map[string]interface{}) map[string]interface{} {
clone := make(map[string]interface{}, len(m))
for k, v := range m {
switch v := v.(type) {
case map[string]interface{}:
clone[k] = db.copy(v)
default:
clone[k] = v
}
}
return clone
}
func randKey() []byte {
var min, max = 1, 1024
n := rand.Intn(max-min) + min
b := make([]byte, n)
for i := 0; i < n; i++ {
b[i] = byte(rand.Intn(255))
}
return b
}
func randKeys() [][]byte {
var keys [][]byte
var count = rand.Intn(2) + 2
for i := 0; i < count; i++ {
keys = append(keys, randKey())
}
return keys
}
func randValue() []byte {
n := rand.Intn(8192)
b := make([]byte, n)
for i := 0; i < n; i++ {
b[i] = byte(rand.Intn(255))
}
return b
}

View File

@ -29,14 +29,6 @@ type Tx struct {
pages map[pgid]*page
stats TxStats
commitHandlers []func()
// WriteFlag specifies the flag for write-related methods like WriteTo().
// Tx opens the database file with the specified flag to copy the data.
//
// By default, the flag is unset, which works well for mostly in-memory
// workloads. For databases that are much larger than available RAM,
// set the flag to syscall.O_DIRECT to avoid trashing the page cache.
WriteFlag int
}
// init initializes the transaction.
@ -95,21 +87,18 @@ func (tx *Tx) Stats() TxStats {
// Bucket retrieves a bucket by name.
// Returns nil if the bucket does not exist.
// The bucket instance is only valid for the lifetime of the transaction.
func (tx *Tx) Bucket(name []byte) *Bucket {
return tx.root.Bucket(name)
}
// CreateBucket creates a new bucket.
// Returns an error if the bucket already exists, if the bucket name is blank, or if the bucket name is too long.
// The bucket instance is only valid for the lifetime of the transaction.
func (tx *Tx) CreateBucket(name []byte) (*Bucket, error) {
return tx.root.CreateBucket(name)
}
// CreateBucketIfNotExists creates a new bucket if it doesn't already exist.
// Returns an error if the bucket name is blank, or if the bucket name is too long.
// The bucket instance is only valid for the lifetime of the transaction.
func (tx *Tx) CreateBucketIfNotExists(name []byte) (*Bucket, error) {
return tx.root.CreateBucketIfNotExists(name)
}
@ -168,8 +157,6 @@ func (tx *Tx) Commit() error {
// Free the old root bucket.
tx.meta.root.root = tx.root.root
opgid := tx.meta.pgid
// Free the freelist and allocate new pages for it. This will overestimate
// the size of the freelist but not underestimate the size (which would be bad).
tx.db.freelist.free(tx.meta.txid, tx.db.page(tx.meta.freelist))
@ -184,14 +171,6 @@ func (tx *Tx) Commit() error {
}
tx.meta.freelist = p.id
// If the high water mark has moved up then attempt to grow the database.
if tx.meta.pgid > opgid {
if err := tx.db.grow(int(tx.meta.pgid+1) * tx.db.pageSize); err != nil {
tx.rollback()
return err
}
}
// Write dirty pages to disk.
startTime = time.Now()
if err := tx.write(); err != nil {
@ -257,8 +236,7 @@ func (tx *Tx) close() {
var freelistPendingN = tx.db.freelist.pending_count()
var freelistAlloc = tx.db.freelist.size()
// Remove transaction ref & writer lock.
tx.db.rwtx = nil
// Remove writer lock.
tx.db.rwlock.Unlock()
// Merge statistics.
@ -272,16 +250,11 @@ func (tx *Tx) close() {
} else {
tx.db.removeTx(tx)
}
// Clear all references.
tx.db = nil
tx.meta = nil
tx.root = Bucket{tx: tx}
tx.pages = nil
}
// Copy writes the entire database to a writer.
// This function exists for backwards compatibility. Use WriteTo() instead.
// This function exists for backwards compatibility. Use WriteTo() in
func (tx *Tx) Copy(w io.Writer) error {
_, err := tx.WriteTo(w)
return err
@ -290,47 +263,29 @@ func (tx *Tx) Copy(w io.Writer) error {
// WriteTo writes the entire database to a writer.
// If err == nil then exactly tx.Size() bytes will be written into the writer.
func (tx *Tx) WriteTo(w io.Writer) (n int64, err error) {
// Attempt to open reader with WriteFlag
f, err := os.OpenFile(tx.db.path, os.O_RDONLY|tx.WriteFlag, 0)
if err != nil {
return 0, err
}
defer func() { _ = f.Close() }()
// Generate a meta page. We use the same page data for both meta pages.
buf := make([]byte, tx.db.pageSize)
page := (*page)(unsafe.Pointer(&buf[0]))
page.flags = metaPageFlag
*page.meta() = *tx.meta
// Write meta 0.
page.id = 0
page.meta().checksum = page.meta().sum64()
nn, err := w.Write(buf)
n += int64(nn)
if err != nil {
return n, fmt.Errorf("meta 0 copy: %s", err)
// Attempt to open reader directly.
var f *os.File
if f, err = os.OpenFile(tx.db.path, os.O_RDONLY|odirect, 0); err != nil {
// Fallback to a regular open if that doesn't work.
if f, err = os.OpenFile(tx.db.path, os.O_RDONLY, 0); err != nil {
return 0, err
}
}
// Write meta 1 with a lower transaction id.
page.id = 1
page.meta().txid -= 1
page.meta().checksum = page.meta().sum64()
nn, err = w.Write(buf)
n += int64(nn)
// Copy the meta pages.
tx.db.metalock.Lock()
n, err = io.CopyN(w, f, int64(tx.db.pageSize*2))
tx.db.metalock.Unlock()
if err != nil {
return n, fmt.Errorf("meta 1 copy: %s", err)
}
// Move past the meta pages in the file.
if _, err := f.Seek(int64(tx.db.pageSize*2), os.SEEK_SET); err != nil {
return n, fmt.Errorf("seek: %s", err)
_ = f.Close()
return n, fmt.Errorf("meta copy: %s", err)
}
// Copy data pages.
wn, err := io.CopyN(w, f, tx.Size()-int64(tx.db.pageSize*2))
n += wn
if err != nil {
_ = f.Close()
return n, err
}
@ -537,7 +492,7 @@ func (tx *Tx) writeMeta() error {
}
// page returns a reference to the page with a given id.
// If page has been written to then a temporary buffered page is returned.
// If page has been written to then a temporary bufferred page is returned.
func (tx *Tx) page(id pgid) *page {
// Check the dirty pages first.
if tx.pages != nil {

456
Godeps/_workspace/src/github.com/boltdb/bolt/tx_test.go generated vendored Normal file
View File

@ -0,0 +1,456 @@
package bolt_test
import (
"errors"
"fmt"
"os"
"testing"
"github.com/coreos/etcd/Godeps/_workspace/src/github.com/boltdb/bolt"
)
// Ensure that committing a closed transaction returns an error.
func TestTx_Commit_Closed(t *testing.T) {
db := NewTestDB()
defer db.Close()
tx, _ := db.Begin(true)
tx.CreateBucket([]byte("foo"))
ok(t, tx.Commit())
equals(t, tx.Commit(), bolt.ErrTxClosed)
}
// Ensure that rolling back a closed transaction returns an error.
func TestTx_Rollback_Closed(t *testing.T) {
db := NewTestDB()
defer db.Close()
tx, _ := db.Begin(true)
ok(t, tx.Rollback())
equals(t, tx.Rollback(), bolt.ErrTxClosed)
}
// Ensure that committing a read-only transaction returns an error.
func TestTx_Commit_ReadOnly(t *testing.T) {
db := NewTestDB()
defer db.Close()
tx, _ := db.Begin(false)
equals(t, tx.Commit(), bolt.ErrTxNotWritable)
}
// Ensure that a transaction can retrieve a cursor on the root bucket.
func TestTx_Cursor(t *testing.T) {
db := NewTestDB()
defer db.Close()
db.Update(func(tx *bolt.Tx) error {
tx.CreateBucket([]byte("widgets"))
tx.CreateBucket([]byte("woojits"))
c := tx.Cursor()
k, v := c.First()
equals(t, "widgets", string(k))
assert(t, v == nil, "")
k, v = c.Next()
equals(t, "woojits", string(k))
assert(t, v == nil, "")
k, v = c.Next()
assert(t, k == nil, "")
assert(t, v == nil, "")
return nil
})
}
// Ensure that creating a bucket with a read-only transaction returns an error.
func TestTx_CreateBucket_ReadOnly(t *testing.T) {
db := NewTestDB()
defer db.Close()
db.View(func(tx *bolt.Tx) error {
b, err := tx.CreateBucket([]byte("foo"))
assert(t, b == nil, "")
equals(t, bolt.ErrTxNotWritable, err)
return nil
})
}
// Ensure that creating a bucket on a closed transaction returns an error.
func TestTx_CreateBucket_Closed(t *testing.T) {
db := NewTestDB()
defer db.Close()
tx, _ := db.Begin(true)
tx.Commit()
b, err := tx.CreateBucket([]byte("foo"))
assert(t, b == nil, "")
equals(t, bolt.ErrTxClosed, err)
}
// Ensure that a Tx can retrieve a bucket.
func TestTx_Bucket(t *testing.T) {
db := NewTestDB()
defer db.Close()
db.Update(func(tx *bolt.Tx) error {
tx.CreateBucket([]byte("widgets"))
b := tx.Bucket([]byte("widgets"))
assert(t, b != nil, "")
return nil
})
}
// Ensure that a Tx retrieving a non-existent key returns nil.
func TestTx_Get_Missing(t *testing.T) {
db := NewTestDB()
defer db.Close()
db.Update(func(tx *bolt.Tx) error {
tx.CreateBucket([]byte("widgets"))
tx.Bucket([]byte("widgets")).Put([]byte("foo"), []byte("bar"))
value := tx.Bucket([]byte("widgets")).Get([]byte("no_such_key"))
assert(t, value == nil, "")
return nil
})
}
// Ensure that a bucket can be created and retrieved.
func TestTx_CreateBucket(t *testing.T) {
db := NewTestDB()
defer db.Close()
// Create a bucket.
db.Update(func(tx *bolt.Tx) error {
b, err := tx.CreateBucket([]byte("widgets"))
assert(t, b != nil, "")
ok(t, err)
return nil
})
// Read the bucket through a separate transaction.
db.View(func(tx *bolt.Tx) error {
b := tx.Bucket([]byte("widgets"))
assert(t, b != nil, "")
return nil
})
}
// Ensure that a bucket can be created if it doesn't already exist.
func TestTx_CreateBucketIfNotExists(t *testing.T) {
db := NewTestDB()
defer db.Close()
db.Update(func(tx *bolt.Tx) error {
b, err := tx.CreateBucketIfNotExists([]byte("widgets"))
assert(t, b != nil, "")
ok(t, err)
b, err = tx.CreateBucketIfNotExists([]byte("widgets"))
assert(t, b != nil, "")
ok(t, err)
b, err = tx.CreateBucketIfNotExists([]byte{})
assert(t, b == nil, "")
equals(t, bolt.ErrBucketNameRequired, err)
b, err = tx.CreateBucketIfNotExists(nil)
assert(t, b == nil, "")
equals(t, bolt.ErrBucketNameRequired, err)
return nil
})
// Read the bucket through a separate transaction.
db.View(func(tx *bolt.Tx) error {
b := tx.Bucket([]byte("widgets"))
assert(t, b != nil, "")
return nil
})
}
// Ensure that a bucket cannot be created twice.
func TestTx_CreateBucket_Exists(t *testing.T) {
db := NewTestDB()
defer db.Close()
// Create a bucket.
db.Update(func(tx *bolt.Tx) error {
b, err := tx.CreateBucket([]byte("widgets"))
assert(t, b != nil, "")
ok(t, err)
return nil
})
// Create the same bucket again.
db.Update(func(tx *bolt.Tx) error {
b, err := tx.CreateBucket([]byte("widgets"))
assert(t, b == nil, "")
equals(t, bolt.ErrBucketExists, err)
return nil
})
}
// Ensure that a bucket is created with a non-blank name.
func TestTx_CreateBucket_NameRequired(t *testing.T) {
db := NewTestDB()
defer db.Close()
db.Update(func(tx *bolt.Tx) error {
b, err := tx.CreateBucket(nil)
assert(t, b == nil, "")
equals(t, bolt.ErrBucketNameRequired, err)
return nil
})
}
// Ensure that a bucket can be deleted.
func TestTx_DeleteBucket(t *testing.T) {
db := NewTestDB()
defer db.Close()
// Create a bucket and add a value.
db.Update(func(tx *bolt.Tx) error {
tx.CreateBucket([]byte("widgets"))
tx.Bucket([]byte("widgets")).Put([]byte("foo"), []byte("bar"))
return nil
})
// Delete the bucket and make sure we can't get the value.
db.Update(func(tx *bolt.Tx) error {
ok(t, tx.DeleteBucket([]byte("widgets")))
assert(t, tx.Bucket([]byte("widgets")) == nil, "")
return nil
})
db.Update(func(tx *bolt.Tx) error {
// Create the bucket again and make sure there's not a phantom value.
b, err := tx.CreateBucket([]byte("widgets"))
assert(t, b != nil, "")
ok(t, err)
assert(t, tx.Bucket([]byte("widgets")).Get([]byte("foo")) == nil, "")
return nil
})
}
// Ensure that deleting a bucket on a closed transaction returns an error.
func TestTx_DeleteBucket_Closed(t *testing.T) {
db := NewTestDB()
defer db.Close()
tx, _ := db.Begin(true)
tx.Commit()
equals(t, tx.DeleteBucket([]byte("foo")), bolt.ErrTxClosed)
}
// Ensure that deleting a bucket with a read-only transaction returns an error.
func TestTx_DeleteBucket_ReadOnly(t *testing.T) {
db := NewTestDB()
defer db.Close()
db.View(func(tx *bolt.Tx) error {
equals(t, tx.DeleteBucket([]byte("foo")), bolt.ErrTxNotWritable)
return nil
})
}
// Ensure that nothing happens when deleting a bucket that doesn't exist.
func TestTx_DeleteBucket_NotFound(t *testing.T) {
db := NewTestDB()
defer db.Close()
db.Update(func(tx *bolt.Tx) error {
equals(t, bolt.ErrBucketNotFound, tx.DeleteBucket([]byte("widgets")))
return nil
})
}
// Ensure that no error is returned when a tx.ForEach function does not return
// an error.
func TestTx_ForEach_NoError(t *testing.T) {
db := NewTestDB()
defer db.Close()
db.Update(func(tx *bolt.Tx) error {
tx.CreateBucket([]byte("widgets"))
tx.Bucket([]byte("widgets")).Put([]byte("foo"), []byte("bar"))
equals(t, nil, tx.ForEach(func(name []byte, b *bolt.Bucket) error {
return nil
}))
return nil
})
}
// Ensure that an error is returned when a tx.ForEach function returns an error.
func TestTx_ForEach_WithError(t *testing.T) {
db := NewTestDB()
defer db.Close()
db.Update(func(tx *bolt.Tx) error {
tx.CreateBucket([]byte("widgets"))
tx.Bucket([]byte("widgets")).Put([]byte("foo"), []byte("bar"))
err := errors.New("foo")
equals(t, err, tx.ForEach(func(name []byte, b *bolt.Bucket) error {
return err
}))
return nil
})
}
// Ensure that Tx commit handlers are called after a transaction successfully commits.
func TestTx_OnCommit(t *testing.T) {
var x int
db := NewTestDB()
defer db.Close()
db.Update(func(tx *bolt.Tx) error {
tx.OnCommit(func() { x += 1 })
tx.OnCommit(func() { x += 2 })
_, err := tx.CreateBucket([]byte("widgets"))
return err
})
equals(t, 3, x)
}
// Ensure that Tx commit handlers are NOT called after a transaction rolls back.
func TestTx_OnCommit_Rollback(t *testing.T) {
var x int
db := NewTestDB()
defer db.Close()
db.Update(func(tx *bolt.Tx) error {
tx.OnCommit(func() { x += 1 })
tx.OnCommit(func() { x += 2 })
tx.CreateBucket([]byte("widgets"))
return errors.New("rollback this commit")
})
equals(t, 0, x)
}
// Ensure that the database can be copied to a file path.
func TestTx_CopyFile(t *testing.T) {
db := NewTestDB()
defer db.Close()
var dest = tempfile()
db.Update(func(tx *bolt.Tx) error {
tx.CreateBucket([]byte("widgets"))
tx.Bucket([]byte("widgets")).Put([]byte("foo"), []byte("bar"))
tx.Bucket([]byte("widgets")).Put([]byte("baz"), []byte("bat"))
return nil
})
ok(t, db.View(func(tx *bolt.Tx) error { return tx.CopyFile(dest, 0600) }))
db2, err := bolt.Open(dest, 0600, nil)
ok(t, err)
defer db2.Close()
db2.View(func(tx *bolt.Tx) error {
equals(t, []byte("bar"), tx.Bucket([]byte("widgets")).Get([]byte("foo")))
equals(t, []byte("bat"), tx.Bucket([]byte("widgets")).Get([]byte("baz")))
return nil
})
}
type failWriterError struct{}
func (failWriterError) Error() string {
return "error injected for tests"
}
type failWriter struct {
// fail after this many bytes
After int
}
func (f *failWriter) Write(p []byte) (n int, err error) {
n = len(p)
if n > f.After {
n = f.After
err = failWriterError{}
}
f.After -= n
return n, err
}
// Ensure that Copy handles write errors right.
func TestTx_CopyFile_Error_Meta(t *testing.T) {
db := NewTestDB()
defer db.Close()
db.Update(func(tx *bolt.Tx) error {
tx.CreateBucket([]byte("widgets"))
tx.Bucket([]byte("widgets")).Put([]byte("foo"), []byte("bar"))
tx.Bucket([]byte("widgets")).Put([]byte("baz"), []byte("bat"))
return nil
})
err := db.View(func(tx *bolt.Tx) error { return tx.Copy(&failWriter{}) })
equals(t, err.Error(), "meta copy: error injected for tests")
}
// Ensure that Copy handles write errors right.
func TestTx_CopyFile_Error_Normal(t *testing.T) {
db := NewTestDB()
defer db.Close()
db.Update(func(tx *bolt.Tx) error {
tx.CreateBucket([]byte("widgets"))
tx.Bucket([]byte("widgets")).Put([]byte("foo"), []byte("bar"))
tx.Bucket([]byte("widgets")).Put([]byte("baz"), []byte("bat"))
return nil
})
err := db.View(func(tx *bolt.Tx) error { return tx.Copy(&failWriter{3 * db.Info().PageSize}) })
equals(t, err.Error(), "error injected for tests")
}
func ExampleTx_Rollback() {
// Open the database.
db, _ := bolt.Open(tempfile(), 0666, nil)
defer os.Remove(db.Path())
defer db.Close()
// Create a bucket.
db.Update(func(tx *bolt.Tx) error {
_, err := tx.CreateBucket([]byte("widgets"))
return err
})
// Set a value for a key.
db.Update(func(tx *bolt.Tx) error {
return tx.Bucket([]byte("widgets")).Put([]byte("foo"), []byte("bar"))
})
// Update the key but rollback the transaction so it never saves.
tx, _ := db.Begin(true)
b := tx.Bucket([]byte("widgets"))
b.Put([]byte("foo"), []byte("baz"))
tx.Rollback()
// Ensure that our original value is still set.
db.View(func(tx *bolt.Tx) error {
value := tx.Bucket([]byte("widgets")).Get([]byte("foo"))
fmt.Printf("The value for 'foo' is still: %s\n", value)
return nil
})
// Output:
// The value for 'foo' is still: bar
}
func ExampleTx_CopyFile() {
// Open the database.
db, _ := bolt.Open(tempfile(), 0666, nil)
defer os.Remove(db.Path())
defer db.Close()
// Create a bucket and a key.
db.Update(func(tx *bolt.Tx) error {
tx.CreateBucket([]byte("widgets"))
tx.Bucket([]byte("widgets")).Put([]byte("foo"), []byte("bar"))
return nil
})
// Copy the database to another file.
toFile := tempfile()
db.View(func(tx *bolt.Tx) error { return tx.CopyFile(toFile, 0666) })
defer os.Remove(toFile)
// Open the cloned database.
db2, _ := bolt.Open(toFile, 0666, nil)
defer db2.Close()
// Ensure that the key exists in the copy.
db2.View(func(tx *bolt.Tx) error {
value := tx.Bucket([]byte("widgets")).Get([]byte("foo"))
fmt.Printf("The value for 'foo' in the clone is: %s\n", value)
return nil
})
// Output:
// The value for 'foo' in the clone is: bar
}

View File

@ -0,0 +1 @@
*~

View File

@ -0,0 +1,19 @@
# This file is like Go's AUTHORS file: it lists Copyright holders.
# The list of humans who have contributd is in the CONTRIBUTORS file.
#
# To contribute to this project, because it will eventually be folded
# back in to Go itself, you need to submit a CLA:
#
# http://golang.org/doc/contribute.html#copyright
#
# Then you get added to CONTRIBUTORS and you or your company get added
# to the AUTHORS file.
Blake Mizerany <blake.mizerany@gmail.com> github=bmizerany
Daniel Morsing <daniel.morsing@gmail.com> github=DanielMorsing
Gabriel Aszalos <gabriel.aszalos@gmail.com> github=gbbr
Google, Inc.
Keith Rarick <kr@xph.us> github=kr
Matthew Keenan <tank.en.mate@gmail.com> <github@mattkeenan.net> github=mattkeenan
Matt Layher <mdlayher@gmail.com> github=mdlayher
Tatsuhiro Tsujikawa <tatsuhiro.t@gmail.com> github=tatsuhiro-t

View File

@ -0,0 +1,19 @@
# This file is like Go's CONTRIBUTORS file: it lists humans.
# The list of copyright holders (which may be companies) are in the AUTHORS file.
#
# To contribute to this project, because it will eventually be folded
# back in to Go itself, you need to submit a CLA:
#
# http://golang.org/doc/contribute.html#copyright
#
# Then you get added to CONTRIBUTORS and you or your company get added
# to the AUTHORS file.
Blake Mizerany <blake.mizerany@gmail.com> github=bmizerany
Brad Fitzpatrick <bradfitz@golang.org> github=bradfitz
Daniel Morsing <daniel.morsing@gmail.com> github=DanielMorsing
Gabriel Aszalos <gabriel.aszalos@gmail.com> github=gbbr
Keith Rarick <kr@xph.us> github=kr
Matthew Keenan <tank.en.mate@gmail.com> <github@mattkeenan.net> github=mattkeenan
Matt Layher <mdlayher@gmail.com> github=mdlayher
Tatsuhiro Tsujikawa <tatsuhiro.t@gmail.com> github=tatsuhiro-t

View File

@ -17,15 +17,8 @@ RUN apt-get install -y --no-install-recommends \
libcunit1-dev libssl-dev libxml2-dev libevent-dev \
automake autoconf
# The list of packages nghttp2 recommends for h2load:
RUN apt-get install -y --no-install-recommends make binutils \
autoconf automake autotools-dev \
libtool pkg-config zlib1g-dev libcunit1-dev libssl-dev libxml2-dev \
libev-dev libevent-dev libjansson-dev libjemalloc-dev \
cython python3.4-dev python-setuptools
# Note: setting NGHTTP2_VER before the git clone, so an old git clone isn't cached:
ENV NGHTTP2_VER 895da9a
ENV NGHTTP2_VER af24f8394e43f4
RUN cd /root && git clone https://github.com/tatsuhiro-t/nghttp2.git
WORKDIR /root/nghttp2
@ -38,9 +31,9 @@ RUN make
RUN make install
WORKDIR /root
RUN wget http://curl.haxx.se/download/curl-7.45.0.tar.gz
RUN tar -zxvf curl-7.45.0.tar.gz
WORKDIR /root/curl-7.45.0
RUN wget http://curl.haxx.se/download/curl-7.40.0.tar.gz
RUN tar -zxvf curl-7.40.0.tar.gz
WORKDIR /root/curl-7.40.0
RUN ./configure --with-ssl --with-nghttp2=/usr/local
RUN make
RUN make install

View File

@ -0,0 +1,5 @@
We only accept contributions from users who have gone through Go's
contribution process (signed a CLA).
Please acknowledge whether you have (and use the same email) if
sending a pull request.

View File

@ -0,0 +1,7 @@
Copyright 2014 Google & the Go AUTHORS
Go AUTHORS are:
See https://code.google.com/p/go/source/browse/AUTHORS
Licensed under the terms of Go itself:
https://code.google.com/p/go/source/browse/LICENSE

View File

@ -10,11 +10,8 @@ Status:
* The client work has just started but shares a lot of code
is coming along much quicker.
Docs are at https://godoc.org/golang.org/x/net/http2
Docs are at https://godoc.org/github.com/bradfitz/http2
Demo test server at https://http2.golang.org/
Help & bug reports welcome!
Contributing: https://golang.org/doc/contribute.html
Bugs: https://golang.org/issue/new?title=x/net/http2:+
Help & bug reports welcome.

View File

@ -0,0 +1,75 @@
// Copyright 2014 The Go Authors.
// See https://code.google.com/p/go/source/browse/CONTRIBUTORS
// Licensed under the same terms as Go itself:
// https://code.google.com/p/go/source/browse/LICENSE
package http2
import (
"errors"
)
// buffer is an io.ReadWriteCloser backed by a fixed size buffer.
// It never allocates, but moves old data as new data is written.
type buffer struct {
buf []byte
r, w int
closed bool
err error // err to return to reader
}
var (
errReadEmpty = errors.New("read from empty buffer")
errWriteFull = errors.New("write on full buffer")
)
// Read copies bytes from the buffer into p.
// It is an error to read when no data is available.
func (b *buffer) Read(p []byte) (n int, err error) {
n = copy(p, b.buf[b.r:b.w])
b.r += n
if b.closed && b.r == b.w {
err = b.err
} else if b.r == b.w && n == 0 {
err = errReadEmpty
}
return n, err
}
// Len returns the number of bytes of the unread portion of the buffer.
func (b *buffer) Len() int {
return b.w - b.r
}
// Write copies bytes from p into the buffer.
// It is an error to write more data than the buffer can hold.
func (b *buffer) Write(p []byte) (n int, err error) {
if b.closed {
return 0, errors.New("closed")
}
// Slide existing data to beginning.
if b.r > 0 && len(p) > len(b.buf)-b.w {
copy(b.buf, b.buf[b.r:b.w])
b.w -= b.r
b.r = 0
}
// Write new data.
n = copy(b.buf[b.w:], p)
b.w += n
if n < len(p) {
err = errWriteFull
}
return n, err
}
// Close marks the buffer as closed. Future calls to Write will
// return an error. Future calls to Read, once the buffer is
// empty, will return err.
func (b *buffer) Close(err error) {
if !b.closed {
b.closed = true
b.err = err
}
}

View File

@ -0,0 +1,73 @@
// Copyright 2014 The Go Authors.
// See https://code.google.com/p/go/source/browse/CONTRIBUTORS
// Licensed under the same terms as Go itself:
// https://code.google.com/p/go/source/browse/LICENSE
package http2
import (
"io"
"reflect"
"testing"
)
var bufferReadTests = []struct {
buf buffer
read, wn int
werr error
wp []byte
wbuf buffer
}{
{
buffer{[]byte{'a', 0}, 0, 1, false, nil},
5, 1, nil, []byte{'a'},
buffer{[]byte{'a', 0}, 1, 1, false, nil},
},
{
buffer{[]byte{'a', 0}, 0, 1, true, io.EOF},
5, 1, io.EOF, []byte{'a'},
buffer{[]byte{'a', 0}, 1, 1, true, io.EOF},
},
{
buffer{[]byte{0, 'a'}, 1, 2, false, nil},
5, 1, nil, []byte{'a'},
buffer{[]byte{0, 'a'}, 2, 2, false, nil},
},
{
buffer{[]byte{0, 'a'}, 1, 2, true, io.EOF},
5, 1, io.EOF, []byte{'a'},
buffer{[]byte{0, 'a'}, 2, 2, true, io.EOF},
},
{
buffer{[]byte{}, 0, 0, false, nil},
5, 0, errReadEmpty, []byte{},
buffer{[]byte{}, 0, 0, false, nil},
},
{
buffer{[]byte{}, 0, 0, true, io.EOF},
5, 0, io.EOF, []byte{},
buffer{[]byte{}, 0, 0, true, io.EOF},
},
}
func TestBufferRead(t *testing.T) {
for i, tt := range bufferReadTests {
read := make([]byte, tt.read)
n, err := tt.buf.Read(read)
if n != tt.wn {
t.Errorf("#%d: wn = %d want %d", i, n, tt.wn)
continue
}
if err != tt.werr {
t.Errorf("#%d: werr = %v want %v", i, err, tt.werr)
continue
}
read = read[:n]
if !reflect.DeepEqual(read, tt.wp) {
t.Errorf("#%d: read = %+v want %+v", i, read, tt.wp)
}
if !reflect.DeepEqual(tt.buf, tt.wbuf) {
t.Errorf("#%d: buf = %+v want %+v", i, tt.buf, tt.wbuf)
}
}
}

View File

@ -1,13 +1,11 @@
// Copyright 2014 The Go Authors. All rights reserved.
// Use of this source code is governed by a BSD-style
// license that can be found in the LICENSE file.
// Copyright 2014 The Go Authors.
// See https://code.google.com/p/go/source/browse/CONTRIBUTORS
// Licensed under the same terms as Go itself:
// https://code.google.com/p/go/source/browse/LICENSE
package http2
import (
"errors"
"fmt"
)
import "fmt"
// An ErrCode is an unsigned 32-bit error code as defined in the HTTP/2 spec.
type ErrCode uint32
@ -78,45 +76,3 @@ func (e StreamError) Error() string {
type goAwayFlowError struct{}
func (goAwayFlowError) Error() string { return "connection exceeded flow control window size" }
// connErrorReason wraps a ConnectionError with an informative error about why it occurs.
// Errors of this type are only returned by the frame parser functions
// and converted into ConnectionError(ErrCodeProtocol).
type connError struct {
Code ErrCode
Reason string
}
func (e connError) Error() string {
return fmt.Sprintf("http2: connection error: %v: %v", e.Code, e.Reason)
}
type pseudoHeaderError string
func (e pseudoHeaderError) Error() string {
return fmt.Sprintf("invalid pseudo-header %q", string(e))
}
type duplicatePseudoHeaderError string
func (e duplicatePseudoHeaderError) Error() string {
return fmt.Sprintf("duplicate pseudo-header %q", string(e))
}
type headerFieldNameError string
func (e headerFieldNameError) Error() string {
return fmt.Sprintf("invalid header field name %q", string(e))
}
type headerFieldValueError string
func (e headerFieldValueError) Error() string {
return fmt.Sprintf("invalid header field value %q", string(e))
}
var (
errMixPseudoHeaderTypes = errors.New("mix of request and response pseudo headers")
errPseudoAfterRegular = errors.New("pseudo header field after regular")
)

Some files were not shown because too many files have changed in this diff Show More