December 18, 2016

Elasticsearch Notes

Elasticsearch is 2 components.

Elasticsearch: clustering engine and REST API
Lucene: Search backend. (indexes are always raw lucene)

You need to understand how both work;

Lucene: #

Index Merges #

This video displays how index merges occur:

Basically when you have enough segments that can be grouped they will be vacuumed and merged.

Source

Memory Pressure/Heap: #

If you monitor the total memory used on the JVM you will typically see a sawtooth pattern where the memory usage steadily increases and then drops suddenly.

Sawtooth

Sawtooth
The reason for this sawtooth pattern is that the JVM continously needs to allocate memory on the heap as new objects are created as a part of the normal program execution. Most of these objects are however short lived and quickly become available for collection by the garbage collector. When the garbage collector finishes you’ll see a drop on the memory usage graph. This constant state of flux makes the current total memory usage a poor indicator of memory pressure.

The JVM garbage collector is designed such that it draws on the fact that most objects are short lived. There are separate pools in the heap for new objects and old objects and these pools are garbage collected separately. After the collection of the new objects pool, surviving objects are moved to the old objects pool, which is garbage collected less frequently. This is due to the fact that it’s less likely to be any substantial amount of garbage there. If you monitor each of these pools separately, you will see the same sawtooth pattern, but the old pool is fairly steady while the new pool frequently moves between full and empty. This is why we we have based our memory pressure indicator on the fill rate of the old pool.

This is bad, heap is too large:
Notice the gaps between teeth
If the heap is too large, the application will be prone to infrequent long latency spikes from full-heap garbage collections. Infrequent long pauses impact end-user experience as these pauses increase the tail of the latency distribution; user requests will sometimes see unacceptably-long response times. Long pauses are especially detrimental to a distributed system like Elasticsearch because a long pause is indistinguishable from a node that is unreachable because it is hung, or otherwise isolated from the cluster. During a stop-the-world pause, no Elasticsearch server code is executing: it doesn’t call, it doesn’t write, and it doesn’t send flowers. In the case of an elected master, a long garbage collection pause can cause other nodes to stop following the master and elect a new one. In the case of a data node, a long garbage collection pause can lead to the master removing the node from the cluster and reallocating the paused node’s assigned shards. This increases network traffic and disk I/O across the cluster, which hampers normal load. Long garbage collection pauses are a top issue for cluster instability.

This is bad, heap is too small:
Baaaaad
If the heap is too small, applications will be prone to the danger of out of memory errors. While that is the most serious risk from an undersized heap, there are additional problems that can arise from a heap that is too small. A heap that is too small relative to the application’s allocation rate leads to frequent small latency spikes and reduced throughput from constant garbage collection pauses. Frequent short pauses impact end-user experience as these pauses effectively shift the latency distribution and reduce the number of operations the application can handle. For Elasticsearch, constant short pauses reduce the number of indexing operations and queries per second that can be handled. A small heap also reduces the memory available for indexing buffers, caches, and memory-hungry features like aggregations and suggesters.

Memory Pressure
Understanding HEAP
Sizing ES

Elasticsearch #

Node Types: #

Client Node:
node.data: off and node.master: off

Master Node:
node.data: off and node.master: on

Data Node:
node.data: on and node.master: off

Sizing of ES #

Get queries which will be run
Acquire SLA
Sample the incoming data (jmeter can generate data)
Never more than 50G for a shard.
25G-30G is about right. (take your index and divide it into 25G chunks, that’s your shard count)
if you must increase shardcount, you have to reindex.

Reindexing: #

While reindexing turn off replicas
set refresh_interval: -1 to stop time based syncing to disk.

Performance: #

you can manually specify the index to use. localhost:9200/index1,index2/_search
index per timeframe sh curl -XPOST 'localhost:9200/_aliases' -d '{ "actions": [{ "add": { "index": "2016_03", "alias": "last_2_months" }, "remove": { "index": "2016_01", "alias": "last_2_months" } }] }
hitting multiple shards decreases read performance, even if they return no data.
deleting documents is expensive, delete indexes instead (DELETE vs. TRUNCATE basically)
large numbers of medium machines preferred over fewer tanky machines.

Monitoring: #

curl -XGET localhost:9200/_cat

Explore!, you can get a subset of data by listing the headings

curl -XGET 'localhost:9200/_cat/nodeattrs?h=host,ip,attr'

And list the headings

curl -XGET 'localhost:9200/_cat/nodeattrs?v

of course there is ?help too.

Recommended software:
Monitoring:

IDE:

Kibana plugin: Sense

Best Practices: #

Use “Client Nodes” for load balancing.
Have three “Master Nodes”. No more, no less.
Avoid DELETE, update is just DELETE followed by INSERT.
Never link nodes over WAN.
Try to seperate C2S HTTP and S2S Transport on different NICs
Use long-lived HTTP connections.
Increase open file descriptors (32k, 64k or even Unlimited)
never allow servers to swap, advice is to turn off swap, also can set bootstrap.mlockall: true
ES Heap size should be 50% of memory, and never more than 30.5G. (after 31G Java will use 64bit pointers)
Ingestion nodes should have SSD’s
- ensure noop or readline scheduler
Archive nodes can be HDD, but ES is not configured by default to handle HDDs
- Ensure:index.merge.scheduler.max_thread_count: 1 on HDD
RAID0 is OK, even good.
Thread Exhaustion does not mean increase threads, increase data nodes instead. (Error: 429)
Minor versions are fine with a rolling restart. Major versions require cluster downtime.
Track Took time of common queries.
Round-Trip-Time (“RTT”) should scale the same as ‘Took’ time, disparity means network issues. (client server, or server to server)
DO NOT USE AN ALTERNATE GARBAGE COLLECTOR, data corruption will ensue

Kudos