{"id":11311,"date":"2013-03-21T23:56:00","date_gmt":"2013-03-21T22:56:00","guid":{"rendered":"http:\/\/zygm0nt.github.com\/blog\/2013\/03\/21\/zookeeper-tips"},"modified":"2022-08-01T09:15:29","modified_gmt":"2022-08-01T07:15:29","slug":"operational-problems-with-zookeeper","status":"publish","type":"post","link":"https:\/\/touk.pl\/blog\/2013\/03\/21\/operational-problems-with-zookeeper\/","title":{"rendered":"[:en] Operational problems with Zookeeper"},"content":{"rendered":"<p>This post is a summary of what has been presented by Kathleen Ting on<br \/>\nStrangeLoop conference. You can watch the original here:<br \/>\n<a href=\"http:\/\/www.infoq.com\/presentations\/Misconfiguration-ZooKeeper\">http:\/\/www.infoq.com\/presentations\/Misconfiguration-ZooKeeper<\/a><\/p>\n<p>I&#8217;ve decided to put this selection here for quick reference.<\/p>\n<h2 id=\"connection-mismanagement\">Connection mismanagement<\/h2>\n<ul>\n<li>too many connections\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">WARN [NIOServerCxn.Factory: 0.0.0.0\/0.0.0.0:2181:NIOServerCnxn$Factory@247] - Too many connections from \/xx.x.xx.xxx - max is 60\r\n<\/pre>\n<\/li>\n<li>running out of ZK connections?\n<ul>\n<li>set <code>maxClientCnxns=200<\/code> in <code>zoo.cfg<\/code><\/li>\n<\/ul>\n<\/li>\n<li>HBase client leaking connections?\n<ul>\n<li>fixed in HBASE-3777, HBASE-4773, HBASE-5466<\/li>\n<li>manually close connections<\/li>\n<\/ul>\n<\/li>\n<li>connection closes prematurely\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">ERROR: org.apache.hadoop.hbase.ZooKeeperConnectionException: HBase is able to connect to ZooKeeper but the connection closes immediately.<\/pre>\n<\/li>\n<li>in <code>hbase-site.xml<\/code> set <code>hbase.zookeeper.recoverable.waittime=30000ms<\/code><\/li>\n<li>pig hangs connecting to HBase\n<p><strong>CAUSE:<\/strong> location of ZK quorum is not known to Pig<\/p>\n<ul>\n<li>use Pig 10, which includes PIG-2115<\/li>\n<li>if there is an overlap between TaskTrackers and ZK quorum nodes\n<ul>\n<li>set <code>hbase.zookeeper.quorum<\/code> to final in <code>hbase-site.xml<\/code><\/li>\n<li>otherwise add <code>hbaze.zoopeeker.quorum=hadoophbasemaster.lan:2181<\/code> in <code>pig.properties<\/code><\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">WARN org.apache.zookeeper.ClientCnxn: Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect java.net.ConnectionException: Connection refused!<\/pre>\n<h2 id=\"time-mismanagement\">Time mismanagement<\/h2>\n<ul>\n<li>client session timed out<\/li>\n<\/ul>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">INFO org.apache.zookeeper.server.ZooKeeperServer: Expiring session &lt;id&gt;, timeout of 40000ms exceeded<\/pre>\n<ul>\n<li style=\"list-style-type: none\">\n<ul>\n<li>ZK and HBase need the same session timeout values\n<ul>\n<li><code>zoo.cfg<\/code>: <code>maxSession=Timeout=180000<\/code><\/li>\n<li><code>hbase-site.xml<\/code>: <code>zookeeper.session.timeout=180000<\/code><\/li>\n<\/ul>\n<\/li>\n<li>don&#8217;t co-locate ZK with IO-intense DataNode or RegionServer<\/li>\n<li>specify right amount of heap and tune GC flags\n<ul>\n<li>turn on parallel\/CMS\/incremental GC<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<li>&nbsp;<\/li>\n<li>clients lose connections\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">WARN org.apache.zookeeper.ClientCnxn - Session &lt;id&gt; for server &lt;name&gt;, unexpected error, closing socket connection and attempting reconnect java.io.IOException: Broken pipe\r\n<\/pre>\n<ul>\n<li>don&#8217;t use SSD drive for ZK transaction log<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h2 id=\"disk-management\">Disk management<\/h2>\n<ul>\n<li>unable to load database &#8211; unable to run quorum server\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">FATAL Unable to load database on disk !  java.io.IOException: Failed to process transaction type: 2 error: KeeperErrorCode = NoNode for &lt;file&gt; at org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:152)!\r\n<\/pre>\n<ul>\n<li>archive and wipe <code>\/var\/zookeeper\/version-2<\/code> if other two ZK servers<br \/>\nare running<\/li>\n<\/ul>\n<\/li>\n<li>unable to load database &#8211; unreasonable length exception\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">FATAL Unable to load database on disk java.io.IOException: Unreasonable length = 1048583 at org.apache.jute.BinaryInputArchive.readBuffer(BinaryInputArchive.java:100)<\/pre>\n<ul>\n<li>server allows a client to set data larger than the server can read from disk<\/li>\n<li>if a znode is not readable, increase <code>jute.maxbuffer<\/code>\n<ul>\n<li>look for <code>\"Packet len &lt;xx&gt; is out of range\"<\/code> in the client log<\/li>\n<li>increase it by 20%<\/li>\n<li>set in <code>JVMFLAGS=\"-Djute.maxbuffer=yy\" bin\/zkCli.sh<\/code><\/li>\n<li>fixed in ZOOKEEPER-151<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<li>failure to follow leader\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">WARN org.apache.zookeeper.server.quorum.Learner: Exception when following the leader java.net.SocketTimeoutException: Read timed out<\/pre>\n<p><strong>CAUSE:<\/strong><\/p>\n<ul>\n<li>disk IO contention, network issues<\/li>\n<li>ZK snapshot is too large (lots of ZK nodes)<\/li>\n<\/ul>\n<p><strong>SOLVE:<\/strong><\/p>\n<ul>\n<li>reduce IO contention by putting dataDir on dedicated spindle<\/li>\n<li>increase initLimit on all ZK servers and restart, see<br \/>\nZOOKEEPER-1521<\/li>\n<li>monitor network<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h2 id=\"best-practices\">Best Practices<\/h2>\n<p><strong>DOs<\/strong><\/p>\n<ul>\n<li>separate spindles for dataDir &amp; dataLogDir<\/li>\n<li>allocate 3 or 5 ZK servers<\/li>\n<li>tune garbage collection<\/li>\n<li>run zkCleanup.sh script via cron<\/li>\n<\/ul>\n<p><strong>DON&#8217;Ts<\/strong><\/p>\n<ul>\n<li>dont&#8217; co-locate ZK with I\/O intense DataNode or RegionServer<\/li>\n<li>don&#8217;t use SSD drive for ZK transaction log<\/li>\n<\/ul>\n<p>You may use Zookeeper as an observer &#8211; a non-voting member:<\/p>\n<ul>\n<li>in zoo.cfg\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">peerType=observer<\/pre>\n<\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"This post is a summary of what has been presented by Kathleen Ting on\nStrangeLoop conference. You can watch the original here:\nhttp:\/\/www.infoq.com\/presentations\/Misconfiguration-ZooKeeper\nI&#8217;ve decided to put this selection here for quick reference.\n&#8230;This post is a summary of what has been presented by Kathleen Ting on\nStrangeLoop conference. You can watch the original here:\nhttp:\/\/www.infoq.com\/presentations\/Misconfiguration-ZooKeeper\nI&#8217;ve decided to put this selection here for quick reference.\n&#8230;\n","protected":false},"author":11,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[11],"tags":[],"class_list":{"0":"post-11311","1":"post","2":"type-post","3":"status-publish","4":"format-standard","6":"category-development-design"},"_links":{"self":[{"href":"https:\/\/touk.pl\/blog\/wp-json\/wp\/v2\/posts\/11311","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/touk.pl\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/touk.pl\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/touk.pl\/blog\/wp-json\/wp\/v2\/users\/11"}],"replies":[{"embeddable":true,"href":"https:\/\/touk.pl\/blog\/wp-json\/wp\/v2\/comments?post=11311"}],"version-history":[{"count":3,"href":"https:\/\/touk.pl\/blog\/wp-json\/wp\/v2\/posts\/11311\/revisions"}],"predecessor-version":[{"id":14731,"href":"https:\/\/touk.pl\/blog\/wp-json\/wp\/v2\/posts\/11311\/revisions\/14731"}],"wp:attachment":[{"href":"https:\/\/touk.pl\/blog\/wp-json\/wp\/v2\/media?parent=11311"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/touk.pl\/blog\/wp-json\/wp\/v2\/categories?post=11311"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/touk.pl\/blog\/wp-json\/wp\/v2\/tags?post=11311"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}