[:en] Operational problems with Zookeeper

This post is a summary of what has been presented by Kathleen Ting on StrangeLoop conference. You can watch the original here: http://www.infoq.com/presentations/Misconfiguration-ZooKeeper I’ve decided to put this selection here for quick reference. …This post is a summary of what has been presented by Kathleen Ting on StrangeLoop conference. You can watch the original here: http://www.infoq.com/presentations/Misconfiguration-ZooKeeper I’ve decided to put this selection here for quick reference. …

This post is a summary of what has been presented by Kathleen Ting on
StrangeLoop conference. You can watch the original here:
http://www.infoq.com/presentations/Misconfiguration-ZooKeeper

I’ve decided to put this selection here for quick reference.

Connection mismanagement

  • too many connections
    WARN [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181:NIOServerCnxn$Factory@247] - Too many connections from /xx.x.xx.xxx - max is 60
    
  • running out of ZK connections?
    • set maxClientCnxns=200 in zoo.cfg
  • HBase client leaking connections?
    • fixed in HBASE-3777, HBASE-4773, HBASE-5466
    • manually close connections
  • connection closes prematurely
    ERROR: org.apache.hadoop.hbase.ZooKeeperConnectionException: HBase is able to connect to ZooKeeper but the connection closes immediately.
  • in hbase-site.xml set hbase.zookeeper.recoverable.waittime=30000ms
  • pig hangs connecting to HBase

    CAUSE: location of ZK quorum is not known to Pig

    • use Pig 10, which includes PIG-2115
    • if there is an overlap between TaskTrackers and ZK quorum nodes
      • set hbase.zookeeper.quorum to final in hbase-site.xml
      • otherwise add hbaze.zoopeeker.quorum=hadoophbasemaster.lan:2181 in pig.properties
WARN org.apache.zookeeper.ClientCnxn: Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect java.net.ConnectionException: Connection refused!

Time mismanagement

  • client session timed out
INFO org.apache.zookeeper.server.ZooKeeperServer: Expiring session <id>, timeout of 40000ms exceeded
    • ZK and HBase need the same session timeout values
      • zoo.cfg: maxSession=Timeout=180000
      • hbase-site.xml: zookeeper.session.timeout=180000
    • don’t co-locate ZK with IO-intense DataNode or RegionServer
    • specify right amount of heap and tune GC flags
      • turn on parallel/CMS/incremental GC
  •  
  • clients lose connections
    WARN org.apache.zookeeper.ClientCnxn - Session <id> for server <name>, unexpected error, closing socket connection and attempting reconnect java.io.IOException: Broken pipe
    
    • don’t use SSD drive for ZK transaction log

Disk management

  • unable to load database – unable to run quorum server
    FATAL Unable to load database on disk !  java.io.IOException: Failed to process transaction type: 2 error: KeeperErrorCode = NoNode for <file> at org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:152)!
    
    • archive and wipe /var/zookeeper/version-2 if other two ZK servers
      are running
  • unable to load database – unreasonable length exception
    FATAL Unable to load database on disk java.io.IOException: Unreasonable length = 1048583 at org.apache.jute.BinaryInputArchive.readBuffer(BinaryInputArchive.java:100)
    • server allows a client to set data larger than the server can read from disk
    • if a znode is not readable, increase jute.maxbuffer
      • look for "Packet len <xx> is out of range" in the client log
      • increase it by 20%
      • set in JVMFLAGS="-Djute.maxbuffer=yy" bin/zkCli.sh
      • fixed in ZOOKEEPER-151
  • failure to follow leader
    WARN org.apache.zookeeper.server.quorum.Learner: Exception when following the leader java.net.SocketTimeoutException: Read timed out

    CAUSE:

    • disk IO contention, network issues
    • ZK snapshot is too large (lots of ZK nodes)

    SOLVE:

    • reduce IO contention by putting dataDir on dedicated spindle
    • increase initLimit on all ZK servers and restart, see
      ZOOKEEPER-1521
    • monitor network

Best Practices

DOs

  • separate spindles for dataDir & dataLogDir
  • allocate 3 or 5 ZK servers
  • tune garbage collection
  • run zkCleanup.sh script via cron

DON’Ts

  • dont’ co-locate ZK with I/O intense DataNode or RegionServer
  • don’t use SSD drive for ZK transaction log

You may use Zookeeper as an observer – a non-voting member:

  • in zoo.cfg
    peerType=observer
You May Also Like

EhCache config with BeanUtils

BeanUtils allows you to set Bean properties.If you have configuration stored in a Map it's tempting to use BeanUtils to automagically setup EhCache configuration.Sadly this class has mixed types in setters and getter and thus BeanUtils that use Introsp...

CasperJS for Java developers

Why CasperJS

Being a Java developer is kinda hard these days. Java may not be dead yet, but when keeping in sync with all the hipster JavaScript frameworks could make us feel a bit outside the playground. It’s even hard to list JavaScript frameworks with latest releases on one website.

In my current project, we are using AngularJS. It’a a nice abstraction of MV* pattern in frontend layer of any web application (we use Grails underneath). Here is a nice article with an 8-point Win List of Angular way of handling AJAX calls and updating the view. So it’s not only a funny new framework but a truly helper of keeping your code clean and neat.

But there is also another area when you can put helpful JS framework in place of plan-old-java one - functional tests. Especially when you are dealing with one page app with lots of asynchronous REST/JSON communication.

Selenium and Geb

In Java/JVM project the typical is to use Selenium with some wrapper like Geb. So you start your project, setup your CI-functional testing pipeline and… after 1 month of coding your tests stop working and being maintainable. The frameworks itselves are not bad, but the typical setup is so heavy and has so many points of failure that keeping it working in a real life project is really hard.

Here is my list of common myths about Selenium: * It allows you to record test scripts via handy GUI - maybe some static request/response sites. In modern web applications with asynchronous REST/JSON communication your tests must contain a lot of “waitFor” statements and you cannot automate where these should be included. * It allows you to test your web app against many browsers - don’t try to automate IE tests! You have to manually open your app in IE to see how it actually bahaves! * It integrates well with continuous integration servers like Jenkins - you have to setup Selenium Grid on server with X installed to run tests on Chrome or Firefox and a Windows server for IE. And the headless HtmlUnit driver lacks a lot of JS support.

So I decided to try something different and introduce a bit of JavaScript tooling in our project by using CasperJS.

Introduction

CasperJS is simple but powerful navigation scripting & testing utility for PhantomJS - scritable headless WebKit (which is an rendering engine used by Safari and Chrome). In short - CasperJS allows you to navigate and make assertions about web pages as they’d been rendered in Google Chrome. It is enough for me to automate the functional tests of my application.

If you want a gentle introduction to the world of CasperJS I suggest you to read: * Official website, especially installation guide and API * Introductionary article from CasperJS creator Nicolas Perriault * Highlevel testing with CasperJS by Kevin van Zonneveld * grails-angular-scaffolding plugin by Rob Fletcher with some working CasperJS tests

Full example

I run my test suite via following script:

casperjs test --direct --log-level=debug --testhost=localhost:8080 --includes=test/casper/includes/casper-angular.coffee,test/casper/includes/pages.coffee test/casper/specs/

casper-angular.coffe

casper.test.on "fail", (failure) ->
    casper.capture(screenshot)

testhost   = casper.cli.get "testhost"
screenshot = 'test-fail.png'

casper
    .log("Using testhost: #{testhost}", "info")
    .log("Using screenshot: #{screenshot}", "info")

casper.waitUntilVisible = (selector, message, callback) ->
    @waitFor ->
        @visible selector
    , callback, (timeout) ->
        @log("Selector [#{selector}] not visible, failing")
        withParentSelector selector, (parent) ->
            casper.log("Output of parent selector [#{parent}]")
            casper.debugHTML(parent)
        @echo message, "RED_BAR"
        @capture(screenshot)
        @test.fail(f("Wait timeout occured (%dms)", timeout))

withParentSelector = (selector, callback) ->
    if selector.lastIndexOf(" ") > 0
       parent = selector[0..selector.lastIndexOf(" ")-1]
       callback(parent)

Sample pages.coffee:

x = require('casper').selectXPath

class EditDocumentPage

    assertAt: ->
        casper.test.assertSelectorExists("div.customerAccountInfo", 'at EditDocumentPage')

    templatesTreeFirstCategory: 'ul.tree li label'
    templatesTreeFirstTemplate: 'ul.tree li a'
    closePreview: '.closePreview a'
    smallPreview: '.smallPreviewContent img'
    bigPreview: 'img.previewImage'
    confirmDelete: x("//div[@class='modal-footer']/a[1]")

casper.editDocument = new EditDocumentPage()

End a test script:

testhost = casper.cli.get "testhost" or 'localhost:8080'

casper.start "http://#{testhost}/app", ->
    @test.assertHttpStatus 302
    @test.assertUrlMatch /\/fakeLogin/, 'auto login'
    @test.assert @visible('input#Create'), 'mock login button'
    @click 'input#Create'

casper.then ->
    @test.assertUrlMatch /document#\/edit/, 'new document'
    @editDocument.assertAt()
    @waitUntilVisible @editDocument.templatesTreeFirstCategory, 'template categories not visible', ->
        @click @editDocument.templatesTreeFirstCategory
        @waitUntilVisible @editDocument.templatesTreeFirstTemplate, 'template not visible', ->
            @click @editDocument.templatesTreeFirstTemplate

casper.then ->
    @waitUntilVisible @editDocument.smallPreview, 'small preview not visible', ->
        # could be dblclick / whatever
        @mouseEvent('click', @editDocument.smallPreview)

casper.then ->
    @waitUntilVisible @editDocument.bigPreview, 'big preview should be visible', ->
        @test.assertEvalEquals ->
            $('.pageCounter').text()
        , '1/1', 'page counter should be visible'
        @click @editDocument.closePreview

casper.then ->
    @click 'button.cancel'
    @waitUntilVisible '.modal-footer', 'delete confirmation not visible', ->
        @click @editDocument.confirmDelete

casper.run ->
    @test.done()

Here is a list of CasperJS features/caveats used here:

  • Using CoffeeScript is a huge win for your test code to look neat
  • When using casper test command, beware of different (than above articles) logging setup. You can pass --direct --log-level=debug from commandline for best results. Logging is essential here since Phantom often exists without any error and you do want to know what just happened.
  • Extract your helper code into separate files and include them by using --includes switch.
  • When passing server URL as a commandline switch remember that in CoffeeScript variables are not visible between multiple source files (unless getting them via window object)
  • It’s good to override standard waitUntilVisible with capting a screenshot and making a proper log statement. In my version I also look for a parent selector and debugHTML the content of it - great for debugging what is actually rendered by the browser.
  • Selenium and Geb have a nice concept of Page Objects - an abstract models of pages rendered by your application. Using CoffeeScript you can write your own classes, bind selectors to properties and use then in your code script. Assigning the objects to casper instance will end up with quite nice syntax like @editDocument.assertAt().
  • There is some issue with CSS :first and :last selectors. I cannot get them working (but maybe I’m doing something wrong?). But in CasperJS you can also use XPath selectors which are fine for matching n-th child of some element (x("//div[@class='modal-footer']/a[1]")).
    Update: :first and :last are not CSS3 selectors, but JQuery ones. Here is a list of CSS3 selectors, all of these are supported by CasperJS. So you can use nth-child(1) is this case. Thanks Andy and Nicolas for the comments!

Working with CasperJS can lead you to a few hour stall, but after getting things working you have a new, cool tool in your box!