Recently at storm-users

I’ve been reading through storm-users Google Group recently. This resolution was heavily inspired by Adam Kawa’s post “Football zero, Apache Pig hero”. Since I’ve encountered a lot of insightful and very interesting information I’ve decided to describe some of those in this post. nimbus will work in HA mode – There’s a pull request open for it already… but some recent work (distributing topology files via Bittorrent) will greatly simplify the implementation. Once the Bittorrent work is done we’ll look at reworking the HA pull request. (storm’s pull request) pig on storm – Pig on Trident would be a cool and welcome project. Join and groupBy have very clear semantics there, as those concepts exist directly in Trident. The extensions needed to Pig are the concept of incremental, persistent state across batches (mirroring those concepts in Trident). You can read a complete proposal. implementing topologies in pure python with petrel looks like this: class Bolt(storm.BasicBolt): def initialize(self, conf, context): ''' This method executed only once ''' storm.log('initializing bolt') def process(self, tup): ''' This method executed every time a new tuple arrived ''' msg = tup.values[0] storm.log('Got tuple %s' %msg) if __name__ == "__main__": Bolt().run() Fliptop is happy with storm – see their presentation here topology metrics in 0.9.0: The new metrics feature allows you to collect arbitrarily custom metrics over fixed windows. Those metrics are exported to a metrics stream that you can consume by implementing IMetricsConsumer and configure with Config.java#L473. Use TopologyContext#registerMetric to register new metrics. storm vs flume – some users’ point of view: I use Storm and Flume and find that they are better at different things – it really depends on your use case as to which one is better suited. First and foremost, they were originally designed to do different things: Flume is a reliable service for collecting, aggregating, and moving large amounts of data from source to destination (e.g. log data from many web servers to HDFS). Storm is more for real-time computation (e.g. streaming analytics) where you analyse data in flight and don’t necessarily land it anywhere. Having said that, Storm is also fault-tolerant and can write to external data stores (e.g. HBase) and you can do real-time computation in Flume (using interceptors) That’s all for this day – however, I’ll keep on reading through storm-users, so watch this space for more info on storm development. I’ve been reading through storm-users Google Group recently. This resolution was heavily inspired by Adam Kawa’s post “Football zero, Apache Pig hero”. Since I’ve encountered a lot of insightful and very interesting information I’ve decided to describe some of those in this post. nimbus will work in HA mode – There’s a pull request open for it already… but some recent work (distributing topology files via Bittorrent) will greatly simplify the implementation. Once the Bittorrent work is done we’ll look at reworking the HA pull request. (storm’s pull request) pig on storm – Pig on Trident would be a cool and welcome project. Join and groupBy have very clear semantics there, as those concepts exist directly in Trident. The extensions needed to Pig are the concept of incremental, persistent state across batches (mirroring those concepts in Trident). You can read a complete proposal. implementing topologies in pure python with petrel looks like this: class Bolt(storm.BasicBolt): def initialize(self, conf, context): ''' This method executed only once ''' storm.log('initializing bolt') def process(self, tup): ''' This method executed every time a new tuple arrived ''' msg = tup.values[0] storm.log('Got tuple %s' %msg) if __name__ == "__main__": Bolt().run() Fliptop is happy with storm – see their presentation here topology metrics in 0.9.0: The new metrics feature allows you to collect arbitrarily custom metrics over fixed windows. Those metrics are exported to a metrics stream that you can consume by implementing IMetricsConsumer and configure with Config.java#L473. Use TopologyContext#registerMetric to register new metrics. storm vs flume – some users’ point of view: I use Storm and Flume and find that they are better at different things – it really depends on your use case as to which one is better suited. First and foremost, they were originally designed to do different things: Flume is a reliable service for collecting, aggregating, and moving large amounts of data from source to destination (e.g. log data from many web servers to HDFS). Storm is more for real-time computation (e.g. streaming analytics) where you analyse data in flight and don’t necessarily land it anywhere. Having said that, Storm is also fault-tolerant and can write to external data stores (e.g. HBase) and you can do real-time computation in Flume (using interceptors) That’s all for this day – however, I’ll keep on reading through storm-users, so watch this space for more info on storm development.

I’ve been reading through storm-users Google Group recently. This
resolution was heavily inspired by Adam Kawa’s post “Football zero, Apache Pig hero”. Since I’ve encountered a lot of insightful and very interesting information I’ve decided to describe some of those in this post.

  • nimbus will work in HA mode – There’s a pull request open for it already… but some
    recent work (distributing topology files via Bittorrent) will greatly
    simplify the implementation. Once the Bittorrent work is done we’ll look
    at reworking the HA pull request. (storm’s pull request)

  • pig on storm – Pig on Trident would be a cool and welcome project. Join
    and groupBy have very clear semantics there, as those concepts exist
    directly in Trident. The extensions needed to Pig are the concept of
    incremental, persistent state across batches (mirroring those concepts
    in Trident). You can read a complete proposal.

  • implementing topologies in pure python with petrel looks like this:

class Bolt(storm.BasicBolt):
    def initialize(self, conf, context):
       ''' This method executed only once '''
        storm.log('initializing bolt')

    def process(self, tup):
       ''' This method executed every time a new tuple arrived '''       
       msg = tup.values[0]
       storm.log('Got tuple %s' %msg)

if __name__ == "__main__":
    Bolt().run()
  • Fliptop is happy with storm – see their presentation here

  • topology metrics in 0.9.0: The new metrics feature allows you to collect
    arbitrarily custom metrics over fixed windows. Those metrics are
    exported to a metrics stream that you can consume by implementing
    IMetricsConsumer and configure with Config.java#L473.
    Use TopologyContext#registerMetric to register new metrics.

  • storm vs flume – some users’ point of view: I use Storm and Flume and find that they are better at
    different things – it really depends on your use case as to which one is
    better suited. First and foremost, they were originally designed to do
    different things: Flume is a reliable service for collecting,
    aggregating, and moving large amounts of data from source to destination
    (e.g. log data from many web servers to HDFS). Storm is more for
    real-time computation (e.g. streaming analytics) where you analyse data
    in flight and don’t necessarily land it anywhere. Having said that,
    Storm is also fault-tolerant and can write to external data stores (e.g.
    HBase) and you can do real-time computation in Flume (using
    interceptors)

That’s all for this day – however, I’ll keep on reading through storm-users, so watch this space for more info on storm development.

You May Also Like

mount.ntfs high cpu ubuntu

My computer suffers from sudden and continous hard drive load strokes. Sometimes it lasts for a few minutes and hence work is impossible because everything goes very slow.I'm trying to locate the cause because it makes me nervous :)Today I found one of...

Spock, Java and Maven

Few months ago I've came across Groovy - powerful language for JVM platform which combines the power of Java with abilities typical for scripting languages (dynamic typing, metaprogramming).

Together with Groovy I've discovered spock framework (https://code.google.com/p/spock/) - specification framework for Groovy (of course you can test Java classes too!). But spock is not only test/specification framework - it also contains powerful mocking tools.

Even though spock is dedicated for Groovy there is no problem with using it for Java classes tests. In this post I'm going to describe how to configure Maven project to build and run spock specifications together with traditional JUnit tests.


Firstly, we need to prepare pom.xml and add necessary dependencies and plugins.

Two obligatory libraries are:
<dependency>
<groupid>org.spockframework</groupId>
<artifactid>spock-core</artifactId>
<version>0.7-groovy-2.0</version>
<scope>test</scope>
</dependency>
<dependency>
<groupid>org.codehaus.groovy</groupId>
<artifactid>groovy-all</artifactId>
<version>${groovy.version}</version>
<scope>test</scope>
</dependency>
Where groovy.version is property defined in pom.xml for more convenient use and easy version change, just like this:
<properties>
<gmaven-plugin.version>1.4</gmaven-plugin.version>
<groovy.version>2.1.5</groovy.version>
</properties>

I've added property for gmaven-plugin version for the same reason ;)

Besides these two dependencies, we can use few additional ones providing extra functionality:
  • cglib - for class mocking
  • objenesis - enables mocking classes without default constructor
To add them to the project put these lines in <dependencies> section of pom.xml:
<dependency>
<groupid>cglib</groupId>
<artifactid>cglib-nodep</artifactId>
<version>3.0</version>
<scope>test</scope>
</dependency>
<dependency>
<groupid>org.objenesis</groupId>
<artifactid>objenesis</artifactId>
<version>1.3</version>
<scope>test</scope>
</dependency>

And that's all for dependencies section. Now we will focus on plugins necessary to compile Groovy classes. We need to add gmaven-plugin with gmaven-runtime-2.0 dependency in plugins section:
<plugin>
<groupid>org.codehaus.gmaven</groupId>
<artifactid>gmaven-plugin</artifactId>
<version>${gmaven-plugin.version}</version>
<configuration>
<providerselection>2.0</providerSelection>
</configuration>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
<dependencies>
<dependency>
<groupid>org.codehaus.gmaven.runtime</groupId>
<artifactid>gmaven-runtime-2.0</artifactId>
<version>${gmaven-plugin.version}</version>
<exclusions>
<exclusion>
<groupid>org.codehaus.groovy</groupId>
<artifactid>groovy-all</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupid>org.codehaus.groovy</groupId>
<artifactid>groovy-all</artifactId>
<version>${groovy.version}</version>
</dependency>
</dependencies>
</plugin>

With these configuration we can use spock and write our first specifications. But there is one issue: default settings for maven-surefire plugin demand that test classes must end with "..Test" postfix, which is ok when we want to use such naming scheme for our spock tests. But if we want to name them like CommentSpec.groovy or whatever with "..Spec" ending (what in my opinion is much more readable) we need to make little change in surefire plugin configuration:
<plugin>
<groupid>org.apache.maven.plugins</groupId>
<artifactid>maven-surefire-plugin</artifactId>
<version>2.15</version>
<configuration>
<includes>
<include>**/*Test.java</include>
<include>**/*Spec.java</include>
</includes>
</configuration>
</plugin>

As you can see there is a little trick ;) We add include directive for standard Java JUnit test ending with "..Test" postfix, but there is also an entry for spock test ending with "..Spec". And there is a trick: we must write "**/*Spec.java", not "**/*Spec.groovy", otherwise Maven will not run spock tests (which is strange and I've spent some time to figure out why Maven can't run my specs).

Little update: instead of "*.java" postfix for both types of tests we can write "*.class" what is in my opinion more readable and clean:
<include>**/*Test.class</include>
<include>**/*Spec.class</include>
(thanks to Tomek Pęksa for pointing this out!)

With such configuration, we can write either traditional JUnit test and put them in src/test/java directory or groovy spock specifications and place them in src/test/groovy. And both will work together just fine :) In one of my next posts I'll write something about using spock and its mocking abilities in practice, so stay in tune.