Simple HBase ORM

When dealing with data stored in HBase, you are quick to come to a conclusion, that it is extremaly inconvenient to reach to it via HBase native API. It is very verbose and you always need to convert between bytes and simple types – a pain. While I wa…When dealing with data stored in HBase, you are quick to come to a conclusion, that it is extremaly inconvenient to reach to it via HBase native API. It is very verbose and you always need to convert between bytes and simple types – a pain. While I wa…

When dealing with data stored in HBase, you are quick to come to a
conclusion, that it is extremaly inconvenient to reach to it
via HBase native API. It is very verbose and you always need to convert
between bytes and simple types – a pain.

While I was working on a project of mine, I thought, why not to easy
those pains and fetch real objects from HBase.

And that’s how this simplistic, hackish ORM came to life. It is no match
for projects like Kundera
(a JPA compliant solution), or n-orm. Nevertheless, it just suits my needs :)

Project sources are hosted on GitHub: https://github.com/zygm0nt/hbase-annotations

To make use of this, you need to have an entity class with annotations:

  • @Column – with argument specifying column family and column name, ie.
    @Column(“cf:column-name”)
  • @Id – will store row key in this property,
  • and optionaly @Value – for Spring Expression Language, in case you
    need to perform some extraction on the value.

Annotations should be on setter methods.

Now you have your model annotated and ready to be fetched from HBase.

The actual work is done with a service class, that should extend class
BaseHadoopInteraction just as class
SimpleHBaseClient does.

Then it is possible to just call:

Note that there are more methods you can use on BaseHadoopInteraction.
You can also do:

  • scan
  • scan with key ranges
  • delete

What you won’t get from this simple ORM is:

  • automatic object updating,
  • nested objects,
  • saving to HBase – ’cause I didn’t have a need for that,

Hope you’ll find this piece of code useful. If you see room for
improvements while staying in project’s scope – please drop me a
message.

And if you are searching for a full-fledged ORM solution for interacting with HBase, just head
straight to Kundera project website :)

You May Also Like

Recently at storm-users

I've been reading through storm-users Google Group recently. This resolution was heavily inspired by Adam Kawa's post "Football zero, Apache Pig hero". Since I've encountered a lot of insightful and very interesting information I've decided to describe some of those in this post.

  • nimbus will work in HA mode - There's a pull request open for it already... but some recent work (distributing topology files via Bittorrent) will greatly simplify the implementation. Once the Bittorrent work is done we'll look at reworking the HA pull request. (storm’s pull request)

  • pig on storm - Pig on Trident would be a cool and welcome project. Join and groupBy have very clear semantics there, as those concepts exist directly in Trident. The extensions needed to Pig are the concept of incremental, persistent state across batches (mirroring those concepts in Trident). You can read a complete proposal.

  • implementing topologies in pure python with petrel looks like this:

class Bolt(storm.BasicBolt):
    def initialize(self, conf, context):
       ''' This method executed only once '''
        storm.log('initializing bolt')

    def process(self, tup):
       ''' This method executed every time a new tuple arrived '''       
       msg = tup.values[0]
       storm.log('Got tuple %s' %msg)

if __name__ == "__main__":
    Bolt().run()
  • Fliptop is happy with storm - see their presentation here

  • topology metrics in 0.9.0: The new metrics feature allows you to collect arbitrarily custom metrics over fixed windows. Those metrics are exported to a metrics stream that you can consume by implementing IMetricsConsumer and configure with Config.java#L473. Use TopologyContext#registerMetric to register new metrics.

  • storm vs flume - some users' point of view: I use Storm and Flume and find that they are better at different things - it really depends on your use case as to which one is better suited. First and foremost, they were originally designed to do different things: Flume is a reliable service for collecting, aggregating, and moving large amounts of data from source to destination (e.g. log data from many web servers to HDFS). Storm is more for real-time computation (e.g. streaming analytics) where you analyse data in flight and don't necessarily land it anywhere. Having said that, Storm is also fault-tolerant and can write to external data stores (e.g. HBase) and you can do real-time computation in Flume (using interceptors)

That's all for this day - however, I'll keep on reading through storm-users, so watch this space for more info on storm development.

I've been reading through storm-users Google Group recently. This resolution was heavily inspired by Adam Kawa's post "Football zero, Apache Pig hero". Since I've encountered a lot of insightful and very interesting information I've decided to describe some of those in this post.

  • nimbus will work in HA mode - There's a pull request open for it already... but some recent work (distributing topology files via Bittorrent) will greatly simplify the implementation. Once the Bittorrent work is done we'll look at reworking the HA pull request. (storm’s pull request)

  • pig on storm - Pig on Trident would be a cool and welcome project. Join and groupBy have very clear semantics there, as those concepts exist directly in Trident. The extensions needed to Pig are the concept of incremental, persistent state across batches (mirroring those concepts in Trident). You can read a complete proposal.

  • implementing topologies in pure python with petrel looks like this:

class Bolt(storm.BasicBolt):
    def initialize(self, conf, context):
       ''' This method executed only once '''
        storm.log('initializing bolt')

    def process(self, tup):
       ''' This method executed every time a new tuple arrived '''       
       msg = tup.values[0]
       storm.log('Got tuple %s' %msg)

if __name__ == "__main__":
    Bolt().run()
  • Fliptop is happy with storm - see their presentation here

  • topology metrics in 0.9.0: The new metrics feature allows you to collect arbitrarily custom metrics over fixed windows. Those metrics are exported to a metrics stream that you can consume by implementing IMetricsConsumer and configure with Config.java#L473. Use TopologyContext#registerMetric to register new metrics.

  • storm vs flume - some users' point of view: I use Storm and Flume and find that they are better at different things - it really depends on your use case as to which one is better suited. First and foremost, they were originally designed to do different things: Flume is a reliable service for collecting, aggregating, and moving large amounts of data from source to destination (e.g. log data from many web servers to HDFS). Storm is more for real-time computation (e.g. streaming analytics) where you analyse data in flight and don't necessarily land it anywhere. Having said that, Storm is also fault-tolerant and can write to external data stores (e.g. HBase) and you can do real-time computation in Flume (using interceptors)

That's all for this day - however, I'll keep on reading through storm-users, so watch this space for more info on storm development.