Super Confitura Man

How Super Confitura Man came to be :)

Recently at TouK we had a one-day hackathon. There was no main theme for it, you just could post a project idea, gather people around it and hack on that idea for a whole day - drinks and pizza included.

My main idea was to create something that could be fun to build and be useful somehow to others. I’d figured out that since Confitura was just around a corner I could make a game, that would be playable at TouK’s booth at the conference venue. This idea seemed good enough to attract Rafał Nowak @RNowak3 and Marcin Jasion @marcinjasion - two TouK employees, that with me formed a team for the hackathon.

Confitura 01

The initial plan was to develop a simple mario-style game, with preceduraly generated levels, random collectible items and enemies. One of the ideas was to introduce Confitura Man as the main character, but due to time constraints, this fall through. We’ve decided to just choose a random available sprite for a character - hence the onion man :)

Confitura 02

How the game is played?

Since we wanted to have a scoreboard and have unique users, we’ve printed out QR codes. A person that would like to play the game could pick up a QR code, show it against a camera attached to the play booth. The start page scanned the QR code and launched the game with username read from paper code.

The rest of the game was playable with gamepad or keyboard.

Confitura game screen

Technicalities

Writing a game takes a lot of time and effort. We wanted to deliver, so we’ve decided to spend some time in the days before the hackathon just to bootstrap the technology stack of our enterprise.

We’ve decided that the game would be written in some Javascript based engine, with Google Chrome as a web platform. There are a lot of HTML5 game engines - list of html5 game engines and you could easily create a game with each and every of them. We’ve decided to use Phaser IO which handles a lot of difficult, game-related stuff on its own. So, we didn’t have to worry about physics, loading and storing assets, animations, object collisions, controls input/output. Go see for yourself, it is really nice and easy to use.

Scoreboard would be a rip-off from JIRA Survivor with stats being served from some web server app. To make things harder, the backend server was written in Clojure. With no experience in that language in the team, it was a bit risky, but the tasks of the server were trivial, so if all that clojure effort failed, it could be rewritten in something we know.

Statistics

During the whole Confitura day there were 69 unique players (69 QR codes were used), and 1237 games were played. The final score looked like this:

  1. Barister Lingerie 158 - 1450 points
  2. Boilerdang Custardbath 386 - 1060 points
  3. Benadryl Clarytin 306 - 870 points

And the obligatory scoreboard screenshot:

Confitura 03

Obstacles

The game, being created in just one day, had to have problems :) It wasn’t play tested enough, there were some rough edges. During the day we had to make a few fixes:

  • the server did not respect the highest score by specific user, it was just overwritting a user’s score with it’s latest one,
  • there was one feature not supported on keyboard, that was available on gamepad - turbo button
  • server was opening a database connection each time it got a request, so after around 5 minutes it would exhaust open file limit for MongoDB (backend database), this was easily fixed - thou the fix is a bit hackish :)

These were easily identified and fixed. Unfortunately there were issues that we were unable to fix while the event was on:

  • google chrome kept asking for the permission to use webcam - this was very annoying, and all the info found on the web did not work - StackOverflow thread
  • it was hard to start the game with QR code - either the codes were too small, or the lighting around that area was inappropriate - I think this issue could be fixed by printing larger codes,

Technology evaluation

All in all we were pretty happy with the chosen stack. Phaser was easy to use and left us with just the fun parts of the game creation process. Finding the right graphics with appropriate licensing was rather hard. We didn’t have enough time to polish all the visual aspects of the game before Confitura.

Writing a server in clojure was the most challenging part, with all the new syntax and new libraries. There were tasks, trivial in java/scala, but hard in Clojure - at least for a whimpy beginners :) Nevertheless Clojure seems like a really handy tool and I’d like to dive deeper into its ecosystem.

Source code

All of the sources for the game can be found here TouK/confitura-man.

The repository is split into two parts:

  • game - HTML5 game
  • server - clojure based backend server

To run the server you need to have a local MongoDB installation. Than in server’s directory run: $ lein ring server-headless This will start a server on http://localhost:3000

To run the game you need to install dependencies with bower and than run $ grunt from game’s directory.

To launch the QR reading part of the game, you enter http://localhost:9000/start.html. After scanning the code you’ll be redirected to http://localhost:9000/index.html - and the game starts.

Conclusion

Summing up, it was a great experience creating the game. It was fun to watch people playing the game. And even with all those glitches and stupid graphics, there were people vigorously playing it, which was awesome.

Thanks to Rafał and Michał for great coding experience, and thanks to all the players of our stupid little game. If you’d like to ask me about anything - feel free to contact me by mail or twitter @zygm0nt

Distributed scans with HBase

HBase is by design a columnar store, that is optimized for random reads. You just ask for a row using rowId as an identifier and you get your data instantaneously.

Performing a scan on part or whole table is a completely different thing. First of all, it is sequential. Meaning it is rather slow, because it doesn't use all the RegionServers at the same time. It is implemented that way to realize the contract of Scan command - which has to return results sorted by key.

So, how to do this efficiently?

The usual way of getting data from HBase is with the help of its API, mainly Scan objects. To accomplish the task I'll use just them. I'll specify startRow and stopRow, so that each Scan request will be looking through only part of the key space.

It is crucial to note, that this whole solution works because of key sorting properties in HBase. So, HBase scans a table according to ascending key values. Since keys are of String type, key with value "1" is smaller than "2", because they are sorted lexicographicly. So, also key with value "12345" is smaller than "2". I've leveraged this property so that I've partitioned my whole key space according to the first character of the key. In my case keys contain only digits. So I have 10 ranges:

  • null-1
  • 1-2
  • 2-3
  • 3-4
  • 4-5
  • 5-6
  • 6-7
  • 7-8
  • 8-9
  • 9-null

The speedup comes from the fact, that each range resides in its own partition. That's right, I've presplit the table to have 10 partitions. This corresponds rather nicely with my cluster's setup, because I have more than 10 RegionServers. So every partition should be on different RegionServer. It will allow the code to do the requested scan operations in parallel - giving me this exact performance boost.

How I've created the input table:

$ create 'tariff_changes', { NAME => 'cf', SPLITS_FILE => 'splits.txt', VERSIONS => 50, MAX_FILESIZE => 1073741824 }

$ alter 'tariff_changes', { NAME => 'cf', SPLITS_FILE => 'splits.txt', VERSIONS => 50, MAX_FILESIZE => 1073741824 }

Split file is just something along this lines:

1 2 3 4 5 6 7 8 9 0

This tells HBase what are the rowKeys starting and ending each of the partitions.

Ok, so after this rather lengthy introduction, what the actual code does? It just spins of a few threads - one for each partition - and runs a Scan request tailored to that partitions key space. This way, I got a 10x speedup for this particular scan. The execution time went down from 30 minutes to 3 for my use case.

I've created an example implementation of this idea. You can find it on GitHub: https://github.com/zygm0nt/hbase-distributed-search.

Any ideas on how to speed things up even more with HBase?

Simple HBase ORM

When dealing with data stored in HBase, you are quick to come to a conclusion, that it is extremaly inconvenient to reach to it via HBase native API. It is very verbose and you always need to convert between bytes and simple types - a pain.

While I was working on a project of mine, I thought, why not to easy those pains and fetch real objects from HBase.

And that's how this simplistic, hackish ORM came to life. It is no match for projects like Kundera (a JPA compliant solution), or n-orm. Nevertheless, it just suits my needs :)

Project sources are hosted on GitHub: https://github.com/zygm0nt/hbase-annotations

To make use of this, you need to have an entity class with annotations:

  • @Column - with argument specifying column family and column name, ie. @Column("cf:column-name")
  • @Id - will store row key in this property,
  • and optionaly @Value - for Spring Expression Language, in case you need to perform some extraction on the value.

Annotations should be on setter methods.

Now you have your model annotated and ready to be fetched from HBase.

The actual work is done with a service class, that should extend class BaseHadoopInteraction just as class SimpleHBaseClient does.

Then it is possible to just call:

Note that there are more methods you can use on BaseHadoopInteraction. You can also do:

  • scan
  • scan with key ranges
  • delete

What you won't get from this simple ORM is:

  • automatic object updating,
  • nested objects,
  • saving to HBase - 'cause I didn't have a need for that,

Hope you'll find this piece of code useful. If you see room for improvements while staying in project's scope - please drop me a message.

And if you are searching for a full-fledged ORM solution for interacting with HBase, just head straight to Kundera project website :)

Recently at storm-users

I've been reading through storm-users Google Group recently. This resolution was heavily inspired by Adam Kawa's post "Football zero, Apache Pig hero". Since I've encountered a lot of insightful and very interesting information I've decided to describe some of those in this post.

  • nimbus will work in HA mode - There's a pull request open for it already... but some recent work (distributing topology files via Bittorrent) will greatly simplify the implementation. Once the Bittorrent work is done we'll look at reworking the HA pull request. (storm’s pull request)

  • pig on storm - Pig on Trident would be a cool and welcome project. Join and groupBy have very clear semantics there, as those concepts exist directly in Trident. The extensions needed to Pig are the concept of incremental, persistent state across batches (mirroring those concepts in Trident). You can read a complete proposal.

  • implementing topologies in pure python with petrel looks like this:

class Bolt(storm.BasicBolt):
    def initialize(self, conf, context):
       ''' This method executed only once '''
        storm.log('initializing bolt')

    def process(self, tup):
       ''' This method executed every time a new tuple arrived '''       
       msg = tup.values[0]
       storm.log('Got tuple %s' %msg)

if __name__ == "__main__":
    Bolt().run()
  • Fliptop is happy with storm - see their presentation here

  • topology metrics in 0.9.0: The new metrics feature allows you to collect arbitrarily custom metrics over fixed windows. Those metrics are exported to a metrics stream that you can consume by implementing IMetricsConsumer and configure with Config.java#L473. Use TopologyContext#registerMetric to register new metrics.

  • storm vs flume - some users' point of view: I use Storm and Flume and find that they are better at different things - it really depends on your use case as to which one is better suited. First and foremost, they were originally designed to do different things: Flume is a reliable service for collecting, aggregating, and moving large amounts of data from source to destination (e.g. log data from many web servers to HDFS). Storm is more for real-time computation (e.g. streaming analytics) where you analyse data in flight and don't necessarily land it anywhere. Having said that, Storm is also fault-tolerant and can write to external data stores (e.g. HBase) and you can do real-time computation in Flume (using interceptors)

That's all for this day - however, I'll keep on reading through storm-users, so watch this space for more info on storm development.

[:en] Zookeeper + Curator = Distributed sync

An application developed for one of my recent projects at TouK involved multiple servers. There was a requirement to ensure failover for the system’s components. Since I had already a few separate components I didn’t want to add more of that, and since there already was a Zookeeper ensemble running - required by one of the services, I’ve decided to go that way with my solution.

[:en] Operational problems with Zookeeper

This post is a summary of what has been presented by Kathleen Ting on StrangeLoop conference. You can watch the original here: http://www.infoq.com/presentations/Misconfiguration-ZooKeeper

I've decided to put this selection here for quick reference.

Connection mismanagement

  • too many connections

      WARN [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181:NIOServerCnxn$Factory@247] - Too many connections from /xx.x.xx.xxx - max is 60
    
  • running out of ZK connections?

    • set maxClientCnxns=200 in zoo.cfg
  • HBase client leaking connections?

    • fixed in HBASE-3777, HBASE-4773, HBASE-5466
    • manually close connections
  • connection closes prematurely

      ERROR: org.apache.hadoop.hbase.ZooKeeperConnectionException: HBase is able to connect to ZooKeeper but the connection closes immediately.
    
  • in hbase-site.xml set hbase.zookeeper.recoverable.waittime=30000ms

  • pig hangs connecting to HBase

      WARN org.apache.zookeeper.ClientCnxn: Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect java.net.ConnectionException: Connection refused!
    

    CAUSE: location of ZK quorum is not known to Pig

    • use Pig 10, which includes PIG-2115
    • if there is an overlap between TaskTrackers and ZK quorum nodes
      • set hbase.zookeeper.quorum to final in hbase-site.xml
      • otherwise add hbaze.zoopeeker.quorum=hadoophbasemaster.lan:2181 in pig.properties

Time mismanagement

  • client session timed out

      INFO org.apache.zookeeper.server.ZooKeeperServer: Expiring session <id>, timeout of 40000ms exceeded
    
    • ZK and HBase need the same session timeout values
      • zoo.cfg: maxSession=Timeout=180000
      • hbase-site.xml: zookeeper.session.timeout=180000
    • don't co-locate ZK with IO-intense DataNode or RegionServer
    • specify right amount of heap and tune GC flags
      • turn on parallel/CMS/incremental GC
  • clients lose connections

      WARN org.apache.zookeeper.ClientCnxn - Session <id> for server <name>, unexpected error, closing socket connection and attempting reconnect java.io.IOException: Broken pipe
    
    • don't use SSD drive for ZK transaction log

Disk management

  • unable to load database - unable to run quorum server

      FATAL Unable to load database on disk !  java.io.IOException: Failed to process transaction type: 2 error: KeeperErrorCode = NoNode for <file> at org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:152)!
    
    • archive and wipe /var/zookeeper/version-2 if other two ZK servers are running
  • unable to load database - unreasonable length exception

      FATAL Unable to load database on disk java.io.IOException: Unreasonable length = 1048583 at org.apache.jute.BinaryInputArchive.readBuffer(BinaryInputArchive.java:100)
    
    • server allows a client to set data larger than the server can read from disk
    • if a znode is not readable, increase jute.maxbuffer
      • look for "Packet len <xx> is out of range" in the client log
      • increase it by 20%
      • set in JVMFLAGS="-Djute.maxbuffer=yy" bin/zkCli.sh
      • fixed in ZOOKEEPER-1513
  • failure to follow leader

      WARN org.apache.zookeeper.server.quorum.Learner: Exception when following the leader java.net.SocketTimeoutException: Read timed out 
    
    CAUSE:
    • disk IO contention, network issues
    • ZK snapshot is too large (lots of ZK nodes)

    SOLVE:

    • reduce IO contention by putting dataDir on dedicated spindle
    • increase initLimit on all ZK servers and restart, see ZOOKEEPER-1521
    • monitor network

Best Practices

DOs

  • separate spindles for dataDir & dataLogDir
  • allocate 3 or 5 ZK servers
  • tune garbage collection
  • run zkCleanup.sh script via cron

DON'Ts

  • dont' co-locate ZK with I/O intense DataNode or RegionServer
  • don't use SSD drive for ZK transaction log

You may use Zookeeper as an observer - a non-voting member:

  • in zoo.cfg

      peerType=observer
    

WHUG 8. Beyond Hadoop – checking other options

W najbliższy czwartek - czyli 29.11.2012 - poprowadzę prezentację w ramach Warsaw Hadoop User Group. Swoją obecność można odklinąć tu http://www.meetup.com/warsaw-hug/

A o czym będę mówił? Przeklejka ze strony WHUG:

Marcin skupi się na współpracy ekosystemu Hadoopa z innymi narzędziami. Pokaże jak prosto i wygodnie przetwarzać grafy i jak stosować podejście Big Data, w czasie rzeczywistym. Poruszy również temat łatwiejszego tworzenia algorytmów Map-Reduce

Będzie to nieco mniej technicza (ale wciąż praktyczna) wycieczka po obrzeżach tematyki, która jest zwykle poruszana w połączeniu z Hadoop-em.

Prezentacja będzie dotyczyć narzędzi takich jak Cascading, Storm, Titan.

Zapraszam!

Hadoop HA setup

With the advent of Hadoop's 2.x version, there finally is a working High-Availability solution. Even two of those. Now it really is easy to configure and use those solutions. It no longer require external components, like DRBD. It all is just neatly packed into Cloudera Hadoop distribution - the precursor of this solution.

Read on to find out how to use it.

The most important weakness of previous Hadoop releases was the single-point-of-failure, which happend to be NameNode. NameNode as a key component of every Hadoop cluster, is responsible for managing filesystem namespace information and block location. Loosing its data results in loosing all the data stored on DataNodes. HDFS is no longer able to reach for specific files, or its blocks. This renders your cluster inoperable.

So it is crucial to be able to detect and counter problems with NameNode. The most desirable behavior is to have a hot backup, that would ensure a no-downtime cluster operation. To achieve this, the second NameNode need to have up-to-date information on filesystem metadata and it needs to be also up and running. Starting NameNode with existing set of data may easily take many minutes to parse the actual filesystem state.

Previously used solution - depoying SecondaryNameNode - was somewhat flawed. It took long time to recover after failure. It was not a hot-backup solution, which also added to the problem. Some other solution was required.

So, what needed to be made redundant is the edits dir contents and sending block location maps from each of the DataNodes to NameNodes - in case of HA deployment - to both NameNodes. This was accomplished in two steps. The first one with the release of CDH 4 beta - solution based on sharing edits directory. Than, with CDH 4.1 came quorum based solution.

Find out how to configure those on your cluster.

Shared edits directory solution

Hadoop HA - NFS based edits share

For this kind of setup, there is an assumption, that in a cluster exists a shared storage directory. It should be deployed using some kind of network-based filesystem. You could try with NFS or GlusterFS.

<property>
  <name>fs.default.name/name>
  <value>hdfs://example-cluster</value>
</property>

<!-- common server name -->
<property>
  <name>dfs.nameservices</name>
  <value>example-cluster</value>
</property>

<!-- HA configuration -->
<property>
  <name>dfs.ha.namenodes.example-cluster</name>
  <value>nn1,nn2</value>
</property>
<property>
  <name>dfs.namenode.rpc-address.example-cluster.nn1</name>
  <value>master1:8020</value>
</property>
<property>
  <name>dfs.namenode.rpc-address.example-cluster.nn2</name>
  <value>master2:8020</value>
</property>
<property>
  <name>dfs.namenode.http-address.example-cluster.nn1</name>
  <value>0.0.0.0:50070</value>
</property>
<property>
  <name>dfs.namenode.http-address.example-cluster.nn2</name>
  <value>0.0.0.0:50070</value>
</property>

<!-- Storage for edits' files -->
<property>
  <name>dfs.namenode.shared.edits.dir</name>
  <value>file:///mnt/filer1/dfs/ha-name-dir-shared</value>
</property>

<!-- Client failover -->
<property>
  <name>dfs.client.failover.proxy.provider.example-cluster</name>
  <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>

<!-- Fencing configuration -->
<property>
  <name>dfs.ha.fencing.methods</name>
  <value>sshfence</value>
</property>
 <property>
  <name>dfs.ha.fencing.ssh.private-key-files</name>
  <value>/home/user/.ssh/id_dsa</value>
</property>


<!-- Automatic failover configuration -->
<property>
  <name>dfs.ha.automatic-failover.enabled</name>
  <value>true</value>
</property>
<property>
  <name>ha.zookeeper.quorum</name>
  <value>zk1:2181,zk2:2181,zk3:2181</value>
</property>

This setup is quite OK, as long as you're comfortable with maintaining a separate service (network storage) for handling the HA state. It seems error prone to me, because it adds another service which high availability should be ensured. NFS seems to be a bad choice here, because AFAIK it does not offer HA out of the box.

On the other hand, we have GlusterFS, which is a distributed filesystem, you can deploy on multiple bricks and increase the replication level.

Nevertheless, it still brings additional burden of another service to maintain.

Quorum based solution

Hadoop HA - Quorum based edits share

With the release of CDH 4.1.0 we are now able to use a much better integrated solution called JournalNode. Now all the updates are synchronized through a JournalNode. Each JournalNode have the same data and all the NameNodes are able to recive filesystem state updates from that daemons.

This solution is much more consistent with Hadoop ecosystem.

Please note, that the config is almost identical to the one needed for shared edits directory solution. The only difference is the value for dfs.namenode.shared.edits.dir. This now points to all the journal nodes deployed in our cluster.

<property>
  <name>fs.default.name/name>
  <value>hdfs://example-cluster</value>
</property>

<!-- common server name -->
<property>
  <name>dfs.nameservices</name>
  <value>example-cluster</value>
</property>

<!-- HA configuration -->
<property>
  <name>dfs.ha.namenodes.example-cluster</name>
  <value>nn1,nn2</value>
</property>
<property>
  <name>dfs.namenode.rpc-address.example-cluster.nn1</name>
  <value>master1:8020</value>
</property>
<property>
  <name>dfs.namenode.rpc-address.example-cluster.nn2</name>
  <value>master2:8020</value>
</property>
<property>
  <name>dfs.namenode.http-address.example-cluster.nn1</name>
  <value>0.0.0.0:50070</value>
</property>
<property>
  <name>dfs.namenode.http-address.example-cluster.nn2</name>
  <value>0.0.0.0:50070</value>
</property>

<!-- Storage for edits' files -->
<property>
  <name>dfs.namenode.shared.edits.dir</name>
  <value>qjournal://node1:8485;node2:8485;node3:8485/example-cluster</value>
</property>

<!-- Client failover -->
<property>
  <name>dfs.client.failover.proxy.provider.example-cluster</name>
  <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>

<!-- Fencing configuration -->
<property>
  <name>dfs.ha.fencing.methods</name>
  <value>sshfence</value>
</property>
 <property>
  <name>dfs.ha.fencing.ssh.private-key-files</name>
  <value>/home/user/.ssh/id_dsa</value>
</property>


<!-- Automatic failover configuration -->
<property>
  <name>dfs.ha.automatic-failover.enabled</name>
  <value>true</value>
</property>
<property>
  <name>ha.zookeeper.quorum</name>
  <value>zk1:2181,zk2:2181,zk3:2181</value>
</property>

Infrastructure

In both cases you need to run Zookeeper-based Failover Controller (hadoop-hdfs-zkfc). This daemon negotiates which NameNode should become active and which standby.

But that's not all. Depending on the way you've choosen to deploy HA you need to do some other things:

Shared edits dir

With shared edits dir you need to deploy networked filesystem, and mount it on your NameNodes. After that you can run your cluster and be happy with your new HA.

Quroum based

For QJournal to operate you need to install one new package called hadoop-hdfs-journalnode. This provides startup scripts for Journal Node daemons. Choose at least three nodes that will be responsible for handling edits state and deploy journal nodes on them.

Conclusion

Thanks to guys from Cloudera we now can use an enterprise grade High Availability features for Hadoop. Eliminating the single point of failure in your cluster is essential for easy maintainability of your infrastructure.

Given the above choices, I'd suggest using QJournal setup, becasue of its relatively small impact on the overal cluster architecture. It's good performance and fairly simple setup enable the users to easily start using Hadoop in HA setup.

Are you using Hadoop with HA? What are your impressions?

Hadoop for Enterprises

Hadoop's usage as a big data processing framework gains a lot of attention lately. Now, not only big players see, that they can embrace the data their sites or products are generating and develop their businesses on it. For that to happen two things are needed: the data itself and means of processing really big amounts of it. Gathering data is relatively easy. These are not necessarily structured data, you don't need to plan their usage at first. Just start collecting them and than you may experiment with their potential usage. If they'll come out as useless rubbish - deleting them won't be hard But imagine the values it may contribute to your business:
  • faster services - working on optimized data
  • more clients - because of more relevant search results
  • happy clients - your service can "read their minds"
  • etc.
There are many companies that utilize Hadoop ecosystem for their own needs. You can read about some of them here: http://wiki.apache.org/hadoop/PoweredBy But since that page lacks insight into specific applications of Hadoop I've tried to delve into details of how Hadoop helped tame some companies' big data sets.

Facebook

Being a social network provider, a widely used one, they require no introduction. However if you've lived under a rock for last couple years just visit their website http://facebook.com Their main usage is data warehousing. Since they require to be able to access the data fast and reliably they had a need for real-time querying of their huge, and always growing data set. Their switch from MySQL databases was required due to the increasing workloads they experienced with standard databases. What they got "out of the box" with Hadoop was all the benefits of distributed file system (HDFS features). They expanded the ideas behind that even further and implemented truly Highly Available file system without Single Point of Failure. Facebook has 3 interesting usage scenarios in which Hadoop plays a major role:
  • Titan - is Facebook's messaging system. It processes messages exchanged between users. Ensures that it happens fast and without glitches. Here Hadoop is used mainly as a huge, unlimited storage.
  • Puma - Facebook Insights - a tool providing page statistics for advanced Facebook users. Based on streams of data (clicks, likes, shares, comments and impressions) it graphs those data and makes it available near instantly.
  • ODS - Operational Data Store - which stores Facebook's internal metrics - collections of OS and cluster health metrics. And it facilitates multiple accounting solutions.

Twitter

This popular micro-blogging platform, where you can register your account and follow friends and celebrities for their micro-messages does some pretty interesting things with their Hadoop cluster. One of their motivations is to speed up their web-page's functionality. That is why the compute users' friendships in Twitter's social graph with Hadoop. Using connections between users they calculate their relationship to each other and estimate groups of users. Since this service's users generate lots of content, the company conducts researches based on natural language processing. They probe what could be told about a user from his tweets. They use tweets' contents for advertisement purpose, trends analysis and many more. From tweets and user's behaviours they characterise usage scenarios. Also, they gather usage statistics, like number of searches daily, number of tweets. Based on this seemingly irrelevant data they run comparisons of different types of users. Twitter analyzes data to determine whether mobile users, users who use third party clients or power users use Twitter differently from average users. Of course theses seem like really specific applications but nevertheless they are very original and base on the data that Twitter has been gathering for some time now.

EBay

Being the biggest auctioning site on the Internet, EBay uses Hadoop processing for increasing search relevance based on click-stream data, user data. This seems pretty obvious, considering their area of operation. However the also have one other interesting thing - they try hard to automatically fill auctioned objects' metadata, based on the descriptions and other data provided by users. They employ data mining approach for this tasks and judging from their constant growth it seems to work

LinkedIn

Social network for professionals, thou a lot smaller than Facebook. Based on click-streams they discover relations between users. All the data concerning latest visits on your profile or people you may know from other places - this comes from Hadoop based analysis of those clicks people make all the time on their sites. Also a very neat feature, called InMaps (http://inmaps.linkedinlabs.com/) analyse declared schools and companies and generates data for graph with clustered friends of yours.

Last.fm

This on-line radio site, praised by many for its invaluable recommendations' system seems like a rather small and simple service. But behind the facade of simple web page there are lots of data being processed, so that their services could match a certain level of perfection. Such large volume of their data comes from scrobbles. Each users of their service listening to a song generates a note about this fact - called scrobble. Based on that and user profiles they calculate global band popularity charts, maps of bands' popularity and many more usage statistics and timeline charts.

Conclusion

They just try to detect and trace new patterns in seemingly chaotic data sets. Perhaps you could also do the same? Analyze your data and expand your business value?

Comments

We stumbled over here from a different web address and thought I might check things out. I like what I see so i am just following you. Look forward to checking out your web page yet again.
I like what you guys are up too. This type of clever work and reporting! Keep up the awesome works guys I've added you guys to my own blogroll.
Greetings from Florida! I'm bored at work so I decided to browse your site on my iphone during lunch break. I enjoy the info you present here and can't wait to take a look when I get home. I'm surprised at how quick your blog loaded on my cell phone .. I'm not even using WIFI, just 3G . . Anyways, very good site!
Comfortableness north face jackets is crucial when they get it that will north face outlet get the best school bags pertaining to going camping north face sale. Your easiest guarantee in the case of even larger delivers has become One with an inner metal framework, one that can wind cheap north face up being aligned to help you appropriately fit your north face women body. They should be now have http://www.salethenorthfacejackets.com secure which were wholly flexible, because essentially in the form of midsection belt to get more aid.
I never imagined how much stuff there was out there on this! Thanks for making it easy to get the picture
What Programming Languages Do Jobs Require? | Regular Geek regulargeek.com/2009/07/21/what-programming-languages-do-jobs-require view page cahecd As a software engineer, you need to keep your skills sharp and current. This is a general requirement of the job. In addition to this, in the current economy you do not want to be without a job. Obviously, this means learning more about what your current company uses for all of its development. What if you do not have a job or you are looking to leave? What technologies or programming languages should you be looking into? From the page
Howdy are using Wordpress for your site platform? I'm new to the blog world but I'm trying to get started and create my own. Do you need any coding expertise to make your own blog? Any help would be greatly appreciated!