Hadoop for Enterprises

Hadoop’s usage as a big data processing framework gains a lot of attention lately. Now, not only big players see, that they can embrace the data their sites or products are generating and develop their businesses on it. For that to happen two things are needed: the data itself and means of processing really big amounts of it.

Gathering data is relatively easy. These are not necessarily structured data, you don’t need to plan their usage at first. Just start collecting them and than you may experiment with their potential usage. If they’ll come out as useless rubbish – deleting them won’t be hard But imagine the values it may contribute to your business:

  • faster services – working on optimized data
  • more clients – because of more relevant search results
  • happy clients – your service can “read their minds”
  • etc.

There are many companies that utilize Hadoop ecosystem for their own needs. You can read about some of them here: http://wiki.apache.org/hadoop/PoweredBy But since that page lacks insight into specific applications of Hadoop I’ve tried to delve into

details of how Hadoop helped tame some companies’ big data sets.

Facebook

Being a social network provider, a widely used one, they require no introduction. However if you’ve lived under a rock for last couple years just visit their website http://facebook.com

Their main usage is data warehousing. Since they require to be able to access the data fast and reliably they had a need for real-time querying of their huge, and always growing data set. Their switch from MySQL databases was required due to the increasing workloads they experienced with standard databases. What they got “out of the box” with Hadoop was all the benefits of distributed file system (HDFS features). They expanded the ideas behind that even further and implemented truly Highly Available file system without Single Point of Failure.

Facebook has 3 interesting usage scenarios in which Hadoop plays a major role:

  • Titan – is Facebook’s messaging system. It processes messages exchanged between users. Ensures that it happens fast and without glitches. Here Hadoop is used mainly as a huge, unlimited storage.
  • Puma – Facebook Insights – a tool providing page statistics for advanced Facebook users. Based on streams of data (clicks, likes, shares, comments and impressions) it graphs those data and makes it available near instantly.
  • ODS – Operational Data Store – which stores Facebook’s internal metrics – collections of OS and cluster health metrics. And it facilitates multiple accounting solutions.

Twitter

This popular micro-blogging platform, where you can register your account and follow friends and celebrities for their micro-messages does some pretty interesting things with their Hadoop cluster.

One of their motivations is to speed up their web-page’s functionality. That is why the compute users’ friendships in Twitter’s social graph with Hadoop. Using connections between users they calculate their relationship to each other and estimate groups of users.

Since this service’s users generate lots of content, the company conducts researches based on natural language processing. They probe what could be told about a user from his tweets. They use tweets’ contents for advertisement purpose, trends analysis and many more.

From tweets and user’s behaviours they characterise usage scenarios. Also, they gather usage statistics, like number of searches daily, number of tweets. Based on this seemingly irrelevant data they run comparisons of different types of users. Twitter analyzes data to determine whether mobile users, users who use third party clients or power users use Twitter differently from average users. Of course theses seem like really specific applications but nevertheless they are very original and base on the data that Twitter has been gathering for some time now.

EBay

Being the biggest auctioning site on the Internet, EBay uses Hadoop processing for increasing search relevance based on click-stream data, user data. This seems pretty obvious, considering their area of operation.

However the also have one other interesting thing – they try hard to automatically fill auctioned objects’ metadata, based on the descriptions and other data provided by users. They employ data mining approach for this tasks and judging from their constant growth it seems to work

LinkedIn

Social network for professionals, thou a lot smaller than Facebook. Based on click-streams they discover relations between users. All the data concerning latest visits on your profile or people you may know from other places – this comes from Hadoop based analysis of those clicks people make all the time on their sites.

Also a very neat feature, called InMaps (http://inmaps.linkedinlabs.com/) analyse declared schools and companies and generates data for graph with clustered friends of yours.

Last.fm

This on-line radio site, praised by many for its invaluable recommendations’ system seems like a rather small and simple service. But behind the facade of simple web page there are lots of data being processed, so that their services could match a certain level of perfection.

Such large volume of their data comes from scrobbles. Each users of their service listening to a song generates a note about this fact – called scrobble. Based on that and user profiles they calculate global band popularity charts, maps of bands’ popularity and many more usage statistics and timeline charts.

Conclusion

They just try to detect and trace new patterns in seemingly chaotic data sets. Perhaps you could also do the same? Analyze your data and expand your business value?

Comments

We stumbled over here from a different web address and thought I might check things out.

I like what I see so i am just following you.

Look forward to checking out your web page yet again.

I like what you guys are up too. This type of clever work and reporting!

Keep up the awesome works guys I’ve added you guys to my own blogroll.

Greetings from Florida! I’m bored at work so I decided to browse your site on my iphone during lunch break. I enjoy the info you present here and can’t wait to take a look

when I get home. I’m surprised at how quick your blog loaded on my cell phone .. I’m not even using WIFI, just 3G .

. Anyways, very good site!

Comfortableness north face jackets

is crucial when they get it that will north face outlet get the best school bags pertaining to going camping north face sale. Your easiest guarantee in the case of even larger delivers has become One with an inner metal framework, one that can wind cheap north face up being aligned to help you appropriately fit your north face women body. They should be now have http://www.salethenorthfacejackets.com secure which were wholly flexible, because essentially in the form of midsection belt to get more aid.

I never imagined how much stuff there was out there

on this! Thanks for making it easy to get the picture

What Programming Languages Do Jobs Require? | Regular Geek regulargeek.com/2009/07/21/what-programming-languages-do-jobs-require view page cahecd As a software engineer, you need to keep your skills sharp and current. This is a general requirement of the job. In addition to this, in the current economy you do not want to be without a job. Obviously, this means learning more about what your current company uses for all of its development. What if you do not have a job or you are looking to leave? What technologies or programming languages should you be looking into? From the page

Howdy are using WordPress for your site platform? I’m new to the blog world but I’m trying to

get started and create my own. Do you need any coding expertise to make your own

blog? Any help would be greatly appreciated!

You May Also Like

How we use Kotlin with Exposed at TouK

Why Kotlin? At TouK, we try to early adopt technologies. We don’t have a starter project skeleton that is reused in every new project, we want to try something that fits the project needs, even if it’s not that popular yet. We tried Kotlin first it mid 2016, right after reaching 1.0.2 version

Multi module Gradle project with IDE support

This article is a short how-to about multi-module project setup with usage of the Gradle automation build tool.

Here's how Rich Seller, a StackOverflow user, describes Gradle:
Gradle promises to hit the sweet spot between Ant and Maven. It uses Ivy's approach for dependency resolution. It allows for convention over configuration but also includes Ant tasks as first class citizens. It also wisely allows you to use existing Maven/Ivy repositories.
So why would one use yet another JVM build tool such as Gradle? The answer is simple: to avoid frustration involved by Ant or Maven.

Short story

I was fooling around with some fresh proof of concept and needed a build tool. I'm pretty familiar with Maven so created project from an artifact, and opened the build file, pom.xml for further tuning.
I had been using Grails with its own build system (similar to Gradle, btw) already for some time up then, so after quite a time without Maven, I looked on the pom.xml and found it to be really repulsive.

Once again I felt clearly: XML is not for humans.

After quick googling I found Gradle. It was still in beta (0.8 version) back then, but it's configured with Groovy DSL and that's what a human likes :)

Where are we

In the time Ant can be met but among IT guerrillas, Maven is still on top and couple of others like for example Ivy conquer for the best position, Gradle smoothly went into its mature age. It's now available in 1.3 version, released at 20th of November 2012. I'm glad to recommend it to anyone looking for relief from XML configured tools, or for anyone just looking for simple, elastic and powerful build tool.

Lets build

I have already written about basic project structure so I skip this one, reminding only the basic project structure:
<project root>

├── build.gradle
└── src
├── main
│ ├── java
│ └── groovy

└── test
├── java
└── groovy
Have I just referred myself for the 1st time? Achievement unlocked! ;)

Gradle as most build tools is run from a command line with parameters. The main parameter for Gradle is a 'task name', for example we can run a command: gradle build.
There is no 'create project' task, so the directory structure has to be created by hand. This isn't a hassle though.
Java and groovy sub-folders aren't always mandatory. They depend on what compile plugin is used.

Parent project

Consider an example project 'the-app' of three modules, let say:
  1. database communication layer
  2. domain model and services layer
  3. web presentation layer
Our project directory tree will look like:
the-app

├── dao-layer
│ └── src

├── domain-model
│ └── src

├── web-frontend
│ └── src

├── build.gradle
└── settings.gradle
the-app itself has no src sub-folder as its purpose is only to contain sub-projects and build configuration. If needed it could've been provided with own src though.

To glue modules we need to fill settings.gradle file under the-app directory with a single line of content specifying module names:
include 'dao-layer', 'domain-model', 'web-frontend'
Now the gradle projects command can be executed to obtain such a result:
:projects

------------------------------------------------------------
Root project
------------------------------------------------------------

Root project 'the-app'
+--- Project ':dao-layer'
+--- Project ':domain-model'
\--- Project ':web-frontend'
...so we know that Gradle noticed the modules. However gradle build command won't run successful yet because build.gradle file is still empty.

Sub project

As in Maven we can create separate build config file per each module. Let say we starting from DAO layer.
Thus we create a new file the-app/dao-layer/build.gradle with a line of basic build info (notice the new build.gradle was created under sub-project directory):
apply plugin: 'java'
This single line of config for any of modules is enough to execute gradle build command under the-app directory with following result:
:dao-layer:compileJava
:dao-layer:processResources UP-TO-DATE
:dao-layer:classes
:dao-layer:jar
:dao-layer:assemble
:dao-layer:compileTestJava UP-TO-DATE
:dao-layer:processTestResources UP-TO-DATE
:dao-layer:testClasses UP-TO-DATE
:dao-layer:test
:dao-layer:check
:dao-layer:build

BUILD SUCCESSFUL

Total time: 3.256 secs
To use Groovy plugin slightly more configuration is needed:
apply plugin: 'groovy'

repositories {
mavenLocal()
mavenCentral()
}

dependencies {
groovy 'org.codehaus.groovy:groovy-all:2.0.5'
}
At lines 3 to 6 Maven repositories are set. At line 9 dependency with groovy library version is specified. Of course plugin as 'java', 'groovy' and many more can be mixed each other.

If we have settings.gradle file and a build.gradle file for each module, there is no need for parent the-app/build.gradle file at all. Sure that's true but we can go another, better way.

One file to rule them all

Instead of creating many build.gradle config files, one per each module, we can use only the parent's one and make it a bit more juicy. So let us move the the-app/dao-layer/build.gradle a level up to the-app/build-gradle and fill it with new statements to achieve full project configuration:
def langLevel = 1.7

allprojects {

apply plugin: 'idea'

group = 'com.tamashumi'
version = '0.1'
}

subprojects {

apply plugin: 'groovy'

sourceCompatibility = langLevel
targetCompatibility = langLevel

repositories {
mavenLocal()
mavenCentral()
}

dependencies {
groovy 'org.codehaus.groovy:groovy-all:2.0.5'
testCompile 'org.spockframework:spock-core:0.7-groovy-2.0'
}
}

project(':dao-layer') {

dependencies {
compile 'org.hibernate:hibernate-core:4.1.7.Final'
}
}

project(':domain-model') {

dependencies {
compile project(':dao-layer')
}
}

project(':web-frontend') {

apply plugin: 'war'

dependencies {
compile project(':domain-model')
compile 'org.springframework:spring-webmvc:3.1.2.RELEASE'
}
}

idea {
project {
jdkName = langLevel
languageLevel = langLevel
}
}
At the beginning simple variable langLevel is declared. It's worth knowing that we can use almost any Groovy code inside build.gradle file, statements like for example if conditions, for/while loops, closures, switch-case, etc... Quite an advantage over inflexible XML, isn't it?

Next the allProjects block. Any configuration placed in it will influence - what a surprise - all projects, so the parent itself and sub-projects (modules). Inside of the block we have the IDE (Intellij Idea) plugin applied which I wrote more about in previous article (look under "IDE Integration" heading). Enough to say that with this plugin applied here, command gradle idea will generate Idea's project files with modules structure and dependencies. This works really well and plugins for other IDEs are available too.
Remaining two lines at this block define group and version for the project, similar as this is done by Maven.

After that subProjects block appears. It's related to all modules but not the parent project. So here the Groovy language plugin is applied, as all modules are assumed to be written in Groovy.
Below source and target language level are set.
After that come references to standard Maven repositories.
At the end of the block dependencies to groovy version and test library - Spock framework.

Following blocks, project(':module-name'), are responsible for each module configuration. They may be omitted unless allProjects or subProjects configure what's necessary for a specific module. In the example per module configuration goes as follow:
  • Dao-layer module has dependency to an ORM library - Hibernate
  • Domain-model module relies on dao-layer as a dependency. Keyword project is used here again for a reference to other module.
  • Web-frontend applies 'war' plugin which build this module into java web archive. Besides it referes to domain-model module and also use Spring MVC framework dependency.

At the end in idea block is basic info for IDE plugin. Those are parameters corresponding to the Idea's project general settings visible on the following screen shot.


jdkName should match the IDE's SDK name otherwise it has to be set manually under IDE on each Idea's project files (re)generation with gradle idea command.

Is that it?

In the matter of simplicity - yes. That's enough to automate modular application build with custom configuration per module. Not a rocket science, huh? Think about Maven's XML. It would take more effort to setup the same and still achieve less expressible configuration quite far from user-friendly.

Check the online user guide for a lot of configuration possibilities or better download Gradle and see the sample projects.
As a tasty bait take a look for this short choice of available plugins:
  • java
  • groovy
  • scala
  • cpp
  • eclipse
  • netbeans
  • ida
  • maven
  • osgi
  • war
  • ear
  • sonar
  • project-report
  • signing
and more, 3rd party plugins...