{"id":11730,"date":"2013-12-10T21:26:00","date_gmt":"2013-12-10T20:26:00","guid":{"rendered":"http:\/\/zygm0nt.github.com\/blog\/2013\/12\/10\/distributed-scans-with-hbase"},"modified":"2022-08-02T10:37:54","modified_gmt":"2022-08-02T08:37:54","slug":"distributed-scans-with-hbase","status":"publish","type":"post","link":"https:\/\/touk.pl\/blog\/2013\/12\/10\/distributed-scans-with-hbase\/","title":{"rendered":"Distributed scans with HBase"},"content":{"rendered":"<p>HBase is by design a columnar store, that is optimized for random reads.<br \/>\nYou just ask for a row using rowId as an identifier and you get your<br \/>\ndata instantaneously.<\/p>\n<p>Performing a scan on part or whole table is a completely different thing.<br \/>\nFirst of all, it is sequential. Meaning it is rather slow, because it<br \/>\ndoesn&#8217;t use all the RegionServers at the same time. It is implemented<br \/>\nthat way to realize the contract of Scan command &#8211; which has to return<br \/>\nresults sorted by key.<\/p>\n<p>So, how to do this efficiently?<\/p>\n<p><!-- more --><\/p>\n<p>The usual way of getting data from HBase is with the help of its API,<br \/>\nmainly Scan objects. To accomplish the task I&#8217;ll use just them. I&#8217;ll<br \/>\nspecify startRow and stopRow, so that each Scan request will be looking<br \/>\nthrough only part of the key space.<\/p>\n<p>It is crucial to note, that this whole solution works because of key<br \/>\nsorting properties in HBase. So, HBase scans a table according to ascending key<br \/>\nvalues. Since keys are of String type, key with value &#8220;1&#8221; is smaller<br \/>\nthan &#8220;2&#8221;, because they are sorted lexicographicly. So, also key with value &#8220;12345&#8221; is smaller than &#8220;2&#8221;. I&#8217;ve<br \/>\nleveraged this property so that I&#8217;ve partitioned my whole key space according to<br \/>\nthe first character of the key. In my case keys contain only digits. So I<br \/>\nhave 10 ranges:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">null-1\r\n1-2\r\n2-3\r\n3-4\r\n4-5\r\n5-6\r\n6-7\r\n7-8\r\n8-9\r\n9-null<\/pre>\n<p>The speedup comes from the fact, that each range resides in its own<br \/>\npartition. That&#8217;s right, I&#8217;ve presplit the table to have 10 partitions.<br \/>\nThis corresponds rather nicely with my cluster&#8217;s setup, because I have<br \/>\nmore than 10 RegionServers. So every partition should be on different<br \/>\nRegionServer. It will allow the code to do the requested scan operations<br \/>\nin parallel &#8211; giving me this exact performance boost.<\/p>\n<p>How I&#8217;ve created the input table:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">$ create 'tariff_changes', { NAME =&amp;gt; 'cf', SPLITS_FILE =&amp;gt; 'splits.txt', VERSIONS =&amp;gt; 50, MAX_FILESIZE =&amp;gt; 1073741824 }\r\n\r\n$ alter 'tariff_changes', { NAME =&amp;gt; 'cf', SPLITS_FILE =&amp;gt; 'splits.txt', VERSIONS =&amp;gt; 50, MAX_FILESIZE =&amp;gt; 1073741824 }<\/pre>\n<p>Split file is just something along this lines:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">1\r\n2\r\n3\r\n4\r\n5\r\n6\r\n7\r\n8\r\n9\r\n0\r\n<\/pre>\n<p>This tells HBase what are the rowKeys starting and ending each of the<br \/>\npartitions.<\/p>\n<p>Ok, so after this rather lengthy introduction, what the actual code<br \/>\ndoes? It just spins of a few threads &#8211; one for each partition &#8211; and runs<br \/>\na Scan request tailored to that partitions key space. This way, I got a<br \/>\n10x speedup for this particular scan. The execution time went down from<br \/>\n30 minutes to 3 for my use case.<\/p>\n<p>I&#8217;ve created an example implementation of this idea. You can find it on<br \/>\nGitHub:<br \/>\n<a href=\"https:\/\/github.com\/zygm0nt\/hbase-distributed-search\">https:\/\/github.com\/zygm0nt\/hbase-distributed-search<\/a>.<\/p>\n<p>Any ideas on how to speed things up even more with HBase?<\/p>\n","protected":false},"excerpt":{"rendered":"HBase is by design a columnar store, that is optimized for random reads. You just ask for a row using rowId as an identifier and you get your data instantaneously.\nPerforming a scan on part or whole table is a completely different thing. First of all, it is sequential. Meaning it is rather slow, because it doesn&#8217;t use all the RegionServers at the same time. It is implemented that way to realize the contract of Scan command &#8211; which has to return results sorted by key.\nSo, how to do this efficiently?HBase is by design a columnar store, that is optimized for random reads. You just ask for a row using rowId as an identifier and you get your data instantaneously.\nPerforming a scan on part or whole table is a completely different thing. First of all, it is sequential. Meaning it is rather slow, because it doesn&#8217;t use all the RegionServers at the same time. It is implemented that way to realize the contract of Scan command &#8211; which has to return results sorted by key.\nSo, how to do this efficiently?\n","protected":false},"author":11,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[11],"tags":[687],"class_list":{"0":"post-11730","1":"post","2":"type-post","3":"status-publish","4":"format-standard","6":"category-development-design","7":"tag-db"},"_links":{"self":[{"href":"https:\/\/touk.pl\/blog\/wp-json\/wp\/v2\/posts\/11730","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/touk.pl\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/touk.pl\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/touk.pl\/blog\/wp-json\/wp\/v2\/users\/11"}],"replies":[{"embeddable":true,"href":"https:\/\/touk.pl\/blog\/wp-json\/wp\/v2\/comments?post=11730"}],"version-history":[{"count":8,"href":"https:\/\/touk.pl\/blog\/wp-json\/wp\/v2\/posts\/11730\/revisions"}],"predecessor-version":[{"id":14804,"href":"https:\/\/touk.pl\/blog\/wp-json\/wp\/v2\/posts\/11730\/revisions\/14804"}],"wp:attachment":[{"href":"https:\/\/touk.pl\/blog\/wp-json\/wp\/v2\/media?parent=11730"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/touk.pl\/blog\/wp-json\/wp\/v2\/categories?post=11730"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/touk.pl\/blog\/wp-json\/wp\/v2\/tags?post=11730"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}