{"id":11730,"date":"2013-12-10T21:26:00","date_gmt":"2013-12-10T20:26:00","guid":{"rendered":"http:\/\/zygm0nt.github.com\/blog\/2013\/12\/10\/distributed-scans-with-hbase"},"modified":"2022-08-02T10:37:54","modified_gmt":"2022-08-02T08:37:54","slug":"distributed-scans-with-hbase","status":"publish","type":"post","link":"https:\/\/touk.pl\/blog\/2013\/12\/10\/distributed-scans-with-hbase\/","title":{"rendered":"Distributed scans with HBase"},"content":{"rendered":"<p>HBase is by design a columnar store, that is optimized for random reads.<br \/>\nYou just ask for a row using rowId as an identifier and you get your<br \/>\ndata instantaneously.<\/p>\n<p>Performing a scan on part or whole table is a completely different thing.<br \/>\nFirst of all, it is sequential. Meaning it is rather slow, because it<br \/>\ndoesn\u2019t use all the RegionServers at the same time. It is implemented<br \/>\nthat way to realize the contract of Scan command \u2013 which has to return<br \/>\nresults sorted by key.<\/p>\n<p>So, how to do this efficiently?<\/p>\n<p><!-- more --><\/p>\n<p>The usual way of getting data from HBase is with the help of its API,<br \/>\nmainly Scan objects. To accomplish the task I\u2019ll use just them. I\u2019ll<br \/>\nspecify startRow and stopRow, so that each Scan request will be looking<br \/>\nthrough only part of the key space.<\/p>\n<p>It is crucial to note, that this whole solution works because of key<br \/>\nsorting properties in HBase. So, HBase scans a table according to ascending key<br \/>\nvalues. Since keys are of String type, key with value \u201c1\u201d is smaller<br \/>\nthan \u201c2\u201d, because they are sorted lexicographicly. So, also key with value \u201c12345\u201d is smaller than \u201c2\u201d. I\u2019ve<br \/>\nleveraged this property so that I\u2019ve partitioned my whole key space according to<br \/>\nthe first character of the key. In my case keys contain only digits. So I<br \/>\nhave 10 ranges:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">null-1\r\n1-2\r\n2-3\r\n3-4\r\n4-5\r\n5-6\r\n6-7\r\n7-8\r\n8-9\r\n9-null<\/pre>\n<p>The speedup comes from the fact, that each range resides in its own<br \/>\npartition. That\u2019s right, I\u2019ve presplit the table to have 10 partitions.<br \/>\nThis corresponds rather nicely with my cluster\u2019s setup, because I have<br \/>\nmore than 10 RegionServers. So every partition should be on different<br \/>\nRegionServer. It will allow the code to do the requested scan operations<br \/>\nin parallel \u2013 giving me this exact performance boost.<\/p>\n<p>How I\u2019ve created the input table:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">$ create 'tariff_changes', { NAME =&gt; 'cf', SPLITS_FILE =&gt; 'splits.txt', VERSIONS =&gt; 50, MAX_FILESIZE =&gt; 1073741824 }\r\n\r\n$ alter 'tariff_changes', { NAME =&gt; 'cf', SPLITS_FILE =&gt; 'splits.txt', VERSIONS =&gt; 50, MAX_FILESIZE =&gt; 1073741824 }<\/pre>\n<p>Split file is just something along this lines:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">1\r\n2\r\n3\r\n4\r\n5\r\n6\r\n7\r\n8\r\n9\r\n0\r\n<\/pre>\n<p>This tells HBase what are the rowKeys starting and ending each of the<br \/>\npartitions.<\/p>\n<p>Ok, so after this rather lengthy introduction, what the actual code<br \/>\ndoes? It just spins of a few threads \u2013 one for each partition \u2013 and runs<br \/>\na Scan request tailored to that partitions key space. This way, I got a<br \/>\n10x speedup for this particular scan. The execution time went down from<br \/>\n30 minutes to 3 for my use case.<\/p>\n<p>I\u2019ve created an example implementation of this idea. You can find it on<br \/>\nGitHub:<br \/>\n<a href=\"https:\/\/github.com\/zygm0nt\/hbase-distributed-search\">https:\/\/github.com\/zygm0nt\/hbase-distributed-search<\/a>.<\/p>\n<p>Any ideas on how to speed things up even more with HBase?<\/p>\n","protected":false},"excerpt":{"rendered":"HBase is by design a columnar store, that is optimized for random reads. You just ask for a row using rowId as an identifier and you get your data instantaneously.\nPerforming a scan on part or whole table is a completely different thing. First of all, it is sequential. Meaning it is rather slow, because it doesn&#8217;t use all the RegionServers at the same time. It is implemented that way to realize the contract of Scan command &#8211; which has to return results sorted by key.\nSo, how to do this efficiently?HBase is by design a columnar store, that is optimized for random reads. You just ask for a row using rowId as an identifier and you get your data instantaneously.\nPerforming a scan on part or whole table is a completely different thing. First of all, it is sequential. Meaning it is rather slow, because it doesn&#8217;t use all the RegionServers at the same time. It is implemented that way to realize the contract of Scan command &#8211; which has to return results sorted by key.\nSo, how to do this efficiently?\n","protected":false},"author":11,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[11],"tags":[687],"class_list":["post-11730","post","type-post","status-publish","format-standard","category-development-design","tag-db"],"_links":{"self":[{"href":"https:\/\/touk.pl\/blog\/wp-json\/wp\/v2\/posts\/11730","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/touk.pl\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/touk.pl\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/touk.pl\/blog\/wp-json\/wp\/v2\/users\/11"}],"replies":[{"embeddable":true,"href":"https:\/\/touk.pl\/blog\/wp-json\/wp\/v2\/comments?post=11730"}],"version-history":[{"count":8,"href":"https:\/\/touk.pl\/blog\/wp-json\/wp\/v2\/posts\/11730\/revisions"}],"predecessor-version":[{"id":14804,"href":"https:\/\/touk.pl\/blog\/wp-json\/wp\/v2\/posts\/11730\/revisions\/14804"}],"wp:attachment":[{"href":"https:\/\/touk.pl\/blog\/wp-json\/wp\/v2\/media?parent=11730"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/touk.pl\/blog\/wp-json\/wp\/v2\/categories?post=11730"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/touk.pl\/blog\/wp-json\/wp\/v2\/tags?post=11730"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}