From 958f7ceda6adef8c8403722abbf74bdbc8e09304 Mon Sep 17 00:00:00 2001 From: Vinoth Chandar Date: Wed, 4 Jan 2017 20:50:44 -0800 Subject: [PATCH] Adding Documentation for Getting Started Section - Overview, Use Cases, Powered By are very detailed - Cleaned up QuickStart - Redistribute the content from README to correct pages to be improved upon - Switch to blue theme --- docs/_includes/head.html | 2 +- docs/admin_guide.md | 43 ++++++- docs/code_structure.md | 9 +- docs/community.md | 7 +- docs/concepts.md | 6 +- docs/images/hoodie_intro_1.png | Bin 0 -> 23478 bytes docs/index.md | 217 ++------------------------------- docs/powered_by.md | 4 +- docs/quickstart.md | 146 +++++++++++++++++++++- docs/roadmap.md | 6 +- docs/use_cases.md | 72 ++++++++++- 11 files changed, 299 insertions(+), 213 deletions(-) create mode 100644 docs/images/hoodie_intro_1.png diff --git a/docs/_includes/head.html b/docs/_includes/head.html index a5700e6d6..25be6c718 100644 --- a/docs/_includes/head.html +++ b/docs/_includes/head.html @@ -12,7 +12,7 @@ - + diff --git a/docs/admin_guide.md b/docs/admin_guide.md index 9e3f18870..5143aa227 100644 --- a/docs/admin_guide.md +++ b/docs/admin_guide.md @@ -5,6 +5,47 @@ sidebar: mydoc_sidebar permalink: admin_guide.html --- -Work In Progress +## Hoodie Admin CLI +### Launching Command Line + + + +* mvn clean install in hoodie-cli +* ./hoodie-cli + +If all is good you should get a command prompt similar to this one +``` +prasanna@:~/hoodie/hoodie-cli$ ./hoodie-cli.sh +16/07/13 21:27:47 INFO xml.XmlBeanDefinitionReader: Loading XML bean definitions from URL [jar:file:/home/prasanna/hoodie/hoodie-cli/target/hoodie-cli-0.1-SNAPSHOT.jar!/META-INF/spring/spring-shell-plugin.xml] +16/07/13 21:27:47 INFO support.GenericApplicationContext: Refreshing org.springframework.context.support.GenericApplicationContext@372688e8: startup date [Wed Jul 13 21:27:47 UTC 2016]; root of context hierarchy +16/07/13 21:27:47 INFO annotation.AutowiredAnnotationBeanPostProcessor: JSR-330 'javax.inject.Inject' annotation found and supported for autowiring +============================================ +* \* +* _ _ _ _ \* +* | | | | | (_) * +* | |__| | ___ ___ __| |_ ___ * +* | __ |/ _ \ / _ \ / _` | |/ _ \ * +* | | | | (_) | (_) | (_| | | __/ * +* |_| |_|\___/ \___/ \__,_|_|\___| * +* * +============================================ + +Welcome to Hoodie CLI. Please type help if you are looking for help. +hoodie-> +``` + +### Commands + + * connect --path [dataset_path] : Connect to the specific dataset by its path + * commits show : Show all details about the commits + * commits refresh : Refresh the commits from HDFS + * commit rollback --commit [commitTime] : Rollback a commit + * commit showfiles --commit [commitTime] : Show details of a commit (lists all the files modified along with other metrics) + * commit showpartitions --commit [commitTime] : Show details of a commit (lists statistics aggregated at partition level) + + * commits compare --path [otherBasePath] : Compares the current dataset commits with the path provided and tells you how many commits behind or ahead + * stats wa : Calculate commit level and overall write amplification factor (total records written / total records upserted) + * help + diff --git a/docs/code_structure.md b/docs/code_structure.md index b0ad81939..5f160fbe7 100644 --- a/docs/code_structure.md +++ b/docs/code_structure.md @@ -5,6 +5,13 @@ sidebar: mydoc_sidebar permalink: code_structure.html --- -Work In Progress +## Code & Project Structure + + * hoodie-client : Spark client library to take a bunch of inserts + updates and apply them to a Hoodie table + * hoodie-common : Common code shared between different artifacts of Hoodie + + + We have embraced the [Google Java code style](https://google.github.io/styleguide/javaguide.html). Please setup your IDE accordingly with style files from [here] (https://github.com/google/styleguide) + diff --git a/docs/community.md b/docs/community.md index 533dade7c..83502964c 100644 --- a/docs/community.md +++ b/docs/community.md @@ -5,6 +5,11 @@ sidebar: mydoc_sidebar permalink: community.html --- -Work In Progress +## Contributing +We :heart: contributions. If you find a bug in the library or would like to add new features, go ahead and open +issues or pull requests against this repo. Before you do so, please sign the +[Uber CLA](https://docs.google.com/a/uber.com/forms/d/1pAwS_-dA1KhPlfxzYLBqK6rsSWwRwH95OCCZrcsY5rk/viewform). +Also, be sure to write unit tests for your bug fix or feature to show that it works as expected. + diff --git a/docs/concepts.md b/docs/concepts.md index 5d35dead2..389380b0e 100644 --- a/docs/concepts.md +++ b/docs/concepts.md @@ -5,6 +5,10 @@ sidebar: mydoc_sidebar permalink: concepts.html --- -Work In Progress +Hoodie provides the following primitives to build & access datasets on HDFS + + * Upsert (how do I change the table efficiently?) + * Incremental consumption (how do I obtain records that changed?) + diff --git a/docs/images/hoodie_intro_1.png b/docs/images/hoodie_intro_1.png new file mode 100644 index 0000000000000000000000000000000000000000..dd8977a190f682aa269e441cb35de727d9ac638e GIT binary patch literal 23478 zcmc$_c|6 z(ds;Kr|r65=E0kgJ7ESP616|o#%#+#QS`UE3!WZPCmZJ-o;~Z|CYWkHcE(3gday<2GOo5BvF1F7JmP6J?_x5Ta1E78G34WZ@q~!N+!yoS6Qm zTo8w_LLYj+LjNL(e__watmyZ7Jp+~U@jbBA9OTntMzPQeDC;Ya8YMse4KVSo(Qj-N zcG3axIk%zhZoj*6eLdg(`#^J9lkkO|%$8Q^Z8DTBNI>>V4ah&Ju|rwULToy$;AghI zZU3=AK%{`v=q871TW(r_%&XI0yS?Ph+=8c;Mevf*s$_2JCm8QIMl zt#GoWwPpG0NkePcV&sAz8NQ#?a2a<8)pKQ?e1Mc?S+7F6_n{+RVb^V`+gm<0O;b~R zO$k$);nShM7(MDf(zmIt+Q@9K0wAi+(GKVG_UrXLanT{H-c!6CZ_6TNRmUXat$E??1*NbiVtg0O~x#F-da^n2m`EZ9>^l19C?vSUYK0_Chic;J<2*tO27Ny{N$|4wd>q6thX z9kIcMIw8z)b_PLjJ~eLLG?BS|ldGPS)X*wh;~S1f#~6NIn%pC4T?<#N;U3FcC_r{d z&jkb}RKJ~KNiE+ujhiy$T^^0>Ss<|XGcqOel@w4A{R)IfqJGThuIYA1ApXJ9OMANj z{nW*i`$=PA=`QUT3J;CFN@Qw<-g78^r`&TT%j)%=l~B+w7heuATl(1Ql#z3JsqVw+ zsF7D~^kYiDI();4FBL4{lv3I_aC*(l4@Cd>pz!OIr!E5HBcXE*Rn$q*;S_(?JI^lt zH%kI69D&L%5zGX77zEw=S@OF~TPNXcDi^G}y`$pG+bWK~UX0;&T?xY*b`OY`bwg6K z7ckrO&3!hyE;}qhtJ?z*#`XJ1@jW^Ktgg=3QslxV+l8fAWZZN+C6xSN=@1+0qF-ev zhDS9XZp8KLF||%R@a_xVYAuuFV&SKYbo0(eX>GR(o{xAw=7qX-1%ey&#Dk1qevQ98 zeFxKCUsMpvZZLdJ@pQZqlVadS2kY#~%Q)sOO>a9`gX;NcohbZ}m9E7tGWhl_y zA5w0m$J%upT_R25#el{(CUc~WoTfldnc8IQ^tU^|O9=WdA-%Cwd3O(IR-a32xTg;c z$NfIJTF~$S-WVPh)7>j|l`bn|oY}^Dnw`=Mp6WOCQ8 z*e*PXMS=o}`lf_0Wxyb78owPa6^p!a3Z7a`c&`Vt3%cPJsY$31A7d54u~!YU zcKlHww;c@bM2z~xCw9`T^^-47GcgS!}X@^@x!+xpj@u7WA4U z;Z%Lpm-TK=l=W^(pkgh!u*Y09d!#_g!S79UQrTezcet@gigk`AvJvAyb&99BQlnu7U;(cKGQ8a z0Co(k=YQNqbk=}wU-34 zqE_TT$97$r4mNc@HsViR=!uqyL!11zSgR27dZorC6tghx^b!FisO8(plN8IggXvqd z>-&x@j!{}4KiIAmtxjb0*A?eNCCdgZ#M6dT(uzcVsT+NaP@v|NZeR1(De5Am;U><7 z4~z14L=`|ZaqC_?F3_1hU(WtK_bG>7q=*$$@mKAdT%_GF%H!4hA3Rx?^)2wp0-@`E z*zZONnyFRxRVYgTUa!t}{2dQU4!t;Fzs_sjJfDEfVzfRFWN2>-i5FT6aS6as? zt>0pvu~xY*hv0(@$GI8cPW z=%+XCmec7Emwyn}SQ~NEw!f1Z%E0@}zJC?x6R~Fo$jHnSi%sj%*)73YsAOc$+_HsY zJ~fkerZ{98;{n}j#W`8|(&`mgqO=6z$FYp6I^wY|Inf%h3XX zG$$$9q_9vL+~cWGFh-Ns=Z{%ZD^5lSZt!r|aMbD&b^W!>K4}vCX~o3=s%%!-3D-RB z$@#mX9g6}~QDT7#1Q>NK zn|CaApkb+RS~Ly`S4L9BPjP;LX_vw!ASJAFhj0=PrpyYM|&45fuCm+MIy|0w-k8rgz89QsAf1pCmCWe zhQuuO`yZ=ncM4nf(hMCByr8CSlw%b!tN$%UG`s3PhImHy%tk>BEx(i7N96|QmMTN` zko;s5Q(Zswc1H}XjE{uwBl(Gk7QtLO1?|5%>0bl5I#7pM{3v`1ict*Uy)^R`!iV)dl4q4Aua}wfN z+5(|pA`7zh|7F@)L7;bqUBjGHku2RCI(fS_TEZ36uU3vNG+52Z3K>qJ0VusTkyh=G zkG-=_FtRa-YJGO<2;`kz;Y<>ypTGL{35{CCKSgar@5|=AVwU@PAqIyGcF??hLOnZ% zIt7E^c7x$&;smC@pqee5$yfjjj%@;=hYh|ZC3BV04v{tZyMXLJnGxPHo# zhz>fLZBnA0uImu8B`kVoC{~cEvzSqxVWz=~aWbQ|Fcek%^VCy=aJB)k+;SHKi3woX z3%;(0|2`7Pk}q#q8S~eO*P>O`c%o02d31ToA5|pntRrdA*UTAqPgH{`-gL%C_}M~F zM%AlNH%ASzQi*x|(-FE+PH%XA^l%HRu`(fLF8NAfl8Wu_!H_9`$pU>pQ_c{n44zd;nd#U~AatXI~UYlQ_cJH!gAsUa0#^udBP6EN0;)V9(!{t^CNT>mh zv~xVy)XEo@G|R(hM1#`TUqdg4ULKw)X@2BTk;Eo!FQSc%OBvDM1s>@AW>9c0#ji8~}8}gyPo;X^W*~}k$#lRyQ)w{3CwZ2AqjuF)v>5|~V0BGoDa$OMeI&G5b6?Z!N z+PbqrZ&>M*o_E!Qg07I#8p&aT7*HaARwc#so#XT?K-YMPrZ*y>F9;$o+=*hIFKg}A z%C13-b6Tvz9SJeuO3MN!Yk4xvgtsR%`0`-v3agJX!KhVQggw-ZB3KnxbO^p#!N(Ne zmku#baMkha;USEcuD%VB6%FwCGdjmJATnf3-isde^{PA~p>ws*GQYuv4mdNw zLl|l<86f3hjFRIT67HX#oF43Qa9=b=+l}5sNrD3kIZdeY61YmCAo0@O3MDVjwy9VB zLns@TeYYL$8I{;UYTEzJ3jCXNAt!ly?2&qR=^rA=nh5bANzqCGS(-lu>74Uj zDd#S@1vV_4K8eqc&H^J{sm+h;0_(rIAneT%nhw(9{;l3~)DNck#zrb{ejP@R8!r|I zK0MA^vyR=rH!~)(SZ9Y%k$y6yKyI?28NSKC8n-W;M9g$ zT1r;R9q+0`Me_Kx%s!QDW4EmL0pdXXT(2;;xe^R#Z?@`v?hx&%DKIr^7&%>}m(jTK zb>^jJkN%N%L-}eo?!tiqYUce_gz?Jz1fyNxf@yxnk+M8}&`|A(GWBH9T#%_b_83Tp zG-6G)f%qocd+Ec1>$ESL*;W;Z8l$5dX38DHl03zq)kKf&TI%vxyGf5A_WcYg*8r7H zeJY&h51oxLxV{`o7caoEi|xZ1gDZN>$O11X(#nE5rTjyxzuY^gFx$495tI{)`)jQH ze_4qDS0UUnkw@DmWd%dX#sWQ5W78E3IgP-sQY|W`CE)=3$RpP>1X5r#*zL~d*6A0t zbmnba2c}!I8me--PtPs744TVUP34b;W1U|{8%CM1yeX@QYs`XrLM6-QRRku`$^wYObLBuX1+y%_@79Ao=d(cdIa1OvgDK< z0j+-7ljT2@T`Gu_mJ=^1!|P=?ZpeBE+}Ze?L;t#CxZEl3_MKZMOP`a6d{(1>>2>k0 z%vjaDmNx4N*NHOW!;e~mlUUBrPh*KIzs4e289}d4&9JtE%@~Gxr^t?KMY!Q$MSB)= zvir0820#28#|7W%Uccb8(uL0l{7-sPTBsWY>s&M%?Z|E3Blfp))rnY1o75NNb^bu^GtKD{p9GCpDIIBe?3U^w~wXf0< zM?4pcT*^*+@yj?@447AGskAX_os{V~a4?BpvpQdQS7H*FCB^i1-aEd!g;8R`FGcMz zCcjvIn}C-B7Zf*%kW34qwZEfoJBMx@yZUm>Um{7|lnBZj(5Ed}ANHqyq@K_U!9BZj zF~ZvxdToiN9-CR{G?wxZeBB2gP}O%c4H&oLFpjf zplV=TliG2VI{u|VGrGqW?5|HN%}b?q!80~`bJmqpRuTGwa9ZLGoiBynC|f6cw<-_Y z2yeP)Tw@NR4R{LpGq5&HI*{FYMb0O&Lz1*4#RX; z!tQf1W1&qq+KBxN!g5C9Y>r)pRHK3-M=uU3EN2-mE3)|IRg=NG&}t7X|sNXkcC1|}=acvP;*qnotcsiX@NVulU$w~)^ z0s9vPa4^2I_Mlq>TWAos{AxV4yF7{kTn&GyW}UFuLJv4;jAzGJ^*AH~5bR53O=Q$6 z0`qeHy^W;L?4|G^-2|ECC&f))quKic3RM8fx<}Li_<fM&PMVe(wvb9@YjxIn`) zf1OT-y=`T@wK^H}jQP2De5FF1-5XTM?tp>xE!o3!i-N;e6w*rE>2Z#jLuQY!e2ue6 z8=B($JP>!fsJQA1B|T8no9v4iWf|CrM_HjF;1qdDbU>vI=kj!1^Nw$mk1K2IjoA;M zG*hP!C8$sik!lpFA#EVFIOv-y>y`S?Zt=mV6Cm`Hbruj8!+NWox*76b1GMabQ?s)u zQ4Bc>ZyRHCjpQ00*GXCbN41% zDtGbbh0$v|+%hnC@eZ5gr8r~DHV z^7t_fi?q@%5x%*ts0crIbbwJ8-%80>))*bf8QoR|e%m_Fy%!XmCP&g@t1JYml1$3^qX&5h;n6t-`h)64|(Mq;o4$MC9As zVgGmJ6EAoDKO7JntYO*|Wq)QSp`dZgs%fS_des!2Pgve9_0Z}=ZsU+2^I(oj_0Tq; zcIF(_1(yD%%f-o7lzH&x$PP$>%9Ytu>*qe)eQ+*Hlq>aVcWjP${6W>r?>3s_QDb682Ui*Bn8pB$ z^;K3yy~)Pf0t*wehND(QtZBC+J~H~p9&RpN37Dg@%~a~Z#`H-v=-!d0wRsB-TUe7| zi85MfgZ4a^ZnC~intUQm%GZS{HW>lPqk)A&f3g!@HC>F0RY*aEr0?E)tNqB8M>%jmsBy1JkZ#vIOxY8*;G>q=}z=d+ki)S zQzy}{PPU#%O~3lw>vKy}$>)mu3jR&hm<@qzQ_@C~{7JoCk2X}C>-G_!#XTQez@V|@ zbXTZU@;rwkTTuvWScWuUmeQAD(FuCtj|FaFR(^O#(|^rScT|a)_H+lKQNIhH?*9U*Mkng(F-f0gvA$~FAJ4%|}2&}TQO(w@YI&*m9J_5n#o=x^o zB2T<2lxvL7#=>x^KWuX+ehw|Ya4`-U$ty6F$rl)Iobbf7_+`nGCf?olw@p5mQrmJQ zkt}%){zR1bKDDOZm#OrcImdkeksWZ|CZY=>^)OLE2`JnHK~8W|aFR4syQm}WTx-6G zRsD?5G1$e|{4yyp{K-1qA_2a|5;Zjclt#D@Tsd8#Xvv1^{#vw!D|K@AFMH#imo`0g z{ZUs`aA2=2IpK^u62}hX;2Ajuu;Z+;Re-(M9-3nMgjjsx+o#`#X;<~qJ2m6rl6$a@ z`WjJ}smgmF(;a=Mine6E`H!RY2($K|GCH?5a|15zX&}jKLrtxWObN|4h7e9T`_#(d z&|K{*5^LmgPd&)z3PNR%^(1dhR;@?!aC`KMv@PBOuccVg)L8R&uSezRuo1v>w9_pm z6<)&k34G+dRxh4m*S4V45sd0I1w-0yq_99s?i;;sV|l#m=z4@Fq z>skHoR`r9+oZ4L2$}MSW^Ryk-abit~VMI4kDc8J018ZBzi^K;Wd5G%o-a3l!I$&;U z!Ql?pZB#MK6BSO*w(Ka$5ua36qQ`k|FV%b2!RN|;W0%<_tJpW)BC1_pX#q%_W#86D z|CVUI4JuL<(hwe#mt3oTuJD1$%Dwr+P0z@IGuBaETP*L10;_!{j}wbJnlBj|pKY=G zxtDJ+S(gszeL`zGAwL@pyxv_s5D%apn(q#3H?(o_nMqj>-51@8}_?cfzY4Gl$P?hMm8>#PmaDAy@t3Uf|Tn zEtjOWVs?O;`RD9$B}3j(NchMYlx}gCE-Ty9Y<}+$zQzsCsefH{n?5T-=KCRBPiXq>T^X6LNhiy{r%4Dqn*+;@GHvfLvg5UFEsNAdkZ-wePD6jNVl%3`< zlV2{W%+rv_l&phmK8HU!K^BW;vHVS~qIN^ksjK2s5-I^9I|Ayj;Cc1lU3o_36r_C8 zo=@iJUW!ik@y6{R)2b_+(@IC}LKYRib|p@kf5ox#7gr+kO#Fy=5~JFJWlqchtPq`y z0%ow)zegKtv~2+hJX*^ofOC#*UW#bfwJ3_0;SHBt1S+l4<8A7bBNm~~A8x>yzK%{y z!w4VN(Vo#V=;AUCAQ(mkBit*NM&>fwjoCneQ4HUSAZX+{l4xzdnNAgzqp0e!EA7iR z4|iBA!m1k+WCet@^IZM1-tMP8)V%t@0M__?9RMm?zj}iepFTF}C`66StNkwaF0_LC zs=y*gH(zF4sn)o0gs+v9hG@(OgaAe(k}>P3_)h3nZLF43hd}96p3+J(Gcww<^ zT$m$R27aymz*XOE=tNUia>4KO%FNJR;XIQ7>v>O&w$+aflp1rqPbv~CtA7q_Jbw%} zkGEB&))a@9M%bpG%k8Z*h=@Xt_zdQ#Dx(d);)sWv$|H43bsK?T5n4|nM-Ez_O__}0 zV`H46f*m=fpA9;?)ha>@roA)ybdw1gG6oV2$&*>hPadn|fPQ~!V-0PpU2J5g9mgec z#EfH+DowYYjwRy^0RkiD4Y+UT?S>cmKOgC&UCt9+~7r!;Pl z)D;Ti53;j_NS;jW5`2LVL0?78j*Yu+ri7F(HmArraVb22ezv-!*9^O4z$@YYb~NVM zCk;t?)Ku%BMoD`8tC!hXHFGyOp%{Zd%$i6Brv7O|7)R4QnfGJefAq@juXHsuUo#$q z821KLGkI#no9-~<5M!)k-AU`0J+s7w)t`SuFau8-m>5SZCzYWilq!~t`$x1V z$W*3XduxG;L0;;{GT+O?zzcV2y+7ht+<0FLc?bl2Eu#v0- zWOq7+P4Bdgd*+f~qz21_95;hOd=h*hkDbGRj_4TR;eetoi*ZCb(nKimC90fNiyLH%UJ=U zI`DR;ynHCAdhx=1(TsZ3jseNc;T3m4(G-4m4OhL;G`9K(su^tNZ)R*QcwV6s)}2(w z@Lx#}U9xdj$kFLKqP(!rHS6F8zEZkM?hdI93yXL^^*qk*xl?lYs+3t2$S6ob>?H1M zVnNAZay?F_eANXGtLad~&aL4Vm{C=qScVJHc~4bse9K@WoeAeF@@9Z%h(ozM!OA)% zo)i*3^~+fNi;+hY!PM+u$z^t^=};znF7)+G->R0LcPT0wq7*yWxH#B69Xg)}ae zfYEvW6EwEuJa}Begy?^blLgm`1`1D20RuVbb0K0UBzEAp8s31u4EqiDJU-xIW>Wnk zs0`}e&zwKyTXecE>!?G~;L&y#sBR34z2g%dgO}IC?de2~P6=uz8kE(V=xbRPQtl|Q z(BokeMgIEW?8dq)Fmus-%xV-Mt@|e4#HyoSjOF_bC##p7Icb6^Q z(D|u+iib@^9fD)j!Z_HTg{a6Q>bO7Muo-F3`^sv+;C#Id>rjR$Bs#I+x5Bb{KH=VC=1K$v$SS~`7?l?kY+MVJ56#tl}BL=j^{RJuT|A)w;+q7=T_&j zGI@|L)*}X}-Iy$rc+5?ZHH?rSJsnQS*XVzre z-OJWIJPf9ts%pWI(rZomoU-Q6PW5@Pd5~Ar38NT1316-!Q{nHU{(KG2@XY2llJ=1n zH)Qb{zpsF9-3T(z1!}%!QY)fPVfRTdY4~?CeuJ`hTS>xVnPMq<7R~T4t|@ase^Vz4 zG;#+*K=!?1VnwpWsxD35L&p6XDekTB{eOZxm;i6K&sr@JRK|0;6lT zzZMA(*l%o1t>V$(W!pHU!r6FFZ)*?jU}H~%-ZEBIwIedZC@Y_g*K+Zv(J%v!bf_9V zw7ZV3an;w_(A>44+^AWO)EyHRiqFe$ax#WZr*?#0cn8YxV;sVj4!pgpVsL|#-y?#) zHge_VU{_>72FnG4;enz$s&3L}a|WCw!ABRtuCD+CY5AG{mXc}r<%OqqIpI|k`TPRt zeIW6pPeM4LXc}mZ&nZ-fmgiL9#sW#VlHMZ~Uw23Kp9FuLhc1eXSlI7_-n@43Zf|w} z*i2Tw!8RY=ZNTkdkude54q8bsk#!xaHQPn&2$PS3oXE7r`7Q z?LChBX>cU09`@DIi#K6m1D+yfdMu`>p)2%?d|*pRNdtIHQl#`7Rl|snX^VKEI93Ww zA3WO42+{jgoVGncllE%tME(>bG`)*6Y=E%xG0Jo~pe=WK@bpmUrTTjTvkhL;;zxXAk#<=~=Q z799HVGMBO)Z{e{0{!jLj7ly~wnJfhOiv_2P?{a-%pRa-Z20mB13hUAd@`&&3b$aFF zOQ``i)V`uz{tEY7YC)1!O1OVFci{LUW22k)?U7oYcbJ=jsTf;%*Y|+6BSg11r?JjkS!YXxu zRHJ!4UWnYjsmNYoQjS|DoGmG34>zrGd^~r8%Wqp?)qImwM$7eK2lTW$SPd<#j#TW` zK7E5Dn5hwFcgnS1vji}I)VukMorp|oP)Nu18LHE+0ROx(i#vz}OQ|$^cK;(9OA`FD zn@$0iD;Zu8Ohp~+E8h{&e~B{s{AELrlp^I$;OxPK$g;U2`vfu?ZGc&0U*bP_qy8IN z_9uwutnz9@tiromC@&7F{Q*8L`8uC+oeXFk97FjqLFK1;{_B0d2~?g&Xa=ij*FF0W$rX&fCga{O z?SvpAUC;&Oy8}IL%#lm*lyNNN53Z8-ozLu0s{jdf0m4TzzU%p`1gtfF_N$R1-y z1>wSE-%m+4N4_+YBUKY$^Ni!1eAXPi8!^0EHcviiA&KdKSaJ6D~5Z)=5OlGvqg^B~A)a>{GB`QCx zB6T1?r$EQugYeY{N_{%fJT zNS+E6D4)^Jg<-(k;Tgt~A=}U+ph2PCb>YVFz@D-bJ!0l%vlS5#Jeh1vTRQB1@Zr)6 zq&MT5|A`S7+SCdy6x%fF4%n}?j&N7)&lpZ~^;(Z|vMJP#wljtMj#lq~&${SUk?YSfdiaZ;DY1l3hUm>B zApDxPo1`LalWF<*3UXjD9U}nOlB~auX7jn?K;Ria)O)SkR8MEbYq@vT7HIgEanMvI zYjC8_CaAP1&rStA)EycTl5x3y(MO9_!5gYJ^*H#A^IqL7R9FZ8cFR?JcHwYo!`PpN zcbj#oHqgr<3k0Hg{K(N6F;z;#gUBqL*6C=h)s3M>j{{1+vk`EU63x42)D%)X08j6& zWcs2FdfesS>?O^VyIi;P>gLXZ24h=%Jdc5mm%MF3ajCs0D#t>YE~tQ8Hko^LcnPG6P&!q`jP{ovL`D1yli}RIc_66o?w-aXz zDm~2zNXAvFNW8%w(w#XK1!*lsCMEfRy5Z+~=y{Z2{&SA1tHQwh99+Hf1cP&5HIM5v zc<&^XtOhWuSQWoH@@P}qKUszB^MI$1hSVY|TeS}!uK1;Y&=breEeeY|lzVgo$(Fg{ zSQr%TA!RjhnE{u;;#y^~JHT)vt4~dYX1y5KY19_ut@|mk2m%mZCq>tYsBTU2p$Ay+ z-T{6Qkc?N=>e$jX;+NGf^aWx`iRX z%YZxfr^D+`ED$Gb*uvBb<){UB;Yz*RaeR|@IrpJ=#SX=v1_=gFVn(JL*l zVVPe*f4Eb1udS6wQK%Cl1;R5cx&|^171_6;(%!9>$bEHksl)Vnj#(5!!wwVd_#~%N z4qScuhF3xLs;w95zOJ>QXGn&v;_++)a>X^Bg_-JQM^dXoR*SLT#3M#YXqos)u2D@_ zIXu*lz3>_$VtxHj3ryU1ZvrwS)cV-uVNSnw+q@2k{i+&v)QmHst^z>c!JH|AU9R&~ zk8hxxA$C_nq*0qD3w4A(8a;B-FYQQzvgYdO2<^q}pcaoG4uCef@)h93dQNU}qQa-$ zU+-PDjT2G@3yzarzYUk(sk64s4($&hQ!)k50sS8}cQJSP;SFE&79MT`uB;%cWQ2ns zEc&NfIpPS#N|wX^$msL1j<8CMbZptId``D$)oNNn=snt|R`|8z^5VRIlVtRs&M)Sf zg1ukNAnZRPb<)>1pj#}v!P8|BJDx8QOHc|h8%8Qaj+P2luqu84_7{0qaYowg#bD6% zsFbSbyT#a)&#-FG3sMgYLL_Np&OsOl%Ii`XG1mbb8PA|#Oa|bYxI~w zg;8=0I_}x>MuRJ$$mgz2v=MX*@v^NwFNVz;_%~hW@N>mcnOm2(srsv4!ru?<4$SCe zhgKOLR6U!Sp4(q<6b2}IB$y{IY;6@*rFEn0_lMW#SxHBaWIO$fb~I22j5%Rlu0IE; zy-vF{X>e{pT{#VsW|VL7s$l<<-Q=7CV;G*~bK|!}c?WKK?NT zJayc&dV8P@KVj(++Wcex?2;C;b#ME}DmORkc5k?*57<@@?~$(zpESUwiW>vHG0>6y$v8D*i=${t0C{dE%Qy>z~zl1tAYh zvlfo>7t9_MvPFWQw}anp7K39IxKe+!QMq5vY-(lGUjHkNyV-sBpCs*H?Bky`^oOet zR>gBv&YKjCxUaQ~9+b}Xx6$SjJng9>f*b zy9{>UC6x0UsNCKpoJ~7}lL-8}oq14XO;MshiF!|E`MT}wmMq=aRdEpjryBYm&U82Z zPoIVcR?m)-t~3AiD21&&{NdT8xubgr>rs-#r1HB*ziv(wtl?#S9S~)BgePLHXc{GY zC#^C=&(WXK)_yN|_jt=5Ar~DUt#;h{y0duf-!HbL{g)S>z8V{W^4mL&?#P%6)jBSw z2nl)*pP`-o&&mwZPt=9;M1uaFhXJi`-N7nLlKguqk%{_FED1zYm5aoG1`?8*oklrd zM7UBy$~Q5+Rc=(o86+NnA5mL$Y7)(&V1M+luD?0oW!R< zq3;UqMW_rZi67d9x)fSBPC`rZhxYFuZJSj76e#jf`wt)e)BeLpKeYez(GTtaeDvQ8 zDk3HEPy62={Yw&N=BoI2ss7dX&$0enslN;G_gMeSQVX}@AFcdHRsWLY&Z_WYg^+k6DufE>}_z(It%wk$R z>nZ(jCEH>|HjU+u1Np^gAXZbGkIZa~!D2jRTn|2&>`^C!!%N6X+8%zox7tbQc)+;X@NfxXy(5^}S z(JHqrmne{?&?K2JHbyUg(Es>=1n|^B&wUMR^IZUw8I@lyL8vxpW5FAUw9l~Y7b_V} zZ!@COFFnQ&fdA}uEYiF~)9LHpe8o2;ix&l2ERX5n*9{Qz{e>UFJ@3@YBA*RG+&S?`f}GE>LU~UdgG=D%M%Zzi$+jwH$>285Y2Xt~zC^ zKGfOs^A7O$K=~b+#;yoYoTNn!i^msY7=u+EUo22o{hqCsC?TI4o@Y_;gWVX;EL08o zArwf9E-})OVZXGoRjo=_$zWuq4v_tf~v6a%uo zDqmPhKZ0W==6}&meI^k~!@h zGJGaG=EdKUUg%G@3tGa=pLQ8fvmb35Z8LDC?@M|lqE_LU41j`vv6g}cfBW?x^FL=R z3OQZ$s9tCC)2q?SV)5hs-Ogw@EckVr7Ete9hPY?4W&ZQe3M zj^rQbV;+oE_=$8i5VHs}L?G6jamRW+db2?3-$We}(%7 zCPCSOv%HU6Dq{@G*%U-iFKyuz^F@+lj2y8y zT1cUOhoU0L=5h@!e%IW2HCSX}UBDmumM>m%PJ7RiEZd(xLN>E6?+AKYLe;TcU7#Z(Wpf zoOfS1MdKz_cNK$k?7bvf`7E*D_ec*gI@%*2H6INHrczOVo6<_U3WaiCWP3P<2D2XZn)V?n8Iqf4n$3*tyP-UpkAD{FV<-Kq*_MtJVSQ3GQFt+E zjocmg3i*Cj|99Ty0RC>f5HGIY!V-el6_E|R|+GZu(J5zn9=)N91bfT+f@SZhWZ(d*1zK&+^{1XmKoeCTOuaZUFlre1?LB%s)e(?Q( z>~+rlieoMeh(#k?iuYlnVpeXt1(-%qrg$Dr@KFQoqd1dZ=l z^gT)V3wXY#41YQPcJSECnDrILY?AOo@E68hRnz)v2|s=FUhyogJqd<3iE%Mp7sl); zp3v(n`0X1>OOL2n^~8e}zKd{LINrU5HXr2H)Eo5I{C`^C1JuHo4-YG1$coYDf0r@g zPo}vxN;(``TxMgb^E;U2sOSv=FJQY&^Tjb842&}-_-WUGJBBUb^Gvp>&n+)I8#0jY zEqs~p$&H+f=%1VZq320pzQEaC{F2NDHCnmK3tS{btheb2n!`F6#hDg&bN&U9F^C9f*fP1Z^p zKEEq_St$MIL$`fQe5<_XSp-dwc~$~BS>$d?7Eo_qkc1V5fNKe)6;l>rO2x^0wRpwopqz+hcC#!uz822+d zzomZ2#N`w?7wm6O5PYQ)YjWPy2?OZgm>m8rNA=-}h~_b2y(~6GvA_~_A#UO9JH1aC zea=}-Ka}D`bqu2`eV*lS0uf#=cg=RzOa+Ghpf8HXs);LRgXh6-68<|}w1-rECXW!< zk!C#Em3&sptDw0K_zU2Op|lVeZ&RQ*nKSf zR|%uJ;a?jfdc^Jz2$O~H>AeV-E&K&f!)WBtIxmeKBlN%r`o8)1fflCbfWV0E#<@s5 z`7ND`L7VVKvw!eviiHFanwyYp|5wWY-J4C07IjcIgt^FH9m6f89uhpO*sbE$pK3%g z7lmO-;f$@jFhkr7@t?N=g<2K~w*sxUR6V*7M|<%ZP*cRgxHW$XkMfmO51W?Jh}vo@ zCGxIZ3@9rTefUtiH?}Feuk7uDKMrw-^asVc{TgRvANErs#jTwK9(u%h+`i;uZm1fq zf}NOMS)A@CN6in``(BwL2gqcju}V4ty+*v%7i0Cp>_|!oNg`pt6a^t>9gdt$=%ucG zc({AC{rtG_X9B_$SrIx}BzNz8pgpgIf4=s#zKgW)p>Og=ux-EN8BDg?rfJ41i2R>! zjx?&tBpDGMyci>n2p)h6=s1wo7)0)C7*LUnI4UZaJOcy}4Jr_hARx%0FoFca5gj3d z%b^nvl>``6FbZU3!X+Dm7!1l05V=f9w)-Qv`}W`Nd$aHD=2xe``l`CCs;jH2yS`{3 z51AP7t#s4ieM@T6RDHhtHy^~7cdtF^JNsu668sJBt&k~$GDES;n{=Yc%?a8jgKF@>~&wLbr{En(( zPNx1a5cvZY4^9OV^ua}IrJO19mWbk}Yo0smMpe$g>v)zOJ_i>!FtMnaY*~}{S0=;L zDYT7bVg0spjX#vqXM%dgp!PW`KPxajHnXHPL*oi2jNz1X1zTM=&z6XlsQSOc`Ph5y zcTZrX#qEZB5U-LQot4!+=*xgbep z94pH#R=44r3Wl`$#^MyeH&q?riY=Re={eJz+%rfjj7W6J=lpTKLqSDeWt^j6?} zNue&)RmUGdx==vMxZ#rCyc@3@YVY8;Vp=ya&oLyE#CO;E1oobN862HMs&TKq-zjCW zRL;2_e8?_SxkLWw|GZjT9qvS6VMO1Gr|ZSsco?vqaFb_M*jiic-%??Uo9wvd_lDzJ zwLfm_WJsV>wq{ucy-BvqR8NWK{;N#3a zj`QKcV%W^h1*z%|e4#f2b^QzbR&XC}!TgN;BT-pm7MmgqCnxR*|m?Tf&iXGv>CQ|b$hilLR-jintj zlvU1wsNG|6w!1@Hk7X5bRfcLJ-wMv>PDun&l93wJcBPD0-Xa^?tBLB;C4UJMkfkt@ zFvJaxUYnXse1UhVDq#$oNR;zP|V7Vk2OQ-0|tD^v=>^Y~*WA^I0EU3a2i0dXu=W z>^uxA)h~X72>YFD3{3??^!=HVS#5TJ(aP}ivUeYzE(VUj8MxY@Ju}nk-|YYX;F%HH z##8Mhx`a;S*~N%nFYOJ*r&Brz&7V5%CeD3Gp_xev+(>&s6z}lyu(gvLoncOyb8XIg z-|{z3ss^9)K_eC%dQ@p`0BZu+k*}$-Nyn3Qxtlp5Rzyjqsm--@s4}MWmWS`%KJ(Ts zi5hfE&jjVexwGcNNk4M(8t1$vS0+C1K_8uaiX#RX0hbk766Qr3c>#9X>FRItmQqay z>h(HevP~~@ZgY5#*cX*|_=^iFz>UJ|be|=YXD9hJF1*3fAH(S+Eo%mM$0yK^|jc0K?QzM$VGCV%U6ijQ3SY*Nv4@Sm ztie(>HThY8mfh7?hL`X4hv0;X+Pe8$1v`p2VF-nGz7uKRlY4BQu{+V?ogSI(Q(R_H z|M_vTrcwJe9W#;SU4a+rzPwxp7`B>Mzd?oAl}fth56CD?0Sc^cV1l4K1|jZGpAP)_ zacvSbNhnnPR(R5-2aOVh3owEa3V1g(?g^$RF!%RHgADH)%SUXsoNuuPnoJbRv(A=Q zW#6(>#UHvT0JckOh`ya5q8-tx}pu$Isr8`SzL0CJA)ZvQOanW4dGC%lk)YxgSYiaO9t)AV?1l&CM zWWGTOx(z-K5Obbr>VUO`s()hcdVFgEWo7$^s=r<#n=3N*op~ z`!uBn%nvdumd3s`&gBDAiX^Q09QcN7paf((YFn>znl=QUXBoxn$tY(9D(4jZ3soa& z1H%vQ0=da}gnl7j*U~V-c-KySLaR_?_xQx)pysEuZ&NL6H|v9~u2(Tt?V?8>#~9Tf zdDd#!AX2niojCD#M*Kb-jd^L%_8NGQc?9OuVX7a2ad0@j^G>m~lLg`AwBA&*JP1p) z_iq!P+wPt)1~O?-=5L&q;_evO&^DIvYRm@?WSwRll!dTaI|w~b-%>kR*(E;$&jRiN z&@Rp0(i2mkrBGR7icCZ)GjXp8xL19sMcC?$hzXb{+Iys)+yq?u%!$3p8f_jNAJ6J_ zqxXvf^Kv4(k8}{aD-0~QjEO;rq&Wh6B5KZDQHKCz%nWKD_c^s*kHw$xjgT0pu1#O z-l=W5+-+S#{;cX{`yw>dqBu<}gx^Haf+Uu%p6>-M6bd|2^ezA?dKEwmy??^1e+v0N s>)Va%y=9)ryuDKp27vbe#_wUoqUCz5Ov#Y?G#s$w_lNAtY&>KB4yqZZIRF3v literal 0 HcmV?d00001 diff --git a/docs/index.md b/docs/index.md index 32a03bd3e..f7fc045f3 100644 --- a/docs/index.md +++ b/docs/index.md @@ -1,220 +1,29 @@ --- -title: Hoodie - Upserts & Incrementals On Hadoop +title: Hoodie Overview keywords: homepage tags: [getting_started] sidebar: mydoc_sidebar permalink: index.html +summary: "Hoodie lowers data latency across the board, while simultaenously achieving orders of magnitude of efficiency over traditional batch processing." --- -Hoodie - Spark Library For Upserts & Incremental Consumption -============================================================= - -- - - - - -# Core Functionality # - -Hoodie provides the following abilities on a Hive table - - * Upsert (how do I change the table efficiently?) - * Incremental consumption (how do I obtain records that changed?) -Ultimately, make the built Hive table, queryable via Spark & Presto as well. +Hoodie manages storage of large analytical datasets on [HDFS](http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html) and serve them out via two types of tables + + * **Read Optimized Table** - Provides excellent query performance via purely columnar storage (e.g. [Parquet](https://parquet.apache.org/)) + * **Near-Real time Table** - Provides queries on real-time data, using a combination of columnar & row based storage (e.g Parquet + [Avro](http://avro.apache.org/docs/current/mr.html)) -# Code & Project Structure # +{% include image.html file="hoodie_intro_1.png" alt="hoodie_intro_1.png" %} - * hoodie-client : Spark client library to take a bunch of inserts + updates and apply them to a Hoodie table - * hoodie-common : Common code shared between different artifacts of Hoodie +By carefully managing how data is laid out on storage & how its exposed to queries, Hoodie is able to power a rich data ecosystem where external sources can be ingested into Hadoop in near-real time. +The ingested data is then available for interactive SQL Engines like [Presto](https://prestodb.io) & [Spark](https://spark.apache.org/sql/), +while at the same time capable of being consumed incrementally from processing/ETL frameoworks like [Hive](https://hive.apache.org/) & [Spark](https://spark.apache.org/docs/latest/) to build derived (hoodie) datasets. + +Hoodie broadly consists of a self contained Spark library to build datasets and integrations with existing query engines for data access. - We have embraced the [Google Java code style](https://google.github.io/styleguide/javaguide.html). Please setup your IDE accordingly with style files from [here] (https://github.com/google/styleguide) +{% include callout.html content="Hoodie is a young project. Near-Real time Table implementation is currently underway. Get involved [here](https://github.com/uber/hoodie/projects/1)" type="info" %} - -# Quickstart # - -Check out code and pull it into Intellij as a normal maven project. -> You might want to add your spark assembly jar to project dependencies under "Module Setttings", to be able to run Spark from IDE - -Setup your local hadoop/hive test environment. See [this](http://www.bytearray.io/2016/05/setting-up-hadoopyarnsparkhive-on-mac.html) for reference - -## Run the Hoodie Test Job ## - -Create the output folder on your local HDFS -``` -hdfs dfs -mkdir -p /tmp/hoodie/sample-table -``` - -You can run the __HoodieClientExample__ class, to place a set of inserts + updates onto your HDFS at /tmp/hoodie/sample-table - -## Access via Hive ## - -Add in the hoodie-mr jar so, Hive can pick up the right files to hit, to answer the query. - -``` -hive> add jar file:///tmp/hoodie-mr-0.1.jar; -Added [file:///tmp/hoodie-mr-0.1.jar] to class path -Added resources: [file:///tmp/hoodie-mr-0.1.jar] -``` - -Then, you need to create a table and register the sample partitions - - -``` -drop table hoodie_test; -CREATE EXTERNAL TABLE hoodie_test(`_row_key` string, -`_hoodie_commit_time` string, -`_hoodie_commit_seqno` string, - rider string, - driver string, - begin_lat double, - begin_lon double, - end_lat double, - end_lon double, - fare double) -PARTITIONED BY (`datestr` string) -ROW FORMAT SERDE - 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' -STORED AS INPUTFORMAT - 'com.uber.hoodie.hadoop.HoodieInputFormat' -OUTPUTFORMAT - 'com.uber.hoodie.hadoop.HoodieOutputFormat' -LOCATION - 'hdfs:///tmp/hoodie/sample-table'; - -ALTER TABLE `hoodie_test` ADD IF NOT EXISTS PARTITION (datestr='2016-03-15') LOCATION 'hdfs:///tmp/hoodie/sample-table/2016/03/15'; -ALTER TABLE `hoodie_test` ADD IF NOT EXISTS PARTITION (datestr='2015-03-16') LOCATION 'hdfs:///tmp/hoodie/sample-table/2015/03/16'; -ALTER TABLE `hoodie_test` ADD IF NOT EXISTS PARTITION (datestr='2015-03-17') LOCATION 'hdfs:///tmp/hoodie/sample-table/2015/03/17'; -``` - -Let's first perform a query on the latest committed snapshot of the table - -``` -hive> select count(*) from hoodie_test; -... -OK -100 -Time taken: 18.05 seconds, Fetched: 1 row(s) -hive> -``` - - -Let's now perform a query, to obtain the changed rows since a commit in the past - -``` -hive> set hoodie.scan.mode=INCREMENTAL; -hive> set hoodie.last.commitTs=001; -hive> select `_hoodie_commit_time`, rider, driver from hoodie_test limit 10; -OK -All commits :[001, 002] -002 rider-001 driver-001 -002 rider-001 driver-001 -002 rider-002 driver-002 -002 rider-001 driver-001 -002 rider-001 driver-001 -002 rider-002 driver-002 -002 rider-001 driver-001 -002 rider-002 driver-002 -002 rider-002 driver-002 -002 rider-001 driver-001 -Time taken: 0.056 seconds, Fetched: 10 row(s) -hive> -hive> -``` - - -## Access via Spark ## - -Spark is super easy, once you get Hive working as above. Just spin up a Spark Shell as below - -``` -$ cd $SPARK_INSTALL -$ export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop -$ spark-shell --jars /tmp/hoodie-mr-0.1.jar --driver-class-path $HADOOP_CONF_DIR --conf spark.sql.hive.convertMetastoreParquet=false - - -scala> sqlContext.sql("show tables").show(10000) -scala> sqlContext.sql("describe hoodie_test").show(10000) -scala> sqlContext.sql("select count(*) from hoodie_test").show(10000) -``` - - - -## Access via Presto ## - -Checkout the 'hoodie-integration' branch, build off it, and place your installation somewhere. - -* Copy the hoodie-mr jar into $PRESTO_INSTALL/plugin/hive-hadoop2/ - -* Change your catalog config, to make presto respect the __HoodieInputFormat__ - -``` -$ cat etc/catalog/hive.properties -connector.name=hive-hadoop2 -hive.metastore.uri=thrift://localhost:10000 -hive.respect-input-format-splits=true -``` - -startup your server and you should be able to query the same Hive table via Presto - -``` -show columns from hive.default.hoodie_test; -select count(*) from hive.default.hoodie_test -``` - -> NOTE: As of now, Presto has trouble accessing HDFS locally, hence create a new table as above, backed on local filesystem file:// as a workaround - -# Planned # -* Support for Self Joins - As of now, you cannot incrementally consume the same table more than once, since the InputFormat does not understand the QueryPlan. -* Hoodie Spark Datasource - Allows for reading and writing data back using Apache Spark natively (without falling back to InputFormat), which can be more performant -* Hoodie Presto Connector - Allows for querying data managed by Hoodie using Presto natively, which can again boost [performance](https://prestodb.io/docs/current/release/release-0.138.html) - - -# Hoodie Admin CLI -# Launching Command Line # - - - -* mvn clean install in hoodie-cli -* ./hoodie-cli - -If all is good you should get a command prompt similar to this one -``` -prasanna@:~/hoodie/hoodie-cli$ ./hoodie-cli.sh -16/07/13 21:27:47 INFO xml.XmlBeanDefinitionReader: Loading XML bean definitions from URL [jar:file:/home/prasanna/hoodie/hoodie-cli/target/hoodie-cli-0.1-SNAPSHOT.jar!/META-INF/spring/spring-shell-plugin.xml] -16/07/13 21:27:47 INFO support.GenericApplicationContext: Refreshing org.springframework.context.support.GenericApplicationContext@372688e8: startup date [Wed Jul 13 21:27:47 UTC 2016]; root of context hierarchy -16/07/13 21:27:47 INFO annotation.AutowiredAnnotationBeanPostProcessor: JSR-330 'javax.inject.Inject' annotation found and supported for autowiring -============================================ -* * -* _ _ _ _ * -* | | | | | (_) * -* | |__| | ___ ___ __| |_ ___ * -* | __ |/ _ \ / _ \ / _` | |/ _ \ * -* | | | | (_) | (_) | (_| | | __/ * -* |_| |_|\___/ \___/ \__,_|_|\___| * -* * -============================================ - -Welcome to Hoodie CLI. Please type help if you are looking for help. -hoodie-> -``` - -# Commands # - - * connect --path [dataset_path] : Connect to the specific dataset by its path - * commits show : Show all details about the commits - * commits refresh : Refresh the commits from HDFS - * commit rollback --commit [commitTime] : Rollback a commit - * commit showfiles --commit [commitTime] : Show details of a commit (lists all the files modified along with other metrics) - * commit showpartitions --commit [commitTime] : Show details of a commit (lists statistics aggregated at partition level) - - * commits compare --path [otherBasePath] : Compares the current dataset commits with the path provided and tells you how many commits behind or ahead - * stats wa : Calculate commit level and overall write amplification factor (total records written / total records upserted) - * help - -## Contributing -We :heart: contributions. If you find a bug in the library or would like to add new features, go ahead and open -issues or pull requests against this repo. Before you do so, please sign the -[Uber CLA](https://docs.google.com/a/uber.com/forms/d/1pAwS_-dA1KhPlfxzYLBqK6rsSWwRwH95OCCZrcsY5rk/viewform). -Also, be sure to write unit tests for your bug fix or feature to show that it works as expected. diff --git a/docs/powered_by.md b/docs/powered_by.md index 9153034c8..f4e8e7d2d 100644 --- a/docs/powered_by.md +++ b/docs/powered_by.md @@ -5,5 +5,7 @@ sidebar: mydoc_sidebar permalink: powered_by.html --- -Work In Progress +## Uber +Hoodie was originally developed at [Uber](https://uber.com), to achieve [low latency database ingestion, with high efficiency](http://www.slideshare.net/vinothchandar/hadoop-strata-talk-uber-your-hadoop-has-arrived/32). +It has been in production since Aug 2016, powering highly business critical (7/10 most used including trips,riders,partners totalling 100s of TBs) tables on Hadoop. It also powers several incremental Hive ETL pipelines and being currently integrated into Uber's data dispersal system. diff --git a/docs/quickstart.md b/docs/quickstart.md index 95c2ef25a..6642accd5 100644 --- a/docs/quickstart.md +++ b/docs/quickstart.md @@ -6,5 +6,149 @@ sidebar: mydoc_sidebar permalink: quickstart.html --- -Work In Progress + + + +## Download Hoodie + +Check out code and pull it into Intellij as a normal maven project. + +Normally build the maven project, from command line +``` +$ mvn clean install -DskipTests +``` + +{% include callout.html content="You might want to add your spark assembly jar to project dependencies under 'Module Setttings', to be able to run Spark from IDE" type="info" %} + +{% include note.html content="Setup your local hadoop/hive test environment, so you can play with entire ecosystem. See [this](http://www.bytearray.io/2016/05/setting-up-hadoopyarnsparkhive-on-mac.html) for reference" %} + + + +## Generate a Hoodie Dataset + +Create the output folder on your local HDFS +``` +hdfs dfs -mkdir -p /tmp/hoodie/sample-table +``` + +You can run the __HoodieClientExample__ class, to place a two commits (commit 1 => 100 inserts, commit 2 => 100 updates to previously inserted 100 records) onto your HDFS at /tmp/hoodie/sample-table + + +## Register Dataset to Hive Metastore + +Add in the hoodie-hadoop-mr jar so, Hive can read the Hoodie dataset and answer the query. + +``` +hive> add jar file:///tmp/hoodie-hadoop-mr-0.2.7.jar; +Added [file:///tmp/hoodie-hadoop-mr-0.2.7.jar] to class path +Added resources: [file:///tmp/hoodie-hadoop-mr-0.2.7.jar] +``` + +Then, you need to create a ReadOptimized table as below (only type supported as of now)and register the sample partitions + + +``` +drop table hoodie_test; +CREATE EXTERNAL TABLE hoodie_test(`_row_key` string, +`_hoodie_commit_time` string, +`_hoodie_commit_seqno` string, + rider string, + driver string, + begin_lat double, + begin_lon double, + end_lat double, + end_lon double, + fare double) +PARTITIONED BY (`datestr` string) +ROW FORMAT SERDE + 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' +STORED AS INPUTFORMAT + 'com.uber.hoodie.hadoop.HoodieInputFormat' +OUTPUTFORMAT + 'com.uber.hoodie.hadoop.HoodieOutputFormat' +LOCATION + 'hdfs:///tmp/hoodie/sample-table'; + +ALTER TABLE `hoodie_test` ADD IF NOT EXISTS PARTITION (datestr='2016-03-15') LOCATION 'hdfs:///tmp/hoodie/sample-table/2016/03/15'; +ALTER TABLE `hoodie_test` ADD IF NOT EXISTS PARTITION (datestr='2015-03-16') LOCATION 'hdfs:///tmp/hoodie/sample-table/2015/03/16'; +ALTER TABLE `hoodie_test` ADD IF NOT EXISTS PARTITION (datestr='2015-03-17') LOCATION 'hdfs:///tmp/hoodie/sample-table/2015/03/17'; +``` + +## Querying The Dataset + +Now, we can proceed to query the dataset, as we would normally do across all the three query engines supported. + +### HiveQL + +Let's first perform a query on the latest committed snapshot of the table + +``` +hive> select count(*) from hoodie_test; +... +OK +100 +Time taken: 18.05 seconds, Fetched: 1 row(s) +hive> +``` + +### SparkSQL + +Spark is super easy, once you get Hive working as above. Just spin up a Spark Shell as below + +``` +$ cd $SPARK_INSTALL +$ export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop +$ spark-shell --jars /tmp/hoodie-hadoop-mr-0.2.7.jar --driver-class-path $HADOOP_CONF_DIR --conf spark.sql.hive.convertMetastoreParquet=false + + +scala> sqlContext.sql("show tables").show(10000) +scala> sqlContext.sql("describe hoodie_test").show(10000) +scala> sqlContext.sql("select count(*) from hoodie_test").show(10000) +``` + + +### Presto + +Checkout the 'master' branch on OSS Presto, build it, and place your installation somewhere. + +* Copy the hoodie-hadoop-mr-0.2.7 jar into $PRESTO_INSTALL/plugin/hive-hadoop2/ +* Startup your server and you should be able to query the same Hive table via Presto + +``` +show columns from hive.default.hoodie_test; +select count(*) from hive.default.hoodie_test +``` + + + +## Incremental Queries + +Let's now perform a query, to obtain the __ONLY__ changed rows since a commit in the past. + +``` +hive> set hoodie.scan.mode=INCREMENTAL; +hive> set hoodie.last.commitTs=001; +hive> select `_hoodie_commit_time`, rider, driver from hoodie_test limit 10; +OK +All commits :[001, 002] +002 rider-001 driver-001 +002 rider-001 driver-001 +002 rider-002 driver-002 +002 rider-001 driver-001 +002 rider-001 driver-001 +002 rider-002 driver-002 +002 rider-001 driver-001 +002 rider-002 driver-002 +002 rider-002 driver-002 +002 rider-001 driver-001 +Time taken: 0.056 seconds, Fetched: 10 row(s) +hive> +hive> +``` + + + + + + diff --git a/docs/roadmap.md b/docs/roadmap.md index 8c376e544..692240c92 100644 --- a/docs/roadmap.md +++ b/docs/roadmap.md @@ -5,6 +5,10 @@ sidebar: mydoc_sidebar permalink: roadmap.html --- -Work In Progress +## Planned Features + +* Support for Self Joins - As of now, you cannot incrementally consume the same table more than once, since the InputFormat does not understand the QueryPlan. +* Hoodie Spark Datasource - Allows for reading and writing data back using Apache Spark natively (without falling back to InputFormat), which can be more performant +* Hoodie Presto Connector - Allows for querying data managed by Hoodie using Presto natively, which can again boost [performance](https://prestodb.io/docs/current/release/release-0.138.html) diff --git a/docs/use_cases.md b/docs/use_cases.md index 0cf54607b..b0a16ab9e 100644 --- a/docs/use_cases.md +++ b/docs/use_cases.md @@ -3,7 +3,77 @@ title: Use Cases keywords: usecases sidebar: mydoc_sidebar permalink: use_cases.html +toc: false --- -Work In Progress +Following are some sample use-cases for Hoodie. + + +## Near Real-Time Ingestion + +Ingesting data from external sources like (event logs, databases, external sources) into a [Hadoop Data Lake](http://martinfowler.com/bliki/DataLake.html) is a well known problem. +In most (if not all) Hadoop deployments, it is unfortunately solved in a piecemeal fashion, using a medley of ingestion tools, +even though this data is arguably the most valuable for the entire organization. + + +For RDBMS ingestion, Hoodie provides __faster loads via Upserts__, as opposed costly & inefficient bulk loads. For e.g, you can read the MySQL BIN log or [Sqoop Incremental Import](https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_incremental_imports) and apply them to an +equivalent Hoodie table on HDFS. This would be much faster/efficient than a [bulk merge job](https://sqoop.apache.org/docs/1.4.0-incubating/SqoopUserGuide.html#id1770457) +or [complicated handcrafted merge workflows](http://hortonworks.com/blog/four-step-strategy-incremental-updates-hive/) + + +For NoSQL datastores like [Cassandra](http://cassandra.apache.org/) / [Voldemort](http://www.project-voldemort.com/voldemort/) / [HBase](https://hbase.apache.org/), even moderately big installations store billions of rows. +It goes without saying that __full bulk loads are simply infeasible__ and more efficient approaches are needed if ingestion is to keep up with the typically high update volumes. + + +Even for immutable data sources like [Kafka](kafka.apache.org) , Hoodie helps __enforces a minimum file size on HDFS__, which improves [NameNode health](https://blog.cloudera.com/blog/2009/02/the-small-files-problem/). +This is all the more important in such an use-case since typically event data is high volume (eg: click streams) and if not managed well, can cause serious damage to your Hadoop cluster. + +Across all sources, Hoodie adds the much needed ability to atomically publish new data to consumers via notion of commits, shielding them from partial ingestion failures + + +## Near Real-time Analytics + +Typically, real-time [datamarts](https://en.wikipedia.org/wiki/Data_mart) are powered by specialized analytical stores such as [Druid](http://druid.io/) or [Memsql](http://www.memsql.com/) or [even OpenTSDB](http://opentsdb.net/) . +This is absolutely perfect for lower scale ([relative to Hadoop installations like this](https://blog.twitter.com/2015/hadoop-filesystem-at-twitter)) data, +that needs sub-second query responses such as system monitoring or interactive real-time analysis. +But, typically these systems end up getting abused for less interactive queries also since data on Hadoop is intolerably stale. This leads to under utilization & wasteful hardware/license costs. + + +On the other hand, interactive SQL solutions on Hadoop such as Presto & SparkSQL excel in __queries that finish within few seconds__. +By bringing __data freshness to a few minutes__, Hoodie can provide a much efficient alternative, as well unlock real-time analytics on __several magnitudes larger datasets__ stored in HDFS. +Also, Hoodie has no external dependencies (like a dedicated HBase cluster, purely used for real-time analytics) and thus enabled faster analytics on much fresher analytics, without increasing the operational overhead. + + +## Incremental Processing Pipelines + +One fundamental ability Hadoop provides is to build a chain of datasets derived from each other via DAGs expressed as workflows. +Workflows often depend on new data being output by multiple upstream workflows and traditionally, availability of new data is indicated by a new HDFS Folder/Hive Partition. +Let's take a concrete example to illustrate this. An upstream workflow `U` can create a Hive partition for every hour, with data for that hour (event_time) at the end of each hour (processing_time), providing effective freshness of 1 hour. +Then, a downstream workflow `D`, kicks off immediately after `U` finishes, and does its own processing for the next hour, increasing the effective latency to 2 hours. + +The above paradigm simply ignores late arriving data i.e when `processing_time` and `event_time` drift apart. +Unfortunately, in today's post-mobile & pre-IoT world, __late data from intermittently connected mobile devices & sensors are the norm, not an anomaly__. +In such cases, the only remedy to guarantee correctness is to [reprocess the last few hours](https://falcon.apache.org/FalconDocumentation.html#Handling_late_input_data) worth of data, +over and over again each hour, which can significantly hurt the efficiency across the entire ecosystem. For e.g; imagine reprocessing TBs worth of data every hour across hundreds of workflows. + + +Hoodie comes to the rescue again, by providing a way to consume new data (including late data) from an upsteam Hoodie dataset `HU` at a record granularity (not folders/partitions), +apply the processing logic, and efficiently update/reconcile late data with a downstream Hoodie dataset `HD`. Here, `HU` and `HD` can be continuously scheduled at a much more frequent schedule +like 15 mins, and providing an end-end latency of 30 mins at `HD`. + + +{% include callout.html content="To achieve this, Hoodie borrows concepts from stream processing frameworks like [Spark Streaming](https://spark.apache.org/docs/latest/streaming-programming-guide.html#join-operations) , Pub/Sub systems like [Kafka](http://kafka.apache.org/documentation/#theconsumer) +or database replication technologies like [Oracle XStream](https://docs.oracle.com/cd/E11882_01/server.112/e16545/xstrm_cncpt.htm#XSTRM187). +For the more curious, a more detailed explanation of the benefits of Incremetal Processing (compared to Stream Processing & Batch Processing) can be found [here](https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop)" type="info" %} + + +## Data Dispersal From Hadoop + +A popular use-case for Hadoop, is to crunch data and then disperse it back to an online serving store, to be used by an application. +For e.g, a Spark Pipeline can [determine hard braking events on Hadoop](https://eng.uber.com/telematics/) and load them into a serving store like ElasticSearch, +to be used by the Uber application to increase safe driving. Typical architectures for this employ a `queue` between Hadoop and serving store, to prevent overwhelming the target serving store. +A popular choice for this queue is Kafka and this model often results in __redundant storage of same data on HDFS (for offline analysis on computed results) and Kafka (for dispersal)__ + +Once again Hoodie can efficiently solve this problem efficiently. Using the same example, the Spark Pipeline can keep upserting output from +each run into a Hoodie dataset, which can now be incrementally tailed (just like a Kafka topic) for new data to be written into the serving store.