Apache Drill quick & dirty guide using LinkedIn contacts CSV export

apache-drill-1.4.0-guide

 

TL;DR – this simple guide should take you around 5 to 10 mins tops to get hands on with Apache Drill

1. First you need to download the Apache Drill installer from the project homepage

Apache Drill project website => http://drill.apache.org

direct link to download => http://www.apache.org/dyn/closer.lua?filename=drill/drill-1.4.0/apache-drill-1.4.0.tar.gz&action=download

On a unix / os x / bsd / linux system you can use one of the following CLI commands:

$ wget http://getdrill.org/drill/download/apache-drill-1.4.0.tar.gz
$ curl -o apache-drill-1.4.0.tar.gz http://getdrill.org/drill/download/apache-drill-1.4.0.tar.gz

2. We need a Java Development Kit ( JDK ) or Runtime Environment ( JRE ) installed

If you don’t already have Java installed I recommend you use the Oracle Java SE Development (JDK) Kit 7

direct link to download => http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads-1880260.html

When you visit that URL, scroll down until you find a series of tables listing operating systems and releases of Java for them.

Note you have to click on the check box labeled “Accept License Agreement” before it will allow you to click on the download link for your selected operating system.

The installer is around 200 MB so on an average ADSL / 4G link it will take around 3 to 4 minutes. Once it’s downloaded it should appear in your browsers default “downloads” folder ( unless you specifically told it to safe somewhere else of course ).

3. Install and check the version of Java you have ( my example uses a JRE )

I won’t cover the details of how to install or upgrade your version of Java, it’s all pretty straight forward, just follow the dots for most of the installer packages, and there are good guides on the Oracle website if you get stuck.

Once you have installed or upgraded your Java Development Kit ( or Runtime Environment if you decided to go that route ), quickly check you have it installed and working, and at the same time check the version, to confirm that the install worked and that the correct version is in your PATH so that when you start to play with Apache Drill it can get to the correct version of Java.

Run this command on the CLI to check you can run Java and check the version that is installed and in your PATH:

$ java -version
java version "1.7.0_40"
Java(TM) SE Runtime Environment (build 1.7.0_40-b43)
Java HotSpot(TM) 64-Bit Server VM (build 24.0-b56, mixed mode)

4. Download and install Apache Drill

Change directory to where you downloaded Apache Drill to ( most likely ~/Downloads ), and extract the TAR GZ archive:

$ tar zxvf apache-drill-1.4.0.tar.gz

Alternatively if you have an older version of “tar” which doesn’t like “-zxvf” you can use:

$ zcat apache-drill-1.4.0.tar.gz | tar xvf -

Now you should see the full TAR archive expand into files and folders, something like this:

$ tar zxvf apache-drill-1.4.0.tar.gz
x apache-drill-1.4.0/KEYS
x apache-drill-1.4.0/LICENSE
x apache-drill-1.4.0/README.md
x apache-drill-1.4.0/NOTICE
x apache-drill-1.4.0/git.properties
x apache-drill-1.4.0/bin/runbit
x apache-drill-1.4.0/bin/hadoop-excludes.txt
x apache-drill-1.4.0/bin/drillbit.sh
.. truncated ..

5. Now change directory into your extracted Apache Drill folder

$ cd apache-drill-1.4.0

Here’s what you should see once you’re in the Apache Drill folder

$ ls -l
total 192
-rw-r--r--@ 1 dez staff 18235 8 Dec 18:47 KEYS
-rw-r--r--@ 1 dez staff 63245 8 Dec 18:47 LICENSE
-rw-r--r--@ 1 dez staff 238 8 Dec 18:47 NOTICE
-rw-r--r--@ 1 dez staff 1297 8 Dec 18:47 README.md
drwxr-xr-x 13 dez staff 442 18 Jan 15:09 bin
drwxr-xr-x 7 dez staff 238 18 Jan 15:09 conf
-rw-r--r--@ 1 dez staff 693 8 Dec 20:12 git.properties
drwxr-xr-x 22 dez staff 748 18 Jan 15:09 jars
drwxr-xr-x@ 8 dez staff 272 8 Dec 18:48 sample-data
drwxr-xr-x 3 dez staff 102 18 Jan 15:09 winutils

6. Now we need a simple clean CSV dataset to have fun with

I recommend you use something like your LinkedIn contacts CSV export for something fun.

Here are the ridiculously convoluted steps LinkedIn makes you take to do that:

+ login to http://linkedin.com
+ under “My Network” in the main menu select “Connections”
+ click on the tiny “sprocket” on the right hand size ( mouse over label is curiously “Settings” )
+ click on “Export LinkedIn Connections “ on the right hand side under “Advanced Settings”
+ now click on the blue button labeled “Export”
+ you will now asked to perform a droll “Security Verification” ( CAPTCHA ), enter text & click “Continue”
+ you will see a green bar appear announcing “Your connections were successfully exported.”
+ then about 5 seconds later a download popup will appear
+ click on “OK” to start the download ( it should end up in your Downloads folder )
+ you’ve now downloaded a file called “linkedin_connections_export_microsoft_outlook.csv”
+ this is the full CSV export of you entire LinkedIn network of connections ( yay! )

Now you have a data-set to play with once you have installed Java ( if you don’t already have it ).

For easy access move it into the Apache Drill install folder:

$ mv ~/Downloads/linkedin_connections_export_microsoft_outlook.csv ~/Downloads/apache-drill-1.4.0

7. Right let’s get started – we’ll use Apache Drill in “embedded” mode

RTFM for the -u options ;-)

$ bin/drill-embedded -u jdbc:drill:zk=local
Jan 18, 2016 3:27:59 PM org.glassfish.jersey.server.ApplicationHandler initialize
INFO: Initiating Jersey application, version Jersey: 2.8 2014-04-29 01:25:26...
apache drill 1.4.0
"drill baby drill"
0: jdbc:drill:zk=local>

Congratulations, you’re now successfully running Apache Drill in local “embedded” mode!

8. We’re ready to submit queries, here’s a simple example using my LinkedIn contacts

Let’s find all the MapR folk in my network ( note: Apache Drill needs the full filename to find a data file ):

0: jdbc:drill:zk=local> select columns[1] as Firstname, columns[3] as Lastname, columns[29] as Company_Name, columns[31] as Job_Title from dfs.`/Users/dez/Downloads/apache-drill-1.4.0/ linkedin_connections_export_microsoft_outlook.csv` where columns[29] like '%MapR%';
+-------------+-----------+--------------------+----------------------------------------+
| Firstname | Lastname | Company_Name | Job_Title |
+-------------+-----------+--------------------+----------------------------------------+
| Justin | Bock | MapR Technologies | Regional Sales Director ANZ |
| Rajkumar | Singh | MapR Technologies | Sr Product Specialist |
| Thong Hsin | Sheng | MapR Technologies | Marketing Manager, Asia Pacific/Japan |
+-------------+-----------+--------------------+----------------------------------------+
3 rows selected (0.244 seconds)

Just in case your browser doesn’t render the HTML PRE ( fixed width font ) for the example text above, this is a screenshot of what it should look like:

apache-drill-1.4.0-guide

Bingo! you just ran your first SQL query of an CSV file using Apache Drill – your life is now complete ;-)

Give yourself a pat on the back, and now go have some fun playing with your SQL skills on a raw CSV file, and then RTFM and play with the other “drillbits” for JSON, HDFS and Parquet data files / sources ( and many more ).

9. To quit Apache Drill just enter the“!quit” command

It will look something like this:

0: jdbc:drill:zk=local> !quit
Closing: org.apache.drill.jdbc.impl.DrillConnectionImpl

10. That’s it folks, hope you had fun.

Remember you just experienced what it’s like to play with “live ammo” as it were, you probably did it on a laptop or personal computer. It’s important to note that Apache Drill scales from a single laptop, to tens of thousands of servers in a cluster, it is indeed a very powerful and flexble tool / platform, and used correctly can render amazing results.

But, and I say but slowly and clearly, Apache Drill is not the answer to everything, and it is not the panacea to all things big data & analytics, it is merely one of many powerful open source tools at your beck and call, so use it well, use it wisely, and you will get amazing results and have fun doing it.

If you get stuck or have any issues, ping me on Twitter as I’ll be happy to help as much as I can to get you started. I’d love to hear about and see your results and get any feedback you might want to offer – so post screen shots on Twitter and “tag” me @dez_blanchfield and @ApacheDrill, and use the hashtag #ApacheDrill to share the love.

And if or when you get the opportunity, “pay it forward”, make time to show 2 or 3 others how to do this, and then in turn ask them to show 2 or 3 others, and watch what happens – you can help change the world for good ;-)

Cheers,
Dez

em: dez@gara.guru
mo: +61 414 464 356
ph: +61 2 8006 4700
sk: skype://dez_blanchfield
cv: http://j.mp/dez-cv-20151213
tw: http://twitter.com/dez_blanchfield
li: http://linkedin.com/in/dezblanchfield
pl: https://plus.google.com/+DezBlanchfield

 

Dez Blanchfield

Dez Blanchfield is a strategic leader in business & digital transformation, with three decades of global experience in Business and the Information Technology & Telecommunications industry, developing strategy and implementing business initiatives. He works with key industry sectors such as Federal & State Government, Defence, Banking & Finance, Airports & Aviation, Health, Transport, Telecommunications, Energy and Utilities, Mobile Digital Media and Advertising, and Cyber Security. His focus is driving outcomes for organisations by leveraging Digital Disruption, Digital Transformation, Cloud Computing, Big Data & Analytics, Machine Intelligence, Internet of Things, DevOps Integration, Automation & Orchestration, App Containerisation & Micro Services, Webscale Infrastructure, and High Performance Computing. Be sure to follow Dez on LinkedIn ( http://linkedin.com/in/dezblanchfield ) and Twitter ( http://twitter.com/dez_blanchfield ).

You may also like...