Step-by-step
1. Download and install the HDP Sandbox
Download the latest (2.3 as of this writing) HDP Sandbox here. Import it into VMware or VirtualBox, start the instance, and update the DNS entry on your host machine to point to the new instance’s IP.
On Mac, edit /etc/hosts, on Windows, edit %systemroot%\system32\drivers\etc\ as administrator and add a line similar to the below:
- 192.168.56.102 sandbox sandbox.hortonworks.com
2. Download and install the latest NiFi release
Follow the directions here. These were the steps that I executed for 0.4.1
- cd /tmp
- wget http://apache.cs.utah.edu/nifi/0.4.1/nifi-0.4.1-bin.zip
- cd /opt/
- unzip /tmp/nifi-0.4.1-bin.zip
- useradd nifi
- chown -R nifi:nifi /opt/nifi-0.4.1/
- perl -pe 's/run.as=.*/run.as=nifi/' -i /opt/nifi-0.4.1/conf/bootstrap.conf
- perl -pe 's/nifi.web.http.port=8080/nifi.web.http.port=9090/' -i /opt/nifi-0.4.1/conf/nifi.properties
- /opt/nifi-0.4.1/bin/nifi.sh start
3. Create a Solr dashboard to visualize the results
Download a new Solr dashboard, start the service, and create a new collection to store stock price changes:
- export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk.x86_64
- wget https://raw.githubusercontent.com/vzlatkin/Stocks2HBaseAndSolr/master/Solr%20Dashboard.json -O /opt/lucidworks-hdpsearch/solr/server/solr-webapp/webapp/banana/app/dashboards/default.json
- /opt/lucidworks-hdpsearch/solr/bin/solr start -c -z localhost:2181
- /opt/lucidworks-hdpsearch/solr/bin/solr create -c stocks -d data_driven_schema_configs -s 1 -rf 1
4. Create a new NiFi flow to pull from Google Finance API, transform, and store in HBase and Solr
Solr is used for indexing the data, Banana UI is used for visualization, and HBase is used for future-proofing. HBase can be used to further analyze the data from Storm/Spark or to create a custom UI. The get the data into these tools, follow the steps below:
- Start HBase via Ambari
- Create a new table:
- hbase shell
- hbase(main):001:0> create 'stocks', 'cf'
- Then download this NiFi template to your host machine.
- To import the template, open the NiFi UI
- Open Templates manager:
- Find the template on your local machine and import it:
- Drag and drop to instantiate a new template:
- Double click the new process group:
- You'll need to enable the HBase shared controller. To do so, click the right mouse button over the "Send to HBase" process, then click "Configure", then "Properties" and the "Go to" arrow to access the controller. Finally, click the "Enable" button.
- Now start all of the processes. Hold down the Shift-key, and select all of the processes on the screen. Then click the start button:
You should see a flow that looks like the below screenshot
The reason for so many processes is that the response from Google Finance API needs to be transformed. First, we remove the comment characters '//' from the response. Second, we split the array into individual JSON objects. Third, we extract the relevant attributes. Fourth, the timestamp has the format of UTC, but it is actually in EST timezone, therefore, we fix that. Finally, we send the information to HBase, Solr, and the NiFi bulletin board for logging.