HADOOP practice for beginners with illustration
HADOOP practice for beginners with illustration
1. Pre-requisite: Environment setting
2.HDFS system
2.1 Hadoop ecosystem overview
2.2 Current version of each component (by 2016.07):
2.3 Download and install HDFS
privilege setting
2.4 verify hadoop version
2.5 Start hdfs
3. Eclipse:
4. MySql
4.1 Download and installation:
4.2 Download and install sample database
5. Hive
5.1 download hive
5.2 Configuring the Metastore Database
5.3 Run schematool
5.4 Cleanup
5.5 SchemaTool
5.6 Start hive
5.7 error troubleshooting
5.8 Config Hive
5.9 Hive operations
6. ZooKeeper
7. HBase
7.1 Download
7.2 extract and move to hadoop directory
7.3 start region server
7.4 start hbase shell
7.5 run status command
7.6 Create a table.
7.7 insert rows into table in hbase
7.8 retrieve rows from table in hbase
7.9 scan table in hbase for matched result with filter
7.10 Access HBASE via GUI
7.11 Pseudo-distributed mode [1]
8. PIG
8.1 Download
6.2 Installation
6.3 Configuration
6.4 Start Pig
6.5 Using Grunt shell
6.6 loading data demo
7. Sqoop
7.1 Setup sqoop
7.2 Sqoop and Hive
8. Hive Performance Tuning
8.1 Leveraging Time-based Partitioning
8.2 Set Custom Schema
8.3 DISTRIBUTE BY…SORT BY v. ORDER BY
8.4 Avoid “SELECT count(DISTINCT field) FROM tbl”
8.5 Considering the Cardinality within GROUP BY
8.6 Partition
9. Flume
9.1 download
9.2 installation
10. Using flume to load twitter data into hadoop
10.1 prepare the twitter account
10.2 create twitter application
10.3 Create a conf file for the flume job
10.4 Download the file
10.5 flume.conf sample for twitter agent
10.6 set up the path in the conf file
10.7 The full conf file: (the agent's name is called TwitterAgent)
10.8 Configuring Flume (Cloudera Manager path)
10.9 start the flume agent
11. Spark and scala
11.1 download
11.2 installation
11.3 Run Spark
Create a simple text file
Create a simple RDD
12.1 Download
12.2 Installation
13. IPython
ELK
Appendix 1. Configure network for multiple nodes in hadoop cluster
1. collect info from windows server
2. collect info from vm machine (pick any one as they are cloned)
3. set the vm using customized definition for network: VMnet8 (NAT Mode)
4. configure name server
5. configure hosts file
6. configure network interface
7. restart network service
8. set the MAC address
9. set linux security type
10. set fastest mirror
11. chkconfig NetworkManager off
12. servie network stop
13. yum -y install perl openssl
14. ssh-keygen[2]
15. set hostname
Appendix 2. How-to install DNS server on hadoop cluster in CentOS7
[3]Our Goal
Default domain
Install BIND on DNS Servers
Configure Primary DNS Server
Configure Bind
Configure Local File to specify DNS zones
Create Forward Zone File
Edit our forward zone file
Create Reverse Zone File(s)
Check
Start BIND
Test
Terminology & Reading[4]
Appendix 3. How to batch generate ssh key and send to multiple servers
Method 1 Using Expect
Method 2: Using Python
Appendix 4: Load data from file system to hdfs
method 1 using hdfs command
method 2 using hive command
method 3 cron job
CDH5
OOZIE
Appendix 6: Configuring Hadoop Security
Appendix 5: Move Data from MySQL to HDFS
Appendix 6: Load data into table in Hive
non-python:
Using python
Appendix 7. Move Data (using Sqoop) from MySQL to HIVE
Finding
Appendix 8. HDFS upgrade instructions
Before you install the new Hadoop version
Install the new Hadoop version
After you have installed the new Hadoop version
Finishing the HDFS upgrade process
How to finalize an HDFS upgrade
Appendix 9. Tune Hadoop Cluster to get Maximum Performance
How OS tuning will improve performance of Hadoop?
1. Turn off the Power savings option in BIOS:
2. Open file handles and files:
3. FileSystem Type & Reserved Space:
4. Network Parameters Tuning:
5. Transparent Huge Page Compaction:
6. Memory Swapping:
Appendix 10 The Hadoop Ecosystem in a nutshell
Appendix 11. Common Linux Knowledge
Kill a job
Kill jps process
soft link
to change password to null
couldn't find hdfs
couldn't find jps
hdfs health
check applications on hdfs:
Visualize near-real-time stock price changes using Solr and Banana UI
Summary of steps
Step-by-step
Conclusion
Flume Near Real-Time Indexing Reference
Cloudera documentation reference
Regular-Expression Examples
Project: Bible Statistics
Loading data
Step by step:
Project 2: weblog analysis
Processing & Analytical goals:
Solution:
Disclaimer
Powered by
GitBook
Project 2: weblog analysis
Project 2: weblog analysis
Spark, scala, log sessionization, data cleansing
results matching "
"
No results matching "
"