万本电子书0元读

万本电子书0元读

顶部广告

Hadoop: Data Processing and Modelling电子书

售       价:¥

8人正在读 | 0人评论 9.8

作       者:Garry Turkington,Tanmay Deshpande,Sandeep Karanth

出  版  社:Packt Publishing

出版时间:2016-08-01

字       数:537.8万

所属分类: 进口书 > 外文原版书 > 电脑/网络

温馨提示:数字商品不支持退换货,不提供源文件,不支持导出打印

为你推荐

  • 读书简介
  • 目录
  • 累计评论(0条)
  • 读书简介
  • 目录
  • 累计评论(0条)
Unlock the power of your data with Hadoop 2.X ecosystem and its data warehousing techniques across large data sets About This Book Conquer the mountain of data using Hadoop 2.X tools The authors succeed in creating a context for Hadoop and its ecosystem Hands-on examples and recipes giving the bigger picture and helping you to master Hadoop 2.X data processing platforms Overcome the challenging data processing problems using this exhaustive course with Hadoop 2.X Who This Book Is For This course is for Java developers, who know *ing, wanting a career shift to Hadoop - Big Data segment of the IT industry. So if you are a novice in Hadoop or an expert, this book will make you reach the most advanced level in Hadoop 2.X. What You Will Learn Best practices for setup and configuration of Hadoop clusters, tailoring the system to the problem at hand Integration with relational databases, using Hive for SQL queries and Sqoop for data transfer Installing and maintaining Hadoop 2.X cluster and its ecosystem Advanced Data Analysis using the Hive, Pig, and Map Reduce programs Machine learning principles with libraries such as Mahout and Batch and Stream data processing using Apache Spark Understand the changes involved in the process in the move from Hadoop 1.0 to Hadoop 2.0 Dive into YARN and Storm and use YARN to integrate Storm with Hadoop Deploy Hadoop on Amazon Elastic MapReduce and Discover HDFS replacements and learn about HDFS Federation In Detail As Marc Andreessen has said “Data is eating the world,” which can be witnessed today being the age of Big Data, businesses are producing data in huge volumes every day and this rise in tide of data need to be organized and analyzed in a more secured way. With proper and effective use of Hadoop, you can build new-improved models, and based on that you will be able to make the right decisions. The first module, Hadoop beginners Guide will walk you through on understanding Hadoop with very detailed instructions and how to go about using it. Commands are explained using sections called “What just happened” for more clarity and understanding. The second module, Hadoop Real World Solutions Cookbook, 2nd edition, is an essential tutorial to effectively implement a big data warehouse in your business, where you get detailed practices on the latest technologies such as YARN and Spark. Big data has become a key basis of competition and the new waves of productivity growth. Hence, once you get familiar with the basics and implement the end-to-end big data use cases, you will start exploring the third module, Mastering Hadoop. So, now the question is if you need to broaden your Hadoop skill set to the next level after you nail the basics and the advance concepts, then this course is indispensable. When you finish this course, you will be able to tackle the real-world scenarios and become a big data expert using the tools and the knowledge based on the various step-by-step tutorials and recipes. Style and approach This course has covered everything right from the basic concepts of Hadoop till you master the advance mechanisms to become a big data expert. The goal here is to help you learn the basic essentials using the step-by-step tutorials and from there moving toward the recipes with various real-world solutions for you. It covers all the important aspects of Hadoop from system designing and configuring Hadoop, machine learning principles with various libraries with chapters illustrated with code fragments and schematic diagrams. This is a compendious course to explore Hadoop from the basics to the most advanced techniques available in Hadoop 2.X.
目录展开

Cover

Table of Contents

Hadoop: Data Processing and Modelling

Hadoop: Data Processing and Modelling

Hadoop: Data Processing and Modelling

Credits

Preface

What this learning path covers

Hadoop beginners Guide

Hadoop Real World Solutions Cookbook, 2nd edition

Mastering Hadoop

What you need for this learning path

Who this learning path is for

Reader feedback

Customer support

Downloading the example code

Errata

Piracy

Questions

Part 1. Module 1

Chapter 1. What It's All About

Big data processing

The value of data

Historically for the few and not the many

A different approach

Hadoop

Cloud computing with Amazon Web Services

Too many clouds

A third way

Different types of costs

AWS – infrastructure on demand from Amazon

What this book covers

Summary

Chapter 2. Getting Hadoop Up and Running

Hadoop on a local Ubuntu host

Other operating systems

Time for action – checking the prerequisites

What just happened?

Setting up Hadoop

Time for action – downloading Hadoop

What just happened?

Time for action – setting up SSH

What just happened?

Configuring and running Hadoop

Time for action – using Hadoop to calculate Pi

What just happened?

Three modes

Time for action – configuring the pseudo-distributed mode

What just happened?

Configuring the base directory and formatting the filesystem

Time for action – changing the base HDFS directory

What just happened?

Time for action – formatting the NameNode

What just happened?

Starting and using Hadoop

Time for action – starting Hadoop

What just happened?

Time for action – using HDFS

What just happened?

Time for action – WordCount, the Hello World of MapReduce

What just happened?

Have a go hero – WordCount on a larger body of text

Monitoring Hadoop from the browser

Using Elastic MapReduce

Setting up an account in Amazon Web Services

Time for action – WordCount on EMR using the management console

What just happened?

Have a go hero – other EMR sample applications

Other ways of using EMR

The AWS ecosystem

Comparison of local versus EMR Hadoop

Summary

Chapter 3. Understanding MapReduce

Key/value pairs

What it mean

Why key/value data?

MapReduce as a series of key/value transformations

Pop quiz – key/value pairs

The Hadoop Java API for MapReduce

The 0.20 MapReduce Java API

Writing MapReduce programs

Time for action – setting up the classpath

What just happened?

Time for action – implementing WordCount

What just happened?

Time for action – building a JAR file

What just happened?

Time for action – running WordCount on a local Hadoop cluster

What just happened?

Time for action – running WordCount on EMR

What just happened?

The pre-0.20 Java MapReduce API

Hadoop-provided mapper and reducer implementations

Time for action – WordCount the easy way

What just happened?

Walking through a run of WordCount

Startup

Splitting the input

Task assignment

Task startup

Ongoing JobTracker monitoring

Mapper input

Mapper execution

Mapper output and reduce input

Partitioning

The optional partition function

Reducer input

Reducer execution

Reducer output

Shutdown

That's all there is to it!

Apart from the combiner…maybe

Time for action – WordCount with a combiner

What just happened?

Time for action – fixing WordCount to work with a combiner

What just happened?

Reuse is your friend

Pop quiz – MapReduce mechanics

Hadoop-specific data types

The Writable and WritableComparable interfaces

Introducing the wrapper classes

Time for action – using the Writable wrapper classes

What just happened?

Have a go hero – playing with Writables

Input/output

Files, splits, and records

InputFormat and RecordReader

Hadoop-provided InputFormat

Hadoop-provided RecordReader

OutputFormat and RecordWriter

Hadoop-provided OutputFormat

Don't forget Sequence files

Summary

Chapter 4. Developing MapReduce Programs

Using languages other than Java with Hadoop

How Hadoop Streaming works

Why to use Hadoop Streaming

Time for action – implementing WordCount using Streaming

What just happened?

Differences in jobs when using Streaming

Analyzing a large dataset

Getting the UFO sighting dataset

Getting a feel for the dataset

Time for action – summarizing the UFO data

What just happened?

Time for action – summarizing the shape data

What just happened?

Time for action – correlating of sighting duration to UFO shape

What just happened?

Time for action – performing the shape/time analysis from the command line

What just happened?

Java shape and location analysis

Time for action – using ChainMapper for field validation/analysis

What just happened?

Have a go hero

Time for action – using the Distributed Cache to improve location output

What just happened?

Counters, status, and other output

Time for action – creating counters, task states, and writing log output

What just happened?

Too much information!

Summary

Chapter 5. Advanced MapReduce Techniques

Simple, advanced, and in-between

Joins

When this is a bad idea

Map-side versus reduce-side joins

Matching account and sales information

Time for action – reduce-side join using MultipleInputs

What just happened?

Implementing map-side joins

Have a go hero - Implementing map-side joins

To join or not to join...

Graph algorithms

Graph 101

Graphs and MapReduce – a match made somewhere

Representing a graph

Time for action – representing the graph

What just happened?

Overview of the algorithm

Time for action – creating the source code

What just happened?

Time for action – the first run

What just happened?

Time for action – the second run

What just happened?

Time for action – the third run

What just happened?

Time for action – the fourth and last run

What just happened?

Running multiple jobs

Final thoughts on graphs

Using language-independent data structures

Candidate technologies

Introducing Avro

Time for action – getting and installing Avro

What just happened?

Avro and schemas

Time for action – defining the schema

What just happened?

Time for action – creating the source Avro data with Ruby

What just happened?

Time for action – consuming the Avro data with Java

What just happened?

Using Avro within MapReduce

Time for action – generating shape summaries in MapReduce

What just happened?

Time for action – examining the output data with Ruby

What just happened?

Time for action – examining the output data with Java

What just happened?

Have a go hero – graphs in Avro

Going forward with Avro

Summary

Chapter 6. When Things Break

Failure

Embrace failure

Or at least don't fear it

Don't try this at home

Types of failure

Hadoop node failure

Time for action – killing a DataNode process

What just happened?

Have a go hero – NameNode log delving

Time for action – the replication factor in action

What just happened?

Time for action – intentionally causing missing blocks

What just happened?

Time for action – killing a TaskTracker process

What just happened?

Killing the cluster masters

Time for action – killing the JobTracker

What just happened?

Have a go hero – moving the JobTracker to a new host

Time for action – killing the NameNode process

What just happened?

Task failure due to software

Time for action – causing task failure

What just happened?

Have a go hero – HDFS programmatic access

Have a go hero – causing tasks to fail

Task failure due to data

Time for action – handling dirty data by using skip mode

What just happened?

Summary

Chapter 7. Keeping Things Running

A note on EMR

Hadoop configuration properties

Default values

Time for action – browsing default properties

What just happened?

Additional property elements

Default storage location

Where to set properties

Setting up a cluster

How many hosts?

Special node requirements

Storage types

Hadoop networking configuration

Time for action – examining the default rack configuration

What just happened?

Time for action – adding a rack awareness script

What just happened?

What is commodity hardware anyway?

Pop quiz – setting up a cluster

Cluster access control

The Hadoop security model

Time for action – demonstrating the default security

What just happened?

Working around the security model via physical access control

Managing the NameNode

Configuring multiple locations for the fsimage class

Time for action – adding an additional fsimage location

What just happened?

Swapping to another NameNode host

Time for action – swapping to a new NameNode host

What just happened?

Have a go hero – swapping to a new NameNode host

Managing HDFS

Where to write data

Using balancer

MapReduce management

Command line job management

Have a go hero – command line job management

Job priorities and scheduling

Time for action – changing job priorities and killing a job

What just happened?

Alternative schedulers

Scaling

Adding capacity to a local Hadoop cluster

Have a go hero – adding a node and running balancer

Adding capacity to an EMR job flow

Summary

Chapter 8. A Relational View on Data with Hive

Overview of Hive

Why use Hive?

Thanks, Facebook!

Setting up Hive

Prerequisites

Getting Hive

Time for action – installing Hive

What just happened?

Using Hive

Time for action – creating a table for the UFO data

What just happened?

Time for action – inserting the UFO data

What just happened?

Validating the data

Time for action – validating the table

What just happened?

Time for action – redefining the table with the correct column separator

What just happened?

Hive tables – real or not?

Time for action – creating a table from an existing file

What just happened?

Time for action – performing a join

What just happened?

Have a go hero – improve the join to use regular expressions

Hive and SQL views

Time for action – using views

What just happened?

Handling dirty data in Hive

Have a go hero – do it!

Time for action – exporting query output

What just happened?

Partitioning the table

Time for action – making a partitioned UFO sighting table

What just happened?

Bucketing, clustering, and sorting... oh my!

User-Defined Function

Time for action – adding a new User Defined Function (UDF)

What just happened?

To preprocess or not to preprocess...

Hive versus Pig

What we didn't cover

Hive on Amazon Web Services

Time for action – running UFO analysis on EMR

What just happened?

Using interactive job flows for development

Have a go hero – using an interactive EMR cluster

Integration with other AWS products

Summary

Chapter 9. Working with Relational Databases

Common data paths

Hadoop as an archive store

Hadoop as a preprocessing step

Hadoop as a data input tool

The serpent eats its own tail

Setting up MySQL

Time for action – installing and setting up MySQL

What just happened?

Did it have to be so hard?

Time for action – configuring MySQL to allow remote connections

What just happened?

Don't do this in production!

Time for action – setting up the employee database

What just happened?

Be careful with data file access rights

Getting data into Hadoop

Using MySQL tools and manual import

Have a go hero – exporting the employee table into HDFS

Accessing the database from the mapper

A better way – introducing Sqoop

Time for action – downloading and configuring Sqoop

What just happened?

Time for action – exporting data from MySQL to HDFS

What just happened?

Importing data into Hive using Sqoop

Time for action – exporting data from MySQL into Hive

What just happened?

Time for action – a more selective import

What just happened?

Time for action – using a type mapping

What just happened?

Time for action – importing data from a raw query

What just happened?

Have a go hero

Getting data out of Hadoop

Writing data from within the reducer

Writing SQL import files from the reducer

A better way – Sqoop again

Time for action – importing data from Hadoop into MySQL

What just happened?

Have a go hero

Time for action – importing Hive data into MySQL

What just happened?

Time for action – fixing the mapping and re-running the export

What just happened?

AWS considerations

Considering RDS

Summary

Chapter 10. Data Collection with Flume

A note about AWS

Data data everywhere...

Types of data

Getting network traffic into Hadoop

Time for action – getting web server data into Hadoop

What just happened?

Have a go hero

Getting files into Hadoop

Hidden issues

Introducing Apache Flume

A note on versioning

Time for action – installing and configuring Flume

What just happened?

Using Flume to capture network data

Time for action – capturing network traffic in a log file

What just happened?

Time for action – logging to the console

What just happened?

Writing network data to log files

Time for action – capturing the output of a command to a flat file

What just happened?

Time for action – capturing a remote file in a local flat file

What just happened?

Sources, sinks, and channels

Understanding the Flume configuration files

Have a go hero

It's all about events

Time for action – writing network traffic onto HDFS

What just happened?

Time for action – adding timestamps

What just happened?

To Sqoop or to Flume...

Time for action – multi level Flume networks

What just happened?

Time for action – writing to multiple sinks

What just happened?

Selectors replicating and multiplexing

Handling sink failure

Have a go hero - Handling sink failure

Next, the world

Have a go hero - Next, the world

The bigger picture

Data lifecycle

Staging data

Scheduling

Summary

Chapter 11. Where to Go Next

What we did and didn't cover in this book

Upcoming Hadoop changes

Alternative distributions

Why alternative distributions?

Other Apache projects

HBase

Oozie

Whir

Mahout

MRUnit

Other programming abstractions

Pig

Cascading

AWS resources

HBase on EMR

SimpleDB

DynamoDB

Sources of information

Source code

Mailing lists and forums

LinkedIn groups

HUGs

Conferences

Summary

Appendix A. Pop Quiz Answers

Chapter 3, Understanding MapReduce

Pop quiz – key/value pairs

Pop quiz – walking through a run of WordCount

Chapter 7, Keeping Things Running

Pop quiz – setting up a cluster

Part 2. Module 2

Chapter 1. Getting Started with Hadoop 2.X

Introduction

Installing a single-node Hadoop Cluster

Getting ready

How to do it...

How it works...

There's more

Installing a multi-node Hadoop cluster

Getting ready

How to do it...

How it works...

Adding new nodes to existing Hadoop clusters

Getting ready

How to do it...

How it works...

Executing the balancer command for uniform data distribution

Getting ready

How to do it...

How it works...

There's more...

Entering and exiting from the safe mode in a Hadoop cluster

How to do it...

How it works...

Decommissioning DataNodes

Getting ready

How to do it...

How it works...

Performing benchmarking on a Hadoop cluster

Getting ready

How to do it...

How it works...

Chapter 2. Exploring HDFS

Introduction

Loading data from a local machine to HDFS

Getting ready

How to do it...

How it works...

Exporting HDFS data to a local machine

Getting ready

How to do it...

How it works...

Changing the replication factor of an existing file in HDFS

Getting ready

How to do it...

How it works...

Setting the HDFS block size for all the files in a cluster

Getting ready

How to do it...

How it works...

Setting the HDFS block size for a specific file in a cluster

Getting ready

How to do it...

How it works...

Enabling transparent encryption for HDFS

Getting ready

How to do it...

How it works...

Importing data from another Hadoop cluster

Getting ready

How to do it...

How it works...

Recycling deleted data from trash to HDFS

Getting ready

How to do it...

How it works...

Saving compressed data in HDFS

Getting ready

How to do it...

How it works...

Chapter 3. Mastering Map Reduce Programs

Introduction

Writing the Map Reduce program in Java to analyze web log data

Getting ready

How to do it...

How it works...

Executing the Map Reduce program in a Hadoop cluster

Getting ready

How to do it

How it works...

Adding support for a new writable data type in Hadoop

Getting ready

How to do it...

How it works...

Implementing a user-defined counter in a Map Reduce program

Getting ready

How to do it...

How it works...

Map Reduce program to find the top X

Getting ready

How to do it...

How it works

Map Reduce program to find distinct values

Getting ready

How to do it

How it works...

Map Reduce program to partition data using a custom partitioner

Getting ready

How to do it...

How it works...

Writing Map Reduce results to multiple output files

Getting ready

How to do it...

How it works...

Performing Reduce side Joins using Map Reduce

Getting ready

How to do it

How it works...

Unit testing the Map Reduce code using MRUnit

Getting ready

How to do it...

How it works...

Chapter 4. Data Analysis Using Hive, Pig, and Hbase

Introduction

Storing and processing Hive data in a sequential file format

Getting ready

How to do it...

How it works...

Storing and processing Hive data in the ORC file format

Getting ready

How to do it...

How it works...

Storing and processing Hive data in the ORC file format

Getting ready

How to do it...

How it works...

Storing and processing Hive data in the Parquet file format

Getting ready

How to do it...

How it works...

Performing FILTER By queries in Pig

Getting ready

How to do it...

How it works...

Performing Group By queries in Pig

Getting ready

How to do it...

How it works...

Performing Order By queries in Pig

Getting ready

How to do it..

How it works...

Performing JOINS in Pig

Getting ready

How to do it...

How it works

Writing a user-defined function in Pig

Getting ready

How to do it...

How it works...

There's more...

Analyzing web log data using Pig

Getting ready

How to do it...

How it works...

Performing the Hbase operation in CLI

Getting ready

How to do it

How it works...

Performing Hbase operations in Java

Getting ready

How to do it

How it works...

Executing the MapReduce programming with an Hbase Table

Getting ready

How to do it

How it works

Chapter 5. Advanced Data Analysis Using Hive

Introduction

Processing JSON data in Hive using JSON SerDe

Getting ready

How to do it...

How it works...

Processing XML data in Hive using XML SerDe

Getting ready

How to do it...

How it works

Processing Hive data in the Avro format

Getting ready

How to do it...

How it works...

Writing a user-defined function in Hive

Getting ready

How to do it

How it works...

Performing table joins in Hive

Getting ready

How to do it...

How it works...

Executing map side joins in Hive

Getting ready

How to do it...

How it works...

Performing context Ngram in Hive

Getting ready

How to do it...

How it works...

Call Data Record Analytics using Hive

Getting ready

How to do it...

How it works...

Twitter sentiment analysis using Hive

Getting ready

How to do it...

How it works

Implementing Change Data Capture using Hive

Getting ready

How to do it

How it works

Multiple table inserting using Hive

Getting ready

How to do it

How it works

Chapter 6. Data Import/Export Using Sqoop and Flume

Introduction

Importing data from RDMBS to HDFS using Sqoop

Getting ready

How to do it...

How it works...

Exporting data from HDFS to RDBMS

Getting ready

How to do it...

How it works...

Using query operator in Sqoop import

Getting ready

How to do it...

How it works...

Importing data using Sqoop in compressed format

Getting ready

How to do it...

How it works...

Performing Atomic export using Sqoop

Getting ready

How to do it...

How it works...

Importing data into Hive tables using Sqoop

Getting ready

How to do it...

How it works...

Importing data into HDFS from Mainframes

Getting ready

How to do it...

How it works...

Incremental import using Sqoop

Getting ready

How to do it...

How it works...

Creating and executing Sqoop job

Getting ready

How to do it...

How it works...

Importing data from RDBMS to Hbase using Sqoop

Getting ready

How to do it...

How it works...

Importing Twitter data into HDFS using Flume

Getting ready

How to do it...

How it works

Importing data from Kafka into HDFS using Flume

Getting ready

How to do it...

How it works

Importing web logs data into HDFS using Flume

Getting ready

How to do it...

How it works...

Chapter 7. Automation of Hadoop Tasks Using Oozie

Introduction

Implementing a Sqoop action job using Oozie

Getting ready

How to do it...

How it works

Implementing a Map Reduce action job using Oozie

Getting ready

How to do it...

How it works...

Implementing a Java action job using Oozie

Getting ready

How to do it

How it works

Implementing a Hive action job using Oozie

Getting ready

How to do it...

How it works...

Implementing a Pig action job using Oozie

Getting ready

How to do it...

How it works

Implementing an e-mail action job using Oozie

Getting ready

How to do it...

How it works...

Executing parallel jobs using Oozie (fork)

Getting ready

How to do it...

How it works...

Scheduling a job in Oozie

Getting ready

How to do it...

How it works...

Chapter 8. Machine Learning and Predictive Analytics Using Mahout and R

Introduction

Setting up the Mahout development environment

Getting ready

How to do it...

How it works...

Creating an item-based recommendation engine using Mahout

Getting ready

How to do it...

How it works...

Creating a user-based recommendation engine using Mahout

Getting ready

How to do it...

How it works...

Using Predictive analytics on Bank Data using Mahout

Getting ready

How to do it...

How it works...

Clustering text data using K-Means

Getting ready

How to do it...

How it works...

Performing Population Data Analytics using R

Getting ready

How to do it...

How it works...

Performing Twitter Sentiment Analytics using R

Getting ready

How to do it...

How it works...

Performing Predictive Analytics using R

Getting ready

How to do it...

How it works...

Chapter 9. Integration with Apache Spark

Introduction

Running Spark standalone

Getting ready

How to do it...

How it works...

Running Spark on YARN

Getting ready

How to do it...

How it works...

Olympics Athletes analytics using the Spark Shell

Getting ready

How to do it...

How it works...

Creating Twitter trending topics using Spark Streaming

Getting ready

How to do it...

How it works...

Twitter trending topics using Spark streaming

Getting ready

How to do it...

How it works...

Analyzing Parquet files using Spark

Getting ready

How to do it...

How it works...

Analyzing JSON data using Spark

Getting ready

How to do it...

How it works...

Processing graphs using Graph X

Getting ready

How to do it...

How it works...

Conducting predictive analytics using Spark MLib

Getting ready

How to do it...

How it works...

Chapter 10. Hadoop Use Cases

Introduction

Call Data Record analytics

Getting ready

How to do it...

How it works...

Web log analytics

Getting ready

How to do it...

How it works...

Sensitive data masking and encryption using Hadoop

Getting ready

How to do it...

How it works...

Part 3. Module 3

Chapter 1. Hadoop 2.X

The inception of Hadoop

The evolution of Hadoop

Hadoop's genealogy

Hadoop 2.X

Yet Another Resource Negotiator (YARN)

Storage layer enhancements

Support enhancements

Hadoop distributions

Which Hadoop distribution?

Available distributions

Summary

Chapter 2. Advanced MapReduce

MapReduce input

The InputFormat class

The InputSplit class

The RecordReader class

Hadoop's "small files" problem

Filtering inputs

The Map task

The dfs.blocksize attribute

Sort and spill of intermediate outputs

Node-local Reducers or Combiners

Fetching intermediate outputs – Map-side

The Reduce task

Fetching intermediate outputs – Reduce-side

Merge and spill of intermediate outputs

MapReduce output

Speculative execution of tasks

MapReduce job counters

Handling data joins

Reduce-side joins

Map-side joins

Summary

Chapter 3. Advanced Pig

Pig versus SQL

Different modes of execution

Complex data types in Pig

Compiling Pig scripts

The logical plan

The physical plan

The MapReduce plan

Development and debugging aids

The DESCRIBE command

The EXPLAIN command

The ILLUSTRATE command

The advanced Pig operators

The advanced FOREACH operator

Specialized joins in Pig

User-defined functions

The evaluation functions

The load functions

The store functions

Pig performance optimizations

The optimization rules

Measurement of Pig script performance

Combiners in Pig

Memory for the Bag data type

Number of reducers in Pig

The multiquery mode in Pig

Best practices

The explicit usage of types

Early and frequent projection

Early and frequent filtering

The usage of the LIMIT operator

The usage of the DISTINCT operator

The reduction of operations

The usage of Algebraic UDFs

The usage of Accumulator UDFs

Eliminating nulls in the data

The usage of specialized joins

Compressing intermediate results

Combining smaller files

Summary

Chapter 4. Advanced Hive

The Hive architecture

The Hive metastore

The Hive compiler

The Hive execution engine

The supporting components of Hive

Data types

File formats

Compressed files

ORC files

The Parquet files

The data model

Dynamic partitions

Indexes on Hive tables

Hive query optimizers

Advanced DML

The GROUP BY operation

ORDER BY versus SORT BY clauses

The JOIN operator and its types

Advanced aggregation support

Other advanced clauses

UDF, UDAF, and UDTF

Summary

Chapter 5. Serialization and Hadoop I/O

Data serialization in Hadoop

Writable and WritableComparable

Hadoop versus Java serialization

Avro serialization

Avro and MapReduce

Avro and Pig

Avro and Hive

Comparison – Avro versus Protocol Buffers / Thrift

File formats

The Sequence file format

The MapFile format

Other data structures

Compression

Splits and compressions

Scope for compression

Summary

Chapter 6. YARN – Bringing Other Paradigms to Hadoop

The YARN architecture

Resource Manager (RM)

Application Master (AM)

Node Manager (NM)

YARN clients

Developing YARN applications

Writing YARN clients

Writing the Application Master entity

Monitoring YARN

Job scheduling in YARN

CapacityScheduler

FairScheduler

YARN commands

User commands

Administration commands

Summary

Chapter 7. Storm on YARN – Low Latency Processing in Hadoop

Batch processing versus streaming

Apache Storm

Architecture of an Apache Storm cluster

Computation and data modeling in Apache Storm

Use cases for Apache Storm

Developing with Apache Storm

Apache Storm 0.9.1

Storm on YARN

Installing Apache Storm-on-YARN

Installation procedure

Summary

Chapter 8. Hadoop on the Cloud

Cloud computing characteristics

Hadoop on the cloud

Amazon Elastic MapReduce (EMR)

Provisioning a Hadoop cluster on EMR

Summary

Chapter 9. HDFS Replacements

HDFS – advantages and drawbacks

Amazon AWS S3

Hadoop support for S3

Implementing a filesystem in Hadoop

Implementing an S3 native filesystem in Hadoop

Summary

Chapter 10. HDFS Federation

Limitations of the older HDFS architecture

Architecture of HDFS Federation

Benefits of HDFS Federation

Deploying federated NameNodes

HDFS high availability

Secondary NameNode, Checkpoint Node, and Backup Node

High availability – edits sharing

Useful HDFS tools

Three-layer versus four-layer network topology

HDFS block placement

Pluggable block placement policy

Summary

Chapter 11. Hadoop Security

The security pillars

Authentication in Hadoop

Kerberos authentication

The Kerberos architecture and workflow

Kerberos authentication and Hadoop

Authentication via HTTP interfaces

Authorization in Hadoop

Authorization in HDFS

Limiting HDFS usage

Service-level authorization in Hadoop

Data confidentiality in Hadoop

HTTPS and encrypted shuffle

Audit logging in Hadoop

Summary

Chapter 12. Analytics Using Hadoop

Data analytics workflow

Machine learning

Apache Mahout

Document analysis using Hadoop and Mahout

Term frequency

Document frequency

Term frequency – inverse document frequency

Tf-Idf in Pig

Cosine similarity distance measures

Clustering using k-means

K-means clustering using Apache Mahout

RHadoop

Summary

Chapter 13. Hadoop for Microsoft Windows

Deploying Hadoop on Microsoft Windows

Prerequisites

Building Hadoop

Configuring Hadoop

Deploying Hadoop

Summary

Appendix A. Bibliography

Index

A

B

C

D

E

F

G

H

I

J

K

L

M

N

O

P

Q

R

S

T

U

V

W

X

Y

累计评论(0条) 0个书友正在讨论这本书 发表评论

发表评论

发表评论,分享你的想法吧!

买过这本书的人还买过

读了这本书的人还在读

回顶部