Big Data on a Shoestring

Read Big Data on a Shoestring Online

Authors: Nicholas Bessmer

BOOK: Big Data on a Shoestring
12.04Mb size Format: txt, pdf, ePub

© 2013 Nicholas Bessmer

All Rights Reserved

 

[email protected]

 

Table of Contents

Introduction

NOTE: This is a technical subject. You may need to do some research on your own to learn some basic LINUX commands. This is an introductory deep dive for business generalists who want to learn more about Hadoop. For a gentler introduction, see companion volume Big Data for Small business.

1 – Cassandra

2 – Hadoop

3 - Our Big Data Analytics Example Using Pig Latin Sample Script

Getting Our Tools Running on Our New Big Data Server

Getting The Linux Environment Set Up – Basic Steps

Editing Our Hadoop Configuration Files

Edit /conf/core-site.xml. I have used localhost in the value of fs.default.name

Edit /conf/mapred-site.xml.

Edit /conf/hdfs-site.xml. Since this test cluster has a single node, replication factor should be set to 1.

Format the name node (one per install). $ bin/hadoop namenode –format

Start all Hadoop components $ bin/hadoop-daemon.sh start namenode

Use the hadoop command-line tool to test the file system: $ hadoop dfs -ls /

Let’s Use PIG

Change to Pig Directory and Run Sample Script From The Tutorial

Conclusion

Appendix – Pig Script

Introduction

 

The companion volume Big Data for Small Business discusses how businesses can gain a competitive advantage by using Big Data techniques to filter out noise and determine trends in very large, unstructured data sets.  Big Data is a toolbox to perform analysis on large (
petabytes
) sets of unstructured (Twitter, chats, web logs) data that change in
near
real-time
.

 

But … there are two forks in the road in terms of how businesses can use Big Data:

 

»
       
To process
Operational Data
that changes in real-time.

»
       
To analyze trends
in massive volumes of structured and unstructured data that are set aside by using
batch jobs
.

 

A business may benefit from both flavors of Big Data tools. We want to avoid getting too immersed in buzz words and stay focused on how to realize the greatest Big Data benefits for the least cost.

NOTE: This is a technical subject. You may need to do some research on your own to learn some basic LINUX commands
. This is an introductory deep dive for business generalists who want to learn more about Hadoop. For a gentler introduction, see the companion volume Big Data for Small Business.
1 – Cassandra

 

“In
Greek mythology
,
Cassandra
(
Greek
Κα
σσάνδρα, also Κασάνδρα)
[1]
was the daughter of King
Priam
and Queen
Hecuba
of
Troy
. Her beauty caused
Apollo
to grant her the gift of
prophecy
.”

 

-Wikipedia

 

We are
all familiar with ATM machines where each check that is deposied is considered a transaction – a discrete set of steps with a beginning and an end.
Transactional
systems in the database world have ways to make sure changes are saved properly including discarding partial information. Cassandra is a
distributed database
that is not transactional – rather it is much more fluid and suited for operational data.

 

Imagine an airplane with thousands of measurements occurring in real-time. Everything from speed, height, thrust, navigation to the health of the airplane systems need to be checked almost instantaneously. It is wasted effort to spend a lot of time making sure each data-point is saved somewhere. Rather, the operational data of the plane needs to be fed to the command center (the pilots) in real-time with as little overhead as possible. Cassandra is really good at:

 


        
Fault tolerant peer to peer architecture.


        
Performance that can be easily tuned.


        
Session Storage (imagine sites like Netflix with millions of people streaming videos)


        
User Data Storage


        
Scalable, low-latency storage for mobile apps


        
Critical data storage

2
– Hadoop

 

As discussed in the companion volume Big Data for Small Business, Hadoop is really good at the following:

 


        
Reporting on large amounts of unstructured data


        
Ability to sort and perform simple calculations on large amount of unstructured and structured data:

o
       
Counting words – this is the standard Map Reduce example

o
       
High-volume analysis – gathering and analyzing large scale ad network data

o
       
Recommendation engines – analyzing browsing and purchasing patterns to recommend a product

o
       
Social graphs – Determining relationships between individuals

3
- Our Big Data Analytics Example Using Pig Latin Sample Script

 

For the purpose of this guide, we will work through setting up a Hadoop Big Data Analytics example and run a simple Pig Latin example script from the Pig tutorial. This will perform some analysis on the Excite search engine.

 

For future reference, you can find huge data sets to test Big Data with at the following site:

http://aws.amazon.com/publicdatasets/

 

Some
examples that can be useful to businesses:

 

»
       
US and foreign Census Data.

»
       
Labor statistics

»
       
Federal Reserve data

»
       
Federal contracts

 

Here are examples that are useful to scientists:

 

»
       
Daily global weather measurements

»
       
Genome databases.

 

We may want to use census data from our local metropolitan area to identify trends such as disposable income or demographics like where elderly or young people reside. This type of marketing savvy requires not only computer power but also the
framework
that Hadoop provides. Think of Hadoop as a toolbox that allows people to approach managing huge volumes of unstructured and structured data.

 

In Amazon’s example, these sample big data sets are accessible by signing up for their EC2 service. This is a metered service that allows businesses and institutions to run applications and services in
the cloud.
Amazon is acts as a central utility like the electrical company from which customers rent services – in this case computing power and data storage.  EC2 is Amazon’s Elastic Cloud and what follows are the steps to set up an EC2 account through Amazon.

 

 

Here is the sign up screen to “rent” Amazon Web Services to run your application and database in the cloud. This is a metered service that fluctuates based on your demand. It will not break the bank.

 

 

Better yet, let’s sign up for
the micro version
.

 

Free Tier*

 

As part of AWS’s Free Usage Tier, new AWS customers can get started with Amazon EC2 for free. Upon sign-up, new AWS customers receive the following EC2 services each month for one year:

750 hours of EC2 running Linux/Unix Micro instance usage

750 hours of EC2 running Microsoft Windows Server Micro instance usage

750 hours of Elastic Load Balancing plus 15 GB data processing

30 GB of Amazon EBS Standard volume storage plus 2 million IOs and 1 GB snapshot storage

15 GB of bandwidth out aggregated across all AWS services

1 GB of Regional Data Transfer

 

Not bad to test drive this service. You need to provide you credit number and do a phone validation (to make sure you are a real person). Remember – we want the micro service to start off with. You will receive a confirmation email and you should select MANAGE YOUR ACCOUNT. Sign in with the credentials that you created and:

 

Other books

The Probability Broach by L. Neil Smith
The Last Chance by Rona Jaffe
Wall of Glass by Walter Satterthwait
Scorched by Soll, Michael
Held by Edeet Ravel
The Green by Karly Kirkpatrick
The Mating Intent-mobi by Bonnie Vanak