Hadoop Hive

Hadoop Apache Hive

Hive is a type of data warehouse system.  Hive is from Apache.Hive allows a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. This language permits traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL. You may refer pdf guides on Hive at the end of section.

The properties of hives are,

  1. Easy data summarization
  2. Ad-hoc queries
  3. The analysis of large datasets stored in Hadoop compatible file systems.
  4. Provide HiveQL(Hive Query language)

Hive is open source software and it provides a command line interface (CLI) to write Hive queries by using Hive Query Language (HQL). The syntax of Hive Query Language is similar to the Structured Query Language. The major difference between HiveQL and AQL are,

  1. HQL query executes on a Hadoop cluster rather than a platform that would use expensive hardware for large data sets.
  2. Hive handle huge data sets.
  3. The internal operation of the Hive query is through a series of automatically generated MapREduce jobs.

The three different hive components are,

1)      Hadoop cluster

2)      Metadata store

3)      Warehouse Directory

Hive
Hive

Hadoop cluster

The cluster of inexpensive commodity computers on which the large data set is stored and all processing is performed.

Metadata Store

The location in which the description of the structure of the large data set is kept. The important point is that a standard database is used to store the metadata and it does not store the large data set itself. The database can be either local that is existing on the same computer on which Karmasphere Analyst is installed or remote, that is accessible through the network to multiple users.

Warehouse directory

This is a type of a location called scratch-pad storage location that Hive permits to store/cache working files. It includes,

1)      Newly created tables

2)      Temporary results from user queries.

For processing/communication efficiency, it is typically located on a Hadoop Distributed File System (HDFS) located on the Hadoop Cluster.

Hive does not own the Hadoop file system (HDFS) format that data is stored in. Users are able to write files to HDFS with whatever tools/mechanism takes their fancy and use Hive to correctly “parse” that file format in a way that can be used by Hive. Commonly used file formats like comma delimited text files, even when the file is compressed with Gzip or Bzip2 Karmasphere Analyst isolates the user from having to configure how Hive reads/writes data. Hive provides a powerful and flexible mechanism for “parsing” the data file for use by Hadoop and it is called a Serializer or deserializer.

In Hive tables are stored as files. A table in Hive is basically a directory with the data file(s). For defining a table in Hive covers two main items which are stored in the Metadata store. They are,

1)      Where the folder that includes the data files is located.

2)      How to “parse” the data for reading and writing to the file.

The Karmasphere Analyst “New Table Wizard” creates it easy to complete these steps. Once a table is defined, the Load Table Wizard helps to load data into the table, and if you need it helps to redefine where the table’s folder is located.

Depending on your Hive configuration, simply adding more files to the table’s folder in the file system adds that data to the table. In the special case that the table is partitioned, then each partition in the table is a sub-folder within the table’s folder.

Hive Query Language

The Hive query language (HiveQL) comprises of a subset of SQL and some extensions that we have found useful in our environment. Traditional SQL features like from clause sub queries, various types of joins like inner, left outer, right outer and outer joins, Cartesian products, group bys, and aggregations, union all, create table as select and many useful functions on primitive and complex types make the language very similar to SQL. This helps anyone familiar with SQL to start a hive CLI (command line interface) and begin querying the system right way. There are some limitations.

1)      Only equality predicates are supported in a join predicate and the joins have to be specified using the ANSI join.

Syntax: – SELECT t1.a1 as c1, t2.b1 as c2 FROM t1 JOIN t2 ON (t1.a2=t2, b2);

2)      How inserts work? Hive does not support inserting into an existing table or data partition and all inserts overwrite the existing data.

Syntax: – INSERT OVERWRITE TABLE t1

Apache Hive