Tutorial on Types Of Hive tables

Nixon Data Tutorial on Types Of Hive tables

Tutorial on Types Of Hive tables

Types Of Hive tables

Hive is an open-source data warehousing and management system for Hadoop. It provides a way to store, process, and analyze large amounts of data in a scalable and efficient manner. Hive supports several types of tables that allow users to organize and manage their data in different ways.

In this article, we will discuss the different types of Hive tables and how to create them:

1. Managed Tables:

  • Managed tables are the standard type of table in Hive. When you create a managed table, Hive stores the data and metadata (schema, table properties, etc.) in a specific location in the HDFS file system. The data in managed tables is managed by Hive and the user cannot perform any file-level operations on the data. When you drop a managed table, both the data and metadata are deleted.

To create a managed table, use the following syntax:

CREATE TABLE table_name(
  column1 data_type,
  column2 data_type,
  ...
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;

Example:

CREATE TABLE employee(
  id INT,
  name STRING,
  department STRING,
  salary FLOAT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;

Use case: Managed tables are ideal for storing structured data that you want to be managed by Hive. They are typically used for data warehousing and business intelligence applications.

2. External Tables:

  • External tables are similar to managed tables, but with some important differences. When you create an external table, Hive only stores the metadata in its metastore. The actual data remains in the HDFS file system and is not managed by Hive. When you drop an external table, only the metadata is deleted and the data remains in the HDFS file system.

To create an external table, use the following syntax:

CREATE EXTERNAL TABLE table_name(
  column1 data_type,
  column2 data_type,
  ...
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 'hdfs_path';

Example

CREATE EXTERNAL TABLE employee(
  id INT,
  name STRING,
  department STRING,
  salary FLOAT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/user/data/employee';

Use case:

External tables are ideal for scenarios where you want to keep the data outside of Hive’s control. This might be the case when you want to access the data using other tools or when you want to keep the data in a specific location for security reasons.

3. Partitioned Tables

  • Partitioned tables allow you to store large amounts of data in a more efficient manner. Partitioning divides the data into smaller, more manageable pieces based on the values of one or more columns. This makes queries faster by allowing Hive to skip over partitions that do not contain the desired data.

To create a partitioned table, use the following syntax:

CREATE TABLE table_name(
  column1 data_type,
  column2 data_type,
...
)
PARTITIONED BY (partition_column data_type)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;

Example:

CREATE TABLE sales(
date STRING,
product STRING,
quantity INT,
price FLOAT
)
PARTITIONED BY (region STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;

Use case:

Partitioned tables are ideal for large datasets that can be divided into smaller, more manageable pieces based on some criteria.

For example, you could partition a sales table by region or by date. Queries on the sales table will be faster because Hive will only need to scan the partitions that contain the desired data.

4. Bucketed Tables:

Bucketed tables allow you to store data in a more efficient manner by dividing the data into smaller pieces called buckets. Bucketing helps to improve query performance by reducing the amount of data that needs to be processed. To create a bucketed table, use the following syntax:

CREATE TABLE table_name(
column1 data_type,
column2 data_type,
...
)
CLUSTERED BY (bucketing_column) INTO num_buckets BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;

Example

CREATE TABLE sales(
date STRING,
product STRING,
quantity INT,
price FLOAT
)
CLUSTERED BY (product) INTO 10 BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;

Use case:

Bucketed tables are ideal for scenarios where you want to improve query performance by reducing the amount of data that needs to be processed. For example, you could bucket a sales table by product to improve performance when running queries on sales data by product. In conclusion, Hive supports several types of tables that allow you to manage and organize your data in different ways. Whether you need a standard managed table, an external table, a partitioned table, or a bucketed table, Hive has you covered. By choosing the right type of table for your data, you can ensure that your data is stored and processed efficiently, making your queries faster and more reliable.