数据类型
- 表:
Hive 中的表和关系型数据库中的表在概念上很类似,每个表在HDFS 中都有相应的目录用来存储表的数据,这个目录可以通过${HIVE_HOME}/conf/hive-site.xml 配置文件中的hive.metastore.warehouse.dir 属性来配置,这个属性默认的值是/user/hive/warehouse( 这个目录在HDFS 上) ,我们可以根据实际的情况来修改这个配置。如果我有一个表wyp ,那么在HDFS 中会创建/user/hive/warehouse/wyp 目录( 这里假定hive.metastore.warehouse.dir 配置为/user/hive/warehouse) ;wyp 表所有的数据都存放在这个目录中。这个例外是外部表。 - 外部表:
Hive 中的外部表和表很类似,但是其数据不是放在自己表所属的目录中,而是存放到别处,这样的好处是如果你要删除这个外部表,该外部表所指向的数据是不会被删除的,它只会删除外部表对应的元数据;而如果你要删除表,该表对应的所有数据包括元数据都会被删除。 - 分区:在
Hive 中,表的每一个分区对应表下的相应目录,所有分区的数据都是存储在对应的目录中。比如wyp 表有dt 和city 两个分区,则对应dt=20131218,city=BJ 对应表的目录为/user/hive/warehouse/dt=20131218/city=BJ ,所有属于这个分区的数据都存放在这个目录中。 - 桶:对指定的列计算其
hash ,根据hash 值切分数据,目的是为了并行,每一个桶对应一个文件( 注意和分区的区别) 。比如将wyp 表id 列分散至16 个桶中,首先对id 列的值计算hash ,对应hash 值为0 和16 的数据存储的HDFS 目录为:/user/hive/warehouse/wyp/part-00000;而hash 值为2 的数据存储的HDFS 目录为:/user/hive/warehouse/wyp/part-00002。

可以看出,表是在数据库下面,而表里面又要分区、桶、倾斜的数据和正常的数据等;分区下面也是可以建立桶的。
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost:3306/hive_hdp?characterEncoding=UTF-8
&createDatabaseIfNotExist=true</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>Driver class name for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
<description>username to use against metastore database</description>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>123456</value>
<description>password to use against metastore database</description>
</property>
当然,你还需要将相应数据库的启动复制到
Column Types
Column type are used as column data types of Hive. They are as follows:
Integral Types
Integer type data can be specified using integral data types, INT. When the data range exceeds the range of INT, you need to use BIGINT and if the data range is smaller than the INT, you use SMALLINT. TINYINT is smaller than SMALLINT.
The following table depicts various INT data types:
Type | Postfix | Example |
---|---|---|
TINYINT | Y | 10Y |
SMALLINT | S | 10S |
INT | - | 10 |
BIGINT | L | 10L |
String Types
String type data types can be specified using single quotes (’ ‘) or double quotes (" “). It contains two data types: VARCHAR and CHAR. Hive follows C-types escape characters.
The following table depicts various CHAR data types:
Data Type | Length |
---|---|
VARCHAR | 1 to 65355 |
CHAR | 255 |
Timestamp
It supports traditional UNIX timestamp with optional nanosecond precision. It supports java.sql.Timestamp format “YYYY-MM-DD HH:MM:SS.fffffffff” and format “yyyy-mm-dd hh:mm:ss.ffffffffff”.
Dates
DATE values are described in year/month/day format in the form {{YYYY-MM-DD}}.
Decimals
The DECIMAL type in Hive is as same as Big Decimal format of Java. It is used for representing immutable arbitrary precision. The syntax and example is as follows:
DECIMAL(precision, scale)
decimal(10,0)
Union Types
Union is a collection of heterogeneous data types. You can create an instance using create union. The syntax and example is as follows:
UNIONTYPE<int, double, array<string>, struct<a:int,b:string>>
{0:1}
{1:2.0}
{2:["three","four"]}
{3:{"a":5,"b":"five"}}
{2:["six","seven"]}
{3:{"a":8,"b":"eight"}}
{0:9}
{1:10.0}
Literals
The following literals are used in Hive:
Floating Point Types
Floating point types are nothing but numbers with decimal points. Generally, this type of data is composed of DOUBLE data type.
Decimal Type
Decimal type data is nothing but floating point value with higher range than DOUBLE data type. The range of decimal type is approximately -10
-308
to 10
308
.
Null Value
Missing values are represented by the special value NULL.
Complex Types
The Hive complex data types are as follows:
Arrays
Arrays in Hive are used the same way they are used in Java.
Syntax: ARRAY<data_type>
Maps
Maps in Hive are similar to Java Maps.
Syntax: MAP<primitive_type, data_type>
Structs
Structs in Hive is similar to using complex data with comment.