场景:hive常见的格式有TextFile,SequenceFile,ORC,Parquet,RCFile等
示例:分别建5张表,为以上5种存储格式,并分别往里写入1000万条数据,查看其各自占用的存储空间
create table test1 (
id string
,name string
)
row format delimited
fields terminated by '|'
lines terminated by '\n'
stored as textfile
;
create table test2 (
id string
,name string
)
row format delimited
fields terminated by '|'
lines terminated by '\n'
stored as SequenceFile
;
create table test3 (
id string
,name string
)
row format delimited
fields terminated by '|'
lines terminated by '\n'
stored as ORC
;
create table test4 (
id string
,name string
)
row format delimited
fields terminated by '|'
lines terminated by '\n'
stored as Parquet
;
create table test5 (
id string
,name string
)
row format delimited
fields terminated by '|'
lines terminated by '\n'
stored as RCFile
;
生成1000万条测试数据代码如下:
package com.tpiods.myself.work0510
import java.io.{File, PrintWriter}
object GeneData {
def main(args: Array[String]): Unit = {
val output = "ods_etl/src/main/resources/work0515_hive/test.txt"
val sb = new StringBuilder
for (i <- 1 to 10000000) {
sb.append(i + "|" + "test" + "\n")
}
val writer = new PrintWriter(new File(output))
writer.write(sb.toString())
writer.close()
}
}
--测试数据如下:
test.txt
1|test
2|test
3|test
....
....
--创建测试表,并写入数据:
create table test (
id string
,name string
)
row format delimited
fields terminated by '|'
lines terminated by '\n'
stored as textfile
location '/user/hive/warehouse/test/test'
;
hadoop fs -put test.txt /user/hive/warehouse/test/test
msck repair table test;
select * from test limit 3;
+----------+------------+
| test.id | test.name |
+----------+------------+
| 1 | test |
| 2 | test |
| 3 | test |
+----------+------------+
3 rows selected (0.319 seconds)
select count(1) from test limit 3;
+-----------+
| _c0 |
+-----------+
| 10000000 |
+-----------+
1 row selected (35.452 seconds)
测试数据写入hive表耗时统计:
nohup beeline -e "insert into test_db.test1 select * from test_db.test" > test1.log &
Time taken: 30.157 seconds
nohup beeline -e "insert into test_db.test2 select * from test_db.test" > test2.log &
Time taken: 35.352 seconds
nohup beeline -e "insert into test_db.test3 select * from test_db.test" > test3.log &
Time taken: 28.066 seconds
nohup beeline -e "insert into test_db.test4 select * from test_db.test" > test4.log &
Time taken: 32.137 seconds
nohup beeline -e "insert into test_db.test5 select * from test_db.test" > test5.log &
Time taken: 28.835 seconds
测试数据写入hive表存储统计:
122.9 M 368.8 M /user/hive/warehouse/test_db.db/test1
237.4 M 712.2 M /user/hive/warehouse/test_db.db/test2
20.4 M 61.1 M /user/hive/warehouse/test_db.db/test3
103.9 M 311.6 M /user/hive/warehouse/test_db.db/test4
103.8 M 311.5 M /user/hive/warehouse/test_db.db/test5
beeline相关参数:
beeline --help
Usage: java org.apache.hive.cli.beeline.BeeLine
-u <database url> the JDBC URL to connect to
-c <named url> the named JDBC URL to connect to,
which should be present in beeline-site.xml
as the value of beeline.hs2.jdbc.url.<namedUrl>
-r reconnect to last saved connect url (in conjunction with !save)
-n <username> the username to connect as
-p <password> the password to connect as
-d <driver class> the driver class to use
-i <init file> script file for initialization
-e <query> query that should be executed
-f <exec file> script file that should be executed
-w (or) --password-file <password file> the password file to read password from
--hiveconf property=value Use value for given property
--hivevar name=value hive variable name and value
This is Hive specific settings in which variables
can be set at session level and referenced in Hive
commands or queries.
--property-file=<property-file> the file to read connection properties (url, driver, user, password) from
--color=[true/false] control whether color is used for display
--showHeader=[true/false] show column names in query results
--escapeCRLF=[true/false] show carriage return and line feeds in query results as escaped \r and \n
--headerInterval=ROWS; the interval between which heades are displayed
--fastConnect=[true/false] skip building table/column list for tab-completion
--autoCommit=[true/false] enable/disable automatic transaction commit
--verbose=[true/false] show verbose error messages and debug info
--showWarnings=[true/false] display connection warnings
--showDbInPrompt=[true/false] display the current database name in the prompt
--showNestedErrs=[true/false] display nested errors
--numberFormat=[pattern] format numbers using DecimalFormat pattern
--force=[true/false] continue running script even after errors
--maxWidth=MAXWIDTH the maximum width of the terminal
--maxColumnWidth=MAXCOLWIDTH the maximum width to use when displaying columns
--silent=[true/false] be more silent
--autosave=[true/false] automatically save preferences
--outputformat=[table/vertical/csv2/tsv2/dsv/csv/tsv] format mode for result display
Note that csv, and tsv are deprecated - use csv2, tsv2 instead
--incremental=[true/false] Defaults to false. When set to false, the entire result set
is fetched and buffered before being displayed, yielding optimal
display column sizing. When set to true, result rows are displayed
immediately as they are fetched, yielding lower latency and
memory usage at the price of extra display column padding.
Setting --incremental=true is recommended if you encounter an OutOfMemory
on the client side (due to the fetched result set size being large).
Only applicable if --outputformat=table.
--incrementalBufferRows=NUMROWS the number of rows to buffer when printing rows on stdout,
defaults to 1000; only applicable if --incremental=true
and --outputformat=table
--truncateTable=[true/false] truncate table column when it exceeds length
--delimiterForDSV=DELIMITER specify the delimiter for delimiter-separated values output format (default: |)
--isolation=LEVEL set the transaction isolation level
--nullemptystring=[true/false] set to true to get historic behavior of printing null as empty string
--maxHistoryRows=MAXHISTORYROWS The maximum number of rows to store beeline history.
--delimiter=DELIMITER set the query delimiter; multi-char delimiters are allowed, but quotation
marks, slashes, and -- are not allowed; defaults to ;
--convertBinaryArrayToString=[true/false] display binary column data as string or as byte array
--help display this message
Example:
1. Connect using simple authentication to HiveServer2 on localhost:10000
$ beeline -u jdbc:hive2://localhost:10000 username password
2. Connect using simple authentication to HiveServer2 on hs.local:10000 using -n for username and -p for password
$ beeline -n username -p password -u jdbc:hive2://hs2.local:10012
3. Connect using Kerberos authentication with hive/localhost@mydomain.com as HiveServer2 principal
$ beeline -u "jdbc:hive2://hs2.local:10013/default;principal=hive/localhost@mydomain.com"
4. Connect using SSL connection to HiveServer2 on localhost at 10000
$ beeline "jdbc:hive2://localhost:10000/default;ssl=true;sslTrustStore=/usr/local/truststore;trustStorePassword=mytruststorepassword"
5. Connect using LDAP authentication
$ beeline -u jdbc:hive2://hs2.local:10013/default <ldap-username> <ldap-password>