hive建表使用不同的存储格式测试-EW帮帮网

场景:hive常见的格式有TextFile,SequenceFile,ORC,Parquet,RCFile等

示例:分别建5张表,为以上5种存储格式,并分别往里写入1000万条数据,查看其各自占用的存储空间


create table test1 (
id string
,name string
)
row format delimited
fields terminated by '|'
lines terminated by '\n'
stored as textfile
;

create table test2 (
id string
,name string
)
row format delimited
fields terminated by '|'
lines terminated by '\n'
stored as SequenceFile
;

create table test3 (
id string
,name string
)
row format delimited
fields terminated by '|'
lines terminated by '\n'
stored as ORC
;

create table test4 (
id string
,name string
)
row format delimited
fields terminated by '|'
lines terminated by '\n'
stored as Parquet
;

create table test5 (
id string
,name string
)
row format delimited
fields terminated by '|'
lines terminated by '\n'
stored as RCFile
;

生成1000万条测试数据代码如下:

package com.tpiods.myself.work0510

import java.io.{File, PrintWriter}

object GeneData {
  def main(args: Array[String]): Unit = {
    val output = "ods_etl/src/main/resources/work0515_hive/test.txt"
    val sb = new StringBuilder
    for (i <- 1 to 10000000) {
      sb.append(i + "|" + "test" + "\n")
    }
    val writer = new PrintWriter(new File(output))
    writer.write(sb.toString())
    writer.close()
  }
}

--测试数据如下:
test.txt
1|test
2|test
3|test
....
....

--创建测试表,并写入数据:
create table test (
id string
,name string
)
row format delimited
fields terminated by '|'
lines terminated by '\n'
stored as textfile
location '/user/hive/warehouse/test/test'
;

hadoop fs -put test.txt /user/hive/warehouse/test/test
msck repair table test;

select * from test limit 3;
+----------+------------+
| test.id  | test.name  |
+----------+------------+
| 1        | test       |
| 2        | test       |
| 3        | test       |
+----------+------------+
3 rows selected (0.319 seconds)
select count(1) from test limit 3;
+-----------+
|    _c0    |
+-----------+
| 10000000  |
+-----------+
1 row selected (35.452 seconds)


测试数据写入hive表耗时统计:
nohup beeline -e "insert into test_db.test1 select * from test_db.test" > test1.log &
Time taken: 30.157 seconds

nohup beeline -e "insert into test_db.test2 select * from test_db.test" > test2.log &
Time taken: 35.352 seconds

nohup beeline -e "insert into test_db.test3 select * from test_db.test" > test3.log &
Time taken: 28.066 seconds

nohup beeline -e "insert into test_db.test4 select * from test_db.test" > test4.log &
Time taken: 32.137 seconds

nohup beeline -e "insert into test_db.test5 select * from test_db.test" > test5.log &
Time taken: 28.835 seconds

测试数据写入hive表存储统计:
122.9 M  368.8 M  /user/hive/warehouse/test_db.db/test1
237.4 M  712.2 M  /user/hive/warehouse/test_db.db/test2
20.4 M   61.1 M   /user/hive/warehouse/test_db.db/test3
103.9 M  311.6 M  /user/hive/warehouse/test_db.db/test4
103.8 M  311.5 M  /user/hive/warehouse/test_db.db/test5

beeline相关参数:
beeline --help
Usage: java org.apache.hive.cli.beeline.BeeLine
   -u <database url>               the JDBC URL to connect to
   -c <named url>                  the named JDBC URL to connect to,
                                   which should be present in beeline-site.xml
                                   as the value of beeline.hs2.jdbc.url.<namedUrl>
   -r                              reconnect to last saved connect url (in conjunction with !save)
   -n <username>                   the username to connect as
   -p <password>                   the password to connect as
   -d <driver class>               the driver class to use
   -i <init file>                  script file for initialization
   -e <query>                      query that should be executed
   -f <exec file>                  script file that should be executed
   -w (or) --password-file <password file>  the password file to read password from
   --hiveconf property=value       Use value for given property
   --hivevar name=value            hive variable name and value
                                   This is Hive specific settings in which variables
                                   can be set at session level and referenced in Hive
                                   commands or queries.
   --property-file=<property-file> the file to read connection properties (url, driver, user, password) from
   --color=[true/false]            control whether color is used for display
   --showHeader=[true/false]       show column names in query results
   --escapeCRLF=[true/false]       show carriage return and line feeds in query results as escaped \r and \n
   --headerInterval=ROWS;          the interval between which heades are displayed
   --fastConnect=[true/false]      skip building table/column list for tab-completion
   --autoCommit=[true/false]       enable/disable automatic transaction commit
   --verbose=[true/false]          show verbose error messages and debug info
   --showWarnings=[true/false]     display connection warnings
   --showDbInPrompt=[true/false]   display the current database name in the prompt
   --showNestedErrs=[true/false]   display nested errors
   --numberFormat=[pattern]        format numbers using DecimalFormat pattern
   --force=[true/false]            continue running script even after errors
   --maxWidth=MAXWIDTH             the maximum width of the terminal
   --maxColumnWidth=MAXCOLWIDTH    the maximum width to use when displaying columns
   --silent=[true/false]           be more silent
   --autosave=[true/false]         automatically save preferences
   --outputformat=[table/vertical/csv2/tsv2/dsv/csv/tsv]  format mode for result display
                                   Note that csv, and tsv are deprecated - use csv2, tsv2 instead
   --incremental=[true/false]      Defaults to false. When set to false, the entire result set
                                   is fetched and buffered before being displayed, yielding optimal
                                   display column sizing. When set to true, result rows are displayed
                                   immediately as they are fetched, yielding lower latency and
                                   memory usage at the price of extra display column padding.
                                   Setting --incremental=true is recommended if you encounter an OutOfMemory
                                   on the client side (due to the fetched result set size being large).
                                   Only applicable if --outputformat=table.
   --incrementalBufferRows=NUMROWS the number of rows to buffer when printing rows on stdout,
                                   defaults to 1000; only applicable if --incremental=true
                                   and --outputformat=table
   --truncateTable=[true/false]    truncate table column when it exceeds length
   --delimiterForDSV=DELIMITER     specify the delimiter for delimiter-separated values output format (default: |)
   --isolation=LEVEL               set the transaction isolation level
   --nullemptystring=[true/false]  set to true to get historic behavior of printing null as empty string
   --maxHistoryRows=MAXHISTORYROWS The maximum number of rows to store beeline history.
   --delimiter=DELIMITER           set the query delimiter; multi-char delimiters are allowed, but quotation
                                   marks, slashes, and -- are not allowed; defaults to ;
   --convertBinaryArrayToString=[true/false]    display binary column data as string or as byte array
   --help                          display this message

   Example:
    1. Connect using simple authentication to HiveServer2 on localhost:10000
    $ beeline -u jdbc:hive2://localhost:10000 username password

    2. Connect using simple authentication to HiveServer2 on hs.local:10000 using -n for username and -p for password
    $ beeline -n username -p password -u jdbc:hive2://hs2.local:10012

    3. Connect using Kerberos authentication with hive/localhost@mydomain.com as HiveServer2 principal
    $ beeline -u "jdbc:hive2://hs2.local:10013/default;principal=hive/localhost@mydomain.com"

    4. Connect using SSL connection to HiveServer2 on localhost at 10000
    $ beeline "jdbc:hive2://localhost:10000/default;ssl=true;sslTrustStore=/usr/local/truststore;trustStorePassword=mytruststorepassword"

    5. Connect using LDAP authentication
    $ beeline -u jdbc:hive2://hs2.local:10013/default <ldap-username> <ldap-password>

hive建表使用不同的存储格式测试

网站公告

今日签到

热门文章

最新发布