需要考虑的问题:
- HDFS文件系统布局,考虑数据管理,用户访问控制
- 如果是关系型数据,在Hive中如何设计
- 存储格式?ORC
- 选择合适的压缩算法
- 避免在HDFS中存储大量小文件
- 根据ETL方式,选择合适的加载工具
向HDFS加载数据的方法
- Ambari 文件浏览器
- Hadoop 命令行
- HDFS NFS 网关代理
- Sqoop
- Apachi Nifi
Sqoop示例
$sqoop help
Available commands:
codegen Generate code to interact with database records
create-hive-table Import a table definition into Hive
eval Evaluate a SQL statement and display the results
export Export an HDFS directory to a database table
help List available commands
import Import a table from a database to HDFS
import-all-tables Import tables from a database to HDFS
import-mainframe Import datasets from a mainframe server to HDFS
job Work with saved jobs
list-databases List available databases on a server
list-tables List available tables in a database
merge Merge results of incremental imports
metastore Run a standalone Sqoop metastore
version Display version information
下面示例导入adventureworks.address数据。
首先准备postgresql java driver驱动,拷贝到sandbox `/usr/share/java`
$sqoop import --connect jdbc:postgres://10.52.28.72/adventureworks1 --table adress --username farland --m 1 --password yourpass
Warning: /usr/hdp/3.0.1.0-187/accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/hdp/3.0.1.0-187/hadoop/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hdp/3.0.1.0-187/hive/lib/log4j-slf4j-impl-2.10.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
23/01/11 04:08:35 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7.3.0.1.0-187
23/01/11 04:08:35 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
23/01/11 04:08:35 INFO manager.SqlManager: Using default fetchSize of 1000
23/01/11 04:08:35 INFO tool.CodeGenTool: Beginning code generation
23/01/11 04:08:35 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM "address" AS t LIMIT 1
23/01/11 04:08:35 ERROR orm.ClassWriter: Cannot resolve SQL type 1111
23/01/11 04:08:35 ERROR orm.ClassWriter: Cannot resolve SQL type 1111
23/01/11 04:08:35 ERROR orm.ClassWriter: Cannot resolve SQL type 1111
23/01/11 04:08:35 ERROR orm.ClassWriter: Cannot resolve SQL type 1111
23/01/11 04:08:35 ERROR orm.ClassWriter: No Java type for SQL type 1111 for column rowguid
......
出现错误,sqoop不能识别数据类型uuid. 可以考虑映射uuid 到java string类型。`--map-column-java rowguid=String`
$ $sqoop import --connect jdbc:postgres://10.52.28.72/adventureworks1 \
--table adress --username postgres --m 1 --password yourpass \
--map-column-java rowguid=String
......
23/01/11 04:11:32 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7.3.0.1.0-187
23/01/11 04:11:32 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
23/01/11 04:11:32 INFO manager.SqlManager: Using default fetchSize of 1000
23/01/11 04:11:32 INFO tool.CodeGenTool: Beginning code generation
23/01/11 04:11:33 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM "address" AS t LIMIT 1
23/01/11 04:11:33 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/hdp/3.0.1.0-187/hadoop-mapreduce
23/01/11 04:11:35 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-hive/compile/e8d73337304e2559bf780b19245358fa/address.jar
23/01/11 04:11:35 WARN manager.PostgresqlManager: It looks like you are importing from postgresql.
23/01/11 04:11:35 WARN manager.PostgresqlManager: This transfer can be faster! Use the --direct
23/01/11 04:11:35 WARN manager.PostgresqlManager: option to exercise a postgresql-specific fast path.
23/01/11 04:11:35 INFO mapreduce.ImportJobBase: Beginning import of address
23/01/11 04:11:36 INFO client.RMProxy: Connecting to ResourceManager at sandbox-hdp.hortonworks.com/172.18.0.2:8050
23/01/11 04:11:36 INFO client.AHSProxy: Connecting to Application History server at sandbox-hdp.hortonworks.com/172.18.0.2:10200
23/01/11 04:11:37 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /user/hive/.staging/job_1672217206510_0021
23/01/11 04:11:43 INFO db.DBInputFormat: Using read commited transaction isolation
23/01/11 04:11:43 INFO mapreduce.JobSubmitter: number of splits:1
23/01/11 04:11:44 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1672217206510_0021
23/01/11 04:11:44 INFO mapreduce.JobSubmitter: Executing with tokens: []
23/01/11 04:11:44 INFO conf.Configuration: found resource resource-types.xml at file:/etc/hadoop/3.0.1.0-187/0/resource-types.xml
23/01/11 04:11:44 INFO impl.YarnClientImpl: Submitted application application_1672217206510_0021
23/01/11 04:11:44 INFO mapreduce.Job: The url to track the job: http://sandbox-hdp.hortonworks.com:8088/proxy/application_1672217206510_0021/
23/01/11 04:11:44 INFO mapreduce.Job: Running job: job_1672217206510_0021
23/01/11 04:11:53 INFO mapreduce.Job: Job job_1672217206510_0021 running in uber mode : false
23/01/11 04:11:53 INFO mapreduce.Job: map 0% reduce 0%
23/01/11 04:12:00 INFO mapreduce.Job: map 100% reduce 0%
23/01/11 04:12:01 INFO mapreduce.Job: Job job_1672217206510_0021 completed successfully
23/01/11 04:12:01 INFO mapreduce.Job: Counters: 32
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=244786
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=85
HDFS: Number of bytes written=2984000
HDFS: Number of read operations=6
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Other local map tasks=1
Total time spent by all maps in occupied slots (ms)=14420
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=3605
Total vcore-milliseconds taken by all map tasks=3605
Total megabyte-milliseconds taken by all map tasks=3691520
Map-Reduce Framework
Map input records=19614
Map output records=19614
Input split bytes=85
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=98
CPU time spent (ms)=2010
Physical memory (bytes) snapshot=253136896
Virtual memory (bytes) snapshot=2846994432
Total committed heap usage (bytes)=155713536
Peak Map Physical memory (bytes)=253136896
Peak Map Virtual memory (bytes)=2846994432
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=2984000
23/01/11 04:12:01 INFO mapreduce.ImportJobBase: Transferred 2.8458 MB in 24.9017 seconds (117.0227 KB/sec)
23/01/11 04:12:01 INFO mapreduce.ImportJobBase: Retrieved 19614 records.
[hive@sandbox-hdp ~]$ hdfs dfs -ls /user/hive/address
Found 2 items
-rw-r--r-- 1 hive hdfs 0 2023-01-11 04:11 /user/hive/address/_SUCCESS
-rw-r--r-- 1 hive hdfs 2984000 2023-01-11 04:11 /user/hive/address/part-m-00000
直接导入表数据到hive
$sqoop import --connect jdbc:postgresql://10.52.28.72/adventureworks1 \
--table address --username postgres --m 1 --password pass \
--map-column-java rowguid=String \
--hive-import --map-column-hive rowguid=String
......
23/01/11 04:34:23 INFO hive.HiveImport: Connecting to jdbc:hive2://sandbox-hdp.hortonworks.com:2181/default;password=hive;serviceDiscoveryMode=zooKeeper;user=hive;zooKeeperNamespace=hiveserver2
23/01/11 04:34:23 INFO hive.HiveImport: 23/01/11 04:34:23 [main]: INFO jdbc.HiveConnection: Connected to sandbox-hdp.hortonworks.com:10000
23/01/11 04:34:23 INFO hive.HiveImport: Connected to: Apache Hive (version 3.1.0.3.0.1.0-187)
23/01/11 04:34:23 INFO hive.HiveImport: Driver: Hive JDBC (version 3.1.0.3.0.1.0-187)
23/01/11 04:34:23 INFO hive.HiveImport: Transaction isolation: TRANSACTION_REPEATABLE_READ
23/01/11 04:34:23 INFO hive.HiveImport: 0: jdbc:hive2://sandbox-hdp.hortonworks.com:2> CREATE TABLE IF NOT EXISTS `address` ( `addressid` IN
23/01/11 04:34:23 INFO hive.HiveImport: T, `addressline1` STRING, `addressline2` STRING, `city` STRING, `stateprovinceid` INT, `postalcode`
23/01/11 04:34:23 INFO hive.HiveImport: STRING, `spatiallocation` STRING, `rowguid` String, `modifieddate` STRING) COMMENT 'Imported by sqoo
23/01/11 04:34:23 INFO hive.HiveImport: p on 2023/01/11 04:34:19' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001' LINES TERMINATED BY '\012
23/01/11 04:34:23 INFO hive.HiveImport: ' STORED AS TEXTFILE;
23/01/11 04:34:24 INFO hive.HiveImport: INFO : Compiling command(queryId=hive_20230111043423_f29bb581-d140-4b51-a9b7-305468cc0108): CREATE TABLE IF NOT EXISTS `address` ( `addressid` INT, `addressline1` STRING, `addressline2` STRING, `city` STRING, `stateprovinceid` INT, `postalcode` STRING, `spatiallocation` STRING, `rowguid` String, `modifieddate` STRING) COMMENT 'Imported by sqoop on 2023/01/11 04:34:19' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001' LINES TERMINATED BY '\012' STORED AS TEXTFILE
23/01/11 04:34:24 INFO hive.HiveImport: INFO : Semantic Analysis Completed (retrial = false)
23/01/11 04:34:24 INFO hive.HiveImport: INFO : Returning Hive schema: Schema(fieldSchemas:null, properties:null)
23/01/11 04:34:24 INFO hive.HiveImport: INFO : Completed compiling command(queryId=hive_20230111043423_f29bb581-d140-4b51-a9b7-305468cc0108); Time taken: 0.04 seconds
23/01/11 04:34:24 INFO hive.HiveImport: INFO : Executing command(queryId=hive_20230111043423_f29bb581-d140-4b51-a9b7-305468cc0108): CREATE TABLE IF NOT EXISTS `address` ( `addressid` INT, `addressline1` STRING, `addressline2` STRING, `city` STRING, `stateprovinceid` INT, `postalcode` STRING, `spatiallocation` STRING, `rowguid` String, `modifieddate` STRING) COMMENT 'Imported by sqoop on 2023/01/11 04:34:19' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001' LINES TERMINATED BY '\012' STORED AS TEXTFILE
23/01/11 04:34:24 INFO hive.HiveImport: INFO : Starting task [Stage-0:DDL] in serial mode
23/01/11 04:34:24 INFO hive.HiveImport: INFO : Completed executing command(queryId=hive_20230111043423_f29bb581-d140-4b51-a9b7-305468cc0108); Time taken: 0.08 seconds
23/01/11 04:34:24 INFO hive.HiveImport: INFO : OK
23/01/11 04:34:24 INFO hive.HiveImport: No rows affected (0.262 seconds)
23/01/11 04:34:24 INFO hive.HiveImport: 0: jdbc:hive2://sandbox-hdp.hortonworks.com:2> LOAD DATA INPATH 'hdfs://sandbox-hdp.hortonworks.com:
23/01/11 04:34:24 INFO hive.HiveImport: 8020/user/hive/address' INTO TABLE `address`;
23/01/11 04:34:24 INFO hive.HiveImport: INFO : Compiling command(queryId=hive_20230111043424_7ffad493-5238-444d-be2e-d2d4e8062c05): LOAD DATA INPATH 'hdfs://sandbox-hdp.hortonworks.com:8020/user/hive/address' INTO TABLE `address`
23/01/11 04:34:24 INFO hive.HiveImport: INFO : Semantic Analysis Completed (retrial = false)
23/01/11 04:34:24 INFO hive.HiveImport: INFO : Returning Hive schema: Schema(fieldSchemas:null, properties:null)
23/01/11 04:34:24 INFO hive.HiveImport: INFO : Completed compiling command(queryId=hive_20230111043424_7ffad493-5238-444d-be2e-d2d4e8062c05); Time taken: 0.079 seconds
23/01/11 04:34:24 INFO hive.HiveImport: INFO : Executing command(queryId=hive_20230111043424_7ffad493-5238-444d-be2e-d2d4e8062c05): LOAD DATA INPATH 'hdfs://sandbox-hdp.hortonworks.com:8020/user/hive/address' INTO TABLE `address`
23/01/11 04:34:24 INFO hive.HiveImport: INFO : Starting task [Stage-0:MOVE] in serial mode
23/01/11 04:34:24 INFO hive.HiveImport: INFO : Loading data to table default.address from hdfs://sandbox-hdp.hortonworks.com:8020/user/hive/address
23/01/11 04:34:24 INFO hive.HiveImport: INFO : Starting task [Stage-1:STATS] in serial mode
23/01/11 04:34:24 INFO hive.HiveImport: INFO : Completed executing command(queryId=hive_20230111043424_7ffad493-5238-444d-be2e-d2d4e8062c05); Time taken: 0.487 seconds
23/01/11 04:34:24 INFO hive.HiveImport: INFO : OK
23/01/11 04:34:24 INFO hive.HiveImport: No rows affected (0.602 seconds)
23/01/11 04:34:24 INFO hive.HiveImport: 0: jdbc:hive2://sandbox-hdp.hortonworks.com:2>
23/01/11 04:34:24 INFO hive.HiveImport: 0: jdbc:hive2://sandbox-hdp.hortonworks.com:2> Closing: 0: jdbc:hive2://sandbox-hdp.hortonworks.com:2181/default;password=hive;serviceDiscoveryMode=zooKeeper;user=hive;zooKeeperNamespace=hiveserver2
23/01/11 04:34:24 INFO hive.HiveImport: Hive import complete.
23/01/11 04:34:24 INFO hive.HiveImport: Export directory is contains the _SUCCESS file only, removing the directory.
23/01/11 04:34:24 INFO tool.ImportTool: Publishing Hive/Hcat import job data to Listeners for table address
23/01/11 04:34:25 INFO atlas.ApplicationProperties: Looking for atlas-application.properties in classpath
23/01/11 04:34:25 INFO atlas.ApplicationProperties: Loading atlas-application.properties from file:/etc/sqoop/3.0.1.0-187/0/atlas-application.properties
23/01/11 04:34:25 INFO atlas.ApplicationProperties: No graphdb backend specified. Will use 'janus'
23/01/11 04:34:25 INFO atlas.ApplicationProperties: Using storage backend 'hbase2'
23/01/11 04:34:25 INFO atlas.ApplicationProperties: Using index backend 'solr'
23/01/11 04:34:25 INFO atlas.ApplicationProperties: Setting solr-wait-searcher property 'true'
23/01/11 04:34:25 INFO atlas.ApplicationProperties: Setting index.search.map-name property 'false'
23/01/11 04:34:25 INFO atlas.ApplicationProperties: Property (set to default) atlas.graph.cache.db-cache = true
23/01/11 04:34:25 INFO atlas.ApplicationProperties: Property (set to default) atlas.graph.cache.db-cache-clean-wait = 20
23/01/11 04:34:25 INFO atlas.ApplicationProperties: Property (set to default) atlas.graph.cache.db-cache-size = 0.5
23/01/11 04:34:25 INFO atlas.ApplicationProperties: Property (set to default) atlas.graph.cache.tx-cache.size = 15000
23/01/11 04:34:25 INFO atlas.ApplicationProperties: Property (set to default) atlas.graph.cache.tx-dirty-size = 120
23/01/11 04:34:25 ERROR security.InMemoryJAASConfiguration: Unable to add JAAS configuration for client [KafkaClient] as it is missing param [atlas.jaas.KafkaClient.loginModuleName]. Skipping JAAS config for [KafkaClient]
23/01/11 04:34:25 INFO kafka.KafkaNotification: ==> KafkaNotification()
23/01/11 04:34:25 INFO kafka.KafkaNotification: <== KafkaNotification()
23/01/11 04:34:25 INFO hook.AtlasHook: Created Atlas Hook
23/01/11 04:34:25 INFO mapreduce.PublishJobData: Published data is Operation=import, Url=jdbc:postgresql://10.52.28.72/adventureworks1, User=hive, StoreType=postgresql, StoreTable=address, StoreQuery=null, HiveDB=default, HiveTable=address, StartTime=1673411637645, EndTime=1673411664953, CmdLineArgs={reset.onemapper=false, codegen.output.delimiters.enclose=0, sqlconnection.metadata.transaction.isolation.level=2, codegen.input.delimiters.escape=0, codegen.auto.compile.dir=true, accumulo.batch.size=10240000, codegen.input.delimiters.field=0, accumulo.create.table=false, map.column.hive.rowguid=String, mainframe.input.dataset.type=p, enable.compression=false, skip.dist.cache=false, hive.compute.stats.table=false, accumulo.max.latency=5000, db.username=farland, sqoop.throwOnError=false, db.clear.staging.table=false, codegen.input.delimiters.enclose=0, hdfs.append.dir=false, map.column.java.rowguid=String, import.direct.split.size=0, hcatalog.drop.and.create.table=false, codegen.output.delimiters.record=10, codegen.output.delimiters.field=1, hbase.bulk.load.enabled=false, mapreduce.num.mappers=1, export.new.update=UpdateOnly, db.require.password=true, hive.import=true, customtool.options.jsonmap={}, hdfs.delete-target.dir=false, codegen.output.delimiters.enclose.required=false, direct.import=false, codegen.output.dir=., hdfs.file.format=TextFile, hive.drop.delims=false, codegen.input.delimiters.record=0, db.batch=false, codegen.delete.compile.dir=false, split.limit=null, hcatalog.create.table=false, hive.fail.table.exists=false, hive.overwrite.table=false, incremental.mode=None, temporary.dirRoot=_sqoop, verbose=false, hbase.null.incremental.mode=Ignore, import.max.inline.lob.size=16777216, import.fetch.size=1000, codegen.input.delimiters.enclose.required=false, relaxed.isolation=false, sqoop.oracle.escaping.disabled=true, db.table=address, hbase.create.table=false, codegen.compile.dir=/tmp/sqoop-hive/compile/7c6dc91d93db1d6b608b978c395ea338, codegen.output.delimiters.escape=0, db.connect.string=jdbc:postgresql://10.52.28.72/adventureworks1}
23/01/11 04:34:25 INFO hook.AtlasHook: ==> Shutdown of Atlas Hook
23/01/11 04:34:25 INFO kafka.KafkaNotification: ==> KafkaNotification.createProducer()
23/01/11 04:34:25 INFO producer.ProducerConfig: ProducerConfig values:
......
23/01/11 04:34:25 WARN producer.ProducerConfig: The configuration 'key.deserializer' was supplied but isn't a known config.
23/01/11 04:34:25 WARN producer.ProducerConfig: The configuration 'value.deserializer' was supplied but isn't a known config.
23/01/11 04:34:25 WARN producer.ProducerConfig: The configuration 'hook.group.id' was supplied but isn't a known config.
23/01/11 04:34:25 WARN producer.ProducerConfig: The configuration 'zookeeper.connection.timeout.ms' was supplied but isn't a known config.
23/01/11 04:34:25 WARN producer.ProducerConfig: The configuration 'zookeeper.session.timeout.ms' was supplied but isn't a known config.
23/01/11 04:34:25 WARN producer.ProducerConfig: The configuration 'enable.auto.commit' was supplied but isn't a known config.
23/01/11 04:34:25 WARN producer.ProducerConfig: The configuration 'zookeeper.connect' was supplied but isn't a known config.
23/01/11 04:34:25 WARN producer.ProducerConfig: The configuration 'zookeeper.sync.time.ms' was supplied but isn't a known config.
23/01/11 04:34:25 WARN producer.ProducerConfig: The configuration 'session.timeout.ms' was supplied but isn't a known config.
23/01/11 04:34:25 WARN producer.ProducerConfig: The configuration 'auto.offset.reset' was supplied but isn't a known config.
23/01/11 04:34:25 INFO utils.AppInfoParser: Kafka version : 1.1.1.3.0.1.0-187
23/01/11 04:34:25 INFO utils.AppInfoParser: Kafka commitId : bbbf85928afedf80
23/01/11 04:34:25 INFO kafka.KafkaNotification: <== KafkaNotification.createProducer()
23/01/11 04:34:25 INFO clients.Metadata: Cluster ID: DuV2mwAuTgGO5-CxpMrQeg
23/01/11 04:34:25 INFO hook.AtlasHook: <== Shutdown of Atlas Hook
$ hive
> select count(*) from address;
----------------------------------------------------------------------------------------------
VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
----------------------------------------------------------------------------------------------
Map 1 .......... container SUCCEEDED 1 1 0 0 0 0
Reducer 2 ...... container SUCCEEDED 1 1 0 0 0 0
----------------------------------------------------------------------------------------------
VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 3.60 s
----------------------------------------------------------------------------------------------
INFO : OK
+--------+
| _c0 |
+--------+
| 19614 |
+--------+
1 row selected (9.405 seconds)