本文共 6167 字,大约阅读时间需要 20 分钟。
hadoop版本:2.6.5spark版本:2.3.0hive版本:1.2.2master主机:192.168.11.170slave1主机:192.168.11.171
针对Hive表的sql语句会转化为MR程序,一般执行起来会比较耗时,spark sql也提供了对Hive表的支持,同时还可以降低运行时间。
pom.xml依赖如下:
4.0.0 com.tongfang.learn learn 1.0-SNAPSHOT learn http://www.example.com UTF-8 1.8 1.8 2.3.0 junit junit 4.11 test org.apache.spark spark-core_2.11 ${spark.core.version} org.apache.spark spark-sql_2.11 ${spark.core.version} mysql mysql-connector-java 5.1.38 org.apache.spark spark-hive_2.11 2.3.0
同时将hive-site.xml配置文件放到工程resources目录下,hive-site.xml配置如下:
hive.metastore.uris thrift://192.168.11.170:9083 hive.server2.thrift.port 10000 javax.jdo.option.ConnectionURL jdbc:mysql://192.168.11.170:3306/hive?createDatabaseIfNoExist=true&characterEncoding=utf8&useSSL=true&useUnicode=true&serverTimezone=UTC javax.jdo.option.ConnectionDriverName com.mysql.jdbc.Driver javax.jdo.option.ConnectionUserName root javax.jdo.option.ConnectionPassword chenliabc hive.metastore.warehouse.dir /user/hive/warehouse fs.defaultFS hdfs://192.168.11.170:9000 hive.metastore.schema.verification false datanucleus.autoCreateSchema true datanucleus.autoStartMechanism checked
实例代码:
import org.apache.spark.sql.SparkSession;public class HiveTest { public static void main(String[] args) { SparkSession spark = SparkSession .builder() .appName("Java Spark Hive Example") .enableHiveSupport() .getOrCreate(); spark.sql("create table if not exists person(id int,name string, address string) row format delimited fields terminated by '|' stored as textfile"); spark.sql("show tables").show(); spark.sql("load data local inpath '/home/hadoop/software/person.txt' overwrite into table person"); spark.sql("select * from person").show(); }}
person.txt如下:
1|tom|beijing2|allen|shanghai3|lucy|chengdu
在运行前需要确保hadoop集群正确启动,同时需要启动hive metastore服务。
./bin/hive --service metastore
提交spark任务:
spark-submit --class com.tongfang.learn.spark.hive.HiveTest --master yarn learn.jar
运行结果:
当然也可以直接在idea中直接运行,代码需要细微调整:
public class HiveTest { public static void main(String[] args) { SparkSession spark = SparkSession .builder() .master("local[*]") .appName("Java Spark Hive Example") .enableHiveSupport() .getOrCreate(); spark.sql("create table if not exists person(id int,name string, address string) row format delimited fields terminated by '|' stored as textfile"); spark.sql("show tables").show(); spark.sql("load data local inpath 'src/main/resources/person.txt' overwrite into table person"); spark.sql("select * from person").show(); }}
在运行中可能报以下错:
Exception in thread "main" org.apache.spark.sql.AnalysisException: java.lang.RuntimeException: java.io.IOException: (null) entry in command string: null chmod 0700 C:\Users\dell\AppData\Local\Temp\c530fb25-b267-4dd2-b24d-741727a6fbf3_resources; at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106) at org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:194) at org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:114) at org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:102) at org.apache.spark.sql.hive.HiveSessionStateBuilder.externalCatalog(HiveSessionStateBuilder.scala:39) at org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog$lzycompute(HiveSessionStateBuilder.scala:54) at org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog(HiveSessionStateBuilder.scala:52) at org.apache.spark.sql.hive.HiveSessionStateBuilder$$anon$1.(HiveSessionStateBuilder.scala:69) at org.apache.spark.sql.hive.HiveSessionStateBuilder.analyzer(HiveSessionStateBuilder.scala:69) at org.apache.spark.sql.internal.BaseSessionStateBuilder$$anonfun$build$2.apply(BaseSessionStateBuilder.scala:293) at org.apache.spark.sql.internal.BaseSessionStateBuilder$$anonfun$build$2.apply(BaseSessionStateBuilder.scala:293) at org.apache.spark.sql.internal.SessionState.analyzer$lzycompute(SessionState.scala:79) at org.apache.spark.sql.internal.SessionState.analyzer(SessionState.scala:79) at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57) at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55) at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:74) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:638) at com.tongfang.learn.spark.hive.HiveTest.main(HiveTest.java:15)
解决方案:
1.下载hadoop windows binary包,点击。 2.在启动类的运行参数中设置环境变量,HADOOP_HOME=D:\winutils\hadoop-2.6.4,后面是hadoop windows 二进制包的目录。运行结果:
本文讲解了spark-sql访问Hive表的代码实现与两种运行方式。
转载地址:http://ozjmb.baihongyu.com/