-1
나는 시카고 범죄 데이터 세트를 Spark ML KMeans로 분석 및 클러스터하려고합니다. 다음은 조각Spark ML - KMeans - org.apache.spark.sql.AnalysisException : 주어진 입력 열에서 'features' '를 해결할 수 없습니다.
case class ChicCase(ID: Long, Case_Number: String, Date: String, Block: String, IUCR: String, Primary_Type: String, Description: String, Location_description: String, Arrest: Boolean, Domestic: Boolean, Beat: Int, District: Int, Ward: Int, Community_Area: Int, FBI_Code: String, X_Coordinate: Int, Y_Coordinate: Int, Year: Int, Updated_On: String, Latitude: Double, Longitude: Double, Location: String)
val city = spark.read.option("header", true).option("inferSchema", true).csv("/chicago_city/Crimes_2001_to_present_2").as[ChicCase]
val data = city.drop("ID", "Case_Number", "Date", "Block", "IUCR", "Primary_Type", "Description", "Location_description", "Arrest", "Domestic", "FBI_Code", "Year", "Location", "Updated_On")
val kmeans = new KMeans
kmeans.setK(10).setSeed(1L)
val model = kmeans.fit(data)
입니다하지만이 데이터 유형은 지능이나 두 번 중 하나 인 다음과 같은 예외를
org.apache.spark.sql.AnalysisException: cannot resolve '`features`' given input columns: [Ward, Longitude, X_Coordinate, Beat, Latitude, District, Y_Coordinate, Community_Area];
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
\t at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77)
\t at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74)
\t at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
\t at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
\t at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300)
at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:190)
\t at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:200)
at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:204)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
\t at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
\t at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
\t at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:204)
\t at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$5.apply(QueryPlan.scala:209)
\t at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
\t at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:209)
\t at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:74)
\t at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
\t at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
\t at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67)
at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:58)
at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
\t at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2589)
at org.apache.spark.sql.Dataset.select(Dataset.scala:969)
at org.apache.spark.ml.clustering.KMeans.fit(KMeans.scala:307) ... 90 elided
을 던지고있다. 무엇이 문제 일 수 있습니까?
마 ** 아닙니다 ** 숫자로 보이지만 ID 코드 인 "병동", "박자"또는 "지구"와 같은 열을 포함합니다. ** 시각화 **를 통해 의미있는 것을 얻을 수 있습니다. Spark은 클러스터링 (모든 좋은 알고리즘이 부족함)을위한 쓰레기입니다. 일키, 훨씬 빠를거야. 와드 경계 : https://data.cityofchicago.org/Facilities-Geographic-Boundaries/Boundaries-Wards-2015-/sp34-6z76 - 와드 번호를 숫자로 취급하지 마십시오. –