0

모델에서 기능 중요도를 추출하고 featureCols 이름을 추가하여 분석하기 쉬운 방법이 있습니까?스칼라 이름 (레이블)이 포함 된 임의의 포리스트 기능 중요도 추출

뭔가처럼이 : 그 후

val featureCols = Array("a","b","c".......... like 67 more) 

val assembler = new VectorAssembler().setInputCols(featureCols).setOutputCol("features") 
val df2 = assembler.transform(modeling_db) 
val labelIndexer = new StringIndexer().setInputCol("def").setOutputCol("label") 
val df3 = labelIndexer.fit(df2).transform(df2) 
val splitSeed = 5043 
val Array(trainingData, testDataCE) = df3.randomSplit(Array(0.7, 0.3), splitSeed) 
val classifier = new RandomForestClassifier().setImpurity("gini").setMaxDepth(19).setNumTrees(57).setFeatureSubsetStrategy("auto").setSeed(5043) 
val model = classifier.fit(trainingData) 

, 우리가 가진 중요성을 추출하려고 :

model.featureImportances 

을 대답은 분석하기 위해 정말 열심히이다

res14: org.apache.spark.mllib.linalg.Vector = (71,[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,20,23,25,27,33,34,35,38,39,41,42,45,47,48,49,50,51,52,53,54,55,56,57,58,60,61,62,63,64,65,66,67,68,69,70],[0.22362951804309808,0.1830148359365108,0.10246542303449771,0.1699399958851977,0.06486419413350401,0.05187244974385025,0.02627047699833213,0.014498050071723645,0.026182513062665076,0.007126662761055224,0.,0.004354513006816487,0.004361008357237427,0.008435852744278544,0.003195472326415685,0.0023071401643885753,0.004602370417578224,0.0030394399903992345,6.92408316823549E-4,0.011207695216651398,7.609910745572573E-4,8.316382113306638E-4,0.0021506289318167916,0.0013468620354363688,0.006968754359778437,0.018796331618729723,0.0024516591941419444,0.005980997035580654,0.0027983... 

이 대답을 "과장"하고 원래의 레이블 이름에 추가 할 수있는 방법이 있습니까?

답변

0

원래의 열 이름은 featureCols이며 관련된 벡터가없는 것 같습니다. 따라서 zip 두 개의 배열을 함께 사용할 수 있습니다. 이와 같은 입력 데이터에 대하여 :

val featureCols = Array("a", "b", "c", "d", "e") 
val featureImportance = Vectors.dense(Array(0.15, 0.25, 0.1, 0.35, 0.15)).toSparse 

단순히 인쇄에 의해

(d,0.35) 
(b,0.25) 
(a,0.15) 
(e,0.15) 
(c,0.1) 
될 것이다
val res = featureCols.zip(featureImportance.toArray).sortBy(-_._2)