2017-04-20 6 views
0

나는 Scikit-SVM 튜토리얼을 읽고 훈련 및 테스트 코드를 작성했습니다. 그러나 나는 '모양이 훈련 형태와 동등해야한다'고 말하는 예측 문제에 직면하고있다. 아래 코드는 다음과 같습니다.SVM 값 오류 텍스트 분류

EDIT1 : 나는 또한 SO에 비슷한 질문을 발견 데이터

ERROR_DESC CLASSIFICATION_LABEL 
ERROR manager.SqlManager: Error executing statement: java.sql.SQLException: ORA-01017: invalid username/password; logon denied at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:447) at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:389) at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:382) at oracle.jdbc.driver.T4CTTIfun.processError(T4CTTIfun.java:675) at oracle.jdbc.driver.T4CTTIoauthenticate.processError(T4CTTIoauthenticate.java:448) at oracle.jdbc.driver.T4CTTIfun.receive(T4CTTIfun.java:513) -- ERROR tool.ImportTool: Encountered IOException running import job: java.io.IOException: No columns to generate for ClassWriter at org.apache.sqoop.orm.ClassWriter.generate(ClassWriter.java:1095),INCORRECT_CREDENTIALS-Database-RAISE_SERVICENOW_DB_CREDENTIALS 
A client error (ThrottlingException) occurred when calling the DescribeCluster operation: Rate exceeded fetching DNS name -- ERROR manager.SqlManager: Error executing statement: java.sql.SQLRecoverableException: IO Error: The Network Adapter could not establish the connection at oracle.jdbc.driver.T4CConnection.logon(T4CConnection.java:489) -- ERROR tool.ImportTool: Encountered IOException running import job: java.io.IOException: No columns to generate for ClassWriter at org.apache.sqoop.orm.ClassWriter.generate(ClassWriter.java:1095), NETWORK_ERROR-Database-RAISE_SERVICENOW_DB_CONNECTION 

샘플 : Link I 변환을 사용하려고하지만 다른 오류가 발생합니다.

import pandas as pd 
​ 
# data paths 
data_in = '../data/input/file.csv' 
​ 
df_data = pd.read_csv(data_in) 

# lower case all columns for uniformity 
df_data.columns = map(str.lower, df_data.columns) 
# lower case all data for uniformity 
df_data = df_data.apply(lambda x: x.astype(str).str.lower()) 

labels = df_data['classification_label'].unique() 

label_map = {} 
i = 1 
for label in labels: 
    label_map[label] = i 
    i += 1 
​  

# apply map to classification_label column 
# df_data['classification_label'] = df_data['classification_label'].map(lambda s: label_map.get(s) if s in label_map else s) 

# select features and labels 
df_final = df_data[['error_desc', 'classification_label']] 


from sklearn.feature_extraction.text import TfidfVectorizer 
v = TfidfVectorizer() 
X = v.fit_transform(df_final['error_desc']) 
y = df_final['classification_label'] 


from sklearn.cross_validation import train_test_split 
​ 
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42 
) 


from sklearn.svm import SVC 
​ 
def train_svm(X, y): 
    """ 
    Create and train the Support Vector Machine. 
    """ 
    svm = SVC(C=1000000.0, gamma='auto', kernel='rbf') 
    svm.fit(X, y) 
    return svm 



svm = train_svm(X_train, y_train) 



from sklearn.metrics import confusion_matrix 
​ 
# Make an array of predictions on the test set 
pred = svm.predict(X_test) 
​ 
# Output the hit-rate and the confusion matrix for each model 
print(svm.score(X_test, y_test)) 
print(confusion_matrix(pred, y_test)) 



0.777777777778 
[[0 0 0 0 0 0 0 0 0 0 0 0 0 0] 
[0 2 0 0 0 0 0 0 0 0 0 0 0 0] 
[0 0 2 0 0 0 0 0 0 0 0 0 0 0] 
[0 0 0 0 0 0 0 0 0 0 1 0 0 0] 
[0 0 0 0 3 0 0 0 0 0 0 0 0 0] 
[0 0 0 0 0 0 0 0 0 0 0 0 0 0] 
[0 0 0 0 0 0 0 0 0 0 0 0 0 0] 
[0 0 0 0 0 0 1 0 0 0 0 0 0 0] 
[1 0 0 0 0 1 0 0 1 0 0 0 0 0] 
[0 0 0 0 0 0 0 0 0 1 0 0 0 0] 
[0 0 0 0 0 0 0 0 0 0 0 0 0 0] 
[0 0 0 0 0 0 0 0 0 0 0 3 0 0] 
[0 0 0 0 0 0 0 0 0 0 0 0 1 0] 
[0 0 0 0 0 0 0 0 0 0 0 0 0 1]] 



pred_x = """ERROR manager.SqlManager: Error executing statement: java.sql.SQLException: ORA-01017: invalid username/password; logon denied at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:447) at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:389) at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:382) at oracle.jdbc.driver.T4CTTIfun.processError(T4CTTIfun.java:675) at oracle.jdbc.driver.T4CTTIoauthenticate.processError(T4CTTIoauthenticate.java:448) at oracle.jdbc.driver.T4CTTIfun.receive(T4CTTIfun.java:513) -- ERROR tool.ImportTool: Encountered IOException running import job: java.io.IOException: No columns to generate for ClassWriter at org.apache.sqoop.orm.ClassWriter.generate(ClassWriter.java:1095)""" 
​ 


pred_x_vector = TfidfVectorizer().fit_transform([pred_x]) 


svm.predict(pred_x_vector) 



--------------------------------------------------------------------------- 
ValueError        Traceback (most recent call last) 
<ipython-input-86-130bf7f79131> in <module>() 
----> 1 svm.predict(pred_x_vector) 

/Users/userOne/anaconda/lib/python2.7/site-packages/sklearn/svm/base.pyc in predict(self, X) 
    571    Class labels for samples in X. 
    572   """ 
--> 573   y = super(BaseSVC, self).predict(X) 
    574   return self.classes_.take(np.asarray(y, dtype=np.intp)) 
    575 

/Users/userOne/anaconda/lib/python2.7/site-packages/sklearn/svm/base.pyc in predict(self, X) 
    308   y_pred : array, shape (n_samples,) 
    309   """ 
--> 310   X = self._validate_for_predict(X) 
    311   predict = self._sparse_predict if self._sparse else self._dense_predict 
    312   return predict(X) 

/Users/userOne/anaconda/lib/python2.7/site-packages/sklearn/svm/base.pyc in _validate_for_predict(self, X) 
    477    raise ValueError("X.shape[1] = %d should be equal to %d, " 
    478        "the number of features at training time" % 
--> 479        (n_features, self.shape_fit_[1])) 
    480   return X 
    481 

ValueError: X.shape[1] = 49 should be equal to 554, the number of features at training time 
+0

에서만 TfidfVectorizer 새로운 Vocab의의 크기에 따라 벡터를 만들고,이 경우 같은 크기의 벡터를 예측하는 훈련 모델을 사용하여, 당신은 어휘 매개 변수를 사용할 수 있습니다하지만 당신은 저장해야합니다 원래 어휘 첫 번째, 당신은 추가 지원이 필요하면 알려 주시기 바랍니다 –

+0

@ EzerK 임 초급, 그래서 제발 용서해주십시오. 공유 할 수있는 코드가 있습니까? 아니면 더 나은 접근 방식을 가르쳐 줄 수 있습니까? – user6083088

+0

샘플 데이터를 게시하면 코드를 수정하고 시도 할 수 있습니다. –

답변

0
import pandas as pd 

df_data = pd.DataFrame([['ERROR manager.SqlManager: Error executing statement: java.sql.SQLException: ORA-01017: invalid username/password; logon denied at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:447) at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:389) at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:382) at oracle.jdbc.driver.T4CTTIfun.processError(T4CTTIfun.java:675) at oracle.jdbc.driver.T4CTTIoauthenticate.processError(T4CTTIoauthenticate.java:448) at oracle.jdbc.driver.T4CTTIfun.receive(T4CTTIfun.java:513) -- ERROR tool.ImportTool: Encountered IOException running import job: java.io.IOException: No columns to generate for ClassWriter at org.apache.sqoop.orm.ClassWriter.generate(ClassWriter.java:1095)','INCORRECT_CREDENTIALS-Database-RAISE_SERVICENOW_DB_CREDENTIALS'],\ 
['A client error (ThrottlingException) occurred when calling the DescribeCluster operation: Rate exceeded fetching DNS name -- ERROR manager.SqlManager: Error executing statement: java.sql.SQLRecoverableException: IO Error: The Network Adapter could not establish the connection at oracle.jdbc.driver.T4CConnection.logon(T4CConnection.java:489) -- ERROR tool.ImportTool: Encountered IOException running import job: java.io.IOException: No columns to generate for ClassWriter at org.apache.sqoop.orm.ClassWriter.generate(ClassWriter.java:1095)', 'NETWORK_ERROR-Database-RAISE_SERVICENOW_DB_CONNECTION']]) 

df_data.columns = ['ERROR_DESC' , 'CLASSIFICATION_LABEL'] 

# lower case all columns for uniformity 
df_data.columns = map(str.lower, df_data.columns) 

# select features and labels 
df_final = df_data[['error_desc', 'classification_label']] 

from sklearn.feature_extraction.text import TfidfVectorizer 
v = TfidfVectorizer() 
X = v.fit_transform(df_final['error_desc']) 
y = df_final['classification_label'] 
orig_vocab = v.get_feature_names() #save the orig vocabulary 

from sklearn.svm import SVC 

def train_svm(X, y): 
    """ 
    Create and train the Support Vector Machine. 
    """ 
    svm = SVC(C=1000000.0, gamma='auto', kernel='rbf') 
    svm.fit(X, y.values) 
    return svm 

svm = train_svm(X, y) 

pred_x = """ERROR manager.SqlManager: Error executing statement: java.sql.SQLException: ORA-01017: invalid username/password; logon denied at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:447) at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:389) at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:382) at oracle.jdbc.driver.T4CTTIfun.processError(T4CTTIfun.java:675) at oracle.jdbc.driver.T4CTTIoauthenticate.processError(T4CTTIoauthenticate.java:448) at oracle.jdbc.driver.T4CTTIfun.receive(T4CTTIfun.java:513) -- ERROR tool.ImportTool: Encountered IOException running import job: java.io.IOException: No columns to generate for ClassWriter at org.apache.sqoop.orm.ClassWriter.generate(ClassWriter.java:1095)""" 
pred_x_vector = TfidfVectorizer(vocabulary=orig_vocab).fit_transform([pred_x]) #vectorize by original vocabulary 

svm.predict(pred_x_vector) 

설명 :

는 훈련 된 모델은 단지 그것을 따라 훈련 벡터로 벡터에 같은 크기를 예측할 수 있습니다. 따라서 단어 방식의 가방으로 텍스트를 벡터화하는 경우 동일한 어휘에 따라 벡터를 만들기 위해 기차 샘플의 원래 어휘를 유지해야합니다.

비고 :

  1. 두 개의 샘플을 사용

    때문에 (단지 모두에 대한 교육)없이 열차 시험 분할 때문에 어떤 교차 검증

  2. 데이터 sklearn의 벡터화가 수행 소문자 필요가 없습니다 너를 위해서.

+0

그래, 예측이 잘못 됐어, 틀렸어. 어떤 생각이야? – user6083088

+0

'INCORRECT_CREDENTIALS-Database-RAISE_SERVICENOW_DB_CREDENTIALS'이 (가) 잘못되었습니다. –

+0

나는 다른 것을 얻고있다. 그러나 나는 83 행을 가지고있다 - 그 경우인가? 예측 점수를 인쇄 할 수있는 방법이 있습니까? 그러나 당신이 나를 도왔을 때 당신의 대답을 받아 들일 것입니다. 하지만 도움이 될만한 지침을 제공 할 수 있다면 : – user6083088