파이썬 sklearn 로지스틱 회귀 K-개최 교차 검증 : coef_에 대한 drameframe을 만드는 방법

Python3.5파이썬 sklearn 로지스틱 회귀 K-개최 교차 검증 : coef_에 대한 drameframe을 만드는 방법

나는 varibale, file에 저장된 데이터 집합을 가지고 있고, 나는 10 홀드 교차 검증을 적용하려고 로지스틱 회귀. 내가 찾고있는 것은 clf.coef_의 평균을 나열하는 방법입니다. 여기

print(file.head()) 

    Result Interest Limit Service Convenience Trust Speed 
0  0   1  1  1   1  1  1 
1  0   1  1  1   1  1  1 
2  0   1  1  1   1  1  1 
3  0   4  4  3   4  2  3 
4  1   4  4  4   4  4  4

내가 coef_의 목록을 보여주기 위해 작성한 간단한 로지스틱 회귀 코드입니다.

[있음]

import pandas as pd 
from pandas import DataFrame 
import numpy as np 
from sklearn import cross_validation 
from sklearn.linear_model import LogisticRegression 

X = file.drop(['Result'],1) 
y = file['Result'] 

X_train, X_test, y_train, y_test = cross_validation.train_test_split(X,y,test_size=0.25) 
clf = LogisticRegression(penalty='l1') 
clf.fit(X_train,y_train) 
accuracy = clf.score(X_test,y_test) 
print(accuracy) 

coeff_df = pd.DataFrame([X.columns, clf.coef_[0]]).T 
print(coeff_df)

[출력]

0.823061630219 

      0   1 
0  Interest 0.163577 
1  Limit -0.161104 
2  Service 0.323073 
3 Convenience 0.121573 
4  Trust 0.370012 
5  Speed 0.089934 
6  Major 0.183002 
7   Ads 0.0137151

그리고, I는 동일한 데이터 세트에 10 배 교차 검증을 적용하려고. 나는 아래 코드를 가지고 있지만 위의 분석을 위해했던 것처럼 coef_, coeff_df 목록의 데이터 프레임을 생성 할 수 없었습니다. 누군가가 해결책을 제공 할 수 있습니까?

[있음]

from sklearn.cross_validation import cross_val_score 
scores = cross_val_score(clf, X, y, cv=10) 
print (scores) 
print (np.average(scores))

[출력]

[ 0.82178218 0.7970297 0.84158416 0.80693069 0.84158416 0.80693069 
    0.825  0.825  0.815  0.76  ] 
0.814084158416

출처

2017-03-05 Ryo

cross_val_score는 크로스 확인 (예컨대 KFold, StratifiedKFold)을위한 다양한 목적을 scikit 배우기 감싸는 도우미 함수이다. 사용 된 scoring 매개 변수를 기반으로 점수 목록을 반환합니다 (분류 문제의 경우 기본적으로 accuracy이 될 것입니다).

cross_val_score의 return 객체는 교차 유효성 검사에 사용되는 기본 폴드/모델에 액세스 할 수 없으므로 각 모델의 계수를 가져올 수 없습니다.

교차 유효성 검사의 각 폴드에 대한 계수를 얻으려면 KFold (또는 클래스가 불균형 인 경우 StratifiedKFold)을 사용하는 것이 좋습니다.

import pandas as pd 
from sklearn.model_selection import StratifiedKFold 
from sklearn.linear_model import LogisticRegression 

df = pd.read_clipboard() 
file = pd.concat([df, df, df]).reset_index() 

X = file.drop(['Result'],1) 
y = file['Result'] 

skf = StratifiedKFold(n_splits=2, random_state=0) 

models, coefs = [], [] # in case you want to inspect the models later, too 
for train, test in skf.split(X, y): 
    print(train, test) 
    clf = LogisticRegression(penalty='l1') 
    clf.fit(X.loc[train], y.loc[train]) 
    models.append(clf) 
    coefs.append(clf.coef_[0]) 

pd.DataFrame(coefs, columns=X.columns).mean()

우리를 가져옵니다 :

Interest  0.000000 
Limit   0.000000 
Service  0.000000 
Convenience 0.000000 
Trust   0.530811 
Speed   0.000000 
dtype: float64

내가 (단지 긍정적 인 클래스의 인스턴스를 가지고있는) 귀하의 예제에서 데이터를했습니다. 나는이 수치가 당신의 경우 0이 아닌 것으로 의심한다.

편집 StratifiedKFold (또는 KFold가) 우리에게 데이터 세트의 교차 검증 분할을 제공하기 때문에, 당신은 여전히 모델의 score 방법을 사용하여 교차 검증 점수를 계산할 수 있습니다.

아래의 버전은 각 폴드의 교차 유효성 검사 점수를 캡처하기 위해 위에서 약간 변경되었습니다.

models, scores, coefs = [], [], [] # in case you want to inspect the models later, too 
for train, test in skf.split(X, y): 
    print(train, test) 
    clf = LogisticRegression(penalty='l1') 
    clf.fit(X.loc[train], y.loc[train]) 
    score = clf.score(X.loc[test], y.loc[test]) 
    models.append(clf) 
    scores.append(score) 
    coefs.append(clf.coef_[0])

출처

2017-03-05 21:55:43

고마워요! 코드가 작동합니다! 한 가지 추가 질문 - 코드를 기반으로 점수 목록을 만드는 방법이 있습니까? 나는 'L1 penalty'를 설정하고 cross_val_score는 나를 허용하지 않을 것입니다. – Ryo

이 문제를 해결하기 위해 내 대답이 업데이트되었습니다. –

파이썬 sklearn 로지스틱 회귀 K-개최 교차 검증 : coef_에 대한 drameframe을 만드는 방법

답변

관련 문제