2017-11-14 6 views
0

놀라운 캐럿 패키지를 처음 접했고 리샘플링 메서드가 'timeslice'인 lm 모델의 train() 출력에서 ​​일부 개체를 재현하려고합니다.

  1. 왜 그리고 $ 함수 defaultSummary의 출력에서 ​​ 차이가 내 예에 Rsquared $을 $ 결과 $의 RMSE를 초래할 않습니다 ($ PRED $ PRED, $ PRED $ OBS)?
  2. $ resample에서 RMSE, Rsquared, MAE를 계산하는 데 사용되는 데이터는 무엇입니까?

    require(caret) 
    require(doParallel) 
    
    no_cores <- detectCores() - 1 
    cls = makeCluster(no_cores) 
    registerDoParallel(cls) 
    
    data(economics) 
    #str(economics) 
    ec.data <- as.data.frame(economics[,-1]) #drop 'date' column 
    #head(ec.data) 
    
    #trainControl() with parallel processing and 1 step forecasts by TimeSlices------------------------ 
    set.seed(123) 
    samplesCount = nrow(ec.data) 
    initialWindow = 10 
    h = 1 
    s = 0 
    M = 1 # no of models that are evaluated during each resample (tuning parameters) 
    
    #seeds 
    resamplesCount = length(createTimeSlices(1:samplesCount, initialWindow, horizon = h, fixedWindow = TRUE, skip = s)$test) 
    seeds <- vector(mode = "list", length = resamplesCount + 1) # length = B+1, B = number of resamples 
    for(i in 1:resamplesCount) seeds[[i]] <- sample.int(1000, M) # The first B elements of the list should be vectors of integers of >= length M where M is the number of models being evaluated for each resample. 
    seeds[[(resamplesCount+1)]] <- sample.int(1000, 1) # The last element of the list only needs to be a single integer (for the final model) 
    
    
    trainCtrl.ec <- trainControl(
        method = "timeslice", initialWindow = initialWindow, horizon = h, skip = s, # data splitting 
        returnResamp = "all", 
        savePredictions = "all", 
        seeds = seeds, 
        allowParallel = TRUE) 
    
    
    lm.fit.ec <- train(unemploy ~ ., data = ec.data, 
            method = "lm", 
            trControl = trainCtrl.ec) 
    
    lm.fit.ec 
    head(lm.fit.ec$resample) 
    

출력 :

> lm.fit.ec 
Linear Regression 

574 samples 
    4 predictor 

No pre-processing 
Resampling: Rolling Forecasting Origin Resampling (1 held-out with a fixed window) 
Summary of sample sizes: 10, 10, 10, 10, 10, 10, ... 
Resampling results: 

    RMSE  Rsquared MAE  
    250.072 NaN  250.072 

Tuning parameter 'intercept' was held constant at a value of TRUE 

없는 이유 defaultSummary 산출 할 때와 같은 RMSE 및 Rsquared의 출력()?

dat <- as.data.frame(cbind(lm.fit.ec$pred$pred, lm.fit.ec$pred$obs)) 
colnames(dat) <- c("pred", "obs") 
defaultSummary(dat) 

> defaultSummary(dat) 
     RMSE Rsquared  MAE 
394.440680 0.978365 250.072031 

$ resample에서 결과를 어떻게 재현 할 수 있습니까?

> head(lm.fit.ec$resample) 
     RMSE Rsquared  MAE intercept Resample 
1 16.33273  NA 16.33273  TRUE Training010 
2 232.16184  NA 232.16184  TRUE Training011 
3 197.65143  NA 197.65143  TRUE Training012 
4 393.29469  NA 393.29469  TRUE Training013 
5 129.99157  NA 129.99157  TRUE Training014 
6 60.95649  NA 60.95649  TRUE Training015 

세션 정보 : https://stats.stackexchange.com/questions/114168/how-to-get-sub-training-and-sub-test-from-cross-validation-in-caret

Q1 :

> sessionInfo() 
R version 3.4.2 (2017-09-28) 
Platform: x86_64-w64-mingw32/x64 (64-bit) 
Running under: Windows >= 8 x64 (build 9200) 

Matrix products: default 

locale: 
[1] LC_COLLATE=Swedish_Sweden.1252 LC_CTYPE=Swedish_Sweden.1252 LC_MONETARY=Swedish_Sweden.1252 
[4] LC_NUMERIC=C     LC_TIME=Swedish_Sweden.1252  

attached base packages: 
[1] parallel stats  graphics grDevices utils  datasets methods base  

other attached packages: 
[1] fpp_0.5    tseries_0.10-42  lmtest_0.9-35  zoo_1.8-0   
[5] expsmooth_2.3  fma_2.3    forecast_8.2  mlbench_2.1-1  
[9] spikeslab_1.1.5  randomForest_4.6-12 lars_1.2   doParallel_1.0.11 
[13] iterators_1.0.8  foreach_1.4.3  caret_6.0-77.9000 ggplot2_2.2.1  
[17] lattice_0.20-35 

답변

1

나는 여기에 내 질문에 대한 답을 발견했다. 필자의 예제에서 $ result $ RMSE와 $ result $ Rsquared가 defaultSummary ($ pred $ pred, $ pred $ obs) 함수의 결과와 다른 이유는 무엇입니까?

A : 기차의 출력은 홀드 아웃의 평균으로 계산됩니다. 내 예 :

# The output is the mean of $resample 
    mean(lm.fit.ec$resample$RMSE) # =250.072 
    mean(lm.fit.ec$resample$MAE) # =250.072 

질문 2. 어떤 데이터가 $ resample에서 RMSE, Rsquared, MAE를 계산하는 데 사용됩니까?

> head(lm.fit.ec$resample) 
RMSE Rsquared  MAE intercept Resample 
1 16.33273  NA 16.33273  TRUE Training010 
2 232.16184  NA 232.16184  TRUE Training011 
3 197.65143  NA 197.65143  TRUE Training012 
4 393.29469  NA 393.29469  TRUE Training013 
5 129.99157  NA 129.99157  TRUE Training014 
6 60.95649  NA 60.95649  TRUE Training015 


first_holdout <- subset(lm.fit.ec$pred, Resample == "Training010") 
first_holdout 

> first_holdout 
pred  obs rowIndex intercept Resample 
1 2756.333 2740  11  TRUE Training010 # only 1 row since 1 step forecast horizon 


# Calculate RMSE, Rsquared and MAE for the holdout set 
postResample(first_holdout$pred, first_holdout$obs) 

> postResample(first_holdout$pred, first_holdout$obs) 
RMSE  Rsquared  MAE 
16.33273  NA  16.33273 

내 혼란은 주로 Rsquared가 NA라는 사실에 기인합니다. 그러나 forcast 수평선이 1 단계 였기 때문에 모든 보류 샘플에는 하나의 행만 있으므로 Rsquared는 계산할 수 없습니다.