R 회귀 분석에서의 요인과 더미

좋아, 나는 두 가지 문제가있다. 어쩌면 관련성이있다. 나는 내 데이터베이스와 거의 비슷한 예를 사용할 것이다. 한 국가의 대통령 (예 : "George W.", "Bill C."등)과 같이 여러 개의 이름이있는 열이 20 개 있습니다. 또한 25 가지 전략 (예 : 'str_1', 'str2'등)이 있습니다. 그것들은 y와 x 같은 다른 변수와 함께 "dat"이라는 동일한 데이터베이스에 모두 있습니다.R 회귀 분석에서의 요인과 더미

예를

============================= 
y x presidents strategies 
============================ 
20 2 Bill.C  3_A 
10 1 George.W 2_B 
10 1 Tom_C  3_C 
3 2 Tom_C  2_D 
4 4 John.C  3_A 
4 3 Bill.C  2_A

나는 전략 회장과 전략 사이의 + 상호 작용에 대한 대통령 + 인형에 대한 Y ~ X + 인형을 퇴보하고 싶습니다.

나는 이미 20 명의 대통령과 25 가지 전략 각각에 대해 인형을 만들었지 만, 각 대통령과 각 전략 사이의 상호 작용을 만드는 방법을 알지 못합니다. (그것이 내 문제의 첫 부분입니다.) 이 작업을 쉽게 할 수 있다고 가정하면, 20 * 25 상호 작용을 하나씩 작성하지 않고도 회귀를 지정할 수있는 다른 방법이 있습니까 (Stata에는 이와 동일한 문제에 대한 명령이 있습니다).

아마도 별개의 질문 일 수 있지만 확실하지 않습니다.

미리 감사드립니다.

출처

2017-10-17 RandomWalker

? (작은) 예제 데이터 프레임을 제공 할 수 있다면 도움이 될 것입니다. –

"20 * 25 개의 상호 작용을 하나씩 작성하지 않고 회귀를 지정하는 다른 방법이 있습니까?"예 있습니다. 'lm'은 요소 변수를 자동으로 해당하는 더미로 변환합니다 (하나는 참조 범주로 남겨 둡니다). 그러므로''lm (y ~ x + presidents + strategies + presidents : 전략, data = dat)'이라고 쓰면 충분합니다.''lm (y ~ x + presidents * strategies, data = dat) '도 쓸 수 있습니다. 동일한 사양. – useR

OLS는 관찰보다 변수가 많은 데이터 집합 (인형 및 상호 작용 포함)을 처리 할 수 없으므로 더 큰 데이터 집합을 제공해야합니다. – useR

lm 및 glm은 요인 변수를 해당 참조 번호 (참조 카테고리로 남겨 둡니다)로 자동 변환합니다. 그래서 다음을 수행 할 수있는 충분한입니다 :

mod1 = lm(y ~ x + presidents + strategies + presidents:strategies, data = df1) 
mod2 = lm(y ~ x + presidents*strategies, data = df1) 
mod3 = glm(y ~ x + presidents + strategies + presidents:strategies, data = df1) 
mod4 = glm(y ~ x + presidents*strategies, data = df1) 

summary(mod1) 
summary(mod2) 
summary(mod3) 
summary(mod4)

결과 :

> summary(mod1) 

Call: 
lm(formula = y ~ x + presidents + strategies + presidents:strategies, 
    data = df1) 

Residuals: 
    Min  1Q Median  3Q  Max 
-17.3690 -6.1273 -0.1699 6.4295 17.4156 

Coefficients: 
           Estimate Std. Error t value Pr(>|t|)  
(Intercept)      14.4782  3.0799 4.701 5.15e-06 *** 
x         -0.1692  0.2141 -0.790 0.431  
presidentsGeorge.W    11.1984  8.8283 1.268 0.206  
presidentsJohn.C     4.1281  4.2305 0.976 0.330  
presidentsTom_C     4.9604  3.6271 1.368 0.173  
strategies2_B      1.6203  3.5736 0.453 0.651  
strategies2_D      -1.7246  3.6550 -0.472 0.638  
strategies3_A      1.7663  3.2966 0.536 0.593  
strategies3_C      -0.5787  3.8440 -0.151 0.881  
presidentsGeorge.W:strategies2_B -9.9934 10.0125 -0.998 0.320  
presidentsJohn.C:strategies2_B -1.5192  5.8696 -0.259 0.796  
presidentsTom_C:strategies2_B  -0.8962  5.0202 -0.179 0.859  
presidentsGeorge.W:strategies2_D -7.5266  9.7414 -0.773 0.441  
presidentsJohn.C:strategies2_D  1.7179  6.4375 0.267 0.790  
presidentsTom_C:strategies2_D  -1.1020  5.0551 -0.218 0.828  
presidentsGeorge.W:strategies3_A -11.9783  9.3115 -1.286 0.200  
presidentsJohn.C:strategies3_A -2.8849  5.0866 -0.567 0.571  
presidentsTom_C:strategies3_A  -5.0305  4.4068 -1.142 0.255  
presidentsGeorge.W:strategies3_C -6.5116  9.7387 -0.669 0.505  
presidentsJohn.C:strategies3_C -4.3792  6.0389 -0.725 0.469  
presidentsTom_C:strategies3_C  -1.3257  5.3821 -0.246 0.806  
--- 
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 8.364 on 179 degrees of freedom 
Multiple R-squared: 0.064, Adjusted R-squared: -0.04058 
F-statistic: 0.612 on 20 and 179 DF, p-value: 0.9007 

> summary(mod2) 

Call: 
lm(formula = y ~ x + presidents * strategies, data = df1) 

Residuals: 
    Min  1Q Median  3Q  Max 
-17.3690 -6.1273 -0.1699 6.4295 17.4156 

Coefficients: 
           Estimate Std. Error t value Pr(>|t|)  
(Intercept)      14.4782  3.0799 4.701 5.15e-06 *** 
x         -0.1692  0.2141 -0.790 0.431  
presidentsGeorge.W    11.1984  8.8283 1.268 0.206  
presidentsJohn.C     4.1281  4.2305 0.976 0.330  
presidentsTom_C     4.9604  3.6271 1.368 0.173  
strategies2_B      1.6203  3.5736 0.453 0.651  
strategies2_D      -1.7246  3.6550 -0.472 0.638  
strategies3_A      1.7663  3.2966 0.536 0.593  
strategies3_C      -0.5787  3.8440 -0.151 0.881  
presidentsGeorge.W:strategies2_B -9.9934 10.0125 -0.998 0.320  
presidentsJohn.C:strategies2_B -1.5192  5.8696 -0.259 0.796  
presidentsTom_C:strategies2_B  -0.8962  5.0202 -0.179 0.859  
presidentsGeorge.W:strategies2_D -7.5266  9.7414 -0.773 0.441  
presidentsJohn.C:strategies2_D  1.7179  6.4375 0.267 0.790  
presidentsTom_C:strategies2_D  -1.1020  5.0551 -0.218 0.828  
presidentsGeorge.W:strategies3_A -11.9783  9.3115 -1.286 0.200  
presidentsJohn.C:strategies3_A -2.8849  5.0866 -0.567 0.571  
presidentsTom_C:strategies3_A  -5.0305  4.4068 -1.142 0.255  
presidentsGeorge.W:strategies3_C -6.5116  9.7387 -0.669 0.505  
presidentsJohn.C:strategies3_C -4.3792  6.0389 -0.725 0.469  
presidentsTom_C:strategies3_C  -1.3257  5.3821 -0.246 0.806  
--- 
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 8.364 on 179 degrees of freedom 
Multiple R-squared: 0.064, Adjusted R-squared: -0.04058 
F-statistic: 0.612 on 20 and 179 DF, p-value: 0.9007 

> summary(mod3) 

Call: 
glm(formula = y ~ x + presidents + strategies + presidents:strategies, 
    data = df1) 

Deviance Residuals: 
    Min  1Q Median  3Q  Max 
-17.3690 -6.1273 -0.1699 6.4295 17.4156 

Coefficients: 
           Estimate Std. Error t value Pr(>|t|)  
(Intercept)      14.4782  3.0799 4.701 5.15e-06 *** 
x         -0.1692  0.2141 -0.790 0.431  
presidentsGeorge.W    11.1984  8.8283 1.268 0.206  
presidentsJohn.C     4.1281  4.2305 0.976 0.330  
presidentsTom_C     4.9604  3.6271 1.368 0.173  
strategies2_B      1.6203  3.5736 0.453 0.651  
strategies2_D      -1.7246  3.6550 -0.472 0.638  
strategies3_A      1.7663  3.2966 0.536 0.593  
strategies3_C      -0.5787  3.8440 -0.151 0.881  
presidentsGeorge.W:strategies2_B -9.9934 10.0125 -0.998 0.320  
presidentsJohn.C:strategies2_B -1.5192  5.8696 -0.259 0.796  
presidentsTom_C:strategies2_B  -0.8962  5.0202 -0.179 0.859  
presidentsGeorge.W:strategies2_D -7.5266  9.7414 -0.773 0.441  
presidentsJohn.C:strategies2_D  1.7179  6.4375 0.267 0.790  
presidentsTom_C:strategies2_D  -1.1020  5.0551 -0.218 0.828  
presidentsGeorge.W:strategies3_A -11.9783  9.3115 -1.286 0.200  
presidentsJohn.C:strategies3_A -2.8849  5.0866 -0.567 0.571  
presidentsTom_C:strategies3_A  -5.0305  4.4068 -1.142 0.255  
presidentsGeorge.W:strategies3_C -6.5116  9.7387 -0.669 0.505  
presidentsJohn.C:strategies3_C -4.3792  6.0389 -0.725 0.469  
presidentsTom_C:strategies3_C  -1.3257  5.3821 -0.246 0.806  
--- 
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for gaussian family taken to be 69.96038) 

    Null deviance: 13379 on 199 degrees of freedom 
Residual deviance: 12523 on 179 degrees of freedom 
AIC: 1439 

Number of Fisher Scoring iterations: 2 

> summary(mod4) 

Call: 
glm(formula = y ~ x + presidents * strategies, data = df1) 

Deviance Residuals: 
    Min  1Q Median  3Q  Max 
-17.3690 -6.1273 -0.1699 6.4295 17.4156 

Coefficients: 
           Estimate Std. Error t value Pr(>|t|)  
(Intercept)      14.4782  3.0799 4.701 5.15e-06 *** 
x         -0.1692  0.2141 -0.790 0.431  
presidentsGeorge.W    11.1984  8.8283 1.268 0.206  
presidentsJohn.C     4.1281  4.2305 0.976 0.330  
presidentsTom_C     4.9604  3.6271 1.368 0.173  
strategies2_B      1.6203  3.5736 0.453 0.651  
strategies2_D      -1.7246  3.6550 -0.472 0.638  
strategies3_A      1.7663  3.2966 0.536 0.593  
strategies3_C      -0.5787  3.8440 -0.151 0.881  
presidentsGeorge.W:strategies2_B -9.9934 10.0125 -0.998 0.320  
presidentsJohn.C:strategies2_B -1.5192  5.8696 -0.259 0.796  
presidentsTom_C:strategies2_B  -0.8962  5.0202 -0.179 0.859  
presidentsGeorge.W:strategies2_D -7.5266  9.7414 -0.773 0.441  
presidentsJohn.C:strategies2_D  1.7179  6.4375 0.267 0.790  
presidentsTom_C:strategies2_D  -1.1020  5.0551 -0.218 0.828  
presidentsGeorge.W:strategies3_A -11.9783  9.3115 -1.286 0.200  
presidentsJohn.C:strategies3_A -2.8849  5.0866 -0.567 0.571  
presidentsTom_C:strategies3_A  -5.0305  4.4068 -1.142 0.255  
presidentsGeorge.W:strategies3_C -6.5116  9.7387 -0.669 0.505  
presidentsJohn.C:strategies3_C -4.3792  6.0389 -0.725 0.469  
presidentsTom_C:strategies3_C  -1.3257  5.3821 -0.246 0.806  
--- 
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for gaussian family taken to be 69.96038) 

    Null deviance: 13379 on 199 degrees of freedom 
Residual deviance: 12523 on 179 degrees of freedom 
AIC: 1439 

Number of Fisher Scoring iterations: 2

당신이 볼 수 있듯이, 추정 정확히 동일합니다.

데이터 :이 데이터베이스의 행은 무엇

df = read.table(text = "y x presidents strategies 
       20 2 Bill.C  3_A 
       10 1 George.W 2_B 
       10 1 Tom_C  3_C 
       3 2 Tom_C  2_D 
       4 4 John.C  3_A 
       4 3 Bill.C  2_A", header = TRUE) 

set.seed(123) 
df1 = data.frame(y = sample(1:30, 200, replace = TRUE), 
       x = sample(1:10, 200, replace = TRUE), 
       presidents = sample(df$presidents, 200, replace = TRUE), 
       strategies = sample(df$strategies, 200, replace = TRUE))

출처

2017-10-17 16:49:22 useR

정말 고마워요. – RandomWalker

@RandomWalker이 답이 도움이 될 경우, downvote 버튼 아래의 회색 체크 표시를 클릭하여 수락하는 것이 좋습니다. – useR

물론. 내가 잘못하지 않으면 투표했습니다. :) – RandomWalker

답변

관련 문제