【R.統計】多重迴歸中,每個項的重要性(Partial R-square)
在 lm 或 glm 中,我們會用R-square代表模型「解釋」了多少的整體變異,例如 我們想知道四個變量對於每週減重成效的影響,利用 R 的模型可得到下面結果:
> test0 <- glm(Weightloss_Perc2_week2 ~ Time+Time*MET_normal + Time*Amp24 + Time*TotalEnrIn_normal + Time*PeakT24, weights = circ_ultradian, data = data)
> summary(test0)
Call:
glm(formula = Weightloss_Perc2_week2 ~ Time + Time * MET_normal +
Time * Amp24 + Time * TotalEnrIn_normal + Time * PeakT24,
data = data, weights = circ_ultradian)
Deviance Residuals:
Min 1Q Median 3Q Max
-8.823 -1.190 0.000 1.164 10.157
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.5568053 0.1747433 -26.077 < 2e-16 ***
Time 0.4569993 0.0157595 28.998 < 2e-16 ***
MET_normal -0.0142031 0.0066411 -2.139 0.03285 *
Amp24 -4.9617615 5.5785681 -0.889 0.37412
TotalEnrIn_normal -0.0047543 0.0197913 -0.240 0.81024
PeakT24 0.0135482 0.0407856 0.332 0.73987
Time:MET_normal 0.0028966 0.0006074 4.768 2.32e-06 ***
Time:Amp24 1.4545885 0.4624128 3.146 0.00174 **
Time:TotalEnrIn_normal -0.0053339 0.0019630 -2.717 0.00677 **
Time:PeakT24 0.0027033 0.0027874 0.970 0.33251
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for gaussian family taken to be 5.759965)
Null deviance: 10528.7 on 627 degrees of freedom
Residual deviance: 3559.7 on 618 degrees of freedom
(567 observations deleted due to missingness)
AIC: Inf
Number of Fisher Scoring iterations: 2
雖然我們知道 MET, Total energy, Peak time, Amplitude 對於模型都有顯著的影響,而且整個模型的解釋度是 但是各別的重要性是多少呢?
上網 Google 的結果大概有二個,一個是用 ppcor 這個 package 裡的 spcor() 這個套件,我試了一下,它需要懂一些條件機率的概念在已知 X1, X2 之下去推 X3的貢獻,太麻煩了。另一個是用 relaimpo 這個套件,由德國的統計學家 Prof. Dr. Ulrike Grömping 開發。
上述的 model 丟進 calc.relimp()這個函數後,得到的結果如下:
> calc.relimp(test0, type=c("lmg"),rela=TRUE)
Response variable: Weightloss_Perc2_week2
Total response variance: 16.79224
Analysis based on 673 observations
9 Regressors:
Time MET_normal Amp24 TotalEnrIn_normal PeakT24 Time:MET_normal Time:Amp24 Time:TotalEnrIn_normal Time:PeakT24
Proportion of variance explained by model: 66.19%
Metrics are normalized to sum to 100% (rela=TRUE).
Relative importance metrics:
lmg
Time 0.915321497
MET_normal 0.006743841
Amp24 0.005752103
TotalEnrIn_normal 0.024451951
PeakT24 0.010963340
Time:MET_normal 0.016396956
Time:Amp24 0.011228629
Time:TotalEnrIn_normal 0.003251115
Time:PeakT24 0.005890567
Average coefficients for different model sizes:
1X 2Xs 3Xs 4Xs 5Xs 6Xs 7Xs
Time 0.42775671 0.429422473 0.433774932 0.4374087502 0.4415789775 0.445187269 0.449273335
MET_normal -0.01082523 -0.005096398 -0.002092200 0.0003017041 -0.0008177312 -0.001402809 -0.004029325
Amp24 -1.37623328 0.189845689 0.760871915 2.5016207321 1.7187778915 1.347069469 0.199870297
TotalEnrIn_normal -0.08453656 -0.071530940 -0.055123358 -0.0406576254 -0.0355749464 -0.034863774 -0.029541116
PeakT24 0.06143983 0.079340115 0.075254809 0.0560990881 0.0363142134 0.022308638 0.009986747
Time:MET_normal NaN NaN 0.002581282 0.0024446650 0.0024815029 0.002461414 0.002568757
Time:Amp24 NaN NaN 1.920939379 1.7905122656 1.6700297367 1.536277161 1.456754683
Time:TotalEnrIn_normal NaN NaN -0.001786519 -0.0016394771 -0.0023086992 -0.002658085 -0.003441314
Time:PeakT24 NaN NaN 0.008438375 0.0084367348 0.0075782090 0.007056867 0.005963264
8Xs 9Xs
Time 0.4526609705 0.456999317
MET_normal -0.0055692785 -0.014203070
Amp24 -0.0959529453 -4.961761505
TotalEnrIn_normal -0.0252180584 -0.004754349
PeakT24 0.0004990347 0.013548212
Time:MET_normal 0.0026184787 0.002896551
Time:Amp24 1.3493823328 1.454588467
Time:TotalEnrIn_normal -0.0039749003 -0.005333920
Time:PeakT24 0.0053578184 0.002703350
這是一般的 glm 模型,但注意我們應該要把 subject 的 random effect 放進模型裡,但是 relaimpo 不吃 lmer 模型的東西,會出現 error
> test2 <- lmer(Weightloss_Perc2_week2 ~ Time+Time*MET_normal + Time*Amp24 + Time*TotalEnrIn_normal + Time*PeakT24+ (1|SubjID), weights = circ_ultradian, data = data)
> summary(test2)
> calc.relimp(test2, type=c("lmg"),rela=TRUE)
Error in calc.relimp.default.intern(object = new("lmerModLmerTest", vcov_varpar = c(0.00316776617198365, :
If x is NULL, then object must be a data frame or a matrix.
那麼,能不能把 subject 當成 fix effect 呢?結果還是不行
> test1 <- glm(Weightloss_Perc2_week2 ~ SubjID + Time+Time*MET_normal + Time*Amp24 + Time*TotalEnrIn_normal + Time*PeakT24, weights = circ_ultradian, data = data)
> summary(test1)
> calc.relimp(test1, type=c("lmg"),rela=TRUE)
Error in calc.relimp.default.intern(c(-7.06190879760306, -7.06190879760306, :
covg must be
a positive definite covariance matrix
or a data matrix / data frame with linearly independent columns.
後來就想到一個方法,先用 mixed model 把 subject 的 intercept 估出來以後,扣掉,再把剩下的東西去做 glm
# Remove subject intercept
data2 = data
subjpred = coef(test2)$SubjID # subject intercept from lmer model
for (i in 1:37){
subjidx = which(data2$SubjID==row.names(subjpred)[i])
data2$Weightloss_Perc2_week2[subjidx] = data$Weightloss_Perc2_week2[subjidx] - subjpred[i,1]
}
這個 for 迴圈是用來把每個 subject 減掉一個估出來的 intercept.
然後再做一次 glm,但subject就不放進 model 了。
> test3 <- glm(Weightloss_Perc2_week2 ~ Time+Time*MET_normal + Time*Amp24 + Time*TotalEnrIn_normal + Time*PeakT24, weights = circ_ultradian, data = data2)
> summary(test3)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.588e-14 1.240e-01 0.000 1.00000
Time 4.812e-01 1.119e-02 43.018 < 2e-16 ***
MET_normal -1.062e-02 4.714e-03 -2.253 0.02460 *
Amp24 -5.414e+00 3.960e+00 -1.367 0.17207
TotalEnrIn_normal -2.475e-02 1.405e-02 -1.761 0.07867 .
PeakT24 -1.343e-02 2.895e-02 -0.464 0.64281
Time:MET_normal 2.730e-03 4.312e-04 6.332 4.65e-10 ***
Time:Amp24 1.965e+00 3.282e-01 5.987 3.63e-09 ***
Time:TotalEnrIn_normal -3.001e-03 1.393e-03 -2.153 0.03167 *
Time:PeakT24 5.508e-03 1.979e-03 2.784 0.00554 **
可以比較一下, test3 是先把subject intercept去掉後做 glm,和直接用 subject intercept做 random effect的 test2, 結果是樣的
> summary(test2)
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) -4.646e+00 2.305e-01 1.372e+02 -20.160 < 2e-16 ***
Time 4.812e-01 1.257e-02 6.359e+02 38.279 < 2e-16 ***
MET_normal -1.062e-02 8.654e-03 1.396e+02 -1.227 0.2217
Amp24 -5.414e+00 7.509e+00 1.325e+02 -0.721 0.4722
TotalEnrIn_normal -2.475e-02 2.543e-02 1.457e+02 -0.973 0.3322
PeakT24 -1.343e-02 5.766e-02 1.221e+02 -0.233 0.8162
Time:MET_normal 2.730e-03 4.711e-04 6.273e+02 5.796 1.08e-08 ***
Time:Amp24 1.965e+00 3.868e-01 6.460e+02 5.080 4.94e-07 ***
Time:TotalEnrIn_normal -3.001e-03 1.653e-03 6.479e+02 -1.815 0.0700 .
Time:PeakT24 5.508e-03 2.205e-03 6.327e+02 2.498 0.0128 *
最後再把 test3 這個模型丟進 relaimpo 裡面算 partial R-square。在下面指令中,type是估算 R-sqaure 的方法,預設是 lmg ,說明文件是說:
lmg is the R2
contribution averaged over orderings among regressors, cf. e.g. Lindeman, Merenda
and Gold 1980, p.119ff or Chevan and Sutherland (1991)
記得另一個方法的 spcor() 要自己指定條件機率,這裡應該就是試過所有的條件機率,取平均(我不確定對不對,歡迎指正)。而 rela=TRUE 表示出來的是佔全體 R-square 的比例,加總起來=1。
> calc.relimp(test3, type=c("lmg"),rela=TRUE)
Response variable: Weightloss_Perc2_week2
Total response variance: 15.03069
Analysis based on 673 observations
9 Regressors:
Time MET_normal Amp24 TotalEnrIn_normal PeakT24 Time:MET_normal Time:Amp24 Time:TotalEnrIn_normal Time:PeakT24
Proportion of variance explained by model: 80.97%
Metrics are normalized to sum to 100% (rela=TRUE).
Relative importance metrics:
lmg
Time 0.8925406850
MET_normal 0.0074157245
Amp24 0.0094917949
TotalEnrIn_normal 0.0236040025
PeakT24 0.0164884432
Time:MET_normal 0.0147108370
Time:Amp24 0.0218828460
Time:TotalEnrIn_normal 0.0008287056
Time:PeakT24 0.0130369613
Average coefficients for different model sizes:
1X 2Xs 3Xs 4Xs 5Xs 6Xs 7Xs
Time 0.440694983 0.443403347 0.4496149070 0.4548941249 0.4607458236 0.4658320951 0.471331282
MET_normal -0.009862837 -0.003761972 -0.0005885058 0.0018287036 0.0008132627 0.0005022829 -0.001776877
Amp24 1.990300612 3.445463124 2.9448299689 3.7359763850 2.0284921362 1.4099965410 -0.114745414
TotalEnrIn_normal -0.085282449 -0.072882050 -0.0592355184 -0.0470806642 -0.0447264708 -0.0458199725 -0.042986433
PeakT24 0.087608872 0.103903249 0.0907189923 0.0610918832 0.0318606109 0.0117704376 -0.006748718
Time:MET_normal NaN NaN 0.0027769227 0.0025953887 0.0025742539 0.0024967131 0.002544580
Time:Amp24 NaN NaN 2.7685416116 2.6227088515 2.4541552858 2.2811335982 2.140242933
Time:TotalEnrIn_normal NaN NaN 0.0003984801 0.0006150023 -0.0000832632 -0.0003991128 -0.001181843
Time:PeakT24 NaN NaN 0.0127067481 0.0127505036 0.0115795887 0.0108609312 0.009403546
8Xs 9Xs
Time 0.475931905 0.481246518
MET_normal -0.002838234 -0.010621835
Amp24 -0.367323484 -5.413933881
TotalEnrIn_normal -0.040713890 -0.024745591
PeakT24 -0.020914736 -0.013433636
Time:MET_normal 0.002536895 0.002730404
Time:Amp24 1.978185277 1.965080499
Time:TotalEnrIn_normal -0.001674073 -0.003000777
Time:PeakT24 0.008569905 0.005508089
解釋 relaimpo 的教學網頁:
Relaimpo的官方文件
Prof. Dr. Ulrike Grömping 的網頁
留言
張貼留言