class: center, middle, inverse, title-slide .title[ # Análisis de Datos Categóricos (SOC3070) ] .author[ ###
Mauricio Bucca
Profesor Asistente, Sociología UC ] --- class: inverse, center, middle # Análisis de datos categóricos ## visión panorámica --- ## Tipos de datos categóricos Distintas miradas a los datos de infidelidad: .center[ <!-- --> ] --- class: inverse, center, middle ## Datos dicotómicos ### Linear Probability Model y Regresión Logística --- ## Linear Probability Model ``` r lpm_1 <- lm(affairs_binary ~ ym, data=affairsdata) lpm_2 <- lm(affairs_binary ~ ym + rate, data=affairsdata) ``` <br> -- .pull-left[ <table style="border-collapse:collapse; border:none;"> <tr> <th style="border-top: double; text-align:center; font-style:normal; font-weight:bold; padding:0.2cm; text-align:left; "> </th> <th colspan="1" style="border-top: double; text-align:center; font-style:normal; font-weight:bold; padding:0.2cm; ">affairs binary</th> <th colspan="1" style="border-top: double; text-align:center; font-style:normal; font-weight:bold; padding:0.2cm; ">affairs binary</th> </tr> <tr> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; text-align:left; ">Predictors</td> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; ">Estimates</td> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; ">Estimates</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">(Intercept)</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.16 <sup>***</sup></td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.56 <sup>***</sup></td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">ym</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.01 <sup>***</sup></td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.01 <sup>*</sup></td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">rate</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; "></td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">-0.09 <sup>***</sup></td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; padding-top:0.1cm; padding-bottom:0.1cm; border-top:1px solid;">Observations</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; padding-top:0.1cm; padding-bottom:0.1cm; text-align:left; border-top:1px solid;" colspan="1">601</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; padding-top:0.1cm; padding-bottom:0.1cm; text-align:left; border-top:1px solid;" colspan="1">601</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; padding-top:0.1cm; padding-bottom:0.1cm;">R<sup>2</sup> / R<sup>2</sup> adjusted</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; padding-top:0.1cm; padding-bottom:0.1cm; text-align:left;" colspan="1">0.020 / 0.018</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; padding-top:0.1cm; padding-bottom:0.1cm; text-align:left;" colspan="1">0.071 / 0.068</td> </tr> <tr> <td colspan="3" style="font-style:italic; border-top:double black; text-align:right;">* p<0.05 ** p<0.01 *** p<0.001</td> </tr> </table> ] -- .pull-right[ - En un modelo lineal estándar al agregar una variable: - cambio en coeficientes reflejan correlación entre predictores (ej. "confounding") - reducción de varianza residual ] --- ## Regresión Logística ``` r logistic_1 <- glm(affairs_binary ~ ym, family="binomial", data=affairsdata) logistic_2 <- glm(affairs_binary ~ ym + rate, family="binomial", data=affairsdata) ``` <br> -- .pull-left[ <table style="border-collapse:collapse; border:none;"> <tr> <th style="border-top: double; text-align:center; font-style:normal; font-weight:bold; padding:0.2cm; text-align:left; "> </th> <th colspan="1" style="border-top: double; text-align:center; font-style:normal; font-weight:bold; padding:0.2cm; ">affairs binary</th> <th colspan="1" style="border-top: double; text-align:center; font-style:normal; font-weight:bold; padding:0.2cm; ">affairs binary</th> </tr> <tr> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; text-align:left; ">Predictors</td> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; ">Log-Odds</td> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; ">Log-Odds</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">(Intercept)</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">-1.61 <sup>***</sup></td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.35 <sup></sup></td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">ym</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.06 <sup>***</sup></td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.04 <sup>*</sup></td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">rate</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; "></td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">-0.47 <sup>***</sup></td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; padding-top:0.1cm; padding-bottom:0.1cm; border-top:1px solid;">Observations</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; padding-top:0.1cm; padding-bottom:0.1cm; text-align:left; border-top:1px solid;" colspan="1">601</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; padding-top:0.1cm; padding-bottom:0.1cm; text-align:left; border-top:1px solid;" colspan="1">601</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; padding-top:0.1cm; padding-bottom:0.1cm;">R<sup>2</sup> Tjur</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; padding-top:0.1cm; padding-bottom:0.1cm; text-align:left;" colspan="1">0.019</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; padding-top:0.1cm; padding-bottom:0.1cm; text-align:left;" colspan="1">0.071</td> </tr> <tr> <td colspan="3" style="font-style:italic; border-top:double black; text-align:right;">* p<0.05 ** p<0.01 *** p<0.001</td> </tr> </table> ] -- .pull-right[ En un modelo de regresión logística .bold[NO ES POSIBLE] sacar conclusiones sustantivas de cambios en coeficientes en modelos anidados! ] <br> -- ¿Por que? --- ## Regresión Logística: varianza no identificables <br> Existe una formulación alternariva de la regresión logística en términos de variable latente, donde la variable dependiente `\(y\)` es una .bold[manifestación dicotómica] de una .bold[variable latente] (inobservada) continua, `\(y^{*}\)`. .pull-left[ `$$\begin{align} y_{i} = \begin{cases} 1 \quad \text{si} \quad y^{*}_{i} > 0 \\ 0 \quad \text{si} \quad y^{*}_{i} < 0 \end{cases} \end{align}$$` ] .pull-right[ `\(y^{*}_{i} = \beta_{0} + \beta_{1}x_{1i} + \dots + \beta_{k}x_{ki} + \epsilon_{i}\)` ] <br> - donde `\(e_{i}\)` sigue una distribución logística. -- - En este modelo no es posible estimar los coeficientes y la varianza conjuntamente. Es necesario fijar uno para estima el otro. -- - Para estimar los coeficientes del modelo es necesario fijar el valor de la varianza. -- - Usualmente: `\(\epsilon_{i} \sim \text{logistic}(\mu=0,\sigma=\pi/\sqrt{3})\)` --- ## Regresión Logística: efectos no comparables entre modelos anidados En consequencia .... <br> -- | | Varianza variable dependiente | Varianza residual | |---------------------|:---------------------------------:|:-----------------------------------------:| | Regresión lineal | `\(\sigma^{2}_{y}\)` es observado | `\(\sigma^{2}_{\epsilon}\)` dependende del modelo | | Regresión logística | `\(\sigma^{2}_{y*}\)` depende del modelo | `\(\sigma^{2}_{\epsilon}\)` es fijo | <br> -- .bold[Efecto de agregar un predictor adicional:] | | Varianza variable dependiente | Varianza residual | |---------------------|:---------------------------------:|:-----------------------------------------:| | Regresión lineal | `\(\sigma^{2}_{y}\)` no cambia | `\(\sigma^{2}_{\epsilon}\)` disminuye | | Regresión logística | `\(\sigma^{2}_{y*}\)` aumenta | `\(\sigma^{2}_{\epsilon}\)` no cambia | <br> -- Esencialmente, `\(y^{*}\)` es .bold[re-escalado], y por tanto también los coeficientes son re-escalados. --- ## Regresión Logística: efectos no comparables entre modelos anidados .pull-left[ <table style="border-collapse:collapse; border:none;"> <tr> <th style="border-top: double; text-align:center; font-style:normal; font-weight:bold; padding:0.2cm; text-align:left; "> </th> <th colspan="1" style="border-top: double; text-align:center; font-style:normal; font-weight:bold; padding:0.2cm; ">affairs binary</th> <th colspan="1" style="border-top: double; text-align:center; font-style:normal; font-weight:bold; padding:0.2cm; ">affairs binary</th> </tr> <tr> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; text-align:left; ">Predictors</td> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; ">Log-Odds</td> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; ">Log-Odds</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">(Intercept)</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">-1.61 <sup>***</sup></td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.35 <sup></sup></td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">ym</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.06 <sup>***</sup></td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.04 <sup>*</sup></td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">rate</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; "></td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">-0.47 <sup>***</sup></td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; padding-top:0.1cm; padding-bottom:0.1cm; border-top:1px solid;">Observations</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; padding-top:0.1cm; padding-bottom:0.1cm; text-align:left; border-top:1px solid;" colspan="1">601</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; padding-top:0.1cm; padding-bottom:0.1cm; text-align:left; border-top:1px solid;" colspan="1">601</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; padding-top:0.1cm; padding-bottom:0.1cm;">R<sup>2</sup> Tjur</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; padding-top:0.1cm; padding-bottom:0.1cm; text-align:left;" colspan="1">0.019</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; padding-top:0.1cm; padding-bottom:0.1cm; text-align:left;" colspan="1">0.071</td> </tr> <tr> <td colspan="3" style="font-style:italic; border-top:double black; text-align:right;">* p<0.05 ** p<0.01 *** p<0.001</td> </tr> </table> ] .pull-right[ En un modelo de regresión logística cambios en los coeficientes al agregar una variable pueden deberse a: - correlación entre predictores (ej. "confounding") - re-escalamiento de la variable dependiente - .bold[ambos!] ] <br> -- - Este problema es usualmente desconocido o ignorado. -- - Cientos de artículos científicos extren interpretaciones erróneas de cambio en coeficientes a través de modelos. -- - Seriedad del problema varía caso a caso --- ## Regresión Logística: efectos no comparables entre modelos anidados -- Una posible solución: .bold[KHB method] -- .pull-left[  - Paquete disponible en `Stata` y `R` ] .pull-right[  ] --- class: inverse, center, middle ## Datos politómicos ### Regresión Logística Multinomial y Ordenada --- ## Regresión Logística Multinomial ``` r mlogistic_1 <- multinom(affairs_order ~ ym, trace=F, data=affairsdata) mlogistic_2 <- multinom(affairs_order ~ ym + rate, trace=F, data=affairsdata) ``` .pull-left[ <table style="border-collapse:collapse; border:none;"> <tr> <th style="border-top: double; text-align:center; font-style:normal; font-weight:bold; padding:0.2cm; text-align:left; "> </th> <th colspan="2" style="border-top: double; text-align:center; font-style:normal; font-weight:bold; padding:0.2cm; ">affairs order</th> </tr> <tr> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; text-align:left; ">Predictors</td> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; ">Log-Odds</td> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; ">Response</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">(Intercept)</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">-2.01 <sup>***</sup></td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">ocasional</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">ym</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.02 <sup></sup></td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">ocasional</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">(Intercept)</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">-2.62 <sup>***</sup></td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">compulsivo</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">ym</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.10 <sup>***</sup></td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">compulsivo</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; padding-top:0.1cm; padding-bottom:0.1cm; border-top:1px solid;">Observations</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; padding-top:0.1cm; padding-bottom:0.1cm; text-align:left; border-top:1px solid;" colspan="2">601</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; padding-top:0.1cm; padding-bottom:0.1cm;">R<sup>2</sup> / R<sup>2</sup> adjusted</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; padding-top:0.1cm; padding-bottom:0.1cm; text-align:left;" colspan="2">0.021 / 0.019</td> </tr> <tr> <td colspan="3" style="font-style:italic; border-top:double black; text-align:right;">* p<0.05 ** p<0.01 *** p<0.001</td> </tr> </table> ] .pull-right[ <table style="border-collapse:collapse; border:none;"> <tr> <th style="border-top: double; text-align:center; font-style:normal; font-weight:bold; padding:0.2cm; text-align:left; "> </th> <th colspan="2" style="border-top: double; text-align:center; font-style:normal; font-weight:bold; padding:0.2cm; ">affairs order</th> </tr> <tr> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; text-align:left; ">Predictors</td> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; ">Log-Odds</td> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; ">Response</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">(Intercept)</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">-0.54 <sup></sup></td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">ocasional</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">ym</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.00 <sup></sup></td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">ocasional</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">rate</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">-0.35 <sup>**</sup></td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">ocasional</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">(Intercept)</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">-0.27 <sup></sup></td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">compulsivo</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">ym</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.07 <sup>**</sup></td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">compulsivo</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">rate</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">-0.57 <sup>***</sup></td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">compulsivo</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; padding-top:0.1cm; padding-bottom:0.1cm; border-top:1px solid;">Observations</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; padding-top:0.1cm; padding-bottom:0.1cm; text-align:left; border-top:1px solid;" colspan="2">601</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; padding-top:0.1cm; padding-bottom:0.1cm;">R<sup>2</sup> / R<sup>2</sup> adjusted</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; padding-top:0.1cm; padding-bottom:0.1cm; text-align:left;" colspan="2">0.058 / 0.055</td> </tr> <tr> <td colspan="3" style="font-style:italic; border-top:double black; text-align:right;">* p<0.05 ** p<0.01 *** p<0.001</td> </tr> </table> ] --- ## Regresión Logística Ordenada ``` r ologistic_1 <- polr(affairs_order ~ ym, data=affairsdata) ologistic_2 <- polr(affairs_order ~ ym + rate, data=affairsdata) ``` <br> .pull-left[ <table style="border-collapse:collapse; border:none;"> <tr> <th style="border-top: double; text-align:center; font-style:normal; font-weight:bold; padding:0.2cm; text-align:left; "> </th> <th colspan="1" style="border-top: double; text-align:center; font-style:normal; font-weight:bold; padding:0.2cm; ">affairs order</th> <th colspan="1" style="border-top: double; text-align:center; font-style:normal; font-weight:bold; padding:0.2cm; ">affairs order</th> </tr> <tr> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; text-align:left; ">Predictors</td> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; ">Log-Odds</td> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; ">Log-Odds</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">fiel|ocasional</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">1.65 <sup>***</sup></td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">-0.35 <sup></sup></td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">ocasional|compulsivo</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">2.43 <sup>***</sup></td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.48 <sup></sup></td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">ym</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.06 <sup>***</sup></td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.04 <sup>*</sup></td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">rate</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; "></td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">-0.48 <sup>***</sup></td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; padding-top:0.1cm; padding-bottom:0.1cm; border-top:1px solid;">Observations</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; padding-top:0.1cm; padding-bottom:0.1cm; text-align:left; border-top:1px solid;" colspan="1">601</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; padding-top:0.1cm; padding-bottom:0.1cm; text-align:left; border-top:1px solid;" colspan="1">601</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; padding-top:0.1cm; padding-bottom:0.1cm;">R<sup>2</sup> Nagelkerke</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; padding-top:0.1cm; padding-bottom:0.1cm; text-align:left;" colspan="1">0.030</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; padding-top:0.1cm; padding-bottom:0.1cm; text-align:left;" colspan="1">0.098</td> </tr> <tr> <td colspan="3" style="font-style:italic; border-top:double black; text-align:right;">* p<0.05 ** p<0.01 *** p<0.001</td> </tr> </table> ] -- .pull-right[ - Menos parámetros pero más supuestos. - "Parallel regression assumption" (o proportional odds): predictores tienen los mismos efectos para cualquier "transición" entre una categoría y otra. ] --- ## Regresión Logística Ordenada Podemos testear este supuesto usando el "Brant test". ``` r library("brant") ``` .pull-left[ ``` r brant(ologistic_1) ``` ``` ## -------------------------------------------- ## Test for X2 df probability ## -------------------------------------------- ## Omnibus 3.99 1 0.05 ## ym 3.99 1 0.05 ## -------------------------------------------- ## ## H0: Parallel Regression Assumption holds ``` ] .pull-right[ ``` r brant(ologistic_2) ``` ``` ## -------------------------------------------- ## Test for X2 df probability ## -------------------------------------------- ## Omnibus 3.89 2 0.14 ## ym 3.16 1 0.08 ## rate 0.3 1 0.58 ## -------------------------------------------- ## ## H0: Parallel Regression Assumption holds ``` ] -- - Evalua si los coeficientes son iguales al estimarlos en regresiones separadas. - p-values grandes indican que la "parallel regression assumption" se cumple. --- class: inverse, center, middle ## Datos de recuento ### Regresión Poisson y Quasi-Poisson --- ## Regresión Poisson ``` r poisson_1 <- glm(affairs_count ~ ym, family=poisson(link="log"), data=affairsdata) poisson_2 <- glm(affairs_count ~ ym + rate, family=poisson(link="log"), data=affairsdata) ``` <br> <table style="border-collapse:collapse; border:none;"> <tr> <th style="border-top: double; text-align:center; font-style:normal; font-weight:bold; padding:0.2cm; text-align:left; "> </th> <th colspan="1" style="border-top: double; text-align:center; font-style:normal; font-weight:bold; padding:0.2cm; ">affairs count</th> <th colspan="1" style="border-top: double; text-align:center; font-style:normal; font-weight:bold; padding:0.2cm; ">affairs count</th> </tr> <tr> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; text-align:left; ">Predictors</td> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; ">Log-Mean</td> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; ">Log-Mean</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">(Intercept)</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">-0.36 <sup>***</sup></td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">1.37 <sup>***</sup></td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">ym</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.08 <sup>***</sup></td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.06 <sup>***</sup></td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">rate</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; "></td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">-0.42 <sup>***</sup></td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; padding-top:0.1cm; padding-bottom:0.1cm; border-top:1px solid;">Observations</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; padding-top:0.1cm; padding-bottom:0.1cm; text-align:left; border-top:1px solid;" colspan="1">601</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; padding-top:0.1cm; padding-bottom:0.1cm; text-align:left; border-top:1px solid;" colspan="1">601</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; padding-top:0.1cm; padding-bottom:0.1cm;">R<sup>2</sup> Nagelkerke</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; padding-top:0.1cm; padding-bottom:0.1cm; text-align:left;" colspan="1">0.234</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; padding-top:0.1cm; padding-bottom:0.1cm; text-align:left;" colspan="1">0.474</td> </tr> <tr> <td colspan="3" style="font-style:italic; border-top:double black; text-align:right;">* p<0.05 ** p<0.01 *** p<0.001</td> </tr> </table> --- ## Regressión Quasi-Poisson La regressión quasi-Poisson incorpora explícitamente el factor de dispersión y corrige la inferencia del modelo. -- `$$y_{i} \sim \text{quasi-Poisson}(\mu_{i} = e^{X_{i}\beta}, \sigma= \sqrt{\omega \cdot e^{X_{i}\beta}})$$` <br> donde `\(\omega\)` es el factor de dispersion. -- - Regressión Poisson es un caso especial de quasi-Poisson ( `\(\omega=1\)` ). <br> -- Implementación en `R`: ``` r qpoisson_2 <- glm(affairs_count ~ ym + rate, family=quasipoisson(link="log"), data=affairsdata) ``` ``` ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 1.36592517 0.34532278 3.955503 8.553030e-05 ## ym 0.05575781 0.01757849 3.171935 1.591856e-03 ## rate -0.42036507 0.07268757 -5.783177 1.181990e-08 ``` ``` ## dispersion = 6.977591 ``` --- ## Quasi-Poisson ``` r qpoisson_2 <- glm(affairs_count ~ ym + rate, family=quasipoisson(link="log"), data=affairsdata) ``` <br> <table style="border-collapse:collapse; border:none;"> <tr> <th style="border-top: double; text-align:center; font-style:normal; font-weight:bold; padding:0.2cm; text-align:left; "> </th> <th colspan="1" style="border-top: double; text-align:center; font-style:normal; font-weight:bold; padding:0.2cm; ">affairs count</th> <th colspan="1" style="border-top: double; text-align:center; font-style:normal; font-weight:bold; padding:0.2cm; ">affairs count</th> </tr> <tr> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; text-align:left; ">Predictors</td> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; ">Log-Mean</td> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; ">Log-Mean</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">(Intercept)</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">1.37 <sup>***</sup></td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">1.37 <sup>***</sup></td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">ym</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.06 <sup>***</sup></td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.06 <sup>**</sup></td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">rate</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">-0.42 <sup>***</sup></td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">-0.42 <sup>***</sup></td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; padding-top:0.1cm; padding-bottom:0.1cm; border-top:1px solid;">Observations</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; padding-top:0.1cm; padding-bottom:0.1cm; text-align:left; border-top:1px solid;" colspan="1">601</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; padding-top:0.1cm; padding-bottom:0.1cm; text-align:left; border-top:1px solid;" colspan="1">601</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; padding-top:0.1cm; padding-bottom:0.1cm;">R<sup>2</sup> Nagelkerke</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; padding-top:0.1cm; padding-bottom:0.1cm; text-align:left;" colspan="1">0.474</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; padding-top:0.1cm; padding-bottom:0.1cm; text-align:left;" colspan="1">0.474</td> </tr> <tr> <td colspan="3" style="font-style:italic; border-top:double black; text-align:right;">* p<0.05 ** p<0.01 *** p<0.001</td> </tr> </table> --- class: inverse, center, middle ## Inferencia --- ## Inferencia para los coeficientes - Los coeficientes de los modelos cubiertos son estimados via MLE. -- - `\(\hat{\beta} \sim \mathcal{N}(\beta, \frac{\sigma_{\beta}}{\sqrt{n}})\)`, donde `\(\frac{\sigma_{\beta}}{\sqrt{n}}\)` es el "standard error" (SE) de `\(\beta\)`. -- - Es posible construir un intervalo de confianza para los coeficientes. Por ejemplo, un IC para `\(\beta\)` al 95% de confianza está dado por: `$$95\% \text{ CI}_{\hat{\beta}} = \hat{\beta} \pm 1.96 \cdot \text{SE}$$` -- Por ejemplo, en el modelo Poisson más complejo: .pull-left[ ``` ## Estimate Std. Error ## (Intercept) 1.36592517 0.130729157 ## ym 0.05575781 0.006654703 ## rate -0.42036507 0.027517398 ``` ] .pull-right[ ``` r ic = -0.42 + c(-1.96,1.96)*0.028 cat("95% IC rate: (",ic[1],",",ic[2],")") ``` ``` ## 95% IC rate: ( -0.47488 , -0.36512 ) ``` ] --- ## Inferencia para funciones de los coeficientes de un modelo En muchas situaciones nos interesa realizar inferencia para cantidad de interés derivadas de un modelo. -- .bold[Ejemplo]: predicciones sobre "probabilidad de ser infiel" arrojadas por diferentes modelos .pull-left[ ``` r grid <- affairsdata %>% data_grid(ym=median(ym),rate) grid <- grid %>% mutate(lpm = predict(lpm_2, newdata=grid), logistic = predict(logistic_2, newdata=grid, type="response"), mlogistic = 1- predict(mlogistic_2,newdata=grid, type="probs")[,1], ologistic = 1- predict(ologistic_2,newdata=grid, type="probs")[,1] ) ``` ] .pull-right[ <!-- --> ] --- ## Inferencia para funciones de los coeficientes de un modelo: Bootstrap Cuando cantidades de interés son funciones complejas de los coeficientes los métodos de simulación y re-sampling pueden ser de gran utilidad. -- ``` r bs <- function(x) { grid <- affairsdata %>% data_grid(ym=median(ym),rate) #data_b <- sample_n(affairsdata,size=nrow(affairsdata),replace=TRUE) lpm_2_b <- lm(affairs_binary ~ ym + rate, data=x) logistic_2_b <- glm(affairs_binary ~ ym + rate, family="binomial", data=x) mlogistic_2_b <- multinom(affairs_order ~ ym + rate, trace=F, data=x) ologistic_2_b <- polr(affairs_order ~ ym + rate, data=x) grid <- grid %>% mutate(lpm = predict(lpm_2_b, newdata=grid), logistic = predict(logistic_2_b, newdata=grid, type="response"), mlogistic = 1- predict(mlogistic_2_b,newdata=grid, type="probs")[,1], ologistic = 1- predict(ologistic_2_b,newdata=grid, type="probs")[,1] ) } ``` --- ## Inferencia para funciones de los coeficientes de un modelo: Bootstrap ``` r # aplica función bootstrap grid <- affairsdata %>% bootstrap(500) %>% mutate(pred = map(strap, ~ bs(.x))) %>% dplyr::select(.id,pred) %>% unnest() ``` <!-- --> --- class: inverse, center, middle ## Comparación entre modelos ### Bondad de ajuste y predicción --- ## Bondad de ajuste Medidas de bondad de ajusto para modelos no anidados y con penalización por cantidad de parámetros: .pull-left[ ``` r AIC(lpm_1, lpm_2, logistic_1,logistic_2, mlogistic_1,mlogistic_2, ologistic_1,ologistic_2, poisson_1, poisson_2) ``` ``` ## df AIC ## lpm_1 3 692.8884 ## lpm_2 4 662.5836 ## logistic_1 2 667.5002 ## logistic_2 3 639.8786 ## mlogistic_1 4 872.0740 ## mlogistic_2 6 843.8824 ## ologistic_1 3 874.3844 ## ologistic_2 4 843.6135 ## poisson_1 2 3264.8188 ## poisson_2 3 3043.9881 ``` ] .pull-right[ ``` r BIC(lpm_1, lpm_2, logistic_1,logistic_2, mlogistic_1,mlogistic_2, ologistic_1,ologistic_2, poisson_1, poisson_2) ``` ``` ## df BIC ## lpm_1 3 706.0841 ## lpm_2 4 680.1780 ## logistic_1 2 676.2974 ## logistic_2 3 653.0744 ## mlogistic_1 4 889.6684 ## mlogistic_2 6 870.2740 ## ologistic_1 3 887.5802 ## ologistic_2 4 861.2079 ## poisson_1 2 3273.6160 ## poisson_2 3 3057.1839 ``` ] --- ## Capacidad predictiva "out-of-sample" -- - Principio básico: estimar el modelo en un dataset (training set) y evalúarlo en un dataset distinto (test set) -- - .bold[K-fold Cross-validation:] .img-right-top[  ] <br> <br> <br> -- ``` r library("caret") # especifica tipo de cross-validation ctrl <- trainControl(method = "repeatedcv", number = 10, savePredictions = TRUE) cv_mlogistic2 <- train(affairs_order ~ ym + rate, data=affairsdata, method=c("multinom"), trControl = ctrl) ``` ``` ## # weights: 12 (6 variable) ## initial value 594.349248 ## iter 10 value 373.809255 ## iter 20 value 373.260025 ## final value 373.260020 ## converged ## # weights: 12 (6 variable) ## initial value 594.349248 ## iter 10 value 373.848519 ## iter 20 value 373.327081 ## final value 373.327048 ## converged ## # weights: 12 (6 variable) ## initial value 594.349248 ## iter 10 value 373.809284 ## iter 20 value 373.260093 ## final value 373.260088 ## converged ## # weights: 12 (6 variable) ## initial value 594.349248 ## iter 10 value 376.050437 ## iter 20 value 375.592643 ## final value 375.592632 ## converged ## # weights: 12 (6 variable) ## initial value 594.349248 ## iter 10 value 376.072097 ## iter 20 value 375.663208 ## final value 375.663184 ## converged ## # weights: 12 (6 variable) ## initial value 594.349248 ## iter 10 value 376.050460 ## iter 20 value 375.592715 ## final value 375.592704 ## converged ## # weights: 12 (6 variable) ## initial value 594.349248 ## iter 10 value 374.311516 ## iter 20 value 373.812769 ## iter 20 value 373.812765 ## iter 20 value 373.812764 ## final value 373.812764 ## converged ## # weights: 12 (6 variable) ## initial value 594.349248 ## iter 10 value 374.406382 ## iter 20 value 373.870603 ## iter 20 value 373.870603 ## iter 20 value 373.870603 ## final value 373.870603 ## converged ## # weights: 12 (6 variable) ## initial value 594.349248 ## iter 10 value 374.311394 ## iter 20 value 373.812827 ## iter 20 value 373.812823 ## iter 20 value 373.812822 ## final value 373.812822 ## converged ## # weights: 12 (6 variable) ## initial value 594.349248 ## iter 10 value 372.508433 ## iter 20 value 371.306366 ## final value 371.306153 ## converged ## # weights: 12 (6 variable) ## initial value 594.349248 ## iter 10 value 372.066242 ## iter 20 value 371.384491 ## final value 371.384460 ## converged ## # weights: 12 (6 variable) ## initial value 594.349248 ## iter 10 value 372.509011 ## iter 20 value 371.306446 ## final value 371.306233 ## converged ## # weights: 12 (6 variable) ## initial value 594.349248 ## iter 10 value 377.385078 ## final value 376.955325 ## converged ## # weights: 12 (6 variable) ## initial value 594.349248 ## iter 10 value 377.478601 ## iter 20 value 377.075754 ## final value 377.075726 ## converged ## # weights: 12 (6 variable) ## initial value 594.349248 ## iter 10 value 377.385168 ## final value 376.955450 ## converged ## # weights: 12 (6 variable) ## initial value 593.250636 ## iter 10 value 378.865673 ## final value 375.292083 ## converged ## # weights: 12 (6 variable) ## initial value 593.250636 ## iter 10 value 378.674891 ## iter 20 value 375.403685 ## final value 375.403663 ## converged ## # weights: 12 (6 variable) ## initial value 593.250636 ## iter 10 value 378.865587 ## final value 375.292199 ## converged ## # weights: 12 (6 variable) ## initial value 594.349248 ## iter 10 value 373.190686 ## final value 372.689438 ## converged ## # weights: 12 (6 variable) ## initial value 594.349248 ## iter 10 value 373.425412 ## final value 372.804534 ## converged ## # weights: 12 (6 variable) ## initial value 594.349248 ## iter 10 value 373.190898 ## final value 372.689557 ## converged ## # weights: 12 (6 variable) ## initial value 594.349248 ## iter 10 value 376.732973 ## final value 376.304225 ## converged ## # weights: 12 (6 variable) ## initial value 594.349248 ## iter 10 value 376.794790 ## iter 20 value 376.394531 ## final value 376.394365 ## converged ## # weights: 12 (6 variable) ## initial value 594.349248 ## iter 10 value 376.733030 ## final value 376.304318 ## converged ## # weights: 12 (6 variable) ## initial value 594.349248 ## iter 10 value 373.727461 ## iter 20 value 373.196335 ## final value 373.196304 ## converged ## # weights: 12 (6 variable) ## initial value 594.349248 ## iter 10 value 373.728043 ## final value 373.263120 ## converged ## # weights: 12 (6 variable) ## initial value 594.349248 ## iter 10 value 373.727470 ## iter 20 value 373.196403 ## final value 373.196372 ## converged ## # weights: 12 (6 variable) ## initial value 594.349248 ## iter 10 value 373.886537 ## iter 20 value 373.600259 ## final value 373.600252 ## converged ## # weights: 12 (6 variable) ## initial value 594.349248 ## iter 10 value 373.914101 ## final value 373.671972 ## converged ## # weights: 12 (6 variable) ## initial value 594.349248 ## iter 10 value 373.886570 ## iter 20 value 373.600331 ## final value 373.600325 ## converged ## # weights: 12 (6 variable) ## initial value 660.265985 ## iter 10 value 417.057097 ## iter 20 value 416.021385 ## final value 416.021290 ## converged ``` ``` r cv_ologistic2 <- train(affairs_order ~ ym + rate, data=affairsdata, method=c("polr"), trControl = ctrl) ``` --- ## Capacidad predictiva "out-of-sample" .pull-left[ ``` r confusionMatrix(cv_mlogistic2) ``` ``` ## Cross-Validated (10 fold, repeated 1 times) Confusion Matrix ## ## (entries are percentual average cell counts across resamples) ## ## Reference ## Prediction fiel ocasional compulsivo ## fiel 74.0 11.3 12.6 ## ocasional 0.0 0.0 0.0 ## compulsivo 1.0 0.3 0.7 ## ## Accuracy (average) : 0.7471 ``` ] -- .pull-right[ ``` r confusionMatrix(cv_ologistic2) ``` ``` ## Cross-Validated (10 fold, repeated 1 times) Confusion Matrix ## ## (entries are percentual average cell counts across resamples) ## ## Reference ## Prediction fiel ocasional compulsivo ## fiel 74.0 11.3 12.6 ## ocasional 0.0 0.0 0.0 ## compulsivo 1.0 0.3 0.7 ## ## Accuracy (average) : 0.7471 ``` ] --- ## En el tintero .. .pull-left[  ] -- .pull-right[ - Probit - Choice models: e.g., conditional logit - Zero-inflated Poisson - Log-linear models para tablas de contingencia - etc ] --- class: inverse, center, middle .huge[ **FIN. Gracias!** ] <br> Mauricio Bucca <br> https://mebucca.github.io/ <br> github.com/mebucca