8.1 Introduction to MLR with Categorical Variables¶
- MLR Review: A model predicting a dependent variable y using multiple independent variables
- Extension: In this module, categorical (qualitative) variables are introduced as independent variables alongside quantitative variables.
- Example: Flat prices might depend not only on area (quantitative) but also on whether the location is premium or not (categorical).
Why Include Categorical Variables?¶
- Improved Predictions: Categorical variables often provide crucial context not captured by quantitative data.
- Example: In Mysuru flat price data:
- Quantitative variables: Area, bathrooms, bedrooms.
- Missing categorical variables: Amenities, apartment type, premium location.
Handling Categorical Variables¶
- Key Concept: Categorical variables need to be converted into numerical form (dummy variables) for regression.
- Dummy Variables: Binary variables (0 or 1) representing categories.
- Example: Brand variable in Basavaraja's dataset:
Steps in MLR with Categorical Variables¶
- Model Formulation:
- Basic regression equation:
- x1: Price (quantitative).
- x2: Brand (categorical, represented as dummy variable).
- Data Analysis:
- Include both x1 and x2 in the regression model.
- Use tools like Excel's Analysis ToolPak to calculate coefficients, R2, standard error, and ANOVA table.
- Example Calculation:
- Basavaraja's dataset: Satisfaction score depends on:
- Price (x1): Quantitative.
- Brand (x2): Lenovo (1) or Dell (0).
- Regression Equation:
Results and Interpretations¶
- Performance Improvement:
- R2 improved from 0.31 (price only) to 0.45 (price + brand).
- Standard error reduced from 5.19 to 4.85.
- Significance Tests:
- F-test (overall significance):
- Null hypothesis: No linear relationship.
- F-statistic: 13.75 (p-value close to 0). Reject null hypothesis, indicating the model is significant.
- T-tests (individual significance):
- Price significantly influences satisfaction score.
- Brand also significantly impacts satisfaction.
¶
- Adding categorical variables (e.g., brand) enhances the model's explanatory power and predictive accuracy.
- Interpretation of coefficients:
- β1: Satisfaction score increases by 1.29 units for every 1-lakh increase in price.
- β2: Lenovo users score 3.93 units higher than Dell users, on average.
Simplified Explanation¶
- What is Multiple Linear Regression (MLR)?
- It's like finding a formula to predict something (e.g., satisfaction score) using several factors (e.g., price, brand).
- What’s New?
- Previously, we only used numbers (e.g., price).
- Now, we also include labels like brand (Lenovo, Dell) by turning them into numbers (dummy variables: 1 or 0).
- Why Do This?
- Labels (categories) can also influence predictions. Including them improves accuracy.
- How Do We Use It?
- Example:
- Satisfaction \= 46.83 + (1.29 × price) + (3.93 × brand).
- If it’s Lenovo, brand \= 1. If Dell, brand \= 0.
8.2 Understanding the Interpretation of Coefficients¶
- Regression Equation:
,
where: - β0: Intercept.
- β1: Coefficient for x1 (quantitative variable).
- β2: Coefficient for x2 (categorical dummy variable).
- Interpreting Coefficients with a Dummy Variable:
- Impact of β2:
- Example with Basavaraja's Dataset:
- Regression Equation:
- x1: Price of the computer.
- X2 \= 0 (Dell), x2=1x_2 \= 1x2=1 (Lenovo).
- Predictions:
- Interpretation: Lenovo scores 3.93 points higher on average than Dell.
Residual Analysis¶
- Residuals vs. x1 (price): Appear evenly distributed around the x-axis.
- Residuals vs. x2 (dummy variable): Clustered at the two dummy values (0 or 1).
- No major outliers: A few residuals above or below ±2.
When There are More Categories¶
- Scenario: Three brands (Lenovo, Dell, Asus).
- Use k - 1 dummy variables for kkk categories:
![][image16]
Regression Equation:
- Interpretation of Coefficients:
- β0: Average satisfaction score for Asus.
- β1: Difference between Lenovo and Asus.
- β2: Difference between Dell and Asus.
- Alternative Baseline: Change the reference category.
- Example: Use Lenovo as the baseline.
- β0: Average score for Lenovo.
- β1: Difference between Asus and Lenovo.
- β2: Difference between Dell and Lenovo.
Simplified Explanation¶
What is Happening?¶
- Why Use Dummy Variables?
- Categorical data (e.g., brands) must be converted into numbers (0s and 1s) to fit into the regression model.
- How Do We Interpret Results?
- Each category (e.g., Dell or Lenovo) changes the regression equation slightly.
- Example: Lenovo gives an average satisfaction score 3.93 points higher than Dell.
- What Happens with 3+ Categories?
- Use multiple dummy variables. For 3 brands:
- Assign 1s and 0s to represent two of the brands, while the third acts as a baseline for comparison.
Key Takeaways¶
- Dummy variables allow you to analyze categorical factors like brands.
- The choice of baseline category (e.g., Asus or Lenovo) affects how results are interpreted but not the overall conclusion.
8.3 Examples¶
Baseline Model: Manjula Nayak's Household Survey¶
- Regression Results:
- β0\=−140,378
Adding Ownership as a Categorical Variable¶
Adding Location as a Categorical Variable¶
Baseline Model¶
- Initial Model:
- Dependent variable (y): Selling price (in lakhs).
- Independent variable (x1): Area (in square feet).
- Limitations:
- Adding quantitative variables (e.g., bedrooms, bathrooms) led to multicollinearity issues, as these variables were highly correlated with area.
Step 1: Adding a Categorical Variable (Premium Location)¶
- New Model:
- Dependent variable (y): Selling price.
- Independent variables:
- x1: Area (quantitative).
- x2: Premium location (categorical; x2=1 for premium, x2=0 otherwise).
- Regression equation:
- Results:
Step 2: Adding Multiple Categorical Variables¶
- Additional Variables:
- x3: Amenities (1 if present, 0 otherwise).
- x4: Gated community (1 if gated, 0 otherwise).
- New Model:
- Regression equation:
- Results:
- Significance Tests:
Residual Analysis¶
- Standardized Residuals:
- No values exceeding ±3.
- Adding categorical variables improved the fit.
- Plots of Residuals:
- Residuals vs. independent variables Random distribution, no patterns, indicating no bias.
- Residuals vs. predicted values Equally distributed above and below the x-axis.
8.4 Interactive Variables¶
What are Interaction Variables?¶
- Definition:
- Interaction variables are the product of two independent variables (e.g., continuous and categorical variables).
- They capture the conditional relationship between a dependent variable (y) and one independent variable (x2), moderated by another variable (x1).
- Purpose:
- To model situations where the effect of one independent variable on the dependent variable depends on the level of another independent variable.
Example: Salary Analysis¶
Analysis and Results¶
Interpretation of Interaction Effects¶
- Insights:
- The rate of salary increase (effect of x2) is significantly moderated by gender (x1).
- Men experience a much higher salary increase with work experience compared to women.
- Implications:
- Highlights the importance of considering interaction effects in salary studies.
- Reflects systemic gender-based salary disparities when work experience is accounted for.
Simplified Explanation¶
What Are Interaction Variables?¶
- Interaction happens when one factor (e.g., work experience) affects outcomes differently depending on another factor (e.g., gender).
How Does It Work?¶
- Multiply two variables (e.g., gender × experience) to create an interaction term.
- This term helps identify if the relationship changes depending on conditions (e.g., different effects for men vs. women).
Example Insights¶
- Men’s salaries increase faster than women’s as work experience grows.
- Women: Salary increases by $52.2 per month of experience.
- Men: Salary increases by $293.8 per month of experience.
Why Use Interaction Variables?¶
- To capture complex relationships and uncover hidden patterns in data.
- Helps ensure models reflect real-world dynamics.
Key Takeaways¶
- Power of Interaction Variables:
- They provide tailored insights for different groups (e.g., men vs. women).
- Make regression models more versatile and insightful.
- Practical Use:
- Applicable in salary analysis, customer segmentation, and more.
- Ensures better, data-driven decisions.
Ask Hive Chat
Hive Chat
Hi, I'm Hive Chat, an AI assistant created by CollegeHive.
How can I help you today?
How can I help you today?