Skip to content

8.1 Introduction to MLR with Categorical Variables

  • MLR Review: A model predicting a dependent variable y using multiple independent variables
  • Extension: In this module, categorical (qualitative) variables are introduced as independent variables alongside quantitative variables.
  • Example: Flat prices might depend not only on area (quantitative) but also on whether the location is premium or not (categorical).

Why Include Categorical Variables?

  • Improved Predictions: Categorical variables often provide crucial context not captured by quantitative data.
  • Example: In Mysuru flat price data:
    • Quantitative variables: Area, bathrooms, bedrooms.
    • Missing categorical variables: Amenities, apartment type, premium location.

Handling Categorical Variables

  • Key Concept: Categorical variables need to be converted into numerical form (dummy variables) for regression.
  • Dummy Variables: Binary variables (0 or 1) representing categories.
    • Example: Brand variable in Basavaraja's dataset:

Steps in MLR with Categorical Variables

  1. Model Formulation:
  2. Basic regression equation:
  3. x1​: Price (quantitative).
  4. x2​: Brand (categorical, represented as dummy variable).
  5. Data Analysis:
  6. Include both x1​ and x2​ in the regression model.
  7. Use tools like Excel's Analysis ToolPak to calculate coefficients, R2, standard error, and ANOVA table.
  8. Example Calculation:
  9. Basavaraja's dataset: Satisfaction score depends on:
    • Price (x1): Quantitative.
    • Brand (x2​): Lenovo (1) or Dell (0).
  10. Regression Equation:

Results and Interpretations

  • Performance Improvement:
  • R2 improved from 0.31 (price only) to 0.45 (price + brand).
  • Standard error reduced from 5.19 to 4.85.
  • Significance Tests:
  • F-test (overall significance):
    • Null hypothesis: No linear relationship.
    • F-statistic: 13.75 (p-value close to 0). Reject null hypothesis, indicating the model is significant.
  • T-tests (individual significance):
    • Price significantly influences satisfaction score.
    • Brand also significantly impacts satisfaction.

  • Adding categorical variables (e.g., brand) enhances the model's explanatory power and predictive accuracy.
  • Interpretation of coefficients:
  • β1​: Satisfaction score increases by 1.29 units for every 1-lakh increase in price.
  • β2​: Lenovo users score 3.93 units higher than Dell users, on average.

Simplified Explanation

  1. What is Multiple Linear Regression (MLR)?
  2. It's like finding a formula to predict something (e.g., satisfaction score) using several factors (e.g., price, brand).
  3. What’s New?
  4. Previously, we only used numbers (e.g., price).
  5. Now, we also include labels like brand (Lenovo, Dell) by turning them into numbers (dummy variables: 1 or 0).
  6. Why Do This?
  7. Labels (categories) can also influence predictions. Including them improves accuracy.
  8. How Do We Use It?
  9. Example:
    • Satisfaction \= 46.83 + (1.29 × price) + (3.93 × brand).
    • If it’s Lenovo, brand \= 1. If Dell, brand \= 0.

8.2 Understanding the Interpretation of Coefficients

  1. Regression Equation:
    ​,
    where:
  2. β0​: Intercept.
  3. β1: Coefficient for x1​ (quantitative variable).
  4. β2​: Coefficient for x2​ (categorical dummy variable).
  5. Interpreting Coefficients with a Dummy Variable:
  6. Impact of β2​:
  7. Example with Basavaraja's Dataset:
  8. Regression Equation:
    • x1​: Price of the computer.
    • X2 \= 0 (Dell), x2=1x_2 \= 1x2​=1 (Lenovo).
  9. Predictions:
  10. Interpretation: Lenovo scores 3.93 points higher on average than Dell.

Residual Analysis

  • Residuals vs. x1​ (price): Appear evenly distributed around the x-axis.
  • Residuals vs. x2 (dummy variable): Clustered at the two dummy values (0 or 1).
  • No major outliers: A few residuals above or below ±2.

When There are More Categories

  • Scenario: Three brands (Lenovo, Dell, Asus).
  • Use k - 1 dummy variables for kkk categories:
    ![][image16]
    

Regression Equation:

  • Interpretation of Coefficients:
  • β0​: Average satisfaction score for Asus.
  • β1​: Difference between Lenovo and Asus.
  • β2: Difference between Dell and Asus.
  • Alternative Baseline: Change the reference category.
  • Example: Use Lenovo as the baseline.
    • β0: Average score for Lenovo.
    • β1​: Difference between Asus and Lenovo.
    • β2​: Difference between Dell and Lenovo.

Simplified Explanation

What is Happening?

  1. Why Use Dummy Variables?
  2. Categorical data (e.g., brands) must be converted into numbers (0s and 1s) to fit into the regression model.
  3. How Do We Interpret Results?
  4. Each category (e.g., Dell or Lenovo) changes the regression equation slightly.
  5. Example: Lenovo gives an average satisfaction score 3.93 points higher than Dell.
  6. What Happens with 3+ Categories?
  7. Use multiple dummy variables. For 3 brands:
    • Assign 1s and 0s to represent two of the brands, while the third acts as a baseline for comparison.

Key Takeaways

  • Dummy variables allow you to analyze categorical factors like brands.
  • The choice of baseline category (e.g., Asus or Lenovo) affects how results are interpreted but not the overall conclusion.

8.3 Examples

Baseline Model: Manjula Nayak's Household Survey

  • Regression Results:
  • β0\=−140,378

Adding Ownership as a Categorical Variable

Adding Location as a Categorical Variable

Baseline Model

  1. Initial Model:
  2. Dependent variable (y): Selling price (in lakhs).
  3. Independent variable (x1​): Area (in square feet).

  1. Limitations:
  2. Adding quantitative variables (e.g., bedrooms, bathrooms) led to multicollinearity issues, as these variables were highly correlated with area.

Step 1: Adding a Categorical Variable (Premium Location)

  1. New Model:
  2. Dependent variable (y): Selling price.
  3. Independent variables:
    • x1​: Area (quantitative).
    • x2​: Premium location (categorical; x2​=1 for premium, x2​=0 otherwise).
  4. Regression equation:
  5. Results:

Step 2: Adding Multiple Categorical Variables

  1. Additional Variables:
  2. x3: Amenities (1 if present, 0 otherwise).
  3. x4​: Gated community (1 if gated, 0 otherwise).
  4. New Model:
  5. Regression equation:
  6. Results:
  7. Significance Tests:

Residual Analysis

  1. Standardized Residuals:
  2. No values exceeding ±3.
  3. Adding categorical variables improved the fit.
  4. Plots of Residuals:
  5. Residuals vs. independent variables Random distribution, no patterns, indicating no bias.
  6. Residuals vs. predicted values Equally distributed above and below the x-axis.

8.4 Interactive Variables

What are Interaction Variables?

  1. Definition:
  2. Interaction variables are the product of two independent variables (e.g., continuous and categorical variables).
  3. They capture the conditional relationship between a dependent variable (y) and one independent variable (x2​), moderated by another variable (x1​).
  4. Purpose:
  5. To model situations where the effect of one independent variable on the dependent variable depends on the level of another independent variable.

Example: Salary Analysis

Analysis and Results

Interpretation of Interaction Effects

  1. Insights:
  2. The rate of salary increase (effect of x2​) is significantly moderated by gender (x1).
  3. Men experience a much higher salary increase with work experience compared to women.
  4. Implications:
  5. Highlights the importance of considering interaction effects in salary studies.
  6. Reflects systemic gender-based salary disparities when work experience is accounted for.

Simplified Explanation

What Are Interaction Variables?

  • Interaction happens when one factor (e.g., work experience) affects outcomes differently depending on another factor (e.g., gender).

How Does It Work?

  • Multiply two variables (e.g., gender × experience) to create an interaction term.
  • This term helps identify if the relationship changes depending on conditions (e.g., different effects for men vs. women).

Example Insights

  • Men’s salaries increase faster than women’s as work experience grows.
  • Women: Salary increases by $52.2 per month of experience.
  • Men: Salary increases by $293.8 per month of experience.

Why Use Interaction Variables?

  • To capture complex relationships and uncover hidden patterns in data.
  • Helps ensure models reflect real-world dynamics.

Key Takeaways

  1. Power of Interaction Variables:
  2. They provide tailored insights for different groups (e.g., men vs. women).
  3. Make regression models more versatile and insightful.
  4. Practical Use:
  5. Applicable in salary analysis, customer segmentation, and more.
  6. Ensures better, data-driven decisions.
Ask Hive Chat Chat Icon
Hive Chat
Hi, I'm Hive Chat, an AI assistant created by CollegeHive.
How can I help you today?