Skip to content

8.1.1 MLR: Categorical Variables

MLR Review

  • Definition: A model predicting a dependent variable y using multiple independent variables image

  • Extension: In this module, categorical (qualitative) variables are introduced as independent variables alongside quantitative variables.

  • Example: Flat prices might depend not only on area (quantitative) but also on whether the location is premium or not (categorical).

Why Include Categorical Variables?

  • Improved Predictions: Categorical variables often provide crucial context not captured by quantitative data.
  • Example: In Mysuru flat price data:
    • Quantitative variables: Area, bathrooms, bedrooms.
    • Missing categorical variables: Amenities, apartment type, premium location.

Handling Categorical Variables

  • Key Concept: Categorical variables need to be converted into numerical form (dummy variables) for regression.
  • Dummy Variables: Binary variables (0 or 1) representing categories.
  • Example: Brand variable in Basavaraja's dataset.
    • image
    • image

Steps in MLR with Categorical Variables

Model Formulation:

  • Basic regression equation: image

  • x1​: Price (quantitative).

  • x2: Brand (categorical, represented as dummy variable).

Data Analysis:

  • Include both x1 and x2​ in the regression model.
  • Use tools like Excel's Analysis ToolPak to calculate coefficients, image , standard error, and ANOVA table.

Example Calculation:

  • Basavaraja's dataset: Satisfaction score depends on:
  • Price (x1): Quantitative.
  • Brand (x2): Lenovo (1) or Dell (0).
  • Regression Equation: image

Results and Interpretations

Performance Improvement:

  • image improved from 0.31 (price only) to 0.45 (price + brand).
  • Standard error reduced from 5.19 to 4.85.

Significance Tests:

  • F-test (overall significance):
  • Null hypothesis: No linear relationship.
  • F-statistic: 13.75 (p-value close to 0). Reject null hypothesis, indicating the model is significant.
  • T-tests (individual significance):
  • image

    • image

    • Price significantly influences satisfaction score.

    • image
    • image

    • Brand also significantly impacts satisfaction.

Adding categorical variables (e.g., brand):

  • Enhances the model's explanatory power and predictive accuracy.
  • Interpretation of coefficients:
  • β1: Satisfaction score increases by 1.29 units for every 1-lakh increase in price.
  • β2: Lenovo users score 3.93 units higher than Dell users, on average.

Simplified Explanation

What is Multiple Linear Regression (MLR)?

  • It's like finding a formula to predict something (e.g., satisfaction score) using several factors (e.g., price, brand).

What’s New?

  • Previously, we only used numbers (e.g., price).
  • Now, we also include labels like brand (Lenovo, Dell) by turning them into numbers (dummy variables: 1 or 0).

Why Do This?

  • Labels (categories) can also influence predictions. Including them improves accuracy.

How Do We Use It?

  • Example:
  • Satisfaction = 46.83 + (1.29 x price) + (3.93 x brand).
  • If it’s Lenovo, brand = 1.
  • If Dell, brand = 0.
Ask Hive Chat Chat Icon
Hive Chat
Hi, I'm Hive Chat, an AI assistant created by CollegeHive.
How can I help you today?