8.1.1 MLR: Categorical Variables¶
MLR Review¶
-
Definition: A model predicting a dependent variable y using multiple independent variables
-
Extension: In this module, categorical (qualitative) variables are introduced as independent variables alongside quantitative variables.
- Example: Flat prices might depend not only on area (quantitative) but also on whether the location is premium or not (categorical).
Why Include Categorical Variables?¶
- Improved Predictions: Categorical variables often provide crucial context not captured by quantitative data.
- Example: In Mysuru flat price data:
- Quantitative variables: Area, bathrooms, bedrooms.
- Missing categorical variables: Amenities, apartment type, premium location.
Handling Categorical Variables¶
- Key Concept: Categorical variables need to be converted into numerical form (dummy variables) for regression.
- Dummy Variables: Binary variables (0 or 1) representing categories.
- Example: Brand variable in Basavaraja's dataset.
Steps in MLR with Categorical Variables¶
Model Formulation:¶
-
Basic regression equation:
-
x1: Price (quantitative).
- x2: Brand (categorical, represented as dummy variable).
Data Analysis:¶
- Include both x1 and x2 in the regression model.
- Use tools like Excel's Analysis ToolPak to calculate coefficients, , standard error, and ANOVA table.
Example Calculation:¶
- Basavaraja's dataset: Satisfaction score depends on:
- Price (x1): Quantitative.
- Brand (x2): Lenovo (1) or Dell (0).
- Regression Equation:
Results and Interpretations¶
Performance Improvement:¶
- improved from 0.31 (price only) to 0.45 (price + brand).
- Standard error reduced from 5.19 to 4.85.
Significance Tests:¶
- F-test (overall significance):
- Null hypothesis: No linear relationship.
- F-statistic: 13.75 (p-value close to 0). Reject null hypothesis, indicating the model is significant.
- T-tests (individual significance):
-
-
Price significantly influences satisfaction score.
-
Brand also significantly impacts satisfaction.
Adding categorical variables (e.g., brand):¶
- Enhances the model's explanatory power and predictive accuracy.
- Interpretation of coefficients:
- β1: Satisfaction score increases by 1.29 units for every 1-lakh increase in price.
- β2: Lenovo users score 3.93 units higher than Dell users, on average.
Simplified Explanation¶
What is Multiple Linear Regression (MLR)?¶
- It's like finding a formula to predict something (e.g., satisfaction score) using several factors (e.g., price, brand).
What’s New?¶
- Previously, we only used numbers (e.g., price).
- Now, we also include labels like brand (Lenovo, Dell) by turning them into numbers (dummy variables: 1 or 0).
Why Do This?¶
- Labels (categories) can also influence predictions. Including them improves accuracy.
How Do We Use It?¶
- Example:
- Satisfaction = 46.83 + (1.29 x price) + (3.93 x brand).
- If it’s Lenovo, brand = 1.
- If Dell, brand = 0.
Ask Hive Chat
Hive Chat
Hi, I'm Hive Chat, an AI assistant created by CollegeHive.
How can I help you today?
How can I help you today?