10 Linear Regression - Machine Learning Models

Author

Thieu Nguyen

10.0.1 Introduction

Machine learning is a powerful tool for identifying patterns and making predictions from data, especially in healthcare. Among its many techniques, linear regression is one of the most fundamental, particularly for predicting continuous outcomes.

In this chapter, We’ll apply linear regression to a real-world healthcare dataset — heart.csv — which includes clinical and demographic data related to cholesterol levels. We’ll cover data preparation, model building, performance evaluation, and result interpretation in a medical context.

This hands-on approach reinforces core machine learning concepts and shows how even simple models can yield valuable insights. We’ll also address challenges like data missing, multicollinearity and outliers that can affect model accuracy.

10.0.2 Load python necessary packages and data

# Python packages
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Load data
df= pd.read_csv("/Users/nnthieu/Downloads/heart.csv")
print(df.shape)
print(df.columns)
df.head()

(5209, 17)
Index(['Status', 'DeathCause', 'AgeCHDdiag', 'Sex', 'AgeAtStart', 'Height',
       'Weight', 'Diastolic', 'Systolic', 'MRW', 'Smoking', 'AgeAtDeath',
       'Cholesterol', 'Chol_Status', 'BP_Status', 'Weight_Status',
       'Smoking_Status'],
      dtype='object')

	Status	DeathCause	AgeCHDdiag	Sex	AgeAtStart	Height	Weight	Diastolic	Systolic	MRW	Smoking	AgeAtDeath	Cholesterol	Chol_Status	BP_Status	Weight_Status	Smoking_Status
0	Dead	Other	NaN	Female	29	62.50	140.0	78	124	121.0	0.0	55.0	NaN	NaN	Normal	Overweight	Non-smoker
1	Dead	Cancer	NaN	Female	41	59.75	194.0	92	144	183.0	0.0	57.0	181.0	Desirable	High	Overweight	Non-smoker
2	Alive	NaN	NaN	Female	57	62.25	132.0	90	170	114.0	10.0	NaN	250.0	High	High	Overweight	Moderate (6-15)
3	Alive	NaN	NaN	Female	39	65.75	158.0	80	128	123.0	0.0	NaN	242.0	High	Normal	Overweight	Non-smoker
4	Alive	NaN	NaN	Male	42	66.00	156.0	76	110	116.0	20.0	NaN	281.0	High	Optimal	Overweight	Heavy (16-25)

File heart.csv has 5209 rows and 17 columns, where each row represents a patient and each column represents a feature or attribute related to heart disease.

10.0.3 Evaluate the dataset

10.0.3.1 Check for missing data

The dataset contains some missing values, which we will need to handle before applying machine learning algorithms. We can check for missing values using the following code:

df.isna().sum()

Status               0
DeathCause        3218
AgeCHDdiag        3760
Sex                  0
AgeAtStart           0
Height               6
Weight               6
Diastolic            0
Systolic             0
MRW                  6
Smoking             36
AgeAtDeath        3218
Cholesterol        152
Chol_Status        152
BP_Status            0
Weight_Status        6
Smoking_Status      36
dtype: int64

10.0.3.2 look at the data distribution of the numeric variables:

Check the histogram to see if there are any outliers in the numeric variables.

def plot_histograms(df, bins=10, alpha=0.5, colors=None):
    """
    Plot histograms for all numeric variables in the DataFrame.
    Parameters:
        df (DataFrame): The DataFrame containing numeric variables.
        bins (int): Number of bins for the histograms. Default is 10.
        alpha (float): Transparency level of the histograms. Default is 0.5.
        colors (list): List of colors for the histograms. If None, default colors will be used.
    Returns:
        None
    """
    if colors is None:
        colors = plt.cm.tab10.colors  # Default color palette

    num_variables = df.select_dtypes(include='number').shape[1]
    num_rows = (num_variables + 1) // 2
    num_cols = 2
    
    plt.figure(figsize=(12, 6 * num_rows))

    for i, col in enumerate(df.select_dtypes(include='number'), start=1):
        plt.subplot(num_rows, num_cols, i)
        plt.hist(df[col], bins=bins, alpha=alpha, color=colors[i % len(colors)])
        plt.title(f'Histogram of {col}')
        plt.xlabel('Value')
        plt.ylabel('Frequency')
        plt.grid(True)

    plt.tight_layout()
    plt.show()

# Example usage:
# Assuming df is your DataFrame containing numeric variables
plot_histograms(df)

10.0.3.3 Check correlation between variables using heatmap

To visualize the correlation between variables in the dataset, We will create a heatmap. This will help us identify relationships between features and understand how they might influence cholesterol levels.

# Drop 'Cholesterol'
df_corr = df.drop(columns=['Cholesterol'])

# Encode categorical variables (required for correlation matrix)
categorical_cols = df_corr.select_dtypes(include=['object', 'category']).columns.tolist()
df_corr_encoded = pd.get_dummies(df_corr, columns=categorical_cols, drop_first=True)

# Compute correlation matrix
corr_matrix = df_corr_encoded.corr()

# Plot heatmap
plt.figure(figsize=(14, 12))
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))  # mask upper triangle

sns.heatmap(
    corr_matrix,
    mask=mask,
    cmap='coolwarm',
    vmin=-1, vmax=1,
    center=0,
    square=True,
    linewidths=0.5,
    annot=True,
    fmt='.2f',
    cbar_kws={"shrink": 0.8}
)

plt.title("Feature Correlation Heatmap (Excluding 'Cholesterol')", fontsize=16)
plt.tight_layout()
plt.show()

Highly correlated features can lead to multicollinearity, which can affect the performance of linear regression models. To identify pairs of features with high correlation, we will extract the upper triangle of the correlation matrix and filter for pairs with an absolute correlation greater than 0.65.

# Step 1: Drop 'Cholesterol' and one-hot encode categorical features
df_corr = df.drop(columns=['Cholesterol'])
categorical_cols = df_corr.select_dtypes(include=['object', 'category']).columns
df_corr_encoded = pd.get_dummies(df_corr, columns=categorical_cols, drop_first=True)

# Step 2: Compute correlation matrix
corr_matrix = df_corr_encoded.corr().abs()

# Step 3: Extract upper triangle (to avoid duplicate pairs)
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))

# Step 4: Find feature pairs with correlation > 0.7
high_corr_pairs = (
    upper.stack()
    .reset_index()
    .rename(columns={0: 'correlation', 'level_0': 'feature_1', 'level_1': 'feature_2'})
    .query('correlation > 0.65')
    .sort_values(by='correlation', ascending=False)
)

print("Highly correlated feature pairs (|correlation| > 0.65):")
print(high_corr_pairs)

Highly correlated feature pairs (|correlation| > 0.65):
      feature_1                         feature_2  correlation
90    Diastolic                          Systolic     0.796061
71       Weight                               MRW     0.767171
162     Smoking         Smoking_Status_Non-smoker     0.753253
7    AgeCHDdiag                        AgeAtDeath     0.748112
163     Smoking  Smoking_Status_Very Heavy (> 25)     0.708725
58       Height                          Sex_Male     0.693441
30   AgeAtStart                        AgeAtDeath     0.688605
141         MRW          Weight_Status_Overweight     0.683167

‘Cholesterol’ and ‘DeathCause’

# Calculate mean cholesterol for each DeathCause
mean_cholesterol = df.groupby('DeathCause')['Cholesterol'].mean()

# Create bar plot
plt.figure(figsize=(10, 6))
mean_cholesterol.plot(kind='bar', color='skyblue')
plt.title('Mean Cholesterol by Death Cause')
plt.xlabel('Death Cause')
plt.ylabel('Mean Cholesterol')
plt.xticks(rotation=45)  # Rotate x-axis labels for better visibility

# Add mean values to the bars
for index, value in enumerate(mean_cholesterol):
    plt.text(index, value, str(round(value, 2)), ha='center', va='bottom')

plt.tight_layout()
plt.show()

‘DeathCause’ is consequences, not exposurer, that should be removed from the regression models.

Highly correlated features will be removed from the dataset to avoid multicollinearity issues. Therefore, we should get out variables as ‘MRW’, ‘Weight_Status_Overweight’, ‘Height’.

10.0.4 Prepare data

10.0.4.1 Remove unnecessary variables

df = df.drop(columns=[
    'DeathCause', 'MRW', 'Smoking_Status', 
    'Weight_Status', 'Height', 'Systolic', 'AgeAtDeath', 'BP_Status'
], errors='ignore')

df.head(3)

	Status	AgeCHDdiag	Sex	AgeAtStart	Weight	Diastolic	Smoking	Cholesterol	Chol_Status
0	Dead	NaN	Female	29	140.0	78	0.0	NaN	NaN
1	Dead	NaN	Female	41	194.0	92	0.0	181.0	Desirable
2	Alive	NaN	Female	57	132.0	90	10.0	250.0	High

Data have to be prepared before applying machine learning algorithms. This includes handling missing values, recoding categorical variables, and detecting outliers.

10.0.4.2 replace NA

# Clean values (strip spaces, normalize case)
df['Chol_Status'] = df['Chol_Status'].astype(str).str.strip().str.title()
df['Chol_Status'] = df['Chol_Status'].replace('Nan', np.nan)  # convert literal "nan" string to real NaN

# Impute missing values using distribution
proportions = df['Chol_Status'].value_counts(normalize=True, dropna=True)
missing_count = df['Chol_Status'].isna().sum()

if missing_count > 0 and not proportions.empty:
    missing_samples = np.random.choice(
        proportions.index,
        size=missing_count,
        p=proportions.values
    )
    df.loc[df['Chol_Status'].isna(), 'Chol_Status'] = missing_samples

# Calculate means for numeric columns
means = df.mean(numeric_only=True)

# Impute missing values in numeric columns using their respective means
for col in df.select_dtypes(include='number'):
    df[col] = df[col].fillna(means[col])

df.describe()

	AgeCHDdiag	AgeAtStart	Weight	Diastolic	Smoking	Cholesterol
count	5209.000000	5209.000000	5209.000000	5209.000000	5209.000000	5209.000000
mean	63.302968	44.068727	153.086681	85.358610	9.366518	227.417441
std	5.282496	8.574954	28.898765	12.973091	11.989796	44.274927
min	32.000000	28.000000	67.000000	50.000000	0.000000	96.000000
25%	63.302968	37.000000	132.000000	76.000000	0.000000	197.000000
50%	63.302968	43.000000	150.000000	84.000000	1.000000	225.000000
75%	63.302968	51.000000	172.000000	92.000000	20.000000	251.000000
max	90.000000	62.000000	300.000000	160.000000	60.000000	568.000000

Check NA again

df.isna().sum()

Status         0
AgeCHDdiag     0
Sex            0
AgeAtStart     0
Weight         0
Diastolic      0
Smoking        0
Cholesterol    0
Chol_Status    0
dtype: int64

10.0.4.3 Recoding data

To prepare the dataset for machine learning, we need to recode categorical variables into numerical values. This is essential because most machine learning algorithms require numerical input.

# Define mapping dictionary
status_mapping = {'Dead':1, 'Alive':0}

# Decode 'Status' column
df['Status'] = df['Status'].map(status_mapping)

# Define mapping dictionary
status_mappingSex = {'Male':1, 'Female':0 }

df['Sex'] = df['Sex'].map(status_mappingSex)

status_mappingC = {'Borderline': 1, 'Desirable': 0, 'High': 2}

df['Chol_Status'] = df['Chol_Status'].apply(
    lambda x: status_mappingC.get(x, np.nan)
)

Check NA again

df.isna().sum()

Status         0
AgeCHDdiag     0
Sex            0
AgeAtStart     0
Weight         0
Diastolic      0
Smoking        0
Cholesterol    0
Chol_Status    0
dtype: int64

print(df.columns)
df.head()

Index(['Status', 'AgeCHDdiag', 'Sex', 'AgeAtStart', 'Weight', 'Diastolic',
       'Smoking', 'Cholesterol', 'Chol_Status'],
      dtype='object')

	Status	AgeCHDdiag	Sex	AgeAtStart	Weight	Diastolic	Smoking	Cholesterol	Chol_Status
0	1	63.302968	0	29	140.0	78	0.0	227.417441	1
1	1	63.302968	0	41	194.0	92	0.0	181.000000	0
2	0	63.302968	0	57	132.0	90	10.0	250.000000	2
3	0	63.302968	0	39	158.0	80	0.0	242.000000	2
4	0	63.302968	1	42	156.0	76	20.0	281.000000	2

10.0.4.4 Detect and drop the outliers in numeric variables

# Function to detect outliers using IQR method
def detect_outliers(df, variable):
    Q1 = df[variable].quantile(0.25)
    Q3 = df[variable].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    outliers = df[(df[variable] < lower_bound) | (df[variable] > upper_bound)]
    return outliers

# Define numeric variables to detect outliers
numeric_variables = ['Cholesterol', 'AgeCHDdiag', 'AgeAtStart', 'Weight', 'Smoking']

# Create a dictionary to store outliers for each variable
outliers_dict = {}

# Detect outliers in each numeric variable
for col in numeric_variables:
    outliers_dict[col] = detect_outliers(df, col)

# Print outliers for each variable
for col, outliers in outliers_dict.items():
    print(f"Outliers in {col}:")
    print(outliers.head(10))

Outliers in Cholesterol:
     Status  AgeCHDdiag  Sex  AgeAtStart  Weight  Diastolic  Smoking  \
89        0   63.302968    0          52   135.0         82      0.0   
123       1   73.000000    0          47   124.0         80      0.0   
143       0   63.302968    1          45   160.0         86      0.0   
187       0   76.000000    0          54   146.0         96      0.0   
197       1   61.000000    1          59   164.0        100      0.0   
217       0   63.302968    0          59   124.0         88      0.0   
247       0   58.000000    1          40   197.0         86     15.0   
277       1   66.000000    1          58   166.0         82      5.0   
382       1   73.000000    0          55   130.0         74      0.0   
419       0   65.000000    0          45   162.0         94     25.0   

     Cholesterol  Chol_Status  
89         339.0            2  
123        347.0            2  
143        347.0            2  
187        418.0            2  
197        334.0            2  
217        353.0            2  
247        333.0            2  
277        334.0            2  
382        334.0            2  
419        335.0            2  
Outliers in AgeCHDdiag:
    Status  AgeCHDdiag  Sex  AgeAtStart  Weight  Diastolic  Smoking  \
11       0        57.0    1          33   151.0         68      0.0   
12       0        55.0    1          33   174.0         90      0.0   
13       0        79.0    1          57   165.0         76     15.0   
14       0        66.0    1          44   155.0         90     30.0   
17       1        56.0    1          56   122.0         72     15.0   
19       1        74.0    1          46   157.0         84     30.0   
27       1        71.0    0          49   153.0        110      5.0   
28       0        68.0    1          40   189.0         78      0.0   
31       0        68.0    1          40   195.0         76     20.0   
34       0        43.0    1          33   172.0        106      0.0   

    Cholesterol  Chol_Status  
11   221.000000            1  
12   188.000000            0  
13   227.417441            0  
14   292.000000            2  
17   194.000000            0  
19   233.000000            1  
27   221.000000            1  
28   319.000000            2  
31   205.000000            1  
34   247.000000            2  
Outliers in AgeAtStart:
Empty DataFrame
Columns: [Status, AgeCHDdiag, Sex, AgeAtStart, Weight, Diastolic, Smoking, Cholesterol, Chol_Status]
Index: []
Outliers in Weight:
      Status  AgeCHDdiag  Sex  AgeAtStart  Weight  Diastolic  Smoking  \
154        0   63.000000    0          37   236.0         96      0.0   
436        1   63.302968    0          50   241.0        150      0.0   
491        1   63.302968    1          52   247.0        104     20.0   
671        0   63.302968    1          29   243.0         90     20.0   
765        1   63.302968    0          33    71.0         90      0.0   
836        1   63.302968    1          48   250.0         96      0.0   
1236       1   59.000000    1          39   244.0        100     20.0   
1623       0   63.302968    0          51   239.0         88      1.0   
1647       0   32.000000    1          32   239.0         92      0.0   
1664       1   60.000000    1          34   245.0         90      0.0   

      Cholesterol  Chol_Status  
154    227.417441            0  
436    213.000000            1  
491    188.000000            0  
671    163.000000            0  
765    192.000000            0  
836    180.000000            0  
1236   276.000000            2  
1623   175.000000            0  
1647   171.000000            0  
1664   271.000000            2  
Outliers in Smoking:
      Status  AgeCHDdiag  Sex  AgeAtStart      Weight  Diastolic  Smoking  \
498        0   63.302968    1          34  137.000000         80     60.0   
914        0   66.000000    1          38  174.000000         88     60.0   
1093       0   63.302968    1          32  173.000000         80     60.0   
1263       1   74.000000    1          54  153.086681         90     60.0   
1699       1   57.000000    1          53  150.000000         86     60.0   
2043       0   63.302968    1          32  203.000000         90     60.0   
2903       0   55.000000    1          35  158.000000         80     60.0   
2910       0   63.302968    1          34  164.000000         90     60.0   
3468       1   40.000000    1          36  190.000000        100     60.0   
4110       1   42.000000    1          32  222.000000        105     55.0   

      Cholesterol  Chol_Status  
498    227.417441            2  
914    305.000000            2  
1093   234.000000            1  
1263   307.000000            2  
1699   209.000000            1  
2043   146.000000            0  
2903   177.000000            0  
2910   159.000000            0  
3468   362.000000            2  
4110   220.000000            1

10.0.4.5 Drop outliers

# Drop outliers from df
for col, outliers in outliers_dict.items():
    try:
        df.drop(outliers.index, inplace=True)
    except KeyError:
        # Handle KeyError if the column doesn't exist in the DataFrame
        pass

# Reset index after dropping outliers
df.reset_index(drop=True, inplace=True)

# Verify the DataFrame after dropping outliers
print(df.head(10))

   Status  AgeCHDdiag  Sex  AgeAtStart  Weight  Diastolic  Smoking  \
0       1   63.302968    0          29   140.0         78      0.0   
1       1   63.302968    0          41   194.0         92      0.0   
2       0   63.302968    0          57   132.0         90     10.0   
3       0   63.302968    0          39   158.0         80      0.0   
4       0   63.302968    1          42   156.0         76     20.0   
5       0   63.302968    0          58   131.0         92      0.0   
6       0   63.302968    0          36   136.0         80     15.0   
7       1   63.302968    1          53   130.0         80      0.0   
8       0   63.302968    1          35   194.0         68      0.0   
9       1   63.302968    1          52   129.0         78      5.0   

   Cholesterol  Chol_Status  
0   227.417441            1  
1   181.000000            0  
2   250.000000            2  
3   242.000000            2  
4   281.000000            2  
5   196.000000            0  
6   196.000000            0  
7   276.000000            2  
8   211.000000            1  
9   284.000000            2

10.0.4.6 Check NA again

df.isna().sum()

Status         0
AgeCHDdiag     0
Sex            0
AgeAtStart     0
Weight         0
Diastolic      0
Smoking        0
Cholesterol    0
Chol_Status    0
dtype: int64

10.0.5 Linear regression model

10.0.5.1 Building the model

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Encode categorical features
# Identify categorical columns (object or category types)
categorical_cols = df.select_dtypes(include=['object', 'category']).columns.tolist()

# Optional: drop target-related columns before encoding
categorical_cols = [col for col in categorical_cols if col not in ['Cholesterol', 'Chol_Status']]

# One-hot encode
df_encoded = pd.get_dummies(df, columns=categorical_cols, drop_first=True)

# Split data 
X = df_encoded.drop(['Cholesterol', 'Chol_Status'], axis=1)
y = df_encoded['Cholesterol']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Fit model
model = LinearRegression()
model.fit(X_train, y_train)

# Predict and evaluate 
y_pred = model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Mean Squared Error:", mse)
print("R-squared (R²) score:", r2)

Mean Squared Error: 1480.870677084777
R-squared (R²) score: 0.06813581733633056

10.0.5.2 Select variables contributing most to the model

# Extract coefficients
coefficients = model.coef_

# Match coefficients with feature names
feature_names = X.columns
coefficients_df = pd.DataFrame({'Feature': feature_names, 'Coefficient': coefficients})

# Sort coefficients by absolute value
coefficients_df['Absolute Coefficient'] = coefficients_df['Coefficient'].abs()
sorted_coefficients_df = coefficients_df.sort_values(by='Absolute Coefficient', ascending=False).reset_index(drop=True)

# Print the top contributing features
print("Top contributing features:")
print(sorted_coefficients_df)

Top contributing features:
      Feature  Coefficient  Absolute Coefficient
0         Sex    -6.222432              6.222432
1      Status     2.006297              2.006297
2  AgeAtStart     1.238053              1.238053
3  AgeCHDdiag    -0.434733              0.434733
4   Diastolic     0.284810              0.284810
5     Smoking     0.198287              0.198287
6      Weight     0.108792              0.108792

10.0.5.3 rewrite the model with top 7 variables most contribute to the model

# Encode categorical features
# Identify categorical columns (object or category types)
categorical_cols = df.select_dtypes(include=['object', 'category']).columns.tolist()

# Optional: drop target-related columns before encoding
categorical_cols = [col for col in categorical_cols if col not in ['Cholesterol', 'Chol_Status']]

# One-hot encode
df_encoded = pd.get_dummies(df, columns=categorical_cols, drop_first=True)

# Split data 
X = df_encoded.drop(['Cholesterol', 'Chol_Status'], axis=1)
y = df_encoded['Cholesterol']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Initialize linear regression model
model = LinearRegression()

# Fit the model on the training data
model.fit(X_train, y_train)

# Extract coefficients
coefficients = model.coef_

# Match coefficients with feature names
feature_names = X.columns
coefficients_df = pd.DataFrame({'Feature': feature_names, 'Coefficient': coefficients})

# Sort coefficients by absolute value
coefficients_df['Absolute Coefficient'] = coefficients_df['Coefficient'].abs()
sorted_coefficients_df = coefficients_df.sort_values(by='Absolute Coefficient', ascending=False)

# Select top 10 features
top_features = sorted_coefficients_df.iloc[:7]['Feature'].tolist()

# Fit the model on training data with top 10 features
X_train_top = X_train[top_features]
model.fit(X_train_top, y_train)

# Evaluate the model on testing data
X_test_top = X_test[top_features]
score = model.score(X_test_top, y_test)
print("R-squared (R2) score using top 7 features:", score)

R-squared (R2) score using top 7 features: 0.06813581733633045

10.0.5.4 Visualization

import seaborn as sns
import matplotlib.pyplot as plt

#Plot actual vs. predicted cholesterol values
plt.figure(figsize=(8, 6))
plt.scatter(y_test, y_pred, color='blue')
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], color='red', linestyle='--')
plt.xlabel('Actual Cholesterol')
plt.ylabel('Predicted Cholesterol')
plt.title('Actual vs. Predicted Cholesterol')
plt.grid(True)
plt.show()

This is a relatively low R² score, suggesting that while the model captures some relationship between the features and cholesterol levels, there are likely other factors not included in the model that significantly influence cholesterol levels. Further feature engineering or inclusion of additional relevant variables may improve the model’s performance.

10.0.6 Running linear regression using statsmodels

import pandas as pd
import numpy as np
import statsmodels.api as sm
from sklearn.model_selection import train_test_split

# Encode categorical features (skip target columns)
categorical_cols = df.select_dtypes(include=['object', 'category']).columns.tolist()
categorical_cols = [col for col in categorical_cols if col not in ['Cholesterol', 'Chol_Status']]
df_encoded = pd.get_dummies(df, columns=categorical_cols, drop_first=True)

# Define features and target
X = df_encoded.drop(['Cholesterol', 'Chol_Status'], axis=1)
y = df_encoded['Cholesterol']

# Step 3: Split into train/test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Align test columns to train
X_train, X_test = X_train.align(X_test, join='left', axis=1, fill_value=0)

# Ensure all data is numeric and float (important for statsmodels)
X_train = X_train.astype(float)
X_test = X_test.astype(float)
y_train = y_train.astype(float)

# Add constant for intercept
X_train = sm.add_constant(X_train)
X_test = sm.add_constant(X_test)

# Fit the model
lm2 = sm.OLS(y_train, X_train).fit()

# Show summary
print(lm2.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:            Cholesterol   R-squared:                       0.106
Model:                            OLS   Adj. R-squared:                  0.105
Method:                 Least Squares   F-statistic:                     68.41
Date:                Sat, 28 Jun 2025   Prob (F-statistic):           1.12e-93
Time:                        10:20:16   Log-Likelihood:                -20370.
No. Observations:                4042   AIC:                         4.076e+04
Df Residuals:                    4034   BIC:                         4.081e+04
Df Model:                           7                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        157.0432      8.898     17.649      0.000     139.598     174.489
Status         2.0063      1.403      1.430      0.153      -0.743       4.756
AgeCHDdiag    -0.4347      0.121     -3.596      0.000      -0.672      -0.198
Sex           -6.2224      1.444     -4.310      0.000      -9.053      -3.392
AgeAtStart     1.2381      0.084     14.799      0.000       1.074       1.402
Weight         0.1088      0.026      4.239      0.000       0.058       0.159
Diastolic      0.2848      0.051      5.596      0.000       0.185       0.385
Smoking        0.1983      0.055      3.634      0.000       0.091       0.305
==============================================================================
Omnibus:                       44.661   Durbin-Watson:                   2.043
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               45.040
Skew:                           0.245   Prob(JB):                     1.66e-10
Kurtosis:                       2.835   Cond. No.                     2.91e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.91e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

statsmodels improves R-squared score to 0.114.

Identify influenced rows that might affect the model performance. We will use studentized residuals to detect outliers.

influence = lm2.get_influence()  
resid_student = influence.resid_studentized_external
resid = pd.concat([X_train,pd.Series(resid_student,name = "Studentized Residuals")],axis = 1)
resid.head()

	const	Status	AgeCHDdiag	Sex	AgeAtStart	Weight	Diastolic	Smoking	Studentized Residuals
4942	1.0	1.0	59.000000	0.0	49.0	137.0	90.0	0.000000	NaN
2833	1.0	0.0	66.000000	1.0	36.0	158.0	80.0	30.000000	-0.539324
1041	1.0	0.0	63.302968	0.0	52.0	110.0	70.0	0.000000	-0.691736
2423	1.0	0.0	63.302968	0.0	39.0	107.0	95.0	20.000000	-0.838312
1135	1.0	1.0	63.302968	1.0	32.0	124.0	78.0	9.366518	-0.801665

resid.loc[np.absolute(resid["Studentized Residuals"]) > 3,:]

	const	Status	AgeCHDdiag	Sex	AgeAtStart	Weight	Diastolic	Smoking	Studentized Residuals
1311	1.0	1.0	63.302968	1.0	58.0	164.0	80.0	0.0	-3.309271
196	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	3.170889
2473	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	3.000359

# If resid was computed from statsmodels on X_train:
outliers = np.abs(resid["Studentized Residuals"]) > 3
ind = resid[outliers].index

# Only drop if the indices actually exist
valid_ind = [i for i in ind if i in X_train.index]

X_train.drop(index=valid_ind, inplace=True)
y_train.drop(index=valid_ind, inplace=True)

10.0.6.1 Detecting and removing multicollinearity

Multicollinearity occurs when two or more independent variables in a regression model are highly correlated, which can lead to unreliable coefficient estimates. We can detect multicollinearity using the Variance Inflation Factor (VIF).

from statsmodels.stats.outliers_influence import variance_inflation_factor
[variance_inflation_factor(X_train.values, j) for j in range(X_train.shape[1])]

[228.80071850018703,
 1.3394416883894844,
 1.1289095898066266,
 1.491372090282602,
 1.4737894670530807,
 1.421536015030276,
 1.247273518820798,
 1.2456096632455371]

We create a function to remove the collinear variables. We choose a threshold of 5 which means if VIF is more than 5 for a particular variable then that variable will be removed.

def calculate_vif(x):
    thresh = 5.0
    output = pd.DataFrame()
    k = x.shape[1]
    vif = [variance_inflation_factor(x.values, j) for j in range(x.shape[1])]
    for i in range(1,k):
        print("Iteration no.")
        print(i)
        print(vif)
        a = np.argmax(vif)
        print("Max VIF is for variable no.:")
        print(a)
        if vif[a] <= thresh :
            break
        if i == 1 :          
            output = x.drop(x.columns[a], axis = 1)
            vif = [variance_inflation_factor(output.values, j) for j in range(output.shape[1])]
        elif i > 1 :
            output = output.drop(output.columns[a],axis = 1)
            vif = [variance_inflation_factor(output.values, j) for j in range(output.shape[1])]
    return(output)
train_out = calculate_vif(X_train)
train_out.head(3)

Iteration no.
1
[228.80071850018703, 1.3394416883894844, 1.1289095898066266, 1.491372090282602, 1.4737894670530807, 1.421536015030276, 1.247273518820798, 1.2456096632455371]
Max VIF is for variable no.:
0
Iteration no.
2
[2.1222882889369488, 62.16034485702503, 2.701296037094943, 40.41113912383457, 42.09374206836035, 49.06828510997345, 1.920379036703248]
Max VIF is for variable no.:
1
Iteration no.
3
[1.9248923624254433, 2.6206724232173713, 28.190316957166747, 37.075574349814076, 42.868450858747245, 1.8715880307447514]
Max VIF is for variable no.:
4
Iteration no.
4
[1.9248781994016138, 2.5134621648001754, 20.368687501825814, 22.04762828608679, 1.8584351804464543]
Max VIF is for variable no.:
3
Iteration no.
5
[1.8435708145893481, 2.136544517299741, 2.5600136791453187, 1.8168711072464279]
Max VIF is for variable no.:
2

	Status	Sex	AgeAtStart	Smoking
4942	1.0	0.0	49.0	0.0
2833	0.0	1.0	36.0	30.0
1041	0.0	0.0	52.0	0.0

10.0.6.2 Running linear regression again on our new training set (without multicollinearity)

import pandas as pd
import numpy as np
import statsmodels.api as sm
from sklearn.model_selection import train_test_split

# Encode categorical features (skip target columns)
categorical_cols = df.select_dtypes(include=['object', 'category']).columns.tolist()
categorical_cols = [col for col in categorical_cols if col not in ['Cholesterol', 'Chol_Status']]
df_encoded = pd.get_dummies(df, columns=categorical_cols, drop_first=True)

# Step 2: Define features and target
X = df_encoded.drop(['Cholesterol', 'Chol_Status'], axis=1)
y = df_encoded['Cholesterol']

# Split into train/test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Step 4: Align test columns to train
X_train, X_test = X_train.align(X_test, join='left', axis=1, fill_value=0)

# Ensure all data is numeric and float (important for statsmodels)
X_train = X_train.astype(float)
X_test = X_test.astype(float)
y_train = y_train.astype(float)

# Add constant for intercept
train_out = sm.add_constant(train_out)
X_test = sm.add_constant(X_test)
# Step 7: Fit the model
lm2 = sm.OLS(y_train, X_train).fit()

# Show summary
print(lm2.summary())

                                 OLS Regression Results                                
=======================================================================================
Dep. Variable:            Cholesterol   R-squared (uncentered):                   0.971
Model:                            OLS   Adj. R-squared (uncentered):              0.971
Method:                 Least Squares   F-statistic:                          1.935e+04
Date:                Sat, 28 Jun 2025   Prob (F-statistic):                        0.00
Time:                        10:20:20   Log-Likelihood:                         -20520.
No. Observations:                4042   AIC:                                  4.105e+04
Df Residuals:                    4035   BIC:                                  4.110e+04
Df Model:                           7                                                  
Covariance Type:            nonrobust                                                  
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Status        -1.2945      1.442     -0.897      0.370      -4.123       1.534
AgeCHDdiag     1.2664      0.076     16.726      0.000       1.118       1.415
Sex           -8.0096      1.494     -5.360      0.000     -10.940      -5.080
AgeAtStart     1.3199      0.087     15.227      0.000       1.150       1.490
Weight         0.2334      0.026      9.114      0.000       0.183       0.284
Diastolic      0.5905      0.050     11.891      0.000       0.493       0.688
Smoking        0.4139      0.055      7.500      0.000       0.306       0.522
==============================================================================
Omnibus:                       37.541   Durbin-Watson:                   2.037
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               38.476
Skew:                           0.238   Prob(JB):                     4.42e-09
Kurtosis:                       2.949   Cond. No.                         486.
==============================================================================

Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.

R-squared score is improved significantly to 0.971 and no more multicolinear variables.

10.0.7 Conclusion

In this chapter, we explored the application of linear regression to a healthcare dataset, specifically focusing on predicting cholesterol levels. We demonstrated the importance of data preparation, including handling missing values, encoding categorical variables, and detecting outliers. We also addressed multicollinearity by calculating Variance Inflation Factor (VIF) and removing collinear variables, which significantly improved the model’s performance.

We built a linear regression model using the statsmodels library, which provided detailed insights into the model’s performance and the significance of each feature. The final model achieved a high R-squared score, indicating that it explained a substantial portion of the variance in cholesterol levels.