10  Linear Regression - Machine Learning Models

Author

Thieu Nguyen

10.0.1 Introduction

Machine learning is a powerful tool for identifying patterns and making predictions from data, especially in healthcare. Among its many techniques, linear regression is one of the most fundamental, particularly for predicting continuous outcomes.

In this chapter, We’ll apply linear regression to a real-world healthcare dataset — heart.csv — which includes clinical and demographic data related to cholesterol levels. We’ll cover data preparation, model building, performance evaluation, and result interpretation in a medical context.

This hands-on approach reinforces core machine learning concepts and shows how even simple models can yield valuable insights. We’ll also address challenges like data missing, multicollinearity and outliers that can affect model accuracy.

10.0.2 Load python necessary packages and data

# Python packages
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Load data
df= pd.read_csv("/Users/nnthieu/Downloads/heart.csv")
print(df.shape)
print(df.columns)
df.head()
(5209, 17)
Index(['Status', 'DeathCause', 'AgeCHDdiag', 'Sex', 'AgeAtStart', 'Height',
       'Weight', 'Diastolic', 'Systolic', 'MRW', 'Smoking', 'AgeAtDeath',
       'Cholesterol', 'Chol_Status', 'BP_Status', 'Weight_Status',
       'Smoking_Status'],
      dtype='object')
Status DeathCause AgeCHDdiag Sex AgeAtStart Height Weight Diastolic Systolic MRW Smoking AgeAtDeath Cholesterol Chol_Status BP_Status Weight_Status Smoking_Status
0 Dead Other NaN Female 29 62.50 140.0 78 124 121.0 0.0 55.0 NaN NaN Normal Overweight Non-smoker
1 Dead Cancer NaN Female 41 59.75 194.0 92 144 183.0 0.0 57.0 181.0 Desirable High Overweight Non-smoker
2 Alive NaN NaN Female 57 62.25 132.0 90 170 114.0 10.0 NaN 250.0 High High Overweight Moderate (6-15)
3 Alive NaN NaN Female 39 65.75 158.0 80 128 123.0 0.0 NaN 242.0 High Normal Overweight Non-smoker
4 Alive NaN NaN Male 42 66.00 156.0 76 110 116.0 20.0 NaN 281.0 High Optimal Overweight Heavy (16-25)

File heart.csv has 5209 rows and 17 columns, where each row represents a patient and each column represents a feature or attribute related to heart disease.

10.0.3 Evaluate the dataset

10.0.3.1 Check for missing data

The dataset contains some missing values, which we will need to handle before applying machine learning algorithms. We can check for missing values using the following code:

df.isna().sum()
Status               0
DeathCause        3218
AgeCHDdiag        3760
Sex                  0
AgeAtStart           0
Height               6
Weight               6
Diastolic            0
Systolic             0
MRW                  6
Smoking             36
AgeAtDeath        3218
Cholesterol        152
Chol_Status        152
BP_Status            0
Weight_Status        6
Smoking_Status      36
dtype: int64

10.0.3.2 look at the data distribution of the numeric variables:

Check the histogram to see if there are any outliers in the numeric variables.

def plot_histograms(df, bins=10, alpha=0.5, colors=None):
    """
    Plot histograms for all numeric variables in the DataFrame.
    Parameters:
        df (DataFrame): The DataFrame containing numeric variables.
        bins (int): Number of bins for the histograms. Default is 10.
        alpha (float): Transparency level of the histograms. Default is 0.5.
        colors (list): List of colors for the histograms. If None, default colors will be used.
    Returns:
        None
    """
    if colors is None:
        colors = plt.cm.tab10.colors  # Default color palette

    num_variables = df.select_dtypes(include='number').shape[1]
    num_rows = (num_variables + 1) // 2
    num_cols = 2
    
    plt.figure(figsize=(12, 6 * num_rows))

    for i, col in enumerate(df.select_dtypes(include='number'), start=1):
        plt.subplot(num_rows, num_cols, i)
        plt.hist(df[col], bins=bins, alpha=alpha, color=colors[i % len(colors)])
        plt.title(f'Histogram of {col}')
        plt.xlabel('Value')
        plt.ylabel('Frequency')
        plt.grid(True)

    plt.tight_layout()
    plt.show()

# Example usage:
# Assuming df is your DataFrame containing numeric variables
plot_histograms(df)

10.0.3.3 Check correlation between variables using heatmap

To visualize the correlation between variables in the dataset, We will create a heatmap. This will help us identify relationships between features and understand how they might influence cholesterol levels.

# Drop 'Cholesterol'
df_corr = df.drop(columns=['Cholesterol'])

# Encode categorical variables (required for correlation matrix)
categorical_cols = df_corr.select_dtypes(include=['object', 'category']).columns.tolist()
df_corr_encoded = pd.get_dummies(df_corr, columns=categorical_cols, drop_first=True)

# Compute correlation matrix
corr_matrix = df_corr_encoded.corr()

# Plot heatmap
plt.figure(figsize=(14, 12))
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))  # mask upper triangle

sns.heatmap(
    corr_matrix,
    mask=mask,
    cmap='coolwarm',
    vmin=-1, vmax=1,
    center=0,
    square=True,
    linewidths=0.5,
    annot=True,
    fmt='.2f',
    cbar_kws={"shrink": 0.8}
)

plt.title("Feature Correlation Heatmap (Excluding 'Cholesterol')", fontsize=16)
plt.tight_layout()
plt.show()

Highly correlated features can lead to multicollinearity, which can affect the performance of linear regression models. To identify pairs of features with high correlation, we will extract the upper triangle of the correlation matrix and filter for pairs with an absolute correlation greater than 0.65.

# Step 1: Drop 'Cholesterol' and one-hot encode categorical features
df_corr = df.drop(columns=['Cholesterol'])
categorical_cols = df_corr.select_dtypes(include=['object', 'category']).columns
df_corr_encoded = pd.get_dummies(df_corr, columns=categorical_cols, drop_first=True)

# Step 2: Compute correlation matrix
corr_matrix = df_corr_encoded.corr().abs()

# Step 3: Extract upper triangle (to avoid duplicate pairs)
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))

# Step 4: Find feature pairs with correlation > 0.7
high_corr_pairs = (
    upper.stack()
    .reset_index()
    .rename(columns={0: 'correlation', 'level_0': 'feature_1', 'level_1': 'feature_2'})
    .query('correlation > 0.65')
    .sort_values(by='correlation', ascending=False)
)

print("Highly correlated feature pairs (|correlation| > 0.65):")
print(high_corr_pairs)
Highly correlated feature pairs (|correlation| > 0.65):
      feature_1                         feature_2  correlation
90    Diastolic                          Systolic     0.796061
71       Weight                               MRW     0.767171
162     Smoking         Smoking_Status_Non-smoker     0.753253
7    AgeCHDdiag                        AgeAtDeath     0.748112
163     Smoking  Smoking_Status_Very Heavy (> 25)     0.708725
58       Height                          Sex_Male     0.693441
30   AgeAtStart                        AgeAtDeath     0.688605
141         MRW          Weight_Status_Overweight     0.683167

‘Cholesterol’ and ‘DeathCause’

# Calculate mean cholesterol for each DeathCause
mean_cholesterol = df.groupby('DeathCause')['Cholesterol'].mean()

# Create bar plot
plt.figure(figsize=(10, 6))
mean_cholesterol.plot(kind='bar', color='skyblue')
plt.title('Mean Cholesterol by Death Cause')
plt.xlabel('Death Cause')
plt.ylabel('Mean Cholesterol')
plt.xticks(rotation=45)  # Rotate x-axis labels for better visibility

# Add mean values to the bars
for index, value in enumerate(mean_cholesterol):
    plt.text(index, value, str(round(value, 2)), ha='center', va='bottom')

plt.tight_layout()
plt.show()

‘DeathCause’ is consequences, not exposurer, that should be removed from the regression models.

Highly correlated features will be removed from the dataset to avoid multicollinearity issues. Therefore, we should get out variables as ‘MRW’, ‘Weight_Status_Overweight’, ‘Height’.

10.0.4 Prepare data

10.0.4.1 Remove unnecessary variables

df = df.drop(columns=[
    'DeathCause', 'MRW', 'Smoking_Status', 
    'Weight_Status', 'Height', 'Systolic', 'AgeAtDeath', 'BP_Status'
], errors='ignore')

df.head(3)
Status AgeCHDdiag Sex AgeAtStart Weight Diastolic Smoking Cholesterol Chol_Status
0 Dead NaN Female 29 140.0 78 0.0 NaN NaN
1 Dead NaN Female 41 194.0 92 0.0 181.0 Desirable
2 Alive NaN Female 57 132.0 90 10.0 250.0 High

Data have to be prepared before applying machine learning algorithms. This includes handling missing values, recoding categorical variables, and detecting outliers.

10.0.4.2 replace NA

# Clean values (strip spaces, normalize case)
df['Chol_Status'] = df['Chol_Status'].astype(str).str.strip().str.title()
df['Chol_Status'] = df['Chol_Status'].replace('Nan', np.nan)  # convert literal "nan" string to real NaN

# Impute missing values using distribution
proportions = df['Chol_Status'].value_counts(normalize=True, dropna=True)
missing_count = df['Chol_Status'].isna().sum()

if missing_count > 0 and not proportions.empty:
    missing_samples = np.random.choice(
        proportions.index,
        size=missing_count,
        p=proportions.values
    )
    df.loc[df['Chol_Status'].isna(), 'Chol_Status'] = missing_samples
# Calculate means for numeric columns
means = df.mean(numeric_only=True)

# Impute missing values in numeric columns using their respective means
for col in df.select_dtypes(include='number'):
    df[col] = df[col].fillna(means[col])

df.describe()
AgeCHDdiag AgeAtStart Weight Diastolic Smoking Cholesterol
count 5209.000000 5209.000000 5209.000000 5209.000000 5209.000000 5209.000000
mean 63.302968 44.068727 153.086681 85.358610 9.366518 227.417441
std 5.282496 8.574954 28.898765 12.973091 11.989796 44.274927
min 32.000000 28.000000 67.000000 50.000000 0.000000 96.000000
25% 63.302968 37.000000 132.000000 76.000000 0.000000 197.000000
50% 63.302968 43.000000 150.000000 84.000000 1.000000 225.000000
75% 63.302968 51.000000 172.000000 92.000000 20.000000 251.000000
max 90.000000 62.000000 300.000000 160.000000 60.000000 568.000000

Check NA again

df.isna().sum()
Status         0
AgeCHDdiag     0
Sex            0
AgeAtStart     0
Weight         0
Diastolic      0
Smoking        0
Cholesterol    0
Chol_Status    0
dtype: int64

10.0.4.3 Recoding data

To prepare the dataset for machine learning, we need to recode categorical variables into numerical values. This is essential because most machine learning algorithms require numerical input.

# Define mapping dictionary
status_mapping = {'Dead':1, 'Alive':0}

# Decode 'Status' column
df['Status'] = df['Status'].map(status_mapping)
# Define mapping dictionary
status_mappingSex = {'Male':1, 'Female':0 }

df['Sex'] = df['Sex'].map(status_mappingSex)
status_mappingC = {'Borderline': 1, 'Desirable': 0, 'High': 2}

df['Chol_Status'] = df['Chol_Status'].apply(
    lambda x: status_mappingC.get(x, np.nan)
)

Check NA again

df.isna().sum()
Status         0
AgeCHDdiag     0
Sex            0
AgeAtStart     0
Weight         0
Diastolic      0
Smoking        0
Cholesterol    0
Chol_Status    0
dtype: int64
print(df.columns)
df.head()
Index(['Status', 'AgeCHDdiag', 'Sex', 'AgeAtStart', 'Weight', 'Diastolic',
       'Smoking', 'Cholesterol', 'Chol_Status'],
      dtype='object')
Status AgeCHDdiag Sex AgeAtStart Weight Diastolic Smoking Cholesterol Chol_Status
0 1 63.302968 0 29 140.0 78 0.0 227.417441 1
1 1 63.302968 0 41 194.0 92 0.0 181.000000 0
2 0 63.302968 0 57 132.0 90 10.0 250.000000 2
3 0 63.302968 0 39 158.0 80 0.0 242.000000 2
4 0 63.302968 1 42 156.0 76 20.0 281.000000 2

10.0.4.4 Detect and drop the outliers in numeric variables

# Function to detect outliers using IQR method
def detect_outliers(df, variable):
    Q1 = df[variable].quantile(0.25)
    Q3 = df[variable].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    outliers = df[(df[variable] < lower_bound) | (df[variable] > upper_bound)]
    return outliers

# Define numeric variables to detect outliers
numeric_variables = ['Cholesterol', 'AgeCHDdiag', 'AgeAtStart', 'Weight', 'Smoking']

# Create a dictionary to store outliers for each variable
outliers_dict = {}

# Detect outliers in each numeric variable
for col in numeric_variables:
    outliers_dict[col] = detect_outliers(df, col)

# Print outliers for each variable
for col, outliers in outliers_dict.items():
    print(f"Outliers in {col}:")
    print(outliers.head(10))
Outliers in Cholesterol:
     Status  AgeCHDdiag  Sex  AgeAtStart  Weight  Diastolic  Smoking  \
89        0   63.302968    0          52   135.0         82      0.0   
123       1   73.000000    0          47   124.0         80      0.0   
143       0   63.302968    1          45   160.0         86      0.0   
187       0   76.000000    0          54   146.0         96      0.0   
197       1   61.000000    1          59   164.0        100      0.0   
217       0   63.302968    0          59   124.0         88      0.0   
247       0   58.000000    1          40   197.0         86     15.0   
277       1   66.000000    1          58   166.0         82      5.0   
382       1   73.000000    0          55   130.0         74      0.0   
419       0   65.000000    0          45   162.0         94     25.0   

     Cholesterol  Chol_Status  
89         339.0            2  
123        347.0            2  
143        347.0            2  
187        418.0            2  
197        334.0            2  
217        353.0            2  
247        333.0            2  
277        334.0            2  
382        334.0            2  
419        335.0            2  
Outliers in AgeCHDdiag:
    Status  AgeCHDdiag  Sex  AgeAtStart  Weight  Diastolic  Smoking  \
11       0        57.0    1          33   151.0         68      0.0   
12       0        55.0    1          33   174.0         90      0.0   
13       0        79.0    1          57   165.0         76     15.0   
14       0        66.0    1          44   155.0         90     30.0   
17       1        56.0    1          56   122.0         72     15.0   
19       1        74.0    1          46   157.0         84     30.0   
27       1        71.0    0          49   153.0        110      5.0   
28       0        68.0    1          40   189.0         78      0.0   
31       0        68.0    1          40   195.0         76     20.0   
34       0        43.0    1          33   172.0        106      0.0   

    Cholesterol  Chol_Status  
11   221.000000            1  
12   188.000000            0  
13   227.417441            0  
14   292.000000            2  
17   194.000000            0  
19   233.000000            1  
27   221.000000            1  
28   319.000000            2  
31   205.000000            1  
34   247.000000            2  
Outliers in AgeAtStart:
Empty DataFrame
Columns: [Status, AgeCHDdiag, Sex, AgeAtStart, Weight, Diastolic, Smoking, Cholesterol, Chol_Status]
Index: []
Outliers in Weight:
      Status  AgeCHDdiag  Sex  AgeAtStart  Weight  Diastolic  Smoking  \
154        0   63.000000    0          37   236.0         96      0.0   
436        1   63.302968    0          50   241.0        150      0.0   
491        1   63.302968    1          52   247.0        104     20.0   
671        0   63.302968    1          29   243.0         90     20.0   
765        1   63.302968    0          33    71.0         90      0.0   
836        1   63.302968    1          48   250.0         96      0.0   
1236       1   59.000000    1          39   244.0        100     20.0   
1623       0   63.302968    0          51   239.0         88      1.0   
1647       0   32.000000    1          32   239.0         92      0.0   
1664       1   60.000000    1          34   245.0         90      0.0   

      Cholesterol  Chol_Status  
154    227.417441            0  
436    213.000000            1  
491    188.000000            0  
671    163.000000            0  
765    192.000000            0  
836    180.000000            0  
1236   276.000000            2  
1623   175.000000            0  
1647   171.000000            0  
1664   271.000000            2  
Outliers in Smoking:
      Status  AgeCHDdiag  Sex  AgeAtStart      Weight  Diastolic  Smoking  \
498        0   63.302968    1          34  137.000000         80     60.0   
914        0   66.000000    1          38  174.000000         88     60.0   
1093       0   63.302968    1          32  173.000000         80     60.0   
1263       1   74.000000    1          54  153.086681         90     60.0   
1699       1   57.000000    1          53  150.000000         86     60.0   
2043       0   63.302968    1          32  203.000000         90     60.0   
2903       0   55.000000    1          35  158.000000         80     60.0   
2910       0   63.302968    1          34  164.000000         90     60.0   
3468       1   40.000000    1          36  190.000000        100     60.0   
4110       1   42.000000    1          32  222.000000        105     55.0   

      Cholesterol  Chol_Status  
498    227.417441            2  
914    305.000000            2  
1093   234.000000            1  
1263   307.000000            2  
1699   209.000000            1  
2043   146.000000            0  
2903   177.000000            0  
2910   159.000000            0  
3468   362.000000            2  
4110   220.000000            1  

10.0.4.5 Drop outliers

# Drop outliers from df
for col, outliers in outliers_dict.items():
    try:
        df.drop(outliers.index, inplace=True)
    except KeyError:
        # Handle KeyError if the column doesn't exist in the DataFrame
        pass

# Reset index after dropping outliers
df.reset_index(drop=True, inplace=True)

# Verify the DataFrame after dropping outliers
print(df.head(10))
   Status  AgeCHDdiag  Sex  AgeAtStart  Weight  Diastolic  Smoking  \
0       1   63.302968    0          29   140.0         78      0.0   
1       1   63.302968    0          41   194.0         92      0.0   
2       0   63.302968    0          57   132.0         90     10.0   
3       0   63.302968    0          39   158.0         80      0.0   
4       0   63.302968    1          42   156.0         76     20.0   
5       0   63.302968    0          58   131.0         92      0.0   
6       0   63.302968    0          36   136.0         80     15.0   
7       1   63.302968    1          53   130.0         80      0.0   
8       0   63.302968    1          35   194.0         68      0.0   
9       1   63.302968    1          52   129.0         78      5.0   

   Cholesterol  Chol_Status  
0   227.417441            1  
1   181.000000            0  
2   250.000000            2  
3   242.000000            2  
4   281.000000            2  
5   196.000000            0  
6   196.000000            0  
7   276.000000            2  
8   211.000000            1  
9   284.000000            2  

10.0.4.6 Check NA again

df.isna().sum()
Status         0
AgeCHDdiag     0
Sex            0
AgeAtStart     0
Weight         0
Diastolic      0
Smoking        0
Cholesterol    0
Chol_Status    0
dtype: int64

10.0.5 Linear regression model

10.0.5.1 Building the model

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Encode categorical features
# Identify categorical columns (object or category types)
categorical_cols = df.select_dtypes(include=['object', 'category']).columns.tolist()

# Optional: drop target-related columns before encoding
categorical_cols = [col for col in categorical_cols if col not in ['Cholesterol', 'Chol_Status']]

# One-hot encode
df_encoded = pd.get_dummies(df, columns=categorical_cols, drop_first=True)

# Split data 
X = df_encoded.drop(['Cholesterol', 'Chol_Status'], axis=1)
y = df_encoded['Cholesterol']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Fit model
model = LinearRegression()
model.fit(X_train, y_train)

# Predict and evaluate 
y_pred = model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Mean Squared Error:", mse)
print("R-squared (R²) score:", r2)
Mean Squared Error: 1480.870677084777
R-squared (R²) score: 0.06813581733633056

10.0.5.2 Select variables contributing most to the model

# Extract coefficients
coefficients = model.coef_

# Match coefficients with feature names
feature_names = X.columns
coefficients_df = pd.DataFrame({'Feature': feature_names, 'Coefficient': coefficients})

# Sort coefficients by absolute value
coefficients_df['Absolute Coefficient'] = coefficients_df['Coefficient'].abs()
sorted_coefficients_df = coefficients_df.sort_values(by='Absolute Coefficient', ascending=False).reset_index(drop=True)

# Print the top contributing features
print("Top contributing features:")
print(sorted_coefficients_df)
Top contributing features:
      Feature  Coefficient  Absolute Coefficient
0         Sex    -6.222432              6.222432
1      Status     2.006297              2.006297
2  AgeAtStart     1.238053              1.238053
3  AgeCHDdiag    -0.434733              0.434733
4   Diastolic     0.284810              0.284810
5     Smoking     0.198287              0.198287
6      Weight     0.108792              0.108792

10.0.5.3 rewrite the model with top 7 variables most contribute to the model

# Encode categorical features
# Identify categorical columns (object or category types)
categorical_cols = df.select_dtypes(include=['object', 'category']).columns.tolist()

# Optional: drop target-related columns before encoding
categorical_cols = [col for col in categorical_cols if col not in ['Cholesterol', 'Chol_Status']]

# One-hot encode
df_encoded = pd.get_dummies(df, columns=categorical_cols, drop_first=True)

# Split data 
X = df_encoded.drop(['Cholesterol', 'Chol_Status'], axis=1)
y = df_encoded['Cholesterol']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Initialize linear regression model
model = LinearRegression()

# Fit the model on the training data
model.fit(X_train, y_train)

# Extract coefficients
coefficients = model.coef_

# Match coefficients with feature names
feature_names = X.columns
coefficients_df = pd.DataFrame({'Feature': feature_names, 'Coefficient': coefficients})

# Sort coefficients by absolute value
coefficients_df['Absolute Coefficient'] = coefficients_df['Coefficient'].abs()
sorted_coefficients_df = coefficients_df.sort_values(by='Absolute Coefficient', ascending=False)

# Select top 10 features
top_features = sorted_coefficients_df.iloc[:7]['Feature'].tolist()

# Fit the model on training data with top 10 features
X_train_top = X_train[top_features]
model.fit(X_train_top, y_train)

# Evaluate the model on testing data
X_test_top = X_test[top_features]
score = model.score(X_test_top, y_test)
print("R-squared (R2) score using top 7 features:", score)
R-squared (R2) score using top 7 features: 0.06813581733633045

10.0.5.4 Visualization

import seaborn as sns
import matplotlib.pyplot as plt

#Plot actual vs. predicted cholesterol values
plt.figure(figsize=(8, 6))
plt.scatter(y_test, y_pred, color='blue')
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], color='red', linestyle='--')
plt.xlabel('Actual Cholesterol')
plt.ylabel('Predicted Cholesterol')
plt.title('Actual vs. Predicted Cholesterol')
plt.grid(True)
plt.show()

This is a relatively low R² score, suggesting that while the model captures some relationship between the features and cholesterol levels, there are likely other factors not included in the model that significantly influence cholesterol levels. Further feature engineering or inclusion of additional relevant variables may improve the model’s performance.

10.0.6 Running linear regression using statsmodels

import pandas as pd
import numpy as np
import statsmodels.api as sm
from sklearn.model_selection import train_test_split

# Encode categorical features (skip target columns)
categorical_cols = df.select_dtypes(include=['object', 'category']).columns.tolist()
categorical_cols = [col for col in categorical_cols if col not in ['Cholesterol', 'Chol_Status']]
df_encoded = pd.get_dummies(df, columns=categorical_cols, drop_first=True)

# Define features and target
X = df_encoded.drop(['Cholesterol', 'Chol_Status'], axis=1)
y = df_encoded['Cholesterol']

# Step 3: Split into train/test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Align test columns to train
X_train, X_test = X_train.align(X_test, join='left', axis=1, fill_value=0)

# Ensure all data is numeric and float (important for statsmodels)
X_train = X_train.astype(float)
X_test = X_test.astype(float)
y_train = y_train.astype(float)

# Add constant for intercept
X_train = sm.add_constant(X_train)
X_test = sm.add_constant(X_test)

# Fit the model
lm2 = sm.OLS(y_train, X_train).fit()

# Show summary
print(lm2.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:            Cholesterol   R-squared:                       0.106
Model:                            OLS   Adj. R-squared:                  0.105
Method:                 Least Squares   F-statistic:                     68.41
Date:                Sat, 28 Jun 2025   Prob (F-statistic):           1.12e-93
Time:                        10:20:16   Log-Likelihood:                -20370.
No. Observations:                4042   AIC:                         4.076e+04
Df Residuals:                    4034   BIC:                         4.081e+04
Df Model:                           7                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        157.0432      8.898     17.649      0.000     139.598     174.489
Status         2.0063      1.403      1.430      0.153      -0.743       4.756
AgeCHDdiag    -0.4347      0.121     -3.596      0.000      -0.672      -0.198
Sex           -6.2224      1.444     -4.310      0.000      -9.053      -3.392
AgeAtStart     1.2381      0.084     14.799      0.000       1.074       1.402
Weight         0.1088      0.026      4.239      0.000       0.058       0.159
Diastolic      0.2848      0.051      5.596      0.000       0.185       0.385
Smoking        0.1983      0.055      3.634      0.000       0.091       0.305
==============================================================================
Omnibus:                       44.661   Durbin-Watson:                   2.043
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               45.040
Skew:                           0.245   Prob(JB):                     1.66e-10
Kurtosis:                       2.835   Cond. No.                     2.91e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.91e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

statsmodels improves R-squared score to 0.114.

Identify influenced rows that might affect the model performance. We will use studentized residuals to detect outliers.

influence = lm2.get_influence()  
resid_student = influence.resid_studentized_external
resid = pd.concat([X_train,pd.Series(resid_student,name = "Studentized Residuals")],axis = 1)
resid.head()
const Status AgeCHDdiag Sex AgeAtStart Weight Diastolic Smoking Studentized Residuals
4942 1.0 1.0 59.000000 0.0 49.0 137.0 90.0 0.000000 NaN
2833 1.0 0.0 66.000000 1.0 36.0 158.0 80.0 30.000000 -0.539324
1041 1.0 0.0 63.302968 0.0 52.0 110.0 70.0 0.000000 -0.691736
2423 1.0 0.0 63.302968 0.0 39.0 107.0 95.0 20.000000 -0.838312
1135 1.0 1.0 63.302968 1.0 32.0 124.0 78.0 9.366518 -0.801665
resid.loc[np.absolute(resid["Studentized Residuals"]) > 3,:]
const Status AgeCHDdiag Sex AgeAtStart Weight Diastolic Smoking Studentized Residuals
1311 1.0 1.0 63.302968 1.0 58.0 164.0 80.0 0.0 -3.309271
196 NaN NaN NaN NaN NaN NaN NaN NaN 3.170889
2473 NaN NaN NaN NaN NaN NaN NaN NaN 3.000359
# If resid was computed from statsmodels on X_train:
outliers = np.abs(resid["Studentized Residuals"]) > 3
ind = resid[outliers].index

# Only drop if the indices actually exist
valid_ind = [i for i in ind if i in X_train.index]

X_train.drop(index=valid_ind, inplace=True)
y_train.drop(index=valid_ind, inplace=True)

10.0.6.1 Detecting and removing multicollinearity

Multicollinearity occurs when two or more independent variables in a regression model are highly correlated, which can lead to unreliable coefficient estimates. We can detect multicollinearity using the Variance Inflation Factor (VIF).

from statsmodels.stats.outliers_influence import variance_inflation_factor
[variance_inflation_factor(X_train.values, j) for j in range(X_train.shape[1])]
[228.80071850018703,
 1.3394416883894844,
 1.1289095898066266,
 1.491372090282602,
 1.4737894670530807,
 1.421536015030276,
 1.247273518820798,
 1.2456096632455371]

We create a function to remove the collinear variables. We choose a threshold of 5 which means if VIF is more than 5 for a particular variable then that variable will be removed.

def calculate_vif(x):
    thresh = 5.0
    output = pd.DataFrame()
    k = x.shape[1]
    vif = [variance_inflation_factor(x.values, j) for j in range(x.shape[1])]
    for i in range(1,k):
        print("Iteration no.")
        print(i)
        print(vif)
        a = np.argmax(vif)
        print("Max VIF is for variable no.:")
        print(a)
        if vif[a] <= thresh :
            break
        if i == 1 :          
            output = x.drop(x.columns[a], axis = 1)
            vif = [variance_inflation_factor(output.values, j) for j in range(output.shape[1])]
        elif i > 1 :
            output = output.drop(output.columns[a],axis = 1)
            vif = [variance_inflation_factor(output.values, j) for j in range(output.shape[1])]
    return(output)
train_out = calculate_vif(X_train)
train_out.head(3)
Iteration no.
1
[228.80071850018703, 1.3394416883894844, 1.1289095898066266, 1.491372090282602, 1.4737894670530807, 1.421536015030276, 1.247273518820798, 1.2456096632455371]
Max VIF is for variable no.:
0
Iteration no.
2
[2.1222882889369488, 62.16034485702503, 2.701296037094943, 40.41113912383457, 42.09374206836035, 49.06828510997345, 1.920379036703248]
Max VIF is for variable no.:
1
Iteration no.
3
[1.9248923624254433, 2.6206724232173713, 28.190316957166747, 37.075574349814076, 42.868450858747245, 1.8715880307447514]
Max VIF is for variable no.:
4
Iteration no.
4
[1.9248781994016138, 2.5134621648001754, 20.368687501825814, 22.04762828608679, 1.8584351804464543]
Max VIF is for variable no.:
3
Iteration no.
5
[1.8435708145893481, 2.136544517299741, 2.5600136791453187, 1.8168711072464279]
Max VIF is for variable no.:
2
Status Sex AgeAtStart Smoking
4942 1.0 0.0 49.0 0.0
2833 0.0 1.0 36.0 30.0
1041 0.0 0.0 52.0 0.0

10.0.6.2 Running linear regression again on our new training set (without multicollinearity)

import pandas as pd
import numpy as np
import statsmodels.api as sm
from sklearn.model_selection import train_test_split

# Encode categorical features (skip target columns)
categorical_cols = df.select_dtypes(include=['object', 'category']).columns.tolist()
categorical_cols = [col for col in categorical_cols if col not in ['Cholesterol', 'Chol_Status']]
df_encoded = pd.get_dummies(df, columns=categorical_cols, drop_first=True)

# Step 2: Define features and target
X = df_encoded.drop(['Cholesterol', 'Chol_Status'], axis=1)
y = df_encoded['Cholesterol']

# Split into train/test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Step 4: Align test columns to train
X_train, X_test = X_train.align(X_test, join='left', axis=1, fill_value=0)

# Ensure all data is numeric and float (important for statsmodels)
X_train = X_train.astype(float)
X_test = X_test.astype(float)
y_train = y_train.astype(float)

# Add constant for intercept
train_out = sm.add_constant(train_out)
X_test = sm.add_constant(X_test)
# Step 7: Fit the model
lm2 = sm.OLS(y_train, X_train).fit()

# Show summary
print(lm2.summary())
                                 OLS Regression Results                                
=======================================================================================
Dep. Variable:            Cholesterol   R-squared (uncentered):                   0.971
Model:                            OLS   Adj. R-squared (uncentered):              0.971
Method:                 Least Squares   F-statistic:                          1.935e+04
Date:                Sat, 28 Jun 2025   Prob (F-statistic):                        0.00
Time:                        10:20:20   Log-Likelihood:                         -20520.
No. Observations:                4042   AIC:                                  4.105e+04
Df Residuals:                    4035   BIC:                                  4.110e+04
Df Model:                           7                                                  
Covariance Type:            nonrobust                                                  
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Status        -1.2945      1.442     -0.897      0.370      -4.123       1.534
AgeCHDdiag     1.2664      0.076     16.726      0.000       1.118       1.415
Sex           -8.0096      1.494     -5.360      0.000     -10.940      -5.080
AgeAtStart     1.3199      0.087     15.227      0.000       1.150       1.490
Weight         0.2334      0.026      9.114      0.000       0.183       0.284
Diastolic      0.5905      0.050     11.891      0.000       0.493       0.688
Smoking        0.4139      0.055      7.500      0.000       0.306       0.522
==============================================================================
Omnibus:                       37.541   Durbin-Watson:                   2.037
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               38.476
Skew:                           0.238   Prob(JB):                     4.42e-09
Kurtosis:                       2.949   Cond. No.                         486.
==============================================================================

Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.

R-squared score is improved significantly to 0.971 and no more multicolinear variables.

10.0.7 Conclusion

In this chapter, we explored the application of linear regression to a healthcare dataset, specifically focusing on predicting cholesterol levels. We demonstrated the importance of data preparation, including handling missing values, encoding categorical variables, and detecting outliers. We also addressed multicollinearity by calculating Variance Inflation Factor (VIF) and removing collinear variables, which significantly improved the model’s performance.

We built a linear regression model using the statsmodels library, which provided detailed insights into the model’s performance and the significance of each feature. The final model achieved a high R-squared score, indicating that it explained a substantial portion of the variance in cholesterol levels.