11  Readmission Classifications - Machine Learning Models

Author

Thieu Nguyen

11.0.1 Introduction

In healthcare, machine learning (ML) classification models are widely used to support clinical decision-making, especially in predicting binary outcomes such as patient readmission. Readmission prediction involves identifying whether a patient is likely to be readmitted to a hospital within a specific period (e.g., 30 days) after discharge. Accurate prediction helps hospitals improve patient care, reduce costs, and avoid penalties under policies like Medicare’s Hospital Readmissions Reduction Program (HRRP).

Why Classification?

Because readmission is a yes/no outcome (binary), it’s ideal for classification algorithms, which learn from historical patient data — such as demographics, diagnoses, lab results, medications, length of stay, and discharge summaries — to predict future outcomes.

The common models used for readmission classification include: - Logistic Regression: A statistical method that models the probability of a binary outcome based on one or more predictor variables.

  • Decision Trees: A flowchart-like structure that splits data into branches based on feature values, leading to a decision about the outcome.

  • Random Forest: An ensemble method that builds multiple decision trees and combines their predictions to improve accuracy and reduce overfitting.

  • Support Vector Machines (SVM): A method that finds the hyperplane that best separates different classes in the feature space.

  • Gradient Boosting Machines (GBM): An ensemble technique that builds models sequentially, where each new model corrects errors made by the previous ones.

  • Neural Networks: A computational model inspired by the human brain, consisting of interconnected nodes (neurons) that can learn complex patterns in data.

Classification models in healthcare are critical for predictive tasks like hospital readmission. They support preventive care by flagging high-risk patients and enabling early interventions. Choosing the right model depends on data size, feature complexity, interpretability needs, and model performance.

11.0.2 Python packages and Data

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import seaborn as sns
# Ensure all columns are shown
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

Load data

df = pd.read_csv("/Users/nnthieu/Healthcare Data Analysis/readmission_ml.csv")
print(df.columns)
df.info()
Index(['id', 'start', 'stop', 'patient', 'organization', 'provider', 'payer',
       'encounterclass', 'code', 'description', 'base_encounter_cost',
       'total_claim_cost', 'payer_coverage', 'reasoncode', 'reasondescription',
       'id-2', 'birthdate', 'deathdate', 'ssn', 'drivers', 'passport',
       'prefix', 'first', 'middle', 'last', 'suffix', 'maiden', 'marital',
       'race', 'ethnicity', 'gender', 'birthplace', 'address', 'city', 'state',
       'county', 'fips', 'zip', 'lat', 'lon', 'healthcare_expenses',
       'healthcare_coverage', 'income'],
      dtype='object')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 176 entries, 0 to 175
Data columns (total 43 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   id                   176 non-null    object 
 1   start                176 non-null    object 
 2   stop                 176 non-null    object 
 3   patient              176 non-null    object 
 4   organization         176 non-null    object 
 5   provider             176 non-null    object 
 6   payer                176 non-null    object 
 7   encounterclass       176 non-null    object 
 8   code                 176 non-null    int64  
 9   description          176 non-null    object 
 10  base_encounter_cost  176 non-null    float64
 11  total_claim_cost     176 non-null    float64
 12  payer_coverage       176 non-null    float64
 13  reasoncode           176 non-null    int64  
 14  reasondescription    176 non-null    object 
 15  id-2                 176 non-null    object 
 16  birthdate            176 non-null    object 
 17  deathdate            130 non-null    object 
 18  ssn                  176 non-null    object 
 19  drivers              176 non-null    object 
 20  passport             176 non-null    object 
 21  prefix               176 non-null    object 
 22  first                176 non-null    object 
 23  middle               59 non-null     object 
 24  last                 176 non-null    object 
 25  suffix               1 non-null      object 
 26  maiden               30 non-null     object 
 27  marital              176 non-null    object 
 28  race                 176 non-null    object 
 29  ethnicity            176 non-null    object 
 30  gender               176 non-null    object 
 31  birthplace           176 non-null    object 
 32  address              176 non-null    object 
 33  city                 176 non-null    object 
 34  state                176 non-null    object 
 35  county               176 non-null    object 
 36  fips                 32 non-null     float64
 37  zip                  176 non-null    int64  
 38  lat                  176 non-null    float64
 39  lon                  176 non-null    float64
 40  healthcare_expenses  176 non-null    float64
 41  healthcare_coverage  176 non-null    float64
 42  income               176 non-null    int64  
dtypes: float64(8), int64(4), object(31)
memory usage: 59.3+ KB

11.0.3 Prepare data

# Filter for inpatients and explicitly make a copy
inpatients = df[df.encounterclass == 'inpatient'].copy()

# Convert date columns
inpatients['start'] = pd.to_datetime(inpatients['start'])
inpatients['stop'] = pd.to_datetime(inpatients['stop'])

# Sort by PATIENT and START date
inpatients = inpatients.sort_values(['patient', 'start'])

# Get the previous STOP date per patient
inpatients['PREV_STOP'] = inpatients.groupby('patient')['stop'].shift(1)

# Calculate the gap in days since the last discharge
inpatients['DAYS'] = (inpatients['start'] - inpatients['PREV_STOP']).dt.days

# Identify readmissions within 30 days
inpatients['readmitted'] = (
    (inpatients['DAYS'] > 0) &
    (inpatients['DAYS'] <= 30)
)

inpatients.drop(columns=['PREV_STOP', 'DAYS'], inplace=True)
inpatients.head(2)
id start stop patient organization provider payer encounterclass code description base_encounter_cost total_claim_cost payer_coverage reasoncode reasondescription id-2 birthdate deathdate ssn drivers passport prefix first middle last suffix maiden marital race ethnicity gender birthplace address city state county fips zip lat lon healthcare_expenses healthcare_coverage income readmitted
161 6721b19e-3133-cc71-9090-2adfb1a69f3c 1997-05-10 06:28:23 1997-05-11 06:28:23 008fd5fa-9f43-829d-967b-c0c6732d36ab b8421363-9807-3b16-a146-95336eea5cfb a7a2c654-aaea-3d59-bbc6-ba6e560c13f4 df166300-5a78-3502-a46a-832842197811 inpatient 185347001 Encounter for problem (procedure) 87.71 1012.63 962.63 39898005 Sleep disorder (disorder) 008fd5fa-9f43-829d-967b-c0c6732d36ab 1963-05-05 NaN 999-95-3099 S99977796 X35041393X Mrs. Allyn942 Carlie972 Johnston597 NaN Bernhard322 D white nonhispanic F Sagamore Massachusetts US 413 Reinger Trailer Amherst Massachusetts Hampshire County NaN 0 42.332599 -72.480026 391307.16 787672.07 77504 False
162 68427c27-7e3f-797d-ee7a-25c9b1f2f466 2012-09-11 06:06:08 2012-09-15 11:23:03 008fd5fa-9f43-829d-967b-c0c6732d36ab b8421363-9807-3b16-a146-95336eea5cfb a7a2c654-aaea-3d59-bbc6-ba6e560c13f4 a735bf55-83e9-331a-899d-a82a60b9f60c inpatient 305432006 Admission to surgical transplant department (p... 146.18 3480.38 2784.26 698306007 Awaiting transplantation of kidney (situation) 008fd5fa-9f43-829d-967b-c0c6732d36ab 1963-05-05 NaN 999-95-3099 S99977796 X35041393X Mrs. Allyn942 Carlie972 Johnston597 NaN Bernhard322 D white nonhispanic F Sagamore Massachusetts US 413 Reinger Trailer Amherst Massachusetts Hampshire County NaN 0 42.332599 -72.480026 391307.16 787672.07 77504 False
inpatients['age'] = pd.to_datetime(inpatients['start']).dt.year - pd.to_datetime(inpatients['birthdate']).dt.year
inpatients['age'] = inpatients['age'].astype(int)

11.0.3.1 Select specific columns for building models

# Select specific columns from the 'inpatients' DataFrame
df_f = inpatients[
    ['id', 'patient', 'age', 'organization', 'provider', 'payer',
     'code', 'base_encounter_cost', 'total_claim_cost', 'payer_coverage',
     'marital', 'race', 'ethnicity', 'gender',
     'healthcare_expenses', 'healthcare_coverage', 'income', 'readmitted']
].copy()

# Print the selected column names
print(df_f.columns)
Index(['id', 'patient', 'age', 'organization', 'provider', 'payer', 'code',
       'base_encounter_cost', 'total_claim_cost', 'payer_coverage', 'marital',
       'race', 'ethnicity', 'gender', 'healthcare_expenses',
       'healthcare_coverage', 'income', 'readmitted'],
      dtype='object')

11.0.3.2 Check data for missing

df_f.isna().sum()
id                     0
patient                0
age                    0
organization           0
provider               0
payer                  0
code                   0
base_encounter_cost    0
total_claim_cost       0
payer_coverage         0
marital                0
race                   0
ethnicity              0
gender                 0
healthcare_expenses    0
healthcare_coverage    0
income                 0
readmitted             0
dtype: int64

11.0.4 Data Preprocessing

11.0.4.1 Convert numerical variables to categorical

# Convert 'code' to categorical
df_f.loc[:, 'code'] = df_f['code'].astype('category')
df_f['code'].value_counts()
code
185347001    109
56876005      17
305408004     15
305432006     13
32485007       7
397821002      3
305342007      3
183495009      3
305351004      3
305411003      2
185389009      1
Name: count, dtype: int64

11.0.4.2 Convert categorical variables to numerical

df_f['code'] = df_f['code'].astype(str).str.strip()

# Define mapping dictionary for 'code'
code_mapping = {
    '185347001': 5,
    '56876005': 4,
    '305408004': 3,
    '305432006': 2,
    '32485007': 1,
    # Map multiple codes to 0
    **{code: 0 for code in [
        '305342007', '397821002', '305351004', 
        '183495009', '305411003', '185389009'
    ]}
}

# Apply mapping to 'code' column
df_f['code'] = df_f['code'].map(code_mapping)
# Check the mapping
print(df_f.head())
                                       id  \
161  6721b19e-3133-cc71-9090-2adfb1a69f3c   
162  68427c27-7e3f-797d-ee7a-25c9b1f2f466   
151  ca174272-5427-701b-257e-dce2728f698f   
24   549661f6-d3c2-ad25-2f81-d633450d977c   
25   8d5b8d81-18a1-1057-4032-889605e3c7e8   

                                  patient  age  \
161  008fd5fa-9f43-829d-967b-c0c6732d36ab   34   
162  008fd5fa-9f43-829d-967b-c0c6732d36ab   49   
151  0289d313-6d3a-1c72-5950-61347c15c02f   63   
24   047e20eb-bef0-a481-6bb5-210c3b6e07ea   26   
25   047e20eb-bef0-a481-6bb5-210c3b6e07ea   48   

                             organization  \
161  b8421363-9807-3b16-a146-95336eea5cfb   
162  b8421363-9807-3b16-a146-95336eea5cfb   
151  352f2e3b-0708-3eb4-9f7e-e73a685bf379   
24   845fbd9b-2d1c-39a8-8261-28ae40e4fab2   
25   845fbd9b-2d1c-39a8-8261-28ae40e4fab2   

                                 provider  \
161  a7a2c654-aaea-3d59-bbc6-ba6e560c13f4   
162  a7a2c654-aaea-3d59-bbc6-ba6e560c13f4   
151  284331cb-03a3-32e0-a574-1381eb5889d6   
24   4eadc3de-a8cc-3d18-9b00-ad3622513cfd   
25   4eadc3de-a8cc-3d18-9b00-ad3622513cfd   

                                    payer  code  base_encounter_cost  \
161  df166300-5a78-3502-a46a-832842197811     5                87.71   
162  a735bf55-83e9-331a-899d-a82a60b9f60c     2               146.18   
151  a735bf55-83e9-331a-899d-a82a60b9f60c     0               146.18   
24   e03e23c9-4df1-3eb6-a62d-f70f02301496     3               146.18   
25   a735bf55-83e9-331a-899d-a82a60b9f60c     0               146.18   

     total_claim_cost  payer_coverage marital   race    ethnicity gender  \
161           1012.63          962.63       D  white  nonhispanic      F   
162           3480.38         2784.26       D  white  nonhispanic      F   
151         199646.60       159530.88       M  white  nonhispanic      M   
24            5750.55            0.00       W  white  nonhispanic      F   
25           83463.56        66770.84       W  white  nonhispanic      F   

     healthcare_expenses  healthcare_coverage  income  readmitted  
161            391307.16            787672.07   77504       False  
162            391307.16            787672.07   77504       False  
151             94091.07            772282.67   29497       False  
24             173223.68            149763.69   47200       False  
25             173223.68            149763.69   47200       False  
df_f.loc[:, 'gender'] = df_f['gender'].astype(str).str.strip()
status_mappingSex = {'M': 1, 'F': 0}
df_f.loc[:, 'gender'] = df_f['gender'].map(status_mappingSex)
df_f.loc[:, 'race'] = df_f['race'].astype(str).str.strip()
status_mappingRace = {'white': 1, 'black': 2, 'asian': 0}
df_f.loc[:, 'race'] = df_f['race'].map(status_mappingRace)
df_f.loc[:, 'marital'] = df_f['marital'].astype(str).str.strip()
status_mappingMarital = {'M': 3, 'S': 2, 'D':1, 'W': 0}
df_f.loc[:, 'marital'] = df_f['marital'].map(status_mappingMarital)
df_f.loc[:, 'ethnicity'] = df_f['ethnicity'].astype(str).str.strip()
status_mappingEthnicity = {'nonhispanic': 1, 'hispanic': 0}
df_f.loc[:, 'ethnicity'] = df_f['ethnicity'].map(status_mappingEthnicity)
df_f.head()
id patient age organization provider payer code base_encounter_cost total_claim_cost payer_coverage marital race ethnicity gender healthcare_expenses healthcare_coverage income readmitted
161 6721b19e-3133-cc71-9090-2adfb1a69f3c 008fd5fa-9f43-829d-967b-c0c6732d36ab 34 b8421363-9807-3b16-a146-95336eea5cfb a7a2c654-aaea-3d59-bbc6-ba6e560c13f4 df166300-5a78-3502-a46a-832842197811 5 87.71 1012.63 962.63 1 1 1 0 391307.16 787672.07 77504 False
162 68427c27-7e3f-797d-ee7a-25c9b1f2f466 008fd5fa-9f43-829d-967b-c0c6732d36ab 49 b8421363-9807-3b16-a146-95336eea5cfb a7a2c654-aaea-3d59-bbc6-ba6e560c13f4 a735bf55-83e9-331a-899d-a82a60b9f60c 2 146.18 3480.38 2784.26 1 1 1 0 391307.16 787672.07 77504 False
151 ca174272-5427-701b-257e-dce2728f698f 0289d313-6d3a-1c72-5950-61347c15c02f 63 352f2e3b-0708-3eb4-9f7e-e73a685bf379 284331cb-03a3-32e0-a574-1381eb5889d6 a735bf55-83e9-331a-899d-a82a60b9f60c 0 146.18 199646.60 159530.88 3 1 1 1 94091.07 772282.67 29497 False
24 549661f6-d3c2-ad25-2f81-d633450d977c 047e20eb-bef0-a481-6bb5-210c3b6e07ea 26 845fbd9b-2d1c-39a8-8261-28ae40e4fab2 4eadc3de-a8cc-3d18-9b00-ad3622513cfd e03e23c9-4df1-3eb6-a62d-f70f02301496 3 146.18 5750.55 0.00 0 1 1 0 173223.68 149763.69 47200 False
25 8d5b8d81-18a1-1057-4032-889605e3c7e8 047e20eb-bef0-a481-6bb5-210c3b6e07ea 48 845fbd9b-2d1c-39a8-8261-28ae40e4fab2 4eadc3de-a8cc-3d18-9b00-ad3622513cfd a735bf55-83e9-331a-899d-a82a60b9f60c 0 146.18 83463.56 66770.84 0 1 1 0 173223.68 149763.69 47200 False
df_f.describe()
age code base_encounter_cost total_claim_cost payer_coverage healthcare_expenses healthcare_coverage income
count 176.000000 176.000000 176.000000 176.000000 176.000000 1.760000e+02 1.760000e+02 176.000000
mean 66.551136 3.926136 109.636250 43854.455568 40155.515739 4.214176e+05 2.677835e+06 61609.301136
std 19.320979 1.652772 28.387428 40324.546443 40249.005949 2.414169e+05 1.991460e+06 94813.993532
min 7.000000 0.000000 87.710000 146.180000 0.000000 8.028420e+03 1.695971e+04 694.000000
25% 52.750000 3.000000 87.710000 6977.317500 322.790000 2.843511e+05 3.279196e+05 39120.000000
50% 74.000000 5.000000 87.710000 40845.845000 39509.080000 3.028783e+05 3.970905e+06 47765.000000
75% 83.000000 5.000000 146.180000 69000.822500 66277.720000 6.291407e+05 4.670788e+06 47765.000000
max 87.000000 5.000000 146.180000 203663.010000 203663.010000 1.421336e+06 4.670788e+06 755479.000000

11.0.4.3 Visualize data distribution

import matplotlib.pyplot as plt

# List of columns to plot
columns_to_plot = [
    'age', 'base_encounter_cost', 'total_claim_cost', 'payer_coverage',
    'healthcare_expenses', 'healthcare_coverage', 'income'
]

# Filter only columns that exist in df_f
columns_to_plot = [col for col in columns_to_plot if col in df_f.columns]

# Set plot size and layout
num_cols = 2
num_rows = (len(columns_to_plot) + 1) // num_cols
plt.figure(figsize=(12, 5 * num_rows))

# Loop through and plot each histogram
for i, col in enumerate(columns_to_plot, start=1):
    plt.subplot(num_rows, num_cols, i)
    plt.hist(df_f[col].dropna(), bins=20, color='skyblue', edgecolor='black', alpha=0.7)
    plt.title(f'Histogram of {col}')
    plt.xlabel(col)
    plt.ylabel('Frequency')
    plt.grid(True)

plt.tight_layout()
plt.show()

11.0.4.4 Check for class imbalance

# Check the distribution of the 'readmitted' column
readmitted_counts = df_f['readmitted'].value_counts()
print("Readmitted Counts:\n", readmitted_counts)
# Plot the distribution
plt.figure(figsize=(8, 5))
sns.countplot(x='readmitted', data=df_f, palette='Set2')
plt.title('Distribution of Readmission')
plt.xlabel('Readmitted')
plt.ylabel('Count')
plt.xticks(ticks=[0, 1], labels=['Not Readmitted', 'Readmitted'])
plt.show()
Readmitted Counts:
 readmitted
False    126
True      50
Name: count, dtype: int64
/var/folders/cx/3wbhcqyd3cld6gvk_xjkvr_40000gn/T/ipykernel_13342/1914624710.py:6: FutureWarning:



Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

11.0.4.5 Handle class imbalance

from sklearn.utils import resample
# Separate majority and minority classes
df_majority = df_f[df_f['readmitted'] == 0]
df_minority = df_f[df_f['readmitted'] == 1]
# Upsample minority class
df_minority_upsampled = resample(df_minority, 
                                  replace=True,     # sample with replacement
                                  n_samples=len(df_majority),    # to match majority class
                                  random_state=42) # reproducible results
# Combine majority class with upsampled minority class
df_f = pd.concat([df_majority, df_minority_upsampled])
# Shuffle the dataset
df_f = df_f.sample(frac=1, random_state=42).reset_index(drop=True)
# Check the new class distribution
print("New Readmitted Counts:\n", df_f['readmitted'].value_counts())
New Readmitted Counts:
 readmitted
True     126
False    126
Name: count, dtype: int64

11.0.5 Build Decision Tree model

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
import pandas as pd

# Prepare the data
X = df_f.drop(columns=['readmitted'])
X = pd.get_dummies(X, drop_first=True)  # Encode categorical variables
y = df_f['readmitted']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Train a Decision Tree Classifier
dt_model = DecisionTreeClassifier(max_depth=5, random_state=42)  # You can tune max_depth
dt_model.fit(X_train, y_train)

# Make predictions
y_pred = dt_model.predict(X_test)

# Evaluate the model
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("Accuracy Score:", accuracy_score(y_test, y_pred))
Confusion Matrix:
 [[37  5]
 [ 5 29]]

Classification Report:
               precision    recall  f1-score   support

       False       0.88      0.88      0.88        42
        True       0.85      0.85      0.85        34

    accuracy                           0.87        76
   macro avg       0.87      0.87      0.87        76
weighted avg       0.87      0.87      0.87        76

Accuracy Score: 0.868421052631579

11.0.5.1 Hyperparameter Tuning for Decision Tree Classifier

from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Prepare the data
X = df_f.drop(columns=['readmitted'])
X = pd.get_dummies(X, drop_first=True)
y = df_f['readmitted']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Define parameter grid
param_grid = {
    'max_depth': [3, 5, 10, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'criterion': ['gini', 'entropy']
}

# Grid search with cross-validation
grid_search = GridSearchCV(
    estimator=DecisionTreeClassifier(random_state=42),
    param_grid=param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)

grid_search.fit(X_train, y_train)

# Train with best estimator
best_dt_model = grid_search.best_estimator_
y_pred = best_dt_model.predict(X_test)

# Evaluate
print("Best Parameters:", grid_search.best_params_)
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("Accuracy Score:", accuracy_score(y_test, y_pred))
Best Parameters: {'criterion': 'gini', 'max_depth': 5, 'min_samples_leaf': 1, 'min_samples_split': 2}

Confusion Matrix:
 [[37  5]
 [ 5 29]]

Classification Report:
               precision    recall  f1-score   support

       False       0.88      0.88      0.88        42
        True       0.85      0.85      0.85        34

    accuracy                           0.87        76
   macro avg       0.87      0.87      0.87        76
weighted avg       0.87      0.87      0.87        76

Accuracy Score: 0.868421052631579

11.0.6 Building a Random Forest Classifier model.

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Ensure all features are numeric
X = df_f.drop(columns=['readmitted'])
X = pd.get_dummies(X, drop_first=True)  # Encode any categorical variables
y = df_f['readmitted']

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Train Random Forest model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("Accuracy Score:", round(accuracy_score(y_test, y_pred), 3))
Confusion Matrix:
 [[36  6]
 [ 3 31]]

Classification Report:
               precision    recall  f1-score   support

       False       0.92      0.86      0.89        42
        True       0.84      0.91      0.87        34

    accuracy                           0.88        76
   macro avg       0.88      0.88      0.88        76
weighted avg       0.88      0.88      0.88        76

Accuracy Score: 0.882

11.0.6.1 Evaluate performance of Random Forest Classifier

import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix

# Compute confusion matrix
cm = confusion_matrix(y_test, y_pred, labels=[False, True])

# Plot
plt.figure(figsize=(6, 5))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Not Readmitted', 'Readmitted'], yticklabels=['Not Readmitted', 'Readmitted'])
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix')
plt.tight_layout()
plt.show()

from sklearn.metrics import classification_report
import pandas as pd

# Convert classification report to DataFrame
report = classification_report(y_test, y_pred, output_dict=True)
report_df = pd.DataFrame(report).transpose()

# Drop 'accuracy' row (optional)
report_df = report_df.drop(['accuracy'], errors='ignore')

# Plot heatmap
plt.figure(figsize=(8, 5))
sns.heatmap(report_df.iloc[:2, :3], annot=True, cmap='Greens', fmt=".2f")
plt.title('Classification Report (Precision, Recall, F1-score)')
plt.show()

11.0.7 Building a Support Vector Machine (SVM) Classifier model

from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
# Prepare the data
X = df_f.drop(columns=['readmitted'])
X = pd.get_dummies(X, drop_first=True)  # Encode categorical variables
y = df_f['readmitted']
# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)
# Train SVM model
svm_model = SVC(kernel='linear', random_state=42)  # You can also try 'rbf' or 'poly'
svm_model.fit(X_train, y_train)
# Make predictions
y_pred = svm_model.predict(X_test)
# Evaluate the model
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("Accuracy Score:", accuracy_score(y_test, y_pred))
Confusion Matrix:
 [[35  7]
 [ 5 29]]

Classification Report:
               precision    recall  f1-score   support

       False       0.88      0.83      0.85        42
        True       0.81      0.85      0.83        34

    accuracy                           0.84        76
   macro avg       0.84      0.84      0.84        76
weighted avg       0.84      0.84      0.84        76

Accuracy Score: 0.8421052631578947

11.0.7.1 Evaluate performance of SVM Classifier

# Evaluate performance of SVM Classifier
import matplotlib.pyplot as plt
import seaborn as sns
# Compute confusion matrix
cm = confusion_matrix(y_test, y_pred, labels=[False, True])
# Plot confusion matrix
plt.figure(figsize=(6, 5))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Not Readmitted', 'Readmitted'], yticklabels=['Not Readmitted', 'Readmitted'])
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix')
plt.tight_layout()
plt.show()

from sklearn.metrics import classification_report
# Convert classification report to DataFrame
report = classification_report(y_test, y_pred, output_dict=True)
report_df = pd.DataFrame(report).transpose()
# Drop 'accuracy' row (optional)
report_df = report_df.drop(['accuracy'], errors='ignore')
# Plot heatmap
plt.figure(figsize=(8, 5))
sns.heatmap(report_df.iloc[:2, :3], annot=True, cmap='Greens', fmt=".2f")
plt.title('Classification Report (Precision, Recall, F1-score)')
plt.show()

11.0.8 Building a Gradient Boosting Classifier model

from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
# Prepare the data
X = df_f.drop(columns=['readmitted'])
X = pd.get_dummies(X, drop_first=True)  # Encode categorical variables
y = df_f['readmitted']
# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)
# Train Gradient Boosting model
gb_model = GradientBoostingClassifier(n_estimators=100, random_state=42)
gb_model.fit(X_train, y_train)
# Make predictions
y_pred = gb_model.predict(X_test)
# Evaluate the model
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("Accuracy Score:", accuracy_score(y_test, y_pred))
Confusion Matrix:
 [[37  5]
 [ 5 29]]

Classification Report:
               precision    recall  f1-score   support

       False       0.88      0.88      0.88        42
        True       0.85      0.85      0.85        34

    accuracy                           0.87        76
   macro avg       0.87      0.87      0.87        76
weighted avg       0.87      0.87      0.87        76

Accuracy Score: 0.868421052631579

11.0.8.1 Evaluate performance of Gradient Boosting Classifier

# Evaluate performance of Gradient Boosting Classifier
import matplotlib.pyplot as plt
import seaborn as sns
# Compute confusion matrix
cm = confusion_matrix(y_test, y_pred, labels=[False, True])
# Plot confusion matrix
plt.figure(figsize=(6, 5))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Not Readmitted', 'Readmitted'], yticklabels=['Not Readmitted', 'Readmitted'])
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix')
plt.tight_layout()
plt.show()

11.0.9 Conclusion

In this chapter, we explored various machine learning classification models to predict patient readmission. We started with data preprocessing, including handling missing values and encoding categorical variables. We then built and evaluated several models: Decision Tree, Random Forest, Support Vector Machine (SVM), and Gradient Boosting.

Each model was assessed based on its confusion matrix, classification report, and accuracy score. The Random Forest model generally performed well, achieving a high accuracy and balanced precision and recall. The SVM and Gradient Boosting models also showed promising results, while the Decision Tree model provided a simpler interpretation but with slightly lower performance.

The choice of model depends on the specific requirements of the healthcare setting, such as interpretability, computational resources, and the need for real-time predictions. Future work could involve hyperparameter tuning, feature selection, and exploring ensemble methods to further improve model performance.