import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import seaborn as sns
# Ensure all columns are shown
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)11 Readmission Classifications - Machine Learning Models
11.0.1 Introduction
In healthcare, machine learning (ML) classification models are widely used to support clinical decision-making, especially in predicting binary outcomes such as patient readmission. Readmission prediction involves identifying whether a patient is likely to be readmitted to a hospital within a specific period (e.g., 30 days) after discharge. Accurate prediction helps hospitals improve patient care, reduce costs, and avoid penalties under policies like Medicare’s Hospital Readmissions Reduction Program (HRRP).
Why Classification?
Because readmission is a yes/no outcome (binary), it’s ideal for classification algorithms, which learn from historical patient data — such as demographics, diagnoses, lab results, medications, length of stay, and discharge summaries — to predict future outcomes.
The common models used for readmission classification include: - Logistic Regression: A statistical method that models the probability of a binary outcome based on one or more predictor variables.
Decision Trees: A flowchart-like structure that splits data into branches based on feature values, leading to a decision about the outcome.
Random Forest: An ensemble method that builds multiple decision trees and combines their predictions to improve accuracy and reduce overfitting.
Support Vector Machines (SVM): A method that finds the hyperplane that best separates different classes in the feature space.
Gradient Boosting Machines (GBM): An ensemble technique that builds models sequentially, where each new model corrects errors made by the previous ones.
Neural Networks: A computational model inspired by the human brain, consisting of interconnected nodes (neurons) that can learn complex patterns in data.
Classification models in healthcare are critical for predictive tasks like hospital readmission. They support preventive care by flagging high-risk patients and enabling early interventions. Choosing the right model depends on data size, feature complexity, interpretability needs, and model performance.
11.0.2 Python packages and Data
Load data
df = pd.read_csv("/Users/nnthieu/Healthcare Data Analysis/readmission_ml.csv")
print(df.columns)
df.info()Index(['id', 'start', 'stop', 'patient', 'organization', 'provider', 'payer',
'encounterclass', 'code', 'description', 'base_encounter_cost',
'total_claim_cost', 'payer_coverage', 'reasoncode', 'reasondescription',
'id-2', 'birthdate', 'deathdate', 'ssn', 'drivers', 'passport',
'prefix', 'first', 'middle', 'last', 'suffix', 'maiden', 'marital',
'race', 'ethnicity', 'gender', 'birthplace', 'address', 'city', 'state',
'county', 'fips', 'zip', 'lat', 'lon', 'healthcare_expenses',
'healthcare_coverage', 'income'],
dtype='object')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 176 entries, 0 to 175
Data columns (total 43 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 176 non-null object
1 start 176 non-null object
2 stop 176 non-null object
3 patient 176 non-null object
4 organization 176 non-null object
5 provider 176 non-null object
6 payer 176 non-null object
7 encounterclass 176 non-null object
8 code 176 non-null int64
9 description 176 non-null object
10 base_encounter_cost 176 non-null float64
11 total_claim_cost 176 non-null float64
12 payer_coverage 176 non-null float64
13 reasoncode 176 non-null int64
14 reasondescription 176 non-null object
15 id-2 176 non-null object
16 birthdate 176 non-null object
17 deathdate 130 non-null object
18 ssn 176 non-null object
19 drivers 176 non-null object
20 passport 176 non-null object
21 prefix 176 non-null object
22 first 176 non-null object
23 middle 59 non-null object
24 last 176 non-null object
25 suffix 1 non-null object
26 maiden 30 non-null object
27 marital 176 non-null object
28 race 176 non-null object
29 ethnicity 176 non-null object
30 gender 176 non-null object
31 birthplace 176 non-null object
32 address 176 non-null object
33 city 176 non-null object
34 state 176 non-null object
35 county 176 non-null object
36 fips 32 non-null float64
37 zip 176 non-null int64
38 lat 176 non-null float64
39 lon 176 non-null float64
40 healthcare_expenses 176 non-null float64
41 healthcare_coverage 176 non-null float64
42 income 176 non-null int64
dtypes: float64(8), int64(4), object(31)
memory usage: 59.3+ KB
11.0.3 Prepare data
# Filter for inpatients and explicitly make a copy
inpatients = df[df.encounterclass == 'inpatient'].copy()
# Convert date columns
inpatients['start'] = pd.to_datetime(inpatients['start'])
inpatients['stop'] = pd.to_datetime(inpatients['stop'])
# Sort by PATIENT and START date
inpatients = inpatients.sort_values(['patient', 'start'])
# Get the previous STOP date per patient
inpatients['PREV_STOP'] = inpatients.groupby('patient')['stop'].shift(1)
# Calculate the gap in days since the last discharge
inpatients['DAYS'] = (inpatients['start'] - inpatients['PREV_STOP']).dt.days
# Identify readmissions within 30 days
inpatients['readmitted'] = (
(inpatients['DAYS'] > 0) &
(inpatients['DAYS'] <= 30)
)
inpatients.drop(columns=['PREV_STOP', 'DAYS'], inplace=True)
inpatients.head(2)| id | start | stop | patient | organization | provider | payer | encounterclass | code | description | base_encounter_cost | total_claim_cost | payer_coverage | reasoncode | reasondescription | id-2 | birthdate | deathdate | ssn | drivers | passport | prefix | first | middle | last | suffix | maiden | marital | race | ethnicity | gender | birthplace | address | city | state | county | fips | zip | lat | lon | healthcare_expenses | healthcare_coverage | income | readmitted | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 161 | 6721b19e-3133-cc71-9090-2adfb1a69f3c | 1997-05-10 06:28:23 | 1997-05-11 06:28:23 | 008fd5fa-9f43-829d-967b-c0c6732d36ab | b8421363-9807-3b16-a146-95336eea5cfb | a7a2c654-aaea-3d59-bbc6-ba6e560c13f4 | df166300-5a78-3502-a46a-832842197811 | inpatient | 185347001 | Encounter for problem (procedure) | 87.71 | 1012.63 | 962.63 | 39898005 | Sleep disorder (disorder) | 008fd5fa-9f43-829d-967b-c0c6732d36ab | 1963-05-05 | NaN | 999-95-3099 | S99977796 | X35041393X | Mrs. | Allyn942 | Carlie972 | Johnston597 | NaN | Bernhard322 | D | white | nonhispanic | F | Sagamore Massachusetts US | 413 Reinger Trailer | Amherst | Massachusetts | Hampshire County | NaN | 0 | 42.332599 | -72.480026 | 391307.16 | 787672.07 | 77504 | False |
| 162 | 68427c27-7e3f-797d-ee7a-25c9b1f2f466 | 2012-09-11 06:06:08 | 2012-09-15 11:23:03 | 008fd5fa-9f43-829d-967b-c0c6732d36ab | b8421363-9807-3b16-a146-95336eea5cfb | a7a2c654-aaea-3d59-bbc6-ba6e560c13f4 | a735bf55-83e9-331a-899d-a82a60b9f60c | inpatient | 305432006 | Admission to surgical transplant department (p... | 146.18 | 3480.38 | 2784.26 | 698306007 | Awaiting transplantation of kidney (situation) | 008fd5fa-9f43-829d-967b-c0c6732d36ab | 1963-05-05 | NaN | 999-95-3099 | S99977796 | X35041393X | Mrs. | Allyn942 | Carlie972 | Johnston597 | NaN | Bernhard322 | D | white | nonhispanic | F | Sagamore Massachusetts US | 413 Reinger Trailer | Amherst | Massachusetts | Hampshire County | NaN | 0 | 42.332599 | -72.480026 | 391307.16 | 787672.07 | 77504 | False |
inpatients['age'] = pd.to_datetime(inpatients['start']).dt.year - pd.to_datetime(inpatients['birthdate']).dt.year
inpatients['age'] = inpatients['age'].astype(int)11.0.3.1 Select specific columns for building models
# Select specific columns from the 'inpatients' DataFrame
df_f = inpatients[
['id', 'patient', 'age', 'organization', 'provider', 'payer',
'code', 'base_encounter_cost', 'total_claim_cost', 'payer_coverage',
'marital', 'race', 'ethnicity', 'gender',
'healthcare_expenses', 'healthcare_coverage', 'income', 'readmitted']
].copy()
# Print the selected column names
print(df_f.columns)Index(['id', 'patient', 'age', 'organization', 'provider', 'payer', 'code',
'base_encounter_cost', 'total_claim_cost', 'payer_coverage', 'marital',
'race', 'ethnicity', 'gender', 'healthcare_expenses',
'healthcare_coverage', 'income', 'readmitted'],
dtype='object')
11.0.3.2 Check data for missing
df_f.isna().sum()id 0
patient 0
age 0
organization 0
provider 0
payer 0
code 0
base_encounter_cost 0
total_claim_cost 0
payer_coverage 0
marital 0
race 0
ethnicity 0
gender 0
healthcare_expenses 0
healthcare_coverage 0
income 0
readmitted 0
dtype: int64
11.0.4 Data Preprocessing
11.0.4.1 Convert numerical variables to categorical
# Convert 'code' to categorical
df_f.loc[:, 'code'] = df_f['code'].astype('category')
df_f['code'].value_counts()code
185347001 109
56876005 17
305408004 15
305432006 13
32485007 7
397821002 3
305342007 3
183495009 3
305351004 3
305411003 2
185389009 1
Name: count, dtype: int64
11.0.4.2 Convert categorical variables to numerical
df_f['code'] = df_f['code'].astype(str).str.strip()
# Define mapping dictionary for 'code'
code_mapping = {
'185347001': 5,
'56876005': 4,
'305408004': 3,
'305432006': 2,
'32485007': 1,
# Map multiple codes to 0
**{code: 0 for code in [
'305342007', '397821002', '305351004',
'183495009', '305411003', '185389009'
]}
}
# Apply mapping to 'code' column
df_f['code'] = df_f['code'].map(code_mapping)
# Check the mapping
print(df_f.head()) id \
161 6721b19e-3133-cc71-9090-2adfb1a69f3c
162 68427c27-7e3f-797d-ee7a-25c9b1f2f466
151 ca174272-5427-701b-257e-dce2728f698f
24 549661f6-d3c2-ad25-2f81-d633450d977c
25 8d5b8d81-18a1-1057-4032-889605e3c7e8
patient age \
161 008fd5fa-9f43-829d-967b-c0c6732d36ab 34
162 008fd5fa-9f43-829d-967b-c0c6732d36ab 49
151 0289d313-6d3a-1c72-5950-61347c15c02f 63
24 047e20eb-bef0-a481-6bb5-210c3b6e07ea 26
25 047e20eb-bef0-a481-6bb5-210c3b6e07ea 48
organization \
161 b8421363-9807-3b16-a146-95336eea5cfb
162 b8421363-9807-3b16-a146-95336eea5cfb
151 352f2e3b-0708-3eb4-9f7e-e73a685bf379
24 845fbd9b-2d1c-39a8-8261-28ae40e4fab2
25 845fbd9b-2d1c-39a8-8261-28ae40e4fab2
provider \
161 a7a2c654-aaea-3d59-bbc6-ba6e560c13f4
162 a7a2c654-aaea-3d59-bbc6-ba6e560c13f4
151 284331cb-03a3-32e0-a574-1381eb5889d6
24 4eadc3de-a8cc-3d18-9b00-ad3622513cfd
25 4eadc3de-a8cc-3d18-9b00-ad3622513cfd
payer code base_encounter_cost \
161 df166300-5a78-3502-a46a-832842197811 5 87.71
162 a735bf55-83e9-331a-899d-a82a60b9f60c 2 146.18
151 a735bf55-83e9-331a-899d-a82a60b9f60c 0 146.18
24 e03e23c9-4df1-3eb6-a62d-f70f02301496 3 146.18
25 a735bf55-83e9-331a-899d-a82a60b9f60c 0 146.18
total_claim_cost payer_coverage marital race ethnicity gender \
161 1012.63 962.63 D white nonhispanic F
162 3480.38 2784.26 D white nonhispanic F
151 199646.60 159530.88 M white nonhispanic M
24 5750.55 0.00 W white nonhispanic F
25 83463.56 66770.84 W white nonhispanic F
healthcare_expenses healthcare_coverage income readmitted
161 391307.16 787672.07 77504 False
162 391307.16 787672.07 77504 False
151 94091.07 772282.67 29497 False
24 173223.68 149763.69 47200 False
25 173223.68 149763.69 47200 False
df_f.loc[:, 'gender'] = df_f['gender'].astype(str).str.strip()
status_mappingSex = {'M': 1, 'F': 0}
df_f.loc[:, 'gender'] = df_f['gender'].map(status_mappingSex)df_f.loc[:, 'race'] = df_f['race'].astype(str).str.strip()
status_mappingRace = {'white': 1, 'black': 2, 'asian': 0}
df_f.loc[:, 'race'] = df_f['race'].map(status_mappingRace)df_f.loc[:, 'marital'] = df_f['marital'].astype(str).str.strip()
status_mappingMarital = {'M': 3, 'S': 2, 'D':1, 'W': 0}
df_f.loc[:, 'marital'] = df_f['marital'].map(status_mappingMarital)df_f.loc[:, 'ethnicity'] = df_f['ethnicity'].astype(str).str.strip()
status_mappingEthnicity = {'nonhispanic': 1, 'hispanic': 0}
df_f.loc[:, 'ethnicity'] = df_f['ethnicity'].map(status_mappingEthnicity)
df_f.head()| id | patient | age | organization | provider | payer | code | base_encounter_cost | total_claim_cost | payer_coverage | marital | race | ethnicity | gender | healthcare_expenses | healthcare_coverage | income | readmitted | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 161 | 6721b19e-3133-cc71-9090-2adfb1a69f3c | 008fd5fa-9f43-829d-967b-c0c6732d36ab | 34 | b8421363-9807-3b16-a146-95336eea5cfb | a7a2c654-aaea-3d59-bbc6-ba6e560c13f4 | df166300-5a78-3502-a46a-832842197811 | 5 | 87.71 | 1012.63 | 962.63 | 1 | 1 | 1 | 0 | 391307.16 | 787672.07 | 77504 | False |
| 162 | 68427c27-7e3f-797d-ee7a-25c9b1f2f466 | 008fd5fa-9f43-829d-967b-c0c6732d36ab | 49 | b8421363-9807-3b16-a146-95336eea5cfb | a7a2c654-aaea-3d59-bbc6-ba6e560c13f4 | a735bf55-83e9-331a-899d-a82a60b9f60c | 2 | 146.18 | 3480.38 | 2784.26 | 1 | 1 | 1 | 0 | 391307.16 | 787672.07 | 77504 | False |
| 151 | ca174272-5427-701b-257e-dce2728f698f | 0289d313-6d3a-1c72-5950-61347c15c02f | 63 | 352f2e3b-0708-3eb4-9f7e-e73a685bf379 | 284331cb-03a3-32e0-a574-1381eb5889d6 | a735bf55-83e9-331a-899d-a82a60b9f60c | 0 | 146.18 | 199646.60 | 159530.88 | 3 | 1 | 1 | 1 | 94091.07 | 772282.67 | 29497 | False |
| 24 | 549661f6-d3c2-ad25-2f81-d633450d977c | 047e20eb-bef0-a481-6bb5-210c3b6e07ea | 26 | 845fbd9b-2d1c-39a8-8261-28ae40e4fab2 | 4eadc3de-a8cc-3d18-9b00-ad3622513cfd | e03e23c9-4df1-3eb6-a62d-f70f02301496 | 3 | 146.18 | 5750.55 | 0.00 | 0 | 1 | 1 | 0 | 173223.68 | 149763.69 | 47200 | False |
| 25 | 8d5b8d81-18a1-1057-4032-889605e3c7e8 | 047e20eb-bef0-a481-6bb5-210c3b6e07ea | 48 | 845fbd9b-2d1c-39a8-8261-28ae40e4fab2 | 4eadc3de-a8cc-3d18-9b00-ad3622513cfd | a735bf55-83e9-331a-899d-a82a60b9f60c | 0 | 146.18 | 83463.56 | 66770.84 | 0 | 1 | 1 | 0 | 173223.68 | 149763.69 | 47200 | False |
df_f.describe()| age | code | base_encounter_cost | total_claim_cost | payer_coverage | healthcare_expenses | healthcare_coverage | income | |
|---|---|---|---|---|---|---|---|---|
| count | 176.000000 | 176.000000 | 176.000000 | 176.000000 | 176.000000 | 1.760000e+02 | 1.760000e+02 | 176.000000 |
| mean | 66.551136 | 3.926136 | 109.636250 | 43854.455568 | 40155.515739 | 4.214176e+05 | 2.677835e+06 | 61609.301136 |
| std | 19.320979 | 1.652772 | 28.387428 | 40324.546443 | 40249.005949 | 2.414169e+05 | 1.991460e+06 | 94813.993532 |
| min | 7.000000 | 0.000000 | 87.710000 | 146.180000 | 0.000000 | 8.028420e+03 | 1.695971e+04 | 694.000000 |
| 25% | 52.750000 | 3.000000 | 87.710000 | 6977.317500 | 322.790000 | 2.843511e+05 | 3.279196e+05 | 39120.000000 |
| 50% | 74.000000 | 5.000000 | 87.710000 | 40845.845000 | 39509.080000 | 3.028783e+05 | 3.970905e+06 | 47765.000000 |
| 75% | 83.000000 | 5.000000 | 146.180000 | 69000.822500 | 66277.720000 | 6.291407e+05 | 4.670788e+06 | 47765.000000 |
| max | 87.000000 | 5.000000 | 146.180000 | 203663.010000 | 203663.010000 | 1.421336e+06 | 4.670788e+06 | 755479.000000 |
11.0.4.3 Visualize data distribution
import matplotlib.pyplot as plt
# List of columns to plot
columns_to_plot = [
'age', 'base_encounter_cost', 'total_claim_cost', 'payer_coverage',
'healthcare_expenses', 'healthcare_coverage', 'income'
]
# Filter only columns that exist in df_f
columns_to_plot = [col for col in columns_to_plot if col in df_f.columns]
# Set plot size and layout
num_cols = 2
num_rows = (len(columns_to_plot) + 1) // num_cols
plt.figure(figsize=(12, 5 * num_rows))
# Loop through and plot each histogram
for i, col in enumerate(columns_to_plot, start=1):
plt.subplot(num_rows, num_cols, i)
plt.hist(df_f[col].dropna(), bins=20, color='skyblue', edgecolor='black', alpha=0.7)
plt.title(f'Histogram of {col}')
plt.xlabel(col)
plt.ylabel('Frequency')
plt.grid(True)
plt.tight_layout()
plt.show()
11.0.4.4 Check for class imbalance
# Check the distribution of the 'readmitted' column
readmitted_counts = df_f['readmitted'].value_counts()
print("Readmitted Counts:\n", readmitted_counts)
# Plot the distribution
plt.figure(figsize=(8, 5))
sns.countplot(x='readmitted', data=df_f, palette='Set2')
plt.title('Distribution of Readmission')
plt.xlabel('Readmitted')
plt.ylabel('Count')
plt.xticks(ticks=[0, 1], labels=['Not Readmitted', 'Readmitted'])
plt.show()Readmitted Counts:
readmitted
False 126
True 50
Name: count, dtype: int64
/var/folders/cx/3wbhcqyd3cld6gvk_xjkvr_40000gn/T/ipykernel_13342/1914624710.py:6: FutureWarning:
Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

11.0.4.5 Handle class imbalance
from sklearn.utils import resample
# Separate majority and minority classes
df_majority = df_f[df_f['readmitted'] == 0]
df_minority = df_f[df_f['readmitted'] == 1]
# Upsample minority class
df_minority_upsampled = resample(df_minority,
replace=True, # sample with replacement
n_samples=len(df_majority), # to match majority class
random_state=42) # reproducible results
# Combine majority class with upsampled minority class
df_f = pd.concat([df_majority, df_minority_upsampled])
# Shuffle the dataset
df_f = df_f.sample(frac=1, random_state=42).reset_index(drop=True)
# Check the new class distribution
print("New Readmitted Counts:\n", df_f['readmitted'].value_counts())New Readmitted Counts:
readmitted
True 126
False 126
Name: count, dtype: int64
11.0.5 Build Decision Tree model
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
import pandas as pd
# Prepare the data
X = df_f.drop(columns=['readmitted'])
X = pd.get_dummies(X, drop_first=True) # Encode categorical variables
y = df_f['readmitted']
# Split the data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)
# Train a Decision Tree Classifier
dt_model = DecisionTreeClassifier(max_depth=5, random_state=42) # You can tune max_depth
dt_model.fit(X_train, y_train)
# Make predictions
y_pred = dt_model.predict(X_test)
# Evaluate the model
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("Accuracy Score:", accuracy_score(y_test, y_pred))Confusion Matrix:
[[37 5]
[ 5 29]]
Classification Report:
precision recall f1-score support
False 0.88 0.88 0.88 42
True 0.85 0.85 0.85 34
accuracy 0.87 76
macro avg 0.87 0.87 0.87 76
weighted avg 0.87 0.87 0.87 76
Accuracy Score: 0.868421052631579
11.0.5.1 Hyperparameter Tuning for Decision Tree Classifier
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
# Prepare the data
X = df_f.drop(columns=['readmitted'])
X = pd.get_dummies(X, drop_first=True)
y = df_f['readmitted']
# Split the data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)
# Define parameter grid
param_grid = {
'max_depth': [3, 5, 10, None],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4],
'criterion': ['gini', 'entropy']
}
# Grid search with cross-validation
grid_search = GridSearchCV(
estimator=DecisionTreeClassifier(random_state=42),
param_grid=param_grid,
cv=5,
scoring='accuracy',
n_jobs=-1
)
grid_search.fit(X_train, y_train)
# Train with best estimator
best_dt_model = grid_search.best_estimator_
y_pred = best_dt_model.predict(X_test)
# Evaluate
print("Best Parameters:", grid_search.best_params_)
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("Accuracy Score:", accuracy_score(y_test, y_pred))Best Parameters: {'criterion': 'gini', 'max_depth': 5, 'min_samples_leaf': 1, 'min_samples_split': 2}
Confusion Matrix:
[[37 5]
[ 5 29]]
Classification Report:
precision recall f1-score support
False 0.88 0.88 0.88 42
True 0.85 0.85 0.85 34
accuracy 0.87 76
macro avg 0.87 0.87 0.87 76
weighted avg 0.87 0.87 0.87 76
Accuracy Score: 0.868421052631579
11.0.6 Building a Random Forest Classifier model.
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
# Ensure all features are numeric
X = df_f.drop(columns=['readmitted'])
X = pd.get_dummies(X, drop_first=True) # Encode any categorical variables
y = df_f['readmitted']
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)
# Train Random Forest model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
# Evaluate the model
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("Accuracy Score:", round(accuracy_score(y_test, y_pred), 3))Confusion Matrix:
[[36 6]
[ 3 31]]
Classification Report:
precision recall f1-score support
False 0.92 0.86 0.89 42
True 0.84 0.91 0.87 34
accuracy 0.88 76
macro avg 0.88 0.88 0.88 76
weighted avg 0.88 0.88 0.88 76
Accuracy Score: 0.882
11.0.6.1 Evaluate performance of Random Forest Classifier
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix
# Compute confusion matrix
cm = confusion_matrix(y_test, y_pred, labels=[False, True])
# Plot
plt.figure(figsize=(6, 5))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Not Readmitted', 'Readmitted'], yticklabels=['Not Readmitted', 'Readmitted'])
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix')
plt.tight_layout()
plt.show()
from sklearn.metrics import classification_report
import pandas as pd
# Convert classification report to DataFrame
report = classification_report(y_test, y_pred, output_dict=True)
report_df = pd.DataFrame(report).transpose()
# Drop 'accuracy' row (optional)
report_df = report_df.drop(['accuracy'], errors='ignore')
# Plot heatmap
plt.figure(figsize=(8, 5))
sns.heatmap(report_df.iloc[:2, :3], annot=True, cmap='Greens', fmt=".2f")
plt.title('Classification Report (Precision, Recall, F1-score)')
plt.show()
11.0.7 Building a Support Vector Machine (SVM) Classifier model
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
# Prepare the data
X = df_f.drop(columns=['readmitted'])
X = pd.get_dummies(X, drop_first=True) # Encode categorical variables
y = df_f['readmitted']
# Split the data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)
# Train SVM model
svm_model = SVC(kernel='linear', random_state=42) # You can also try 'rbf' or 'poly'
svm_model.fit(X_train, y_train)
# Make predictions
y_pred = svm_model.predict(X_test)
# Evaluate the model
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("Accuracy Score:", accuracy_score(y_test, y_pred))Confusion Matrix:
[[35 7]
[ 5 29]]
Classification Report:
precision recall f1-score support
False 0.88 0.83 0.85 42
True 0.81 0.85 0.83 34
accuracy 0.84 76
macro avg 0.84 0.84 0.84 76
weighted avg 0.84 0.84 0.84 76
Accuracy Score: 0.8421052631578947
11.0.7.1 Evaluate performance of SVM Classifier
# Evaluate performance of SVM Classifier
import matplotlib.pyplot as plt
import seaborn as sns
# Compute confusion matrix
cm = confusion_matrix(y_test, y_pred, labels=[False, True])
# Plot confusion matrix
plt.figure(figsize=(6, 5))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Not Readmitted', 'Readmitted'], yticklabels=['Not Readmitted', 'Readmitted'])
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix')
plt.tight_layout()
plt.show()
from sklearn.metrics import classification_report
# Convert classification report to DataFrame
report = classification_report(y_test, y_pred, output_dict=True)
report_df = pd.DataFrame(report).transpose()
# Drop 'accuracy' row (optional)
report_df = report_df.drop(['accuracy'], errors='ignore')
# Plot heatmap
plt.figure(figsize=(8, 5))
sns.heatmap(report_df.iloc[:2, :3], annot=True, cmap='Greens', fmt=".2f")
plt.title('Classification Report (Precision, Recall, F1-score)')
plt.show()
11.0.8 Building a Gradient Boosting Classifier model
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
# Prepare the data
X = df_f.drop(columns=['readmitted'])
X = pd.get_dummies(X, drop_first=True) # Encode categorical variables
y = df_f['readmitted']
# Split the data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)
# Train Gradient Boosting model
gb_model = GradientBoostingClassifier(n_estimators=100, random_state=42)
gb_model.fit(X_train, y_train)
# Make predictions
y_pred = gb_model.predict(X_test)
# Evaluate the model
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("Accuracy Score:", accuracy_score(y_test, y_pred))Confusion Matrix:
[[37 5]
[ 5 29]]
Classification Report:
precision recall f1-score support
False 0.88 0.88 0.88 42
True 0.85 0.85 0.85 34
accuracy 0.87 76
macro avg 0.87 0.87 0.87 76
weighted avg 0.87 0.87 0.87 76
Accuracy Score: 0.868421052631579
11.0.8.1 Evaluate performance of Gradient Boosting Classifier
# Evaluate performance of Gradient Boosting Classifier
import matplotlib.pyplot as plt
import seaborn as sns
# Compute confusion matrix
cm = confusion_matrix(y_test, y_pred, labels=[False, True])
# Plot confusion matrix
plt.figure(figsize=(6, 5))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Not Readmitted', 'Readmitted'], yticklabels=['Not Readmitted', 'Readmitted'])
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix')
plt.tight_layout()
plt.show()
11.0.9 Conclusion
In this chapter, we explored various machine learning classification models to predict patient readmission. We started with data preprocessing, including handling missing values and encoding categorical variables. We then built and evaluated several models: Decision Tree, Random Forest, Support Vector Machine (SVM), and Gradient Boosting.
Each model was assessed based on its confusion matrix, classification report, and accuracy score. The Random Forest model generally performed well, achieving a high accuracy and balanced precision and recall. The SVM and Gradient Boosting models also showed promising results, while the Decision Tree model provided a simpler interpretation but with slightly lower performance.
The choice of model depends on the specific requirements of the healthcare setting, such as interpretability, computational resources, and the need for real-time predictions. Future work could involve hyperparameter tuning, feature selection, and exploring ensemble methods to further improve model performance.