糖尿病:Diabetes Health Indicators Dataset : 基本分析01の続きを行なっていきます。今回は主成分分析と主成分分析の結果を用いたクラスタリングです。
1. 主成分分析の実施
まず主成分分析を行なっていきますが、今回のように要素が多い場合、主成分分析で次元削減を行い、新たな軸で分析することは有効です。今回は累積寄与率80%で分析を行います。主成分分析を行い、各主成分の項目ごとのウエイトをデータフレームにして算出します。今回は目的変数に相当する「Diabetes_binary」は除いた21項目で実施しました。80%の累積寄与率でやると14主成分となっています。
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import numpy as np | |
from sklearn.preprocessing import StandardScaler | |
from sklearn.decomposition import PCA | |
# Excluding the target variable 'Diabetes_binary' for PCA | |
features = data.drop('Diabetes_binary', axis=1) | |
# Scaling the data | |
scaler = StandardScaler() | |
scaled_features = scaler.fit_transform(features) | |
# Performing PCA | |
pca = PCA() | |
principal_components = pca.fit_transform(scaled_features) | |
# Explained variance ratio for each principal component | |
explained_variance_ratio = pca.explained_variance_ratio_ | |
# Calculating the cumulative variance | |
cumulative_variance = explained_variance_ratio.cumsum() | |
# Finding the number of components to reach 80% cumulative variance | |
n_components_80 = (cumulative_variance >= 0.80).argmax() + 1 | |
# Performing PCA with the desired number of components | |
pca_80 = PCA(n_components=n_components_80) | |
principal_components_80 = pca_80.fit_transform(scaled_features) | |
# Creating a DataFrame for the principal components | |
principal_df_80 = pd.DataFrame(data=principal_components_80, columns=[f'PC{i}' for i in range(n_components_80)]) | |
# Calculating loadings (correlation between original features and the selected principal components) | |
loadings_80 = pca_80.components_.T * np.sqrt(pca_80.explained_variance_) | |
# Creating a DataFrame for the loadings | |
loadings_df_80 = pd.DataFrame(data=loadings_80, columns=[f'PC{i}' for i in range(n_components_80)], index=features.columns) | |
# Transposing the loadings DataFrame to swap rows and columns | |
loadings_df_80_transposed = loadings_df_80.transpose() | |
# Display the transposed loadings DataFrame | |
loadings_df_80_transposed |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import matplotlib.pyplot as plt | |
import seaborn as sns | |
# Creating a heatmap for the loadings | |
plt.figure(figsize=(14, 8)) | |
sns.heatmap(loadings_df_80_transposed, cmap='coolwarm', annot=True) | |
plt.title('Heatmap of PCA Loadings') | |
plt.ylabel('Principal Components') | |
plt.xlabel('Features') | |
plt.show() |
- PC0: この主成分は「GenHlth」(一般的な健康状態)や「PhysHlth」(身体的健康)などの特徴量に高い負荷量を持っています。したがって、この主成分は「全体的健康状態」を反映していると考えることができます。
- PC1: 「Age」(年齢)に高い負荷量を持っており、「Education」(教育)や「Income」(収入)にも影響を与えています。この主成分は「年齢と社会経済的地位」を反映していると言えるでしょう。
- PC2: 「Fruits」(果物の摂取)や「Veggies」(野菜の摂取)に高い負荷量を持っています。この主成分は「食生活の健康性」を表している可能性があります。
- PC3: この主成分は「BMI」(体格指数)に高い負荷量を持っており、「CholCheck」(コレステロール検査)や「GenHlth」(一般的な健康状態)にも影響を与えています。この主成分は「身体的健康とライフスタイル」を反映していると考えられます。
- PC4: 「Smoker」(喫煙)に高い負荷量を持っていることが見られます。また、他の健康関連の特徴量にも影響を与えています。この主成分は「喫煙と関連する健康リスク」を表している可能性があります。
累積寄与率のグラフを書いてみます。
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Creating a plot for the cumulative variance with a horizontal line at 80% | |
plt.figure(figsize=(10, 6)) | |
plt.plot(cumulative_variance, marker='o') | |
# Adding annotations for each point | |
for i, v in enumerate(cumulative_variance): | |
plt.text(i, v + 0.01, f"{v:.2f}", ha='center', va='bottom', fontsize=8) | |
# Adding a horizontal line at 80% cumulative variance | |
plt.axhline(y=0.80, color='r', linestyle='--') | |
plt.title('Cumulative Variance Explained by Each Principal Component') | |
plt.xlabel('Number of Components') | |
plt.ylabel('Cumulative Explained Variance') | |
plt.grid(True) | |
plt.show() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
def biplot_diabetes_smaller_arrows(score, coeff, diabetes_data): | |
xs = score[:,0] | |
ys = score[:,1] | |
n = coeff.shape[0] | |
scalex = 1.0/(xs.max() - xs.min()) | |
scaley = 1.0/(ys.max() - ys.min()) | |
# Plotting points for diabetes cases only | |
diabetes_points = score[diabetes_data == 1] | |
plt.scatter(diabetes_points[:, 0] * scalex, diabetes_points[:, 1] * scaley, color='red', s=5, label='Diabetes') | |
# Using smaller arrows for better visibility | |
for i in range(n): | |
plt.arrow(0, 0, coeff[i,0], coeff[i,1], color='blue', alpha=0.7, linewidth=1.2, head_width=0.03) | |
plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, features.columns[i], color='green', ha='center', va='center') | |
plt.xlabel("PC{}".format(0)) | |
plt.ylabel("PC{}".format(1)) | |
plt.legend() | |
plt.grid() | |
# Calling the biplot function for diabetes cases only with smaller arrows | |
plt.figure(figsize=(12, 8)) | |
biplot_diabetes_smaller_arrows(principal_components[:,0:2], np.transpose(pca.components_[0:2, :]), data['Diabetes_binary']) | |
plt.show() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
def biplot_non_diabetes_only(score, coeff, diabetes_data): | |
xs = score[:,0] | |
ys = score[:,1] | |
n = coeff.shape[0] | |
scalex = 1.0/(xs.max() - xs.min()) | |
scaley = 1.0/(ys.max() - ys.min()) | |
# Plotting points for non-diabetes cases only | |
non_diabetes_points = score[diabetes_data == 0] | |
plt.scatter(non_diabetes_points[:, 0] * scalex, non_diabetes_points[:, 1] * scaley, color='blue', s=5, label='Non-Diabetes') | |
# Using smaller arrows for better visibility | |
for i in range(n): | |
plt.arrow(0, 0, coeff[i,0], coeff[i,1], color='red', alpha=0.7, linewidth=1.2, head_width=0.03) | |
plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, features.columns[i], color='green', ha='center', va='center') | |
plt.xlabel("PC{}".format(0)) | |
plt.ylabel("PC{}".format(1)) | |
plt.legend() | |
plt.grid() | |
# Calling the biplot function for non-diabetes cases only | |
plt.figure(figsize=(12, 8)) | |
biplot_non_diabetes_only(principal_components[:,0:2], np.transpose(pca.components_[0:2, :]), data['Diabetes_binary']) | |
plt.show() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from mpl_toolkits.mplot3d import Axes3D | |
def biplot_3d_with_arrows(score, coeff, diabetes_data, ax): | |
xs = score[:,0] | |
ys = score[:,1] | |
zs = score[:,2] | |
n = coeff.shape[0] | |
# Plotting points for diabetes and non-diabetes cases | |
ax.scatter(score[diabetes_data == 1, 0], score[diabetes_data == 1, 1], score[diabetes_data == 1, 2], color='red', s=5, label='Diabetes') | |
ax.scatter(score[diabetes_data == 0, 0], score[diabetes_data == 0, 1], score[diabetes_data == 0, 2], color='blue', s=5, label='Non-Diabetes') | |
# Plotting the feature vectors as arrows | |
for i in range(n): | |
ax.quiver(0, 0, 0, coeff[i,0], coeff[i,1], coeff[i,2], color='black', arrow_length_ratio=0.05) | |
ax.set_xlabel("PC0") | |
ax.set_ylabel("PC1") | |
ax.set_zlabel("PC2") | |
ax.legend() | |
# Correcting the error by using the correct variable for diabetes data | |
diabetes_data = data['Diabetes_binary'] | |
# Re-creating two 3D plots for better visibility | |
fig = plt.figure(figsize=(16, 8)) | |
# Plot for Diabetes cases | |
ax1 = fig.add_subplot(121, projection='3d') | |
biplot_3d_with_arrows(principal_components[diabetes_data == 1, 0:3], np.transpose(pca.components_[0:3, :]), diabetes_data[diabetes_data == 1], ax1) | |
ax1.set_title('Diabetes Cases') | |
# Plot for Non-Diabetes cases | |
ax2 = fig.add_subplot(122, projection='3d') | |
biplot_3d_with_arrows(principal_components[diabetes_data == 0, 0:3], np.transpose(pca.components_[0:3, :]), diabetes_data[diabetes_data == 0], ax2) | |
ax2.set_title('Non-Diabetes Cases') | |
plt.show() |
演習:biplotの結果からどのようなことが言えるでしょうか?
2. 主成分分析の結果からクラスタリング実施
次にクラスタリングを実施してみます。元データからではなく、主成分分析で求まる射影行列を使ってのクラスタリングを行います。今回はK-means法で5個のクラスタに分類してみます。
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Full Python code to perform clustering and calculate statistics | |
import pandas as pd | |
from sklearn.preprocessing import StandardScaler | |
from sklearn.decomposition import PCA | |
from sklearn.cluster import KMeans | |
import matplotlib.pyplot as plt | |
import seaborn as sns | |
# Assuming 'data' is the original DataFrame and 'features' are the feature columns | |
# Scaling the data | |
scaler = StandardScaler() | |
scaled_features = scaler.fit_transform(features) | |
# Performing PCA | |
pca = PCA() | |
principal_components = pca.fit_transform(scaled_features) | |
# Performing KMeans clustering with 5 clusters | |
kmeans = KMeans(n_clusters=5, random_state=0).fit(principal_components) | |
labels = kmeans.labels_ | |
# Adding the cluster labels to the original DataFrame | |
data_with_clusters = data.copy() | |
data_with_clusters['Cluster'] = labels | |
# Calculating mean values of each feature for each cluster | |
cluster_means = data_with_clusters.groupby('Cluster').mean() | |
# Calculating the percentage of diabetes cases in each cluster | |
cluster_diabetes_percentage = data_with_clusters.groupby('Cluster')['Diabetes_binary'].mean() * 100 | |
# Adding the diabetes rate to the cluster means DataFrame | |
cluster_means_with_diabetes_rate = cluster_means.copy() | |
cluster_means_with_diabetes_rate['Diabetes Rate'] = cluster_diabetes_percentage | |
# Renaming the index to 'Cluster' and resetting index | |
cluster_means_with_diabetes_rate.index.name = 'Cluster' | |
cluster_means_with_diabetes_rate.reset_index(inplace=True) | |
# The final DataFrame | |
cluster_means_with_diabetes_rate |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from sklearn.preprocessing import MinMaxScaler | |
# Dropping the 'Cluster' column for scaling | |
cluster_data_for_scaling = cluster_means_with_diabetes_rate.drop('Cluster', axis=1) | |
# Applying Min-Max scaling | |
scaler = MinMaxScaler() | |
cluster_scaled = scaler.fit_transform(cluster_data_for_scaling) | |
# Creating a DataFrame from the scaled data | |
cluster_scaled_df = pd.DataFrame(cluster_scaled, columns=cluster_data_for_scaling.columns) | |
# Creating a heatmap of the scaled cluster means DataFrame | |
plt.figure(figsize=(15, 10)) | |
sns.heatmap(cluster_scaled_df, annot=True, cmap='coolwarm', fmt=".2f") | |
plt.title('Heatmap of Scaled Cluster Means with Diabetes Rate') | |
plt.xlabel('Features') | |
plt.ylabel('Clusters') | |
plt.show() |
クラスタ0 (「低リスク・健康志向」): このクラスタは全体的に低い糖尿病率と低い健康リスク指標(BMI、高血圧など)を示しており、身体活動が高いことが特徴です。
クラスタ1 (「中高年・リスクあり」): 高い年齢層に属し、高血圧や高コレステロールの割合が高い。中程度の糖尿病率を持つ。
クラスタ2 (「高リスク・健康課題」): 最も高い糖尿病率を持ち、BMIや心臓病のリスクも高い。身体活動が低く、健康上の課題が多い。
クラスタ3 (「アクセス限定・中リスク」): 一般的な健康状態は中程度で、医療へのアクセスが限られている(NoDocbcCostが高い)ことが特徴。中程度の糖尿病率。
クラスタ4 (「若年・喫煙者」): 比較的若い年齢層で喫煙率が高い。糖尿病の割合は比較的低いが、長期的な健康リスクを持つ可能性がある。
このような方法で、基本的な分析からクラスタリングをすることで、データの内容がわかりやすくなったと思います。
0 件のコメント:
コメントを投稿