Divvy Bikesがシカゴ市で提供されている自転車共有サービスのデータを使い、基本的なデータ分析を行なっていきます。
次のリンクからデータをダウンロードします。
https://divvybikes.com/system-data
「Download Divvy trip history data」よりダウンロードページに移り、執筆時点で最新の「 202311-divvy-tripdata.zip」をダウンロードしていきます。
解凍ファイルを展開すると「202311-divvy-tripdata.csv」が得られますので、これを分析していきます。
今回もGPTから得られたコードを使ってやっていきます。
[データのインポートと基本統計量の算出と可視化]
csvを取り込んでいきます。
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import pandas as pd | |
# Load the new CSV file | |
file_path_tripdata = '202311-divvy-tripdata.csv' | |
divvy_tripdata = pd.read_csv(file_path_tripdata) | |
# Display the first few rows of the dataframe to understand its structure and content | |
divvy_tripdata |
ride_id: 各ライド(乗車)に割り当てられた一意の識別子。
rideable_type: 使用された自転車のタイプ。
started_at: ライドの開始日時。
ended_at: ライドの終了日時。
start_station_name: ライド開始時のステーション名。
start_station_id: ライド開始時のステーションID。
end_station_name: ライド終了時のステーション名。
end_station_id: ライド終了時のステーションID。
start_lat: ライド開始時の緯度。
start_lng: ライド開始時の経度。
end_lat: ライド終了時の緯度。
end_lng: ライド終了時の経度。
member_casual: 利用者がメンバーかカジュアル(非メンバー)かを示す。
緯度・経度のヒストグラムを見て、中心的な場所を確認します。これらのヒストグラムは、大部分のライドが特定の範囲内の地理的な領域で発生していることを示しています。また、開始点と終了点の緯度と経度の平均値が非常に近いことから、多くのライドが同じ地域内で完了していることもわかります。
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import matplotlib.pyplot as plt | |
# Creating separate histograms for each of the latitude and longitude columns | |
df = divvy_tripdata | |
plt.figure(figsize=(12, 8)) | |
# Start Latitude | |
plt.subplot(2, 2, 1) | |
plt.hist(df['start_lat'], bins=30, color='skyblue', edgecolor='black') | |
plt.axvline(df['start_lat'].mean(), color='red', linestyle='dashed', linewidth=1) | |
plt.title('Start Latitude') | |
plt.xlabel('Start Latitude') | |
plt.ylabel('Frequency') | |
plt.grid(True) | |
# Start Longitude | |
plt.subplot(2, 2, 2) | |
plt.hist(df['start_lng'], bins=30, color='green', edgecolor='black') | |
plt.axvline(df['start_lng'].mean(), color='red', linestyle='dashed', linewidth=1) | |
plt.title('Start Longitude') | |
plt.xlabel('Start Longitude') | |
plt.grid(True) | |
# End Latitude | |
plt.subplot(2, 2, 3) | |
plt.hist(df['end_lat'].dropna(), bins=30, color='blue', edgecolor='black') # Drop NA values for end_lat | |
plt.axvline(df['end_lat'].mean(), color='red', linestyle='dashed', linewidth=1) | |
plt.title('End Latitude') | |
plt.xlabel('End Latitude') | |
plt.ylabel('Frequency') | |
plt.grid(True) | |
# End Longitude | |
plt.subplot(2, 2, 4) | |
plt.hist(df['end_lng'].dropna(), bins=30, color='purple', edgecolor='black') # Drop NA values for end_lng | |
plt.axvline(df['end_lng'].mean(), color='red', linestyle='dashed', linewidth=1) | |
plt.title('End Longitude') | |
plt.xlabel('End Longitude') | |
plt.grid(True) | |
plt.tight_layout() | |
plt.show() |
カテゴリカルデータ(「rideable_type」と「member_casual」)の出現頻度を確認します。
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Selecting categorical columns | |
categorical_columns = df[['rideable_type', 'member_casual']] | |
# Creating histograms for each categorical column | |
plt.figure(figsize=(12, 6)) | |
# Rideable Type | |
plt.subplot(1, 2, 1) | |
df['rideable_type'].value_counts().plot(kind='bar', color='skyblue') | |
plt.title('Rideable Type Frequency') | |
plt.xlabel('Rideable Type') | |
plt.ylabel('Frequency') | |
plt.xticks(rotation=0) | |
plt.grid(axis='y') | |
# Member Casual | |
plt.subplot(1, 2, 2) | |
df['member_casual'].value_counts().plot(kind='bar', color='green') | |
plt.title('Member vs Casual Frequency') | |
plt.xlabel('Member Type') | |
plt.ylabel('Frequency') | |
plt.xticks(rotation=0) | |
plt.grid(axis='y') | |
plt.tight_layout() | |
plt.show() |
Rideable Type (自転車のタイプ): このグラフは、利用された自転車のタイプの分布を示しています。2種類の自転車タイプがあり、そのうちの一つが他よりも頻繁に利用されていることがわかります。Member vs Casual (メンバー対カジュアル): このグラフは、利用者がメンバーかカジュアル(非メンバー)かの分布を示しています。どちらか一方が他方よりも明らかに多く利用されていることが観察できます。
次に、ステーションについて確認します。件数で並び替えた際に最も多い上位20%の「start_station_name(開始ステーション名)」と「end_station_name(終了ステーション名)」のステーションを示しています。
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Adjusting to select stations that make up the top 20% of all stations by count | |
# Sorting the station counts | |
sorted_start_stations = df['start_station_name'].value_counts().sort_values(ascending=False) | |
sorted_end_stations = df['end_station_name'].value_counts().sort_values(ascending=False) | |
# Calculating the cumulative percentage | |
cumulative_percentage_start = sorted_start_stations.cumsum() / sorted_start_stations.sum() | |
cumulative_percentage_end = sorted_end_stations.cumsum() / sorted_end_stations.sum() | |
# Filtering for top 20% | |
top_20_start_stations = sorted_start_stations[cumulative_percentage_start <= 0.20] | |
top_20_end_stations = sorted_end_stations[cumulative_percentage_end <= 0.20] | |
plt.figure(figsize=(15, 10)) | |
# Plotting for start stations | |
plt.subplot(2, 1, 1) | |
top_20_start_stations.plot(kind='bar', color='skyblue') | |
plt.title('Top 20% of Start Stations by Count') | |
plt.xlabel('Station Name') | |
plt.ylabel('Frequency') | |
plt.xticks(rotation=90) | |
plt.grid(axis='y') | |
# Plotting for end stations | |
plt.subplot(2, 1, 2) | |
top_20_end_stations.plot(kind='bar', color='green') | |
plt.title('Top 20% of End Stations by Count') | |
plt.xlabel('Station Name') | |
plt.ylabel('Frequency') | |
plt.xticks(rotation=90) | |
plt.grid(axis='y') | |
plt.tight_layout() | |
plt.show() |
合計乗車回数で大きい方から並べたときの上位20%のステーションを示しています。各ステーションでの「開始」(スカイブルー色)と「終了」(緑色)のライド数が積み上げられています。
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Calculating total rides for each station (both start and end) | |
total_rides_per_station = df['start_station_name'].value_counts() + df['end_station_name'].value_counts() | |
# Sorting stations by total rides and selecting the top 20% | |
sorted_total_rides = total_rides_per_station.sort_values(ascending=False) | |
cumulative_percentage_total = sorted_total_rides.cumsum() / sorted_total_rides.sum() | |
top_20_total_stations = sorted_total_rides[cumulative_percentage_total <= 0.20] | |
# Separating the counts for start and end stations for the top 20% | |
top_20_start_counts = df['start_station_name'].value_counts().loc[top_20_total_stations.index] | |
top_20_end_counts = df['end_station_name'].value_counts().loc[top_20_total_stations.index] | |
# Plotting stacked bar chart | |
plt.figure(figsize=(15, 10)) | |
top_20_start_counts.plot(kind='bar', color='skyblue', label='Start Station') | |
top_20_end_counts.plot(kind='bar', color='green', label='End Station', bottom=top_20_start_counts) | |
plt.title('Top 20% Stations by Total Rides (Start and End)') | |
plt.xlabel('Station Name') | |
plt.ylabel('Total Rides') | |
plt.xticks(rotation=90) | |
plt.legend() | |
plt.grid(axis='y') | |
plt.tight_layout() | |
plt.show() |
合計乗車回数に基づいて大きい方から並べたときの上位ステーションにおける、電動自転車(スカイブルー色)とクラシック自転車(緑色)の利用回数を示しています。
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Calculating the total number of rides for each station and bike type combination | |
electric_bike_counts = df[df['rideable_type'] == 'electric_bike'].groupby('start_station_name')['ride_id'].count() | |
classic_bike_counts = df[df['rideable_type'] == 'classic_bike'].groupby('start_station_name')['ride_id'].count() | |
# Combining the counts for each station | |
combined_bike_counts = electric_bike_counts.add(classic_bike_counts, fill_value=0) | |
# Sorting stations by total rides and selecting the top stations | |
sorted_combined_counts = combined_bike_counts.sort_values(ascending=False) | |
cumulative_percentage_bikes = sorted_combined_counts.cumsum() / sorted_combined_counts.sum() | |
top_stations_bikes = sorted_combined_counts[cumulative_percentage_bikes <= 0.20] | |
# Separating the counts for electric and classic bikes for the top stations | |
top_electric_counts = electric_bike_counts.loc[top_stations_bikes.index].fillna(0) | |
top_classic_counts = classic_bike_counts.loc[top_stations_bikes.index].fillna(0) | |
# Plotting stacked bar chart | |
plt.figure(figsize=(15, 10)) | |
top_electric_counts.plot(kind='bar', color='skyblue', label='Electric Bike') | |
top_classic_counts.plot(kind='bar', color='green', label='Classic Bike', bottom=top_electric_counts) | |
plt.title('Top Stations by Bike Type Usage (Electric and Classic)') | |
plt.xlabel('Station Name') | |
plt.ylabel('Total Rides') | |
plt.xticks(rotation=90) | |
plt.legend() | |
plt.grid(axis='y') | |
plt.tight_layout() | |
plt.show() |
次は、合計乗車回数に基づいて大きい方から並べたときの上位ステーションにおける、メンバー(スカイブルー色)とカジュアル(緑色)利用者の利用回数を示しています。
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Calculating the total number of rides for each station and member type combination | |
member_counts = df[df['member_casual'] == 'member'].groupby('start_station_name')['ride_id'].count() | |
casual_counts = df[df['member_casual'] == 'casual'].groupby('start_station_name')['ride_id'].count() | |
# Combining the counts for each station | |
combined_member_counts = member_counts.add(casual_counts, fill_value=0) | |
# Sorting stations by total rides and selecting the top stations | |
sorted_combined_member_counts = combined_member_counts.sort_values(ascending=False) | |
cumulative_percentage_members = sorted_combined_member_counts.cumsum() / sorted_combined_member_counts.sum() | |
top_stations_members = sorted_combined_member_counts[cumulative_percentage_members <= 0.20] | |
# Separating the counts for member and casual for the top stations | |
top_member_counts = member_counts.loc[top_stations_members.index].fillna(0) | |
top_casual_counts = casual_counts.loc[top_stations_members.index].fillna(0) | |
# Plotting stacked bar chart | |
plt.figure(figsize=(15, 10)) | |
top_member_counts.plot(kind='bar', color='skyblue', label='Member') | |
top_casual_counts.plot(kind='bar', color='green', label='Casual', bottom=top_member_counts) | |
plt.title('Top Stations by Member Type Usage (Member and Casual)') | |
plt.xlabel('Station Name') | |
plt.ylabel('Total Rides') | |
plt.xticks(rotation=90) | |
plt.legend() | |
plt.grid(axis='y') | |
plt.tight_layout() | |
plt.show() |
非会員の利用率が高いステーションも見られます。
0 件のコメント:
コメントを投稿