Python Data Analysis

資料分析的過程到結果:

1.問題（需求）產生：首先會有一個要解決的問題或需求。
2.資料收集（Data Collection）：開始收集相關資料。資料分為三種主要形式：

結構化資料（Structured Data）：每筆資料具有固定的欄位、格式和順序，方便程式取用與分析，例如 CSV、Excel、資料庫表格。
半結構化資料（Semi-structured Data）：介於結構化和非結構化之間，資料具有欄位可供查找但格式不保證一致性，例如 XML、JSON。
非結構化資料（Unstructured Data）：沒有固定格式，必須整理後才能存取，例如純文字、網頁資料、圖像、聲音、影片。

3. 資料處理（Data Processing / Data Cleaning）：收集到的資料通常需要經過處理，例如轉型、清理髒資料（無效或空資料）。
4. 建立預測模型（Predict Model）：依據處理過的資料建立預測模型。

預測模型 (Predict Model)：
敘述 (Description)：發生了什麼事？（例如：去年有多少客戶流失？）
診斷 (Diagnosis)：為什麼會發生？（例如：為什麼去年客戶流失比例增加？）
預測 (Prediction)：未來會發生什麼事？（例如：哪個客戶最有能流失？）
處方 (Prescription)：我們該怎麼做？（例如：該如何避免客戶流失？）

5. 驗證模型準確性：驗證預測模型是否準確。
6. 發布與實際使用：如果模型沒有問題，就發布出去實際應用。

Python 檔案讀寫

常見資料格式的讀寫與處理

CSV 檔案（Comma-Separated Values）：
是試算表和資料庫之間最常見的資料格式之一，它是以逗點區隔的文字格式資料。

範例 CSV 格式：
時間, 速度, 高度
0.1, 0, 10
0.2, 7.535, 20

首先需 import csv

使用 open() 函式開啟 CSV 檔案，通常結合 with 語句以確保檔案正確關閉：
with open(檔案名稱) as csvFile.
open() 模式：
r：讀取模式（預設）。
w：寫入模式，會清空資料後重新寫入；若檔案不存在則新建。
x：只有檔案不存在時才能建立並寫入，否則報錯。
a：新增模式，在檔案尾端加入資料。
b：二進制模式，適用於圖片、影音等非文字類型資料。
t：文字模式（預設）。
+：更新模式，同時具有讀取與寫入功能。
沒有使用 with 語句，寫入完檔案後一定要用 f.close() 關閉檔案。

csv.reader()：建立 Reader 物件，可將其轉換為串列（list）或使用 for 迴圈逐行讀取.
csv.DictReader()：建立 DictReader 物件，傳回值是排序的字典（OrderedDict），可以使用欄位名稱作為索引來取得資料，使其更具可讀性.

Python 寫入 CSV 檔案：
使用標準模組 csv：
建立 Writer 物件：outWriter = csv.writer(csvFile)。
使用 writer.writerow() 寫入串列資料，參數 newline=” 可避免輸出時每行之間多空一行.
csv.DictWriter()：寫入字典資料。需先設定包含字典鍵的 fieldnames 串列，再使用 writeheader() 寫入標題，並用 writerow() 寫入字典.

JSON 檔案（JavaScript Object Notation）
資料格式主要分為兩種:
物件（object）：用大括號 { } 表示，以「鍵-值（key:value）」方式配對儲存，鍵必須是字串並用雙引號表示，值可以是數值、字串、布林值、陣列或 null 值。JSON 文件內不可有註解。
陣列（array）：用中括號 [ ] 表示，由一系列的值組成，值之間以逗號區隔。

Python 與 JSON 檔案的相對關係：
json.dumps()：將 Python 資料轉換成 JSON 字串格式。
Python 的 dict 轉為 JSON object。
Python 的 list, tuple 轉為 JSON array。
Python 的 str, Unicode 轉為 JSON string。
Python 的 int, float 轉為 JSON number。
Python 的 True, False, None 分別轉為 JSON true, false, null。

json.loads()：將 JSON 格式的字串轉換成 Python 資料型別。其轉換對照表與 dumps() 相反。
json.dump()：將 Python 資料轉存為 JSON 檔案格式（副檔名通常為 .json）。
json.load()：讀取 JSON 檔案，並將其轉換成 Python 的資料格式。

# 使用 Python 內建 json 模組處理 JSON 字串
dict_obj = {‘a’: 25, ‘b’: 80, ‘c’: 60}
json_string = json.dumps(dict_obj, sort_keys=True, indent=4)
print(“\n使用 json.dumps 轉換字典為 JSON 字串 (縮排):”)
print(json_string)
使用 json.dumps 轉換字典為 JSON 字串 (縮排):
{
“a”: 25,
“b”: 80,
“c”: 60
}

# 將 JSON 字串轉換回 Python 字典
parsed_dict = json.loads(json_string)
print(“\n使用 json.loads 將 JSON 字串轉換回字典:”)
print(parsed_dict)
print(f”資料型別: {type(parsed_dict)}”)
使用 json.loads 將 JSON 字串轉換回字典:
{‘a’: 25, ‘b’: 80, ‘c’: 60}
資料型別: <class ‘dict’>

Pandas (Python Data Analysis Library)

Pandas (Python Data Analysis Library)：建立在 NumPy 之上，提供更進階的資料結構，主要是 Series (一維標籤化陣列) 和 DataFrame (二維表格型資料結構)。
Pandas 提供的機制與功能包括:

各種資料格式（CSV、Excel、JSON 等）的輸入/輸出功能

import pandas as pd
import json

# 建立一個簡單的 DataFrame
df_sample = pd.DataFrame({
‘Name’: [‘Alice’, ‘Bob’, ‘Charlie’],
‘Age’: [25, 30, 35],
‘City’: [‘New York’, ‘London’, ‘Paris’]
})
csv_filename = ‘sample_data.csv’

CSV 檔案
Pandas 的 CSV 讀取功能：使用 read_csv( ) 函式。
利用 Pandas 的 CSV 寫入功能：使用 to_csv( ) 函式.

# 讀取 CSV 檔案
df_read_csv = pd.read_csv(csv_filename)
print(“\n從 CSV 讀取的 DataFrame:”)
print(df_read_csv)
從 CSV 讀取的 DataFrame:
Name Age City
0 Alice 25 New York
1 Bob 30 London
2 Charlie 35 Paris

# 寫入 CSV 檔案
df_sample.to_csv(csv_filename, index=False) # index=False 不寫入索引
print(f”DataFrame 已寫入至 {csv_filename}”)

JSON 檔案
Pandas 的 JSON 讀取功能：使用 read_json( )函式。
Pandas 的 JSON 寫入功能：使用 to_json( )函式.

# 寫入 JSON 檔案
json_filename = ‘sample_data.json’
df_sample.to_json(json_filename, orient=’records’, indent=4)
print(f”\nDataFrame 已寫入至 {json_filename}”)

# orient=’records’ 以列表形式儲存每條記錄

# 讀取 JSON 檔案
df_read_json = pd.read_json(json_filename)
print(“\n從 JSON 讀取的 DataFrame:”)
print(df_read_json)
從 JSON 讀取的 DataFrame:
Name Age City
0 Alice 25 New York
1 Bob 30 London
2 Charlie 35 Paris

Excel 檔案
Pandas 的 Excel 讀取功能：使用 read_excel( )函式。
Pandas 的 Excel 寫入功能：使用 to_excel( )函式.

資料的取出與結合

import pandas as pd

data = {‘col1’: [‘1’, ‘2’, ‘3’, ‘abc’], ‘col2’: [1.1, 2.2, 3.3, 4.4]}
df = pd.DataFrame(data)
print(“原始 DataFrame:”)
print(df)
print(df.dtypes)

原始 DataFrame:
col1 col2
0 1 1.1
1 2 2.2
2 3 3.3
3 abc 4.4
col1 object
col2 float64
dtype: object

資料類型轉換 (.astype(), pd.to_numeric())
.astype()：通用的類型轉換方法，可以將 Series 或 DataFrame 的資料轉換為指定的類型（例如 ‘int’, ‘float’, ‘str’ 等），如果轉換失敗會報錯。

pd.to_numeric()：專門用於將字串轉換為數字類型。它提供了 errors 參數來處理非數字字串，例如 errors=’coerce’ 會將無法轉換的值替換為 NaN，這在處理混亂的數值字串時特別有用。

# 使用 .astype() 嘗試將 ‘col1’ 轉換為整數 (會報錯因為有 ‘abc’)
try:
df[‘col1_int’] = df[‘col1’].astype(int)
except ValueError as e:
print(f”\n使用 .astype(int) 轉換 ‘col1’ 失敗: {e}”)
使用 .astype(int) 轉換 ‘col1’ 失敗: invalid literal for int() with base 10: ‘abc’

# 使用 pd.to_numeric() 並處理錯誤
df[‘col1_numeric’] = pd.to_numeric(df[‘col1’], errors=’coerce’)
print(“\n使用 pd.to_numeric (errors=’coerce’) 轉換 ‘col1’:”)
print(df)
print(df.dtypes)
使用 pd.to_numeric (errors=’coerce’) 轉換 ‘col1’:
col1 col2 col1_numeric
0 1 1.1 1.0
1 2 2.2 2.0
2 3 3.3 3.0
3 abc 4.4 NaN
col1 object
col2 float64
col1_numeric float64
dtype: object

# 將 ‘col2’ 轉換為整數
df[‘col2_int’] = df[‘col2’].astype(int)
print(“\n將 ‘col2’ 轉換為整數:”)
print(df)
print(df.dtypes)
將 ‘col2’ 轉換為整數:
col1 col2 col1_numeric col2_int
0 1 1.1 1.0 1
1 2 2.2 2.0 2
2 3 3.3 3.0 3
3 abc 4.4 NaN 4
col1 object
col2 float64
col1_numeric float64
col2_int int64
dtype: object

Series（系列）

儲存 1 維資料的資料結構。資料可以是整數、字串、浮點數、Python 物件等。
每個元素都有一個相關聯的標籤（label），稱為索引（index），預設建立一個從 0 開始的 RangeIndex。
存取元素：可以透過索引號碼或標籤來存取。iloc 屬性用於透過序號索引資料。
屬性：values (資料內容), index (列標籤), dtype (資料型別), name (Series 和其索引的名稱)。
過濾和數學運算：與 NumPy 類似，Series 支援輕鬆進行過濾和數學運算。
import pandas as pd

# 從列表建立 Series
s1 = pd.Series([5, 6, 7, 8, 9, 10])
print(“從列表建立的 Series (自動索引):”)
print(s1)
print(f”索引: {s1.index}”)
print(f”值: {s1.values}\n”)
從列表建立的 Series (自動索引):
0 5
1 6
2 7
3 8
4 9
5 10
dtype: int64
索引: RangeIndex(start=0, stop=6, step=1)
值: [ 5 6 7 8 9 10]

# 從列表建立 Series 並指定索引
s2 = pd.Series([5, 6, 7, 8, 9, 10], index=[‘a’, ‘b’, ‘c’, ‘d’, ‘e’, ‘f’])
print(“指定索引的 Series:”)
print(s2)
print(f”透過索引 ‘c’ 取值: {s2[‘c’]}\n”)
指定索引的 Series:
a 5
b 6
c 7
d 8
e 9
f 10
dtype: int64
透過索引 ‘c’ 取值: 7

# 從字典建立 Series
s3 = pd.Series({‘apple’: 100, ‘banana’: 150, ‘orange’: 120})
print(“從字典建立的 Series:”)
print(s3)
從字典建立的 Series:
apple 100
banana 150
orange 120
dtype: int64

DataFrame（資料框架）

DataFrame 擁有列標籤（index）欄標籤（columns），且每一欄本質上都是一個 Series 物件。
建立時若沒有對應的資料，該欄位會配置 NaN。

import pandas as pd

# 從字典建立 DataFrame
data = {
‘city’: [‘Taipei’, ‘London’, ‘Paris’, ‘Tokyo’, ‘New York’],
‘population’: [2.7, 8.9, 2.1, 13.9, 8.6],
‘country’: [‘Taiwan’, ‘UK’, ‘France’, ‘Japan’, ‘USA’]
}
df = pd.DataFrame(data)
print(“從字典建立的 DataFrame:”)
print(df)
print(f”\n欄標籤 (columns): {df.columns}”)
print(f”列標籤 (index): {df.index}\n”)
從字典建立的 DataFrame:
city population country
0 Taipei 2.7 Taiwan
1 London 8.9 UK
2 Paris 2.1 France
3 Tokyo 13.9 Japan
4 New York 8.6 USA

欄標籤 (columns): Index([‘city’, ‘population’, ‘country’], dtype=’object’)
列標籤 (index): RangeIndex(start=0, stop=5, step=1)

# 指定列索引
df_indexed = pd.DataFrame(data, index=[‘TP’, ‘LD’, ‘PR’, ‘TK’, ‘NY’])
print(“指定列索引的 DataFrame:”)
print(df_indexed)
print(f”\n新的列標籤 (index): {df_indexed.index}”)
print(f”欄標籤 (columns): {df_indexed.columns}\n”)
指定列索引的 DataFrame:
city population country
TP Taipei 2.7 Taiwan
LD London 8.9 UK
PR Paris 2.1 France
TK Tokyo 13.9 Japan
NY New York 8.6 USA

新的列標籤 (index): Index([‘TP’, ‘LD’, ‘PR’, ‘TK’, ‘NY’], dtype=’object’)
欄標籤 (columns): Index([‘city’, ‘population’, ‘country’], dtype=’object’)

DataFrame 的資料選取 (.loc 和 .iloc)
Pandas 提供了強大的資料選取工具，其中 .loc 和 .iloc 最為常用。
df.loc[] (Label-location based indexer)：用標籤 (label) 進行選擇。可以選擇特定的行標籤和列標籤，有包含結束標籤。
df.iloc[] (Integer-location based indexer)：基於整數位置 (position) 進行選擇。使用整數位置來選擇行和列，不包含結束位置。

import pandas as pd

data = {
‘city’: [‘Taipei’, ‘London’, ‘Paris’, ‘Tokyo’, ‘New York’],
‘population’: [2.7, 8.9, 2.1, 13.9, 8.6], # in millions
‘country’: [‘Taiwan’, ‘UK’, ‘France’, ‘Japan’, ‘USA’]
}
df = pd.DataFrame(data, index=[‘TP’, ‘LD’, ‘PR’, ‘TK’, ‘NY’])
print(“原始 DataFrame:”)
print(df)
原始 DataFrame:
city population country
TP Taipei 2.7 Taiwan
LD London 8.9 UK
PR Paris 2.1 France
TK Tokyo 13.9 Japan
NY New York 8.6 USA

# 使用 .loc 選擇單一列（根據標籤）
print(“\n使用 .loc 選擇 ‘LD’ 列:”)
print(df.loc[‘LD’])
使用 .loc 選擇 ‘LD’ 列:
city London
population 8.9
country UK
Name: LD, dtype: object

# 使用 .loc 選擇多列（根據標籤列表）
print(“\n使用 .loc 選擇 ‘TP’ 和 ‘TK’ 列:”)
print(df.loc[[‘TP’, ‘TK’]])
使用 .loc 選擇 ‘TP’ 和 ‘TK’ 列:
city population country
TP Taipei 2.7 Taiwan
TK Tokyo 13.9 Japan

# 使用 .loc 選擇特定列和特定欄
print(“\n使用 .loc 選擇 ‘LD’ 和 ‘PR’ 的 ‘city’ 和 ‘population’ 欄:”)
print(df.loc[[‘LD’, ‘PR’], [‘city’, ‘population’]])
使用 .loc 選擇 ‘LD’ 和 ‘PR’ 的 ‘city’ 和 ‘population’ 欄:
city population
LD London 8.9
PR Paris 2.1

# 使用 .loc 進行切片（包含結束標籤）
print(“\n使用 .loc 進行列切片從 ‘TP’ 到 ‘TK’ (包含):”)
print(df.loc[‘TP’:’TK’])
使用 .loc 進行列切片從 ‘TP’ 到 ‘TK’ (包含):
city population country
TP Taipei 2.7 Taiwan
LD London 8.9 UK
PR Paris 2.1 France
TK Tokyo 13.9 Japan

# 使用布林陣列進行條件選取
print(“\n使用 .loc 選擇 population 大於 5 (百萬) 的城市:”)
print(df.loc[df[‘population’] > 5])
使用 .loc 選擇 population 大於 5 (百萬) 的城市:
city population country
LD London 8.9 UK
TK Tokyo 13.9 Japan
NY New York 8.6 USA

# 使用 .iloc 選擇單一列（根據位置）
print(“\n使用 .iloc 選擇第 0 列 (第一列):”)
print(df.iloc[0])
使用 .iloc 選擇第 0 列 (第一列):
city Taipei
population 2.7
country Taiwan
Name: TP, dtype: object

# 使用 .iloc 選擇多列（根據位置列表）
print(“\n使用 .iloc 選擇第 1 和第 3 列:”)
print(df.iloc[[1, 3]])
使用 .iloc 選擇第 1 和第 3 列:
city population country
LD London 8.9 UK
TK Tokyo 13.9 Japan

# 使用 .iloc 選擇特定列和特定欄（根據位置）
print(“\n使用 .iloc 選擇第 0 和第 2 列，以及第 0 和第 1 欄:”)
print(df.iloc[[0, 2], [0, 1]])
使用 .iloc 選擇第 0 和第 2 列，以及第 0 和第 1 欄:
city population
TP Taipei 2.7
PR Paris 2.1

# 使用 .iloc 進行切片（不包含結束位置）
print(“\n使用 .iloc 進行列切片從第 1 到第 3 (不包含第 3):”)
print(df.iloc[1:3])
使用 .iloc 進行列切片從第 1 到第 3 (不包含第 3):
city population country
LD London 8.9 UK
PR Paris 2.1 France

遺漏值（Not a Number, NaN）與基本運算的處理

基本運算：DataFrame 支援直接使用加減乘除符號進行運算。對於含有 NaN 的計算，這些方法通常能透過替換 NaN 來進行計算。

新增與刪除欄位：
新增：直接賦值給新欄位名。
刪除：使用 df.drop() 方法。

NaN 的處理：Pandas 提供了方便的遺漏值處理方法:
dropna()：根據 NaN 的有無，刪除符合條件的資料列或欄。預設不會修改原始 DataFrame，可使用 inplace=True 進行原地修改。
fillna()：將 NaN 以指定值或指定方法（如使用前後的值）填補。
固定值填充：df.fillna(value)
前向填充 (forward fill)：df.fillna(method=’ffill’)，使用前一個有效值填充。
後向填充 (backward fill)：df.fillna(method=’bfill’)，使用後一個有效值填充。
限制填充數量：df.fillna(method=’ffill’, limit=n)，限制填充的數量。
isnull()：對資料的每個元素判斷是否為 NaN，回傳布林值陣列。
notnull()：與 isnull() 相反。

import pandas as pd
import numpy as np

# 建立範例 DataFrame
df1 = pd.DataFrame(np.arange(6).reshape(2, 3), columns=list(‘xyz’))
df2 = pd.DataFrame(np.arange(12).reshape(3, 4), columns=list(‘wxyz’))

print(“df1:”)
print(df1)
print(“\ndf2:”)
print(df2)
df1:
x y z
0 0 1 2
1 3 4 5

df2:
w x y z
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11

# 元素級別加法 (NaN 會自動填充不匹配的元素)
print(“\ndf1 + df2:”)
print(df1 + df2) # 不匹配的欄位會產生 NaN
df1 + df2:
w x y z
0 NaN 1.0 3.0 5.0
1 NaN 8.0 10.0 12.0
2 NaN NaN NaN NaN

# 遺漏值 (NaN) 處理
df_with_nan = pd.DataFrame({
‘A’: [1, 2, np.nan, 4],
‘B’: [np.nan, 6, 7, 8],
‘C’: [9, 10, 11, np.nan]
})
print(“\n包含 NaN 的 DataFrame:”)
print(df_with_nan)
包含 NaN 的 DataFrame:
A B C
0 1.0 NaN 9.0
1 2.0 6.0 10.0
2 NaN 7.0 11.0
3 4.0 8.0 NaN

# 偵測 NaN
print(“\nisnull() 偵測 NaN:”)
print(df_with_nan.isnull())
isnull() 偵測 NaN:
A B C
0 False True False
1 False False False
2 True False False
3 False False True

# 填補 NaN
# 用 0 填充所有 NaN
print(“\n用 0 填充 NaN:”)
print(df_with_nan.fillna(0))
用 0 填充 NaN:
A B C
0 1.0 0.0 9.0
1 2.0 6.0 10.0
2 0.0 7.0 11.0
3 4.0 8.0 0.0

# 用前一個有效值填充 (forward fill)
print(“\n使用 ffill (前向填充) 填充 NaN:”)
print(df_with_nan.fillna(method=’ffill’))
使用 ffill (前向填充) 填充 NaN:
A B C
0 1.0 NaN 9.0
1 2.0 6.0 10.0
2 2.0 7.0 11.0
3 4.0 8.0 11.0

# 刪除含有 NaN 的列
print(“\n刪除任何含有 NaN 的列 (dropna()):”)
print(df_with_nan.dropna())
刪除任何含有 NaN 的列 (dropna()):
A B C
1 2.0 6.0 10.0

資料的集結（aggregation) ， Groupby 和 Pivot

DataFrame 的資料聚合與分組 (groupby() 和 pivot_table())

資料分組與聚合 , groupby() 函式：核心思想是「分割-應用-合併（split-apply-combine）」。
用途：過程就像是你去學校的學生名單中，想統計每個班級有多少男生和女生。
分割 (Split)：你先把名單按「班級」分成不同的組。然後在每個班級裡，再按「性別」分成不同的組。
應用 (Apply)：對於每一個分組（例如「二年級三班的男生」），你應用「數人數」這個動作或是其他應用函式（例如求平均、總和、計數）。
合併 (Combine)：最後，你把所有這些統計結果（「二年級三班有 20 個男生」）合併成一個新的表格或序列。

透視表（Pivoting tables）, pivot_table() 方法：將資料從「長格式」轉變為「寬格式」，根據設定的欄位將資料重新組織成一個二維表格
參數：index (新表格的行索引), columns (新表格的列標籤), values (要彙總的數值欄位), aggfunc (彙總函式，如 ‘mean’, ‘sum’, ‘count’).

資料分組與聚合 Groupby( )

import pandas as pd

titanic_data = {
‘Sex’: [‘female’, ‘male’, ‘female’, ‘male’, ‘female’, ‘male’, ‘female’, ‘male’],
‘PClass’: [‘1st’, ‘1st’, ‘2nd’, ‘2nd’, ‘3rd’, ‘3rd’, ‘1st’, ‘3rd’],
‘Survived’: [1, 0, 1, 0, 1, 0, 1, 0]
}
titanic_df = pd.DataFrame(titanic_data)
print(“泰坦尼克號 DataFrame:”)
print(titanic_df)
泰坦尼克號 DataFrame:
Sex PClass Survived
0 female 1st 1
1 male 1st 0
2 female 2nd 1
3 male 2nd 0
4 female 3rd 1
5 male 3rd 0
6 female 1st 1
7 male 3rd 0

# 使用 groupby 統計不同性別和艙等下的生存人數
print(“\n使用 groupby 統計性別和生存情況:”)
# 這裡我們計算每個分組中 Survived 的數量 (因為 Survived 是 0/1，所以 count 就是人數).size()用於看對應的元素數量(人數)
survival_counts = titanic_df.groupby([‘Sex’, ‘Survived’]).size().reset_index(name=‘Count’)
print(survival_counts)
使用 groupby 統計性別和生存情況:
Sex Survived Count
0 female 1 4
1 male 0 4

print(“\n使用 groupby 統計艙等和生存情況:”)
pclass_survival = titanic_df.groupby([‘PClass’, ‘Survived’]).size().reset_index(name=‘Count’)
print(pclass_survival)
使用 groupby 統計艙等和生存情況:
PClass Survived Count
0 1st 0 1
1 1st 1 2
2 2nd 0 1
3 2nd 1 1
4 3rd 0 2
5 3rd 1 1

# 計算每個縣市’County’的 PM2.5 平均值
avg_pm25 = df.groupby(‘County’)[‘PM2.5’].mean()
print(“各縣市的 PM2.5 平均值：”)
print(avg_pm25)
說明：
df.groupby(‘County’)：告訴電腦，請你把資料依照 ‘County’ 這個欄位分組。
[‘PM2.5’]：告訴電腦，我要 ‘PM2.5’ 這個欄位的資料。
.mean()：告訴電腦，請幫我對每個分組計算 ‘PM2.5’ 的平均值。

# 對每個縣市，計算 PM2.5 和 PM10 的多種統計數據
county_stats = df.groupby(‘County’)[[‘PM2.5’, ‘PM10’]].agg([‘mean’, ‘max’, ‘min’])
print(“各縣市 PM2.5 和 PM10 的統計數據：”)
print(county_stats)
說明：
[[‘PM2.5’, ‘PM10’]]：一次選取多個你感興趣的欄位。
.agg([‘mean’, ‘max’, ‘min’])：agg() 函式可以讓你一次對每個分組執行多個聚合操作。

groupby().filter() 的用法：篩選「群組」
filter() 的作用是「保留」或「移除」符合特定條件的整個群組，而不是單獨的資料列。
# lambda x: len(x) > 2 是篩選條件
# x 代表每一個分組，len(x) 就是該分組的資料筆數，分組x的比數大於2
filtered_df = df.groupby(‘County’).filter(lambda x: len(x) > 2)

透視表 Pivoting tables( )

透視表（Pivoting tables）, pivot_table() 方法：將資料從「長格式」轉變為「寬格式」，根據設定的欄位將資料重新組織成一個二維表格
參數：index (新表格的行索引), columns (新表格的列標籤), values (要彙總的數值欄位), aggfunc (彙總函式，如 ‘mean’, ‘sum’, ‘count’).

pivot_table(index = ” ” , columns = ” ” , values = ” ” , aggfunc = ” ” )
index：這個參數用來設定新表格的行索引。
例子：如果你想讓每個地區都成為新表格的一行，index 就會是 ‘地區’。

columns：這個參數用來設定新表格的列標籤。
例子：如果你想讓每個產品都成為新表格的一欄，columns 就會是 ‘產品’。

values：這個參數用來設定你想要進行彙總計算的數值欄位。
例子：如果你想計算銷售總額，values 就會是 ‘銷售額’。

aggfunc：這個參數用來設定你想要對 values 欄位執行的彙總函式。
例子：你可以用 ‘mean’ (平均)、‘sum’ (總和)、‘count’ (計數) 等字串，也可以使用 np.sum、np.mean 等 numpy 函式，甚至可以傳入一個函式列表來一次執行多個彙總。

import pandas as pd

# 建立一個範例 DataFrame
data = {
‘地區’: [‘臺北’, ‘臺北’, ‘臺中’, ‘臺南’, ‘臺北’, ‘臺中’],
‘產品’: [‘A’, ‘B’, ‘A’, ‘C’, ‘A’, ‘B’],
‘數量’: [10, 5, 8, 12, 15, 6],
‘價格’: [100, 200, 150, 80, 120, 180]
}
df = pd.DataFrame(data)

print(“原始資料：”)
print(df)
原始資料：
地區產品數量價格
0 臺北 A 10 100
1 臺北 B 5 200
2 臺中 A 8 150
3 臺南 C 12 80
4 臺北 A 15 120
5 臺中 B 6 180

# 範例一：計算各地區各產品的銷售總數量
pivot_table_1 = pd.pivot_table(
df,
index=’地區’,
columns=‘產品’,
values=‘數量’,
aggfunc=‘sum’
)
print(“範例一：各地區各產品的銷售總數量”)
print(pivot_table_1)
範例一：各地區各產品的銷售總數量
產品 A B C
地區
臺中 8.0 6.0 NaN
臺北 25.0 5.0 NaN
臺南 NaN NaN 12.0

# 範例二：計算各地區各產品的銷售總數量和平均價格
pivot_table_2 = pd.pivot_table(
df,
index=‘地區’,
columns=‘產品’,
values=[‘數量’, ‘價格’],
aggfunc={‘數量’: ‘sum’, ‘價格’: ‘mean’}
)
print(“範例二：各地區各產品的銷售總數量和平均價格”)
print(pivot_table_2)
範例二：各地區各產品的銷售總數量和平均價格
價格數量
產品 A B C A B C
地區
臺中 150.0 180.0 NaN 8.0 6.0 NaN
臺北 110.0 200.0 NaN 25.0 5.0 NaN
臺南 NaN NaN 80.0 NaN NaN 12.0

資料的時間序列分析

時間序列分析
import pandas as pd
import numpy as np

# 使用模擬數據
try:
df_time_series = pd.read_csv(‘apple.csv’, index_col=’Date’,parse_dates=True)
df_time_series = df_time_series.sort_index()
except FileNotFoundError:
print(“apple.csv 未找到，使用模擬數據。”)
dates = pd.date_range(start=‘2020-01-01’, periods=365, freq=’D’)
prices = 100 + np.cumsum(np.random.randn(365)) * 0.5
df_time_series = pd.DataFrame({‘Close’: prices}, index=dates)

print(“原始時間序列 DataFrame (部分):”)
print(df_time_series.head()) #.head()只會回傳前5筆資料
原始時間序列 DataFrame (部分):
Close
2020-01-01 100.548512
2020-01-02 100.316740
2020-01-03 100.395088
2020-01-04 99.710764
2020-01-05 100.378229

# 計算每週收盤價的平均值 (重採樣到週’W’頻率)
weekly_mean = df_time_series[‘Close’].resample(‘W’).mean()
print(“\n每週收盤價平均值 (周頻率):”)
print(weekly_mean.head())
每週收盤價平均值 (重採樣):
2020-01-05 100.269867
2020-01-12 102.036074
2020-01-19 104.054523
2020-01-26 104.221784
2020-02-02 105.928135

資料視覺化 (Matplotlib, Seaborn, Bokeh)

Matplotlib

Matplotlib：是 Python 中最重要且最廣泛使用的繪圖工具，它是一個 2D 繪圖套件
import matplotlib.pyplot as plt
import numpy as np

# 設定中文字體，解決負號顯示為方塊的問題
# Microsoft JhengHei(微軟正黑體) Windows 內建
# ‘PingFang TC’ 或 ‘Heiti TC’ 是 Mac
# ‘Noto Sans CJK TC’ 是 Linux
plt.rcParams[‘font.sans-serif’] = [‘Microsoft JhengHei’]
# Matplotlib 預設使用 ASCII 字元集，設定為 False 後會改用 Unicode
plt.rcParams[‘axes.unicode_minus’] = False

# 簡單的線條圖
x = np.arange(0, 5, 0.1)
y = np.square(x)

plt.plot(x, y)
plt.title(“X 的平方關係圖”) # 設定圖表標題
plt.xlabel(“X 軸數值”) # 設定 X 軸標籤
plt.ylabel(“Y 軸數值”) # 設定 Y 軸標籤
plt.grid(True) # 顯示格線
plt.show()

# 多線條圖
x = np.arange(0, 5, 0.1)
plt.plot(x, x, “r–“, label=”y = x”) # 紅色虛線
plt.plot(x, x**2, “bs“, label=”y = x^2″) # 藍色方形標記
plt.plot(x, x**3, “g^“, label=”y = x^3″) # 綠色三角形標記

plt.title(“多項式關係圖”)
plt.xlabel(“X 值”)
plt.ylabel(“Y 值”)
plt.legend() # 顯示圖例
plt.show()

Seaborn

Seaborn：
是建立在 Matplotlib 之上的高階繪圖套件，提供了更高階的視覺化 API，使得圖表繪製更方便、美觀。
它讓使用者能更輕鬆地建立統計圖表，可以視為 Matplotlib 的補充和增強。

import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# 設定中文字體，解決負號顯示為方塊的問題
plt.rcParams[‘font.sans-serif’] = [‘Microsoft JhengHei’]
plt.rcParams[‘axes.unicode_minus’] = False

# 生成標準常態分佈隨機變數
normal_samples = np.random.normal(size=10000)

# 直方圖（Histogram）：seaborn.distplot() 函式，預設附上核密度估計（KDE）曲線。
sns.histplot(normal_samples, kde=True, bins=30) #bin 精度
plt.title(“標準常態分佈直方圖”)
plt.xlabel(“數值”)
plt.ylabel(“頻率”)
plt.show()

# 散佈圖（Scatter plot）：seaborn.joinplot() 函式，預設附上 X 軸與 Y 軸變數的直方圖。
speed = [4, 4, 7, 7, 8, 9, 10, 10, 10, 11, 11, 12, 12, 12, 12, 13, 13,
13, 13, 14, 14, 14, 14, 15, 15, 15, 16, 16, 17, 17, 17, 18, 18,
18, 18, 19, 19, 19, 20, 20, 20, 20, 20, 22, 23, 24, 24, 24, 24, 25]
dist = [2, 10, 4, 22, 16, 10, 18, 26, 34, 17, 28, 14, 20, 24, 28, 26, 34,
34, 46, 26, 36, 60, 80, 20, 26, 54, 32, 40, 32, 40, 50, 42, 56,
76, 84, 36, 46, 68, 32, 48, 52, 56, 64, 66, 54, 70, 92, 93, 120, 85]
cars_df = pd.DataFrame({“speed”: speed, “dist”: dist})
sns.jointplot(x=“speed”, y=“dist”, data=cars_df, kind=’scatter’) #kind 類型
plt.suptitle(“車速與煞車距離散佈圖”, y=1.05) # 調整標題位置避免與子圖重疊
plt.show()

Bokeh

Bokeh：
是一個用於建立互動式視覺化的函式庫，它生成的圖表可以在網頁瀏覽器中顯示，並支援縮放、平移、工具提示等互動功能。
Bokeh 流程：導入 Bokeh -> 定義繪圖面板寬高 -> 定義資料 -> 決定圖的類型與相關設定 -> 將資料放到圖上顯示。

from bokeh.plotting import figure, show
from bokeh.models import ColumnDataSource
from bokeh.transform import factor_cmap
from bokeh.palettes import Spectral6

# 建立資料
fruits = [‘Apples’, ‘Pears’, ‘Nectarines’, ‘Plums’, ‘Grapes’, ‘Strawberries’]
counts = [5, 3, 4, 2, 4, 6]

source = ColumnDataSource(data=dict(fruits=fruits, counts=counts))

# 建立圖表
p = figure(x_range=fruits, height=350, title=”水果銷售數量”,
toolbar_location=”below”, tools=“pan,wheel_zoom,box_zoom,reset,save”)

# 繪製長條圖
p.vbar(x=‘fruits’, top=‘counts’, width=0.9, source=source,
legend_field=”fruits”, line_color=‘white’,
fill_color=factor_cmap(‘fruits’, palette=Spectral6, factors=fruits))

# 設定圖表屬性
p.xgrid.grid_line_color = None
p.y_range.start = 0
p.y_range.end = max(counts) + 1 # 調整 y 軸範圍
p.legend.orientation = “horizontal”
p.legend.location = “top_center”

show(p)

# 繪製互動式直線圖
from bokeh.plotting import figure, show

x = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
y = [6, 3, 4, 2, 5, 2, 5, 1, 3, 5, 4]

p = figure(plot_width=600, plot_height=400, title=“簡單直線圖”,
tools=“pan,wheel_zoom,box_zoom,reset,save”)

p.line(x, y, line_width=3, line_color=“blue”, legend_label=“數據趨勢”)
p.circle(x, y, size=10, color=“red”, alpha=0.6, legend_label=“數據點”)

p.xaxis.axis_label = “X 軸”
p.yaxis.axis_label = “Y 軸”
p.legend.location = “top_left”
p.grid.grid_line_alpha = 0.5

show(p)

NumPy (Numerical Python)

NumPy (Numerical Python)：為 Python 提供了高效能的多維陣列 (ndarray) 物件，所有元素都屬於相同型別，並透過正整數的元組進行索引。

讀、寫、Ndarray陣列產生

CSV 檔案
利用 NumPy 的文字資料讀取功能：如 loadtxt() 或 genfromtxt() 函式
利用 NumPy 的文字資料寫入功能：如 savetxt() 函式.

ndarray 的產生：
從 Python 串列或元組轉換：使用 np.array().
範例：生成 ndarray

import numpy as np

# 從列表生成一維陣列
arr1 = np.array([1, 2, 3, 4, 5])
print(“一維陣列 (from list):”)
print(arr1)
print(f”型別: {type(arr1)}”)
print(f”維度: {arr1.ndim}”)
print(f”形狀: {arr1.shape}\n”)

一維陣列 (from list):
[1 2 3 4 5]
型別: <class ‘numpy.ndarray’>
維度: 1
形狀: (5,)

# 從巢狀列表生成二維陣列
arr2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(“二維陣列 (from nested list):”)
print(arr2d)
print(f”維度: {arr2d.ndim}”)
print(f”形狀: {arr2d.shape}\n”)

二維陣列 (from nested list):
[[1 2 3]
[4 5 6]
[7 8 9]]
維度: 2
形狀: (3, 3)

# 生成全為零的陣列
zeros_arr = np.zeros((3, 4)) # 3 rows, 4 columns
print(“全為零的 3×4 陣列:”)
print(zeros_arr)
print(f”資料型別: {zeros_arr.dtype}\n”)

全為零的 3×4 陣列:
[[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]]
資料型別: float64

# 使用 arange 生成序列
# arange(start, stop, step) – 不包含 stop
seq_arr = np.arange(0, 10, 2)
print(“使用 arange 生成的序列:”)
print(seq_arr)
print(f”資料型別: {seq_arr.dtype}\n”)

使用 arange 生成的序列:
[0 2 4 6 8]
資料型別: int64

# 使用 arange 生成浮點數序列
float_seq_arr = np.arange(0.1, 0.5, 0.1)
print(“使用 arange 生成的浮點數序列:”)
print(float_seq_arr)
print(f”資料型別: {float_seq_arr.dtype}\n”)

使用 arange 生成的浮點數序列:
[0.1 0.2 0.3 0.4]
資料型別: float64

ndarray 的屬性：
ndarray.ndim：陣列的軸（維度）數量。
ndarray.shape：陣列的維度，一個表示每個維度大小的整數元組。
ndarray.size：陣列中元素的總數。
ndarray.dtype：描述陣列中元素型別的物件。
ndarray.itemsize：陣列中每個元素的位元組大小。

arr = np.array([[1, 2, 3], [4, 5, 6]], dtype=’int32′)

print(f”陣列: \n{arr}\n”)
陣列:
[[1 2 3]
[4 5 6]]

print(f”維度 (ndim): {arr.ndim}”)
維度 (ndim): 2

print(f”形狀 (shape): {arr.shape}”)
形狀 (shape): (2, 3)

print(f”元素總數 (size): {arr.size}”)
元素總數 (size): 6

print(f”資料型別 (dtype): {arr.dtype}”)
資料型別 (dtype): int32

print(f”每個元素位元組大小 (itemsize): {arr.itemsize}”)
每個元素位元組大小 (itemsize): 4

print(f”總位元組大小 (nbytes): {arr.nbytes}”)
總位元組大小 (nbytes): 24

Ndarray 索引 (切片與 np.ix)

ndarray 的索引：
直接索引：X[row_index][col_index] 或 X[row_index, col_index]
切片（Slicing）：X[start:end:step]。例如 X[1:, 2:] 表示從第二行開始的所有行，以及第三欄之後的所有欄
import numpy as np

# 建立一個 3×4 的二維陣列
x2d = np.arange(12).reshape(3, 4)
print(“原始陣列 x2d:”)
print(x2d)
原始陣列 x2d:
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]]

# 基本索引：取出第一列的所有元素（視點）
view_slice = x2d[0, :]
print(“\n基本索引：x2d[0, :] (視點):”)
print(view_slice)
view_slice[0] = 99 # 修改視點
print(“修改視點後，原始陣列 x2d:”)
print(x2d) # 原始陣列會被修改
基本索引：x2d[0, :] (視點):
[0 1 2 3]
修改視點後，原始陣列 x2d:
[[99 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]]

# 重置 x2d
x2d = np.arange(12).reshape(3, 4)
print(“\n重置後的原始陣列 x2d:”)
print(x2d)
重置後的原始陣列 x2d:
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]]

# 進階索引：使用整數陣列索引（複本）
# 選取 (0,2), (1,1), (2,0) 的元素
copy_fancy = x2d[[0, 1, 2], [2, 1, 0]]
print(“\n進階索引：x2d[[0,1,2], [2,1,0]] (複本):”)
print(copy_fancy)
copy_fancy[0] = 88 # 修改複本
print(“修改複本後，原始陣列 x2d:”)
print(x2d) # 原始陣列不會被修改
進階索引：x2d[[0,1,2], [2,1,0]] (複本):
[2 5 8]
修改複本後，原始陣列 x2d:
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]]

# 使用 np.ix_ 進行交叉索引
# 選取第 0 和第 2 列，以及第 1 和第 3 欄的交叉點
cross_indexed = x2d[np.ix_([0, 2], [1, 3])]
print(“\n使用 np.ix_ 進行交叉索引：x2d[np.ix_([0,2], [1,3])]:”)
print(cross_indexed)
使用 np.ix_ 進行交叉索引：x2d[np.ix_([0,2], [1,3])]:
[[ 1 3]
[ 9 11]]

Ndarray 運算與排序

arr1 = np.array([1, 2, 3, 4])
arr2 = np.array([10, 20, 30, 40])

# 元素加法
print(f”arr1 + arr2: {arr1 + arr2}”)
arr1 + arr2: [11 22 33 44]

# 元素乘法
print(f”arr1 * arr2: {arr1 * arr2}\n”)
arr1 * arr2: [ 10 40 90 160]

# 各種統計函數
data = np.array([1, 4, 3, 8, 9, 2, 3], dtype=float)
print(f”資料: {data}”)
資料: [1. 4. 3. 8. 9. 2. 3.]

print(f”平均值 (mean): {np.mean(data):.2f}”)
平均值 (mean): 4.28

print(f”中位數 (median): {np.median(data)}”)
中位數 (median): 3.0

print(f”標準差 (std): {np.std(data):.2f}”)
標準差 (std): 2.81

print(f”總和 (sum): {np.sum(data)}”)
總和 (sum): 30.0

print(f”最大值 (max): {np.max(data)}”)
最大值 (max): 9.0

print(f”最小值 (min): {np.min(data)}”)
最小值 (min): 1.0

布林運算：
arry.any(axis=0) 解釋：
「對你的二維陣列，沿著垂直方向（每一欄），去檢查這一欄中，是否『至少有一個』 True 值。如果這一欄中有一個 True，那麼這整一欄的結果就是 True；如果這一欄中都是 False，那麼結果就是 False。」
axis=0 代表「垂直方向」，也就是對每一欄 (column) 進行運算。
axis=1 代表「水平方向」，也就是對每一行 (row) 進行運算。
arry.all( )：判斷陣列中是否所有值都是 True。
arry.sum( )：對於布林值陣列，sum() 可以用來計算 True 值的數量。因為 True 在運算中會被視為 1，False 被視為 0。

排序：
arr.sort()：直接對陣列進行原地排序。
np.sort(arr)：回傳排序後的陣列複本，不改變原始陣列。
arr.argsort()：回傳排序後的索引。

xml.etree.ElementTree ( ET )

xml.etree.ElementTree 是 Python 標準函式庫中，用來處理 XML 資料的工具。

import xml.etree.ElementTree as ET
ET.parse(“file_name.xml”)
這個函式的效果是：解析 XML 文件，並在記憶體中建立一個樹狀結構。
將 XML 檔案中的所有標籤、屬性、內容，從文字格式轉換成一個可供 Python 操作的樹狀資料結構。

root = tree.getroot()
這個函式的效果是：從解析好的樹狀結構中，取得最頂層的根節點。

ET.parse()：讀檔並建樹
tree.getroot()：從樹中找到起點
for row in root：從起點開始，一層一層往下找
這三個步驟共同構成了 XML 解析的標準流程。

YAML (YAML Ain't Markup Language)

YAML (YAML Ain’t Markup Language) 是一種人類可讀的資料序列化格式。

yaml.dump(“放入的資料名稱”, “寫入的file” , default_flow_style = False, allow_unicode = True)
default_flow_style = False：用單行的方式呈現資料，看起來很像 JSON，例如 {key: value, list: [1, 2, 3]}
allow_unicode = True：這個參數的作用是確保所有的 Unicode 字元（例如中文、日文等）都能直接以原始形式寫入到 YAML 檔案中。

主要特徵是：
使用縮排：它利用空白字元來表示資料的階層關係，這使得它非常乾淨和直觀
簡潔：它不需要像 XML 那樣重複的標籤，也不需要像 JSON 那樣大量的括號和大括號
資料結構：它可以完美地表示 Python 中常見的資料結構，如字典 (mapping)、列表 (sequence) 和基本型別 (string, number, boolean)

被廣泛應用於：
設定檔 (Configuration Files)：這是 YAML 最主要的用途。例如，Docker Compose、Kubernetes 和許多 CI/CD 工具（如 GitHub Actions）都使用 YAML 格式來撰寫設定檔。這讓開發者可以更容易地手動編輯和維護這些設定。
資料交換：雖然 JSON 在資料交換中更為普遍，但當可讀性比傳輸效率更重要時（例如 APIs 的範例回應），YAML 也是一個不錯的選擇。

.db & sqlite3

.db 檔是什麼？db 檔通常指的是 SQLite 資料庫檔案。

它的主要特點是：
無伺服器 (Serverless)：它不需要一個獨立運行的資料庫伺服器進程。整個資料庫就是一個單一的檔案，輕量、快速：非常適合小型應用程式或作為本地資料儲存。

要讀取需要先import sqlite3，sqlite3 讓我們可以直接操作 SQLite 資料庫。

連線到資料庫 (sqlite3.connect)
conn = sqlite3.connect(“read.db”)
功能：這個函式的作用是建立一個資料庫連線，它會讀取你指定的 .db 檔案，如果檔案不存在，它會自動為你創建一個。

建立游標 (conn.cursor)
cursor = conn.cursor( )
功能：建立一個游標物件 (Cursor Object)，游標就像是你的「資料庫遙控器」，所有執行 SQL 語句和取得查詢結果的操作，都需要透過它來完成。

執行 SQL 語句 (cursor.execute)
cursor.execute(“SELECT * FROM Employee”)
功能：這個函式用來執行一條 SQL 語句。

“SELECT * FROM Employee” 這條 SQL 語句會要求資料庫回傳 Employee 資料表中的所有資料。

遍歷游標 (for row in cursor)
for row in cursor:
功能：游標物件本身是一個可迭代的 (iterable) 物件。這個迴圈會從 execute() 函式執行的結果中，一筆一筆地讀取資料，並將每一筆資料 (一列) 賦值給 row 這個變數，直到所有資料都讀取完畢。
cursor.fetchall( )：一次性地將所有查詢結果都讀取到記憶體中，回傳一個列表。
cursor.fetchone( )：只讀取下一筆結果，回傳一個元組。

關閉連線 (conn.close)
conn.close( )
功能：關閉與資料庫的連線，釋放所有相關資源。這是一個非常重要的習慣，可以避免檔案鎖定或其他資源佔用的問題。

提交變更 (conn.commit( ))：
如果你執行了會改變資料庫內容的 SQL 語句（例如 INSERT, UPDATE, DELETE），你必須呼叫 conn.commit( ) 來提交變更，否則你的操作將不會被永久寫入到 .db 檔案中。

發佈留言 取消回覆

發佈留言取消回覆