Data Analytics note

Business analyst—analyzes data to help businesses improve processes, products, or services

業務分析師-分析數據以幫助企業改善流程、產品或服務

Data analytics consultant—analyzes the systems and models for using data

數據分析顧問——分析使用數據的系統和模型

Data engineer—prepares and integrates data from different sources for analytical use

資料工程師-準備並整合來自不同來源的資料以供分析使用

Data scientist—uses expert skills in technology and social science to find trends through data analysis

資料科學家-利用科技和社會科學的專業技能,透過數據分析發現趨勢

Data specialist—organizes or converts data for use in databases or software systems

資料專家-組織或轉換資料以供資料庫或軟體系統使用

Operations analyst—analyzes data to assess the performance of business operations and workflows

營運分析師-分析數據以評估業務營運和工作流程的績效

Marketing analyst—analyzes market conditions to assess the potential sales of products and services 

行銷分析師-分析市場狀況以評估產品和服務的潛在銷售

HR/payroll analyst—analyzes payroll data for inefficiencies and errors

人力資源/薪資分析師-分析薪資資料是否有效率低落和錯誤

Financial analyst—analyzes financial status by collecting, monitoring, and reviewing data

財務分析師-透過收集、監控和審查數據來分析財務狀況

Risk analyst—analyzes financial documents, economic conditions, and client data to help companies determine the level of risk involved in making a particular business decision

風險分析師-分析財務文件、經濟狀況和客戶數據,幫助公司確定特定業務決策所涉及的風險水平

Healthcare analyst—analyzes medical data to improve the business aspect of hospitals and medical facilities

醫療保健分析師-分析醫療數據以改善醫院和醫療機構的業務

Ask(提問)

In the ask phase, you’ll work to understand the challenge to be solved or the question to be  answered. It will likely be assigned to you by stakeholders. As this is the ask phase, you’ll ask many questions to help you along the way. 

在提問階段 ,你需要努力理解需要解決的挑戰或需要回答的問題。這些問題很可能是由利害關係人分配給你的。由於這是提問階段,你需要提出許多問題來幫助你完成工作。

Prepare(準備)

Next, in the prepare phase, you’ll find and collect the data you’ll need to answer your questions. You’ll identify data sources, gather data, and verify that it is accurate and useful for answering your questions. 

接下來,在準備階段 ,你將尋找並收集解答問題所需的資料。你將識別資料來源,收集數據,並驗證其準確性和實用性,以解答你的問題。

Process(處理)

The process phase is when you will clean and organize your data. Tasks you perform here include removing any inconsistencies; filling in missing values; and, in many cases, changing the data to a format that’s easier to work with. Essentially, you’re ensuring the data is ready before you begin analysis.

處理階段是清理和整理資料的關鍵階段。您在此階段執行的任務包括:消除所有不一致之處;填充缺失值;以及在許多情況下,將資料轉換為更易於處理的格式。本質上,您需要確保在開始分析之前資料已準備就緒。

Analyze(分析)

The analyze phase is when you do the necessary data analysis to uncover answers and solutions. Depending on the situation and the data, this could involve tasks such as calculating averages or counting items in categories so you can examine trends and patterns.

分析階段是指進行必要的數據分析,以發現答案和解決方案。根據具體情況和數據,分析階段可能涉及計算平均值或按類別計數等任務,以便分析趨勢和模式。

Share(分享)

Next comes the share phase, when you present your findings to decision-makers through a report, presentation, or data visualizations. As part of the share phase, you decide which medium you want to use to share your findings and select the data to include. Tools for presenting data visually include charts made in Google Sheets, Tableau, and R. 

接下來是分享階段 ,您將透過報告、簡報或資料視覺化的形式向決策者展示您的研究成果。在分享階段,您需要決定使用哪種媒介來分享您的研究成果,並選擇要包含的資料。視覺化資料呈現工具包括使用 Google 試算表、Tableau 和 R 製作的圖表。

Act(行動)

Last is the act phase, in which you and others in the company put the data insights into action. This could mean implementing a new business strategy, making changes to a website, or any other action that solves the initial problem. 

最後是行動階段 ,你和公司其他成員將數據洞察付諸行動。這可能意味著實施新的業務策略、改進網站,或任何其他旨在解決初始問題的行動。

1.陳述問題:清楚地定義你所面臨的問題。

2.問第一個「為什麼」:問自己「為什麼這個問題會發生?」

3.重複提問:根據上一個「為什麼」的答案,繼續問「為什麼它會發生?」重複這個過程,通常問到第五個「為什麼」,你就能找到根本原因。

4.找到根源:當你問到一個無法再深入的答案,或者答案指向一個組織流程或人為因素時,你可能就找到了問題的根源。

「五」只是一個經驗法則,你可能只需要問三次,也可能需要問十次,重點是持續追問,直到找到真正的成因。

Specific:   Is the question specific? Does it address the problem? Does it have context? Will it uncover a lot of the information you need?

具體的:這個問題具體嗎?它能解決問題嗎?它有背景嗎?它能揭示很多你需要的資訊嗎?

Measurable: Will the question give you answers that you can measure?

可衡量: 這個問題是否會提供你一個可以衡量的答案?

Action-oriented: Will the answers provide information that helps you devise some type of plan?

行動導向: 答案是否會提供有助於您制定某種計劃的資訊?

Relevant: Is the question about the particular problem you are trying to solve?

相關性: 這個問題是否與您要解決的特定問題有關?

Time-bound: Are the answers relevant to the specific time being studied?

有時限: 答案是否與研究的特定時間有關?

1.Plan: Decide what kind of data is needed, how it will be managed, and who will be responsible for it.

計劃: 決定需要什麼樣的資料、如何管理資料、以及誰負責資料。

2.Capture: Collect or bring in data from a variety of different sources.

抓取: 從各種不同的來源收集或引入數據。

3.Manage: Care for and maintain the data. This includes determining how and where it is stored and the tools used to do so.

管理: 照管和維護資料。這包括確定資料的儲存方式和位置以及儲存資料的工具。

4.Analyze: Use the data to solve problems, make decisions, and support business goals.

分析: 使用數據解決問題、做出決策並支援業務目標。

5.Archive: Keep relevant data stored for long-term and future reference.

存檔: 保存相關數據以供長期和將來參考。

6.Destroy: Remove data from storage and delete any shared copies of the data.

銷毀: 從儲存中移除資料並刪除資料的任何共用副本。

其他生命週期範例:

美國魚類及野生動物管理局

Plan  計劃、Acquire  獲得、Maintain  維持、Access   使用權、Evaluate  評價、Archive  檔案

美國地質調查局(USGS)

Plan  計劃、Acquire  獲得、Process  過程、Analyze  分析、Preserve  儲存、Publish/share  發布/分享

金融機構

Capture  捕獲、Qualify  資格、Transform  轉換、Utilize  利用、Report  報告、Archive  檔案、Purge  清除

哈佛商學院(HBS)

Generation  世代、Collection  收藏、Processing  加工、Storage   貯存、Management  管理、Analysis  分析、Visualization  視覺化、Interpretation  解釋

Small data 小數據

Describes a dataset made up of specific metrics over a short, well-defined time period

描述由特定指標在短且明確定義的時間段內組成的資料集

Usually organized and analyzed in spreadsheets

通常以電子表格形式進行組織和分析

Likely to be used by small and midsize businesses

可能被中小型企業使用

Simple to collect, store, manage, sort, and visually represent 

易於收集、儲存、管理、分類和直覺呈現

Usually already a manageable size for analysis

通常已經達到可管理的分析規模

Big data 大數據

Describes large, less-specific datasets that cover a long time period

描述涵蓋較長時間段的大型、較不具體的資料集

Usually kept in a database and queried

通常會保存在資料庫中並進行查詢

Likely to be used by large organizations

可能被大型組織使用

Takes a lot of effort to collect, store, manage, sort, and visually represent

需要花費大量精力來收集、儲存、管理、分類和視覺呈現

Usually needs to be broken into smaller pieces in order to be organized and analyzed effectively for decision-making

通常需要分解成較小的部分,以便有效地組織和分析,從而做出決策

The three (or four) V words for big data

大數據的三個(或四個)V 字

Volume  體積

The amount of data  數據量

Variety  種類

The different kinds of data 不同類型的數據

Velocity  速度

How fast the data can be processed 數據處理速度有多快

Veracity  真實性 

The quality and reliability of the data 數據的品質和可靠性

Context can turn raw data into meaningful information. It is very important for data analysts to contextualize their data. This means giving the data perspective by defining it. To do this, you need to identify:

上下文可以將原始資料轉化為有意義的資訊。 對於數據分析師來說,將數據置於上下文中非常重要。這意味著透過定義數據來提供視角。為此,您需要確定:

Who: The person or organization that created, collected, and/or funded the data collection

誰:創建、收集和/或資助資料收集的個人或組織

What: The things in the world that data could have an impact on

什麼:數據可能對世界上的事物產生影響

Where: The origin of the data

哪裡:資料的來源

When: The time when the data was created or collected

時間:資料創建或收集的時間

Why: The motivation behind the creation or collection

為什麼:創作或收藏背後的動機

How: The method used to create or collect it

How:創造或收集它的方法

Data issue 1: no data

數據問題 1:沒有數據

Gather the data on a small scale to perform a preliminary analysis and then request additional time to complete the analysis after you have collected more data. 

小規模收集資料進行初步分析,然後在收集更多資料後請求更多時間來完成分析。

If there isn’t time to collect data, perform the analysis using proxy data from other datasets. 

This is the most common workaround.

如果沒有時間收集數據,請使用其他資料集的代理資料進行分析。

這是最常見的解決方法。

Data issue 2: too little data

數據問題2:數據太少

Do the analysis using proxy data along with actual data.

使用代理數據和實際數據進行分析。

Adjust your analysis to align with the data you already have.

調整您的分析以與您現有的數據保持一致

Data issue 3: wrong data, including data with errors*

數據問題 3:錯誤數據,包括有錯誤的數據*

If you have the wrong data because requirements were misunderstood, communicate the requirements again.

如果由於誤解了要求而導致數據錯誤,請再次傳達要求。

Identify errors in the data and, if possible, correct them at the source by looking for a pattern in the errors.

識別資料中的錯誤,如果可能的話,透過尋找錯誤中的模式從來源上修正錯誤。

If you can’t correct data errors yourself, you can ignore the 

wrong data and go ahead with the analysis if your sample size is still large enough and ignoring the data won’t cause systematic bias. 

如果您無法自行修正資料錯誤,則可以忽略

錯誤的數據,如果樣本量仍然足夠大並且忽略數據不會導致系統偏差,則繼續進行分析。

資料錯誤 (Data Errors):

首先,評估您是否可以修復資料或請求一份修正後的資料集。如果可以,就在資料修正後再進行分析。

>如果無法修復,則評估您是否有足夠的資料來忽略這些錯誤資料。如果可以,就直接用現有的正確資料進行分析。

資料不足 (Not Enough Data):

>如果錯誤資料無法修復且數量過多,則會面臨資料不足的問題。這時,可以評估是否能用代理資料 (proxy data) 來替代。如果可以,就用這些代理資料進行分析。

>如果沒有可用的代理資料,請評估是否能收集更多資料。如果可以,就在收集到足夠資料後再進行分析。

>>>如果以上所有方法都不可行,那麼就只能修改業務目標。

Sources of errors: Did you use the right tools and functions to find the source of the errors in your dataset?

錯誤來源 :您是否使用了正確的工具和功能來尋找資料集中的錯誤來源?

Null data: Did you search for NULLs using conditional formatting and filters?

空資料 :您是否使用條件格式和篩選器搜尋了 NULL?

Misspelled words: Did you locate all misspellings?

拼字錯誤的單字 :您找到所有拼字錯誤了嗎?

Mistyped numbers: Did you double-check that your numeric data has been entered correctly?

數字輸入錯誤 :您是否仔細檢查過您的數字資料是否輸入正確?

Extra spaces and characters: Did you remove any extra spaces or characters using the TRIM function?

多餘的空格和字元 :您是否使用 TRIM 函數刪除了任何多餘的空格或字元?

Duplicates: Did you remove duplicates in spreadsheets using the Remove Duplicates function or DISTINCT in SQL?

重複項 :您是否使用 SQL 中的 「刪除重複項」 功能或 DISTINCT 刪除了電子表格中的重複項?

Mismatched data types: Did you check that numeric, date, and string data are typecast correctly?

資料類型不符 :您是否檢查過數字、日期和字串資料的類型轉換是否正確?

Messy (inconsistent) strings: Did you make sure that all of your strings are consistent and meaningful?

混亂(不一致)的字串 :您是否確保所有字串都是一致且有意義的?

Messy (inconsistent) date formats: Did you format the dates consistently throughout your dataset?

混亂(不一致)的日期格式 :您是否在整個資料集中一致地格式化日期?

Misleading variable labels (columns): Did you name your columns meaningfully?

誤導性的變數標籤(列) :您的列名稱是否有意義?

Truncated data: Did you check for truncated or missing data that needs correction?

截斷資料: 您是否檢查過需要更正的截斷或缺失資料?

Business Logic: Did you check that the data makes sense given your knowledge of the business? 

業務邏輯 :根據您對業務的了解,您是否檢查過資料是否合理?

When figuring out a sample size, here are things to keep in mind:

在決定樣本量時,需要牢記以下幾點:

Don’t use a sample size less than 30. It has been statistically proven that 30 is the smallest sample size where an average result of a sample starts to represent the average result of a population.

不要使用小於 30 的樣本量。統計證明,30 是樣本平均結果開始代表總體平均結果的最小樣本量。

As sample size increases, the results more closely resemble the normal (bell-shaped) distribution from a large number of samples. A sample of 30 is the smallest sample size for which the CLT is still valid.

隨著樣本量的增加,結果會更接近大量樣本的常態(鐘形)分佈。 30 個樣本是 CLT 仍然有效的最小樣本量。

The confidence level most commonly used is 95%, but 90% can work in some cases. 

最常用的置信度是 95%,但在某些情況下 90% 也可以。

樣本量計算機: https://www.surveymonkey.com/mp/sample-size-calculator/

Duplicate data  重複數據

Any data record that shows up more than once

任何出現多次的資料記錄

Manual data entry, batch data imports, or data migration

手動資料輸入、大量資料匯入或資料遷移

Skewed metrics or analyses, inflated or inaccurate counts or predictions, or confusion during data retrieval

指標或分析有偏差、計數或預測誇大或不準確,或資料檢索過程中出現混亂

Outdated data  過時的數據

Any data that is old which should be replaced with newer and more accurate information

任何舊的數據都應該用更新、更準確的資訊來替換

People changing roles or companies, or software and systems becoming obsolete

人們更換角色或公司,或者軟體和系統變得過時

Inaccurate insights, decision-making, and analytics

不準確的見解、決策和分析

Incomplete data  數據不完整

Any data that is missing important fields

缺少重要字段的任何數據

Improper data collection or incorrect data entry

資料收集不當或資料輸入不正確

Decreased productivity, inaccurate insights, or inability to complete essential services

生產力下降、洞見不準確或無法完成基本服務

Incorrect/inaccurate data 不正確/不準確的數據

Any data that is complete but inaccurate

任何完整但不準確的數據

Human error inserted during data input, fake information, or mock data

數據輸入過程中插入的人為錯誤、虛假資訊或模擬數據

Inaccurate insights or decision-making based on bad information resulting in revenue loss

基於錯誤資訊的不準確見解或決策導致收入損失

Inconsistent data  數據不一致

Any data that uses different formats to represent the same thing

任何使用不同格式來表示同一件事的數據

Data stored incorrectly or errors inserted during data transfer

資料儲存不正確或資料傳輸過程中出現錯誤

Contradictory data points leading to confusion or inability to classify or segment customers

相互矛盾的數據點導致混亂或無法對客戶進行分類或細分

Common data-cleaning pitfalls 常見的資料清理陷阱

Not checking for spelling errors 不檢查拼字錯誤

Forgetting to document errors 忘記記錄錯誤

Not checking for misfielded values 未檢查錯誤欄位值

Overlooking missing values 忽略缺失值

Only looking at a subset of the data 只關注資料子集

Losing track of business objectives 迷失業務目標

Not fixing the source of the error 不修復錯誤根源

Not analyzing the system prior to data cleaning 資料清理前未進行系統分析

Not backing up your data prior to data cleaning 資料清理前未備份資料

Not accounting for data cleaning in your deadlines/process 未在截止日期/流程中考慮資料清理

Consider all of the available data

考慮所有可用數據

Identify surrounding factors

確定周圍因素(其他外在因素)

Include self-reported data

包括自我報告的數據(刻板印象)

Use oversampling effectively

有效利用過採樣

# Oversampling is the process of increasing the sample size of nondominant groups in a population.
# 過採樣是增加群體中非主導群體樣本的過程。

Think about fairness from beginning to end

始終思考公平

Nine basic principles of design 設計的九個基本原則

Balance: The design of a data visualization is balanced when the key visual elements, like color and shape, are distributed evenly.

平衡 :當關鍵視覺元素(例如顏色和形狀)均勻分佈時,資料視覺化的設計才是平衡的。

Emphasis: Your data visualization should have a focal point, so that your audience knows where to concentrate. 

重點突出: 你的資料視覺化應該有一個焦點,這樣你的受眾就知道應該把注意力集中在哪裡。

Movement: Movement can refer to the path the viewer’s eye travels as they look at a data visualization, or literal movement created by animations.

動畫: 動畫可以指觀看者在觀看資料視覺化時視線的移動路徑,也可以指動畫營造的文字運動。

Pattern: You can use similar shapes and colors to create patterns in your data visualization. 

圖案: 您可以使用相似的形狀和顏色在資料視覺化中建立圖案。這在很多方面都很有用。

Repetition: Repeating chart types, shapes, or colors adds to the effectiveness of your visualization. 

重複: 重複圖表類型、形狀或顏色可以提升視覺化效果。

Proportion: Proportion is another way that you can demonstrate the importance of certain data. 

比例: 比例是展示特定資料重要性的另一種方式。

Rhythm: This refers to creating a sense of movement or flow in your visualization. 

節奏: 這指的是在你的視覺呈現中營造一種運動感或流暢感。

Variety: Your visualizations should have some variety in the chart types, lines, shapes, colors, and values you use. 

多樣性: 視覺化作品應在圖表類型、線條、形狀、顏色和數值方面保持一定的多樣性。

Unity: The last principle is unity.

統一性: 最後一個原則是統一性。

Include a title, subtitle, and date 包含標題、副標題和日期

Use a logical sequence of slides 使用符合邏輯順序的投影片

Provide an agenda with a timeline 提供帶有時間表的議程

Limit the amount of text on slides 限制投影片的文字量,目標讓觀眾在 5 秒內瀏覽完。

Start with the business task  從業務任務開始

Establish the initial hypothesis  建立初步假設

Show what business metrics you used 顯示您使用的業務指標

Use visualizations 使用視覺化

Introduce the graphic by name 透過名稱介紹圖形

Provide a title for each graph 為每個圖表提供標題

Go from the general to the specific 從一般到具體

Use speaker notes to help you remember talking points 使用演講者筆記來幫助您記住談話要點

Include key takeaways 包括關鍵要點

Who is my audience?  我的聽眾是誰?

What is the purpose of my presentation? 我的演講的目的是什麼?

Here is an example of a 30-minute agenda:

以下是一個30分鐘議程的範例:

Introductions (4 minutes) 介紹(4分鐘)

Project overview and goals (5 minutes) 項目概述與目標(5分鐘)

Data and analysis (10 minutes) 數據與分析(10分鐘)

Recommendations (3 minutes) 建議(3分鐘)

Actionable steps (3 minutes) 可操作步驟(3分鐘)

Questions (5 minutes)   提問(5分鐘)

發佈留言