Business analyst—analyzes data to help businesses improve processes, products, or services
業務分析師-分析數據以幫助企業改善流程、產品或服務
Data analytics consultant—analyzes the systems and models for using data
數據分析顧問——分析使用數據的系統和模型
Data engineer—prepares and integrates data from different sources for analytical use
資料工程師-準備並整合來自不同來源的資料以供分析使用
Data scientist—uses expert skills in technology and social science to find trends through data analysis
資料科學家-利用科技和社會科學的專業技能,透過數據分析發現趨勢
Data specialist—organizes or converts data for use in databases or software systems
資料專家-組織或轉換資料以供資料庫或軟體系統使用
Operations analyst—analyzes data to assess the performance of business operations and workflows
營運分析師-分析數據以評估業務營運和工作流程的績效
Marketing analyst—analyzes market conditions to assess the potential sales of products and services
行銷分析師-分析市場狀況以評估產品和服務的潛在銷售
HR/payroll analyst—analyzes payroll data for inefficiencies and errors
人力資源/薪資分析師-分析薪資資料是否有效率低落和錯誤
Financial analyst—analyzes financial status by collecting, monitoring, and reviewing data
財務分析師-透過收集、監控和審查數據來分析財務狀況
Risk analyst—analyzes financial documents, economic conditions, and client data to help companies determine the level of risk involved in making a particular business decision
風險分析師-分析財務文件、經濟狀況和客戶數據,幫助公司確定特定業務決策所涉及的風險水平
Healthcare analyst—analyzes medical data to improve the business aspect of hospitals and medical facilities
醫療保健分析師-分析醫療數據以改善醫院和醫療機構的業務
Ask(提問)
In the ask phase, you’ll work to understand the challenge to be solved or the question to be answered. It will likely be assigned to you by stakeholders. As this is the ask phase, you’ll ask many questions to help you along the way.
在提問階段 ,你需要努力理解需要解決的挑戰或需要回答的問題。這些問題很可能是由利害關係人分配給你的。由於這是提問階段,你需要提出許多問題來幫助你完成工作。
Prepare(準備)
Next, in the prepare phase, you’ll find and collect the data you’ll need to answer your questions. You’ll identify data sources, gather data, and verify that it is accurate and useful for answering your questions.
接下來,在準備階段 ,你將尋找並收集解答問題所需的資料。你將識別資料來源,收集數據,並驗證其準確性和實用性,以解答你的問題。
Process(處理)
The process phase is when you will clean and organize your data. Tasks you perform here include removing any inconsistencies; filling in missing values; and, in many cases, changing the data to a format that’s easier to work with. Essentially, you’re ensuring the data is ready before you begin analysis.
處理階段是清理和整理資料的關鍵階段。您在此階段執行的任務包括:消除所有不一致之處;填充缺失值;以及在許多情況下,將資料轉換為更易於處理的格式。本質上,您需要確保在開始分析之前資料已準備就緒。
Analyze(分析)
The analyze phase is when you do the necessary data analysis to uncover answers and solutions. Depending on the situation and the data, this could involve tasks such as calculating averages or counting items in categories so you can examine trends and patterns.
分析階段是指進行必要的數據分析,以發現答案和解決方案。根據具體情況和數據,分析階段可能涉及計算平均值或按類別計數等任務,以便分析趨勢和模式。
Share(分享)
Next comes the share phase, when you present your findings to decision-makers through a report, presentation, or data visualizations. As part of the share phase, you decide which medium you want to use to share your findings and select the data to include. Tools for presenting data visually include charts made in Google Sheets, Tableau, and R.
接下來是分享階段 ,您將透過報告、簡報或資料視覺化的形式向決策者展示您的研究成果。在分享階段,您需要決定使用哪種媒介來分享您的研究成果,並選擇要包含的資料。視覺化資料呈現工具包括使用 Google 試算表、Tableau 和 R 製作的圖表。
Act(行動)
Last is the act phase, in which you and others in the company put the data insights into action. This could mean implementing a new business strategy, making changes to a website, or any other action that solves the initial problem.
最後是行動階段 ,你和公司其他成員將數據洞察付諸行動。這可能意味著實施新的業務策略、改進網站,或任何其他旨在解決初始問題的行動。
1.陳述問題:清楚地定義你所面臨的問題。
2.問第一個「為什麼」:問自己「為什麼這個問題會發生?」
3.重複提問:根據上一個「為什麼」的答案,繼續問「為什麼它會發生?」重複這個過程,通常問到第五個「為什麼」,你就能找到根本原因。
4.找到根源:當你問到一個無法再深入的答案,或者答案指向一個組織流程或人為因素時,你可能就找到了問題的根源。
「五」只是一個經驗法則,你可能只需要問三次,也可能需要問十次,重點是持續追問,直到找到真正的成因。
Specific: Is the question specific? Does it address the problem? Does it have context? Will it uncover a lot of the information you need?
具體的:這個問題具體嗎?它能解決問題嗎?它有背景嗎?它能揭示很多你需要的資訊嗎?
Measurable: Will the question give you answers that you can measure?
可衡量: 這個問題是否會提供你一個可以衡量的答案?
Action-oriented: Will the answers provide information that helps you devise some type of plan?
行動導向: 答案是否會提供有助於您制定某種計劃的資訊?
Relevant: Is the question about the particular problem you are trying to solve?
相關性: 這個問題是否與您要解決的特定問題有關?
Time-bound: Are the answers relevant to the specific time being studied?
有時限: 答案是否與研究的特定時間有關?
1.Plan: Decide what kind of data is needed, how it will be managed, and who will be responsible for it.
計劃: 決定需要什麼樣的資料、如何管理資料、以及誰負責資料。
2.Capture: Collect or bring in data from a variety of different sources.
抓取: 從各種不同的來源收集或引入數據。
3.Manage: Care for and maintain the data. This includes determining how and where it is stored and the tools used to do so.
管理: 照管和維護資料。這包括確定資料的儲存方式和位置以及儲存資料的工具。
4.Analyze: Use the data to solve problems, make decisions, and support business goals.
分析: 使用數據解決問題、做出決策並支援業務目標。
5.Archive: Keep relevant data stored for long-term and future reference.
存檔: 保存相關數據以供長期和將來參考。
6.Destroy: Remove data from storage and delete any shared copies of the data.
銷毀: 從儲存中移除資料並刪除資料的任何共用副本。
其他生命週期範例:
美國魚類及野生動物管理局
Plan 計劃、Acquire 獲得、Maintain 維持、Access 使用權、Evaluate 評價、Archive 檔案
美國地質調查局(USGS)
Plan 計劃、Acquire 獲得、Process 過程、Analyze 分析、Preserve 儲存、Publish/share 發布/分享
金融機構
Capture 捕獲、Qualify 資格、Transform 轉換、Utilize 利用、Report 報告、Archive 檔案、Purge 清除
哈佛商學院(HBS)
Generation 世代、Collection 收藏、Processing 加工、Storage 貯存、Management 管理、Analysis 分析、Visualization 視覺化、Interpretation 解釋
Small data 小數據
Describes a dataset made up of specific metrics over a short, well-defined time period
描述由特定指標在短且明確定義的時間段內組成的資料集
Usually organized and analyzed in spreadsheets
通常以電子表格形式進行組織和分析
Likely to be used by small and midsize businesses
可能被中小型企業使用
Simple to collect, store, manage, sort, and visually represent
易於收集、儲存、管理、分類和直覺呈現
Usually already a manageable size for analysis
通常已經達到可管理的分析規模
Big data 大數據
Describes large, less-specific datasets that cover a long time period
描述涵蓋較長時間段的大型、較不具體的資料集
Usually kept in a database and queried
通常會保存在資料庫中並進行查詢
Likely to be used by large organizations
可能被大型組織使用
Takes a lot of effort to collect, store, manage, sort, and visually represent
需要花費大量精力來收集、儲存、管理、分類和視覺呈現
Usually needs to be broken into smaller pieces in order to be organized and analyzed effectively for decision-making
通常需要分解成較小的部分,以便有效地組織和分析,從而做出決策
The three (or four) V words for big data
大數據的三個(或四個)V 字
Volume 體積
The amount of data 數據量
Variety 種類
The different kinds of data 不同類型的數據
Velocity 速度
How fast the data can be processed 數據處理速度有多快
Veracity 真實性
The quality and reliability of the data 數據的品質和可靠性
Context can turn raw data into meaningful information. It is very important for data analysts to contextualize their data. This means giving the data perspective by defining it. To do this, you need to identify:
上下文可以將原始資料轉化為有意義的資訊。 對於數據分析師來說,將數據置於上下文中非常重要。這意味著透過定義數據來提供視角。為此,您需要確定:
Who: The person or organization that created, collected, and/or funded the data collection
誰:創建、收集和/或資助資料收集的個人或組織
What: The things in the world that data could have an impact on
什麼:數據可能對世界上的事物產生影響
Where: The origin of the data
哪裡:資料的來源
When: The time when the data was created or collected
時間:資料創建或收集的時間
Why: The motivation behind the creation or collection
為什麼:創作或收藏背後的動機
How: The method used to create or collect it
How:創造或收集它的方法
Data issue 1: no data
數據問題 1:沒有數據
Gather the data on a small scale to perform a preliminary analysis and then request additional time to complete the analysis after you have collected more data.
小規模收集資料進行初步分析,然後在收集更多資料後請求更多時間來完成分析。
If there isn’t time to collect data, perform the analysis using proxy data from other datasets.
This is the most common workaround.
如果沒有時間收集數據,請使用其他資料集的代理資料進行分析。
這是最常見的解決方法。
Data issue 2: too little data
數據問題2:數據太少
Do the analysis using proxy data along with actual data.
使用代理數據和實際數據進行分析。
Adjust your analysis to align with the data you already have.
調整您的分析以與您現有的數據保持一致
Data issue 3: wrong data, including data with errors*
數據問題 3:錯誤數據,包括有錯誤的數據*
If you have the wrong data because requirements were misunderstood, communicate the requirements again.
如果由於誤解了要求而導致數據錯誤,請再次傳達要求。
Identify errors in the data and, if possible, correct them at the source by looking for a pattern in the errors.
識別資料中的錯誤,如果可能的話,透過尋找錯誤中的模式從來源上修正錯誤。
If you can’t correct data errors yourself, you can ignore the
wrong data and go ahead with the analysis if your sample size is still large enough and ignoring the data won’t cause systematic bias.
如果您無法自行修正資料錯誤,則可以忽略
錯誤的數據,如果樣本量仍然足夠大並且忽略數據不會導致系統偏差,則繼續進行分析。
資料錯誤 (Data Errors):
首先,評估您是否可以修復資料或請求一份修正後的資料集。如果可以,就在資料修正後再進行分析。
>如果無法修復,則評估您是否有足夠的資料來忽略這些錯誤資料。如果可以,就直接用現有的正確資料進行分析。
資料不足 (Not Enough Data):
>如果錯誤資料無法修復且數量過多,則會面臨資料不足的問題。這時,可以評估是否能用代理資料 (proxy data) 來替代。如果可以,就用這些代理資料進行分析。
>如果沒有可用的代理資料,請評估是否能收集更多資料。如果可以,就在收集到足夠資料後再進行分析。
>>>如果以上所有方法都不可行,那麼就只能修改業務目標。
Sources of errors: Did you use the right tools and functions to find the source of the errors in your dataset?
錯誤來源 :您是否使用了正確的工具和功能來尋找資料集中的錯誤來源?
Null data: Did you search for NULLs using conditional formatting and filters?
空資料 :您是否使用條件格式和篩選器搜尋了 NULL?
Misspelled words: Did you locate all misspellings?
拼字錯誤的單字 :您找到所有拼字錯誤了嗎?
Mistyped numbers: Did you double-check that your numeric data has been entered correctly?
數字輸入錯誤 :您是否仔細檢查過您的數字資料是否輸入正確?
Extra spaces and characters: Did you remove any extra spaces or characters using the TRIM function?
多餘的空格和字元 :您是否使用 TRIM 函數刪除了任何多餘的空格或字元?
Duplicates: Did you remove duplicates in spreadsheets using the Remove Duplicates function or DISTINCT in SQL?
重複項 :您是否使用 SQL 中的 「刪除重複項」 功能或 DISTINCT 刪除了電子表格中的重複項?
Mismatched data types: Did you check that numeric, date, and string data are typecast correctly?
資料類型不符 :您是否檢查過數字、日期和字串資料的類型轉換是否正確?
Messy (inconsistent) strings: Did you make sure that all of your strings are consistent and meaningful?
混亂(不一致)的字串 :您是否確保所有字串都是一致且有意義的?
Messy (inconsistent) date formats: Did you format the dates consistently throughout your dataset?
混亂(不一致)的日期格式 :您是否在整個資料集中一致地格式化日期?
Misleading variable labels (columns): Did you name your columns meaningfully?
誤導性的變數標籤(列) :您的列名稱是否有意義?
Truncated data: Did you check for truncated or missing data that needs correction?
截斷資料: 您是否檢查過需要更正的截斷或缺失資料?
Business Logic: Did you check that the data makes sense given your knowledge of the business?
業務邏輯 :根據您對業務的了解,您是否檢查過資料是否合理?
When figuring out a sample size, here are things to keep in mind:
在決定樣本量時,需要牢記以下幾點:
Don’t use a sample size less than 30. It has been statistically proven that 30 is the smallest sample size where an average result of a sample starts to represent the average result of a population.
不要使用小於 30 的樣本量。統計證明,30 是樣本平均結果開始代表總體平均結果的最小樣本量。
As sample size increases, the results more closely resemble the normal (bell-shaped) distribution from a large number of samples. A sample of 30 is the smallest sample size for which the CLT is still valid.
隨著樣本量的增加,結果會更接近大量樣本的常態(鐘形)分佈。 30 個樣本是 CLT 仍然有效的最小樣本量。
The confidence level most commonly used is 95%, but 90% can work in some cases.
最常用的置信度是 95%,但在某些情況下 90% 也可以。
樣本量計算機: https://www.surveymonkey.com/mp/sample-size-calculator/
Duplicate data 重複數據
Any data record that shows up more than once
任何出現多次的資料記錄
Manual data entry, batch data imports, or data migration
手動資料輸入、大量資料匯入或資料遷移
Skewed metrics or analyses, inflated or inaccurate counts or predictions, or confusion during data retrieval
指標或分析有偏差、計數或預測誇大或不準確,或資料檢索過程中出現混亂
Outdated data 過時的數據
Any data that is old which should be replaced with newer and more accurate information
任何舊的數據都應該用更新、更準確的資訊來替換
People changing roles or companies, or software and systems becoming obsolete
人們更換角色或公司,或者軟體和系統變得過時
Inaccurate insights, decision-making, and analytics
不準確的見解、決策和分析
Incomplete data 數據不完整
Any data that is missing important fields
缺少重要字段的任何數據
Improper data collection or incorrect data entry
資料收集不當或資料輸入不正確
Decreased productivity, inaccurate insights, or inability to complete essential services
生產力下降、洞見不準確或無法完成基本服務
Incorrect/inaccurate data 不正確/不準確的數據
Any data that is complete but inaccurate
任何完整但不準確的數據
Human error inserted during data input, fake information, or mock data
數據輸入過程中插入的人為錯誤、虛假資訊或模擬數據
Inaccurate insights or decision-making based on bad information resulting in revenue loss
基於錯誤資訊的不準確見解或決策導致收入損失
Inconsistent data 數據不一致
Any data that uses different formats to represent the same thing
任何使用不同格式來表示同一件事的數據
Data stored incorrectly or errors inserted during data transfer
資料儲存不正確或資料傳輸過程中出現錯誤
Contradictory data points leading to confusion or inability to classify or segment customers
相互矛盾的數據點導致混亂或無法對客戶進行分類或細分
Common data-cleaning pitfalls 常見的資料清理陷阱
Not checking for spelling errors 不檢查拼字錯誤
Forgetting to document errors 忘記記錄錯誤
Not checking for misfielded values 未檢查錯誤欄位值
Overlooking missing values 忽略缺失值
Only looking at a subset of the data 只關注資料子集
Losing track of business objectives 迷失業務目標
Not fixing the source of the error 不修復錯誤根源
Not analyzing the system prior to data cleaning 資料清理前未進行系統分析
Not backing up your data prior to data cleaning 資料清理前未備份資料
Not accounting for data cleaning in your deadlines/process 未在截止日期/流程中考慮資料清理
Consider all of the available data
考慮所有可用數據
Identify surrounding factors
確定周圍因素(其他外在因素)
Include self-reported data
包括自我報告的數據(刻板印象)
Use oversampling effectively
有效利用過採樣
# Oversampling is the process of increasing the sample size of nondominant groups in a population.
# 過採樣是增加群體中非主導群體樣本的過程。
Think about fairness from beginning to end
始終思考公平
Nine basic principles of design 設計的九個基本原則
Balance: The design of a data visualization is balanced when the key visual elements, like color and shape, are distributed evenly.
平衡 :當關鍵視覺元素(例如顏色和形狀)均勻分佈時,資料視覺化的設計才是平衡的。
Emphasis: Your data visualization should have a focal point, so that your audience knows where to concentrate.
重點突出: 你的資料視覺化應該有一個焦點,這樣你的受眾就知道應該把注意力集中在哪裡。
Movement: Movement can refer to the path the viewer’s eye travels as they look at a data visualization, or literal movement created by animations.
動畫: 動畫可以指觀看者在觀看資料視覺化時視線的移動路徑,也可以指動畫營造的文字運動。
Pattern: You can use similar shapes and colors to create patterns in your data visualization.
圖案: 您可以使用相似的形狀和顏色在資料視覺化中建立圖案。這在很多方面都很有用。
Repetition: Repeating chart types, shapes, or colors adds to the effectiveness of your visualization.
重複: 重複圖表類型、形狀或顏色可以提升視覺化效果。
Proportion: Proportion is another way that you can demonstrate the importance of certain data.
比例: 比例是展示特定資料重要性的另一種方式。
Rhythm: This refers to creating a sense of movement or flow in your visualization.
節奏: 這指的是在你的視覺呈現中營造一種運動感或流暢感。
Variety: Your visualizations should have some variety in the chart types, lines, shapes, colors, and values you use.
多樣性: 視覺化作品應在圖表類型、線條、形狀、顏色和數值方面保持一定的多樣性。
Unity: The last principle is unity.
統一性: 最後一個原則是統一性。
Include a title, subtitle, and date 包含標題、副標題和日期
Use a logical sequence of slides 使用符合邏輯順序的投影片
Provide an agenda with a timeline 提供帶有時間表的議程
Limit the amount of text on slides 限制投影片的文字量,目標讓觀眾在 5 秒內瀏覽完。
Start with the business task 從業務任務開始
Establish the initial hypothesis 建立初步假設
Show what business metrics you used 顯示您使用的業務指標
Use visualizations 使用視覺化
Introduce the graphic by name 透過名稱介紹圖形
Provide a title for each graph 為每個圖表提供標題
Go from the general to the specific 從一般到具體
Use speaker notes to help you remember talking points 使用演講者筆記來幫助您記住談話要點
Include key takeaways 包括關鍵要點
Who is my audience? 我的聽眾是誰?
What is the purpose of my presentation? 我的演講的目的是什麼?
Here is an example of a 30-minute agenda:
以下是一個30分鐘議程的範例:
Introductions (4 minutes) 介紹(4分鐘)
Project overview and goals (5 minutes) 項目概述與目標(5分鐘)
Data and analysis (10 minutes) 數據與分析(10分鐘)
Recommendations (3 minutes) 建議(3分鐘)
Actionable steps (3 minutes) 可操作步驟(3分鐘)
Questions (5 minutes) 提問(5分鐘)