66 Days of Data Science

Resources

🔖 Coursera

Applied Data Science Specialization by IBM

🔖 Datacamp

Data Analyst in SQL : Career track
Data Analyst in Tableau : Career track

🔖 Kaggle Learn

🔖 LinkedIn learing

Become a Data Scientist

📼 Youtube Playlist

Discrete Probability Distributions

The Journey

📖 Day 1 - Revision of Statistics and Basic SQL

🗓️ Date: 2023-02-15

Resources :

Course

Summary :

While taking the course Introduction to Statistics as part of the track Data Analyst in SQL, I had the chance to review probability, distributions, the central limit theorem, correlation, and hypothesis testing. While revising the dependence and conditional probabilities, I was also able to recall the normal and poisson distributions (k = * n).

I also took Introduction to SQL as part of the same curriculum, which helped me revise the basic sql queries to read and view data from tables. Because of this revision, I learned about "VIEW," a concept I was never aware of before. To summarize, views are virtual tables whose contents are determined by queries. It only allows you to restrict access to the database and does not significantly increase the performance of SQL queries. Nonetheless, it was a useful trick to have in my SQL toolbox for increasing readability.

📖 Day 2 - Revision of Intermediate SQL Queries

🗓️  Date: 2023-02-16

Resources :

Course

Intermediate SQL (Datacamp)

Summary :

Continuing on from Day 1, I chose the Intermediate SQL course from the same track, which included queries for selecting, filtering, aggregating, sorting, and grouping. Unlike the previous time, I did not get to learn a new concept, but it was a good recollection of all these principles, particularly concerning conventions for writing SQL to promote readability, as I had become a little sloopy regarding this.

📖 Day 3 - Revision of Joins, Set theory & Subqueries in SQL

🗓️  Date: 2023-02-17

Resources :

Course

Joining Data in SQL (Datacamp)

Summary :

I took the course Joining Data in SQL, the fifth Course under the track Data Analyst in SQL. It included an introduction to various types of joins (inner, outer, cross & self) as well as set theory (union, intersect & except) joins. The cross joins and set theory section was incredibly beneficial as my perspective on desiging tables using minimal readable query was expanded due to these concepts. While I recall reading about it in my undergrad curriculum, putting it into practice has helped me comprehend it much better. In addition, subqueries in the "WHERE", "FROM" and "SELECT" keywords were covered in the course. I had never used subqueries in the "SELECT" & "FROM" section before, hence I learned some cool tricks up my sleeves. I have added some syntaxes that I learned as follows:

Notes :

Cross Join Query

--- Creates all possible combinations
SELECT column_name(s)
FROM table1
CROSS JOIN table2;

Operators

--- UNION Operator : shows unique rows
SELECT column_name(s) FROM table1
UNION
SELECT column_name(s) FROM table2;

--- UNION ALL Operator : shows duplicate rows
SELECT column_name(s) FROM table1
UNION ALL
SELECT column_name(s) FROM table2;

--- EXCEPT Operator : shows rows not present in the table
SELECT column_name(s) FROM table1
EXCEPT
SELECT column_name(s) FROM table2;

Subquery

--- Example 1: Sub query with in WHERE
SELECT name, country_code
FROM cities
WHERE name in (
SELECT capital
FROM countries
)

--- Example 2: Sub query with in SELECT
SELECT countries.name AS country_name, (
SELECT COUNT(\*)
FROM cities
WHERE cities.country_code = country.code
) AS cities_num
FROM countries

--- Example 3: Sub query with in FROM
SELECT coutries.name AS country_name, lang_num
FROM countries,
(SELECT code, COUNT(\*) AS lang_num
FROM languages
GROUP BY code) AS sub
WHERE countries.code = sub.code
ORDER BY lang_num DESC;

📖 Day 4 - Exploring Data Manipulation in SQL

🗓️  Date: 2023-02-20

Resources :

Course

Data Manipulation in SQL (Datacamp)

Summary :

Machine learning, the most trending topic in today's generation is nothing more than a series of if and else statements. With SQL, a similar scenario occurs when you use the CASE statement to insert new values into a table based on existing records. To be more specific, the first module in Data Manipulation in SQL that I took,' 'We'll Take the CASE' module focused on using case statements to generate labels, probability, and percentage based on supplied criteria. While accounting for only one-quarter of the course, this subject proved useful in a variety of ways. The following are some examples of the statement:

Notes :

CASE Statement

--- Example 1 : Basic
SELECT title,
    length,
    CASE
        WHEN length> 0 AND length <= 50
            THEN 'Short'
        WHEN length > 50 AND length <= 120
            THEN 'Medium'
        WHEN length> 120
            THEN 'Long'
        ELSE
            'Outlier'
    END AS duration
FROM film
ORDER BY title;



--- Example 2 : Count
SELECT
c.name AS country,
-- Count games from the 2012/2013 season
count(CASE WHEN m.season = '2012/2013'
THEN m.id ELSE NULL end) AS matches_2012_2013
FROM country AS c
LEFT JOIN match AS m
ON c.id = m.country_id
-- Group by country name alias
GROUP BY country;

--- Example 3 : Percentage
SELECT
c.name AS country,
-- Round the percentage of tied games to 2 decimal points
ROUND(AVG(CASE WHEN m.season='2013/2014' AND m.home_goal = m.away_goal THEN 1
WHEN m.season='2013/2014' AND m.home_goal != m.away_goal THEN 0
END),2) AS pct_ties_2013_2014,
ROUND(AVG(CASE WHEN m.season='2014/2015' AND m.home_goal = m.away_goal THEN 1
WHEN m.season='2014/2015' AND m.home_goal != m.away_goal THEN 0
END),2) AS pct_ties_2014_2015
FROM country AS c
LEFT JOIN matches AS m
ON c.id = m.country_id
GROUP BY country;

📖 Day 5 - (Continued) Data Manipulation in SQL

🗓️  Date: 2023-02-21

Resources :

Course

Data Manipulation in SQL (Datacamp)

Summary :

Continuing the remaining modules Data Manipulation in SQL course, I was able to gain insights on Simple Subqueires Joins, Correlated Subqueries (takes higher processing time), Multiple/Nested Subqueries, and Common Table Expressions (CTE). These concepts were handful in allowing to perform complex actions within SQL and gain data points that I once thought were only possible through pandas (a python library).

However, more significantly, I learned about window functions and the various types, such as Over, Rank, Partition, and Slide, throughout this course. While I had seen it before, I had never utilized it in practice, and I am pleased that this course allowed me to do so. Aggregating on columns that aren't in the grouping columns is likely the most useful skill to have, especially when doing comparative analysis.

Notes :

Correlated subquery with multiple conditions

SELECT
    -- Select country ID, date, home, and away goals from match
    main.country_id,
    main.date,
    main.home_goal,
    main.away_goal
FROM match AS main
WHERE
    -- Filter for matches with the highest number of goals scored
    (home_goal + away_goal) >
        (SELECT MAX(home_goal + sub.away_goal)
        FROM match AS sub
        WHERE main.country_id = sub.country_id
            AND main.season = sub.season);

Common Table Expressions

WITH match_list AS (
    SELECT
        country_id,
        id
    FROM match
-- Select league and count of matches from the CTE
SELECT
    l.name AS league,
    COUNT(match_list.id) AS matches
FROM league AS l
-- Join the CTE to the league table
LEFT JOIN match_list ON l.id = match_list.country_id
GROUP BY l.name;

Window Function

-- Example 1 : Over function
SELECT
    m.id,
    c.name AS country,
    m.season,
    m.home_goal,
    m.away_goal,
    -- Use a window to include the aggregate average in each row
    AVG(m.home_goal + m.away_goal) OVER() AS overall_avg
FROM match AS m
LEFT JOIN country AS c ON m.country_id = c.id;

-- Example 2 : Rank function
SELECT
l.name AS league,
AVG(m.home_goal + m.away_goal) AS avg_goals,
-- Rank each league according to the average goals
RANK() OVER(ORDER BY AVG(m.home_goal + m.away_goal) DESC) AS league_rank
FROM league AS l
LEFT JOIN match AS m
ON l.id = m.country_id
WHERE m.season = '2011/2012'
GROUP BY l.name
ORDER BY league_rank;

-- Example 3 : Partition function

SELECT
c.name,
m.season,
(home_goal + away_goal) AS goals,
AVG(home_goal + away_goal)
OVER(PARTITION BY m.season, c.name) AS season_country_avg
FROM country AS c
LEFT JOIN match AS m
ON c.id = m.country_id;

-- Example 4 : Sliding Function

SELECT
date,
home_goal,
away_goal,
-- Create a running total and running average of home goals
SUM(home_goal) OVER(ORDER BY date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_total,
AVG(home_goal) OVER(ORDER BY date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_avg
FROM match
WHERE
hometeam_id = 9908
AND season = '2011/2012';

📖 Day 6 - Understanding PostgreSQL Summary Stats

🗓️  Date: 2023-02-22

Resources :

Course

Articles

Window functions in SQL (Medium)

Summary :

With the continuation of window functions, I have gotten slightly familiar with the notion of window function types, particularly fetching, framing, and ranking functions, which I had practiced today. While these functions seemed intimidating at first, they turned out to be considerably easy than I had anticipated.

Beside this, I attempted to put my knowledge into practice by answering practice questions in the "Advanced sql" section of kaggle. It was a valuable experience since I was able to accurately utilize window functions and also learn about the 'UNNEST' function to load nested and repeated data from the tables.

Notes :

Fetching functions

Operator	Description
`LAG(column, n)`	Returns column's value at the row `n` rows before the current row
`LEAD(column, n)`	Returns column's value at the row `n` rows after the current row
`FIRST_VALUE(column)`	Returns the first value in table or partition
`LAST_VALUE(column)`	Returns the last value in table or partition

Framing functions

Operator	Description
ROW/RANGE	Uses the given row or range as a frame.
PRECEDING	Rows before the current row.
UNBOUNDED PRECEDING	Return all rows before the current row.
UNBOUNDED FOLLOWING	Return all rows after the current row.
CURRENT ROW	Current row of query execution.

Ranking Functions

Operator	Description
ROW_NUMBER	Unique sequential number for each row in the specified partition
RANK	Unique rank number for the each distinct row within the specified partition, but equal values share same rank
DENSE_RANK	Unique rank number for the each distinct row within the specified partition without skipping any duplicate values
NTILE	Distribute the rows in to the rows set with a specific `n` number of groups.

📖 Day 7 - Understanding Data Scientist Fundamentals

🗓️  Date: 2023-02-23

Resources :

Course

Summary :

Taking a break from the regular SQL courses, I delved into the everyday life of a data scientist, complete with current data science issues and how data scientists manage themselves and the organizations for which they operate. I was also able to take the following course on the non-technical abilities of a successful data scientist, which addressed not just the attributes that a person should have but also the role diplomacy plays while working in a professional setting. In addition, to polish my pandas abilities, I completed a Kaggle Learn course that served as a refresher on the techniques I use on a daily basis.

📖 Day 8 - Exploring PostgreSQL Window Functions

🗓️  Date: 2023-02-24

Resources :

Course

Summary :

Leveraging the same elements in different ways has always lit up the neurons in my brain, allowing me to perceive the world in new ways. This occurred when learning how to use the aggregrate functions within the window functions to obtain new results. In fact, utilizing the same `SUM` and `AVG` functions to deliver moving totals and averages within sql itself with the assistance of frames and aggregrate functions made me leap on top of my bed. There were so many things that sql could do that I had always assumed only pandas could accomplish. While creating sophisticated queries in pandas is faster, the execution time would be much faster if same queries were implemented directly in SQL without loading the dataset into memory.

Continuing this discovery, pivoting tables in SQL was also conceivable with `CROSSTAB`, as well as other beneficial functions like `ROLLUP`, `CUBE`, `COALESCE`, and `STRING AGG`, which would come in handy when relying only on SQL.

Notes :

ROW BETWEEN

Syntax ROWS BETWEEN [start] AND [finish]

n PRECEDING : n rows before the current row
CURRENT ROW : the current row
n FOLLOWING : n rows after the current row

Examples

ROWS BETWEEN 3 PRECEDING AND CURRENT ROW
ROWS BETWEEN 4 PRECEDING AND 4 FOLLOWING
ROWS BETWEEN CURRENT ROW AND 1 FOLLOWING

CROSSTAB

Before using crosstab, use the to create an extension
    CREATE EXTENSION IF NOT EXISTS tablefunc;

    SELECT * FROM CROSSTAB($$
        source_sql TEXT
    $$) AS ct(
        column_1 DATA_TYPE_1,
        column_2 DATA_TYPE_2,
        ...,
        column_n DATA_TYPE_N
    );

ROLLUP and CUBE

The `ROLLUP` option allows to include extra rows that represent the subtotals, which are commonly referred to as super-aggregate rows, along with the grand total row.

SELECT
country, warehouse, SUM(quantity)

FROM
inventory
GROUP BY ROLLUP (country, warehouse);

`ROLLUP` is hierarchical, de-aggregrating from the leftmost provided column to the right-most.

ROLLUP (country, warehouse) -- includes country level totals
ROLLUP (warehouse, country) -- includes warehouse level totals

However, when we need all possible group-level aggregrations, we use `CUBE` which shares similar properties to `ROLLUP`.

CUBE (country, warehouse) -- country level and warehouse level, and grand total

Useful Functions

- COALESCE `COALESCE()` takes a list of values and returns the first non-null value, going from left to right

COALESCE(null, null, 1, null, 2) -- returns 1

- STRING_AGG `STRING_AGG(column, separator)` takes all the values of a column and concatenates them, with `separator` in between each value.

📖 Day 9 - Exploring Functions for Data manipulation in SQL

🗓️  Date: 2023-02-25

Resources :

Course

Functions for Manipulating Data in PostgreSQL (Datacamp)

Summary :

The focus of today's course was on data manipulation in PostgreSQL utilizing both built-in and user-defined functions. The built-in functions of PostgreSQL included common data types and their casts, date/time functions and operators, and string parsing and manipulation functions. While the most of the operators were familiar, I learned about several new ones, such as `INTERVAL` and `INITCAP`. Nevertheless, the postgreSQL extensions and full-text search capabilities were entirely new subjects, particularly `tsvector` (text search vector) to execute a full text search beyond the scope of the 'LIKE' operator. Knowing that PostgreSQL offers built-in extensions such as fuzzy string matching through 'levenshtein' and'similarity' blew my mind as I had previously only used it in Python. Learning the syntax to develop my own functions was also quite instructive. Overall, it was a productive weekend spent learning more about PostgreSQL.

Notes :

INFORMATION_SCHEMA

`INFORMATION_SCHEMA` provides access to database metadata, information about the MySQL server such as the name of a database or table, the data type of a column, or access privileges.

-- Example 1 : Extracting all table names from system database
SELECT table_name, table_type
FROM INFORMATION_SCHEMA.TABLES
WHERE table_schema = 'public';

-- Example 2 : Extracting column data types from table
SELECT
column_name,
data_type
FROM INFORMATION_SCHEMA.COLUMNS
WHERE table_name = 'actor';

INTERVAL

`INTERVAL` data type allows to store and manipulate a period of time in years, months, days, hours, minutes, seconds, etc.

INTERVAL '3 days' -- goes forward in time
INTERVAL '2 months ago'; -- goes back in time due to the keyword 'ago'
INTERVAL '3 hours 20 minutes';

-- Example 1 : Addition of timeframe
SELECT rental_date + INTERVAL '2 days' as expected_return
FROM rental;

-- Example 2: Conversion of column to interval
SELECT INTERVAL '1' day \* rental_duration
FROM rental

DATETIME Operators

Operator	Description
AGE()	Subtract with current_date (at midnight) when empty and with the other arguments when two values are provided
NOW()	Get current timestamp with microsecond precision
CURRENT_TIMESTAMP()	Gets similar timestamp to now but allows precision parameter to round off seconds
CURRENT_DATE/CURRENT_TIME	Get current date and time
EXTRACT( `field` from `source`)	Get subfield
DATE_PART(' `field`', `source`)	Get subfield (equivalent to extract)
DATE_TRUNC(' `field`', `source`)	Truncate timestamp or interval data types with precision
ISFINITE()	Test for finite date, time and interval (not +/-infinity)

STRING Operators

Operator	Description
UPPER/LOWER( `source`)	Converts column to upper or lower case
INITCAP( `source`)	Converts column to title case
REPLACE( `source`, ' `find_string`', ' `replace_string`')	Replaces the source string with the replacement string
REVERSE( `source`)	Reverses the string
LENGTH( `source`)	Extract the length of the string
POSITION(' `char`' IN `source`)	Extract the first position of a character in a string
LEFT( `source`, `n`)	Extract the `n` number of characters from left side of the given source
RIGHT( `source`, `n`)	Extract the `n` number of characters from right side of the given source
SUBSTRING( `source`, `start`, `length`)	Extract a string containing a specific number of characters from a particular position of a given string
TIRM([leading\|trailing\|both] [characters] FROM `source`)	Removes characters from source
LPAD( `source`, `n`, `char`)	Left-pads a string with another string, to a certain length
RPAD( `source`, `n`, `char`)	Right-pads a string with another string, to a certain length

FULL TEXT Search

- Basic Search `to_tsvector(text)` : performs normalization and creates a list of tokens `to_tsquery(string)` : accepts a list of words that will be checked against the normalized vector `@@` : check if `tsquery` matches `tsvector`

-- Example 1 : Check if the title contains 'elf'
SELECT title, description
FROM film
WHERE to_tsvector(title) @@ to_tsquery('elf');

- Fuzzystring

-- Enable the fuzzystrmatch extension
CREATE EXTENSION IF NOT EXISTS fuzzystrmatch;
-- Confirm that fuzzystrmatch has been enabled
SELECT extname FROM pg_extension;

SELECT levenshtein('hello', 'jelly'); -- number of edits required to be a perfect match
SELECT similarity('hello', 'jelly'); -- similarity between two strings from 0 to 1

User Defined Data Types

Enumerated Data Types - Allows to create list of values that will not change

CREATE TYPE dayofweek AS
ENUM('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday');

-- Check
SELECT typname, typcategory
FROM pg_type
WHERE typname='dayofweek';

User Defined Functions

CREATE FUNCTION squared(i integer) RETURNS integer AS $$
        BEGIN
            RETURN i * i;
        END;
    $$ LANGUAGE plpgsql;

📖 Day 10 - Exploratory Data Analysis in SQL

🗓️  Date: 2023-02-27

Resources :

Course

Exploratory Data Analysis in SQL (Datacamp)

Summary :

Breaking the usual heavy dosage of study sessions, this particular course covered about the usage of relationship diagrams, constraints (primary key, foreign key, unique and not null), and data types for the columns. The most significant functions from this course are 'corr' and 'percentile desc,' which allow you to get correlation and discrete value from a percentile. Moreover, temporary tables were a notion I had heard of but had never used in practice, and this course was a huge help in reinforcing the concept of breaking large queries into smaller chunks.

Notes :

CAST Function

-- Cast Function syntax
SELECT CAST (value AS value_type);

-- Alternate Cast Function with :: notation
SELECT value::new_type;

-- Example 1 : Casting float to integer
SELECT CAST (3.7 AS integer);

Series

-- Example 1 : Basic series
SELECT generate_series(1, 10, 2);

-- Example 2 : Float series
SELECT generate_series(0, 1, 0.1);

Summary functions

Function	Description
CORR( `source1`, `source2`)	Returns the correlation between two columns
percentile_disc( `percentile`) WITHIN GROUP (ORDER BY `column_name`)	Returns the value representing the percentile of the column using discrete method

Temporary Tables

-- Dropping the table
DROP TABLE IF EXISTS table_name

-- Create a temporary table
CREATE TEMP TABLE table_name AS
SELECT column1, column2
FROM table;

📖 Day 11 - (Continued) Exploratory Data Analysis in SQL

🗓️  Date: 2023-02-28

Resources :

Course

Exploratory Data Analysis in SQL (Datacamp)

Summary :

The remaining modules of the course delved into the topic of character types in PostgreSQL, specifically character, varchar, and text. It also covered common challenges that arise when grouping categorical variables and dealing with unstructured text data. The modules included exercises on data cleaning such as dealing with cases and white spaces, as well as data manipulation techniques such as splitting strings using delimiters and concatenating multiple strings. Additionally, the course covered working with date and timestamps to create complex queries through series.

Notes :

Series Generation

-- Syntax
SELECT generate_series(from, to, interval);

-- Example 1
SELECT generate_series('2018-01-01', '2018-01-15', '2 days'::interval)

📖 Day 12 - Data-Driven Decision Making using SQL

🗓️  Date: 2023-03-01

Resources :

Course

Data-Driven Decision Making in SQL(Datacamp)

Project

When Was the Golden Age of Video Games?(Datacamp)

Summary :

With all the skills that I had accumilated so far, it was only about implementing them. While a proper implementation is yet to come, I could still practice within a real evironment through the course "Data-Driven Decision Making in SQL" and the project "When Was the Golden Age of Video Games?". These allowed me to use all of the concepts from data cleaning, manipulation to aggregration and concentrated on using groupings, joins and pivots to create complex tables. Today marks the end of the career track, and I'm over the moon with all the knowledge I've gained in these 12 days. Yay for learning!

📖 Day 13 - Understanding Data Science & Analytics Career Paths

🗓️  Date: 2023-03-02

Resources :

Course

Data Science & Analytics Career Paths & Certifications: First Steps (LinkedIn Learning)

Summary :

Before diving into the world of mathematica, I needed to grasp the foundations that I would need to build as a Data Analyst. Attending the LinkedIn Learning career course "Data Science & Analytics Career Pathways & Certifications" was quite beneficial in this regard. It began by discussing the applications of data science, such as fraud detection, social media analytics, disease control, dating services, simulations, climate research, and network security. It also discussed the abilities required to be relevant in the sector. Data mining, machine learning, natural language processing, statistics, and visualization were among the crucial skills mentioned. It also discussed certificates that can help advance one's career and establish one as a specialist in a particular subject. Overall, the course was beneficial in aiding comprehension of the principles of being relevant in the ever-changing world of data science.

📖 Day 14 - Exploring Data Visualization through Storytelling

🗓️  Date: 2023-03-03

Resources :

Course

Data Visualization: Storytelling (LinkedIn Learning)

Summary :

As visualizing data through narrative storytelling is one of the most crucial skills for a data analyst to have, which sets them apart from their colleagues. I took a data visualization course that included story structure and its components (begining, middle, end, plot, protagonist, problem and transformation). It also demonstrated the use of flow diagrams to successfully represent linear data flow for effective story telling. Most notably, the course taught the principles of learning to demonstrate your analytic abilities utilizing the 4x4 progressive depth model:

The watercooler moment
- The initial attention grabber determines whether or not individuals are interested in learning more.
- Example: Image or headline.
The cafe content
- Example : Blog post or short article
The research library
- Research portion, such as a PDF document.
The Lab Experience
- Interactive dashboard where data aficionados can examine the content and tinker to answer their in-depth questions

📖 Day 15 - Setup for IBM Watson and SpaceX Rest API

🗓️  Date: 2023-03-05

Resources :

Course

Applied Data Science Capstone: Week 1 (Coursera)

Summary :

I took a break from learning today to prepare for the journey ahead! I made my own IBM account and configured Watson Studio to publish notebooks directly to my GitHub repository. I also explored in the world of SpaceX's rest API in order to extract useful data for future projects. We can get so enthused in learning new things that we forget to take a deep breath and get organized. However, not today.

📖 Day 16 - Analyzing SpaceX Data and AI ethics

🗓️  Date: 2023-03-06

Resources :

Course

Summary :

Building on yesterday's exploration, today was all about extracting launch data from SpaceX using requests and beautiful soup. The objective was to determine the fruitfulness of starting a new business for a hypothetical company, SpaceY. During the course, I delved into the concepts of Exploratory Data Analysis and Feature Engineering, utilizing both python and SQL to analyze the data. Wrapping up with data science, I visually represented our findings using scatterplots and barplots to identify factors such as landing site, booster, and payload mass that can contribute to a higher success rate.

Aside from that, I took an AI ethics course and was introduced to Human-Centered-Design for AI and its significance. It not only helped me assess whether a project is worth transitioning to be done under AI, but it also helped me grasp that AI systems are more effective when they work alongside people rather than independently. Also, I learned about the numerous types of biases and fairness that can emerge in an ML model when biased data/model is used, as garbage in, garbage out.

Notes :

Six Types of Bias

Historical Bias
- Occurs when the state of the world in which the data was generated is flawed.
Representation bias
- Occurs when building datasets for training a model, if those datasets poorly represent the people that the model will serve.
- Example : if the dataset used to train the models exclude darker skin tones.
Measurement bias
- Occurs when the accuracy of the data varies across groups.
- This can happen when working with proxy variables (variables that take the place of a variable that cannot be directly measured), if the quality of the proxy varies in different groups.
- Example : if the measurement apparatus shows reduced performance with dark skin tones.
Aggregation bias
- Occurs when groups are inappropriately combined, resulting in a model that does not perform well for any group or only performs well for the majority group.
- This is often not an issue, but most commonly arises in medical applications.
Evaluation bias
- Occurs when evaluating a model.
- If the benchmark data (used to compare the model to other models that perform similar tasks) does not represent the population that the model will serve.
- Example : if the dataset used to benchmark the model excludes darker skin tones.
Deployment bias
- Occurs when the problem the model is intended to solve is different from the way it is actually used.
- If the end users don’t use the model in the way it is intended, there is no guarantee that the model will perform well.

Four fairness criteria

Demographic parity / statistical parity
- It says the model is fair if the composition of people who are selected by the model matches the group membership percentages of the applicants.
Equal opportunity fairness
- It ensures that the proportion of people who should be selected by the model ("positives") that are correctly selected by the model is the same for each group.
- We refer to this proportion as the true positive rate (TPR) or sensitivity of the model.
Equal accuracy
- It is the percentage of correct classifications (people who should be denied and are denied, and people who should be approved who are approved) should be the same for each group.
- If the model is 98% accurate for individuals in one group, it should be 98% accurate for other groups.
Group unaware / Fairness through unawareness
- Group unaware fairness removes all group membership information from the dataset.
- For instance, we can remove gender data to try to make the model fair to different gender groups.
- Similarly, we can remove information about race or age.

📖 Day 17 - Interactive Dashboard, Predective Analysis & Reporting

🗓️  Date: 2023-03-07

Resources :

Course

Applied Data Science Capstone: Week 3, 4 & 5 (Coursera)

Summary :

After completing exploratory data analysis, I delved into creating an interactive dashboard with plolty dash and folium to facilitate in real-time data analysis. It was a good refresher on the concept of dash callbacks to help translate user inputs and update existing charts based on those inputs. In addition, as part of the course, I touched on predictive analysis to determine the optimum model and hyperparameters needed to develop a model capable of predicting the launch's success rate. To do this, I used Preprocessing, GridSearchCV, LogisticRegression, DecisionTreeClassifier, and KNeighborsClassifier to help automate model selection, as well as a confusion matrix to evaluate true accuracy much more clearly.

With plenty of time left in the day, I investigated the creation of an effective data analysis report and its components. While data reports vary depending on the use and data included, I was able to get a general idea of how a data report should look through the course.

Notes :

Elements of Data Finding Report

Cover Page
- Contains: Title, Date and Name of the presenter
Executive Summary
- Briefly explain the details
- Considered a stand-alone document
- No new information should be presented except from the main points
Table of Contents
Introduction
- Nature of the analysis
- States the problem
- States questions for analysis
Methodology
- Explains the data sources
- Outlines the plan for the collected data
Results
- Deatils of data collection
- How data was organized?
- How data was analyzed?
- Charts and graphs to show crucial finding
Discussion
- Engage the audience
Conclusion
- Conclusion of the report finding, reiterating the problem given in introduction
- Overall summary of the findings
- Outcome of the analysis
- Any steps taken in future
Appendix
- Information that didn't fit in the report
- Resources and references

📖 Day 18 - Using Powerpoint to Communicate Finding from Space X Launch Data

🗓️  Date: 2023-03-08

Resources :

Course

Applied Data Science Capstone: Week 5 (Coursera)

Summary :

After a thorough analysis of Space X's launches, it was time to predict the first stage's successful landing to give competition to the likes of Space X with the assistance of Company Y. Armed with a lengthy 50-page presentation, a combination of online resources and a dash of personal passion was instrumental in completing the task, and in the process, honed valuable presentation creation skills. In addition, the power of context cannot be overstated, as it aided in comprehending the insights more easily, with an executive summary for those uninterested in the subject matter. All in all, it was a remarkable learning experience that showcased the importance of a compelling narrative and a comprehensive overview for maximum impact.

📖 Day 19 - Understanding Data Governance

🗓️  Date: 2023-03-09

Resources :

Course

Learning Data Governance (LinkedInLearning)

Summary :

In the data governance course, I gained insights into the significance of efficient data management in organizations. The course taught me that data governance involves creating and enforcing policies, procedures, and standards to manage data assets of an organization, which includes data privacy, quality, security, and access. A crucial lesson that I learned was how data governance plays a critical role in ensuring the trustworthiness and correctness of an organization's data. It enables high-quality data that can be relied upon to drive decision-making processes. Moreover, data governance can also aid organizations in complying with regulatory obligations related to data privacy and security.

Notes :

Principles of Data Governance

Transparency
- All data governance processes implemented throughout your organisation should exhibit the utmost transparency.
Accountability
- Ownership and accountability has to be applied across the organisation for the data being collected and stored by the individuals.
Standardization
- Any successful data governance process will need to define and abide by standardised rules and regulations to protect their data and to ensure it is used in accordance with all relevant external regulations (such as the GDPR).

📖 Day 20 - Lessons from Data Scientist and exploring side hussles

🗓️  Date: 2023-03-10

Resources :

Course

Summary :

Six individuals, each with their unique experiences in data, shared their stories through the course. These narratives covered their journeys starting out in data analytics, the inspiration behind their work, the impact of their contributions on the organization, their current endeavors, and practical advice based on their experiences. One of the prominent discussions was about the ethical considerations that data scientists face while conducting an analysis, where certain data points may conflict with their personal values. However, what had a significant impact on my outlook towards the data science field was gaining insights about the industry and the people involved in it during the course.

In addition to my current pursuits, I have become interested in side hustle strategies for data science. The monotony of only having one job motivated me to seek out new opportunities to expand my abilities and skills. I discovered a range of options such as writing, training, consulting, attending conferences, and engaging with academics. These activities may include co-authoring a book, writing a chapter in a second edition, providing training in R or Python during free time, through in-site or online classes. These endeavors not only benefit the individual but can also contribute to growing the data science industry. As one gains expertise, opportunities such as giving speeches at conferences and consulting with organizations can lead to expanding networks and discovering new possibilities. It's important to note that if one is currently involved in academics as a student or teacher, there are resources beyond the classroom that can be taken advantage of, such as university libraries and websites like GitHub Education. By making the most of what is available and staying informed about the latest tools and patterns in data analytics, one can continue to expand their knowledge and skills in this new and exciting field.

📖 Day 21 - Introduction to Tableau

🗓️  Date: 2023-03-19

Resources :

Course

Introduction to Tableau (Datacamp)

Summary :

Today marks the first day of my journey into the world of Tableau, following a 7-day break from my usual learning streak. I began my day by loading workbooks and familiarizing myself with the navigation of Tableau through the menu pane and tool bar. Next, I dove into the world of sorting and various types of filters (extract filter, data source filter, context filter, dimension filter, and measure filter) in Tableau. I then proceeded to using aggregration and creating custom columns using calculated fields that leverage the inbuilt functions within Tableau. Moving on, I explored the topic of creating visualizations on geo maps using geocoding. I was thrilled to see how easy it was to map data and extract meaningful insights based on the location of the data points. I also learned how to work with dates and create reference lines, trend lines, and forecasting using Tableau. Finally, I learned how to convey my findings with dashboards and stories in Tableau.

Overall, it was an interesting and exciting day of learning to fully comprehend the tool that has/can significantly decrease my burden by providing rapid visualizations that would take hours to complete in Python. I'm looking forward to applying these skills to real-world scenarios and deepening my knowledge in the days to come.

Notes :

Data roles in Tableau

Discrete dimension
- Common, colored in blue
- Finite amount of values
- Can't be aggregated
- Example: Eye color, Gender, etc.
Discrete measure
- Not common, colored in blue
- Finite amount of values
- Can be aggregated
- Example: Shoe size, Age, etc.
Continuous dimension
- Not common, colored in green
- Infinite amount of values
- Can't be aggregated
- Example: Date, etc.
Continuous measure
- Common, colored in green
- Infinite amount of values
- Can be aggregated
- Example: Height, Weight, etc.

📖 Day 22 - Analyzing Data in Tableau

🗓️  Date: 2023-03-21

Resources :

Course

Analyzing Data in Tableau (Datacamp)

Summary :

Today's focus was on preparing for analysis using Tableau. I began by familiarizing myself with various slicing and dicing functions and visualizations that are readily available in the software. It was exciting to get the opportunity to make complex visualizations and help analyze a hypothetical scenario. The scenario contained a "Two by two discount" promotional campaign, which seeks to increase customer purchases on the two overall lowest riding days, as well as the two lowest windows of time, excluding nighttime. Some of my favorite visualizations included a heat map to show sales volume by product and a scatter plot to show the correlation between time of day and sales. It was a challenging yet rewarding experience that allowed me to build on my existing Tableau skills and gain new insights into data analysis.

weekday_trends_by_gender_and_time_segment

Furthermore, I explored more of Tableau's capabilities with user data. I added filters and created KPI for dashboards, which helped to better understand user behavior and trends. Additionally, I created a histogram using bins and compared it with a traditional line chart to understand the distribution of a specific variable. I found that the histogram provided a more granular view of the data and revealed insights that were not visible through the line chart.

📖 Day 23 - (Continued) Analyzing Data in Tableau

🗓️  Date: 2023-03-27

Resources :

Course

Analyzing Data in Tableau (Datacamp)

Summary :

Today, I learnt how to use Tableau to map customer activity by analyzing popular bike locations and looking at each station's user base. This involves using various techniques such as color, size, layers, tooltips, pages, and quick table calculations to create maps that reveal trends and insights in the data. I have also learned the importance of adding other charts inside tooltips to view data more closely and provide additional context.

Similarly, tableau provides various features that can be used to group, filter, and parameterize data in order to gain insights and create more meaningful visualizations that I was able to utilize. To explain further, groups can be used to combine similar data points into a single category. Sets are similar to groups, but instead of combining data points into a category, they allow you to create subsets of data based on specific criteria. Parameters allow you to create user-defined inputs that can be used to adjust various aspects of a visualization.

📖 Day 24 - Creating Dashboards in Tableau

🗓️  Date: 2023-03-28

Resources :

Course

Creating Dashboards in Tableau (Datacamp)

Summary :

Today, I learned how to create interactive dashboards in Tableau that can be used to share data insights with others. By combining different worksheets and allowing self-serve data exploration, I was able to create dashboards that provide users with the ability to analyze data in a more interactive and meaningful way. Through the course, I learned best practices for creating dashboards, including alignment and adding worksheets inside tooltips to allow for better analysis. Using the Divy dataset provided in DataCamp, I was able to create various dashboards with appropriate worksheets, parameters, objects, and actions to increase the readability and interactivity of the dashboard. Overall, I found the process of creating interactive dashboards in Tableau to be both challenging and rewarding, and I look forward to applying these skills in future data science projects.

📖 Day 26 - (Continued) Creating Dashboards in Tableau

🗓️  Date: 2023-03-30

Resources :

Course

Creating Dashboards in Tableau (Datacamp)

Summary :

Today, I learned about sharing insights from data using different techniques, with a focus on the importance of effective presentation. Specifically, I gained knowledge on the ways to utilize the power of Tableau to create visually appealing and informative presentations through the use of stories, annotations, and highlights. This will enable me to effectively convey key points to my audience and keep their attention focused on the most important parts of the visualization. Additionally, I discovered tips for optimizing dashboards to ensure they are responsive and user-friendly on mobile devices.

Notes :

Essential elements of an analytical presentation

Big picture question or hypothesis
Data description
Insights
Observations and implications
Recommendations and next steps

📖 Day 27 - Case Study: Analyzing Customer Churn in Tableau

🗓️  Date: 2023-04-01

Resources :

Course

Case Study: Analyzing Customer Churn in Tableau (Datacamp)

Summary :

Today's learning focused on understanding the data analysis flow in Tableau, which involves data check, data exploration, analyzing and visualizing data, dashboarding, and communicating insights. To effectively learn these concepts, I decided to download Tableau Public and perform the analysis. This decision proved to be helpful, as it eliminated any lag present while using the virtual machine provided by Datacamp, allowing me to work faster and more efficiently. However, using the public version meant that I was restricted and couldn't save my progress. During the analysis process, I utilized a range of visualizations, such as dual-axis graphs, scatter plots, and maps, to better understand the data. Additionally, I made use of calculated fields, bins, and bin sizes, which helped me analyze the data more effectively. To take advantage of Tableau's interactivity feature, I leveraged the power of filters, visualizations as filters, and parameters. By understanding the data analysis flow in Tableau and utilizing its features, I was able to effectively find customer churn for a hypothetical company.

📖 Day 28 - Connecting Data in Tableau

🗓️  Date: 2023-04-02

Resources :

Course

Connecting Data in Tableau (Datacamp)

Summary :

Today's course focused on the crucial task of combining multiple datasets into one. As data rarely comes in one single file, this process is essential in the real world. Fortunately, Tableau has built-in features to combine multiple datasets using unions, joins, and relationships, which made the course comparatively straightforward. Along with learning about combining datasets, I was also introduced to the different types of Tableau file formats, including packaged workbooks, workbooks, data sources, and extracts. These formats enable users to share their worksheets with their peers easily. Although the course was relatively short compared to others I had taken, it was crucial as it laid the foundation for the data import process, which is arguably the most important part of data analysis as analysis can't start without data.

📖 Day 29 - Data Visualization in Tableau

🗓️  Date: 2023-04-03

Resources :

Course

Data Visualization in Tableau (Datacamp)

Summary :

Today's course was focused on the importance and need of data visualization in businesses, as well as the process of creating effective data visualization using Tableau. In addition to this, the course emphasized the use of various visualization tools such as boxplots, waterfall/bridge charts, heat maps, and scatter plots to depict complex data analysis, using IMdb movie reviews as a case study. Despite being mentally exhausted, I was able to learn a great deal from today's class.

📖 Day 30 - (Continued) Data Visualization in Tableau

🗓️  Date: 2023-04-06

Resources :

Course

Data Visualization in Tableau (Datacamp)

Summary :

Despite feeling exhausted from work and physically sick, I remained committed to learning and decided to focus on one manageable module. Today's course centered on the best practices for creating effective data visualizations, highlighting common mistakes such as using the wrong type of visualization, misleading design elements, and mishandling missing data. The course also emphasized the importance of designing dashboards thoughtfully, considering factors such as design, formatting, and interactivity, rather than overloading a single page with charts. What really hit home for me, though, was the importance of designing dashboards thoughtfully. Instead of cramming all the charts onto one page, it's crucial to consider the design, formatting, and interactivity of the dashboard. This way, you can make sure that your dashboard is clear, concise, and easy to use.

📖 Day 31 - Data Fluency: Exploring and Describing Data

🗓️  Date: 2023-04-07

Resources :

Course

Data Fluency: Exploring and Describing Data (LinkedinLearning)

Summary :

Taking a break from Tableau, I jumped back into the LinkedIn course for becoming a data scientist. As expected, the course started with a focus on preparing data using in-house data, open data, and third-party data sources, all while adhering to data ethics and staying within the scope of the analysis. Next, the course covered sorting and filtering data to perform exploratory data analysis, and emphasized the usage of ratios to gain a better understanding of the overall picture. But what really caught my attention was when the course revisited various charts that can be used to visualize data. The instructor highlighted the limitations of pie charts, and warned against using them as they can often mislead the data. Instead, the course introduced me to some new concepts such as dot plots, sparklines, and data maps, and explained their unique use cases. I found it fascinating to learn about these new techniques and how they can help convey complex data in a more meaningful way.

📖 Day 32 - (Continued) Data Fluency: Exploring and Describing Data

🗓️  Date: 2023-04-08

Resources :

Course

Data Fluency: Exploring and Describing Data (LinkedinLearning)

Summary :

Today, I continued with the remaining modules from the last course and came across a section that was all about describing data. This involved looking at various methods to access the center, variability, rescale, and associations within a dataset. When it comes to finding the center of the data, data analysts usually refer to the mode (most common score), mean (average), and median (middle score/50%). On the other hand, variability or how spread out the data is can be found using techniques such as range (difference from maximum to minimum), quartiles (split into 4 equal parts), IQR (Q3 - Q1), variance, and standard deviation (average distance from the mean). In order to rescale data for better interpretation, comparison, and probability analysis, the course introduced the use of z-score. The course concluded with a summary of basic and conditional probability, along with sampling variations.

📖 Day 33 - (Continued) Data Visualization in Tableau

🗓️  Date: 2023-04-10

Resources :

Course

Data Visualization in Tableau (Datacamp)

Summary :

Today, as I returned to Tableau, I delved into the world of digital cartography by learning how to create maps using geo-coordinates. Although it was a short module, it covered some essential topics from converting a measure to geo-coordinate to creating your own maps using x and y coordinates. However, the most exciting part of the module was the visualization of paths for a hypothetical museum where the size of the line represented the number of visitors who visited based on the hour. This technique provides a fascinating and intuitive way to represent data, and I look forward to using it in my future visualizations.

📖 Day 34 - (Continued) Data Visualization in Tableau

🗓️  Date: 2023-04-11

Resources :

Course

Data Visualization in Tableau (Datacamp)

Summary :

In today's course, I encountered advanced visualization techniques such as the waffle chart, DNA chart, sparklines, and the famous Sankey chart. It was fascinating to see how these visualizations could be used to communicate complex information in an intuitive way. The waffle chart, for example, allowed me to represent the percentage of a whole using a grid of squares, while the DNA chart provided a unique way to represent hierarchical data. Additionally, the sparkline chart enabled me to visualize trends over time using a small line chart within a single cell. Although the Sankey chart wasn't a part of the module, the remaining charts challenged me to find creative ways to create them within the limitations of Tableau.

📖 Day 35 - Calculations in Tableau

🗓️  Date: 2023-04-12

Resources :

Course

Calculations in Tableau (Datacamp)

Summary :

Today's course in Tableau was focused on performing calculations within the software to create new columns and rows using calculated fields. Throughout the lesson, I encountered various types of errors, which were detected by Tableau, as well as errors that resulted from the equations themselves. The course provided exercises to sharpen my skills in finding and tackling these errors. Following this, the course delved into data granularity and how it increases with the number of dimensions exposed in Rows and Columns or in the Marks, such as detail, color, shape, and so on. While "Group by" is used to tackle data granularity in pandas and SQL, in Tableau, LOD expressions are used to compute aggregations easily. There are three types of LOD Expressions: INCLUDE, EXCLUDE, and FIXED, each controlling the level of detail at which a calculation is performed and not depending on the dimensions used in the visualization.

📖 Day 36 - (Continued) Calculations in Tableau

🗓️  Date: 2023-04-16

Resources :

Course

Calculations in Tableau (Datacamp)

Summary :

Today's module focused on table calculations, covering both inbuilt and custom expressions based on single and multiple dimensions. Various examples of table calculations were explored, including running average, running table, window sum, and more. Following this, the module also revisited parameters, which allow for custom interactions between the user and the worksheet.

📖 Day 37 - (Continued) Calculations in Tableau

🗓️  Date: 2023-04-17

Resources :

Course

Calculations in Tableau (Datacamp)

Summary :

Today's course module was centered on time series analysis, including the common calculations used in time series, along with a summary of how to use individual rows of data to create date and time. After covering these topics, I performed analysis on the data source to find patterns and growth by analyzing seasonality and year-to-year trends.