Doing Data Analysis with SQL (Part Two)
Shaping Data
Shaping data refers to manipulating how data is represented in columns and rows Each table has a shape. The result set of each query has a shape.
An important concept in shaping data is figuring out the granularity of data you need. Flattening data is also another important concept. This refers to reducing the number of rows that represent an entity. Joining multiple tables together is one way to flatten data.
Deciding how you want to shape your data depends on what you want to do with the data afterward. When outputting data for dashboards and reports, it’s important to understand the use case. Data sets need to be detailed to enable exploration. They may need to be small in order to have fast loading and response times.
Pivot
A pivot table is a way to summarize data sets by arranging data into rows. At the intersection, a summary statistic such as sum, count, or avg is calculated. It can be a good way to summarize data for business audiences.
A pivot table can be created using a CASE statement along with one or more aggregation functions. To do this, we must flatten the data using GROUP BY,
SELECT customer_id,
SUM(order_amount) AS total
FROM orders
GROUP BY 1
To create the pivot, we will create columns for each of the values of an attribute,
SELECT order_date,
SUM(CASE WHEN product = 'jean' THEN order_amount ELSE 0 END) AS jean_amount,
SUM(CASE WHEN product = 'cap' THEN order_amount ELSE 0 END) AS cap_amount,
SUM(CASE WHEN product = 'tshirt' THEN order_amount ELSE 0 END) AS tshirt_amount,
FROM orders
GROUP BY 1
With sum aggregation, you can use ‘else 0’ to avoid nulls in the output. With count or count distinct, you shouldn’t use the ELSE statement, as it will inflate the results.
Unpivoting
Sometimes we may need to move data stored in columns into rows to create tidy data. This is called unpivoting.
UNION is a way to combine data sets from multiple queries into a single set. There are two forms, UNION and UNION ALL. When using them, the number of columns in each component query must match,
SELECT date,
'1970' AS year,
year_1970 AS population
FROM country
UNION ALL
SELECT date,
'1980' AS year,
year_1980 AS population
FROM country
UNION ALL
SELECT date,
'1990' AS year,
year_1990 AS population
FROM country
The difference between UNION and UNION ALL is UNION removes duplicates from the results while UNION ALL retains all records like duplicates.
Time
Time series is a sequence of measurements recorded in time order, often at regularly spaced intervals. Examples may include the temperature, the closing value of the S&P 500 stock index, or the number of daily steps recorded on your phone.
Forecasting is a common goal of time series analysis.
Time Zone
Time zones allow different parts of the world to have similar clock times for daytime and nighttime. For example, the sun is overhead at 12 p.m. wherever you are in the world. The zones follow irregular boundaries and most are one hour apart, but some are offset only 30 or 45 minutes, so there are more than 30 time zones spanning the globe.
Many databases are set to Coordinated Universal Time (UTC), the global standard. UTC doesn’t have daylight savings time, so it stays consistent all year long which is useful for analysis.
All local time zones have a UTC offset. For example, the offset for PDT is UTC - 7 hours, while the offset for PST is UTC - 8 hours. Timestamps in databases are stored in the format YYYY-MM-DD hh:mi:ss.
We can convert a timestamp in UTC to PST,
SELECT '2022-04-14 00:00:00 -0' at time zone 'pst'
Some databases have a convert_timezone or convert_tz function that works similarly,
SELECT convert_timezone('pst','2022-04-14 00:00:00 -0')
Date and Timestamp
Returning the current date or time is a common analysis task. The current date and time are referred to as system time.
To return the current date, some databases have a current_date function,
SELECT current_date
There is a wider variety of functions to return the current date and time,
current_timestamp
localtimestamp
get_date()
now()
There are also functions that return only the timestamp portion of the current system time,
current_time
localtime
timeofday()
SQL provides a few functions for returning just the part of the date or timestamp. The date_part function takes a text value for the part to be returned,
SELECT date_part('day', current_timestamp);
SELECT date_part('month', current_timestamp);
SELECT date_part('hour', current_timestamp);
Another function that works similarly is extract which returns a FLOAT value,
SELECT extract('day' FROM current_timestamp);
To return text values of the date parts, use to_char, which takes the input value and the output format as arguments,
SELECT to_char(current_timestamp, 'Day');
SELECT to_char(current_timestamp, 'Month');
A simple way to create separate date and time components is to concatenate them together with a plus sign,
SELECT date '2022-04-14' + time '04:00:00' as timestamp;
A date can be assembled using make_date, makedate, date_from_parts, or date fromparts functions. The function takes arguments for the year, month, and day parts and returns a value with a date format,
SELECT makedate(2022,04,14);
Date Math
SQL allows us to do various mathematical operations on dates. Date math involves two types of data: the dates themselves and intervals.
Let’s find the days elapsed between the two dates. First, we can use the minus sign,
SELECT date('2022-04-14') - date('2022-03-14') AS days;
We can also find the number of months between the two dates,
SELECT age(date('2022-04-14'),date('2022-01-01'));
We can find the number of months component of the interval with date_part(),
SELECT date_part('month', age('2022-04-14','2022-01-01')) AS months;
To do addition with dates,
SELECT date('2022-04-14') + interval '7 days' AS new_date
Some databases don’t require the interval syntax and instead covert the provided number to days,
SELECT date('2022-04-14') + 7 AS new_date
Time Math
This is similar to date math. We can add time intervals to times,
SELECT time '01:00' + interval '2 hours' AS new_time
We can subtract intervals,
SELECT time '01:00' - interval '2 hours' AS new_time
We can subtract times, resulting in an interval,
SELECT time '01:00' - time '12:00' AS time_diff
Times can be multiplied,
SELECT time '01:00' * 2 AS time_multiplied
Intervals can also be multiplied, resulting in time value,
SELECT interval '1 day' * 35 AS interval_multiplied
Cohort Analysis
A cohort is a group of individuals who share some characteristic of interest. They are often people, but can also be companies, products, or phenomena.
Cohort analysis is a useful way to compare groups of entities over time. It provides a framework for detecting correlations between characteristics which can lead to hypotheses.
Cohort analyses have three components: the group, a time series of data over which the cohort is observed, and an aggregate metric that measures an action done by cohort individuals.
There are four types of cohort analysis: retention, survivorship, returnship, and cumulative behavior:
Retention is concerned with whether the cohort member has a record in the time series on a particular date. This is useful for any kind of organization in which repeated actions are expected.
Survivorship is concerned with how many entities remained in the data set for a certain length of time, regardless of the number or frequency of actions up to that time. It’s useful for answering questions about the proportion of the population that can be expected to remain.
Returnship is concerned with whether an action has happened more than some minimum threshold of times during a fixed window of time. This is useful in situations in which behavior is unpredictable.
Cumulative is concerned with the total number measured at one or more fixed time windows.
Retention
To retain is to keep something. Businesses usually want their customers to keep purchasing their products or using their services since retaining customers is more profitable than acquiring new customers.
The key question in retention is whether the starting size of the cohort number of subscribers will remain constant, decay, or increase over time.
Retention analysis uses the count of entities or sum of money present in the data set for each period from the starting date and is normalized by dividing this number by the count or sum of entities in the first time period. Retention analysis is typically displayed in a graph or table form, which is referred to as a retention curve.
The first characteristic to pay attention to is the shape of the curve in the initial few periods. For many apps, losing half of the cohort in the first few months is common. A second characteristic to look for is whether the curve flattens after some number of periods.
Survivorship
Survivorship is concerned with how long something lasts. It can help answer questions about the share of the population that is likely to remain past a certain amount of time. It can help identify which characteristics increase or decrease the survival likelihood.
This is similar to retention analysis, but we would calculate whether the entity is present in that period or later in the time series. Then the share of the total cohort is calculated.
Returnship
Returnship seeks to understand whether a cohort member can be expected to return within a given window of time and the intensity of activity during that window.
For example, an e-commerce site might want to know how many new customers were acquired because of a marketing campaign and also whether those buyers are repeat buyers. One way we can figure this out is to calculate total purchases per customer. However, comparing customers acquired three years ago to those acquired a month ago isn’t fair. The older cohort would appear more valuable than the new ones.
To make a fair comparison, we need to create an analysis based on a fixed window of time from the first date and consider whether cohort members returned within that window.
Cumulative
Cumulative can be used to establish cumulative lifetime value. With cumulative calculations, we’re concerned about whether an entity did an action on the total as of a particular date. The calculations used are counts or sums.
Text Analysis
Text in databases ranges from structured to semistructured or mostly unstructured. SQL has a number of useful functions that can accomplish a range of text-structuring.
Text analysis is the process of deriving insight from text data. There are two broad categories of text analysis: qualitative and quantitative. The qualitative analysis seeks to understand the meaning of a single text or a set of texts. The quantitative analysis seeks to synthesize information from text data, but the output is quantitative. SQL is more suited to quantitative analysis.
Text analysis has several goals. The first is text extraction, where pieces of data must be pulled from the text. Another is categorization where information is extracted from text data in order to assign tags to rows. Another is sentiment analysis where the goal is to understand the intent of the writer on a scale from positive to negative.
Text Characteristics
The most flexible data type is VARCHAR. The first thing we would do is profiling and characterizing data.
One way we can get to know the text data is to find the number of characters in each value which is done with the length function (or len):
SELECT length('Sample');
We can create a distribution of field lengths to get a sense of the typical length and whether there are any extreme outliers:
SELECT length(reports), COUNT(*) AS records
FROM gov
GROUP BY 1
ORDER BY 1
The next step is to make this field more usable. The steps are:
Plan the field desired as output
Apply parsing functions
Apply transformations
Check results when applied to the entire data set
Repeat these steps until the data is in desired formats
Text Parsing
Parsing data is the process of extracting pieces of text value to make them more useful. Parsing splits the data into two parts: one for things we want and the other for everything else.
The simplest parsing function is to return a fixed number of characters. The left function returns characters from the left side, while the right function returns characters from the right side:
SELECT left('data',3) as left_data,
right('data',4) as right_data
The left and right functions are useful for extracting fixed-length parts of a string, but for more complex patterns, we would use split_part. The idea behind this function is to split a string into parts based on delimiter and then allow you to select a specific part. A delimiter is one or more characters that are used to specify the boundary between text. The form is:
split_part(string or field name, delimiter, index)
The index is the position of the text to be returned, relative to the delimiter.
SELECT split_part('example of a string', 'example', 1)
SELECT split_part('example of a string', 'example', 2)
Text Transformations
Transformation changes string values in some way. The most common transformations are capitalization. The upper function converts all letters into uppercase, while the lower function converts letters into lowercase. Example:
SELECT upper('Sample')
SELECT lower('Sample')
Some databases, like Postgres, have initcap fiction that capitalizes the first letter of each word in a string:
SELECT initcap('new york')
Another useful function is trim which removes blank spaces at the beginning and end of a string. Example:
SELCT trim(' New York ')
The last transformation is the replace function. Sometimes we want to change a string into another string or remove it entirely. The replace function takes three arguments: the original text, the string to find, and the string to substitute.
replace(string, string to find, string to substitute)
[End of Part Two]