Doing Data Analysis with SQL (Part Three End)
Anomaly Detection
An anomaly is something that is different from other members of the same group. Anomalies have one of two sources: real events that are extreme, or errors introduced during data collection or processing. That is why understanding the root cause between the two types of causes is important.
Outliers
A data point that is very different from the rest does seem pretty easy to tell, but there are some challenges. The first has to do with knowing when a data point is common or rare, and the second is setting a threshold for marking values on either side of the dividing line.
Generally, a larger or more complete data set is easier to make a judgment on what is anomalous.
There are some tools for detecting outliers. The first thing we can do is sort or ORDER BY the values in the data. This can be combined with various GROUP BY clauses to find outliers by frequency. Another thing we can do is use SQL’s statistical functions to find extreme values. The last thing we can use is to inspect it visually with a graph.
Sorting
The most basic tool for finding outliers is by sorting. The default behavior of ORDER BY is to sort ascending. To sort in descending order, we can put DESC after the column.
For example, we can sort volcanoes by size:
SELECT size
FROM volcanoes
ORDER BY 1 DESC
Another way to consider whether values are anomalies is to calculate their frequency. We can count the id field and GROUP BY the size to find the number of volcanoes per size. The number of volcanoes per size is then divided by the total number of volcanoes, which can be done using the sum function. They’ll require an OVER clause with a PARTITION BY and/or ORDER BY clause:
SELECT size
,count(id) as volcanoes
,round(count(id) * 100.0 / sum(count(id)) over (partition by 1),8)
as pct_volcanoes
FROM volcanoes
GROUP BY 1
ORDER BY 1 DESC
Graphing to find Anomalies
There are three graphs that are useful for anomaly detection: bar graphs, scatter plots, and box plots. Any BI tool or spreadsheet software, or programming language such as Python or R will be able to produce these graph types.
The bar graph is used to plot a histogram of the values in a field and is useful for characterizing the data and spotting outliers. The full extent of values is plotted along one axis, and the number of occurrences of each value is plotted into another axis.
The scatter plot is also used to characterize data and spot outliers. A scatter plot is useful when the data set contains at least two numeric values of interest. The x-axis displays the range of values of the first data field, the y-axis displays the range of values of the second data field, and a dot is graphed for every pair of x and y values.
The box plot, also known as the box-and-whisker plot, is also useful in finding and analyzing outliers. The box plot summarizes data in the middle of the range of values while retaining the outliers. The line that forms the bottom of the rectangle is located at the 25th percentile value, the line at the top is the 75th percentile, and the middle is the 50th percentile or median value. Box plots are more difficult to explain to people who don’t have a statistics background or for those who rarely look at visualizations. Box plots can also be used to compare across groupings of the data to further identify where outliers happen.
Handling Anomalies
After detecting anomalies, the next step is to handle them.
Finding the cause of an anomaly is the first step in deciding what to do with it. This usually involves a series of queries that are searching for patterns and also look at specific examples. A true outlier is easy to spot. In cases like this, it’s best to query for the entire row that contains the outlier for clues as to the source, and any other attributes that are available. Next, check the records that share those attributes to see if they have values that seem unusual.
After finding the source of anomalies, it’s best to get in touch with the owners. Sometimes there is a bug in the system, but often there is an issue in the system that needs to be addressed, and the information we’ve gathered is going to be useful. For public data sets, you may not be able to find the root cause.
Removing
One option for dealing with anomalies is to remove them from the data set. Removal is a good option when the data set is large as dropping a few records is unlikely to affect the conclusions.
Removing records with numbers is pretty easy using the WHERE clause:
SELECT size
FROM volcanoes
WHERE size not in (6,7,8,9)
Before removing records, we may want to determine whether including the outlier makes a difference to the output. For example, we may want to know if the outlier is going to affect the average since the average can easily be skewed by outliers. We can do this by calculating the average across the data set, as well as the average excluding the extremely low values, using the CASE statement:
SELECT avg(size) as avg_size
,avg(case when size > 6 then size end) as avg_size_adjust
FROM volcanoes
Replace
Anomalous values can be handled by replacing them with other values rather than removing the entire records. An alternate value can be the default, a substitute value, the nearest numerical value within a range, or the average or median value.
Another option is to replace extreme values with the nearest high or low value that is not extreme. This would maintain much of the range of values but will prevent misleading averages that can result from extreme outliers. Winsorization is used for this, where outliers are set to a specific percentile of the data. For example, values above the 95th percentile are set to the 95th percentile value, while values below the 5th percentile are set to the 5th percentile value:
SELECT percentile_cont(0.95) within group (order by size)
as percentile_95
,percentile_cont(0.05) within group (order by size)
as percentile_05
FROM volcanoes
Experimentation
Experimentation, also known as A/B testing, is the standard for establishing causality. All experiments begin with a hypothesis: a guess about behavioral change that will result from some alternation. The change might be a user interface, a new user onboarding, an algorithm, marketing, etc. If the organization built it or has control over it, then it can be experimented on. Hypotheses are often driven by other data analysis work.
The second is a success metric. This should be easy to measure, and sensitive enough that we can detect a change. Click-throughs are a good success metric.
The third element is a system that randomly assigns entities to a control group that alters the experience accordingly. To perform experiment analysis with SQL, the entity-level data must flow into a table in the database that also contains behavioral data.
Value relationship
It’s easier to prove that two values are correlated than it is to prove that one cause the other. There are five ways in which x and y relate to each other:
X causes Y: This is what you’re trying to find (Y is a result from X)
Y causes X: The direction is reversed, but the relationship is there
X and Y have a common cause: The values are related because there is a third variable that explains both of them
A feedback loop exists between X and Y: When Y increases, X increases to compensate, which leads to Y increasing in turn, and so on
There is no relationship, just random. You may find metrics that are correlated even though there isn’t any relationship between X and Y
Types of Experiments
There are two types of experiments: binary outcomes and those with continuous outcomes.
Binary
A binary outcome has two outcomes: an action that is taken or isn’t. A student graduates or they don’t. A consumer clicks on an ad or they don’t. For experiments like these, we could calculate the proportion of each variant that completes the action. The numerator is the number of completers, while the denominator is all units that were exposed.
To determine whether the rates are statistically different, we can use the chi-squared test which is a statistical test for categorical variables. Data for a chi-squared test is shown in the form of a contingency table, which shows the frequency of observations at the intersection of two attributes.
Continuous
Continuous metrics can take on a range of values. For example, the amount spent by customers, the time spent on a page, and the days an app is used. E-commerce sites may want to experiment with product pages. Content sites might test layout, navigation, and headlines to increase the number of stories read.
For these, the goal is to figure out whether the average value in each variant differs from each other in a statistically significant way. This is called the two-sample t-test, which finds whether we can reject the null hypothesis that the average is equal to a defined confidence interval. This has three inputs: the mean, the standard deviation, and the count of observations.
Challenges
There are a number of ways experiments can go wrong. If the entire premise is flawed, there isn’t anything SQL can do to fix it. Running experiments would require time spent by engineers, and designers to create variants. It also has opportunity costs that could have been gained by sending customers down an optimal conversion path.
Random assignment is one of the key elements of experimentation. However, sometimes errors happen, whether it’s a flaw in the experiment specification, technical failure, or a limitation in the software. As a result, the control group may be unequal in size, assignments might not have been random, or fewer overall entities may have been cohorted than expected.
Outliers can pose another challenge. They’re sensitive to high or low values. This could give a variant a significant edge over others.
Time could also be a challenge. Experiments usually run over the course of several weeks. This means that individuals who enter the experiment early on will have a longer window in which to complete actions associated with the success metric. In order to fix this, we can apply time boxing, which is imposing a fixed length of time relative to the experiment entry date and considering actions only during that window. For example, the window can be short as one hour when measuring action that typically has an immediate response. For purchase conversion, a window of 1 to 7 days will be best. The best window will depend on the dynamics of the customers, so if the customers typically convert in a few days, consider a 7-day window; if the customer takes up to 20 days, consider a 30-day window.
[End]