A General Overview of Statistics Used in Data Science (Part One)
Statistics being the backbone of Data science
I’m going to give you a general idea of statistics used in Data Science. This is a general overview (like a dictionary almost) and doesn’t go in-depth.
Exploring Data
Data can come from many sources such as events, images, videos, and/or text. Much of this kind of data is unstructured. We have to convert unstructured raw data to structured data. The most common form of structure data is a table with columns and rows.
Two types of structured data:
Numeric: data that are expressed in numbers.
Categorical: data that can take a specific set of values representing a set of categories.
You do also need to know:
Continuous: data that can take any value (float, numeric).
Discrete: data that can only take integer values (integer, count).
Binary: data that have two categories of values such as zeros and ones, or true and false (boolean, logical).
Ordinal: data that has explicit ordering (ordered factor).
Data types are important since they help determine the type of statistical model, the visual, and the analysis of data.
Rectangular Data
Rectangular data is like a database table. It contains rows and tables. Data usually doesn’t begin as a database table, they have to be processed and manipulated so that it can be represented in a data table.
Some terms for rectangular data:
Data frame: the basic data structure for machine learning model
Feature: a column within a table
Outcome: a yes/no outcome
Records: a row within a table
The most basic data structure is rectangular data.
Nonrectangular Data
There are other data structures such as:
Time series database which is for forecasting
Spatial data structures are used for location and mapping analytics.
Graph data structures are used to represent social, abstract, and physical relationships.
Estimates
A basic step in exploring data is to get an estimate of where most data is located.
Some terms:
Mean: Sum of all values divided by the number of values (average)
Weighted mean: Sum of all values times a weight divided by the sum of the weights (weighted average)
Median: Value that is in the middle (50th percentile)
Percentile: Value is P percent of data lies below (quantile)
Weighted median: the value of one-half of the sum of the weights lies above and below the data
Trimmed mean: average of values after dropping extreme values (truncated mean)
Robust: non-sensitive to extreme values (resistant)
Outlier: data value that is extreme or very different from other data (extreme value)
The basic metric is the mean, but it can be sensitive to outliers.
Variability
This will measure if the data is spread out or tightly together.
Some terms:
Deviations: Difference between the estimate of location and observed values (errors, residuals)
Variance: Sum of squared deviations from the mean divided by n - 1. n is the number of values (mean-squared-error)
Standard deviation: Square root of the variance
Mean absolute deviation: Mean of absolute values of the absolute deviation from the mean
Range: Difference between the smallest and largest value
Order statistics: Metric based on value sorted from small to big (ranks)
Interquartile range: Difference between the 25th percentile and the 75th percentile (IQR)
Standard deviation and variance are the most commonly reported statistics of variability. The issue is that both are sensitive to outliers.
Data Distribution
Also important to find out how data is distributed.
Some key terms:
Boxplot: Quick way to visualize data. Often used in stocks
Frequency Table: Set of values and then tally them for how often they’ve appear
Histogram: A plot of the frequency table. It would values on the x-axis and the frequency count on the y-axis.
Density plot: Like the histogram, but it’s like a hill
Categorical and Binary
Some terms:
Mode: Most commonly occurring value.
Expected value: Long-run average value based on probability of occurrence. It help adds the idea of future expectations.
Bar charts: Present categorical data in rectangles. This often resembles a histogram.
Pie charts: Present categorical data in a pie shape with slices. An alternative to bar charts.
[End of Part One]