kpitsimpl: Statistics

Showing posts with label Statistics. Show all posts

Graphical Integrity

"The representation of numbers, as physically measured on the surface of the graphic itself, should be directly proportional to the quantities represented." -Edward Tufte

It is amazing how easy it is to find highly inaccurate and misleading data graphics and charts even in this year 2019. These inaccuracies and sometimes outright perversions of the truth are of particular concern to an insta-culture who gets its news in headlines, memes, charts and other bite-sized generalizations via social media and rarely looks for the evidence beyond the headlines and the source data behind the charts.

The “Lie Factor”, first defined by American statistician Edward Tufte is defined as "a value to describe the relation between the size of effect shown in a graphic and the size of effect shown in the data." A larger Lie Factor value indicates a higher level of deception or "inaccurate scaling/weighting".

Lie Factor in Action:

The numbers do not equate to the scale of the bars and money bags... not quite as "strong" as projected.

This example mixes 2 different scales and data sets and only serves to confuse the reader...

This is a propaganda data graphic displaying a series of 5 increases using a totally nonsensical scale

This graphic shows Last Year, Last Week, and Current Week as having the same temporal scale.... O'Lie Factor.

Lie Factor Breakdown:

Lie Factor is the change shown in the graphic (say 100%) divided by the change reported in the data (say "50%") - (100/50 = a LF of 2)

There are reasons for misleading graphics that go beyond propaganda and sensationalist news articles:

Lack of quantitative skills on the part of the graphic creator and publication editor
Doctrine that statistics are boring and therefor need to be "jazzed up"
Doctrine that graphics are only for unsophisticated and so don't need "accuracy constraints"
Failure to treat graphics with the same fidelity to the truth as the written word it accompanies

Other ways that graphical information displays are corrupted include cherry-picking data, making small changes appear large by showing a small scale interval and when all else fails for information manipulators- using fake data.

It is important to not jump to conclusions when assessing graphical information displays even if it is coming from a reputable publisher. As you can see it is not always obvious that the information being communicated graphically is accurate. Wherever possible, get a look at the source data.

"When we see a chart or diagram, we generally interpret its appearance as a sincere desire on the part of the author to inform. In the face of this sincerity, the misuse of graphical material is a perversion of communication, equivalent to putting up a detour sign that leads to an abyss" - Wainer

References:

https://viz.wtf/

https://infovis-wiki.net/wiki/Lie_Factor

Quality Control

You should know at least the surface topics surrounding TQM (Total Quality Management) because nearly all modern businesses practice TQM strategies and tactics to reduce costs and ensure top quality.

But first, check out this old video clip of America discovering something that ironically, an American (W. Edwards Deming) exported to Japan with great success years before:

1980 NBC News Report: "If Japan Can, Why Can't We?"

So big-Q "Quality" became a bit hit and has been embedded in process management throughout the globe ever since.

I think he has a point here.

Here are some Quality buzz words that surely you've heard before:

ASQ - American Society for Quality

"Black Belt" - Ooo. Ahh. It does mean something. It means a person has passed a series of very difficult exams on statistics and statistical process control for quality based on the quantitative technics and measures originated in Japan by W. Edwards Deming.

ISO 9001 - the International standard of a Quality Management System that is used to certify that business processes follow standard process and product guidelines.

Kaizen - a long-term approach to work that systematically seeks to achieve small, incremental changes in processes in order to improve efficiency and quality.

Kanban - a visual system for managing work as it moves through a process.

Lean - a synonym for continuous improvement through balanced efficiency gains.

Example of statistical process control using UCL and LCL boundaries and a process (Fall Rate) improving.

LCL* - Lower Control Limit - The negative value beyond which a process is statistically unstable.

MAIC - Measure, Analyze, Improve, Control.

Service Level Agreements (SLA) - A contract between a service provider and end user that defines the expected level of service to the end user.

UCL* - Upper Control Limit - The positive value beyond which a process is statistically unstable.

Uptime - Uptime is a measure of the time a service is working and available and opposite of Downtime.

Six Sigma - a statistical approach to process improvement and quality control; sometimes defined as +/-3 three deviations for the mean ("6"), sometimes as +/-6 deviations from mean.

The table above gives you an idea of realistic process improvement numbers (66,800 == a lot of defective items)

History and W. Edwards Deming
Quality Management is a permanent organizational approach to continuous process improvement. It was successfully applied by W. Edwards Deming in post-WWII Japan. Deming's work began in August 1950 at the Hakone Convention Center in Tokyo, when Deming delivered a speech on what he called "Statistical Product Quality Administration".

He is credited with helping hasten Japanese recovery after the war and then later helping American companies embrace TQM and realize significant efficiency and quality gains.

Deming's 14 Points for Total Quality Management

*Measures such as standard deviation and other distribution-based statistics determine the LCL and UCL for a process (any process- temperature of a factory floor, time to assemble a component, download/upload speed, defects per million, etc.).

References:

http://asq.org/learn-about-quality/total-quality-management/overview/deming-points.html

https://www.quora.com/How-did-W-Edwards-Deming-influence-Japanese-manufacturing

Statistical Variance and Standard Deviation

Variance and standard deviation are measures of how spread out a distribution of values is. In other words, they are measures of variability within a population (all possible values) or a within a sample (a subset of the population of sufficient size to warrant statistical analysis).

Standard deviation and variance, in conjunction with other statistical methods (ANOVA, F-test, T-test, etc.), are used in data analysis models to determine whether, within a sample or population, certain values or correlation between one or more values are statistically significant.

The important distinction: a standard deviation is expressed in the same units as the mean is, whereas the variance is expressed in squared units

Reference: https://stats.stackexchange.com/questions/35123/whats-the-difference-between-variance-and-standard-deviation

ANOVA

ANOVA, short for "Analysis of Variance", is a set of mathematical models that can be used to analyze the differences among group means (ie. dispersal, or "distribution" of averages within a group).

You can easily calculate ANOVA in R as follows (when file.choose() prompts you, you should open up a .csv dataset, in this case I used a file of crime statistics on the C:\ drive):

data = read.csv(file.choose()) #select your .csv
attach(data) #attach so you don't need to explicitly reference
data.aov = aov(district~crimedescr) #choose your x and y variables and get ANOVA
summary(data.aov) //ANOVA Summary stats
plot(data.aov)

For an (important/deeper) understanding of the numbers and formulas used to calculate an ANOVA Summary, just walk through one or all of the ANOVA reference links below.

It is relatively simple math- all you need is (1) the Hypothesis or question and (2) the data. Then using tools like R, Minitab, Mathematica, etc.- you can analyze the test results and draw conclusions and infer meaning from the data.

References:

http://www.graziano-raulin.com/tutorials/stat_comp/man1way.htm

http://www.mathandstatistics.com/learn-stats/hypothesis-testing/one-way-anova-by-hand

https://www.statmethods.net/stats/anova.html