Graphical Integrity

"The representation of numbers, as physically measured on the surface of the graphic itself, should be directly proportional to the quantities represented." -Edward Tufte

It is amazing how easy it is to find highly inaccurate and misleading data graphics and charts even in this year 2019. These inaccuracies and sometimes outright perversions of the truth are of particular concern to an insta-culture who gets its news in headlines, memes, charts and other bite-sized generalizations via social media and rarely looks for the evidence beyond the headlines and the source data behind the charts.

The “Lie Factor”, first defined by American statistician Edward Tufte is defined as "a value to describe the relation between the size of effect shown in a graphic and the size of effect shown in the data." A larger Lie Factor value indicates a higher level of deception or "inaccurate scaling/weighting".


Lie Factor in Action:

The numbers do not equate to the scale of the bars and money bags... not quite as "strong" as projected.




This example mixes 2 different scales and data sets and only serves to confuse the reader...




This is a propaganda data graphic displaying a series of 5 increases using a totally nonsensical scale



This graphic shows Last Year, Last Week, and Current Week as having the same temporal scale.... O'Lie Factor.




Lie Factor Breakdown:

Lie Factor is the change shown in the graphic (say 100%) divided by the change reported in the data (say "50%") - (100/50 = a LF of 2)


There are reasons for misleading graphics that go beyond propaganda and sensationalist news articles:

  • Lack of quantitative skills on the part of the graphic creator and publication editor
  • Doctrine that statistics are boring and therefor need to be "jazzed up"
  • Doctrine that graphics are only for unsophisticated and so don't need "accuracy constraints"
  • Failure to treat graphics with the same fidelity to the truth as the written word it accompanies

Other ways that graphical information displays are corrupted include cherry-picking data, making small changes appear large by showing a small scale interval and when all else fails for information manipulators- using fake data.

It is important to not jump to conclusions when assessing graphical information displays even if it is coming from a reputable publisher. As you can see it is not always obvious that the information being communicated graphically is accurate. Wherever possible, get a look at the source data.

"When we see a chart or diagram, we generally interpret its appearance as a sincere desire on the part of the author to inform. In the face of this sincerity, the misuse of graphical material is a perversion of communication, equivalent to putting up a detour sign that leads to an abyss" - Wainer


References:

https://viz.wtf/

https://infovis-wiki.net/wiki/Lie_Factor


Google Maps API

The Google Maps API is a very powerful tool that is relatively easy to use if you have some JavaScript background. The below screen was created with the code that follows (Google API Key obfuscated).


Basic Maps API example with hard-coded lat/long custom markers


Get an API key here: https://developers.google.com/maps/documentation/javascript/get-api-key

Maps API exposed operations reference: https://developers.google.com/maps/documentation/javascript/tutorial

 <!DOCTYPE html>  
 <html>  
  <head>  
   <title>Custom Markers</title>  
   <meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=no">  
   <meta charset="utf-8">  
   <style>  
    /* Always set the map height explicitly to define the size of the div  
     * element that contains the map. */  
    #map {  
     height: 100%;  
    }  
    /* Optional: Makes the sample page fill the window. */  
    html, body {  
     height: 100%;  
     margin: 0;  
     padding: 0;  
    }  
   </style>  
  </head>  
  <body>  
  <div>  
  Debug/Inspect...  
  </div>  
   <div id="map"></div>  
   <script>  
    var map;  
    function initMap() {  
     map = new google.maps.Map(  
       document.getElementById('map'),  
       {center: new google.maps.LatLng(43.0589, -88.0988), zoom: 11, mapTypeId: 'hybrid'});  
     var icons = {  
      bucks: {  
       icon: 'bucks.png'  
      },  
      brewers: {  
       icon: 'brewers.png'  
      },  
      panthers: {  
       icon: 'panthers.jpg'  
      }  
     };  
     var features = [  
      {  
       position: new google.maps.LatLng(43.0280, -87.9712),  
       type: 'brewers'  
      }, {  
       position: new google.maps.LatLng(42.9930, -87.9210),  
       type: 'panthers'  
      }, {  
       position: new google.maps.LatLng(43.0000, -87.8379),  
       type: 'bucks'  
      }  
     ];  
     // Create markers.  
     for (var i = 0; i < features.length; i++) {  
      var marker = new google.maps.Marker({  
       position: features[i].position,  
       icon: icons[features[i].type].icon,  
       map: map  
      });  
     };  
           //Must have API with access to the Places API to get marker click details without much manual code  
    }  
   </script>  
   <script async defer src="https://maps.googleapis.com/maps/api/js?key=XXXXXXXXXXXXXXXXXXXX&callback=initMap">  
   </script>  
  </body>  
 </html>  

Value vs. Reference Type Memory Allocation

I once in the not-too-distant past had a Developer interview go comically off the rails when I (the interviewee) was posed with the simple software development question:

"Describe the difference between a value type variable and a reference type variable"

This is an elementary Comp Sci 101 question that every developer should be able to understand and answer with unflinching confidence.

This is what value type (int a left) and reference type (object a on the right) memory allocation looks like graphically


The correct answer would point out that value type variables ('primitive' in Java) are stored directly in memory within the Stack while reference type variables contain pointers (id) to another memory location in the Heap (which may in turn contain additional memory references (pointers) that compose the entire object).

And (no I swear really) at one point I knew exactly what the difference was. However with time and as development languages and tools become more and more abstracted away from the underlying behavior of memory during application run-time, my knowledge of this simple-yet-important concept went blank.

I stammered though a meandering non-answer about how value types contain "simple" data types that are known as primitives and how reference types refer to more complex objects. Suffice it to say that it was an embarrassing lack of clarity and awareness of how application memory works and likely torpedoed any chance of an offer.


On value and reference types, the Stack and the Heap and Boxing/Unboxing (converting between val and ref)


This is something we should all know as developers. I hope this helped refresh and/or clarify the concept of variable memory allocation for at least a few people out there.

ETL and EDI Using SSIS

ETL is the process by which you can take (Extract) data from various (usually related) data sources, Transform that data to meet your destination system's needs, and finally Load that transformed data into the destination system data store.

Your table structure will be something along the lines of this basic template:

In a real-world db environment Staging, OLAP, OLTP and other data repos may be on different database servers, this is same db server for demonstration

We will use SQL Server Integration Services (SSIS) and develop the SSIS package within the Visual Studio 2017 IDE.


The first step of the SSIS package load (INSERT) the data into a STAGING area database. This allows us to:
  • Store off the intermediate data from all sources into analysis-friendly OLAP datastores
  • Perform data integrity checks
  • Keep extraction and transformation as two strictly separated steps

We load the data from the various source files (.csv, .xls, .xlsx) into SQL Server database Staging table(s) using SSIS Source and Destination Data Flow Tasks connected with the movable data flow arrows. Once you have connected a source and destination you can go into the Destination Data Flow Task and edit the mappings of which source columns should be written to which destination columns.

Next we perform some transformations. This can be anything from a simple ranking or status/flagging/business prioritization algorithm to data cleansing to data partitioning based on certain criteria; the key is that this Transform step is where we apply T-SQL UPDATEs to transform the data once it has all been aggregated in Staging.

Then we refresh the OLAP destination tables using the same kind of Source and Destination Data Flow Tasks and mappings as used for Staging. The OLAP data is used for data analysis.

Finally, we load the cleansed Staging data into our destination system's OLTP database and email or text message the system owner upon successful completion of the SSIS ETL job (or deliver an error if anything fails). The OLTP data stores live transactions.

Bear in mind that most ETL data-flow step mappings are not a 1:1 match; this is just an e2e demo of SSIS ETL in most basic form


Happy ETL'ing, and be sure to watch out for cases of mysterious symbols/characters from miscellaneous data copied from other programs or from other system environments that were using a Language setting (codepage) which is incompatible with your ETL software. Bad data happens more than you think and as we say, GIGO.

Your end result looks like this (all Green Checkmarks indicates all was successful; I recommend using PaperCut for SMTP testing- super cool and useful product

I would attach or GitHub the source code (and will do so upon request) but SSIS project code has a lot of dependencies and can get quite messy for another to re-use project on even just a 'slightly different' machine.

Having used SSIS' now-deprecated predecessor "DTS" (SQL Server Data Transformation Services) and SSIS for many years I can attest to the fact that the best way to learn this product is by diving right in on your own and begin the creation of sources and destinations and source/destination connection managers and control flow events, and .NET integration, and exception event handlers, etc.

You will likely run into some ambiguous and not well-documented errors when developing in SSIS; but persist in your efforts and you can create a very powerful EDI system with the many capabilities of SSIS and the robust Scheduled ETL jobs that it can create.