kpitsimpl: Data

Showing posts with label Data. Show all posts

Storytelling with data

Each of these rules will help guide you to the delivery of a compelling and effective visual story

U.S. Presidential Elections 1900-2024

A picture really can depict 1000s of words

From the Industrial Revolution to WWI to the Great Depression to the New Deal to WWII to the Red Scare to the Civil Rights Movement to Neoconservatism to Neoliberalism to 9/11 to the Great Recession to MAGA, these small multiples charts beautifully encapsulate U.S. presidential elections and the attendant American political eras that fueled them in these past 124 years.

Major U.S. Economic Events

1929-1932 Great Depression

The gridlines are distracting and unnecessary but this chart, and the informational callouts, explain the stock market crash that caused the Great Depression

Post-WWII Boom

The United States was the biggest gainer among the countries whose GDP per capita spiked after WWII

Oil Prices and Inflation

The U.S. experienced huge inflation spikes in the 1970s due to oil price "shocks" and again in 2022 due to Russian oil sanctions and COVID stimulus

1980s Savings and Loan Crisis

Well over 1000 banks failed as a result of the money market and junk-bond deregulation that led to the S&L crisis; this hasn't happened before or since

Millenium Dot-Com Bubble

Between 1998 and 2002 the NASDAQ rose from ~2000 to over 5000 only to come crashing back to just over ~1000 by the end of the bubble burst

2007-2009 Great Recession

Due to internationally linked financialization, the Great Recession impacted not just the U.S. economy, but economies across the world

2020 COVID and U.S. Unemployment

This graph illustrates the impact that COVID had on U.S. unemployment, which subsided once vaccines were developed, re-opening economies worldwide

Music Sales

Fascinating composition bar chart showing how music sales have shifted from vinyl to cassettes to CDs to mp3s to streaming

Whenever you present a story about data you will invariably be displaying some values (along a time series axis, or in static/isolation) within a categorization-based visual. The four primary chart types are Comparison, Relationship, Distribution and Composition. We will briefly pose the questions that illustrate what each of these chart types aims to answer.

Comparison: How much of each subgroup exists in relation to the other subgroups?

Relationship: How did a value change over time? (or change in relation to some other non-temporal metric like "how are various foreign currency exchange rates impacted by the movement of the U.S. federal funds (interest) rate?")

Distribution: What is the concentration of values within different percentiles if you chart the data along a linear scale?

Composition: What are the sizes of each of the constituent parts that comprise the whole of the thing you are trying to depict or explain?

Other key data visualization concepts to know and always be considering...:

Avoid Chartjunk: The goal should be to encode as much information as possible using near-exclusively "data ink", and as low a level of "non-data ink" as possible. Charts and graphs that contain excessive non-data ink (also called "chartjunk"), which is any chart content that does not communicate information relevant to your data visualization, are only going to confuse the consumer of your data visualization and hinder the expression of the message your visualization is meant to convey.

Chartjunk should be ruthlessly excised wherever it is found. Only include the data ink that serves the communication of your data visualization. All extra clutter will detract from and degrade your dataviz and your message- which is the entire purpose of graphs and charts.

Know Your Audience: It is important that you know your audience. A chart presented in a scientific or academic journal is often expected to contain values derived from complex ratios and formulas and labeling using esoteric language; a chart presented for general consumers in a newspaper or magazine is not. You should have an idea of what the baseline expectations are for the data you are presenting and ensure that you communicate your visualization in a way that is easy for your target audience to understand.

Data Integrity: Many charts have been used as propaganda and to otherwise mislead people. This is done by using outright fake data or trying to elicit specious insights from thin data sets that do not provide a complete picture of what a particular data point or set of data points means within the context of other data it is a part of.

References:

https://www.270towin.com/historical-presidential-elections

https://ar.inspiredpencil.com/pictures-2023/great-depression-stock-market-graph

https://www.researchgate.net/figure/Oil-prices-and-inflation-in-the-OECD-area-and-the-USA-from-1960-to-2021-C-te-Data-World_fig1_363709177

https://www.daytrading.com/savings-loan-crisis-1980s

https://ar.inspiredpencil.com/pictures-2023/economic-and-financial-crisis-2008

https://www.bbc.com/news/business-52137727

https://audioexchanges.com/2019/11/18/10-years-of-the-vinyl-renaissance

https://www.amazon.com/Show-Me-Numbers-Designing-Enlighten/dp/0970601972

https://www.amazon.com/Information-Dashboard-Design-At-Glance/dp/1938377001

Charts Suggestions - "A Chart Chooser" (edited)

Just because you can do things with "non-data ink"* does not mean you should do things with "non data ink". Below is a useful guide to charts for presenting different perspectives of data. I have crossed out the ones that are unanimously decried in the data visualization community for having confusing elements that do not effectively communicate the meaning of data.

Bar charts. Line charts. Dot or scatter plots. And bullet charts for really communicating a lot of information in a small space. That's really all you need.

The "Thought-Starter" above, edited to cross out (in red) all useless and confusing visuals to AVOID

Bullet charts are highly effective at displaying sub-ranges within a spectrum; the sole max line communicates threshold or "target" values

Focus on ensuring that every piece of ink in your chart is conveying some kind of useful information. If not, delete it. Blank space is better than distracting ink.

If you need an example of "distracting, non-data ink" then look no further than the following almost headache-inducing example:

I imagine the author of this chart was more interested in the art than the meaning of the data; this is a really bad data visualization

Another simple yet very powerful data visualization technique is to show the same type of chart repeated for contiguous time intervals or for different groups at the same point in time. An example is the following small multiples chart on alcohol consumption in different countries:

Small multiples charts really highlight the outliers (S. Korea?!!)

For more on the use of small multiples for effective data visualizations, I show more examples here: kpitsimpl: Small Multiples (are awesome) a while back. KEEP IT SIMPL.

*https://infovis-wiki.net/wiki/Data-Ink_Ratio#:~:text=Non%2DData%2DInk%20is%20accordingly,non%2DData%2DInk

extRSAuth for Custom SSRS Authentication (works w/newest SSRS version 16.0.8)

Fortunately, ExtRSAuth code doesn't require any updates to work with MSSQL SSRS 2022

ExtRSAuth for custom SSRS security

This assembly, forked from the Microsoft Custom Security Sample extends and improves custom authentication to allow for mechanisms other than user/pwd credential check and to offer a seamless pass-thru of the Login page if something present in the HttpRequest verifies that user is already authenticated. For instance, the user already has an app token from an app that communicates with the report server, and you require the communications with the report server to not involve any intermediate screen or login UI. The user just wants to auth as fast as possible and get to their report, right?

What does ExtRSAuth do to authenticate SSRS user connections?

For direct URL report server access, the default here is to allow local connections, which grants Admin rights for any local requests. If the SSRS request is external, a fallback option accepts an AES 128-bit encrypted querystring from the calling app, and the application, if decryption works, is authenticated and allowed to communicate using a read-only SSRS user; any exception thrown indicates the request is neither from a local connection nor a secure request from the external app.

ExtRSAuth gives SSRS environments the freedom from MS Active Directory that they deserve

To secure the built-in SSRS REST API v2.0 access, you can simply customize the LogonUser() and VerifyPassword() methods in AutneticationExtension.cs and AuthenticationUtilities.cs, respectively:

If you don't perform a security check here, the SSRS REST API will be open to anyone who knows the SSRS admin username

As you can see, this is but one of many approaches we can take with ExtRSAuth in an SSRS-connected application or business environment. Any type and granularity of custom authentication and level of authorization is possible. The only ingredient needed is a .NET developer or developers willing to customize a pretty basic .NET security model.

Real-world applications

This Custom Auth assembly has been tested with (1) several .NET Framework 4.8 and .NET 5, 6, 7 and 8 web and mobile applications, (2) with the SSRS API and all its operations, (3) with the SSRS /ReportServer and the /Reports management web interface as well as (4) Visual Studio 2022 Reporting Services projects (report designers can seamlessly deploy Report Server projects from VS to the Report Server with ExtRSAuth).

After running InitalizeExtRSAuth.ps1, a successful installation will output the above

Demonstration

This YouTube explainer video describes the SSRS external user authentication problem that ExtRSAuth addresses.

Requirements

This plug-in relies on SSRS (2016 or later), and a report server configuration as described in Microsoft's Reporting Services Custom Security Sample

-Replace [your_sym_encr_key] with your symmetric encryption key. Clients can encrypt SSRS URL access querystring with Sonrai.ExtRSAuth.Excryption.Encrypt() or a similar 128-bit AES encryption implementation, or modify Encrypt() with any encryption algorithm and key and block sizes.

Clone it, customize it further (or not) and get started today: https://github.com/sonrai-LLC/ExtRSAuth

EDI, RPC, SOAP, MQ, REST and Interoperability

All of these concepts help to address the same concern: how do we move data from System A to System B when these systems have no direct linkage (no common data store)? The following are a few of the technologies that have served as answers to this question.

There was a different kind of web back in the way day

EDI (Electronic Data Interchange): An exchange of data usually large in volume in comparison to other remote data transfer methods (batched records of 1000s vs. 1 record of JSON or 1 row of an RDBMS table), and usually done in conjunction with some kind of an ETL and/or Data Warehousing process. EDI is typically used for large, domain-specific transactions and the data transfer itself is performed over SFTP or another secure file transfer protocol and utilizes XSLT for data formatting. EDI files must adhere to strict ISO formatting specifications. This is helpful (and coincidentally adds a layer of complexity for hackers) when trying to ensure that a large number of disparate parties reporting data are all sending data in the right format as, if an EDI file's data format is wrong in any way, it won't be accepted at the destination.

This is an example of an EDI "EDIFACT" formatted file

RPC (Remote Procedure Call): Highly-coupled abstraction (if you can call it abstraction- it's really more of a video game accessory that only works on certain consoles) that essentially requires the client and server to be running the same program which, while once upon a time was feasible (and in some cases may be desirable for channel security), is not typically the ideal way to communicate openly. However, for closed, secure communications, RPC is still very much a part of the many technologies that facilitate secure messaging in applications like Telegram, Signal and the like.

As stated, RPC implies client-server sharing code (see "RPC thread" spanning above)

SOAP (Simple Object Access Protocol): This had been the standard for web services (indeed it is why Microsoft created WCF) until HTTP-based/RESTful APIs replaced them as the standard choice among developers of newer projects around the early 2010's. It is self-describing (.wsdl) and allows for communication over virtually any point-to-point communications protocol. SOAP is however quite prescriptive in the way it dictates how SOAP message "objects" are defined, leading to a lot of (interface) metadata inside the envelope that may have little to do with the task at hand, but which is needed so that the client can understand the message and deserialize the object if necessary.

An example of a faulting SOAP call's SOAP response

MQ (Message Queuing): the primary concept of utilizing message queues and exchanges is the asynchronous nature in which the messages are pulled and pushed vs. a REST or SOAP service call which are request/response synchronous by design.

This architectural data model also supports highly decoupled design whereby many applications, all written in different languages and under disparate frameworks can utilize the same MQ Exchange and share communication across queues.

Frameworks like RabbitMQ facilitate event sourcing design with queues; an app is often both a Producer and Consumer

REST (REpresentational State Transfer) APIs: Operating completely (and solely) over HTTP(S) and via (primarily) GET/POST actions which have already undergone some 35 years of incremental improvement, for as long as the web lives on, REST APIs will be at its foundation. They aren't self-describing though descriptive metadata can be embedded in the naming of the API resources to achieve similar reflection. Additionally, there are usually descriptive, interactive specifications for large publicly hosted APIs like the ones from Google Maps and Twitter. RESTful APIs are not highly prescriptive in the API structure/operations. It just has to be an HTTP action method that any HTTP client would understand. Most APIs default to passing JSON around when objects are involved in POST arguments or GET return values; but there is no reason you cannot return XML. Or a file. Or a streaming video. Or whatever floats your software ship. People create RESTful API wrappers for SOAP services all the time.

Just leaving the REST for last.. 😉

Although many in the software development community prefer the use of RESTful APIs, Message Queuing or some combination of the 2 for new projects, we must be mindful of the fairly recent past which has littered the landscape with SOAP, EDI, ETL and an assortment of proprietary and highly customized RPC (still active) communication channels (for example SOAP streaming over UDP).

There was a time before the web as we know it today when machines like ATMs and TicketMaster were still interconnected just as ever. However, these connections were not regular TCP HTTP packets traveling to and from port 80 or 443 but rather fixed length TCP frames of an earlier file transfer protocol. And many of those ATM and TicketMaster connections still exist, even if upgraded for modern times via something like WCF (.NET) or JAX-WS (Java).

There are certain things only SOAP can do. There are certain legacy systems which will not be updated any time soon (because "if it ain't broke") that still need to interface with SOAP clients. As technologists we have to deal with this and understand the tradeoffs of using different frameworks for different jobs.

In the same way that there is no perfect language for every scenario, no one way of electronically transferring data and interacting with remote systems is always the "best way" (although REST APIs come pretty close as so much our connected world is now HTTP-based).

The best choice for sending remote communications just like any choice of framework, language or design paradigm is never fixed. The answer requires careful, domain-centric, thorough analysis of the problem and the resources available to resolve that problem. In software development, the answer to "which way is the best way?" is invariably- "it depends".

Reference: https://researchportal.port.ac.uk/portal/files/681058/ITIT_13_1035_1.pdf

"Next Big" Software Religiosity and The Go-nowhere Rush

There is far too much religious extremism in information technology these days. And there have always been camps (extreme anti-Microsoft sentiment or its sad corporate counterpart: disdain, fear and suspicion of all things open-source)- but these days it has gotten to the point where sensible, cheap, reliable, proven solutions that everyone on the team understands- are thrown out in favor of chasing the next big thing that some bigshot at some big conference declared was going to be the next, next, next "big thing".

This image does have its merits..

Amid all the continuous rush to be cutting edge despite understanding what that edge can do for you and having a strong data foundation to build upon with that new cutting edge thing- it doesn't matter what tools are out. You are still stuck with ideas and not programs.

Design and develop with what works for your particular team and project and within the context of the environments of your stakeholders (if all but 2% of your customers use Android then the iPhone version of your app may not be as important as you think). Above all else, make sure you understand the domain knowledge behind the data your application will be persisting and passing around. That (the data understanding) is the heart of every program that stores, processes, transmits or even simply reads/prints/paints- any kind of communication.

Data sense-making and software development is hard work. And it's not done in a void. I suggest reading Stephen Few's "Big Data, Big Dupe" which is a little paperback containing 90 some pages of important wisdom for this modern rapid-fire information age that pre-empts knowledge of data in favor of slogans and metrics about data.

In short, the essence of this book is that if you have say 10TB of crap data that is always causing ETL failures that your personnel spend countless hours trying to correct... you may indeed have "big data" per some misguided tech journalist's definition... but you still have crap data-- understand your data before you try understanding how best to fit it inside of the newest shiny box.

Take also for example message queues and their usage in modern web application development. There seems to be a lot of misunderstanding about what MQ is and even some who claim this is a new technology (MSMQ has been around since Windows '95; IBM MQ has been in use since 1993). Basic email has operated on a publisher/subscriber (ICMP or SMTP) messaging queue paradigm that works in much the same way as modern MQ implementations (minus some bells and whistles)- since the early 70's.

These things aren't as complicated as they seem but they are complicated. And it's perilous to keep jumping from new trick to new trick whilst ignoring foundational, timeless software principles.

I would go so far as to say it is injurious to current and future generations of software developers to keep focusing on buzzwords, zooming out and away from the hard-but-necessary work of understanding the data, and then wondering why the tool or framework flavor of the year did not save the day.

Small Multiples (are awesome)

To keep it short and sweet let's go with the definition:

"A small multiple (sometimes called trellis chart, lattice chart, grid chart, or panel chart) is a series of similar graphs or charts using the same scale and axes, allowing them to be easily compared. It uses multiple views to show different partitions of a dataset."

Read any serious visual communication guide and it will invariably highlight this powerful tool we have at our disposal when we have the data (we almost always have the data).

A pair of Small Multiples example quite pertinent to the current times followed by some other good ones:

This CNN.com graphic captures a running snapshot of the "new case/spread" curve trajectory of individual states

This clearly communicates how each state unemployment picture fared from 1976-2009

This SM visual shows population change over time by country (look at Mexico's growth since 1960)

Reference: https://www.edwardtufte.com/tufte/

Dapper for .NET Data Access

Jeff Atwood described the phrase coined by Ted Neward that "Object-Relational Mapping is the Vietnam of Computer Science".

I agree with everything except Atwood's (huh? 😨- keep in mind this is 2006) conclusion that we should do one or the other: objects or relational data records. Develop apps as a series of SQL data access statements assigning values to arbitrary pieces of monolith application code.. or exclusively object-oriented with everything saved to blob storage... Or something awkward like that.

That, he says (in the 2006 article*), removes the O (object) - R (relational data) mapping problem entirely. It sure does; but how can we develop apps like that?

(fast-forward 6yrs, and.... Dapper to the rescue!)

Dapper is an awesome (IMO) alternative that allows developers to retain SOLID reuse and extensibility in their .NET data access code while still accessing complex relational data- and fast.

Dapper has the best of both worlds in terms of what you look for in a data access framework - speed and clean, easy SQL-to-typed object mapping facilitation

I highly recommend the brief peruse; it is a very interesting article. It essentially describes the pitfalls that ADO.NET, Hibernate and Entity SQL (EF for MSSQL) and so many of the other approaches to modeling relational data as .NET objects that have, if not failed completely- severely been lacking especially in terms of speed and control over the actual SQL that you instruct the SQL engine to execute.

Dapper aims to bridge the eternal gap between application and relational database code in a pretty elegant way for .NET development. So long as your database records (whether from a complex JOIN'd SP or wherever in your db)- can return data with types and field/alias names that match your "query-return-target-type" class' properties' names and data types, you are set for all the kinds of data access you like and are off and running without all the headaches normally associated with ORMs (magic config strings, mappings in separate files out of sync with class or db changes, etc.).

"there is no good solution to the object/relational mapping problem. There are solutions, sure, but they all involve serious, painful tradeoffs. And the worst part is that you can't usually see the consequences of these tradeoffs until much later in the development cycle." -Jeff Atwood on ORMs

I guess you could say that the SQL itself in the queries you tell Dapper to issue to MSSQL are "magic strings" insofar as VS doesn't compile them.. But if you don't use SSMS to parse and execute tests of your queries before using them in application code then you aren't really doing real data development- you are just shooting in the dark.

You should have unit tests for this very purpose. Unit tests of your Dapper calls will catch any db changes in the tests ("hey why did nobody tell me about this schema change in the Archives table?"); regardless- if your SQL field names don't match the class prop names of the object you are trying to "Dapperize"- you will find out at run-time. The exception messages are very "straight to the point of exactly what is off".

Dapper works the same in all versions of .NET; it is currently based on .NET Standard for that very reason, but you will need to bring in more dependency depending on what type of data source you are trying to access (SQL Server, MySQL, Oracle, DB2, Terradata, etc.).

Consider giving Dapper a try - it is very useful and illuminating, and it really shines in the very areas where EF falls short.

Dapper accessing 'UserReport' records from SQL db and returning the dynamic, typed object:

     SqlConnection db = new SqlConnection(WebConfigurationManager.AppSettings["DefaultSQLConnection"]);  
     public List<UserReport> ReadAllSavedUserReports()  
     {  
       using (db)  
       {  
         return db.Query<Report>("SELECT * FROM CLARO.dbo.UserReport").ToList();  
       }  
     }  
     public UserReport FindSavedUserReport(int id)  
     {  
       using (db)  
       {  
         return db.Query<Report>("SELECT * FROM CLARO.dbo.UserReport WHERE Id = @Id", new { id }).SingleOrDefault();  
       }  
     }

Forgive the "SELECT *.... this is just a demonstration..

These methods can then easily be called in controller or other code like so:

     public ViewResult Index()  
     {  
       string nowTime = DateTime.Now.ToShortDateString();  
       ReportDAL dal = new ReportDAL();  
       Demo model = BuildModel(BuildSQLStatement(nowTime, ReportDrafts.BaseballDemo), nowTime);  
       model.Reports = dal.ReadAllSavedUserReports();  
       return View(model);  
     }

Dapper is not a company trying to sell anything- it is just a really useful micro-ORM for those who prefer to work more hands-on with the SQL in data access code (and like to be able to more granularily control optimization for speedier queries).

*Atwood helped contribute (with SO) to the development of Dapper, so... I think he and that team kinda nailed the removal and easing of the very same limitations he bemoaned in the article I reference at the beginning: https://stackoverflow.blog/2012/02/18/stack-exchange-open-source-projects/

References:

https://elanderson.net/2019/02/asp-net-core-with-dapper/

https://dapper-tutorial.net/

.NET XML Serialization with System.Xml.Serialization

From SOAP, to app configuration, model templating, EDI/ETL, and beyond, XML is a mainstay in software and will continue to be (see- HTML)

Source Code:

 using System;  
 using Microsoft.VisualStudio.TestTools.UnitTesting;  
 using System.Xml.Serialization;  
 using System.IO;  
 using System.Xml.Linq;  
 using System.Xml;  
 namespace DemoTests  
 {  
   [TestClass]  
   public class TextXmlSerialization  
   {  
     public const string xml =  
       "<XMLEntityCollection>" +  
         "<XMLEntities>" +  
           "<XMLEntity>" +   
             "<ID>1</ID>" +   
             "<Name>TestName1</Name>" +   
           "</XMLEntity>" +  
           "<XMLEntity>" +  
             "<ID>2</ID>" +  
             "<Name>TestName2</Name>" +  
           "</XMLEntity>" +  
         "</XMLEntities>" +  
       "</XMLEntityCollection>";  
     [TestMethod]  
     public void TestDeserialize()  
     {  
       XmlSerializer serializer = new XmlSerializer(typeof(XMLEntityCollection));  
       using (StringReader reader = new StringReader(xml))  
       {  
         XMLEntityCollection ents = (XMLEntityCollection)serializer.Deserialize(reader);  
         TestSerialize(ents);  
       }  
     }  
     [TestMethod]  
     public void TestSerialize(XMLEntityCollection ents)  
     {  
       XmlSerializer xmlSerializer = new XmlSerializer(ents.GetType());  
       using (StringWriter stringWriter = new StringWriter())  
       {  
         xmlSerializer.Serialize(stringWriter, ents);  
         Assert.IsNotNull(stringWriter);  
       }  
     }  
   }  
   [Serializable()]  
   public class XMLEntity  
   {  
     [XmlElement("ID")]  
     public string Id { get; set; }  
     [XmlElement("Name")]  
     public string Name { get; set; }  
   }  
   [Serializable()]  
   [XmlRoot("XMLEntityCollection")]  
   public class XMLEntityCollection  
   {  
     [XmlArray("XMLEntities")]  
     [XmlArrayItem("XMLEntity", typeof(XMLEntity))]  
     public XMLEntity[] Ents { get; set; }  
   }  
 }

.NET XML Deserialization with System.Xml.Serialization:

Instantiated (ents) object containing XML from the deserialized XML string

.NET XML Serialization with System.Xml.Serialization:

XML genertaed by the stringWriter when serializing ents object into XML using the NET XMLSerializer type

Reference: https://stackoverflow.com/questions/364253/how-to-deserialize-xml-document