Showing posts with label ETL. Show all posts
Showing posts with label ETL. Show all posts

ETL and EDI Using SSIS

ETL is the process by which you can take (Extract) data from various (usually related) data sources, Transform that data to meet your destination system's needs, and finally Load that transformed data into the destination system data store.

Your table structure will be something along the lines of this basic template:

In a real-world db environment Staging, OLAP, OLTP and other data repos may be on different database servers, this is same db server for demonstration

We will use SQL Server Integration Services (SSIS) and develop the SSIS package within the Visual Studio 2017 IDE.


The first step of the SSIS package load (INSERT) the data into a STAGING area database. This allows us to:
  • Store off the intermediate data from all sources into analysis-friendly OLAP datastores
  • Perform data integrity checks
  • Keep extraction and transformation as two strictly separated steps

We load the data from the various source files (.csv, .xls, .xlsx) into SQL Server database Staging table(s) using SSIS Source and Destination Data Flow Tasks connected with the movable data flow arrows. Once you have connected a source and destination you can go into the Destination Data Flow Task and edit the mappings of which source columns should be written to which destination columns.

Next we perform some transformations. This can be anything from a simple ranking or status/flagging/business prioritization algorithm to data cleansing to data partitioning based on certain criteria; the key is that this Transform step is where we apply T-SQL UPDATEs to transform the data once it has all been aggregated in Staging.

Then we refresh the OLAP destination tables using the same kind of Source and Destination Data Flow Tasks and mappings as used for Staging. The OLAP data is used for data analysis.

Finally, we load the cleansed Staging data into our destination system's OLTP database and email or text message the system owner upon successful completion of the SSIS ETL job (or deliver an error if anything fails). The OLTP data stores live transactions.

Bear in mind that most ETL data-flow step mappings are not a 1:1 match; this is just an e2e demo of SSIS ETL in most basic form


Happy ETL'ing, and be sure to watch out for cases of mysterious symbols/characters from miscellaneous data copied from other programs or from other system environments that were using a Language setting (codepage) which is incompatible with your ETL software. Bad data happens more than you think and as we say, GIGO.

Your end result looks like this (all Green Checkmarks indicates all was successful; I recommend using PaperCut for SMTP testing- super cool and useful product

I would attach or GitHub the source code (and will do so upon request) but SSIS project code has a lot of dependencies and can get quite messy for another to re-use project on even just a 'slightly different' machine.

Having used SSIS' now-deprecated predecessor "DTS" (SQL Server Data Transformation Services) and SSIS for many years I can attest to the fact that the best way to learn this product is by diving right in on your own and begin the creation of sources and destinations and source/destination connection managers and control flow events, and .NET integration, and exception event handlers, etc.

You will likely run into some ambiguous and not well-documented errors when developing in SSIS; but persist in your efforts and you can create a very powerful EDI system with the many capabilities of SSIS and the robust Scheduled ETL jobs that it can create.

T-SQL and .NET for XSLT

Steps to perform transformation from SQL to XML to XSLT Output data (csv, htm, tab-delim, fixed-width, etc.):

1). Get your SQL data (assuming you are using SQL Server) into XML format via:

 SELECT [OrderId], [PersonId], [OrderDateTime] FROM [Order] FOR XML AUTO, ELEMENTS, ROOT ('Orders');  


XML Result of "FOR XML AUTO, ELEMENTS, ROOT"..

2). Load XSL transform file (OrderXform.xsl) into a new XslCompiledTransform object (transform):

       XslCompiledTransform transform = new XslCompiledTransform();  
       transform.Load("C:\\Xform\\OrderXform.xsl");  


XSL file used to transform the XML to the desired file output (EDI 834 standard, etc)


3). Using a StringWriter (writer), StringReader (inputReader) and XmlReader (inputXmlReader), read the inputReader XML string into inputXmlReader and apply the XSL transformation to the transform object via: transform.Transform(inputXmlReader, args, writer); //(or another override of Transform())

  using (StringWriter writer = new StringWriter())   
     {   
      // Make an XML reader (inputReader) out of the string (input)  
      XmlReader inputXmlReader;   
      using (var inputReader = new StringReader(input))   
      {   
       inputXmlReader = XmlReader.Create(inputReader);   
       // Apply the transformation to XmlReader (inputXmlReader) and write it to StringWriter (writer)   
       transform.Transform(inputXmlReader, null, writer);   
      }      
      // Retrieve the output as string from writer object  
      return writer.GetStringBuilder().ToString();   
     }   

4). Return the result and write the bits (to memory as Label Text for a demo like this, to file, to domain network location, to non-domain customer ftp, etc.).

The result normally gets written to remote SFTP location or posted to API endpoint for processing

Nothing incredibly complex about this, but it seems to frustrate a lot of people in .NET land so I have documented a quick and easy example.

GitHub Source: https://github.com/Radagast27/Simpl_XSLT_dot_net

Tool for XSLT validation: https://www.freeformatter.com/xsl-transformer.html

References:

https://www.red-gate.com/simple-talk/sql/learn-sql-server/using-the-for-xml-clause-to-return-query-results-as-xml/

https://www.w3schools.com/xml/xsl_examples.asp

https://www.c-sharpcorner.com/article/transform-xml-output-of-sql-query-using-for-xml-auto-statement-to-html-using-xsl/

Invisible characters from pasted data



If you find yourself debugging an ETL task error where the problem at hand is data type mismatch or some other data error that is not visible to you in the source (culprit) data, don't fret.

What has likely happened is invisible formatting characters got into the data.

To remedy things, all you'll need to do is open up Notepad++, paste your data into the text window, and from the View menu dropdown, select "Show Symbol" >> "Show All Characters".

Simply edit the data in Windows Notepad (remove the special characters), save, and try to import again. Your data should be sufficiently scrubbed clean and your ETL task will return to normal.


There is also a shortcut button in Notepad++:


PS: handy tip for Windows users- to remove all formatting from copied data, paste the text into Notepad. All formatting will be removed from the data. Re-select it and go paste where needed.

PureText is another application that handles removing all formatting from text: https://stevemiller.net/puretext/