Showing posts with label SSIS. Show all posts
Showing posts with label SSIS. Show all posts

ETL and EDI Using SSIS

ETL is the process by which you can take (Extract) data from various (usually related) data sources, Transform that data to meet your destination system's needs, and finally Load that transformed data into the destination system data store.

Your table structure will be something along the lines of this basic template:

In a real-world db environment Staging, OLAP, OLTP and other data repos may be on different database servers, this is same db server for demonstration

We will use SQL Server Integration Services (SSIS) and develop the SSIS package within the Visual Studio 2017 IDE.


The first step of the SSIS package load (INSERT) the data into a STAGING area database. This allows us to:
  • Store off the intermediate data from all sources into analysis-friendly OLAP datastores
  • Perform data integrity checks
  • Keep extraction and transformation as two strictly separated steps

We load the data from the various source files (.csv, .xls, .xlsx) into SQL Server database Staging table(s) using SSIS Source and Destination Data Flow Tasks connected with the movable data flow arrows. Once you have connected a source and destination you can go into the Destination Data Flow Task and edit the mappings of which source columns should be written to which destination columns.

Next we perform some transformations. This can be anything from a simple ranking or status/flagging/business prioritization algorithm to data cleansing to data partitioning based on certain criteria; the key is that this Transform step is where we apply T-SQL UPDATEs to transform the data once it has all been aggregated in Staging.

Then we refresh the OLAP destination tables using the same kind of Source and Destination Data Flow Tasks and mappings as used for Staging. The OLAP data is used for data analysis.

Finally, we load the cleansed Staging data into our destination system's OLTP database and email or text message the system owner upon successful completion of the SSIS ETL job (or deliver an error if anything fails). The OLTP data stores live transactions.

Bear in mind that most ETL data-flow step mappings are not a 1:1 match; this is just an e2e demo of SSIS ETL in most basic form


Happy ETL'ing, and be sure to watch out for cases of mysterious symbols/characters from miscellaneous data copied from other programs or from other system environments that were using a Language setting (codepage) which is incompatible with your ETL software. Bad data happens more than you think and as we say, GIGO.

Your end result looks like this (all Green Checkmarks indicates all was successful; I recommend using PaperCut for SMTP testing- super cool and useful product

I would attach or GitHub the source code (and will do so upon request) but SSIS project code has a lot of dependencies and can get quite messy for another to re-use project on even just a 'slightly different' machine.

Having used SSIS' now-deprecated predecessor "DTS" (SQL Server Data Transformation Services) and SSIS for many years I can attest to the fact that the best way to learn this product is by diving right in on your own and begin the creation of sources and destinations and source/destination connection managers and control flow events, and .NET integration, and exception event handlers, etc.

You will likely run into some ambiguous and not well-documented errors when developing in SSIS; but persist in your efforts and you can create a very powerful EDI system with the many capabilities of SSIS and the robust Scheduled ETL jobs that it can create.

ETL and the Case Against a Single System for All Business Processes

Nearly every business wants to centralize their business data so that they can get an at-a-glance overview of what is going on at any given point in the day. Suppose, for instance, a company has its business data on 5 separate database servers, in various databases therein, across several different tables. And much of the data will need to be slightly modified (transformed) in order to make data types consistent and optimize it for quick and meaningful data analysis.

Assume the database servers are:
  • CMS, Orders (Dynamics 365 data)
  • Accounting (Sage 50 data)
  • Warehouse, Logistics (SAP data)
  • Field Service, Product Support (SalesForce data)
  • Call Center (TantaComm data)
Furthermore, the business wants to apply some custom logic to various fields so that certain calculations don't have to be made by the OLAP end users and applications.

ETL is just moving data from a Source to a Destination, often making some changes (transformations) along the way

Well, it all seems so complicated, right? How can all that disparate data be combined and standardized? It's not as complicated as allowing an ERP solution define how your various departments do their job. Some ERP products truly do have the functionality to satisfy 80-85% of the company business requirements. And when that is the case, by all means, companies should go that route.

But for instances where ERP of every business function is not working, you can use products like SSIS, Informatica, and Oracle Data Integrator to easily map ETL data visually and apply code operations to the data using a variety of programming languages.

But of course, there is no "one-size-fits-all 100% of our business needs" ERP solution; different business productivity software work better or worse for different industries, different departmental needs, for all manner of different reasons. Why care about the choice of business tool that your business' management and best minds know and have proven is most effective for their- often very specific- job?

Allowing a business to integrate the "best apps for each particular type of job" is nothing more than a sort of microservices architecture. But if we are working with custom code in these various LOB applications (at least some of the LOB apps will need custom API integrations so that they can talk to each other), with each new block of custom code, comes more decentralized code bases and decentralized test scripts...

All considered, I would argue that trying to put square pegs into round holes a la "ERP-everything" is a much, much, MUCH bigger effort than a well-developed and tested scheduled ETL Data Warehouse refresh process to centralize your company's data.

It's always a balance with software. Use the most effective apps for your organization and departmental requirements, but try to avoid the dependency hell that can come along with a poorly managed distributed application environment.

Here are the ETL steps in order, from the resource link below:

Extract the data from the external data sources such as line-of-business systems, CRM systems, relational databases, web services, and anywhere you have business data you want to summarize.

Transform the data. This includes cleansing the data and converting it to an OLAP-friendly data model. The OLAP-friendly data model traditionally consists of dimension and fact tables in a star or snowflake schema.

Load the data so that it can be quickly accessed by querying tools such as reports, dashboards, etc.

"90% of the requirements now is better than 100% of the requirements never" -Unknown

Reference: https://blogs.msdn.microsoft.com/andreasderuiter/2012/12/05/designing-an-etl-process-with-ssis-two-approaches-to-extracting-and-transforming-data/

T-SQL and .NET for XSLT

Steps to perform transformation from SQL to XML to XSLT Output data (csv, htm, tab-delim, fixed-width, etc.):

1). Get your SQL data (assuming you are using SQL Server) into XML format via:

 SELECT [OrderId], [PersonId], [OrderDateTime] FROM [Order] FOR XML AUTO, ELEMENTS, ROOT ('Orders');  


XML Result of "FOR XML AUTO, ELEMENTS, ROOT"..

2). Load XSL transform file (OrderXform.xsl) into a new XslCompiledTransform object (transform):

       XslCompiledTransform transform = new XslCompiledTransform();  
       transform.Load("C:\\Xform\\OrderXform.xsl");  


XSL file used to transform the XML to the desired file output (EDI 834 standard, etc)


3). Using a StringWriter (writer), StringReader (inputReader) and XmlReader (inputXmlReader), read the inputReader XML string into inputXmlReader and apply the XSL transformation to the transform object via: transform.Transform(inputXmlReader, args, writer); //(or another override of Transform())

  using (StringWriter writer = new StringWriter())   
     {   
      // Make an XML reader (inputReader) out of the string (input)  
      XmlReader inputXmlReader;   
      using (var inputReader = new StringReader(input))   
      {   
       inputXmlReader = XmlReader.Create(inputReader);   
       // Apply the transformation to XmlReader (inputXmlReader) and write it to StringWriter (writer)   
       transform.Transform(inputXmlReader, null, writer);   
      }      
      // Retrieve the output as string from writer object  
      return writer.GetStringBuilder().ToString();   
     }   

4). Return the result and write the bits (to memory as Label Text for a demo like this, to file, to domain network location, to non-domain customer ftp, etc.).

The result normally gets written to remote SFTP location or posted to API endpoint for processing

Nothing incredibly complex about this, but it seems to frustrate a lot of people in .NET land so I have documented a quick and easy example.

GitHub Source: https://github.com/Radagast27/Simpl_XSLT_dot_net

Tool for XSLT validation: https://www.freeformatter.com/xsl-transformer.html

References:

https://www.red-gate.com/simple-talk/sql/learn-sql-server/using-the-for-xml-clause-to-return-query-results-as-xml/

https://www.w3schools.com/xml/xsl_examples.asp

https://www.c-sharpcorner.com/article/transform-xml-output-of-sql-query-using-for-xml-auto-statement-to-html-using-xsl/

Invisible characters from pasted data



If you find yourself debugging an ETL task error where the problem at hand is data type mismatch or some other data error that is not visible to you in the source (culprit) data, don't fret.

What has likely happened is invisible formatting characters got into the data.

To remedy things, all you'll need to do is open up Notepad++, paste your data into the text window, and from the View menu dropdown, select "Show Symbol" >> "Show All Characters".

Simply edit the data in Windows Notepad (remove the special characters), save, and try to import again. Your data should be sufficiently scrubbed clean and your ETL task will return to normal.


There is also a shortcut button in Notepad++:


PS: handy tip for Windows users- to remove all formatting from copied data, paste the text into Notepad. All formatting will be removed from the data. Re-select it and go paste where needed.

PureText is another application that handles removing all formatting from text: https://stevemiller.net/puretext/