ITMD 526-Week10-Blog

Extract, Transform and Load

ETL stands for Extract, Transform and Load which is a process in database particularly in data warehousing. Data is extracted from homogeneous/heterogeneous data sources in the data extraction phase. The data is transformed into required format in the data transformation phase and the transformed data is loaded into to the new or destined database in the data load phase. Since the data extraction requires significant investment, it is basic to execute the three stages in parallel. While the data is being removed, another change procedure executes while handling the data effectively got and sets it up for stacking while the data stacking starts without sitting tight for the fulfillment of the past stages. ETL frameworks usually incorporate data from various applications (frameworks), commonly created and upheld by various sellers or facilitated on particular PC equipment. The divergent frameworks containing the first data are every now and again oversaw and worked by various representatives. For instance, a cost bookkeeping framework may join data from finance, deals, and obtaining.

Extract:
Extract is the first and the most important step in the ETL process. The data is extracted from the source database system. Extracting data is considered vital as data extraction sets stage for the next processes. Mostly, different data source systems are used by the warehouse projects to combine data and extract them. Some of the common source formats are relational databases, XML, flat files, IMS, VSAM, ISAM, web spidering or screen-scraping. The data is converted into single format in the extraction phase which will be helpful for transformation. Data validation also takes place which helps in assuring that right values are extracted for transformation.

Transformation:
A set of rules are applied to the extracted data that need to be transformed in the data transformation phase. Data that does not require any transformation are called as “pass through” data. One of the key functionalities of transformation phase is data cleaning which targets to send only required data to the target database. Some of the requirements that should be met as per business needs are:
  • Selecting only certain columns to load: (or selecting null columns not to load). For example, if the source data has three columns (aka "attributes"), roll_no, age, and salary, then the selection may take only roll_no and salary. Or, the selection mechanism may ignore all those records where salary is not present (salary = null).
  • Translating coded values: (e.g., if the source system codes male as "1" and female as "2", but the warehouse codes male as "M" and female as "F")
  • Encoding free-form values: (e.g., mapping "Male" to "M")
  • Deriving a new calculated value: (e.g., sale_amount = qty * unit_price)
  • Sorting or ordering the data based on a list of columns to improve search performance
  • Joining data from multiple sources (e.g., lookup, merge) and deduplicating the data
  • Aggregating (for example, rollup — summarizing multiple rows of data — total sales for each store, and for each region, etc.)
  • Generating surrogate-key values
  • Transposing or pivoting (turning multiple columns into multiple rows or vice versa)
  • Splitting a column into multiple columns (e.g., converting a comma-separated list, specified as a string in one column, into individual values in different columns)
  • Disaggregating repeating columns
  • Looking up and validating the relevant data from tables or referential files


Load:

In this file, the extracted data (both transformed and pass through) is loaded into the target database. The data can be either in simple flat file or a data warehouse. The process of the load phase varies depending on the business requirement. Certain data warehouses have a onetime load phase, while others have ETL phases on regular intervals. It depends on the business needs and type of data.


Comments

Popular posts from this blog

ITMD 526-Week 6-Blog

ITMD 526-Week 7-Blog

ITMD526-Week1-Blog