ITMD 526-Week10-Blog
Extract, Transform and Load
ETL stands
for Extract, Transform and Load which is a process in database particularly in
data warehousing. Data is extracted from homogeneous/heterogeneous data sources
in the data extraction phase. The data is transformed into required format in
the data transformation phase and the transformed data is loaded into to the
new or destined database in the data load phase. Since the data extraction
requires significant investment, it is basic to execute the three stages in
parallel. While the data is being removed, another change procedure executes
while handling the data effectively got and sets it up for stacking while the
data stacking starts without sitting tight for the fulfillment of the past
stages. ETL frameworks usually incorporate data from various applications
(frameworks), commonly created and upheld by various sellers or facilitated on
particular PC equipment. The divergent frameworks containing the first data are
every now and again oversaw and worked by various representatives. For
instance, a cost bookkeeping framework may join data from finance, deals, and
obtaining.
Extract:
Extract is
the first and the most important step in the ETL process. The data is extracted
from the source database system. Extracting data is considered vital as data
extraction sets stage for the next processes. Mostly, different data source
systems are used by the warehouse projects to combine data and extract them.
Some of the common source formats are relational databases, XML, flat files,
IMS, VSAM, ISAM, web spidering or screen-scraping. The data is converted into
single format in the extraction phase which will be helpful for transformation.
Data validation also takes place which helps in assuring that right values are
extracted for transformation.
Transformation:
A set of
rules are applied to the extracted data that need to be transformed in the data
transformation phase. Data that does not require any transformation are called
as “pass through” data. One of the key functionalities of transformation phase
is data cleaning which targets to send only required data to the target
database. Some of the requirements that should be met as per business needs
are:
- Selecting only certain columns to load: (or selecting null columns not to load). For example, if the source data has three columns (aka "attributes"), roll_no, age, and salary, then the selection may take only roll_no and salary. Or, the selection mechanism may ignore all those records where salary is not present (salary = null).
- Translating coded values: (e.g., if the source system codes male as "1" and female as "2", but the warehouse codes male as "M" and female as "F")
- Encoding free-form values: (e.g., mapping "Male" to "M")
- Deriving a new calculated value: (e.g., sale_amount = qty * unit_price)
- Sorting or ordering the data based on a list of columns to improve search performance
- Joining data from multiple sources (e.g., lookup, merge) and deduplicating the data
- Aggregating (for example, rollup — summarizing multiple rows of data — total sales for each store, and for each region, etc.)
- Generating surrogate-key values
- Transposing or pivoting (turning multiple columns into multiple rows or vice versa)
- Splitting a column into multiple columns (e.g., converting a comma-separated list, specified as a string in one column, into individual values in different columns)
- Disaggregating repeating columns
- Looking up and validating the relevant data from tables or referential files
Load:
In this file,
the extracted data (both transformed and pass through) is loaded into the
target database. The data can be either in simple flat file or a data
warehouse. The process of the load phase varies depending on the business
requirement. Certain data warehouses have a onetime load phase, while others
have ETL phases on regular intervals. It depends on the business needs and type
of data.

Comments
Post a Comment