ETL Testing
Data warehousing and its Concepts:
What
is Data warehouse?
Data
Warehouse is a central managed and integrated database containing data from the
operational sources in an organization (such as SAP, CRM, ERP system). It may
gather manual inputs from users determining criteria and parameters for
grouping or classifying records.
Data warehouse database contains structured data for query analysis and can be accessed by users. The data warehouse can be created or updated at any time, with minimum disruption to operational systems. It is ensured by a strategy implemented in ETL process.
A source for the data warehouse is a data extract from operational databases. The data is validated, cleansed, transformed and finally aggregated and it becomes ready to be loaded into the data warehouse.
Data
warehouse is a dedicated database which contains detailed, stable, non-volatile
and consistent data which can be analyzed in the time variant.
Sometimes, where only a portion of detailed data is required, it may be worth considering using a data mart.
Sometimes, where only a portion of detailed data is required, it may be worth considering using a data mart.
A
data mart is generated from the data warehouse and contains data focused on a
given subject and data that is frequently accessed or summarized.
Advantages of Data
warehouse:
Ø
Data
warehouse provides a common data model for all data of interest regardless of
the data's source. This makes it easier to report and analyze information than
it would be if multiple data models were used to retrieve information such as
sales invoices, order receipts, general ledger charges, etc.
Ø Inconsistencies are identified and
resolved prior to loading of data in the Data warehouse. This greatly
simplifies reporting and analysis.
Ø Information in the data warehouse is
under the control of data warehouse users so that, even if the source system
data is purged over time, the information in the warehouse can be stored safely
for extended periods of time.
Ø Because they are separate from
operational systems, data warehouses provide retrieval of data without slowing
down operational systems.
Ø Data warehouses enhance the value of
operational business applications, notably customer relationship management
(CRM) systems.
Ø Data warehouses facilitate decision support
system applications such as trend reports (e.g., the items with the most sales
in a particular area within the last two years), exception reports, and reports
that show actual performance versus goals.
Disadvantages of Data
Warehouse:
Ø Data warehouses are not the optimal
environment for unstructured data.
Ø Because data must be extracted,
transformed and loaded into the warehouse, there is an element of latency in
data warehouse data.
Ø Over their life, data warehouses can
have high costs. Maintenance costs are high.
Ø Data warehouses can get outdated
relatively quickly. There is a cost of delivering suboptimal information to the
organization.
Ø There is often a fine line between
data warehouses and operational systems. Duplicate, expensive functionality may
be developed. Or, functionality may be developed in the data warehouse that, in
retrospect, should have been developed in the operational systems and vice
versa.
ETL Concept:
ETL is the automated and auditable
data acquisition process from source system that involves one or more sub
processes of data extraction, data transportation, data transformation, data
consolidation, data integration, data loading and data cleaning.
E - Extracting
data from source operational or archive systems which are primary source of
data for the data warehouse.
T
- Transforming the data – which may
involve cleaning, filtering, validating and applying business rules.
L -
Loading the data into the data
warehouse or any other database or application that houses the data.
ETL Process:
ETL Process involves the
Extraction, Transformation and Loading Process.
Extraction:
The first part of an ETL process involves extracting the data from the
source systems. Most data warehousing projects consolidate data from different
source systems. Each separate system may also use a different data format.
Common data source formats are relational databases and flat files, but may
include non-relational database structures such as Information Management
System (IMS) or other data structures such as Virtual Storage Access Method
(VSAM) or Indexed Sequential Access Method (ISAM), or even fetching from
outside sources such as through web spidering or screen-scraping. Extraction
converts the data into a format for transformation processing.An intrinsic part of the extraction involves the parsing of extracted data, resulting in a check if the data meets an expected pattern or structure. If not, the data may be rejected entirely or in part.
Transformation:
Transformation
is the series of tasks that prepares the data for loading into the warehouse.
Once data is secured, you have worry about its format or structure. Because it
will be not be in the format needed for the target. Example the grain level,
data type, might be different. Data cannot be used as it is. Some rules and
functions need to be applied to transform the dataOne of the purposes of ETL is to consolidate the data in a central repository or to bring it at one logical or physical place. Data can be consolidated from similar systems, different subject areas, etc.
ETL must support data integration for the data coming from multiple sources and data coming at different times. This has to be seamless operation. This will avoid overwriting existing data, creating duplicate data or even worst simply unable to load the data in the target
Loading:
Loading
process is critical to integration and consolidation. Loading process decides
the modality of how the data is added in the warehouse or simply rejected.
Methods like addition, Updating or deleting are executed at this step. What
happens to the existing data? Should the old data be deleted because of
new information? Or should the data be archived? Should the data be treated as
additional data to the existing one?So data to the data warehouse has to loaded with utmost care for which data auditing process can only establish the confidence level. This auditing process normally happens after the loading of data.
List of ETL tools:
Below
is the list of ETL Tools available in the market:
List
of ETL Tools
|
ETL
Vendors
|
Oracle
Warehouse Builder (OWB)
|
Oracle
|
Data
Integrator & Data Services
|
SAP
Business Objects
|
IBM
Information Server (Datastage)
|
IBM
|
SAS
Data Integration Studio
|
SAS
Institute
|
PowerCenter
|
Informatica
|
Elixir
Repertoire
|
Elixir
|
Data
Migrator
|
Information
Builders
|
SQL
Server Integration Services
|
Microsoft
|
Talend
Open Studio
|
Talend
|
DataFlow
Manager
|
Pitney
Bowes Business Insight
|
Data
Integrator
|
Pervasive
|
|
Open
Text
|
Transformation
Manager
|
ETL
Solutions Ltd.
|
Data
Manager/Decision Stream
|
IBM
(Cognos)
|
Clover
ETL
|
Javlin
|
ETL4ALL
|
IKAN
|
DB2
Warehouse Edition
|
IBM
|
Pentaho
Data Integration
|
Pentaho
|
Adeptia
Integration Server
|
Adeptia
|
ETL Testing:
Following
are some common goals for testing an ETL application:
Data completeness - To ensure that all
expected data is loaded.
Data Quality - It promises that the
ETL application correctly rejects, substitutes default values, corrects and
reports invalid data.
Data transformation - This is meant for
ensuring that all data is correctly transformed according to business rules and
design specifications.
Performance
and scalability- This is to ensure that the data loads and queries perform
within expected time frames and the technical architecture is scalable.
Integration
testing- It is to ensure that ETL process functions well with other upstream and
downstream applications.
User-acceptance testing - It ensures the solution
fulfills the users’ current expectations and also anticipates their future
expectations.
Regression testing - To keep the existing
functionality intact each time a new release of code is completed.
Basically
data warehouse testing is divided into two categories ‘Back-end testing’ and
‘Front-end testing’. The former applies where the source systems data is
compared to the end-result data in Loaded area which is the ETL testing. While
the latter refers to where the user checks the data by comparing their MIS with
the data that is displayed by the end-user tools.
Data Validation:
Data
completeness is one of the basic ways for data validation. This is needed to
verify that all expected data loads into the data warehouse. This includes the
validation of all the records, fields and ensures that the full contents of
each field are loaded.
Data Transformation:
Validating
that the data is transformed correctly based on business rules, can be one of
the most complex parts of testing an ETL application with significant
transformation logic. Another way of testing is to pick up some sample records
and compare them for validating data transformation manually, but this method
requires manual testing steps and testers who have a good amount of experience
and understand of the ETL logic.
Data Warehouse Testing Life Cycle:
Like any other piece of
software a DW implementation undergoes the natural cycle of Unit testing,
System testing, Regression testing, Integration testing and Acceptance testing.
Unit testing: Traditionally this has
been the task of the developer. This is a white-box testing to ensure the
module or component is coded as per agreed upon design specifications. The
developer should focus on the following:
a) That all inbound and
outbound directory structures are created properly with appropriate permissions
and sufficient disk space. All tables used during the ETL are
present with necessary privileges.
b) The ETL routines give
expected results:
i. All transformation logics work as designed from source till
target
ii. Boundary conditions are satisfied− e.g. check for date fields
with leap year dates
iii. Surrogate keys have been generated properly
iv. NULL values have been populated where expected
v. Rejects have occurred where expected and log for rejects is
created with sufficient details
vi. Error recovery methods
vii. Auditing is done properly
c) That the data loaded
into the target is complete:
i. All source data that is expected to get loaded into target,
actually get loaded− compare counts between source and target and use data
profiling tools
ii. All fields are loaded with full contents− i.e. no data field
is truncated while transforming
iii. No duplicates are loaded
iv. Aggregations take place in the target properly
v. Data integrity constraints are properly taken care of
System testing: Generally the QA team owns
this responsibility. For them the design document is the bible and the entire
set of test cases is directly based upon it. Here we test for the functionality
of the application and mostly it is black-box. The major challenge here is
preparation of test data. An intelligently designed input dataset can bring out
the flaws in the application more quickly. Wherever possible use
production-like data. You may also use data generation tools or customized
tools of your own to create test data. We must test for all possible
combinations of input and specifically check out the errors and exceptions. An
unbiased approach is required to ensure maximum efficiency. Knowledge of the
business process is an added advantage since we must be able to interpret the
results functionally and not just code-wise.
The QA team must test for:
i.
Data completeness− match source to target counts terms of
business. Also the load windows refresh period for the DW and the views created
should be signed off from users.
ii.
Data aggregations− match aggregated data against staging tables.
iii.
Granularity of data is as per specifications.
iv.
Error logs and audit tables are generated and populated properly.
v.
Notifications to IT and/or business are generated in proper format
Regression testing: A DW application is not a
one-time solution. Possibly it is the best example of an incremental design
where requirements are enhanced and refined quite often based on business needs
and feedbacks. In such a situation it is very critical to test that the
existing functionalities of a DW application are not messed up whenever an
enhancement is made to it. Generally this is done by running all functional
tests for existing code whenever a new piece of code is introduced. However, a
better strategy could be to preserve earlier test input data and result sets
and running the same again. Now the new results could be compared against the
older ones to ensure proper functionality.
Integration testing: This is done to ensure
that the application developed works from an end-to-end perspective. Here we
must consider the compatibility of the DW application with upstream and
downstream flows. We need to ensure for data integrity across the flow. Our
test strategy should include testing for:
i. Sequence of jobs to be executed with job dependencies and
scheduling
ii. Re-startability of jobs in case of failures
iii. Generation of error logs
iv. Cleanup scripts for the environment including database
This activity is a
combined responsibility and participation of experts from all related
applications is a must in order to avoid misinterpretation of results.
Acceptance testing: This is the most critical
part because here the actual users validate your output datasets. They are the
best judges to ensure that the application works as expected by them. However,
business users may not have proper ETL knowledge. Hence, the development and
test team should be ready to provide answers regarding ETL process that relate
to data population. The test team must have sufficient business knowledge to
translate the results in terms of business. Also the load windows, refresh
period for the DW and the views created should be signed off from users.
Performance testing: In addition to the above
tests a DW must necessarily go through another phase called performance
testing. Any DW application is designed to be scalable and robust. Therefore,
when it goes into production environment, it should not cause performance problems.
Here, we must test the system with huge volume of data. We must ensure that the
load window is met even under such volumes. This phase should involve DBA team,
and ETL expert and others who can review and validate your code for
optimization.
Summary:
Testing a DW application
should be done with a sense of utmost responsibility. A bug in a DW traced at a
later stage results in unpredictable losses. And the task is even more
difficult in the absence of any single end-to-end testing tool. So the strategies
for testing should be methodically developed, refined and streamlined. This is
also true since the requirements of a DW are often dynamically changing. Under
such circumstances repeated discussions with development team and users is of
utmost importance to the test team. Another area of concern is test coverage.
This has to be reviewed multiple times to ensure completeness of testing.
Always remember, a DW tester must go an extra mile to ensure near defect free
solutions.
0 comments:
Post a Comment