Skip to content

Whitepaper: A Primer on Big Data Testing

With respect to software development and verification processes, testing teams may not yet fully understand the implications of Big Data’s impact on the design, configuration and operation of systems and databases. Testers need a clear plan to execute their tests but there are many new unknowns as Big Data systems are layered on top of enterprise systems struggling with data quality.

Added to those struggles are the challenges of replicating and porting that information into the Big Data analytics and predictive software suite. How do you measure the quality of data, particularly when it is unstructured or generated through statistical processes? How do you confirm that highly concurrent systems do not have deadlock or race conditions? What tools should be used?

It is imperative that software testers understand that Big Data is about far more than simply data volume. For example, having a two-petabyte Oracle database doesn’t necessarily constitute a true Big Data situation – just a high load one. Big Data management involves fundamentally different methods for storing and processing data, and the outputs may also be of a quite different nature. With the increased likelihood that Bad Data is embedded in the mix, the challenges facing the quality assurance testing departments increase dramatically. This primer on Big Data testing provides guidelines and methodologies on how to approach these data quality problems.

Primer on Big Data Report

Our Primer on Big Data report is the output of a research project by QA Consultants – the North American leader in onshore software testing. This white paper focuses on the primary challenges of testing Big Data systems and proposes a methodology to overcome those challenges. Because of the complex nature of both Big Data and the highly distributed, asynchronous systems that process it, organizations have been struggling to define testing strategies and to set up optimal testing environments.

The focus of this report is on important aspects of methods, tools and processes for Big Data testing. It was completed with the support of the National Research Council of Canada.


1. Big Data and Bad Data
2. Characteristics of Big Data
2.1 Volume: The quantity of data
2.2 Velocity: Streaming data
2.3 Variety: Different types of data

3. Testing Big Data Systems

4. Testing Methods, Tools and Reporting for Validation of Pre-Hadoop Processing
4.1 Tools for validating pre-Hadoop processing

5. Testing Methods, Tools and Reporting for Hadoop MapReduce Processes
5.1 Methods and tools

6. Testing Methods, Tools and Reporting for Data Extract and EDW Loading
6.1 Methods
6.2 Different methods for ETL testing
6.3 Areas of ETL testing
6.4 Tools

7. Testing Methods, Tools and Reporting on Analytics
7.1 Four Big Data reporting strategies
7.2 Methodology for report testing
7.3 Apache Falcon

8. Testing Methods, Tools and Reporting on Performance and Failover Testing
8.1 Performance testing
8.2 Failover testing
8.3 Methods and tools
8.3.1 Jepsen

9. Infrastructure Setup, Design and Implementation
9.1 Hardware selection for master nodes (NameNode, JobTracker, HBase Master)
9.2 Hardware selection for slave nodes (DataNodes, TaskTrackers, RegionServers)
9.3 Infrastructure setup key points

Download Our Primer on Big Data

Download our 24-page white paper primer on Big Data below: