Question 1

What are the 5Vs of Big Data

Accepted Answer

Volume, Velocity, Variety, Veracity, Value.

Question 2

What do Volume, Velocity, and Variety describe

Accepted Answer

Technical challenges of data size, speed, and format diversity.

Question 3

What do Veracity and Value emphasize

Accepted Answer

Data quality and business relevance.

Question 4

What technology marked the start of Big Data

Accepted Answer

Hadoop (HDFS for storage, MapReduce for processing).

Question 5

What are the stages of the Data Value Chain

Accepted Answer

Ingestion â†’ Storage â†’ Processing â†’ Serving.

Question 6

What does OSBDET stand for

Accepted Answer

Open Source Big Data Educational Toolkit.

Question 7

What is the purpose of OSBDET

Accepted Answer

To provide a local cloud-like environment for labs using open-source Big Data tools.

Question 8

What tools does OSBDET include

Accepted Answer

NiFi, Hadoop, Kafka, MinIO, Jupyter, etc.

Question 9

How is OSBDET accessed

Accepted Answer

Through the WebUI at

Question 10

What are the three dimensions to classify Data Sources

Accepted Answer

Type (Operational/Informational), Location (Internal/External), Nature (Batch/Streaming).

Question 11

Define Data Ingestion.

Accepted Answer

Capturing data from sources and moving it to centralized storage.

Question 12

Name key open-source ingestion tools.

Accepted Answer

Sqoop (batch), Flume (streaming), NiFi (streaming), Kafka Connect (streaming).

Question 13

What's the difference between Batch and Streaming ingestion

Accepted Answer

Batch handles large historical data in bulk; Streaming processes data continuously in small events.

Question 14

What is Apache NiFi

Accepted Answer

A system to automate and manage data flows between systems.

Question 15

What is a FlowFile

Accepted Answer

The unit of data in NiFi, composed of content and attributes.

Question 16

What is a Processor in NiFi

Accepted Answer

A component that performs work on FlowFiles (e.g., extraction, filtering, loading).

Question 17

What does a Connection do

Accepted Answer

Links processors and acts as a queue for FlowFiles.

Question 18

What is the Flow Controller

Accepted Answer

The “brain” that manages scheduling and movement of FlowFiles.

Question 19

What is a Process Group

Accepted Answer

A container grouping multiple processors and connections for organization and reuse.

Question 20

What are the two main types of Data Storage in Big Data

Accepted Answer

Batch and Streaming storage.

Question 21

What are the two families of Batch Storage

Accepted Answer

Distributed File Systems (DFS) and Object Stores.

Question 22

Give examples of each storage type.

Accepted Answer

DFS: HDFS, DBFS. Object Stores: MinIO, Amazon S3, Azure Blob Storage.

Question 23

What’s the difference between DFS and Object Stores

Accepted Answer

DFS uses hierarchical folders and paths; Object Stores use flat buckets and object keys.

Question 24

What’s the main limitation of Batch Data Storage

Accepted Answer

High latency, poor performance for small files or random updates.

Question 25

What is a Data Lake

Accepted Answer

A centralized repository for storing all types of data at scale.

Question 26

Name the main zones of a Data Lake.

Accepted Answer

Landing/Staging, Raw/Bronze, Standardized/Silver, Curated/Gold, Work, Sensitive.

Question 27

What’s stored in the Raw Zone

Accepted Answer

Unprocessed, immutable data in its original format.

Question 28

What characterizes the Standardized Zone

Accepted Answer

Cleaned, deduplicated, standardized data in optimized formats (e.g., Parquet).

Question 29

What is the Gold Zone used for

Accepted Answer

Analytical and reporting-ready data for business users.

Question 30

What is the Sensitive Zone

Accepted Answer

Restricted, encrypted data requiring high security.

Question 31

What is a Message in Kafka

Accepted Answer

A single event or unit of data.

Question 32

What is a Topic

Accepted Answer

A category or feed name where messages are published.

Question 33

What is a Partition

Accepted Answer

A subdivision of a topic to scale horizontally and preserve message order.

Question 34

What are Producers and Consumers

Accepted Answer

Producers send messages to topics; Consumers read messages from them.

Question 35

What is a Broker

Accepted Answer

A Kafka server that stores and serves messages.

Question 36

What is a Consumer Group

Accepted Answer

A group of consumers sharing topic partitions for parallel reading.

Question 37

What ensures fault tolerance in Kafka

Accepted Answer

Message replication across brokers.

Question 38

What are the two main types of Data Processing

Accepted Answer

Batch Processing (large, slow) and Streaming Processing (small, fast).

Question 39

What is MapReduce

Accepted Answer

A parallel processing model dividing data into map, sort, and reduce phases.

Question 40

What is Apache Spark

Accepted Answer

A fast, unified analytics engine supporting batch and streaming data.

Question 41

What does YARN do in Hadoop

Accepted Answer

Manages resources (CPU, memory) for data processing tasks.

Question 42

What are key processing frameworks for each type

Accepted Answer

Batch: Hadoop, Spark, Flink. Streaming: Storm, Spark Streaming, Kafka Streams.

Question 43

What is a Workflow Scheduler

Accepted Answer

A tool that automates and manages data processing tasks (e.g., Airflow, Oozie, Luigi).

"Know" box contains:
Time elapsed:
Retries:

Modern data midterm