click below
click below
Normal Size Small Size show me how
Modern data midterm
| Question | Answer |
|---|---|
| What are the 5Vs of Big Data | Volume, Velocity, Variety, Veracity, Value. |
| What do Volume, Velocity, and Variety describe | Technical challenges of data size, speed, and format diversity. |
| What do Veracity and Value emphasize | Data quality and business relevance. |
| What technology marked the start of Big Data | Hadoop (HDFS for storage, MapReduce for processing). |
| What are the stages of the Data Value Chain | Ingestion → Storage → Processing → Serving. |
| What does OSBDET stand for | Open Source Big Data Educational Toolkit. |
| What is the purpose of OSBDET | To provide a local cloud-like environment for labs using open-source Big Data tools. |
| What tools does OSBDET include | NiFi, Hadoop, Kafka, MinIO, Jupyter, etc. |
| How is OSBDET accessed | Through the WebUI at |
| What are the three dimensions to classify Data Sources | Type (Operational/Informational), Location (Internal/External), Nature (Batch/Streaming). |
| Define Data Ingestion. | Capturing data from sources and moving it to centralized storage. |
| Name key open-source ingestion tools. | Sqoop (batch), Flume (streaming), NiFi (streaming), Kafka Connect (streaming). |
| What's the difference between Batch and Streaming ingestion | Batch handles large historical data in bulk; Streaming processes data continuously in small events. |
| What is Apache NiFi | A system to automate and manage data flows between systems. |
| What is a FlowFile | The unit of data in NiFi, composed of content and attributes. |
| What is a Processor in NiFi | A component that performs work on FlowFiles (e.g., extraction, filtering, loading). |
| What does a Connection do | Links processors and acts as a queue for FlowFiles. |
| What is the Flow Controller | The “brain” that manages scheduling and movement of FlowFiles. |
| What is a Process Group | A container grouping multiple processors and connections for organization and reuse. |
| What are the two main types of Data Storage in Big Data | Batch and Streaming storage. |
| What are the two families of Batch Storage | Distributed File Systems (DFS) and Object Stores. |
| Give examples of each storage type. | DFS: HDFS, DBFS. Object Stores: MinIO, Amazon S3, Azure Blob Storage. |
| What’s the difference between DFS and Object Stores | DFS uses hierarchical folders and paths; Object Stores use flat buckets and object keys. |
| What’s the main limitation of Batch Data Storage | High latency, poor performance for small files or random updates. |
| What is a Data Lake | A centralized repository for storing all types of data at scale. |
| Name the main zones of a Data Lake. | Landing/Staging, Raw/Bronze, Standardized/Silver, Curated/Gold, Work, Sensitive. |
| What’s stored in the Raw Zone | Unprocessed, immutable data in its original format. |
| What characterizes the Standardized Zone | Cleaned, deduplicated, standardized data in optimized formats (e.g., Parquet). |
| What is the Gold Zone used for | Analytical and reporting-ready data for business users. |
| What is the Sensitive Zone | Restricted, encrypted data requiring high security. |
| What is a Message in Kafka | A single event or unit of data. |
| What is a Topic | A category or feed name where messages are published. |
| What is a Partition | A subdivision of a topic to scale horizontally and preserve message order. |
| What are Producers and Consumers | Producers send messages to topics; Consumers read messages from them. |
| What is a Broker | A Kafka server that stores and serves messages. |
| What is a Consumer Group | A group of consumers sharing topic partitions for parallel reading. |
| What ensures fault tolerance in Kafka | Message replication across brokers. |
| What are the two main types of Data Processing | Batch Processing (large, slow) and Streaming Processing (small, fast). |
| What is MapReduce | A parallel processing model dividing data into map, sort, and reduce phases. |
| What is Apache Spark | A fast, unified analytics engine supporting batch and streaming data. |
| What does YARN do in Hadoop | Manages resources (CPU, memory) for data processing tasks. |
| What are key processing frameworks for each type | Batch: Hadoop, Spark, Flink. Streaming: Storm, Spark Streaming, Kafka Streams. |
| What is a Workflow Scheduler | A tool that automates and manages data processing tasks (e.g., Airflow, Oozie, Luigi). |