Save
Upgrade to remove ads
Busy. Please wait.
Log in with Clever
or

show password
Forgot Password?

Don't have an account?  Sign up 
Sign up using Clever
or

Username is available taken
show password


Make sure to remember your password. If you forget it there is no way for StudyStack to send you a reset link. You would need to create a new account.
Your email address is only used to allow you to reset your password. See our Privacy Policy and Terms of Service.


Already a StudyStack user? Log In

Reset Password
Enter the associated with your account, and we'll email you a link to reset your password.
focusNode
Didn't know it?
click below
 
Knew it?
click below
Don't Know
Remaining cards (0)
Know
0:00
Embed Code - If you would like this activity on your web page, copy the script below and paste it into your web page.

  Normal Size     Small Size show me how

Modern data midterm

QuestionAnswer
What are the 5Vs of Big Data Volume, Velocity, Variety, Veracity, Value.
What do Volume, Velocity, and Variety describe Technical challenges of data size, speed, and format diversity.
What do Veracity and Value emphasize Data quality and business relevance.
What technology marked the start of Big Data Hadoop (HDFS for storage, MapReduce for processing).
What are the stages of the Data Value Chain Ingestion → Storage → Processing → Serving.
What does OSBDET stand for Open Source Big Data Educational Toolkit.
What is the purpose of OSBDET To provide a local cloud-like environment for labs using open-source Big Data tools.
What tools does OSBDET include NiFi, Hadoop, Kafka, MinIO, Jupyter, etc.
How is OSBDET accessed Through the WebUI at
What are the three dimensions to classify Data Sources Type (Operational/Informational), Location (Internal/External), Nature (Batch/Streaming).
Define Data Ingestion. Capturing data from sources and moving it to centralized storage.
Name key open-source ingestion tools. Sqoop (batch), Flume (streaming), NiFi (streaming), Kafka Connect (streaming).
What's the difference between Batch and Streaming ingestion Batch handles large historical data in bulk; Streaming processes data continuously in small events.
What is Apache NiFi A system to automate and manage data flows between systems.
What is a FlowFile The unit of data in NiFi, composed of content and attributes.
What is a Processor in NiFi A component that performs work on FlowFiles (e.g., extraction, filtering, loading).
What does a Connection do Links processors and acts as a queue for FlowFiles.
What is the Flow Controller The “brain” that manages scheduling and movement of FlowFiles.
What is a Process Group A container grouping multiple processors and connections for organization and reuse.
What are the two main types of Data Storage in Big Data Batch and Streaming storage.
What are the two families of Batch Storage Distributed File Systems (DFS) and Object Stores.
Give examples of each storage type. DFS: HDFS, DBFS. Object Stores: MinIO, Amazon S3, Azure Blob Storage.
What’s the difference between DFS and Object Stores DFS uses hierarchical folders and paths; Object Stores use flat buckets and object keys.
What’s the main limitation of Batch Data Storage High latency, poor performance for small files or random updates.
What is a Data Lake A centralized repository for storing all types of data at scale.
Name the main zones of a Data Lake. Landing/Staging, Raw/Bronze, Standardized/Silver, Curated/Gold, Work, Sensitive.
What’s stored in the Raw Zone Unprocessed, immutable data in its original format.
What characterizes the Standardized Zone Cleaned, deduplicated, standardized data in optimized formats (e.g., Parquet).
What is the Gold Zone used for Analytical and reporting-ready data for business users.
What is the Sensitive Zone Restricted, encrypted data requiring high security.
What is a Message in Kafka A single event or unit of data.
What is a Topic A category or feed name where messages are published.
What is a Partition A subdivision of a topic to scale horizontally and preserve message order.
What are Producers and Consumers Producers send messages to topics; Consumers read messages from them.
What is a Broker A Kafka server that stores and serves messages.
What is a Consumer Group A group of consumers sharing topic partitions for parallel reading.
What ensures fault tolerance in Kafka Message replication across brokers.
What are the two main types of Data Processing Batch Processing (large, slow) and Streaming Processing (small, fast).
What is MapReduce A parallel processing model dividing data into map, sort, and reduce phases.
What is Apache Spark A fast, unified analytics engine supporting batch and streaming data.
What does YARN do in Hadoop Manages resources (CPU, memory) for data processing tasks.
What are key processing frameworks for each type Batch: Hadoop, Spark, Flink. Streaming: Storm, Spark Streaming, Kafka Streams.
What is a Workflow Scheduler A tool that automates and manages data processing tasks (e.g., Airflow, Oozie, Luigi).
Created by: user-2005475
 

 



Voices

Use these flashcards to help memorize information. Look at the large card and try to recall what is on the other side. Then click the card to flip it. If you knew the answer, click the green Know box. Otherwise, click the red Don't know box.

When you've placed seven or more cards in the Don't know box, click "retry" to try those cards again.

If you've accidentally put the card in the wrong box, just click on the card to take it out of the box.

You can also use your keyboard to move the cards as follows:

If you are logged in to your account, this website will remember which cards you know and don't know so that they are in the same box the next time you log in.

When you need a break, try one of the other activities listed below the flashcards like Matching, Snowman, or Hungry Bug. Although it may feel like you're playing a game, your brain is still making more connections with the information to help you out.

To see how well you know the information, try the Quiz or Test activity.

Pass complete!
"Know" box contains:
Time elapsed:
Retries:
restart all cards