click below
click below
Normal Size Small Size show me how
K10 - Combining Data
Evaluates the benefits and risks inherent in combining data
| Term | Definition |
|---|---|
| What is meant by, combining data? | It is the principle of combing more than one source of data from different data sets to make a usable data set, for analysis and hypothesis testing, etc |
| What other terms are used to describe combining data? | Data merging, Joining, or Integration are other phrases used to describe processes for combining data. |
| What is meant by ETL? | Extract, Transform, Load. - It is the process used to combine data from different sources. Extract focuses on getting the data, Transform focuses on cleaning the data. Load focuses on loading the data into the targeted location. |
| What is Middleware software? | Essentially software’s that sit between systems and automatically moves the data and transforms the data between them |
| What is Fuzzy Matching? | This is the process of finding records that are similar, not exactly identical. The Levenstein Distance is a helpful metric for analysing this. |
| Levenstein Distance. | This is a mathematical way to measure how different two words are. It counts the minimum number of changes needed to turn one word into another. |
| Probabilistic Matching. | Instead of just counting edits, this uses statistics to decide if two records are the same. It calculates the probability that two records belong to the same entity based on multiple fields. Good for messy real data. |
| What are some of the benefits for combining data? | Improved decision making; Better data quality and accuracy; Enhanced insights and analysis; Elimination of data silos; Efficiency and automation; Real time monitoring and reporting; Support for advanced analytics. |
| What are the risks in combining data? | Data quality issues; Privacy and compliance risks (eg. GDPR), Data integration complexity; Inaccurate matches; Security risks; Increased storage and processing costs; Data governance challenges; |
| what are some of the best practises for combining data. | Data profiling; Standardisation, Metadata Management; Data Governance framework; Iterative testing; Monitoring and auditing. |