click below
click below
Normal Size Small Size show me how
Data Science
Data Science 435
| Term | Definition |
|---|---|
| Clustering | attempts to group individuals in a population together by their similarity, but not driven by any specific purpose eg “Do our customers form natural groups or segments? |
| “Classification and class probability estimation | attempt to predict, for each individual in a population, which of a (small) set of classes this individual belongs to eg classifying emails into Spam or Legitimate. |
| “scoring or class probability estimation | A scoring model applied to an individual produces, instead of a class prediction, a score representing the probability (or some other quantification of likelihood) that that individual belongs to each class. |
| “Regression (“value estimation”) | attempts to estimate or predict, for each individual, the numerical value of some variable for that individual - used to find a function that models the data with the least error |
| “Similarity matching | attempts to identify similar individuals based on data known about them. Similarity matching can be used directly to find similar entities |
| Co-occurrence grouping (aka frequent itemset mining, association rule learning, and market-basket analysis) | attempts to find associations between entities based on transactions involving them. An example co-occurrence question would be: What items are commonly purchased together? |
| Profiling (aka behavior description or anomaly detection) | attempts to characterize the typical behavior of an individual, group, or population. An example profiling question: “What is the typical cell phone usage of this customer segment?”-“often used to establish behavioral norms for anomaly detection |
| Link prediction | attempts to predict connections between data items, usually by suggesting that a link should exist, and possibly also estimating the strength of the link |
| “Data reduction | attempts to take a large set of data and replace it with a smaller set of data that contains much of the important information in the larger set |
| “Causal modeling | attempts to help us understand what events or actions actually influence others |
| Data Science | Set of fundamental principles that provide guidelines for extracting knowledge from data |
| Data Mining | The extracting of knowledge from data using different technologies, processes, and algorithms |
| Data set | A file that contains data arranged in a meaningful format |
| Database | A repository of data that is arranged in a meaningful structure |
| DBMS | A database management system is a system that provides the ability to perform different database operations |
| Data Warehouse | A DB system that is equipped for performing analytical tasks; it stores historical data and contains current and master data |
| CRISP - DM | Cross Industry Standard Process for Data Mining |
| CRISP-DM Phases | Business understanding, Data understanding, Data preparation, Modeling, Evaluation, Deployment |
| Business understanding | Understanding the end business goal that the data mining techniques should support |
| Data understanding | Identifying the source of the data as well as any information necessary to to interpret the results |
| Data preparation | The data should be prepared for data mining by ensuring that the data is high quality (entered properly, missing values handles strategically, etc.) and that it is capable of being processed by the desired data mining algorithm |
| Modeling | Employs data mining algorithms to glean insights from the data; for example Classification models output the expected class of an object (eg responder or nonresponder) |
| Evaluation | The output of the model is evaluated to see if the model is sound or if it might be improved |
| Deployment | After the results have been validated, they can be safely deployed by returning to the goal set in the business understanding phase |