click below
click below
Normal Size Small Size show me how
4315 Exam 1
| Question | Answer |
|---|---|
| ________ comprises facts, observations, or perceptions | Data |
| A ________ is a statement of some element of truth about a subject mater or a domain | Fact |
| Alone, _______ represents raw numbers or assertions, and may therefore be devoid of context, meaning, or intent. However, it can easily be captured, stored, and communicated using electronic or other media. | Data |
| ________ is processed data that is in a form that is useful for making decisions | Information |
| _______ typically involves the manipulation of raw _______ to obtain a more meaningful indication of trends or patterns in the data | Information/data |
| Whether certain facts are considered ________ or only _______ depends on the individual who is using those facts | information/data |
| The problem with too much ______ is that it offers no judgement and no basis for action | data |
| The practical pursuit of computerized information retrieval began in the late _______ | 1940s |
| A great increase in the production of specific literature coupled with the availability of computers led to interest in automatic ___________ | document retrieval |
| ____________ is finding material (usually documents) of an unstructured nature (usually text) that satisfy an information need from within large collections (usually stored on computers) | Information Retrieval (IR) |
| The term "___________" refers to data which does not have clear, semantically overt, easy-for-a-computer structure. It is the opposite of ___________, the canonical example of which is a relational database | unstructured data/structured data |
| IR is also used to facilitate "__________" searches such as finding a document where the Title contains Java and the Body contains threading | semi-structured |
| In the traditional model used in the field of information retrieval, information is organized into __________, and it is assumed that that there is a large number of them | documents |
| Data contained in documents is __________ (without any associated schema) | unstructured |
| Traditional examples of _________ systems are online library catalogs and online document-management systems. Such as those that store newspaper articles | information-retrieval |
| By documents, we mean whatever __________ we have decided to build a retrieval system over. They might be individual memos or chapters of a book | units |
| We will refer to the group of documents over which we perform retrieval as the ________. It is sometimes also referred to as a corpus | collection |
| Information retrieval has also played a critical role in making the ________ a productive and useful tool | Web |
| In the context of the Web, each _________ page is considered to be a document | HTML |
| Documents are associated with a set of _____________ | keywords |
| _________-based information retrieval can be used not only for retrieving textual data, but for retrieving other types of data (such as video and audio data) | Keyword |
| In ______ ______ retrieval, all the words in each document are considered to be keywords | full text |
| _________ is the task of coming up with a good grouping of the documents based on their contents | Clustering |
| Does clustering or classification have an unknown number of final groupings? | Clustering |
| Does clustering or classification have a known number of classes? | Classification |
| _________ is the task of deciding to which class(es), if any, each of a set of documents belongs | Classification |
| Solutions for text information retrieval are generally not effective for _____, ______, or ______ information retrieval, unless the media object is associated with a textual description | image, audio, or video |
| The dominant mode of text search is by its _________ in order to satisfy an information need | content |
| In the most common mode of searching for text, the information need is represented by a ______, and the user may issue several _______ in pursuit of one information need | query/queries |
| Our goal is to develop a system to address the ________ task (the most standard IR task) | ad hoc retrieval |
| In the ___________ task, a system aims to provide documents from within the collection that are relevant to an arbitrary user information need, communicated to the system by means of a one-off, user initiated query | ad hoc retrieval |
| ________: Topic which the user desires to find | Information need |
| ______: What the user conveys to the computer in an attempt to communicate the information need. May be incrementally developed until a user obtains desired results | Query |
| Query vs Information Need __________: Is drinking red wine effective at reducing the risk of a heart attack? __________: "red" and "wine" and "heart" and "attack" | Information Need Query |
| The primary challenge in information retrieval is the ________ between the language of the user and the language of the author | difference/mismatch |
| A(n) _________ is a system that ingests information, transforms it into a searchable format, and provides an interface for a user to search and retrieve information. This includes both hardware and software | IR system |
| The overall goal of a(n) _____________ is to provide the information needed to satisfy the user's question while minimizing the user overhead in locating the information value | information retrieval system |
| Information retrieval system architecture can be segmented into four major processing subsystems: ______, ______, ______, and ______ | ingest, index, search, and display |
| Which of the four IR processing subsystems is concerned with the acquisition and initial normalization and processing of the source items? | Ingest |
| Which of the four IR processing subsystems is concerned with taking the normalized item's processing tokens and metadata and creating the searchable index from it? | Index |
| Which of the four IR processing subsystems is concerned with mapping the user search information need into a form that can be processed as defined by the searchable index and determining which items are to be returned to the user? | Search |
| Which of the four IR processing subsystems is concerned with how the user can locate the items of interest in all of the possible results returned? | Display |
| IR systems have much in common with _______ systems.: documents are stored in a repository, and an index is maintained; queries are evaluated utilizing the index to identify matches which are then returned to the user. | database |
| The ______ in the field of information retrieval is different from that in database systems | emphasis |
| A document matches an information need if the user perceives it to be ______ | relevant |
| _________ return all matching records, while _________ return a fixed number of matches, which are ranked by their statistical similarity | Database systems/search engines |
| Updates are not as common in traditional _____ systems as in traditional ____ systems | IR/DB |
| The ____________ (also known as exact-match retrieval) was used by the earliest search engines | Boolean retrieval model |
| The Boolean retrieval model assumes that relevance is ______ (either relevant or not relevant) | binary |
| In the Boolean Retrieval Model, we can pose any query which is in the form of a Boolean expression of terms: terms which are combined with the operators ______, ______, ______ | AND, OR, NOT |
| Simplest form of document retrieval is for a computer to scan through all the text of each document (commonly referred to as _______ through text) | grepping |
| Avoid linearly scanning texts for each query by indexing the documents in advance by building an ______ matrix | incidence |
| The structure of an incidence matrix includes: ______, ______, and a _______ | rows, columns, and a matrix element (i,j) |
| In an incidence matrix, do rows or columns correspond to words that appear in the collection (sorted alphabetically)? | Rows |
| In an incidence matrix, do rows or columns correspond to documents that appear in the collection? | Columns |
| In an incidence matrix, the matrix element (i,j) is ______ if the document in column j contains the word in row i, and is ______ otherwise | true/false |
| __________ (or inverted file): Central concept in information retrieval | Inverted index |
| A(n) ____________ consists of two major components: The search structure (the dictionary) and a set of inverted lists | inverted index |
| The idea of a(n) __________ is that the lists contain the IDs of the documents that contain the corresponding vocabulary term | inverted index |
| Each item in the inverted index list is conventionally called a _______. Each of which contains document ID and number of times a term appears in a document | posting |
| Steps in building a(n) ____________: 1) collect the documents to be indexed 2) tokenize the text, turning each document into a list of tokens 3) linguistic preprocessing 4) Index the documents | Inverted Index |
| ______________ - Each document is a list of normalized tokens (jumps & jump), which are the indexing terms | Linguistic preprocessing |
| When building inverted indices, sort the list so that the terms are ________ | alphabetical |
| To build an index, it is necessary to determine what the document ______ for indexing is | unit |
| Given a character sequence and a defined document unit: ______ is the job of chopping it up into pieces, called tokens | tokenization |
| ______ Processing: Deals with building equivalence classes of tokens | Linguistic |
| A ______ is an instance of a character sequence in some particular document | token |
| A ______ is the class of all tokens containing the same character sequence | type |
| A ______ is a type that is indexed in the IR's system's dictionary | term |
| Linguistic Processing Example A rose is a rose is a rose How many Tokens: ____ How many Types: ____ How many terms: ____ | Tokens: 8 (a, rose, is, a, rose, is, a, rose) Types: 3 ( a, rose, is) Terms: 2 (rose, be - the term 'a' is too common to be indexed; the base form of is stored in the index) |
| When using the ______ approach during tokenization: chop on whitespace and throw away punctuation characters | simple |
| Issues of tokenization are ______ specific | language |
| ______ ______: Identifying language of a document by analyzing character subsequences | Language Identification |
| A common strategy is to do _____ _____ by reducing all letters to lower case during tokenization | case folding |
| The two most frequent words in English (the, of) account for about _____% of all word occurrences | 10 |
| The most frequent six words account for ____% of word occurrences | 20 |
| The most frequent fifty words account for ____% of all text | 40 |
| Some extremely common and semantically non-selective words are excluded from the dictionary entirely; these are called ______ words (the, is, at, a, of,...) | stop |
| The general strategy for determining a list of stop words (stop list) is to sort the terms by their _______, and then to take the most frequent terms as a stop list | frequency |
| The first step in identification of a processing token consists of determining a ______ | word |
| Systems determine words by dividing input symbols into three classes: __________ symbols (alphabetic characters and numbers), __________ symbols (blanks, commas, and semicolons), and ____________ symbols (set of rules are needed to determine action) | valid word, inter-word, special processing |
| A _____ is defined as a continuous set of word symbols bounded by inter-word symbols | word |
| The second step in defining processing tokens is identification of any specific word _______ (determine if upper case letters should be preserved, numbers, etc.) | characteristics |
| Once the potential list of processing tokens has been defined, some can be removed by a _____ List | Stop |
| ______: Crude heuristic process that chops off the ends of words in the hope of generating the root form correctly most of the time | Stemming |
| ______: Using a dictionary and morphological analysis of words aiming to return the base or dictionary form of the word | Lemmatization |
| ____________: The most common algorithm for stemming English | Porter's Algorithm |
| ______ require less knowledge than a ______, which needs a complete vocabulary and morphological analysis to accurately lemmatize words | stemmers, lemmatizer |
| If the token is saw, _____ may just return s or saw. ______ would try to identify if the token is a noun or a verb and return see if it is the latter, | stemming, lemmatization |
| If we would like to search for "Stanford University" so that it does not match a sentence such as "The inventor Standford Ovshinsky never went to university" we would need to utilize a _____ ______ | phrase query |
| One approach to handling phrases is to consider every pair of consecutive terms in a document as a dictionary term. This approach is processing using (______) ______ ______ | (extended) biword indexes |
| Using ______ indexes, each term in the dictionary is stored in the postings list of the form docID: <position1, position2,...> | positional |
| ______ index: A special index for general wildcard queries that adds '$' to the end of a term and then constructs the index in which various rotations of each term augmented with '$' all link to the original vocabulary term | Permuterm |
| The disadvantage of using the ______ index: Dictionary becomes quite large since it must include all rotations of each term | Permuterm |
| Two basic principles underlying most spelling correction algorithms: 1) Of various alternative correct spellings for a misspelled query, choose the ______ one 2) When two correctly spelled queries are tied, select the one that is more _____ | nearest, common |
| Given two character strings, s1 and s2, the ____ distance between them is the minimum number of editing operations required to transform s1 into s2 | edit |
| Edit operations (Levenshtein Distance): ______ a character into a string ______ a character from a string ______ a character in a string with another character | insert delete replace |
| A _______ is a sequence of k characters | k-gram |
| In a _______ index, the vocabulary terms that have enough k-grams in common with the query are retrieved | k-gram |
| _______ correction: corrections of misspellings that arise because the user types a query that sounds like the target term | Phonetic |
| _________ algorithm: Converts terms into a reduced 4-character form | Soundex |
| Improvements to Soundex Algorithm: Eliminate _______ at beginning of term Treat _____ and _____ differently Add rules for ______ letters | duplicates H/W Silent |