Save
Upgrade to remove ads
Busy. Please wait.
Log in with Clever
or

show password
Forgot Password?

Don't have an account?  Sign up 
Sign up using Clever
or

Username is available taken
show password


Make sure to remember your password. If you forget it there is no way for StudyStack to send you a reset link. You would need to create a new account.
Your email address is only used to allow you to reset your password. See our Privacy Policy and Terms of Service.


Already a StudyStack user? Log In

Reset Password
Enter the associated with your account, and we'll email you a link to reset your password.
focusNode
Didn't know it?
click below
 
Knew it?
click below
Don't Know
Remaining cards (0)
Know
0:00
Embed Code - If you would like this activity on your web page, copy the script below and paste it into your web page.

  Normal Size     Small Size show me how

4315 Exam 1

QuestionAnswer
________ comprises facts, observations, or perceptions Data
A ________ is a statement of some element of truth about a subject mater or a domain Fact
Alone, _______ represents raw numbers or assertions, and may therefore be devoid of context, meaning, or intent. However, it can easily be captured, stored, and communicated using electronic or other media. Data
________ is processed data that is in a form that is useful for making decisions Information
_______ typically involves the manipulation of raw _______ to obtain a more meaningful indication of trends or patterns in the data Information/data
Whether certain facts are considered ________ or only _______ depends on the individual who is using those facts information/data
The problem with too much ______ is that it offers no judgement and no basis for action data
The practical pursuit of computerized information retrieval began in the late _______ 1940s
A great increase in the production of specific literature coupled with the availability of computers led to interest in automatic ___________ document retrieval
____________ is finding material (usually documents) of an unstructured nature (usually text) that satisfy an information need from within large collections (usually stored on computers) Information Retrieval (IR)
The term "___________" refers to data which does not have clear, semantically overt, easy-for-a-computer structure. It is the opposite of ___________, the canonical example of which is a relational database unstructured data/structured data
IR is also used to facilitate "__________" searches such as finding a document where the Title contains Java and the Body contains threading semi-structured
In the traditional model used in the field of information retrieval, information is organized into __________, and it is assumed that that there is a large number of them documents
Data contained in documents is __________ (without any associated schema) unstructured
Traditional examples of _________ systems are online library catalogs and online document-management systems. Such as those that store newspaper articles information-retrieval
By documents, we mean whatever __________ we have decided to build a retrieval system over. They might be individual memos or chapters of a book units
We will refer to the group of documents over which we perform retrieval as the ________. It is sometimes also referred to as a corpus collection
Information retrieval has also played a critical role in making the ________ a productive and useful tool Web
In the context of the Web, each _________ page is considered to be a document HTML
Documents are associated with a set of _____________ keywords
_________-based information retrieval can be used not only for retrieving textual data, but for retrieving other types of data (such as video and audio data) Keyword
In ______ ______ retrieval, all the words in each document are considered to be keywords full text
_________ is the task of coming up with a good grouping of the documents based on their contents Clustering
Does clustering or classification have an unknown number of final groupings? Clustering
Does clustering or classification have a known number of classes? Classification
_________ is the task of deciding to which class(es), if any, each of a set of documents belongs Classification
Solutions for text information retrieval are generally not effective for _____, ______, or ______ information retrieval, unless the media object is associated with a textual description image, audio, or video
The dominant mode of text search is by its _________ in order to satisfy an information need content
In the most common mode of searching for text, the information need is represented by a ______, and the user may issue several _______ in pursuit of one information need query/queries
Our goal is to develop a system to address the ________ task (the most standard IR task) ad hoc retrieval
In the ___________ task, a system aims to provide documents from within the collection that are relevant to an arbitrary user information need, communicated to the system by means of a one-off, user initiated query ad hoc retrieval
________: Topic which the user desires to find Information need
______: What the user conveys to the computer in an attempt to communicate the information need. May be incrementally developed until a user obtains desired results Query
Query vs Information Need __________: Is drinking red wine effective at reducing the risk of a heart attack? __________: "red" and "wine" and "heart" and "attack" Information Need Query
The primary challenge in information retrieval is the ________ between the language of the user and the language of the author difference/mismatch
A(n) _________ is a system that ingests information, transforms it into a searchable format, and provides an interface for a user to search and retrieve information. This includes both hardware and software IR system
The overall goal of a(n) _____________ is to provide the information needed to satisfy the user's question while minimizing the user overhead in locating the information value information retrieval system
Information retrieval system architecture can be segmented into four major processing subsystems: ______, ______, ______, and ______ ingest, index, search, and display
Which of the four IR processing subsystems is concerned with the acquisition and initial normalization and processing of the source items? Ingest
Which of the four IR processing subsystems is concerned with taking the normalized item's processing tokens and metadata and creating the searchable index from it? Index
Which of the four IR processing subsystems is concerned with mapping the user search information need into a form that can be processed as defined by the searchable index and determining which items are to be returned to the user? Search
Which of the four IR processing subsystems is concerned with how the user can locate the items of interest in all of the possible results returned? Display
IR systems have much in common with _______ systems.: documents are stored in a repository, and an index is maintained; queries are evaluated utilizing the index to identify matches which are then returned to the user. database
The ______ in the field of information retrieval is different from that in database systems emphasis
A document matches an information need if the user perceives it to be ______ relevant
_________ return all matching records, while _________ return a fixed number of matches, which are ranked by their statistical similarity Database systems/search engines
Updates are not as common in traditional _____ systems as in traditional ____ systems IR/DB
The ____________ (also known as exact-match retrieval) was used by the earliest search engines Boolean retrieval model
The Boolean retrieval model assumes that relevance is ______ (either relevant or not relevant) binary
In the Boolean Retrieval Model, we can pose any query which is in the form of a Boolean expression of terms: terms which are combined with the operators ______, ______, ______ AND, OR, NOT
Simplest form of document retrieval is for a computer to scan through all the text of each document (commonly referred to as _______ through text) grepping
Avoid linearly scanning texts for each query by indexing the documents in advance by building an ______ matrix incidence
The structure of an incidence matrix includes: ______, ______, and a _______ rows, columns, and a matrix element (i,j)
In an incidence matrix, do rows or columns correspond to words that appear in the collection (sorted alphabetically)? Rows
In an incidence matrix, do rows or columns correspond to documents that appear in the collection? Columns
In an incidence matrix, the matrix element (i,j) is ______ if the document in column j contains the word in row i, and is ______ otherwise true/false
__________ (or inverted file): Central concept in information retrieval Inverted index
A(n) ____________ consists of two major components: The search structure (the dictionary) and a set of inverted lists inverted index
The idea of a(n) __________ is that the lists contain the IDs of the documents that contain the corresponding vocabulary term inverted index
Each item in the inverted index list is conventionally called a _______. Each of which contains document ID and number of times a term appears in a document posting
Steps in building a(n) ____________: 1) collect the documents to be indexed 2) tokenize the text, turning each document into a list of tokens 3) linguistic preprocessing 4) Index the documents Inverted Index
______________ - Each document is a list of normalized tokens (jumps & jump), which are the indexing terms Linguistic preprocessing
When building inverted indices, sort the list so that the terms are ________ alphabetical
To build an index, it is necessary to determine what the document ______ for indexing is unit
Given a character sequence and a defined document unit: ______ is the job of chopping it up into pieces, called tokens tokenization
______ Processing: Deals with building equivalence classes of tokens Linguistic
A ______ is an instance of a character sequence in some particular document token
A ______ is the class of all tokens containing the same character sequence type
A ______ is a type that is indexed in the IR's system's dictionary term
Linguistic Processing Example A rose is a rose is a rose How many Tokens: ____ How many Types: ____ How many terms: ____ Tokens: 8 (a, rose, is, a, rose, is, a, rose) Types: 3 ( a, rose, is) Terms: 2 (rose, be - the term 'a' is too common to be indexed; the base form of is stored in the index)
When using the ______ approach during tokenization: chop on whitespace and throw away punctuation characters simple
Issues of tokenization are ______ specific language
______ ______: Identifying language of a document by analyzing character subsequences Language Identification
A common strategy is to do _____ _____ by reducing all letters to lower case during tokenization case folding
The two most frequent words in English (the, of) account for about _____% of all word occurrences 10
The most frequent six words account for ____% of word occurrences 20
The most frequent fifty words account for ____% of all text 40
Some extremely common and semantically non-selective words are excluded from the dictionary entirely; these are called ______ words (the, is, at, a, of,...) stop
The general strategy for determining a list of stop words (stop list) is to sort the terms by their _______, and then to take the most frequent terms as a stop list frequency
The first step in identification of a processing token consists of determining a ______ word
Systems determine words by dividing input symbols into three classes: __________ symbols (alphabetic characters and numbers), __________ symbols (blanks, commas, and semicolons), and ____________ symbols (set of rules are needed to determine action) valid word, inter-word, special processing
A _____ is defined as a continuous set of word symbols bounded by inter-word symbols word
The second step in defining processing tokens is identification of any specific word _______ (determine if upper case letters should be preserved, numbers, etc.) characteristics
Once the potential list of processing tokens has been defined, some can be removed by a _____ List Stop
______: Crude heuristic process that chops off the ends of words in the hope of generating the root form correctly most of the time Stemming
______: Using a dictionary and morphological analysis of words aiming to return the base or dictionary form of the word Lemmatization
____________: The most common algorithm for stemming English Porter's Algorithm
______ require less knowledge than a ______, which needs a complete vocabulary and morphological analysis to accurately lemmatize words stemmers, lemmatizer
If the token is saw, _____ may just return s or saw. ______ would try to identify if the token is a noun or a verb and return see if it is the latter, stemming, lemmatization
If we would like to search for "Stanford University" so that it does not match a sentence such as "The inventor Standford Ovshinsky never went to university" we would need to utilize a _____ ______ phrase query
One approach to handling phrases is to consider every pair of consecutive terms in a document as a dictionary term. This approach is processing using (______) ______ ______ (extended) biword indexes
Using ______ indexes, each term in the dictionary is stored in the postings list of the form docID: <position1, position2,...> positional
______ index: A special index for general wildcard queries that adds '$' to the end of a term and then constructs the index in which various rotations of each term augmented with '$' all link to the original vocabulary term Permuterm
The disadvantage of using the ______ index: Dictionary becomes quite large since it must include all rotations of each term Permuterm
Two basic principles underlying most spelling correction algorithms: 1) Of various alternative correct spellings for a misspelled query, choose the ______ one 2) When two correctly spelled queries are tied, select the one that is more _____ nearest, common
Given two character strings, s1 and s2, the ____ distance between them is the minimum number of editing operations required to transform s1 into s2 edit
Edit operations (Levenshtein Distance): ______ a character into a string ______ a character from a string ______ a character in a string with another character insert delete replace
A _______ is a sequence of k characters k-gram
In a _______ index, the vocabulary terms that have enough k-grams in common with the query are retrieved k-gram
_______ correction: corrections of misspellings that arise because the user types a query that sounds like the target term Phonetic
_________ algorithm: Converts terms into a reduced 4-character form Soundex
Improvements to Soundex Algorithm: Eliminate _______ at beginning of term Treat _____ and _____ differently Add rules for ______ letters duplicates H/W Silent
Created by: aowens14
 

 



Voices

Use these flashcards to help memorize information. Look at the large card and try to recall what is on the other side. Then click the card to flip it. If you knew the answer, click the green Know box. Otherwise, click the red Don't know box.

When you've placed seven or more cards in the Don't know box, click "retry" to try those cards again.

If you've accidentally put the card in the wrong box, just click on the card to take it out of the box.

You can also use your keyboard to move the cards as follows:

If you are logged in to your account, this website will remember which cards you know and don't know so that they are in the same box the next time you log in.

When you need a break, try one of the other activities listed below the flashcards like Matching, Snowman, or Hungry Bug. Although it may feel like you're playing a game, your brain is still making more connections with the information to help you out.

To see how well you know the information, try the Quiz or Test activity.

Pass complete!
"Know" box contains:
Time elapsed:
Retries:
restart all cards