1. What is the main limitation of context-free word embeddings like Word2Vec and GloVe? A) too computationally expensive B) assign the same embedding to a word regardless of its context C) require labeled data D) can only embed rare words

B Assign the same embedding to a word regardless of its context

2. In transformer models, the purpose of positional encoding is to: A) Reduce model size B) Encode the sequence order of tokens C) Improve attention span D) Remove redundant information

B Encode the sequence order of tokens

3. Which of the following is true about self-attention? A) compares queries from one sequence with keys from different sequence B) compares elements within the same sequence C) ignores position information D) used only in decoder blocks

B It compares elements within the same sequence

5. What is the role of the query in a cross-attention mechanism? A) To retrieve tokens sequentially B) To query information from keys and values from another modality C) To predict the output labels D) To ignore modality differences

B To query information from keys and values from another modality

8. In multimodal retrieval, cosine similarity is used to: A) Translate text to image directly B) Compute similarity between image and text embeddings C) Generate embeddings from raw inputs D) Tokenize text efficiently

B Compute similarity between image and text embeddings

10. In Vision Transformers (ViT), the input image is first: A) Flattened into a 1D vector B) Split into patches and embedded C) Directly fed into a CNN D) Tokenized using text tokenizers

B Split into patches and embedded

Help

Options

focusNode

Didn't know it?
click below

Knew it?
click below

Don't Know

Remaining cards (0)

Know

retry

shuffle

restart

0:00

Embed Code - If you would like this activity on your web page, copy the script below and paste it into your web page.

Normal Size Small Size show me how

LLM

Question	Answer
1. What is the main limitation of context-free word embeddings like Word2Vec and GloVe? A) too computationally expensive B) assign the same embedding to a word regardless of its context C) require labeled data D) can only embed rare words	B Assign the same embedding to a word regardless of its context
2. In transformer models, the purpose of positional encoding is to: A) Reduce model size B) Encode the sequence order of tokens C) Improve attention span D) Remove redundant information	B Encode the sequence order of tokens
3. Which of the following is true about self-attention? A) compares queries from one sequence with keys from different sequence B) compares elements within the same sequence C) ignores position information D) used only in decoder blocks	B It compares elements within the same sequence
4. Which task is BERT primarily designed for? A) Text generation B) Next-word prediction C) Masked Language Modeling D) Image captioning	C Masked Language Modeling
5. What is the role of the query in a cross-attention mechanism? A) To retrieve tokens sequentially B) To query information from keys and values from another modality C) To predict the output labels D) To ignore modality differences	B To query information from keys and values from another modality
6. In CLIP, the Image Encoder is most commonly based on which architecture? A) RNN B) ResNet or Vision Transformer (ViT) C) BERT D) CNN LSTM hybrid	B ResNet or Vision Transformer
7. Which architecture uses masked self-attention to prevent tokens from seeing future tokens during training? A) Encoder Transformers B) Decoder Transformers C) CNNs D) Vision Transformers	B Decoder Transformers
8. In multimodal retrieval, cosine similarity is used to: A) Translate text to image directly B) Compute similarity between image and text embeddings C) Generate embeddings from raw inputs D) Tokenize text efficiently	B Compute similarity between image and text embeddings
9. Which type of tokenization is typically used in modern LLMs to handle rare words efficiently? A) Word-based tokenization B) Character-based tokenization C) Subword tokenization D) Punctuation-based tokenization	C Subword tokenization
10. In Vision Transformers (ViT), the input image is first: A) Flattened into a 1D vector B) Split into patches and embedded C) Directly fed into a CNN D) Tokenized using text tokenizers	B Split into patches and embedded

Created by: user-2007885

"Know" box contains:
Time elapsed:
Retries: