Web Search and Text Mining Lecture 5 Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI Vector Space Model: Pros
Automatic selection of index terms Partial matching of queries and documents (dealing with the case where no document contains all search terms) Ranking according to similarity score (dealing with large result sets) Term weighting schemes (improves retrieval performance) Various extensions
Document clustering Relevance feedback (modifying query vector) Geometric foundation Problems with Lexical Semantics Ambiguity and association in natural language
Polysemy: Words often have a multitude of meanings and different types of usage (more severe in very heterogeneous collections). The vector space model is unable to discriminate between different meanings of the same word. Problems with Lexical Semantics Synonymy: Different terms may have identical or a similar meaning (weaker: words indicating the same topic). No associations between words are made in the vector space representation.
Latent Semantic Indexing (LSI) Perform a low-rank approximation of document-term matrix (typical rank 100-300) General idea Map documents (and terms) to a lowdimensional representation. Design a mapping such that the low-dimensional space reflects semantic associations (latent semantic space).
Compute document similarity based on the inner product in this latent semantic space Goals of LSI Similar terms map to similar location in low dimensional space Noise reduction by dimension reduction Latent Semantic Analysis Latent semantic space: illustrating example
courtesy of Susan Dumais Performing the maps Each row and column of A gets mapped into the k-dimensional LSI space, by the SVD. Claim this is not only the mapping with the best (Frobenius error) approximation to A, but in fact improves retrieval. A query q is also mapped into this space, by T qk q U k
1 k Query NOT a sparse vector. Empirical evidence Experiments on TREC 1/2/3 Dumais Lanczos SVD code (available on netlib) due to Berry used in these expts
Dimensions various values 250-350 reported Running times of ~ one day on tens of thousands of docs (old data) (Under 200 reported unsatisfactory) Generally expect recall to improve what about precision? Empirical evidence Precision at or above median TREC
precision Top scorer on almost 20% of TREC topics Slightly better on average than straight vector spaces Effect of dimensionality: Dimensions Precision 250 300 346 0.367 0.371
0.374 What does LSI compute? The dimensionality of a corpus is the number of distinct topics represented in it. Under some modeling assuptions: if A has a rank k approximation of low Frobenius error, then there are no more than k distinct topics in the corpus. ( Latent semantic indexing: A probabilisti c analysis,''
) LSI has many other applications In many settings in pattern recognition and retrieval, we have a feature-object matrix. For text, the terms are features and the docs are objects.
Could be opinions and users This matrix may be redundant in dimensionality. Can work with low-rank approximation. If entries are missing (e.g., users opinions), can recover if dimensionality is low. Powerful general analytical technique Close, principled analog to clustering methods.
The volume of demand at different times of day. The penalty for the cloud provider in case the provider fails to meet these service requirements. Why Is Service Level Agreement important in cloud computing
EDNOS- This category is frequently used for people who meet some, but not all, of the diagnostic criteria for anorexia nervosa, bulimia nervosa or binge eating disorder (eating an usual amount of food).
Send social media requests to Tom Miner ([email protected]) Facebook Posts of the Month by Engaged Users: ... Energy's Department's network of national labs to help bring next-generation clean #energy technologies, such as advanced #solar, to the market… [View rest of...
The opposite to this is a fixed mindset, where the idea is that ability is fixed and you can either do maths or you can't. When children have a growth mindset, they do well with challenges and do better in...
Mission and Vision. Our R&D Mission. Provide research, development, test and evaluation (RDT&E) investments that focus on maintaining the U.S. military's CWMD technological superiority, supporting current readiness, and mitigating the risks of technical surprise for the CWMD mission.
History of Present Illness. 63 year old male with a history of chronic HF who reports progressive dyspnea on exertion over the last 7 days. He also notes palpitations and decreased urinary output.His legs have become more swollen during this...
Enclosure. create a functional barrier from exterior elements while maintaining aesthetic appeal & interior comfort. FENESTRATION DESIGN GOALS. To maximize the amount of natural daylight in the classroom spaces, while minimizing the cost of construction and optimizing energy savings.
HSC Examinations. Contribute 50% of HSC mark (some exceptions - see major works slide) VET exams are optional, unless you are using the VET course as part of your ATAR. Some courses have practical examinations and/or submitted works or projects...
Ready to download the document? Go ahead and hit continue!