Search and the Net at 2004 Trends, Challenges and Cutting-Edge Developments in Internet Search Services Michael Hunter Reference Librarian Hobart and William Smith Colleges for Rochester Regional Library Council Member Libraries Staff Sponsored by the Rochester Regional Library Council Supported by Library Services and Technology Act (LSTA) and/or
Regional Bibliographic Databases and Resources Sharing (RBDB) funds granted by the New York State Library 2003 For Today . State of the Net and its Users Search Industry Overview Recent Developments in Established Services New Services The Deep Web at 2004 Tracking the Living Web: Weblogs and RSS Cutting-edge Developments Trends and Challenges to Todays Search
Services The Internet and its Users at 2004 How large is the Web? What do you mean by the Web? The totality of all Web sites Sounds simple . BUT IS IT? UC Berkeleys How Much Information Project
http://www.sims.berkeley.edu/research/projects/how-much-info-2003/ internet.htm NOTE: 10 terabytes = total print collections of the Library of Congress Internet Use Worldwide Internet Use in the US http://www.pewinternet.org Internet Use in the US http://www.pewinternet.org Top Ten things our users do online
http://www.pewinternet.org ACTIVITY % E-mail 92 Use search eng. For specific question
88 Consumer info. 83 Get a map 79 Hobby info.
76 Top Ten things our users do online http://www.pewinternet.org Leisure info. 73 Weather 75
Get news 69 Instant message 67 Health/medical info. 66
Undergraduates and Search Engines Colaric, S. Instruction for Web Searching: An Empirical Study College and Research Libraries 64 (2) March 2003 p. 111-116 QUESTION %YES %NO %Dont Know
All work the same way 67 10 23 Engines look at all sites 64 18
18 Term(s) need to match index Gathers sites using a crawler 58 19 23
15 9 96 OR retrieves more than AND 62 18
20 The Internet Search Industry: Consolidation Performance Measures Popularity The Shrinking Search Industry Editorial control of search is shared among few Yahoo owns
AlltheWeb, Altavista, Inktomi, Overture (paid listings) Google MSN AskJeeves owns Teoma LookSmart owns Wisenut Gigablast NOTE: Ownership is different from database affiliation Google Database Affiliates
Google AOL Netscape Yahoo Openfind Database Freshness http://www.searchengineshowdown.com/stats/ freshness.shtml
Based on a series of 6 current topic searches Pages that are updated daily AND report that date on the page Queries submitted May 17, 2003 Database Freshness http://www.searchengineshowdown.com/stats/ freshness.shtml Most have some results indexed in the last few days The bulk of most of the databases is
about 1 month old Some pages may not have been reindexed for much longer Popularity: Searches per day self-reported data, as of 2/28/03 http://searchenginewatch.com/reports/article.php/2156461 SERVICE Google Overture Inktomi LookSmart FindWhat
AskJeeves AltaVista AlltheWeb Searches, in Millions 250 167 80 45 33 20 18
12 Recent Developments among Established Services Google Froogle Phonebook Wildcard Words Info: Synonym feature
Supplemental Index Search by location News Advanced Search and News Alerts ??? Froogle Locates information about products for sale online Gives URLs of sites offering the item Provides links to exact page in the site where you can make the purchase Froogle
Ranking follows normal Google ranking processes Paid placements always clearly marked Price range limits available Access at http://froogle.google.com or via Google Advanced Search Phonebook Command Search Searches US residential (rphonebook:) and business (bphonebook:) listings of Yahoo, MapQuest and other services rphonebook:
MUST INCLUDE Last name City and/or State MAY INCLUDE First name bphonebook: MUST INCLUDE Business name (min. 1 word) City and/or State
MAY INCLUDE Full Business name Wildcard Words Google offers a word-sized asterisk to function as a wildcard Stands for a whole word Cannot be used for part of a word
three * mice = 22,000 three bl* mice = 0 Wildcard Words Several * can be used together milosevic International * * Hague Retrieves military tribunal OR military court OR war tribunal OR military tribunal info: Not exactly hidden, but not well-known Searches for any information Google has
about a site Convenient way to monitor linkage Typing a URL in the search box will give the same results Synonym Feature Place a tilde ~ immediately before a term to retrieve synonyms or related terms from the Google Index Eliminate the original term by placing a minus sign before it. ~hiking -hiking
Googles Supplemental Index For obscure or unusual searches Queried when Google fails to find good matches within its main web index. Live 9/9/03 Sample queries: St. Andrews United Methodist Church Homewood IL
nalanda residential junior college alumni illegal access error jdk 1.2b4 supercilious supernovas Search by Location (beta) http://labs.google.com/location U.S. only Keyword(s) combined with address, city, state or zip Search results appear on a map News Advanced Search and News Alerts
Advanced News Search added this Fall News Alerts Requires a (free) account One query per alert; limit of 50 alerts per email address
Alerts contain links to news containing your alert keywords Cannot edit a query; delete and create a new one instead Alerts sent once a day or as it happens More about Google. Google World http://indicateur.com Maintained by a French Search Engine Site and listed under Guides. Use Google translator (see Language Tools) to translate
the site) Google Lab http://labs.google.com Place for cutting edge developments, many in beta awaiting user feedback and testing. Beyond Google: AskJeeves Simpler, cleaner interface Teoma crawler-based results blended with AJ answers Improved image database
Smart Answers Popular queries mapped to news, image and other sources appropriate to the query ATW (FAST) http://alltheweb.com Continued commitment to a large database (2 nd to Google) Powerful, new advanced search capabilities Extensive page customization options Results clustered by topic (Folders)
Both HTML and Multimedia given, when available NOTE: Folders located at the BOTTOM of each results screen Altavista Simpler interface More language options Expanded image and multimedia collections Results labeledRefreshed in last 48 hours Includes PDF files US and Local search options Prisma query refinement
Altavista Prisma Query Refinement Offers a maximum of 12 terms having the strongest associations with the original query term(s) Selected from the top 50 results of the original query NOTE: Clicking on a Prisma term adds it to your original query, creating a new set of Prisma terms. Similar to Refine (1997) but less graphic
Teoma Ranking Includes a sites relationship to other sites with similar content Results Ranked database results, with Related Pages Refine Clustering of your results and other related sites based on term relationships and web community
linkages derived from your original results Resources Link Collections from experts and enthusiasts (Subject metasites) Hotbot Searches Hotbot (Inktomi) OR Google OR Lycos OR AskJeeves Not a true metaengine Advanced features operable only if
supported by source engines Metacrawler Along with Dogpile and Webcrawler, owned by Infospace Simpler interface Offers the following customizations: Selection of sources searched Total number of results retrieved Length of search (time-out period) Offers a wide range of vertical searches: Images, MP3, Shopping, Subject Directory, Multimedia, News, Message Boards
New Services Attracting Attention Gigablast Launched April, 2002 Smaller database than others Over 200 million on 10/4/03 pope canterbury Google:83,200
Gigablast:24,919 Created and maintained by Matt Wells (alone) Only search engine continuously updated with index refreshed in real time (Site submissions are immediately searchable) Ranking depends less on linkage than Googles ranking, to avoid penalizing newer pages. No advertising (to date) Gigablast Search Features
Basic search Full Boolean Advanced Search: Full Boolean and 2 (!) phrase boxes Limit by site Limit by domain (URL) Links to a page available Most generic html metatags indexed, searched and made available for display Unique to Gigablast!!! Gigablast Search Features
Field searches include title, IP address and non-html filetypes: PDF, Word, Excel, PPT, PostScript, Ascii Text Results from one site clustered Cached version available Results include date indexed and last modified (!!) Linking to Gigablast improves ranking there
KillerInfo http://www.killerinfo.com Metaengine searching Google, AOL, Lycos, Gigablast, MSN, Altavista, LookSmart and Open Directory 9 topical Deep Web channels offered Boolean and phrase search No other Advanced Search features Results clustering (a la Vivisimo) Number of results not given Adult content filter Surfwax
http://surfwax.com Demo site for federated search software Simultaneous search of Deep Web, Intranets, Web and more Metaengine searches Wisenut, AOL, MSN, Yahoo, Incarta, CNN, LookSmart FOCUS search refinement feature Online thesaurus of related terms and
definitions Surfwax http://surfwax.com Site SNAP of a result offers Author summary (from metatags) Related sites Sites FOCUS words
Key Points (query-related sections) Results ranking options: Relevance, Alpha and Source Preferences and Advanced Features require a (free) account; more options available to fee-based accounts Nutch http://nutch.org Project to implement an open source web search engine Why open source?
With open source, search results processing is transparent, not hidden. Bias (if any) can be examined by anyone. Open source applications are free and available for use, modification or for-profit use. Users are asked to contribute their innovations back to the code base Nutch is seeking volunteer developers and
donations The Deep Web at 2004 The Topography of the Internet or The Layers of the Web Mapping the web is challenging
Unregulated in nature Influences from all over the globe Fulfills many purposes, from personal to commercial Changes rapidly and unexpectedly Divisions and terminology are inherently ambiguous eg. Deep vs Invisible Web May I suggest a biological, nautical metaphor, perhaps the ocean?
SURFACE WEB SHALLOW WEB OPAQUE WEB DEEP WEB DARK WEB Surface Web Static html documents Crawler-accessible Shallow Web Static html documents loaded on servers that use ColdFusion or Lotus Domino or other
similar software A different URL for the same page is created each time it is served. Crawlers skip these to avoid multiple copies of the same page in their database Technically human accessible via directories, Deep Web gateways or links from other sites Opaque Web Static html documents Technically crawler accessible 2 types:
Downloaded and indexed by crawler Not downloaded or indexed by crawler Opaque Web Downloaded and indexed by crawler Buried in search results you never look at A casualty of relevance ranking
Not downloaded or indexed by crawler due to programmed download limits Document buried deep in the site Part of a large document that did not get downloaded (Typical crawl per page is 110 K or less) Document added since last crawler visit (Even
the best revisit on an average of every 2 weeks, depending on amount of change at a site) Opaque Web Access to the Opaque Web Specialized search engines General and specialized directories Subject metasites
These services typically index more thoroughly and more often than large, general search engines Deep Web Technically inaccessible to crawlers Dynamically created pages
Databases Non-textual files Password protected sites Sites prohibiting crawlers Technically accessible to crawlers Textual files in non-html formats Dark Web http://research.arbornetwords.com Up to 5% of the web is completely
unreachable due to Misconfigured routers Contractual disputes between ISPs Broadband users with personal or corporate firewalls US Military sites
UC Berkeleys How Much Information Project http://www.sims.berkeley.edu/research/projects/how-much-info-2003/ internet.htm NOTE: 10 terabytes = total print collections of the Library of Congress http://www.sims.berkeley.edu/research/projects/how-much-info-2003/ internet.htm Reducing the Deep Web:mod_rewrite Making dynamic pages available to crawlers Mod_rewrite software loaded onto a web server containing dynamic pages (databases, etc) Crawler follows a link to a stable URL on the server
www.mydomain.com/dvdplayers.html Mod_rewrite searches all the servers dynamic pages containing dvdplayers and creates temporary pages with stable URLs. These pages are linked to each other, creating a stream of virtual pages that can be crawled by any of the search engines Search engines often check the stream for spam or duplicate pages Mining the Deep Web:Directed Query Engines or Intelligent Agents Designed to access distributed Deep
Web resources Some can be configured to search specific URLs Databases Subject metasites report collections dynamic pages
online newsletters Directed Query Engines for purchase Simultaneous search of Deep Web and other resources with many additional features Lexibot http://www.lexibot.com If you complete survey: $189 upgrades $15 If you dont: $289 upgrades $50
BullsEye http://info.intelliseek.com BullsEye Pro: months $199 with free upgrades for 6 Hunters Maxim for the Deep Web Plan to first locate the category of information you want, then browse.
Dont be too specific in your searches. Cast a wide net. TRACKING THE LIVING WEB: WEBLOGS AND RSS FEEDS Blogs: What are they? Online diaries or journals, usually by one person, though many invite comments First developed in 1997 Within the same blog tone can range from
personal musings to discussion of recent issues in technology and research High link-to-word ratio Often link to other weblogs of similar content Blogs: What are they? Can contain rumor, inside information, speculation, blatant errors as well as
Breaking news: political and technical/research Commentary on new software or websites Consumer reaction to products or services Blog authoring tools are basic content management software, useful in ways other than online diaries Typify the spirit of information sharing that has fueled the Internet since its beginnings How large is the blogosphere? 2.4 to 2.9 million active blogs (est.)
Whos blogging? Jupiter Research 2% of Internet users have created a blog About 50% women, 50% men Over 50% are in English; remaining language, in order of prevalence: Portuguese, Polish, Farsi, French, Spanish, German, Italian, Dutch and Icelandic More
About 4% of Internet users read blogs, 60% men, 40% women On average, blogs are updated every 3 days About 4% of online Americans have gone to blogs for information about the Iraq War LiveJournal (large blog host) was the 650th most popular site on the Internet (May, 2003) 184,000 readers every 10 days Spend average of 22 minutes at the site
Creating a Blog Blogger http://new.blogger.com Free, automated Web publishing tool Requires no new software Send posts to an existing website or create a free blog at Blogger Provide a site template and where you want the postings to appear To update, create posting, submit permission form and Blogger will sent FTP Advanced options available Locating Blogs
Blog Hosting Sites www.livejournal.com diaryland.com radio.userland.com ($39.95 with added features) Blog metasites
Blogs and General Search Engines Blog-rich sites are increasingly visited by major crawler-based search services HOWEVER ANY rapidly-changing content can easily be missed by crawlers Obstacles to Crawling and Indexing Blog Content Only the most recent postings appear on the blog homepage (older are archived, and inaccessible to crawlers)
Many bloggers post dozens of times a day Frequent postings may contain critical information to time-sensitive topics Even a daily crawl would miss these postings (typical crawl is about once every 3 weeks) Obstacles to Crawling and Indexing Blog Content Page Design Several postings usually appear on the blog homepage Postings are NOT indexed separately, as crawler indexes the page as a whole
Retrieval of an individual posting on a topic is unreliable Blogs and Libraries Blogs can offer an opportunity to post content on the Web quicklyno delay of FTP uploading or submission to a webmaster
Whats New Favorite Books Recent Acquisitions Program Changes due to the Weather Blogs and Libraries Get more people involved in posting content on the Library (or library-sponsored) website No knowledge of html, RSS or XML needed Log onto the blog hosting website, create content, and update the page Current awareness without the annoyance of unwanted e-mails Choose when YOU want that information by
visiting your blogs of choice Blogs and Libraries: Metasites Blogs and Libraries: A Bibliography (online) http://www.etches-johnson.com/nolibrary/bib.html Library Weblog Directory http://www.libdex.com/weblogs.html
Blogs at the University of Minnesota Libraries http://www.lib.umn.edu/san/mt/ Fichter, D. (2003). Why and how to use blogs to promote your library's s ervices . Marketing Library Services 17(6). http://www.infotoday.com/mls/nov03/fichter.shtml
RSS Rich Site Summaries Really Simple Syndication Really Stops Spam Before RSS: Tracking latest news and site updates Software packages that monitored and reported changes at sites of your choosing News alert services, free and fee Manual checking of your bookmarks Hit or miss Listserv and Usenet postings
RSS: What is it? XML filetype with content that is Structured (tags, standard and/or authordefined) Re-useable (can be integrated into web, e-mail, multimedia and many other formats Originally developed by Netscape as a content management tool for personalizing
home pages My News My Sports My Weather RSS in detail http://blogs.law.harvard.edu/tech/rss RSS: What can it do? Creates a broadcast version of frequently updated content from a website, blog,
news page or other source Authors can Summarize new content Broadcast new content eg. online newsletters Can be used as a way to distribute content to subscribers (syndication) independent of e-mail. Subscribers logon or access via aggregators.
How do I access them? As RSS is in XML, may require downloading reader software (older versions of browsers cannot read XML). Sources for reader software include www.lights.com blogspace.com Sites with RSS feeds display a small icon (usually orange) labeled RSS or XML
General search engines (limited, but worth a try) filetype:xml keyword(s) RSS Directories and Search Engines Syndic8 syndic8.com Directory of available syndicated news feeds Provides no reading area Uses Open Directory classification
Feedster www.feedster.com The best search engine for blogs and RSS feeds Yahoo news.yahoo.com/rss Canadian Government tinyurl.com/vrh7 Often found in Blog Directories and Engines RSS aggregators Receive general or topical RSS feeds and blog postings
Many are focused on news only Present content in compact form Combine multiple sources in one interface Provide links to full content In personal desktop versions or online Personal desktop aggregators Lets you specify any feeds you want access to Ampheta Desk www.disobey.com/amphetadesk/ Radiouserland radio.userland.com ($$) Feedreader feedreader.com
Feedreader.com Online aggregators Selection of feeds may be limited NewsIsFree NewsIsFree.com 7379 sources grouped into 16 channels Create custom pages
$$ offers more Premium options Many RSS sites include links to other aggregators Authoring and Producing RSS Lockergnome rss.lockergnome.com Documents, tools, developers, aggregators, free feed generator for you site
RSS Primer for Publishers www.eevl.ac.uk/rss_primer/ Producing RSS feeds Technical information Feed promotion Feedster www.feedster.com Blogs and RSS
Blogs may offer some or all of their content as RSS feeds, or not Blogs can exist as pure html documents, updated frequently Making content available in RSS increases a blogs access and exposure via aggregators and other RSS-based search services The Living Web What can blogs and RSS feeds tell us about an authors point of view? Which ones does an author list on their
blog/homepage? Which ones does an author visit/subscribe to? Sometimes I want to know what the world thinks GOOGLE Sometimes I want to know what I think MY WEBLOG
Sometimes I want to know what those I respect think BLOGS AND FEEDS I READ Beyond todays (free) search engines: Cutting edge developments Including Context in System Design Context matters (!!??!)
Textual context Query context Who is asking and why? Traditional approaches to retrieval have been deductive Data organized and mapped to anticipated
query terms (controlled vocabularies, taxonomies) Human created and maintained Too slow for rapid data streams Bayesian approaches Uses statistical inference based on Bayes Theorem of Probability (Thomas Bayes, 1702-1761) Inductive approach (adaptive processing)
Take the users information environment Infer structures, relationships, likely queries Inferred structures and relationships can then be mapped to a human-created classification scheme Currently used in corporate intranet and feebased content management software Will be used more in general information systems of the future Adaptive Processing Learning the searchers interests What term(s) did you search? What did you select?
How long did you look at it? What is its source? How old was it? Direct input from searcher Rank the sources Rate individual results Eliminate certain sources, sites Inquirus
http://inquirus.nj.nec.com Query interface research project Attempts to improve precision of results Monitors users search behavior to infer intent of queries Re-formulates queries to increase likelihood of desired answers Inquirus http://inquirus.nj.nec.com USER: How do you make salsa? SYSTEM: salsa and (recipe or ingredients or food)
Eliminates pages on salsa dancing Ranking relies heavily on proximity of query terms and system-provided cognates to each other in the document Vector-Space Model 3-dimensional retrieval A way of ordering documents by word frequency/context in a term spaceand matching them to queries Documents are assigned coordinates One document may be in many term spacesor vectors
Queries that fall within a given vector are likely to be answered by documents located in that vector A Multi-dimensional Boolean Boolean limited to term matches terrier female puppy Vector-space model
More complex relationships can be mapped Degrees of relatedness of document to query Query and document weights based on length and direction of their vectors Documents in Vector Space What do you have on movie stars diets? STAR
Doc about movie stars Doc about astronomy Doc about mammal behavior DIET Phibot http://phibot.org Project of the Univ. of Mainz and German Institute of Artificial Intelligence Crawls science, medicine and news web sites
`200 million general science sites 70 million medical sites Traditional: Google-like processing Vector-Space Optimization: greater vector-space processing Digital Video Search Searches actual visual content Project of Dublin City University
http://www.cdvp.dcu.ie Determine structure of the video by identifying shots with the greatest degree of change (keyframes) Use these to create a structure, and allow user to refine query based on these Needed by journalists, governments and airport security Current Trends in and
Challenges to Todays Search Industry User Interface Trends Toolbars, Toolbars, everywhere Review site: searchenginewatch.com/links/article.php/2156381 Search by Location Major engines with local search options and local specialized ones
Makes the haystack smaller; important in e-commerce P2P networks (Peer-to-peer) File-sharing networks, a la Napster KaZaA - most popular download EVER! Shares any filetype 90% of files shared are audio-visual in nature User Interface Trends
Application Program Interface (API) Published set of programming hooksthat lets you interact directly with a companys open servers You can mine the companys databases for free WHY? To attract more traffic to the site
Example http://www.googlerace.com Enter 1 or 2 terms/phrases and see how Bush and Democratic candidates stack up! Created by Tara Calishain Search in Corporate Settings Drive Search Engine R&D Uniform, seamless access to all information:
Internal & external, data & content XML More natural language processing Hybrid systems to search structured AND unstructured data Adaptive processing (Bayesian) Use of intelligent agent software
Easier user interfaces Personalization Industry-wide Trends Distributed Crawling Volunteer your PC when not in use Grub.com, Looksmart Search continues to be driven by
advertising and revenue Fewer services maintain their own crawlercreated database Increased crawling of non-html filetypes Challenges to the Industry Revenue E-content providers have cut into search software sales with their proprietary engines Fighting fraud
Cloaking, ranking manipulation Scalability Size of surface Web increases Over 300 million queries a day to all Web S.E.s Challenges to the Industry Freshness
Competitive edge demands recent crawls Deep Web Embedded databases Non-html filetypes Real-time information Growing importance of the Living Web
Challenges to the Industry Ambiguous query refinement Not very hopeful among general search engines User group too large User profiling difficult
Indexing the smaller, newer sites Googles link-based PageRank penalizes these sites The Biggest Challenge: Just what are you looking for? A known needle in a known haystack A known needle in an unknown haystack An unknown needle in an unknown haystack Any needle in a haystack The sharpest needle in a haystack
Most of the sharpest needles in a haystack All the needles in a haystack The Biggest Challenge: Just what are you looking for? Affirmation of no needles in the haystack Things like needles in any haystack Let me know if any new needles show up Where are the haystacks? Needles, haystacks, .whatever Thank You and Happy Holidays!
Michael Hunter Reference Librarian Hobart and William Smith Colleges Geneva, NY 14456 (315) 781-3552 [email protected]
Chapter 18: Nine Days . Morning of Darnay and Manette wedding . Charles and Dr. Manette are behind closed doors (remember the promise from Ch. 10) Mr. Lorry and Miss Pross (who still wishes her brother Solomon was the groom)...
Tactical FM radios, such as SINCGARS, must be in sight of each other electronically to communicate Frequency Range From 1.6 To 60 Mhz In SSB And FM Modes Ability to interface with the Army's SINCGARS radios Multi-waveform High Speed Data...
Periglacial regions occupy over 20% of Earth's land surface (See Figure 17-18). The areas are either near permanent ice or are at high elevation, and have ground that is seasonally snow free. Under these conditions, a unique set of periglacial...
Big picture - school size, campus location, campus life offerings. School's ability to meet your financial need (through scholarship or financial aid - or both) College reputation. A feeling like you belong. Note: never make your decision based on the...
Verdana Arial Wingdings Calibri Eclipse 1_Eclipse Comparative Politics (CP) and major questions in the field This course's approach to CP This course's approach to CP My approach to course content Comparative Politics as a Social Science Introducing key concepts Inductive...
Per-Process Virtual Address Space. Each process has its own virtual address space. Process . X: text editor. Process . Y: video player. X. writing to its virtual address 0 does not affect the data stored in . Y 's virtual...
The Role of Risk Communication in Communicating with Veterans "Risk Communication after a deployment is a crucial component of the appropriate care and support for the service member upon his or her return"Persian Gulf Veterans Coordinating Board (1999) "Efforts at...
Ready to download the document? Go ahead and hit continue!