Search and the Net at 2004 Trends, Challenges

Search and the Net at 2004 Trends, Challenges

Search and the Net at 2004 Trends, Challenges and Cutting-Edge Developments in Internet Search Services Michael Hunter Reference Librarian Hobart and William Smith Colleges for Rochester Regional Library Council Member Libraries Staff Sponsored by the Rochester Regional Library Council Supported by Library Services and Technology Act (LSTA) and/or

Regional Bibliographic Databases and Resources Sharing (RBDB) funds granted by the New York State Library 2003 For Today . State of the Net and its Users Search Industry Overview Recent Developments in Established Services New Services The Deep Web at 2004 Tracking the Living Web: Weblogs and RSS Cutting-edge Developments Trends and Challenges to Todays Search

Services The Internet and its Users at 2004 How large is the Web? What do you mean by the Web? The totality of all Web sites Sounds simple . BUT IS IT? UC Berkeleys How Much Information Project internet.htm NOTE: 10 terabytes = total print collections of the Library of Congress Internet Use Worldwide Internet Use in the US Internet Use in the US Top Ten things our users do online ACTIVITY % E-mail 92 Use search eng. For specific question

88 Consumer info. 83 Get a map 79 Hobby info.

76 Top Ten things our users do online Leisure info. 73 Weather 75

Get news 69 Instant message 67 Health/medical info. 66

Undergraduates and Search Engines Colaric, S. Instruction for Web Searching: An Empirical Study College and Research Libraries 64 (2) March 2003 p. 111-116 QUESTION %YES %NO %Dont Know

All work the same way 67 10 23 Engines look at all sites 64 18

18 Term(s) need to match index Gathers sites using a crawler 58 19 23

15 9 96 OR retrieves more than AND 62 18

20 The Internet Search Industry: Consolidation Performance Measures Popularity The Shrinking Search Industry Editorial control of search is shared among few Yahoo owns

AlltheWeb, Altavista, Inktomi, Overture (paid listings) Google MSN AskJeeves owns Teoma LookSmart owns Wisenut Gigablast NOTE: Ownership is different from database affiliation Google Database Affiliates

Google AOL Netscape Yahoo Openfind Database Freshness freshness.shtml

Based on a series of 6 current topic searches Pages that are updated daily AND report that date on the page Queries submitted May 17, 2003 Database Freshness freshness.shtml Most have some results indexed in the last few days The bulk of most of the databases is

about 1 month old Some pages may not have been reindexed for much longer Popularity: Searches per day self-reported data, as of 2/28/03 SERVICE Google Overture Inktomi LookSmart FindWhat

AskJeeves AltaVista AlltheWeb Searches, in Millions 250 167 80 45 33 20 18

12 Recent Developments among Established Services Google Froogle Phonebook Wildcard Words Info: Synonym feature

Supplemental Index Search by location News Advanced Search and News Alerts ??? Froogle Locates information about products for sale online Gives URLs of sites offering the item Provides links to exact page in the site where you can make the purchase Froogle

Ranking follows normal Google ranking processes Paid placements always clearly marked Price range limits available Access at or via Google Advanced Search Phonebook Command Search Searches US residential (rphonebook:) and business (bphonebook:) listings of Yahoo, MapQuest and other services rphonebook:

MUST INCLUDE Last name City and/or State MAY INCLUDE First name bphonebook: MUST INCLUDE Business name (min. 1 word) City and/or State

MAY INCLUDE Full Business name Wildcard Words Google offers a word-sized asterisk to function as a wildcard Stands for a whole word Cannot be used for part of a word

three * mice = 22,000 three bl* mice = 0 Wildcard Words Several * can be used together milosevic International * * Hague Retrieves military tribunal OR military court OR war tribunal OR military tribunal info: Not exactly hidden, but not well-known Searches for any information Google has

about a site Convenient way to monitor linkage Typing a URL in the search box will give the same results Synonym Feature Place a tilde ~ immediately before a term to retrieve synonyms or related terms from the Google Index Eliminate the original term by placing a minus sign before it. ~hiking -hiking

Googles Supplemental Index For obscure or unusual searches Queried when Google fails to find good matches within its main web index. Live 9/9/03 Sample queries: St. Andrews United Methodist Church Homewood IL

nalanda residential junior college alumni illegal access error jdk 1.2b4 supercilious supernovas Search by Location (beta) U.S. only Keyword(s) combined with address, city, state or zip Search results appear on a map News Advanced Search and News Alerts

Advanced News Search added this Fall News Alerts Requires a (free) account One query per alert; limit of 50 alerts per email address

Alerts contain links to news containing your alert keywords Cannot edit a query; delete and create a new one instead Alerts sent once a day or as it happens More about Google. Google World Maintained by a French Search Engine Site and listed under Guides. Use Google translator (see Language Tools) to translate

the site) Google Lab Place for cutting edge developments, many in beta awaiting user feedback and testing. Beyond Google: AskJeeves Simpler, cleaner interface Teoma crawler-based results blended with AJ answers Improved image database

Smart Answers Popular queries mapped to news, image and other sources appropriate to the query ATW (FAST) Continued commitment to a large database (2 nd to Google) Powerful, new advanced search capabilities Extensive page customization options Results clustered by topic (Folders)

Both HTML and Multimedia given, when available NOTE: Folders located at the BOTTOM of each results screen Altavista Simpler interface More language options Expanded image and multimedia collections Results labeledRefreshed in last 48 hours Includes PDF files US and Local search options Prisma query refinement

Altavista Prisma Query Refinement Offers a maximum of 12 terms having the strongest associations with the original query term(s) Selected from the top 50 results of the original query NOTE: Clicking on a Prisma term adds it to your original query, creating a new set of Prisma terms. Similar to Refine (1997) but less graphic

Teoma Ranking Includes a sites relationship to other sites with similar content Results Ranked database results, with Related Pages Refine Clustering of your results and other related sites based on term relationships and web community

linkages derived from your original results Resources Link Collections from experts and enthusiasts (Subject metasites) Hotbot Searches Hotbot (Inktomi) OR Google OR Lycos OR AskJeeves Not a true metaengine Advanced features operable only if

supported by source engines Metacrawler Along with Dogpile and Webcrawler, owned by Infospace Simpler interface Offers the following customizations: Selection of sources searched Total number of results retrieved Length of search (time-out period) Offers a wide range of vertical searches: Images, MP3, Shopping, Subject Directory, Multimedia, News, Message Boards

New Services Attracting Attention Gigablast Launched April, 2002 Smaller database than others Over 200 million on 10/4/03 pope canterbury Google:83,200

Gigablast:24,919 Created and maintained by Matt Wells (alone) Only search engine continuously updated with index refreshed in real time (Site submissions are immediately searchable) Ranking depends less on linkage than Googles ranking, to avoid penalizing newer pages. No advertising (to date) Gigablast Search Features

Basic search Full Boolean Advanced Search: Full Boolean and 2 (!) phrase boxes Limit by site Limit by domain (URL) Links to a page available Most generic html metatags indexed, searched and made available for display Unique to Gigablast!!! Gigablast Search Features

Field searches include title, IP address and non-html filetypes: PDF, Word, Excel, PPT, PostScript, Ascii Text Results from one site clustered Cached version available Results include date indexed and last modified (!!) Linking to Gigablast improves ranking there

KillerInfo Metaengine searching Google, AOL, Lycos, Gigablast, MSN, Altavista, LookSmart and Open Directory 9 topical Deep Web channels offered Boolean and phrase search No other Advanced Search features Results clustering (a la Vivisimo) Number of results not given Adult content filter Surfwax Demo site for federated search software Simultaneous search of Deep Web, Intranets, Web and more Metaengine searches Wisenut, AOL, MSN, Yahoo, Incarta, CNN, LookSmart FOCUS search refinement feature Online thesaurus of related terms and

definitions Surfwax Site SNAP of a result offers Author summary (from metatags) Related sites Sites FOCUS words

Key Points (query-related sections) Results ranking options: Relevance, Alpha and Source Preferences and Advanced Features require a (free) account; more options available to fee-based accounts Nutch Project to implement an open source web search engine Why open source?

With open source, search results processing is transparent, not hidden. Bias (if any) can be examined by anyone. Open source applications are free and available for use, modification or for-profit use. Users are asked to contribute their innovations back to the code base Nutch is seeking volunteer developers and

donations The Deep Web at 2004 The Topography of the Internet or The Layers of the Web Mapping the web is challenging

Unregulated in nature Influences from all over the globe Fulfills many purposes, from personal to commercial Changes rapidly and unexpectedly Divisions and terminology are inherently ambiguous eg. Deep vs Invisible Web May I suggest a biological, nautical metaphor, perhaps the ocean?

SURFACE WEB SHALLOW WEB OPAQUE WEB DEEP WEB DARK WEB Surface Web Static html documents Crawler-accessible Shallow Web Static html documents loaded on servers that use ColdFusion or Lotus Domino or other

similar software A different URL for the same page is created each time it is served. Crawlers skip these to avoid multiple copies of the same page in their database Technically human accessible via directories, Deep Web gateways or links from other sites Opaque Web Static html documents Technically crawler accessible 2 types:

Downloaded and indexed by crawler Not downloaded or indexed by crawler Opaque Web Downloaded and indexed by crawler Buried in search results you never look at A casualty of relevance ranking

Not downloaded or indexed by crawler due to programmed download limits Document buried deep in the site Part of a large document that did not get downloaded (Typical crawl per page is 110 K or less) Document added since last crawler visit (Even

the best revisit on an average of every 2 weeks, depending on amount of change at a site) Opaque Web Access to the Opaque Web Specialized search engines General and specialized directories Subject metasites

These services typically index more thoroughly and more often than large, general search engines Deep Web Technically inaccessible to crawlers Dynamically created pages

Databases Non-textual files Password protected sites Sites prohibiting crawlers Technically accessible to crawlers Textual files in non-html formats Dark Web Up to 5% of the web is completely

unreachable due to Misconfigured routers Contractual disputes between ISPs Broadband users with personal or corporate firewalls US Military sites

UC Berkeleys How Much Information Project internet.htm NOTE: 10 terabytes = total print collections of the Library of Congress internet.htm Reducing the Deep Web:mod_rewrite Making dynamic pages available to crawlers Mod_rewrite software loaded onto a web server containing dynamic pages (databases, etc) Crawler follows a link to a stable URL on the server Mod_rewrite searches all the servers dynamic pages containing dvdplayers and creates temporary pages with stable URLs. These pages are linked to each other, creating a stream of virtual pages that can be crawled by any of the search engines Search engines often check the stream for spam or duplicate pages Mining the Deep Web:Directed Query Engines or Intelligent Agents Designed to access distributed Deep

Web resources Some can be configured to search specific URLs Databases Subject metasites report collections dynamic pages

online newsletters Directed Query Engines for purchase Simultaneous search of Deep Web and other resources with many additional features Lexibot If you complete survey: $189 upgrades $15 If you dont: $289 upgrades $50

BullsEye BullsEye Pro: months $199 with free upgrades for 6 Hunters Maxim for the Deep Web Plan to first locate the category of information you want, then browse.

Dont be too specific in your searches. Cast a wide net. TRACKING THE LIVING WEB: WEBLOGS AND RSS FEEDS Blogs: What are they? Online diaries or journals, usually by one person, though many invite comments First developed in 1997 Within the same blog tone can range from

personal musings to discussion of recent issues in technology and research High link-to-word ratio Often link to other weblogs of similar content Blogs: What are they? Can contain rumor, inside information, speculation, blatant errors as well as

Breaking news: political and technical/research Commentary on new software or websites Consumer reaction to products or services Blog authoring tools are basic content management software, useful in ways other than online diaries Typify the spirit of information sharing that has fueled the Internet since its beginnings How large is the blogosphere? 2.4 to 2.9 million active blogs (est.)

Whos blogging? Jupiter Research 2% of Internet users have created a blog About 50% women, 50% men Over 50% are in English; remaining language, in order of prevalence: Portuguese, Polish, Farsi, French, Spanish, German, Italian, Dutch and Icelandic More

About 4% of Internet users read blogs, 60% men, 40% women On average, blogs are updated every 3 days About 4% of online Americans have gone to blogs for information about the Iraq War LiveJournal (large blog host) was the 650th most popular site on the Internet (May, 2003) 184,000 readers every 10 days Spend average of 22 minutes at the site

Creating a Blog Blogger Free, automated Web publishing tool Requires no new software Send posts to an existing website or create a free blog at Blogger Provide a site template and where you want the postings to appear To update, create posting, submit permission form and Blogger will sent FTP Advanced options available Locating Blogs

Blog Hosting Sites ($39.95 with added features) Blog metasites (library-related, world-wide) Locating Blogs Subject Directories

General Search Engines Blog keyword(s) or URL(bloghost) keyword(s) Professional Association homepages Subject Metasites Use Resources

Searching Blog Content Blog hosting sites Blog Search Engines (includes RSS feeds also) (current events) Topical Blog Search Engines Detod ( Exclusively legal weblogs

Blogs and General Search Engines Blog-rich sites are increasingly visited by major crawler-based search services HOWEVER ANY rapidly-changing content can easily be missed by crawlers Obstacles to Crawling and Indexing Blog Content Only the most recent postings appear on the blog homepage (older are archived, and inaccessible to crawlers)

Many bloggers post dozens of times a day Frequent postings may contain critical information to time-sensitive topics Even a daily crawl would miss these postings (typical crawl is about once every 3 weeks) Obstacles to Crawling and Indexing Blog Content Page Design Several postings usually appear on the blog homepage Postings are NOT indexed separately, as crawler indexes the page as a whole

Retrieval of an individual posting on a topic is unreliable Blogs and Libraries Blogs can offer an opportunity to post content on the Web quicklyno delay of FTP uploading or submission to a webmaster

Whats New Favorite Books Recent Acquisitions Program Changes due to the Weather Blogs and Libraries Get more people involved in posting content on the Library (or library-sponsored) website No knowledge of html, RSS or XML needed Log onto the blog hosting website, create content, and update the page Current awareness without the annoyance of unwanted e-mails Choose when YOU want that information by

visiting your blogs of choice Blogs and Libraries: Metasites Blogs and Libraries: A Bibliography (online) Library Weblog Directory

Blogs at the University of Minnesota Libraries Fichter, D. (2003). Why and how to use blogs to promote your library's s ervices . Marketing Library Services 17(6).

RSS Rich Site Summaries Really Simple Syndication Really Stops Spam Before RSS: Tracking latest news and site updates Software packages that monitored and reported changes at sites of your choosing News alert services, free and fee Manual checking of your bookmarks Hit or miss Listserv and Usenet postings

RSS: What is it? XML filetype with content that is Structured (tags, standard and/or authordefined) Re-useable (can be integrated into web, e-mail, multimedia and many other formats Originally developed by Netscape as a content management tool for personalizing

home pages My News My Sports My Weather RSS in detail RSS: What can it do? Creates a broadcast version of frequently updated content from a website, blog,

news page or other source Authors can Summarize new content Broadcast new content eg. online newsletters Can be used as a way to distribute content to subscribers (syndication) independent of e-mail. Subscribers logon or access via aggregators.

How do I access them? As RSS is in XML, may require downloading reader software (older versions of browsers cannot read XML). Sources for reader software include Sites with RSS feeds display a small icon (usually orange) labeled RSS or XML

General search engines (limited, but worth a try) filetype:xml keyword(s) RSS Directories and Search Engines Syndic8 Directory of available syndicated news feeds Provides no reading area Uses Open Directory classification

Feedster The best search engine for blogs and RSS feeds Yahoo Canadian Government Often found in Blog Directories and Engines RSS aggregators Receive general or topical RSS feeds and blog postings

Many are focused on news only Present content in compact form Combine multiple sources in one interface Provide links to full content In personal desktop versions or online Personal desktop aggregators Lets you specify any feeds you want access to Ampheta Desk Radiouserland ($$) Feedreader Online aggregators Selection of feeds may be limited NewsIsFree 7379 sources grouped into 16 channels Create custom pages

$$ offers more Premium options Many RSS sites include links to other aggregators Authoring and Producing RSS Lockergnome Documents, tools, developers, aggregators, free feed generator for you site

RSS Primer for Publishers Producing RSS feeds Technical information Feed promotion Feedster Blogs and RSS

Blogs may offer some or all of their content as RSS feeds, or not Blogs can exist as pure html documents, updated frequently Making content available in RSS increases a blogs access and exposure via aggregators and other RSS-based search services The Living Web What can blogs and RSS feeds tell us about an authors point of view? Which ones does an author list on their

blog/homepage? Which ones does an author visit/subscribe to? Sometimes I want to know what the world thinks GOOGLE Sometimes I want to know what I think MY WEBLOG

Sometimes I want to know what those I respect think BLOGS AND FEEDS I READ Beyond todays (free) search engines: Cutting edge developments Including Context in System Design Context matters (!!??!)

Textual context Query context Who is asking and why? Traditional approaches to retrieval have been deductive Data organized and mapped to anticipated

query terms (controlled vocabularies, taxonomies) Human created and maintained Too slow for rapid data streams Bayesian approaches Uses statistical inference based on Bayes Theorem of Probability (Thomas Bayes, 1702-1761) Inductive approach (adaptive processing)

Take the users information environment Infer structures, relationships, likely queries Inferred structures and relationships can then be mapped to a human-created classification scheme Currently used in corporate intranet and feebased content management software Will be used more in general information systems of the future Adaptive Processing Learning the searchers interests What term(s) did you search? What did you select?

How long did you look at it? What is its source? How old was it? Direct input from searcher Rank the sources Rate individual results Eliminate certain sources, sites Inquirus Query interface research project Attempts to improve precision of results Monitors users search behavior to infer intent of queries Re-formulates queries to increase likelihood of desired answers Inquirus USER: How do you make salsa? SYSTEM: salsa and (recipe or ingredients or food)

Eliminates pages on salsa dancing Ranking relies heavily on proximity of query terms and system-provided cognates to each other in the document Vector-Space Model 3-dimensional retrieval A way of ordering documents by word frequency/context in a term spaceand matching them to queries Documents are assigned coordinates One document may be in many term spacesor vectors

Queries that fall within a given vector are likely to be answered by documents located in that vector A Multi-dimensional Boolean Boolean limited to term matches terrier female puppy Vector-space model

More complex relationships can be mapped Degrees of relatedness of document to query Query and document weights based on length and direction of their vectors Documents in Vector Space What do you have on movie stars diets? STAR

Doc about movie stars Doc about astronomy Doc about mammal behavior DIET Phibot Project of the Univ. of Mainz and German Institute of Artificial Intelligence Crawls science, medicine and news web sites

`200 million general science sites 70 million medical sites Traditional: Google-like processing Vector-Space Optimization: greater vector-space processing Digital Video Search Searches actual visual content Project of Dublin City University Determine structure of the video by identifying shots with the greatest degree of change (keyframes) Use these to create a structure, and allow user to refine query based on these Needed by journalists, governments and airport security Current Trends in and

Challenges to Todays Search Industry User Interface Trends Toolbars, Toolbars, everywhere Review site: Search by Location Major engines with local search options and local specialized ones

Makes the haystack smaller; important in e-commerce P2P networks (Peer-to-peer) File-sharing networks, a la Napster KaZaA - most popular download EVER! Shares any filetype 90% of files shared are audio-visual in nature User Interface Trends

Application Program Interface (API) Published set of programming hooksthat lets you interact directly with a companys open servers You can mine the companys databases for free WHY? To attract more traffic to the site

Example Enter 1 or 2 terms/phrases and see how Bush and Democratic candidates stack up! Created by Tara Calishain Search in Corporate Settings Drive Search Engine R&D Uniform, seamless access to all information:

Internal & external, data & content XML More natural language processing Hybrid systems to search structured AND unstructured data Adaptive processing (Bayesian) Use of intelligent agent software

Easier user interfaces Personalization Industry-wide Trends Distributed Crawling Volunteer your PC when not in use, Looksmart Search continues to be driven by

advertising and revenue Fewer services maintain their own crawlercreated database Increased crawling of non-html filetypes Challenges to the Industry Revenue E-content providers have cut into search software sales with their proprietary engines Fighting fraud

Cloaking, ranking manipulation Scalability Size of surface Web increases Over 300 million queries a day to all Web S.E.s Challenges to the Industry Freshness

Competitive edge demands recent crawls Deep Web Embedded databases Non-html filetypes Real-time information Growing importance of the Living Web

Challenges to the Industry Ambiguous query refinement Not very hopeful among general search engines User group too large User profiling difficult

Indexing the smaller, newer sites Googles link-based PageRank penalizes these sites The Biggest Challenge: Just what are you looking for? A known needle in a known haystack A known needle in an unknown haystack An unknown needle in an unknown haystack Any needle in a haystack The sharpest needle in a haystack

Most of the sharpest needles in a haystack All the needles in a haystack The Biggest Challenge: Just what are you looking for? Affirmation of no needles in the haystack Things like needles in any haystack Let me know if any new needles show up Where are the haystacks? Needles, haystacks, .whatever Thank You and Happy Holidays!

Michael Hunter Reference Librarian Hobart and William Smith Colleges Geneva, NY 14456 (315) 781-3552 [email protected]

Recently Viewed Presentations

  • A Tale of Two Cities* Charles Dickens *

    A Tale of Two Cities* Charles Dickens *

    Chapter 18: Nine Days . Morning of Darnay and Manette wedding . Charles and Dr. Manette are behind closed doors (remember the promise from Ch. 10) Mr. Lorry and Miss Pross (who still wishes her brother Solomon was the groom)...
  • Reconnaissance & Surveillance Leaders Course

    Reconnaissance & Surveillance Leaders Course

    Tactical FM radios, such as SINCGARS, must be in sight of each other electronically to communicate Frequency Range From 1.6 To 60 Mhz In SSB And FM Modes Ability to interface with the Army's SINCGARS radios Multi-waveform High Speed Data...
  • Chapter 17 Glacial and Periglacial Processes and Landforms

    Chapter 17 Glacial and Periglacial Processes and Landforms

    Periglacial regions occupy over 20% of Earth's land surface (See Figure 17-18). The areas are either near permanent ice or are at high elevation, and have ground that is seasonally snow free. Under these conditions, a unique set of periglacial...
  • NCAA Initial Eligibility Workshop - Amazon S3

    NCAA Initial Eligibility Workshop - Amazon S3

    Big picture - school size, campus location, campus life offerings. School's ability to meet your financial need (through scholarship or financial aid - or both) College reputation. A feeling like you belong. Note: never make your decision based on the...
  • Welcome to POLS 204 Introduction to Comparative Politics

    Welcome to POLS 204 Introduction to Comparative Politics

    Verdana Arial Wingdings Calibri Eclipse 1_Eclipse Comparative Politics (CP) and major questions in the field This course's approach to CP This course's approach to CP My approach to course content Comparative Politics as a Social Science Introducing key concepts Inductive...


    Exl BM et al. Eur J Nutr 2000; 2. Exl BM et al. JPGN 2000 * Affected infants (%) Intervention group 3 months 6 months 3.9 4.3 Control group 3 months 6 months 6.2 8. After appropriate medical evaluation the...
  • Virtual Memory - University of Virginia School of Engineering ...

    Virtual Memory - University of Virginia School of Engineering ...

    Per-Process Virtual Address Space. Each process has its own virtual address space. Process . X: text editor. Process . Y: video player. X. writing to its virtual address 0 does not affect the data stored in . Y 's virtual...
  • Risk Communication and Communicating with Patients

    Risk Communication and Communicating with Patients

    The Role of Risk Communication in Communicating with Veterans "Risk Communication after a deployment is a crucial component of the appropriate care and support for the service member upon his or her return"Persian Gulf Veterans Coordinating Board (1999) "Efforts at...