It's My World

Searching artikel lewat ProQuest tentang INFORMATION RETRIEVAL. Berikut adalah hasil penelusurannya:

1. Improving e-book access via a library-developed full-text search tool**
Jill E Foust, Phillip Bergen, Gretchen L Maxeiner, Peter N Pawlowski. Journal of the Medical Library Association. Chicago: Jan 2007. Vol. 95, Iss. 1; pg. 40, 6 pgs

Abstract (Summary)

This paper reports on the development of a tool for searching the contents of licensed full-text electronic book (e-book) collections. The Health Sciences Library System (HSLS) provides services to the University of Pittsburgh's medical programs and large academic health system. The HSLS has developed an innovative tool for federated searching of its e-book collections. Built using the XML-based Vivísimo development environment, the tool enables a user to perform a full-text search of over 2,500 titles from the library's seven most highly used e-book collections. From a single "Google-style" query, results are returned as an integrated set of links pointing directly to relevant sections of the full text. Results are also grouped into categories that enable more precise retrieval without reformulation of the search. A heuristic evaluation demonstrated the usability of the tool and a web server log analysis indicated an acceptable level of usage. Based on its success, there are plans to increase the number of online book collections searched. This library's first foray into federated searching has produced an effective tool for searching across large collections of full-text e-books and has provided a good foundation for the development of other library-based federated searching products.

Indexing (document details)

Subjects:Search engines, Information retrieval, Digital divide, E-books, Collections, Information literacy, Data bases, Full text, Libraries
Author(s):Jill E Foust, Phillip Bergen, Gretchen L Maxeiner, Peter N Pawlowski
Document types:Feature
Document features:Illustrations, References
Publication title:Journal of the Medical Library Association. Chicago: Jan 2007. Vol. 95, Iss. 1; pg. 40, 6 pgs
Source type:Periodical
ISSN:15365050
ProQuest document ID:1215769501

Full Text

[Headnote]
Purpose: This paper reports on the development of a tool for searching the contents of licensed full-text electronic book (e-book) collections.
Setting: The Health Sciences Library System (HSLS) provides services to the University of Pittsburgh's medical programs and large academic health system.
Brief Description: The HSLS has developed an innovative tool for federated searching of its e-book collections. Built using the XML-based Vivísimo development environment, the tool enables a user to perform a full-text search of over 2,500 titles from the library's seven most highly used e-book collections. From a single "Google-style" query, results are returned as an integrated set of links pointing directly to relevant sections of the full text. Results are also grouped into categories that enable more precise retrieval without reformulation of the search.
Results/Evaluation: A heuristic evaluation demonstrated the usability of the tool and a web server log analysis indicated an acceptable level of usage. Based on its success, there are plans to increase the number of online book collections searched.
Conclusion: This library's first foray into federated searching has produced an effective tool for searching across large collections of full-text e-books and has provided a good foundation for the development of other library-based federated searching products.

INTRODUCTION

The emergence of the Internet has introduced new ways for users to access library resources and has shaped user behavior and expectations [1]. Users now expect instant and constant access to information, often from distant locations, and as a result, remote access to online library resources has become an increasingly significant part of library service.

Remote access has become especially important at the University of Pittsburgh's Health Sciences Library System (HSLS). In addition to supporting the university's schools of the health sciences, HSLS also supports the 17 hospitals of the University of Pittsburgh Medical Center (UPMC). Medical staff members at many of these outlying facilities lack access to a physical library, so online resources, particularly online reference works, are of especial value. HSLS offers remote access to a vast collection of electronic materials, including over 3,000 licensed electronic books (e-books) that are represented in the library's online catalog and also in a Web-based alphabetical title list. These materials should be especially useful for those seeking quick information in the course of patient care; however, the necessity of identifying relevant online titles and then searching within each title separately slows the information retrieval process and potentially limits e-book use [2].

To address this problem, HSLS chose to create a federated search tool called Electronic Book Search . Built using Vivísimo's Velocity software package, this tool facilitates use of the library's e-book collection by enabling users to search the full text of a large set of e-books from across the collection in one easy step. This paper describes the process of developing and implementing Electronic Book Search and the challenges and successes encountered along the way.

BACKGROUND

The federated search tool has emerged as a successful means of meeting the information needs of users despite several limitations. From a streamlined interface with a single search, users can pull results from a variety of sources, both public and subscription-based [3-7]. There is the additional benefit that some of these results may be from resources the user might not have otherwise found [3]. The speed and ease of federated search tools make them appealing to searchers, particularly novice searchers. They do not need to learn how to formulate searches in each of the databases, nor must they learn how to interpret the different results sets; federated search tools typically display results in a common format [4]. An efficient federated search engine allows searching to be done in a timely manner and eliminates the need for a user to learn multiple search interfaces [5].

There are trade-offs for this simplicity, however. The search capabilities of the federated search are limited by those of the individual databases or sources. Boolean searching is not usually possible because a federated search engine can only process what the target provider allows [6]. Advanced search commands such as field searching and truncation are also not usually possible. As a result, information literacy may suffer due to the abandonment of traditional search skills in favor of a "Google-style, keyword approach" [5].

Two issues also arise in the integrated results. First, federated searches commonly retrieve duplicate results from different sources. Because sources return results in small sets (of typically ten to twenty), the federated search tool's deduplication process has only a small portion of the results to work with at a time [7]. A complete deduplication would require that all search results be compared. With a large retrieval, this would be extremely time-consuming, thus not feasible for most searchers. Second, federated search engines do not perform relevancy ranking well. While content providers can utilize the full article and its indexing, federated search engines have only the citation with which to work [7]. Thus, federated search engines have only limited data available on which to base relevancy, and this data may be insufficient.

Although federated searching may seem simple to the user, the set-up of a tool can be complicated. Individual sources store and present data differently, requiring behind-the-scenes efforts to make and keep results compatible. The mapping of data between the sources and the search tool can be quite complex and challenging to maintain, because sources can change their data format and output at any time [5]. Subscription-based sources also present difficulties, since licenses with the library define permissible user groups [4]. Authentication processes need to be established to allow valid users access to the resources while unauthorized users are excluded.

While the literature does not contain a discussion of federated searching used with full-text e-books specifically, the more general literature confirmed the library's decision to apply federated search technology to this situation and offered insight as to where future problems might occur.

ELECTRONIC BOOK SEARCH DEVELOPMENT

Setup

Though there are a number of federated search products on the market, HSLS did not explore these products because the organization had recently licensed Vivisimo's Velocity, an XML-based development environment that consists of a set of software tools for building information retrieval applications. It was decided that this project presented a good opportunity for the first locally developed application using the software. Velocity comprises three interrelated but distinct tools: (1) the Enterprise Search Engine, which allows for automated indexing and searching of document collections (not utilized in this project); (2) the Content Integrator, which transmits queries to the Web-based search engines of individual data sources and integrates the results from each into a single result set; and (3) the Clustering Engine, which takes a result set and dynamically groups it into meaningful categories, placing results in hierarchical folders. Clustering can be a useful way of managing large results sets, and may be familiar to readers from Clusty, Vivisimo's Web-based metasearch engine, or ClusterMed, its subscription-based product that searches and clusters PubMed [8, 9].

The development team, consisting of the Information Architecture Librarian, the Web Manager, who is also a Reference Librarian, and the Cataloging Librarian, selected packages of licensed e-books from 7 different vendors for inclusion: AccessMedicine, Books@Ovid, ebrary, Elsevier ScienceDirect, MD Consult, STAT!Ref, and Wiley InterScience. These were selected for the quality and number of titles in each, and also given the expectation that HSLS would continue to license them over time. Overall, these packages present over 2,500 e-books, a vast majority of the HSLS e-book collection, and include all of the most popular e-book titles.

A detailed profile identifying functionalities and specifications was prepared for each package, or target product, to allow for the proper configuration of the Vivisimo Content Integrator. All of the products include a Web-based search engine for accessing the content of their titles, which is required for interaction with the Content Integrator; however, each of the packages features a unique set of characteristics requiring special configuration. Some of the characteristics that had to be examined include: (1) Searching: What are the basic and advanced search features? Does it search the full collection or only subscribed titles? Does it search full text, book indexes, or other data? (2) Results: Does it sort results by relevancy? What information does it provide about each result? How many results does it return and in what grouping? (3) Authentication: How does the e-book provider recognize valid users? How does it maintain a user's session? (4) Data transmission: Since the Content Integrator does not use the Z39.50 interface, what is the HTML format of the results data? What are the parameters of its common gateway interface (CGI) program?

Utilizing these identified characteristics, the initial stage of programming focused on configuring the Content Integrator to interact with each of the targeted products. For each provider, a "source" file was created in Vivisimo to store the information necessary for the Content Integrator to translate queries into the appropriate format for the provider's CGI program, send the query to the CGI program, and interpret the returned results. This information includes the CGI program's URL and parameters, the syntax the search engine supports, and the format of the HTML that is returned. The setup for each source was tested for errors and unexpected results, and after fine-tuning, the sources were linked together for federated searching.

Display

Attention then turned to the integrated results display. Ideally, results, regardless of their source, should be presented in a consistent format throughout the final set, but the reality is that each target product returns different information about results and uses different formats. In order to make providers' results more compatible, the team reassessed which data elements from each would be displayed and defined the field labels. The maximum number of results was also determined by trial and error, set to be large enough to allow for effective clustering but modest enough to not unduly slow retrieval time. An overall maximum set of 450 results was chosen, divided amongst the providers. The results are displayed based on general relevance: each provider offers its results ranked by relevancy, and these are interfiled in the default Vivisimo display.

The basic results screen contains this list of resulting citations and also the clusters provided by the Clustering Engine (Figure 1). Without any special configuration, this component of the Velocity package uses the information in the results set to establish dynamic categories and sorts the results into hierarchical folders. Although customization is possible, the default settings were considered to be successful in initial testing and so customized capabilities were not explored. Only a short set of clustering stop words was defined. For example, words such as "Chapter" and "Introduction" are common in the results but do not provide useful categories for clustering; thus, these were excluded from the clusters. Clustering by topic and by "source" are included; this latter feature presents results on a provider-by-provider basis.

The development team was initially unsure how much explanation and guidance would be required by users of the Electronic Book Search. Clustering in particular would likely be a new concept to most. However, following Vivisimo's lead in their ClusterMed product, a minimalist approach was adopted. Instruction is provided in the form of a diagram on the opening screen (Figure 2; online only) that explains the different components of the results screen. A simple Help section also offers general information about the tool, tips on search syntax, a list of the included e-book titles, and an email link for requesting assistance.

Testing

Electronic Book Search was subjected to two types of testing prior to its release. First, all HSLS librarians were asked to test it. Only a minimal amount of information about the product was given so as to simulate the typical user situation. These testers were asked to evaluate the product from the perspective of both the librarian and the patron, and the resulting feedback was positive. Testers found the tool to be a timesaver because they could search numerous titles simultaneously and because they could explore multiple e-books in situations in which they would not know which titles to search. The clustering was useful in pulling out different aspects of the search topic, for example, pediatric coverage of a particular disorder. Finally, most felt that there was appropriate guidance and help available for patrons to use the product successfully.

Second, HSLS employed a heuristic evaluation in which evaluators compared aspects of the site against a set of known usability principles [10, 11]. Heuristic evaluation does not test how well a product will meet users' needs; rather, it identifies design problems that would adversely impact a user's interaction with the product. Five HSLS librarians and one external information professional served as evaluators. Working individually to assess the product, they identified a total of twenty-five different problems and then ranked them for their level of severity. Only one of the twenty-five was rated by the evaluators as a major usability problem, that is, "important to fix": inadequate feedback that a search is in progress. Because searches were slow and nothing was happening on the screen, users might not have realized the system was running. This problem was immediately addressed by adding brightly colored text above the search box indicating that a search is in progress. The remaining usability problems identified in the evaluation, all ranked as minor or cosmetic, were then dealt with in order of severity and feasibility. HSLS has not yet undertaken any testing with the product's primary users, the library's patrons. Such testing, however, will likely become part of the evaluation process for future iterations now that basic usability has been addressed.

DISCUSSION

Success

Overall, the development of Electronic Book Search has been a success. Informal user feedback offered during reference service interactions and through departmental liaisons has been favorable, frequently indicating that the tool allowed users to quickly find the answer or resource they were seeking.

Usage of the Electronic Book Search was charted, employing WebTrends software to track visits over an 8-month period, from March 2005 through October 2005 (Figure 3; online only). In that time, a total of 2,008 visits were made to the system. After the initial release to the public in early 2005, there was low use of the product in March with 145 visits. This increased dramatically in April to 407 visits after it was publicized on the HSLS home page and in the library's print and online newsletter [12]. Usage decreased after this peak and has been inconsistent in the months following. However, with an average of 251 monthly visits during the test period, HSLS is satisfied with the overall level of use, which is expected to increase following a current Website redesign project, which will display the tool more prominently on the site.

In addition to numerical data, WebTrends also tracks the exact search terms used in queries. Electronic Book Search is designed for "Google-style" queries, that is, simple words or phrases. During the development phase, there was a concern that this type of searching would not suffice for users in the medical community, who may expect advanced search capabilities. The web server log analysis suggests that this is not the case. Of 159 unique search terms entered in March 2005, shortly after the public release of the tool, 90% present the expected keyword or phrase format. Only 10% of the searches contain complex Boolean strings, structured search phrases, or attempts to search for book titles or authors rather than content. This would suggest that most users did not expect advanced search capabilities and easily grasped the type of search that Electronic Book Search expected.

Although the project focused on the federated searching capabilities of the Vivisimo software, its clustering capabilities proved to be an added benefit. Keyword searching can yield large numbers of results, some of which are likely to be irrelevant to the user. Instead of reformulating additional searches that are more restrictive, the user can take advantage of Electronic Book Search's clusters to focus on the most appropriate hits. Thus, as seen in Figure 1, a user searching on caffeine sleep who is daunted by the 177 results can look to the clusters in order to hone in on results with a pediatric context or those discussing headaches. In both instances the tabs can be expanded to further narrow the results set. This capability has rated well in the informal feedback HSLS received about Electronic Book Search and has contributed to its success.

Challenges

Several of the ongoing challenges for federated search engines that are identified in the literature are likewise present for this project. For example, the system does not readily accept advanced search commands. This includes field-specific searching, truncation, and complicated Boolean search strings. However, because all of the selected sources accommodate at least simple Boolean search, the user does have this option in Electronic Book Search.

Programming can also be complicated for those ebook providers that do not automatically map a search to the full set of subscribed titles. A source may require the manual selection of each title to be searched; at the other end of the spectrum, if the source automatically searches its full suite of resources (even if the user will not have access to all results in full text), one may have limit the search to particular titles. In both cases, programming is done on a title-by-title basis and requires monitoring in case the subscription contents change. At HSLS, the cataloger responsible for e-book cataloging reports these changes as part of her regular workflow.

The system will also require regular monitoring to ensure that no changes to the target products' CGI programs have occurred that would block communication between Electronic Book Search and the sources. Although the system only includes seven products, two were redesigned in the first six months that Electronic Book Search was available. Since both revisions were advertised, preparations could be made for reprogramming the corresponding Content Integrator source file, although in fact little could be done before the interface redesigns went live. Not all changes are advertised, however. One provider, for example, added a temporary welcome screen advertising upcoming enhancements; while this seems innocuous enough, it effectively blocked access to this provider's resources via Electronic Book Search until the Electronic Book Search setup was reprogrammed.

Working with licensed resources presents a variety of challenges. For example, it is possible to exceed the maximum number of concurrent users as allowed by the provider license. In the case of Electronic Book Search, each search opens a session on each source's search engine, even if the user does not choose to view that source's full-text results. This has caused an increase in the number of open sessions for the included providers. So far, this has not been an issue, but it is a situation that must be monitored as use of Electronic Book Search increases. There are also situations in which special arrangements with the provider must be made. In one case, a provider added an individual license agreement screen to which a user must respond before accessing the provider's resources; because this interfered with Electronic Book Search, HSLS contacted the provider for permission to bypass the page, which was granted.

An issue that had not been anticipated from the literature review is that of slow search speed due to the large number of e-books that are searched concurrently. While a federated search is certainly faster than searching each of the component sources separately, Electronic Book Search runs more slowly than some users might expect. This is addressed in part with onscreen feedback to indicate that a search is in progress. However, latency remains a concern and may be a factor whenever the addition of further e-book collections is considered, since those additional collections will further slow search time.

Fortunately, two of the challenges frequently cited in the literature proved to be non-issues for Electronic Book Search. First, the duplication that tends to be a problem in federated searching occurs when the target products overlap in their search coverage and potentially retrieve the same results; for example, if they are searching the web and return the same URL. Deduplication is not a worry in this case since each provider is searching only its own set of resources. Potentially there could be redundancy if the same work is included in multiple e-book packages, since federated search engines identify duplicates based on URL, not on the content. However, this is not a concern for this project because HSLS has virtually no overlap of titles in the different providers' collections. secondly, relevancy among the integrated search results does not present a problem. All of the sources used in Electronic Book Search rank their own results sets, and Vivisimo's Content Integrator simply interfiles these so that a relative relevancy is achieved. This approach seems to be satisfactory.

CONCLUSION

This paper has described one medical library's foray into federated searching, in this case for the development of a tool that allows users to search a large collection of e-books simultaneously across providers and at a full-text level. The simple interface and ability to search across multiple resources at one time make Electronic Book Search appealing to users. Additionally, feedback and evaluations demonstrated a sufficient level of satisfaction and use to continue maintaining this tool despite its inherent ongoing challenges.

Electronic Book Search could potentially impact users from outside the HSLS community as well. Although Electronic Book Search was not deliberately designed for external use, the setup allows anyone with Internet access to conduct searches on the contents of the HSLS e-book collection and to view the results lists. Licensing restrictions prevent users without recognized subscriptions for the individual resources from viewing the full text of the results. However, if users outside the HSLS community have subscriptions to the resources either individually or through their home institutions, they will be able to follow the links in the results list to the full text of those e-books. Although other medical libraries may not subscribe to all of the same titles, it is likely that their e-book collections will closely mirror that of HSLS, thus allowing their patrons to benefit as well from the improved access to these electronic resources.

In fact, in response to the successes of Electronic Book Search, HSLS has since explored other library applications for Vivisimo's Velocity software. One such application, available through the HSLS Molecular Biology and Genetics Web site , provides access to over 900 online bioinformatics databases and software tools. With this, users can locate appropriate resources more efficiently than with the standard popular Web search engines. Other future applications are also being considered as HSLS continues to find innovative ways of addressing the information needs of its users in the online environment.
[Sidebar]
Highlights
* Electronic Book Search searches the content of over 2,500 e-books contained in the 7 most popular packages in the HSLS collection.
* A good federated search tool is founded on a thorough understanding of the target products, careful configuration, and regular monitoring.
* Clustering technology is a useful means of narrowing the large results sets often associated with keyword searching.
Implications
* Users need more efficient ways to access the content of a library's e-book collection.
* Federated search engines targeting specific collections or user populations in a library can be effective tools.

[Footnote]
* Based on a presentation at MLA '05, the 105th annual meeting of the Medical Library Association; San Antonio, Texas; May 16, 2005.
* Supplemental figures are available with the online version of this journal.

[Reference]
REFERENCES
1. Covey DT. The need to improve remote access to online library resources: filling the gap between commercial vendor and academic user practice. Portal Libr Acad 2003;3(4):577-99.
2. Coiera E, Walther M, Nguyen K, Lovell NH. Architecture for knowledge-based and federated search of online clinical evidence. J Med Internet Res [serial online]. 2005;7(5):e52. [cited 21 Jun 2006]. .
3. Stewart VD. Federated search engines. MLA News 2006 Jan: 17.
4. Fryer D. Federated search engines. Online 2004 Mar/Apr; 28(2):16-9.
5. Curtis AM, Dorner DG. Why federated search? Knowl Quest 2005 Jan/Feb;33(3):35-7.
6. Wadham RL. Federated searching. Libr Mosaics 2004 Jan/ Feb;15(1):20.
7. Hane PJ. The truth about federated searching. Inf Today 2003 Oct;20(9):24.
8. Markoff J. New company starts up a challenge to Google. NY Times 2004 Sep 30; Sect. C:6 (col. 6).
9. Price G. Reducing information overkill. SearchDay. [Web document]. 2004 Sep 30. [cited 29 Mar 2006]. .
10. Nielsen J. How to conduct a heuristic evaluation. [Web document], [cited 29 Mar 2006]. <;http:>.
11. Nielsen J. Ten usability heuristics. [Web document], [cited 19 Sep 2006]. .
12. Bergen P, Maxeiner G. Electronic Book Search simplifies full-text searching. HSLS Update 2005 Feb;10(1):1,3. [cited 21 Jun 2006]. .

[Author Affiliation]
Jill E. Foust, MLS; Phillip Bergen, MA, MS; Gretchen L. Maxeiner, MA, MS; Peter N. Pawlowski, BS, BA

[Author Affiliation]
AUTHORS' AFFILIATIONS
Jill E. Foust, MLS, jef2@pitt.edu, Web Manager/Reference Librarian; Phillip Bergen, MA, MS, bergen® pitt.edu, Information Architecture Librarian; Gretchen L. Maxeiner, MA, MS, maxeiner@pitt.edu, Cataloging Librarian, Health Sciences Library System, Falk Library of the Health Sciences, University of Pittsburgh, Pittsburgh, PA 15261; Peter N. Pawlowski, BS, BA, pawlowski@vivisimo.com, Software Engineer and Lead Linguist, Vivisimo, Inc., 1710 Murray Avenue Suite 300, Pittsburgh, PA 15217
Received March 2006; accepted September 2006

2. A machine learning information retrieval approach to protein fold recognition
Jianlin Cheng, Pierre Baldi. Bioinformatics. Oxford: Jun 15, 2006. Vol. 22, Iss. 12; pg. 1456

Abstract (Summary)

Motivation: Recognizing proteins that have similar tertiary structure is the key step of template-based protein structure prediction methods. Traditionally, a variety of alignment methods are used to identify similar folds, based on sequence similarity and sequence-structure compatibility. Although these methods are complementary, their integration has not been thoroughly exploited. Statistical machine learning methods provide tools for integrating multiple features, but so far these methods have been used primarily for protein and fold classification, rather than addressing the retrieval problem of fold recognition-finding a proper template for a given query protein.
Results: Here we present a two-stage machine learning, information retrieval, approach to fold recognition. First, we use alignment methods to derive pairwise similarity features for query-template protein pairs. We also use global profile-profile alignments in combination with predicted secondary structure, relative solvent accessibility, contact map and beta-strand pairing to extract pairwise structural compatibility features. Second, we apply support vector machines to these features to predict the structural relevance (i.e. in the same fold or not) of the query-template pairs. For each query, the continuous relevance scores are used to rank the templates. The FOLDpro approach is modular, scalable and effective. Compared with 11 other fold recognition methods, FOLDpro yields the best results in almost all standard categories on a comprehensive benchmark dataset. Using predictions of the top-ranked template, the sensitivity is ∼85, 56, and 27% at the family, superfamily and fold levels respectively. Using the 5 top-ranked templates, the sensitivity increases to 90, 70, and 48%.
Availability: The FOLDpro server is available with the SCRATCH suite through http://www.igb.uci.edu/servers/psss.html.
Contact: pfbaldi@ics.uci.edu
Supplementary information: Supplementary data are available at http://mine5.ics.uci.edu:1026/gain.html

Indexing (document details)

Author(s):Jianlin Cheng, Pierre Baldi
Document types:Journal Article
Publication title:Bioinformatics. Oxford: Jun 15, 2006. Vol. 22, Iss. 12; pg. 1456
Source type:Periodical
ISSN:13674803
ProQuest document ID:1069165781 Text Word Count 6574

3. Information retrieval and knowledge discovery utilising a biomedical Semantic Web
Sougata Mukherjea. Briefings in Bioinformatics. London: Sep 2005. Vol. 6, Iss. 3; pg. 252, 11 pgs

Abstract (Summary)

Although various ontologies and knowledge sources have been developed in recent years to facilitate biomedical research, it is difficult to assimilate information from multiple knowledge sources. To enable researchers to easily gain understanding of a biomedical concept, a biomedical Semantic Web that seamlessly integrates knowledge from biomedical ontologies, publications and patents would be very helpful. In this paper, current research efforts in representing biomedical knowledge in Semantic Web languages are surveyed. Techniques are presented for information retrieval and knowledge discovery from the Semantic Web that extend traditional keyword search and database querying techniques. Finally, some of the challenges that have to be addressed to make the vision of a biomedical Semantic Web a reality are discussed. [PUBLICATION ABSTRACT]

Indexing (document details)

Subjects:Information retrieval, Biomedical research, Ontology, Semantics, Web services
MeSH subjects:Abstracting & Indexing -- methods, Artificial Intelligence, Biology, Database Management Systems, Databases, Bibliographic, Documentation -- methods, Information Storage & Retrieval -- methods, Internet, Medicine, Natural Language Processing, Pattern Recognition, Automated -- methods, Periodicals, Semantics, Terminology, Unified Medical Language System
Author(s):Sougata Mukherjea
Document types:Feature
Document features:Diagrams, Photographs, References
Publication title:Briefings in Bioinformatics. London: Sep 2005. Vol. 6, Iss. 3; pg. 252, 11 pgs
Source type:Periodical
ISSN:14675463
ProQuest document ID:926575161Text Word Count5631

Full Text

[Headnote]
Abstract
Although various ontologies and knowledge sources have been developed in recent years to facilitate biomedical research, it is difficult to assimilate information from multiple knowledge sources. To enable researchers to easily gain understanding of a biomedical concept, a biomedical Semantic Web that seamlessly integrates knowledge from biomedical ontologies, publications and patents would be very helpful. In this paper, current research efforts in representing biomedical knowledge in Semantic Web languages are surveyed. Techniques are presented for information retrieval and knowledge discovery from the Semantic Web that extend traditional keyword search and database querying techniques. Finally, some of the challenges that have to be addressed to make the vision of a biomedical Semantic Web a reality are discussed.
Keywords: Semantic Web, ontologies, RDF, OWL, semantic search, semantic association

INTRODUCTION

Currently the World-Wide Web has a huge amount of data and is obviously a reliable source of information for many topics. However since there is not much semantics associated with the data in the WWW, the information cannot be processed by autonomous computer agents and is only understandable to humans. The Semantic Web' is a vision of the next generation World-Wide Web in which data from multiple sources described with rich semantics are integrated to enable processing by humans as well as software agents. One of the goals of Semantic Web research is to incorporate most of the knowledge of a domain in an ontology that can be shared by many applications. Ontologies organise information of a domain into taxonomies of concepts, each with their attributes, and describe relationships between concepts.

At present the field of biology also faces the problem of the presence of a large amount of data without any associated semantics. Therefore, biologists currently waste a lot of time and effort in searching for all of the available information about each small area of research. This is hampered further by the wide variations in terminology that may be in common usage at any given time, and (hat inhibit effective searching by computers as well as people.

In recent years, to facilitate biomedical research, various ontologies and knowledge bases have been developed. For example the Gene Ontology (GO)2 project is a collaborative effort to address the need for consistent descriptions of gene products in different databases. Another widely used system has been developed by the United States National Library of Medicine called the Unified Medical Language System (UMLS)2 which is a consolidated repository of medical terms and their relationships, spread across multiple languages and disciplines (chemistry, biology, etc). Moreover, several specialised databases for various aspects of biology have been developed. For example, the UniProt/Swiss-Prot Knowledge base4 is an annotated protein sequence database.

Biomedical information is growing explosively and new and useful results are appearing every day in research publications. The unstructured nature of the biomedical publications makes it difficult to utilise automated techniques to extract knowledge from these sources. Therefore the ontologies have to be augmented manually. However, because of the very large amount of data being generated, it is difficult to have human curators extract all this information and keep the ontologies up to date.

If a researcher has to gain an understanding of a biological concept, they have to determine all the relevant ontologies and databases and utilise them to find all the semantic relationships and synonyms for the concept. They also have to search the research literature to understand the latest research on the topic. Patent databases also need to be searched to determine relevant patents. Obviously, this is a time-consuming task.

To alleviate the problems of biomedical information retrieval and knowledge discovery, a Semantic Web that integrates knowledge from various biomedical ontologies as well as relevant publications and patents will be very useful. Novel techniques can be utilised for effectively retrieving information and discovering hidden and implicit knowledge from the Semantic Web. Our vision is that distributed web servers would store the 'meaning' of biological concepts as well as relationships between them. This will enable researchers to easily gain an understanding about any biological concept and also potentially discover hidden knowledge.

This paper first introduces the Semantic Web languages and discusses current efforts to represent biomedical knowledge in these languages. The techniques that have been developed to effectively retrieve information from the Semantic Web are then explained. Finally, a discussion is given on some of the main research challenges that need to be addressed to develop a Semantic Web storing all the biomedical information, as well as to effectively retrieve information and discover knowledge from this web.

SEMANTIC WEB LANGUAGES RDF and RDFS

Various Semantic Web languages have been developed for specifying the meaning of concepts, relating them with custom ontologies for different domains and reasoning about the concepts. The most well-known languages are Resource Description Framework (RDF)5 and RDF Schema (RDFS)6 which together provide a unique format for the description and exchange of the semantics of web content.

RDF provides a simple data model for describing relationships between resources in terms of named properties and their values. A resource can be used to represent anything, from a physical entity to an abstract concept. For example a disease or a gene as well as a patent or the inventor of the patent can be a resource. A resource is uniquely identified by a Uniform Resource Identifier (URI).

RDF describes a Semantic Web using Statements which are triples of the form . Subjects are resources. Objects can be resources or literals. A literal is a string which can optionally have a type (such as integer, float). Properties are first class objects in the model that define binary relations between two resources or between a resource and a literal. To represent RDF statements in a machine-processable way, RDF uses the Extensible Mark-up Language (XML).

RDF Schema (RDFS) makes the model more powerful by enabling ontological views over RDF statements. It allows new resources to be specialisations of already defined resources. Thus RDFS Classes are resources denoting a set of resources, by means of the property rdf:type (instances have property rdf:type valued by the class). All classes have by definition the property rdf:type valued by rdfs:Class. All properties have rdf:type valued by rdf:Property. If rdf:type is not defined in the model for a resource, it is by default of type rdf:Resource.

Two important properties defined in RDFS are subClassOf and sub PropenyOf. Two other important concepts are domain and range; these apply to properties and must be valued by classes. They restrict the set of resources that may have a given property (the property's domain) and the set of valid values for a property (its range). A property may have as many values for domain as needed, but no more than one value for range. For a triple to be valid, the type of the object must be the range class and the type of the subject must be one of the domain classes.

RDFS allows inference of new triples based on several simple rules. Some of the important rules are:

* ∀s, p^sub 1^, o, p^sub 2^, (s, p^sub 1^, o) ^ p^sub 1^, rdfs:subPropertyOf, p^sub 2^) => (s, p^sub 2^, o). That is, if a property is a subProperryOf another property and if triple exists for the first property, then one can also infer a triple for the second property with the same subject and object.
Figure 1: A section of an example RDF file describing a biomedical semantic web

* ∀r, c^sub 1^, c^sub 2^, (r, rdf:type, c^sub 1^) ^ (c^sub 1^, rdfs:subClassOf, c^sub 2^) => (r, rdf:type, c^sub 2^). That is, if a resource is of type c\ and c\ is a subClassOf c^sub 2^, then one can also infer that the resource is of type c^sub 2^.

* ∀c^sub 1^, c^sub 2^, c^sub 3^ (c^sub 1^, rdfs:subClassOf, c^sub 2^) ^ (c^sub 2^, rdfs:subClassOf, c^sub 3^) => (c^sub 1^ rdfs:subClassOf, c^sub 3^). That is, if a class c^sub 1^ is a subClassOf c^sub 2^ and c^sub 2^ is a subClassOf c^sub 3^, then one can also infer that c^sub 1^ is a subClassOf C^sub 3^.

Figure 1 shows a section of an example RDF file describing a biomedical Semantic Web. The first line of the RDF document is the XML declaration. The XML declaration is followed by the root element of RDF documents: rdf:RDF. The xmlns:rdf namespace specifies that elements with the rdf prefix are from the namespace http://www.w3.org/1999/02/ 22-rdf-syntax-ns# which defines RDF. Similarly the xmlns:rdfs namespace and a local namespace for the biomedical Semantic Web (xmlns:bioMed) are defined.

The example then shows the specification of a Virus with URI http:// www.biomed.org/semWeb#Virus. The rdf:Description element contains the description of the resource identified by the rdf:about attribute. The rdf:type property is used to specify that the resource is a rdfs:Class. Then the rdfs:label specifies the literal 'Virus' as the label of the class. The Virus is defined to be a subClassOf Organism. Next a property is defined with URI http://www.biomed.org/ semWeb#causes. The type is rdf:Property and it has the label 'causes'. Finally a resource with URI http:// www.biomed.org/semWeb/hiv and label HIV is defined. It is an instance of the class Vims. By RDFS rules one can also infer that this resource is of class Organism. A RDF statement is also used to specify that HIV causes AIDS.

OWL

To efficiently represent ontologies in the Semantic Web several ontology representation languages have been proposed including DARPA Agent Mark-up Language (DAML) and Ontology Inference Layer (OIL). DAML and OIL were merged to create DAML+OIL,7 which evolved into OWL (Web Ontology Language).8

OWL is built on top of RDF schema and allows more details to be added to resources. While RDFS provides properdes, such as subClassOf and sub PropertyOf, that define relationship between two classes or properties, OWL can add additional characteristics that are not defined within RDFS. Thus, OWL can define a class to be the union Of two other classes or the complement Of another class. For example one can define an Animal class to be the union Of Vertebrate and Invertebrate class. Similarly OWL can specify that a property is inverse Of another property or a property is a transitiveProperty. Moreover, while RDFS imposes fairly loose constraints on the data model, OWL adds additional constraints that increase the accuracy of implementations of a given model. For example, in OWL it is possible to add existence or cardinality constraints. Thus, it is possible to specify that all instances of vertebrates have a vertebra, or that hearts have exactly four valves.
Figure 2: Representation of a Gene Ontology term in OWL

TOWARDS A BIOMEDICAL SEMANTIC WEB

Representing biomedical ontologies in Semantic Web languages

In recent years researchers have endeavoured to represent existing biomedical knowledge bases in Semantic Web languages. For example, the Gene Ontology has been represented using DAML+OIL9 as well as OWL.10 Figure 2 shows how a Gene Ontology term is represented as an OWL class in reference 10. The Gene Ontology Id is used as the rdf Id while the name, synonym and definition of the term are represented by properties. The parent of the term (specified by the GO isa relation) is specified using the rdfs subClassOf property. The GO part_of relation is specified using the part_of property along with additional constraints provided by owl:Restriction. The restriction states that at least one of the part_of properties of an instance GO_0001303 must point to an instance of GO_0001302. That is, an instance of GO_0001303 must be a part of an instance of GO_0001302. (It may be part of other entities also.)

Kashyap and Borgida11 describe the representation of the UMLS semantic network in OWL. The semantic network has 135 biomedical semantic classes such as Gene or Genome and Aminci Acid, Peptide or Protein. The semantic classes are linked by a set of 54 semantic relationships (such as prevents, causes). One important property is the isa property that creates an ISA hierarchy for both classes and properties. In Kashyap and Borgida11 the classes are represented as OWL classes and the properties (except the isa property) as OWL properties. A statement is created to represent each relationship among the classes. The isa relationship is represented by subclass Of relationship if it is between classes and subPropertyOf relationship if it is between properties. Thus the class Virus is a subClassOf of Organism and the property part_of is a subPropertyQf physically_related_to.

Kashyap and Borgida11 discovered that representing the semantic network in OWL was not trivial. Problems arose owing to the inability to express the semantic network as OWL axioms that would provide the desired inferences as well as the difficulty of making choices between multiple possible representations. Ambiguities of the semantic network notation were also a problem. For example, the following are two scenarios where it was difficult to represent the semantic network knowledge in OWL:

* Multiple interpretations of a link. The semantic network will have the triple . There are several possible interpretations of this relations including 'All bacteria cause some infection' or 'Some bacteria cause all infections'. One has to determine the correct interpretation and represent that formally using OWL.

* Inheritance blocking. The semantic network will have the triple . Based on OWL inheritance rules, we have to infer the triple . This is not correct and this inheritance-based inference has to be blocked.

These problems indicate that the process of representing ontologies formally using the Semantic Web languages is not easy and it must be ensured that wrong knowledge is not inferred due to incorrect representation. Although for the UMLS semantic network that problems are multiplied because it does not have a formal semantics, even for ontologies with equivalent translations to OWL, questions related to expressibility and intended modelling semantics, among others, still remain.

Biomedical information integration

Various ontologies have been developed in recent years focusing on different aspects of biomedicine. There are overlaps between these ontologies and the same concept may be expressed using different terminologies in two different ontologies. To avoid redundant information in the Semantic Web and prevent problems due to differences in the naming convention, efficient techniques of merging ontologies needs to be developed. Several ontology merging methods have been proposed. For example, Sarkar et al.12 attempt to link Gene Ontology with UMLS using techniques such as exact and normalised string match. On a small curated data set the system had precision values ranging from 30 per cent (at 100 per cent recall) to 95 per cent (at 74 per cent recall). (If there are a matching concepts between the two ontologies and a technique finds b of them, then recall is b/a. If the technique also incorrectly identified an additional c concept to be similar, precision is b/(b + c). Generally if a technique tries to increase precision, recall is reduced and vice versa.)

Various ontology merging tools have been also developed and have been utilised to merge biomedical ontologies. For example, in Lambrix and Edberg13 two popular ontology merging tools Protégé14 and Chimaera15 were evaluated for merging two biomedical ontologies. The evaluation showed that although these took are useful, automatically resolving conflicts between the source ontologies is a challenging research problem that has not yet been solved.

The Semantic Web languages have been also used for integrating knowledge from ontologies and research publications and patents. For example, in Knublauch et al.16 the Protégé ontology modelling environment is used to develop a biomedical Semantic Web in OWL. The Semantic Web integrates an ontology of brain cortex anatomy with biomedical articles as well as images of the brain.

In Mukherjea et al.,17 a Semantic Web is created by integrating UMLS with information about pharmaceutical patents. The resources of the Semantic Web are the patents, their inventors and assignees as well as all UMLS biomedical concepts. Besides statements for the UMLS semantic network, the Semantic Web has triples such as:

* (patentA refers to patentB);

* (inventorC has invented patentD);

* (patentF is assigned to assigneeE);

* (patentG has the UMLS concept bioTermH; the UMLS concepts in the patents are determined using information extraction techniques);

* (the UMLS concept is of type UMLS semantic network class bioClassI).

INFORMATION RETRIEVAL FROM THE SEMANTIC WEB

Depending on the user requirements, various techniques can be utilised for information retrieval from the Semantic Web based on different views of the underlying information space. Some of these techniques are discussed in this section.

Keyword and semantic search

Like WWW, keyword search can be utilised to retrieve information from the Semantic Web. All Semantic Web resources having the query keywords in any of their triples can be retrieved. Since the Semantic Web languages are represented in XML, a XML search engine can also be utilised to enable searching on the XML tags.

One key advantage of the Semantic Web is that it will enable semantic search. For example, Guha et al.18 showed how a Semantic Web can be used to augment a traditional WWW keyword search. Similarly, a biomedical Semantic Web can be used to augment a search on PubMed. Thus, if we search Pubmed with the keywords 'Nucleic acid", the Semantic Web can be used to determine that the keywords are a biomedical class and retrieve documents not only containing the query keywords but also documents that contain biological terms that belong to the class 'Nucleic acid'. This will for example retrieve documents with mRNA which is a nucleic acid. Similarly a query to retrieve publications about tumours of the frontal lobe can also return papers about glioma located in the precentral gyrus, exploiting the Semantic Web to understand that glioma is a kind of tumour and precentral gyrus is a part of the frontal lobe. On the other hand, for ambiguous query terms such as cold (disease or temperature), one can determine the correct sense utilising the Semantic Web as well as the query context. Thus, based on the Semantic Web the search engine can widen or narrow the query into concepts that are substantially related to the terms that the user has asked for.

The Semantic Web can also be used to augment the results of a keyword search. For example, for a search with the keyword Cephalosporin, besides showing the relevant patents, all the information about cephalosporin available in the Semantic Web (for example, information from the ontologies about the antibiotic, companies assigned patents on the antibiotic, other antibiotics that are similar, etc) can be shown to the user.

Semantic web query languages

Various languages have been proposed in recent years for querying the Semantic Web RDF data. Examples include RQL,19 SquishQL,20 TRIPLE21 and RDQL.22 Most of these query languages use a SQL-like declarative syntax to query a Semantic Web as a set of RDF triples. As an example, let us assume that in a biomedical patent Semantic Web we want to find inventor and assignee pairs who have a patent which has a term belonging to the UMLS class Molecular_Function. The query will be expressed in RDQL as follows:

All inventors and assignees that match the query criteria will be returned.

A big advantage of these query languages is that they incorporate inference as part of query answering. Thus during querying additional triples are created on-demand using inference. For example, if the triples (c1 rdfs:subClassOf c2) and (r1 rdf:type c1) are present, it can be automatically inferred that (r1 rdf:type c2) also exists (based on a RDFS rule). Thus for the above query all patents that have terms of the class Genetic_Function will also be retrieved since Genetic_Function is a subclass of Molecular_Function.

Semantic associations

The RDF query languages allow the discovery of all resources that are linked to a particular resource by an ordered set of specific relationships. For example, one can query a Semantic Web to find all resources that are linked to resource r^sub 1^ by the properties p^sub 1^ followed by p^sub 2^. Another option is to determine all the paths between resources r^sub 1^ and r^sub 2^ that are of length n. However, none of the query languages allows queries such as 'How are resources r^sub 1^ and r^sub 2^ related?' without any specification of the type of the properties or the length of the path. It is also not possible to determine relationships specified by undirected paths between two resources.

In the biomedical Semantic Web discovering arbitrary relationships between resources is essential to discover hidden non-obvious knowledge. For example, one may wish to discover whether there is any association between a gene and a disease. In order to determine any arbitrary relationships among resources, Anyanwu and Sheth introduced the notion of semantic associations based on p-queries.23

Several types of semantic associations were defined. For explaining these associations, let Figure 3 represent a Semantic Web graph. The resources of the Semantic Web are the nodes of this graph and properties between two resources are represented by edges between the corresponding nodes. In the figure several resources are shown with the dashed arrows representing paths between the resources and solid arrows representing edges between the resources.
Figure 3: An example semantic web graph

* Two resources r^sub 1^ and r^sub 2^ are ρ-pathassociated if there is a direct path from r^sub 1^ to r^sub 2^ or r^sub 2^ to r^sub 1^ in the Semantic Web graph. For example, in the example graph shown in Figure 3, resources (r^sub 4^,r^sub 9^) and (r^sub 5^,r^sub 8^) are ρ-path-associated.

* Two directed paths in the Semantic Web graph are said to be joined if they have at least one vertex in common. The common vertex is the join node. For example, the directed paths from r^sub 4^, to r^sub 9^ and r^sub 8^ to r^sub 5^ are joined with the common vertex r^sub 6^. Two resources r^sub 1^ and r^sub 2^ are ρ-join-associated if there are joined paths p^sub 1^ and p^sub 1^ and either of these two conditions is satisfied: (i) r^sub 1^ is the origin of p^sub 1^ and r^sub 2^ is the origin of p^sub 2^ and (ii) r^sub 1^ is the terminus of p^sub 1^ and r^sub 2^ is the terminus of p^sub 2^. Thus in Figure 3 (r^sub 4^,r^sub 8^) and (r^sub 5^,r^sub 9^) are sets of ρ-joinassociated resources.

* Two resources r^sub 1^ and r^sub 2^ are ρ-cp-associated if they belong to the same class or classes that have a common ancestor. To prevent meaningless associations (such as all resources belong to RDF-.Resource}, one can specify a strong ρ-cp-associated relation which is true if either of these two conditions are also satisfied: (i) the maximum path length from the resources to the common ancestor is below a threshold and (ii) the common ancestor is a subclass of a set of user-specified general classes called the ceiling.
Figure 4: Join associations between two companies in a biomedical patent semantic web. Paths from the assignees to the join nodes are shown

* Two directed paths of length n in the Semantic Web graph P and Q are isomorphic if: (i) they represent the properties p^sub 1^, p^sub 2^, . . ., p^sub n^ and q^sub 1^, q^sub 2^, . . ., q^sub n^ respectively; and (ii) ∀i, 1 ≤ i ≤ n (p^sub i^ = q^sub i^) ⋀ (p^sub i^ ⊂ q^sub i^) ⋀ (q^sub i^ ⊂ p^sub i^). Here C represents the subPropertyOf relation. Two resources are ρ-iso-associated if they are the origins of isomorphic paths. For example, in Figure 3 if p^sub 1^ ⊂ q^sub i^ ⋁ p^sub 2^ ⋁ q^sub 2^, r^sub 1^ and r^sub 10^ are ρ-iso-associated.

Two resources are said to be semantically associated if they are either ρpath-assodated or ρ-join-associated or ρ-cp-associated or ρ-iso-associated. Determining semantic associations between entities may lead to the discovery of non-obvious and unexpected relationships between the entities and thus enable the researchers to gain new insights about the knowledge space.

In Mukherjea et al.17 semantic associations were utilised to determine arbitrary relationships between resources in a biomedical patent Semantic Web. As an example Figure 4 shows the join associations between two companies Pfizer and Ranbaxy. The paths from the companies to the join nodes in the Semantic Web graph are displayed. It shows that the companies are related based on patents assigned to them as well as biomedical concepts in those patents. For example Ranbaxy is assigned a patent 6673369 which refers to the patent 6068859 of Pfizer. Similarly, the UMLS concept high blood cholesterol Invl is present in patents of both companies. This kind of information may be useful for the companies for discovering potential patent infringements.

Determining similarity in the Semantic Web

Several techniques have been developed to determine the similarity between terms in ontologies. Information-theoretic approaches for determining the similarity has been found to be vers- effective. For example, Resnik24 proposed a method to determine the similarity between two terms in a taxonomy based on the amount of information they share in common. Let p^sub t^ be the probability of encountering a term t or a child of the term in the taxonomy. Although Resnik considered a child tenu by only considering is-a links, it can be extended to links of all types, p^sub t^ is monotonic as one moves up the taxonomy and will approach 1 for the root. The principle of information theory defines the information content of a term as -ln(p^sub t^).

These measures have been utilised to find similarity among proteins defined in UniProt/Swiss-Prot26 and genes in Saccharomyces Genome Database.27 The Gene Ontology terms associated with the genes and the proteins are determined and the similarity between those terms is used to calculate the similarity between the corresponding genes or proteins.

While Resnik's approach of determining similarity is based on how ontology nodes are generally populated, a different approach is utilised in the Gene Ontology Categoriser.28 In this system the bio-ontologies are viewed more as combinatorially structured databases than facilities for logical inference and the discrete mathematics of finite partially ordered sets (posets) is utilised to develop data representation and algorithms appropriate for such ontologies. The objective of this system is to determine the best nodes of the Gene Ontology that summarise or categorise a given list of genes. The system determines how the genes are organised with respect to the ontology. It discovers whether they are centralised, dispersed or grouped in one or more clusters. With respect to the biological functions which make up the GO. the system tries to determine whether the genes represent a collection of more general or more specific functions, a coherent collection of functions or distinct functions.

Although identifying resources similar to a given Semantic Web resource is an unexplored research area, some of the above techniques may be modified to determine the similarity between Semantic Web resources. This would be useful, for example, to determine biomedical entities that are similar in the Semantic Web.

CONCLUSION AND FUTURE WORK

This paper has given an introduction to the exciting possibilities of a biomedical Semantic Web. It has briefly described some of the current efforts in representing biomedical ontologies and knowledge bases in Semantic Web languages and integrating multiple biomedical knowledge sources to create Semantic Webs. Information retrieval techniques for the Semantic Web that extend traditional keyword search and database querying techniques have also been explained.

However, although the vision of the biomedical Semantic Web is really grandiose, at present mostly 'toy' Semantic Webs have been developed which have limited usefulness in the real world. There are various research issues that need to be addressed in the future to make the vision of a biomedical Semantic Web a reality.

The first challenge is to develop the biomedical Semantic Web that stores most if not all of the biomedical knowledge. Since the biomedical ontologies are not up to date, data from the research publications have to be integrated into the biomedical Semantic Web. Obviously it is not possible to manually enrich the Semantic Web from the research literature. However, automatic extraction of useful information from online biomedical literature is a challenging problem because these documents are expressed in a natural language form. The first task is to recognise and classify the biological entities in the scientific text. After the biological entities are recognised, the next task is to identify the relations between these entities. Hirshman et al.29 give a good overview of the accomplishments and challenges in text mining of biomedical research publications.

Another area of concern is that although the information retrieval techniques for the Semantic Web seem to have a lot of potential, most of them are basically research prototypes and there is not much evidence of their success. Therefore the techniques need to be tested on a real-world Semantic Web. Based on experiments and user studies, the techniques may need to be modified or even new techniques discovered that enable users to effectively retrieve information and discover hidden knowledge from the Semantic Web.

The other major challenge is scalability. A biomedical Semantic Web that is really commercially useful will need to store a large amount of information from multiples ontologies as well as research publications and patents. Information retrieval techniques like the inference engine during Semantic Web querying as well as graph-theoretic algorithms to determine semantic associations may not work for such a large amount of data. Moreover, all the data may not be located centrally but distributed over a number of agencies and their databases. Therefore distributed algorithms may need to be developed.

Hopefully, with effective solutions to these research challenges, the vision of a biomedical Semantic Web will be realised in the near future.
[Sidebar]
Date received (in revised form): 14th June 2005

[Sidebar]
Semantic Web

[Sidebar]
Ontologies

[Sidebar]
Information retrieval

[Sidebar]
Knowledge discovery

[Reference]
References
1. Barnes-Lee, T., Hendler, J. and Lasilla, O. (2001), 'The semantic web', Scientific American, May.
2. GeneOntology (URL: http:// www.geneontology.org/).
3. UMLS (URL: http://umlsks.nlm.nih.gov).
4. Swiss-Prot (URL: http://www.ebi.ac.uk/ swissprot/).
5. Resource Description Format (URL: http:// www.w3.org/1999/02/22-rdf-syntax-ns).
6. Resource Description Format Schema (URL: http://www.w3.org/2000/01/rdf-schema).
7. Horrocks, I. (2002), 'DAML+OIL: A description logic for the semantic web', IEEE Bull. Technical Committee Data Eng., Vol. 25(1), pp. 4-9.
8. OWL Web Ontology Language (URL: http://www.w3.org/TR/owl-guide/).
9. Wroe, C., Stevens, R., Goble, C. and Ashburner, M. (2003), 'A methodology to migrate the Gene Ontology to a description logic environment using DAML+OIL', in 'Proceedings of the 8th Pacific Symposium on Biocomputing', 3rd-7th January, Hawaii.
10. Gene Ontology in OWL (URL: http:// bioinfo.unice.fr/equipe/Claude.Pasquier/ goterms.owl).
11. Kashyap, V. and Borgida, A. (2003), 'Representing the UMLS semantic network using OWL (or "What's in a semantic web link?")', in 'The Proceedings of the Second International Semantic Web Conference', Sanibel Island, Florida.
12. Sarkar, I., Cantor, M., Gelman, R. et al. (2003), 'Linking biomedical language information and knowledge sources: GO and UMLS', in 'Proceedings of the 8th Pacific Symposium on Biocomputing', 3rd-7th January, Hawaii, pp. 439-450.
13. Lambrix, P. and Edberg. A. (2003), 'Evaluation of ontology merging tools in bioinformatics', in 'The Proceedings of the 8th Pacific Symposium on Biocomputing', 3rd-7th January. Hawaii.
14. Protégé (URL: http://protege.stanford.edu/).
15. Chimaera (URL: http:// www.ksl.stanford.edu/software/chimaera/).
16. Knublauch, H., Dameron. O. and Muscn, M. (2004), 'Weaving the biomedical semantic web with the Protégé OWL plug-in', in 'The Proceedings of the Workshop on Formal Biomedical Knowledge Representation (KR-MED)', Whistler, CA.
17. Mukherjea, S., Bamba, B. and Kankar, P. (2005), 'Information retrieval and knowledge discovery utilizing a biomedical patent semantic web', IEEE Trans. Knowledge Data Eng., Vol. 17(8). pp. 1099-1110.
18. Guha, R., McCool, R. and Miller. E. (2003). 'Semantic search', in 'Proceedings of the Twelfth International World-Wide Web Conference', May, Budapest. Hungary.
19. Karvounarakis, S., Alexalu, S., Christophides, V. et al. (2002). 'RQL: A declarative query language for RDF', in 'Proceedings of the Eleventh International World-Wide Web Conference', May, Honolulu, Hawaii.
20. Miller, L., Seaborne, A. and Reggiori, A. (2002), 'Three implementations of SquishQL, a simple RDF query language', in 'The Proceedings of the 1st Semantic Web Conference', Sardinia, Italy, June.
21. Sintek, M. and Decker, S. (2002). 'TRIPLE: A query, inference and transformation language for the semantic web', in 'Proceedings of the 1st Semantic Web Conference'. June. Sardinia, Italy.
22. Seaborne, A., 'RDQL: A data oriented query language for RDF models' (URL: http:// www.hpl.hp.com/semweb/rdql-grammar.html).
23. Anyanwu. K. and Sheih. A. (2003), 'ρ-Querics: Enabling querying for semantic associations on the semantic web', in 'Proceedings of the Twelfth International World-Wide Web Conference', May, Budapest, Hungary.
24. Resnik, P. (1995), 'Using information content to evaluate semantic similarity m a taxonomy', in 'Proceedings of the International Joint Conference on Arnfical Intelligence (IJCAI)', pp. 448-453.
25. Lin, D. (1998), 'An information-theoretic definition of similarity', in 'The Proceedings of the International Conference on Machine Learning', San Francisco, CA, pp. 296-304.
26. Lord, C., Stevens, R., Brass, A. and Goble, C. (2003), 'Semantic similarity measures as tools for exploring the Gene Ontology', in 'The Proceedings of the 8th Pacific Symposium on Biocomputing', 3rd-7th January, Hawaii.
27. Azuaje, F. and Bodenreider, O. (2004). 'Incorporating ontology-driven similarity knowledge into functional genomics: An exploratory study', in 'The Proceedings of the IEEE Fourth Symposium on Biomformatics and Bioengineering', Taichung, Taiwan, pp. 317-324.
28. Joslyn, C., Mniszewski, S., Fulmer, A. and Heaton, G. (2004), 'The Gene Ontology Categorizer', Bioinformatics, Vol. 20 (Suppl 1), pp. 169-177.
29. Hirshman, L., Park, J., Tsujii, J. et al. (2002), 'Accomplishments and challenges in literature data mining for biology', Biolnfomatics Rev., Vol. 18(12), pp. 1553-1561.

[Author Affiliation]
Sougata Mukherjea is a Research Staff Member in IBM India Research Lab. Before joining IBM, he held research and software architect positions in companies in Silicon Valley (California) including NEC USA, BEA Systems and Verity. His research interests include information visualisation and retrieval and applications of text mining in areas such as web search and biomformatics.
Sougata Mukherjea, IBM India Research Laboratory, Block I, Indian Institute of Technology, Hauz Khas, New Delhi - 110016, India
Tel: +91 11 5129 2138
Fax: +91 I1 2686 1555
E-mail: smukherj@in.ibm.com

4. Naming very familiar people: When retrieving names is faster than retrieving semantic biographical information
Serge Brédart, Tim Brennen, Marie Delchambre, Allan McNeill, A Mike Burton. British Journal of Psychology. London: May 2005. Vol. 96 Part 2. pg. 205, 10 pgs

Abstract (Summary)

One of the most reliable findings in the literature on person indentification is that semantic categorization of a face occurs more quickly than naming a face. Here we present two experiments in which participants are shown the faces of their colleagues, i.e., personally familiar people, encountered with high frequency. In each experiment, naming was faster than making a semantic classification, despite the fact that the semantic classifications were highly salient to the participants (Experiment 1: highest degree obtained; Experiment 2: nationality). The finding is consistent with models that allow or parallel access from faces to semantic information and to names, and demonstrates the need for the frequency of exposure to names to be taken into account in models of proper name processing e.g. Burke, Mackay, Worthley and Wade (1991). [PUBLICATION ABSTRACT]

Indexing (document details)

Subjects: Names, Semantics, Information retrieval, Experiments, Identification
MeSH subjects: Adult, Analysis of Variance, Face, Female, Humans, Male, Mental Recall, Names, Reaction Time, SemanticsDocument features: Tables, Reference
Publication title: British Journal of Psychology. London: May 2005. Vol. 96 Part 2. pg. 205, 10 pgs Part 2
Source type: Periodical
ISSN: 00071269
ProQuest document ID: 860817281
Text Word Count 5072Tugas I Mata Kuliah Penelusuran Terpasang : Searching artikel lewat ProQuest tentang INFORMATION RETRIEVAL.

5. STRATEGIC INFORMATION SEARCH STRATEGIES
J Patrick, S Spencer, B Stafford. The Gerontologist. Washington: Oct 2004. Vol. 44, Iss. 1; pg. 518, 2 pgs

Abstract (Summary)

An abstract of Patrick et al's study that examine information search strategy and introduce a new measure that change in search strategies as information is accumulated during information search is presented. Results indicated that search strategies were dynamic, changing as participants gained information.

Indexing (document details)

Subjects: Information retrieval, Decision analysis, Adults, Comparative studies
Author(s): J Patrick, S Spencer, B Stafford
Document types:Feature
Publication title:The Gerontologist. Washington: Oct 2004. Vol. 44, Iss. 1; pg. 518, 2 pgs
Supplement:PROGRAM ABSTRACTS: 57th Annual Scientific Meeting...
Source type:Periodical
ISSN:00169013
ProQuest document ID: 924474541
Text Word Count 267

Full Text

J. Patrick. S. Spencer. West Virginia University Department of Psychology, Morgantown, WV, B. Stafford. West Virginia University, Morgantown, WV.

Research regarding the effect of age on decision-making processes has focused on differences in strategic information search through alternative by feature information matrices (e.g., Johnson, 1990, 1993) using global indexes of search strategy. We examine search strategy and introduce a new measure that change in search strategies as information is accumulated during information search.

Adults in three age groups (56 younger (M =22.8 yrs.); 56 middle-aged (M= 53.0), 54 older (M= 76.2) conducted matrix-based information search and made consumer decisions using computers. We found that search strategies were dynamic, changing as participants gained information. As information search progressed, participants at all ages show decreasing Ratio of Repetition values for features and increasing values for alternatives. They typically began information search by looking at only a few features, but on many alternatives. In the last third of information search, however, participants tended to narrow their search, investigating many features but only on the most promising alternatives. Changes in search patterns as information search progressed were generally similar in younger and older adults. To examine whether this within-trial switch was related to decision quality, we conducted exploratory regression analyses using the RR measures for each of the six vincentized intervals as predictors. Only the final portion of the search exerted direct influences on decision quality. Low quality decision makers, especially older adults, focused too quickly on a few alternatives. Thus, using measures of within-trial strategy change allows us to more precisely examine age differences and similarity in decision making processes.

Wednesday 10 August 2011

Alhammdulillah q hamil.......:)

Friday 26 December 2008

Muallaf Muda dari Jerman

Tuesday 8 January 2008

Jurnal Ilmiah Gratis......Info dari Romi Satria Wahono

Wednesday 31 October 2007

Tugas I Penelusuran Terpasang............

Friday 14 September 2007

Ilmu Informasi dan Perpustakaan ....RJIIP & website IIP

It's My World

It's My World

About Me

Blog Archive

Labels

Link