TheDataGirl

A little blog about big data and other things

Apache Lucene for big data searches

In order to efficiently search for relevant documents, one can make use of Apache Lucene; a text search engine library. One of the many advantages of using Apache Lucene for search is that it is highly scalable, which is important when using big data. Other advantages of Apache Lucene include that it can be used in both commercial and open source projects. By using Lucene, we can get a score regarding how similar a query is from a query.  (Sonawane, 2009)

Apache Lucene has been used in a range of proposed systems which have seen potential in this well-known text search engine. Apache Lucene works similar to other search engines such as Google. In (Sanchez & Azpilicueta, 2006), the researchers attempt to find an easier way to plow through massive stores of data found on the web and other repositories. The problem is that search is currently based on keywords and thus is limited in the quality of its search results.

The researchers noted that there are two main approaches to handling semantic search; Information Retrieval (IR) and semantic-based knowledge technologies (SW). Information Retrieval is more suited to sparse data which may or may not be unstructured. The disadvantage is that it often yields shallow results in the case of semantic relations. Often in Semantic Search IR approaches, a lexical database such as WordNet, which was previously mentioned, is used to obtain the categories and any shallow semantic relations between terms. (Sanchez & Azpilicueta, 2006)

SW requires data to be well-structured before-hand and yields higher levels of concepts. It makes use of ontologies which through well-defined rules to extract deeper semantic relations. Semantic Retrieval techniques cannot be applied simply to the entire Web database due to problems with scalability, heterogeneity, and usability. Heterogeneity requires a massive amount of ontologies to be set up to cover a set of categories for every topic on the web. Scalability means that a very large amount of data (big data) must be handled and for this reason, the data must be transferred into a structured format. Usability issues can be solved if the interface is easy to use and the authors propose that natural languages are supported. (Sanchez & Azpilicueta, 2006)

Since Apache Lucene is a keyword-based search approach, the results were significantly worse than semantic searches and thus to increase the functionality of Apache Lucene search functionalities, a semantic lexical database such as WordNet must be added for at least shallow semantic relations. (Sanchez & Azpilicueta, 2006)

The limits of keyword-based search have been highlighted by many other people in the field as well such as in (Tran, Cimiano, Rudolph, & Studer). Their implementation designed as a possible solution to these problems was named as the XXploreKnow! system. The authors present an ontology-based IR process which is composed of four main models. The four models are the mental model, the user question model, the system resource model, and the system query model. The mental model is the information which the user is trying to find. The user question model is the user query and can be represented as   where  is each individual keyword in the user query. The system resource model consists of the formal elements or entities of the ontology which is being used to extract the knowledge satisfy the user request. Finally, the system query model is the question processed and returned in its formal form, using formal semantics. (Tran, Cimiano, Rudolph, & Studer)

As can be realized from the description above, the user question model must be converted to the system query model, taking a question or query made in natural language to a query denoted in formal semantic language. To complete this task there was a heavy reliance on mapping to ontology elements. In fact, the first stage of their proposed system takes each keyword and maps it to its corresponding ontology entities and concepts. Once the keywords have been mapped, Apache Lucene was the chosen search engine to handle to search and indexing tasks required. The authors chose Lucene mainly due to its ability to index and search and also due to its ability to handle syntactic and spelling variances. The URIs and entity labels are taken and indexed and Lucene’s fuzzy search engine functionality is used to return a query for each of the keywords in the user question model. Using Lucene’s ontology entities, we receive the relevant entities which correspond to the query generated. (Tran, Cimiano, Rudolph, & Studer)

The next stage is to analyze all of the connections between the ontology entities. A subgraph is generated based on the elements retrieved and this graph is then traversed recursively to obtain the neighbouring elements.  All of the retrieved elements in the designated range (referred to as width d) and selected and being used to retrieve likely connections to satisfying the user query. The user can see a visualization of this graph and traverse as required. This leads the user to possible relevant documents according to how the graph was traversed. (Tran, Cimiano, Rudolph, & Studer)

The evaluation measurement approaches taken were precision, recall, and F-measure which obtained results of 85%, 52%, and 64% respectively when the user traverses the graph’s concepts to obtain the desired relevant documents. When the system was trusted with automatically returning the relevant documents the results gave a lower 69% for precision, 43% for recall and 53% for F-Measure. The authors suggest that lexical knowledge would give more satisfying results in terms of recall to aid assigning appropriate elements from the ontology to the individual keywords (Tran, Cimiano, Rudolph, & Studer)

(Rocha, Schwabe, & Poggi de Aragao, 2004) present a hybrid approach to where several traditional search methods are applied to a closed domain environment in order to retrieve better results. In their work they made use of the Lucene search engine since it provided them with full support for the development of a semantic searcher and aided in their tasks of integrating an ontology and knowledge base, assigning the associated weights for each of the connections which habour the neighbourhood of entities and integrating with their system.

Interestingly, the system also catered for integrating ontologies in different languages since it provided the functionality to convert it to a common, internal ontology format upon receiving the ontology structure. The ideal environment for their proposed system is one in which the individual concepts of the ontologies provided detailed textual information and it is believed that having such ontologies would enable the system to achieve great results in all respective domains. Although they received great results when a detailed ontology was provided, the authors have designed the framework in such a way that it is easily extendible and hope to experiment with different weight formulae and other configurations for better results. (Rocha, Schwabe, & Poggi de Aragao, 2004)

An interesting idea which was brought out as a shortage to the system was the limitations of offering hybrid queries to the system. The authors wish to further their system by allowing the user to input a query which contains both query keywords and concept keywords. Taking a recipe search as an example, a hybrid query example could look like the following,

The Ingredient and Method parts of the query are the concept keywords while the other words define the actual text what we are looking for.

Apache Lucene has a lot of potential in the field of big data and I will definitely be revisiting this topic in the future to learn more about its use for Big Data searches.

 

References

Rocha, C., Schwabe, D., & Poggi de Aragao, M. (2004). A Hybrid Approach for Searching in the Semantic Web. Rua Marquês de São Vicente, Dept. of Informatics, PUC-Rio, New York.

Sanchez, M. F., & Azpilicueta, P. C. (2006). Semantically enhanced Information Retrieval: an ontology-based approach. Retrieved February 22, 2014, from mavir2006: http://mavir2006.mavir.net/docs/MFernandez-semanticIR.pdf

Sonawane, A. (2009, August 18). Using Apache Lucene to search text. Retrieved November 22, 2013, from IBM developer works: http://www.ibm.com/developerworks/library/os-apache-lucenesearch/

Tran, T., Cimiano, P., Rudolph, S., & Studer, R. (n.d.). Ontology-based Interpretation of Keywords for Semantic Search. Institude AIFB. Karlsruhe: Institude AIFB.

 

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *