Browsing by Author "Pembe, Fatma Canan."
Now showing 1 - 2 of 2
Results Per Page
Sort Options
Item A linguistically motivated information retrieval system for Turkish(Thesis (M.S.)-Bogazici University. Institute for Graduate Studies in Science and Engineering, 2004., 2004.) Pembe, Fatma Canan.; Say, Ahmet Celal Cem.Information retrieval (IR) has become an important application in today's computer world because of the great increase in the amount of web-based documents and the widespread use of the Internet. However, the classical "bag of words" approach no longer meets user expectations adequately. In this context, the use of natural language processing (NLP) techniques comes into mind. In this thesis, we investigate the question of whether NLP techniques can improve the effectiveness of information retrieval in Turkish. We implemented a linguistically motivated information retrieval system, called TURNA (TUrkish information Retrieval engine based on Natural language Analysis). The system uses knowledge of three different levels of natural language processing in document and query processing: morphological, syntactical and lexico-semantical levels. Different combinations of these NLP techniques are tested on a set of Turkish documents and queries. The results are evaluated in terms of precision and recall. It is shown that natural language processing techniques, especially stemming and the use of syntactical head-modifier pairs, can improve information retrieval effectiveness in Turkish.Item Automated query-biased and structure-preserving document summarization for web search tasks(Thesis (Ph.D.)-Bogazici University. Institute for Graduate Studies in Science and Engineering, 2010., 2010.) Pembe, Fatma Canan.; Güngör, Tunga.With the drastic increase of available information sources on the Internet, people with different backgrounds in the world share the same problem: locating useful information for their actual needs. Search engines provide a means for users to locate documents on the Web via queries. However, users still have to perform the sifting process by themselves; i.e., to decide the relevance of each document with respect to their actual information needs. At this point, automatic summarization techniques can complement the task of search engines. Currently available search engines, such as Google and AltaVista, only show a limited capability in summarizing the Web documents; e.g. displaying only two or three lines of text fragments which consist of the query words and their surrounding text as the summary. In the literature, most of the research in automatic summarization has focused on creating general-purpose summaries without considering user needs. Also, summarization approaches have mostly seen a document as a flat sequence of sentences and ignored the structure within the documents. In the summarization literature, the effect of query-biased techniques and document structure have been considered only in a few studies and separately investigated. This research is distinguished from previous work by combining these two aspects in a coherent framework. In this thesis, we propose a novel summarization approach for Web search, i.e., query-biased and structure-preserving document summarization. The proposed system consists of two main stages. The first stage is the structural processing of Web documents in order to extract their section and subsection hierarchy together with the corresponding headings and subheadings. A document in the system is represented as an ordered tree of headings, subheadings and other text units. First, we formed a rule-based approach based on heuristics and HTML Document Object Model tree processing. Then, we developed a machine learning approach based on the tree representation using support vector machine (SVM) and perceptron algorithms. The methods were evaluated based on the accuracy of heading extraction and hierarchy extraction. The second stage of the research is to develop automatic summarization methods by utilizing the document structures obtained in the first stage. In the proposed method, the summary sentences are extracted in a query-biased way based on two levels of scoring: sentence scoring and section scoring. Document structure is utilized both in the summarization process and in the output summaries. The performance of the proposed system has been determined using several task-based evaluations. These include information retrieval tasks where the summaries will actually be used. The results of the experiments on Turkish and English documents show that the proposed system summaries are superior to Google extracts and unstructured query-biased summaries of the same size in terms of accuracy with reasonable judgment times. User ratings verify that query-biased and structure-preserving summaries are also found to be more useful by the users.