Development of a web service for searching files by keywords
Abstract
Development of a web service for searching files by keywords
Incoming article date: 02.12.2020The subject of research is the development of a service for searching through user files for a given set of keywords with parameters. The available approaches to solving such a problem were studied and the most relevant one was selected. The service searches inside files with text content in order to automate the process of selecting the desired files among the entire set. Its work is based on Porter's algorithm and uses a text stemming approach in order to obtain more accurate results. Searches for the stem of a word, taking morphology into account. Performing a morphological parsing of a word, a base is found common for all its grammatical forms, cutting off suffixes and endings. As a result, the algorithm of the service allows you to search not just for the given keywords, but also takes into account their word forms, and also searches for several sets of keywords at once, each set is analyzed separately. In addition, you can specify ranges of numeric values to search for. A feature of the service is that sets of keywords are searched together in nearby paragraphs within the range of -20 to +20 words from each other, thus taking into account the context of their appearance in the text. The service ranks the found documents according to the quality of their matching search criteria. Files in basic formats are processed: doc, xls, pdf, txt. The service operates on a Linux platform under the control of the Apache web server. Free software tools were used for development.
Keywords: search engine, document analysis, stemming, Porter's algorithm, word forms, morphology, arithmetic mean of percent, web service