DEVELOPING OF THE RELATED DATA SEARCH LSA-BASED ALGORITHM AND ITS PROGRAMMED REALIZATION

In this article let’s consider the theoretical basis of the data search in large data ordered arrays based on the context of the search request and tracking of semantic relationships. Also the first steps towards the practical implementation of this task are proposed. Simple program to check author’s ideas has been developed. All the researches have been made with the VK (VKontakte) social network (http://vk.com). Internal API VK was used as retrieving data tool. The final results say that the VK’s content has many opportunities to make them more useful and searchable, which means that it is possible to use this ‘property’ to create our own, more user-friendly way to search and get important data, in the first, for example, buying-selling information, from many kinds of data sources (official pages, users’ profiles etc.). That feature never been presented (and probably won’t) in other social networks like Facebook or Instagram.


Introduction
Search engines that make search by keywords, provide access to billions of indexed web pages for thousands of users.Such phenomena as polysemy (one word has got several meanings) and synonymy (some words have got one meaning) increase the number of irrelevant results issued by any search engine.
In connection with ever-increasing number of sites there is also an increasing need for careful analysis of Internet documents content in order to minimize the opportunity to produce irrelevant results.Semantic Web technologies provide a way to solve this problem [1].
The aim of this research is examination of one of the existing technologies of semantic search and discussing of the possibility of adapting it to the using in social media as a large array of disordered data.
In this section, the author would like to describe some main papers and articles that made a big impact on the topic of the article formation.
First of all, the article [2], in which the author presents the idea of contextual search, defines the basic concepts and provides a simplified guide to action.Also the author provides an initial analysis of information resources for the search of linked data.Later the idea from this article will be called "the original algorithm".At the end of this article there is a postscript: "Some of the algorithm improvement ways are to improve the working with social networks...".The author of this article was interested in the author's proposal to expand the application area of "the original algorithm" and at the same time its improvement.
The authors of the article [3] investigated the problem that social networks have got two mutually exclusive characteristics which prevent search algorithms simultaneously from support of them.To solve this problem, the authors developed a framework that provides K-query in real time.This system is based on the rating function that includes relevance, social significance and the similarity of the test (i.e. if the same words are written in two different requests).
The authors of the article [4] propose the approach which addresses the issue of how the members of the enclosed space (groups) can find the shortest path to the desired object using only their closest contacts.The authors have modeled their studies within the real internal network of university students.It seems to us that the algorithm described in this article may be useful, for Computer Sciences example, if, in particular, for replacing its contacts (people) with the words (lexical search components).
The developed algorithm is based on a combination of the latent-semantic analysis, or LSA (described in [5,6]) and the frequency analysis (described in [7], improved in [8]).These methods are most-common used to web data mining (as [9] says) by identifying and finding dependencies via 'understanding' the obtained data [10].
It's simpler way to complete the task in the fastest time than, for example, construction of a single semantic network based on the analysis of graphs of dependencies of lexemes of the text [11], or creating of a single semantic network based on the analysis of graphs of dependencies of lexemes of the text [12] using the Dice coefficient [13].
Moreover, a something similar to our project was developed and described in [14] -a recommender system that can automatically provide annotations to help user.The system could identify the topics discussed within article which is worked out by semantic approaches with Latent Semantic Analysis (LSA) and WordNet [15].

Materials and Methods
The algorithm will aim searching relevant data in the social network VK (VKontakte) that is based on context relationships between words.
One of the possible options for presenting a semantic representation is a structure consisting of "text facts" [16].So, the idea of the algorithm is to track from where the specific account result is got.It may be, for example: a post of an average user, a post of the profile public page (for example, the post on the official IKEA page which has info for the word "bed"), a non-profile page post (entry from the page about repair of apartment where this word occurs).
There is in the two methods of analyzing big texts: literature holistic (h) and component or analytic methods (componential) (c) [17,18].Each methods has its own pros and contras [17,19], and the holistic one more, so its idea to project a little.
Tracking is done by "reading" the name of the post source and then check if there are all keywords in the title in a predetermined word dictionary.If there are, this entry for further processing is leaved.Next, a number n is defined, let n=20.The next step is "reading" the text of all the entry and see if there are all the keywords in it.If there are, then we take 20 words to the right/ left for each keyword.Then a neighborhood of (i-n, i+n) is formed, where i=a serial number of any search word in the all keywords array S, and see whether they are in the table of contexts.If the following keyword is closer than 20 words to the previous one, let's combine the "halves".
After all, the mechanism looks what words, how many times and for which keyword are included in the table of contexts.And on the calculation basis the user concludes what kind this recording is (an announcement of the sale of goods, a review or an ordinary journalistic post).

1. The description and the explanation of the algorithm
The following is the basic idea of the search dependencies algorithm between data taken from [2].The description is made in pseudocode because the concrete realization (programming language) will depend on available resources.MainFunction () Query = enterTheData (keywords); normalizeQuery (Query); (1) remember the number of gotten results; ==>loop for each results from the array: parseTheResult (results); (8) displayResult(); if necessary, repeat the above is written to the same set, but with a different set of keywords; (9) displayUpdatedResult(); (10) ---end of MainFunction() Explanation of the certain subfunctions: 1) normalizeQuery verifies and corrects any requests entered by a user, such as the elimination of insignificant words, punctuation marks and as well as the using of certain specific restrictions for VKontakte (e. g. "Safe Search").
2) dataSearch provides the standard search on request for VKontakte, but the results are submitted not to exit (for any user), but "inside" of the program, where they are further processed (e. g., leaving only the text of the post, cutting off the creation date, the name of the source and etc.), including an application-specific API VK (set of proprietary tools and functions for working with VKontakte social network using any third-party software) requests.
3) parseTheResult.The text of the particular post (the post result) selected in the previous step is processed.By analyzing the words in the text, the software tries to understand what this is about.
4) separateWordsFromTheQuery -all request keywords are selected from the text and given by the list.
5) entryIndexesList submits the plate of pairs -that is the list of entry indexes of every word from the request (4).
6) createTheNWL -for each index in (5) the number and the variety of keyword which are in likely given the neighborhood of the index.
7) the union -the points (2)-( 6) are repeated for each search result and findings are incorporated into a single list, from which the duplicate lines are removed.8) sort -records received in the list are sorted in alphabetical order for easier perception by the user.

Using the results of the algorithm work
Processing results are displayed visually for a user as well as in the "pure" search VKontakte, but the results are divided into some groups -separately advertising, separately reviews (social network user accounts, where they describe goods).
The algorithm saves the list of the results for the word, in the form of table "a word (the term that it look for) -the number -where it is found (the list of these results containing the text of the word, in the form of links)."And then the next time when a user will search for another product, let's compare this table with the table for a new product.If there are the same lines, it means that there is a potentially useful link.The algorithm will derive the link to the product -linkholder (i.e. a product that has been sought the earliest), next to the results of a new search (something similar to the mechanism of contextual advertising).As a result, the user does not need to look for the original product again.
To simplify the work, the author will create the search mechanism only for the words-nouns as nouns uniquely characterize the object and require less grammatical and semantic relationships.

Computer Sciences
Accordingly, the thesaurus will also consist only of nouns and will be much shorter than for the same verbs.For example, due to the fact that separate certain item it is possible to make only a finite number of steps, let's have for the noun "bed" relatively small set of verbs ("break", "buy", "cover..."), while on the contrary, for the verb "tinker" let's find hundreds of nouns -names of objects that can be tinkered.

Experimental procedures 1. The way of the algorithm realization
The place of use of the algorithm results is the VKontakte (VK) social network that contains a lot of randomly scattered data, which are also often duplicated.It seems that this is an "ideal" random simulation, allowing to choose freely research objects and provides their unique (relative originality) in large scale.
As for the dictionary of synonyms/thesaurus at the beginning of the work on the author's thesis the author is going to take the one used in Microsoft Word, and to highlight from there about 1000 random word-concepts (nouns) with pairs of matches and interpretations, and by using this "piece", to emulate search process.
A context table is a list of words (not all ones, but the most common ones; it is enough to test the model) which are mentally associated with a specific noun.Above all this table is needed to analyze whether another suitable search result is found on request (relevance).The author is going to establish it by combining the resources of several common dictionaries (e. g., from the company ABBYY).In order to expedite the work let's confine ourselves to 100 -150 of the most popular nouns in Russian language (The list is available in Wikipedia).
It is planned to receive data via API VK.Clearly, because every day millions of new records are created, and a quarter -is removed, the information eventually becomes obsolete, and the algorithm starts to give false results.Therefore, once a month the author is going to update the list of output data.

2. Practical implementation of the algorithm
For the practical implementation of the developed algorithm, a software application has been written.To the date (September 2017), the first, the most significant and labor-intensive part of the work has been realized.This includes the search in the VK social network, obtaining data and their primary processing, and finding numerous links between words in the search results.
At this stage of development the program enables to connect to VK servers, to execute a search query and obtain results using the API VK and carry out their further processing (by means of the program itself).
After launching of the program, the user will be prompted to login by entering our credentials into the top left corner.After the successful logging in, the login form will change to the VK home page.Do not pay attention to it.Also, the value of the VK access token obtained in API VK will be displayed in the box on top.It was used by the author of the program during the debugging stage.It was decided not to remove this token.(Fig. 1).

Fig. 1. Our VK token is displayed
Now the user can work with the program.To the right of the login form, there are 4 buttons for receiving a list of categories of VK communities and a list of groups in each category, as well as for saving them in an external text file.
Just below there are some resources for work with a search query, and namely, a text box to enter the search query, a button to get the initial results of the query and also for saving them to a text file.
Also on the right, in the second "column" there are some buttons to work with the results of the query according to the algorithm outlined in this paper.The labels on the buttons match the English names of the algorithm stages.
On top there is a text box for debugging (following the correct connection with VK and the algorithm progress), where both, intermediate and final results are displayed.
The rest space in the program window is occupied by some lists, in which the results of queries to API VK are displayed in tabular form (Fig. 2).

Fig. 2. Program interface at the starting point
For a start, let's try to obtain a list of groups, click the "GetCategList" button and, after waiting for a few seconds and get a list of communities categories (Fig. 3).

Fig. 3. VK groups categories list
Then it is possible to obtain a list of specific communities in each category.It is done with the "GetGroupsList" button (Fig. 4).Data generation is complete.Now the user can start work directly with the algorithm itself.The first step is entering a search query -«жареные грибы» ("fried mushrooms").Let's click on the "NormalizeQuery" button to normalize and stem the entered request (Fig. 5).

Fig. 5. Query normalization
After normalizing and stemming a query, it is possible to begin searching and getting the results (in fact, don't even have to normalize it, but then the search results will be less accurate).The search is carried out with the "GetQueryData" button.The results will be listed on the top right.
The received texts of the search results should also be normalized (but not be stemmed).This is being done with the "CleanQueryData" button, which outputs normalized texts to the list at the bottom right (Fig. 6).It is essentially a copy of the top list, but with a changed part of the data.

Fig. 6. The search results have been normalized now
Let's use the "BuildTermSet" button to build a list of terms (individual words) from the text of each search result (the "post").The result of the work (one of the intermediate stages of the algorithm) is displayed into the text box on the top.Such opportunity -viewing intermediate results -allows to monitor the progress of the algorithm, and if something goes wrong, stop and start over again.Also due to this feature, the user can notice certain posts which are intrusive for a user (Fig. 7).

Fig. 7. Building a term set
The next step is creation of a list of indexes for the occurrence of each word from the query in the text of each post.This action is being done with the "QueryIndexList" button, and the results are displayed into the same text box, under the results of the previous step (Fig. 8).

Fig. 8. Creating a list of the terms indexes
Construction of a neighborhood of each index and obtaining words which are included in each neighborhood, are performed with the "BuildAroundSet" button.The results are also displayed in the text box below the results of the previous step (Fig. 9).

Fig. 9. Building an around set of each term
The union of intermediate neighborhoods and the counting of the number of entries are carried out with the "Union" button.The results are also displayed into the text box below the results of the previous step.
The results of the operation of this button are the final results of the algorithm (on the current developing stage) (Fig. 10).

Fig. 10. Counting the final results
In addition, the results of obtaining data from API VK, that is, the content of the lists with the categories of groups, the communities in each category and the initial results of the search can be written to a text file and saved to the disk.These actions are performed by click the "MakeCat-List", "MakeGroupsList", and "MakePostsList" buttons respectively.A message will be displayed if save each file successfully.
It should be noted that if the user's computer does not have an Internet connection and/or the ability to connect to the VK server, then all the steps described here will be impossible.Instead, a message will be posted at the very beginning of the work.

Results
The program has been written in C#.The operations of connecting to the social network VK and the primary processing of the search query are carried out using the Nemiro.OAuth and StemmersNet libraries, respectively.
During testing at this stage, the program showed good results in general.The percentage of totally relevant results was about 80 %.
The results were determined as follows: first, the number x was set -the maximum number of obtained results.Further, the text of each result was read by the researcher (the author of this article) and on the basis of what was written in the text, the conclusion was made, this result is known (that is, before the start of the algorithm's work) irrelevant, mostly relevant, irrelevant.For the example "fried mushrooms" taken above, a source containing a recipe for cooking fried mushrooms is considered relevant; a source containing a recipe for something just fried, but not mushrooms and/or only some dish of mushrooms, but not fried one, is considered to be partially relevant; and the result, which does not contain anything fried and not mushrooms recipes (or in which the search words from the query are used in another context) is considered irrelevant (for example, the text containing the phrase "smelled of something fried" and "mushrooms grew in the autumn forest") .
It should be borne in mind that the source text may contain, for example, two recipes, one of which corresponds to the criteria, and the other one is not.Or both recipes meet the criteria only partially (not all words in the query).In this case, this source is considered to be partially relevant.
The number of such results, which after their viewing (by man, and not by a computer) is considered adequate, let's denote as x 0 .The percentage of such results is calculated as In the future, the execution of the algorithm begins, which determines precisely and more exactly what words from the query enter into each of the results, as well as which words are contained in the vicinity of each of the sources of the query.According to the received data method, all search results are sorted into four groups -the results containing all the words of the query containing several words of the query containing only 1 query word and those that do not contain (2018), «EUREKA: Physics and Engineering» Number Computer Sciences words from the query (let's call such kind of results "noise", their the appearance is due to the peculiarities of the search in VK).Let's denote the number of results in each group as a-d, respectively.Then the percentage of successful results of the algorithm is calculated by the formula: The result was rounded to hundreds.In total during the development of the program, ten experiments were conducted with a different number of results obtained from search algorithm.The obtained data are shown in the Table 1.In parallel with the work of the algorithm, at the same time, a search was carried out directly on the VK site -to compare the success of the author's work with the original search.The proportion of the relevant results of the internal VK search was estimated by the author manually (by reading the text of the results and reading the reading), and the value obtained is given in the last column of Table 1.
The further development of the program is being seen as implementation of the second part of the algorithm -the application of the tables with ontologies, synonyms, etc. to the found results, and to track the degree of dependence between data and the search in cross-community VK (based on the same algorithm, but with changed search domains and with a reduced sampling).The author also searches for ways to increase the relevance of the results.

Discussion
As mentioned above, the program demonstrated a percentage of results sufficient for initial usual work.
However, it should be taken into account that there are such search results (texts of entries in the VK) in which not all the words from the search query are present, but only a few or only one.For example, while requesting a "wooden carved bed", the results also took into account texts containing only the words "wooden bed" or "carved bed".If such "incomplete" texts are ignored, the relevance of the results will decrease to 65-70 %.
Such low percentage is due to the fact that there are a large number of objects (posts, communities, persons...) in the social network VK, which were created, firstly, artificially, and secondly, solely for spam, cheat number increase in attendance, etc.The texts of such objects usually contain a large meaningless set of words which are in part totally unbound, among which may be the ones specified in the search query.

Table 1
Results of the algorithm