Position Papers of the 2014 Federated Conference on Computer Science and Information Systems

— This article presents main results of the pilot study of approaches to the subject information search based on automated semantic processing of mass scientific and technical data. The authors focus on technology of building and qualification of search queries with the following filtering and ranking of search data. Software architecture, specific features of subject search and research results application


I. INTRODUCTION
EW efficient scientific knowledge search and synthesis methods (in particular, breakthrough technologies and innovative ideas in economics, science, education) are one of the top research and development targets in the field of information technology.The project Intelligent Distributed Information Management System for Innovations in Science and Education powered by the Russian Foundation of Basic Research is to solve this problem.This article presents main results of the pilot study of approaches to the subject information search based on automated semantic processing of mass scientific and technical data.

N II. SPECIFIC FEATURES OF SUBJECT SEARCH
The major features of subject search tasks which determine the approaches are: • the required information is often located at the junction of adjacent areas, hence, there is some complexity in the exact wording of the search query.
• along with the information on proper innovation it is desirable to obtain information on applications, risks, specific features, users, authors, producers.
• there is a necessity of available alternatives and different criteria mixing for selecting the most effective practices.
• the information on innovations is fragmentized and heterogeneous; primarily sector-specific character.
In contrast to the search for specific information (facts) on particular aspects of the required content, it is rather difficult to solve a sophisticated problem of searching coordinated information on a target subject.For example, it is required to find the economic performance of mine Raspadskaya JSCo  This work was supported by the Russian Foundation of Basic Research (contract No NK13 -07-00342) for the first half of 2013.If we use this phrase as a search query, it is possible to get a relevant answer in the first ten search results of Google.But how can one find the information to analyze scientific, technical, economic and social factors affecting the innovative technical, technological, or financial mechanisms of coal-mining in the eastern regions of Russia?
To solve such search problems users have to employ lots of key concept combinations, clarify them in the course of en-route search on the Web or specialized stores such as patent databases (DB).It is not obvious that for this purpose any reasonable method would be used without fail.Eventually, a large amount of search results would be at the disposal of a user (tens and hundreds of documents), with the found information being more or less relevant to queries.As is quite common, there would be no opportunity to go into details of all the result data.So, the following questions can arise: • How can one simultaneously assess the relevance of documents found by different queries?Is the relevance of documents determined correctly?
• Is the data ranking in a certain search system correct from the perspective of a user?Do all the results available for direct assessment meet the user's expectations?
• Are all the results that meet the user's expectations available for direct assessment?Are all the required data (e.g.innovative solutions) found at all?
• How one can filter documents extrinsic to the searched subject?
• Is it possible to find any effective solutions relevant in other application fields, but would be successfully used as an innovation in this domain.
• Is it possible to give a visual assessment to lots of found innovative solutions together with linked objects?
There are no clear-cut ways of solving these problems within trivial solutions.Obviously, we need efficient methods of creating and populating the computer-assisted collections of advanced technologies and ideas which would contain not only their descriptions, but selected, classified and associated data.These data can be used to analyze retrospective and prospects of specific innovations, to search current and likely trends.The project in question is an attempt to offer a number of such innovative approaches.

III. OUR APPROACH AND OTHER STUDIES
Currently, the R&D management assurance is of great importance (see, for example, the US National Trends and International Linkages in [1]).In this context, the automated semantic processing of large arrays of scientific and technical information, for sure, is used to search for breakthrough technologies and other innovative concepts, in the same manner as it is done in the prior art solutions illumin8 [2], NetBase [3], Orbit [4], Kalypso [5], as well as in large data stores such as CORDIS [6].The intensive application of these and other similar tools make our attention focused on IT-related issues [7], [8].
The brief overview of publications, which are instrumental in pinning down the goal of the project under discussion, is given below.Our current development solutions mainly deal with the problem of data filtration [9] in terms of a content-based approach.In this respect it is important to note some interesting and topical visions of users and developers of RDF [10], semantic web-services control [11], as well as development of the document vector space model which is fundamental to most information retrieval problems, including rating of documents, data filtering, classification and clusterization of documents [12].
Taxonomy of web searches is also of great interest to us [13].It is worth mentioning the pattern-searching procedure like the one recently described in [14] Information Filtering by Multiple Examples (IFME) which allows users to identify their information needs as a set of relevant documents, not keywords.The use of additional relevance assessment sources can help in work with lots of short texts (for example, Twitter).The use of alternate algorithms focused on solving the top-N recommendation task [15] seems to be useful too.
Another important field of our work is associated with effective query formulation.It is subject to optimal use of the combination of information sources in order to create an extended search set [16], [17], [18].The next essential study to be mentioned is [19].It offers models and infrastructure for complex searches.
Note that our paper significantly expands and details [20].
IV. PROBLEM DEFINITION Thus, a project goal can be summarized as: the exploration of new approaches to innovative solution search methods in the database of a data center and its population with Internet data mining results adapted to visual assessment of selected, classified and associated data.We see three key tasks to attain the goal: • To develop the technology of building and qualification of search queries with the following filtering and ranking of search data.
• To set up methods of cluster analysis to text documents and multimedia objects in order to use them for tagging the links between search results.
• To create a store of innovative solutions for educational and scientific purposes.

V. SOFTWARE ARCHITECTURE
When developing a general software architecture based on mechanisms of direct automated search of innovative solutions the authors determined view layers, those of services, business logic, data access as well as crosscutting concerns (the UML notations and artifacts were applied).In the behavioral model of a system (Fig. 1), in a particular session, we can distinguish two periods of user's activation: query formulation (first step) and visualization of the results including the options of the requested and innovative solutions and linked objects (final step).Interim steps are hidden, offline run and implement the algorithm of interaction between the system components without active participation of a user.
The main functional components: • Search module.It involves executing a search query in the Internet search systems and the custom directory of innovative solutions; basic search (query by attributes and full-texts), location, data retrieval and summarizing.
• Query qualification module.Selection and ranking of search results: filtration, subject control, qualification of search query.
• Classification module.Classification of search results: selection of a method, cluster analysis of text documents and multimedia objects, data qualification.As a result we obtain a subset of semantically linked data.
• Link identification module.Link start-up: qualitative classification assessment, selection of the best results, interpretation of results; generating the descriptions of solutions with innovative potential in a given subject segment or for a specified object (article, technology, product).
• Visualization module.It involves mapping of search results, procedures of data processing, classification results including semantic links between objects.
• Data warehouse (DW) management module involves storage and updating of data search and processing results, parameters, and intermediate data; registry of innovative scientific, technological and educational problems.DW is built on the basis of a vector space model, includes document database access libraries and a data indexer.
• Service module.It involves monitoring and analysis of user access to information resources.
It is particularly remarkable that the developed original object model is oriented to work with any text objects related to the subject of the processing: queries, search results, text documents.Over 30 entity classes specify a document processing environment, a set of documents, methods of calculating the package document similarity measures as well as search functions in document package, types of reports, a collection of document words, lemmatization, a document structure and its specific parts.
The detailed architectural solutions are described in [21].

VI. GENERAL SEARCH ALGORITHM
One of the elements of the presented above architecture is a generalized heuristic algorithm for filtering and rating the search results, which is based on available search engines; the algorithm is supposed to provide a background for search modules and inquiry qualification, as well as for retrieval schedule and search procedures in general.
The algorithm under consideration uses search results of known search engines being in service; it is invariant to them; with various degree of automation; it uses the search engine rating results.
The algorithm instruments a multistep process of sequential filtrations of search results and the analysis of semantic similarity of the found object content to adaptively generated reference texts (k-patterns).Ranking quality of the filtered search results made as per algorithm was estimated by DCG metric (see below).The ways of generating effective k-patterns were investigated as well.
Let us briefly run through the algorithm operation (Fig. 2).The description of a generalized request Q o includes the initial set of key concepts of the target document subject.
The generation of the set Q of search queries q∈Q and |Q|= N is automated with an adaptive genetic algorithm searching for an effective total pertinence of the resulting document sampling under given evolutionary process depth constraints (see below).
The execution of queries q ij is accompanied with filtering search results R qs rated by a search engine and generating total results R .Filtering provides for the exclusion of some documents which subject area is formally pertinent but should not be the subject of the search for some reasons.It is done by hand or with a classifier which learning set is updated during the analysis of found texts.
The examples of documents being filtered are tutorials, student's papers, training programs, tests and notes, site promotion materials, company's sites, shopping sites, social networking sites; blogs; advertisements; virus-infected resources; nonexistent resources.The generation of k-patterns or reference texts is done simultaneously.They are used for calculating document similarity measures ( P ka is a text combination based on the first positions of rated search results, P kc is the most pertinent result, P kb is the text constructed from authority dictionary entries and P kd is a text constructed from Q o ).Further the model of document vector space is used, i.The algorithm and some results of its use in patent searching are described in detail in [22].To assess the algorithm quality, DCG metric [23] was used.For documents arranged by semantic similarity to Similarity ( d 1 d 1 ) , the values were calculated for every k-pattern: where gr( p) is mean expert relevance assessment given to the document located on p position in the list of results, gr∈[0,3] 10 with 3 standing for "relevant", 0 -"irrelevant", 1 and 2 -"partially relevant" ("relevant (+)" or "relevant"); 1/ log 2 (2 + p ) -document position discount (the documents at the head of the list are of greater importance).
Fig. 3 shows the ratio of DCG metrics of ideal ranging and various k-patterns.Good agreement of metric values is observed in various patterns.At the same time there are reserves for more exact labeling of documents by relevance groups gr.The algorithm under consideration labels 10-15% of documents as a group with value gr which is different from ideal ranging (see peaks of breakpoints).Then normalized values NDCG=DCG/Z were calculated for every k-pattern, with Z being equal to the greatest possible DCG value in case of ideal ranging according to the expert assessment.Indicator NDCG assumes values from 0 to 1.The ratio of NDCG values for generalized query and various k-patterns is presented in Fig. 4. It is evident that algorithm shows the best results under k-pattern P ka (combination of texts from the first positions of the ranked search results).
Note that the project provides for the usage of Internet search engine work results.The proposed search algorithms will be added to authoritative decisions -classical approaches to search result ranking (HITS, PageRank, BrowseRank, MatrixNet) which are based on the combination of document semantic pertinence and authority as well as user's behaviour and experience.

VII. GENERATION OF SEARCH QUERIES
The project proposes and investigates the approach to search result generation based on a genetic algorithm (Fig. 5).The approach is used to specify a semantic kernel of a document desired set and generate sets of effective queries.The problem definition provides for the organization of an evolutionary process generating a stable and effective query population forming a relative search image of a document.A target set of search results is to be formed by such document addresses which are (а) in the first positions The original population from N search queries may be a set of Q={q i } , |Q|= N , N <|Q 0 |/2 , q i =( k 1 , k 2 , ... , k m ) , where (k 1 , k 2 , ... , k m ) is a random combination of key concepts of a search image Q 0 .The value of an objective function must determine the query quality (population individual fitness).For each i-th query result the value may be calculated as w i ( f , p , s , a) , where f is determined by a result position in a ranked result list made by a search engine; p is determined by entering the result in the result lists of most queries; s is determined by a semantic similarity to k-patterns formed adaptively during the algorithm execution; a is deter-mined by a user profile as an environment factor (values f , p , s , a are normalized for the range from 0 to 1).The value of a target function for each query is calculated as an averaged weight of query results, where w i is a weight of each result calculated after executing all queries; P is a number of document addresses seen as the query result.The value of a objective function is interpreted as the capability of a search query to generate the results to be in the next population generation.
To choose parent couples the method of genotype outbreeding is proposed.It can provide for the most complete participation of all current queries in generating the next query population (the first parent individual is chosen randomly and the second individual is the "farthest" from the first one, the distance can be calculated as ŵ= w1 − w1 ).The evolutionary operator of crossover is done with discrete recombination which corresponds to the exchange of key words (genes) between queries.The peculiarity of the proposed implementation is that the key word of a parent query is not substituted for the other parent query key word but its synonym.It allows generating considerably more child queries, with properties (semantics) of parent queries being preserved.
The essence of the most adequate mutation operation of the approach under study is the probabilistic change of a key query word (gene) chosen randomly.Because if the number of key words in a query q i =( k 1 , k 2 ,... , k m ) is fixed, then it is not possible to use such mutation operators as a new gene addition, new gene insertion, gene deletion.Otherwise, we can doing it.Besides, there is no sense in gene place exchange in the context of executing search queries.
To generate a new population an elite selection denying the loss of best solutions is used.An intermediate population is generated.It includes both the parents and their children.
N with the best values of a objective function ŵ is chosen from all the population members.they will go in the next population.Generally, the condition of terminating the algorithm is considered to be population stability.For example, when a mean-square deviation of fitness function W reaches some threshold specified by an algorithm parameter.The genetic algorithm is described in detail in [24].Some results of preliminary experimental studies of algorithm are briefly described below.The developers used original software support, search engine Bing and the following initial values of key parameters: N=15, m=3, P=10|50; number of search results returned after ranking all the results -50.Weights of arguments f, p and s in search ranking were taken as equal to each other.The science of calculation of fitness function for groups of results is average value; the algorithm exit strategy is given number of passes.Terms from the document corpus (students' papers) were used in the origin collection .
Fig. 6 shows the plots of W against population number, with P=10 and |Q 0 |=50 .Local maxima W and points of relative stabilization W W (the 6-7th population) can be observed.Fig. 7 shows plots of W against number of keywords of every generated query m (shown as numbers beside VLADIMIR IVANOV, BORIS PALYUKH, ALEXANDER SOTNIKOV: APPROACHES TO THE INTELLIGENT SUBJECT SEARCH plots).It is seen that the increase of m leads, in the large, to the population quality improvement.Fig. 8 shows the influence of fitness function arguments on its value.The greatest influence belongs to f, the lowest one -to p.

VIII. DATA WAREHOUSE
The possibilities of Data Warehouse (DW) generation with realizing a document vector space model [25] to use it as a base of a data-centre information support are researched in the project.A software platform Document Text Analyzer (DTA) for semantic document analysis (their metric similarity computation) is developed within DW.
The prototype of the DW was tested successfully when associated technologies of the integral electronic document quality assessment and document pertinence in different contexts analysis were employed [26].In particular, the debugging of software shell and interface of the TSTU specialized electronic teaching pack database, data centre warehouse components, was done (Fig. 9).The database is used to test and apply the project research results.The pioneering technology of the students' work uniqueness assessment (course and design-graphic papers, semester tasks, reports, essays, tests) is put to use.
The methods of semantic text comparison are used here.They are the computation of key concept weights and the construction of document vectors and not the known approaches (e.g.shingling) based on detecting direct adoptions in the text.
The research of some approaches to data centre different information systematization should be noted.As a result, a multistep algorithm of alternative search in an information catalogue with a target step number to be a base of a desired solution selection is developed [27].

IX. APPLICATION AREAS
The list of application areas of the approaches under discussion in the paper, the research results and technologies is given below: • A competitive analysis and competitive intelligence.A survey of commercial, scientific and technical, social information sources in a target field.A search of business valuable information.A client information acquisition (in CRM systems).A characterization of new fields and directions in business planning.A search of sector innovation decision descriptions.
• Educational technologies.An analysis of students' paper works (graduation, course papers), theses.A selection and expert examination of teaching materials (books, articles, papers, essays, surveys, etc., including web-resources).Scientometrical analytical services.
• The work of competition committees and sponsoring agencies.An expert examination in venture and other investment funds, the work of councils and groups of experts.An analysis of applications, information cards, competition documentation, expert examination rules and conditions.Normalizing and metrological control of technical documentation.An analysis of project design documentation, standards, norms, rules, regulations, manuals.
• Patenting, novelty expert examination.Materials selection for patent investigations.A documentation analysis of intellectual property objects, license contracts.Technological development forecasting.
• A content analysis of document texts in sociological surveys.
• Staff recruitment at enterprises and in organisations.An analysis of applicants' resumes vacancy descriptions.
• Rubrication of personal digital documents.PC text document (files) classification and grouping.It should be noted that the project made some patent research which aim was to find analogs of the system designed and establish its novelty.At the moment of the research result report preparation any data of direct project analogs or its components realized are not discovered.The search of the Federal Institute of Industrial Property's document database did not show any matches of the project results with technologies recorded in official publications of the titles of protection.

X. CONCLUSION
One of the R&D management reference models include a competitive analysis and technological development forecasting based on scientometrical analytical services and semantical systems of business valuable information search.A relatively new world trend is evident: an effective use of global knowledge dataflow.With all the differences the major search pattern is selecting materials on demand, highlighting key concepts in the desired area and grouping materials respectively, filtering and semantic result processing, generating analytical reports.In this sense, the project tasks the results of which were discussed in the article are timely and urgent, and on the appropriate level of the problem interpretation.

Fig 1 .
Fig 1. Behaviour of software components

Fig 2 .
Fig 2. General pattern of a generalized heuristic algorithm for filtering and rating the search results

Fig 8 .
Fig 8.The influence of fitness function arguments on W value.