Summary
The observation of internet users in a situation of
information research has helped to highlight a general need to immediate
exchange. The immediacy of the exchanges may take different aspects and in
particular the fact for a surfer, at a given moment, to be able to benefit from
the research of other surfers by dynamic recommendation. According to the
principle of social networks, a community is a set of Internet surfers who can
take advantage of links, predefined or not, on the basis common interests,
common practices... Identifying these links dynamically and causing meetings
between Surfers seemed to be a true challenge.
Then we have to dynamically create communities of internet
users from ongoing research via search engines (log files for example). The
process of dynamic generation of communities is largely based on the extraction
of the research themes (centers of interests) of Internet users present on the
network at a given moment (or during a given period of time). The themes of
research allowing the connection between Internet users constitute the core of
the Community dynamic. The community is then presented as a Large complex
network graph of words (extracts of themes) in which the connections represent
the cooccurrences.
In this thesis, we propose an approach for creation and
validation of the graph community. This approach involves the aggregation of
the nodes of the graph so that each aggregate has the highest semantics
consistency possible. The following issues must be resolved:
- creating clusters of words that can contain overlap (a spelling
may belong to several thematic);
- choosing or defining a grouping technique that guarantees a
high degree of semantics consistency; - characterizing the aggregates to
understand the differences of semantics consistency;
- proposing techniques to validate semantics consistency of
aggregates.
In a first part constituting a state of the art, we are
studying many methods of creating communities in the graphs. However no one
fulfills all of the necessary criteria.
In a second part we present our contribution. The latter is
constituted of several methods of aggregation and several methods of semantic
validations.
We offer 4 methods of aggregation: cliques Detection
(agglomeration of clique), Simple Ratification (search for points of rupture in
the graph), Regulated Regasification (search for points of rupture in relying
on the study of specific populations, empty words and monosemic) and a method
of Enrichment of Aggregate by Gravity (the method determines a coefficient of
attraction for each word toward each aggregate).
We then propose three methods to validate the semantic
consistency of aggregates : Method of Compared Coefficient of Semantics
Validation (estimate of the value semantics of aggregates by comparing the
behavior of search engine on the Internet by using different test sets and
aggregates), Trec-Eval method for requests enrichment (the aggregates are used
to specify user requests) and a method of consistency comparison of documents
returned (comparison of the semantics consistency of documents returned by
queries from test specific sets and aggregate ). We will also use the manual
validation by experts in the field of semantic spaces handled including
comparison with other methods.
The various proposals and methods of experiments provide
evidence of the importance of weighted nodes and links, as well as to direct
the graphs. Limiting the size of the aggregates of words is also a major
element of semantics consistency. The different clustering methods can still
evolve. The combination of several types of links in a graph, for example,
would refine the content of the aggregates.
Key-words
Graphs, Term aggregates, Communities and user communities,
Complex networks, Small words.
5
|