• A
  • A
  • A
  • ABC
  • ABC
  • ABC
  • А
  • А
  • А
  • А
  • А
Regular version of the site

Comparison of Distance Functions Using Multidimensional Classification Methods in Sociological Research

Student: Chujko Anna

Supervisor: Yuliana N. Tolstova

Faculty: Faculty of Sociology

Educational Programme: Bachelor

Year of Graduation: 2014

<p>One of the most common ways to obtain new knowledge in scientific sphere is to cllassify the of objects under investigation. One of the way to proceede it is to apply one of the various computer-mediated multivariate classification methods. This paper discusses the methods belonging to the so-called automatic classification (taxonomy, cluster analysis). In particular, it is assumed that each object is defined as a point of a multidimensional feature space, and there is no information about the clusters number, their shape and the type of boundaries between them, also the learning sample is given. It is supposed that classification methods solve the problem of constructing a sociological typology of objects, which is treated in its substantial way. The main methodological problem is to turn a formal classification into meaningful typology.</p><p>Each classification algorithm contains a number of formal elements. The current study is devoted to the examination of one of the most important formal elements in every clustering algorythm &ndash; the distance function choice (metric of the space). Common-spread software packages posses quite a lot of different distance functions (eg, SPSS package includes seven of such functions). The choice of distance function strongly influence the final cluster decision. The general purpose of this study is to work out a link between the choice of metric and type of objects in the population in order to create a classification that would satisfy the researcher&rsquo;s concept of the objects type sought.</p><p>The object of this study is the distance function in the clustering algorithms, the subject &ndash; the correspondense of distance function to the researcher&rsquo;s concept of the objects type sought.</p><p>&nbsp;After the review of specific literature it was found out that little advance have been achieved in measurements theory as well as in the technics of analyzing data in the reference to the exact recommendations for scholars how to correspond the distance function with the understanding of aims and objectives of the study.</p><p>After scrutinising a number of contemporary studies in which sociological problem is solved by creating a classification of certain sets of objects, the following tendency was identified: frequently, the classification is formed by researchers automatically: the square Euclidian disctance function is chosen &laquo;on default&raquo;, however, this function is inappropriate in many sociological tasks of typology creation.</p><p>In this paper we propose and implement the methodology for comparing the two most dissimilar metrics: the Euclidian and the &quot;cosine&quot; (the second metric measures the angle between the vectors that define the objects under investigation, or what is the same, the correlation coefficient between coordinates sequences of these vectors ). On the generated data it was shown that the use of these functions leads to the diverse cluster shape and further to the different classification. It is shown that each distance function corresponds to a particular understanding of the types sought. The equivalence of several approaches to classification has been proved: (1) the use of &quot;cosine&quot; as a distance function; (2) the projection of all points on the unit sphere and the use of the square Euclidean metric between the projections as the distance between the starting points (this technique was developed in the frame of Spherical K-means algorithm which is used for clustering text documents); (3) the transition from the initial set of coordinates of the vectors to the proportions of its coordinates. On the basis of this very proof, some exact recommendations have been developed according to which it becomes certain which types of objects could be found out with the help of &ldquo;cosine&rdquo; metric.</p><p>The provisions mentioned above were illustrated on the real datasets. It was demonstrated that the use of square Euclidean distance function &laquo;on default&raquo; in similar situations does not allow to build up a well-de

Student Theses at HSE must be completed in accordance with the University Rules and regulations specified by each educational programme.

Summaries of all theses must be published and made freely available on the HSE website.

The full text of a thesis can be published in open access on the HSE website only if the authoring student (copyright holder) agrees, or, if the thesis was written by a team of students, if all the co-authors (copyright holders) agree. After a thesis is published on the HSE website, it obtains the status of an online publication.

Student theses are objects of copyright and their use is subject to limitations in accordance with the Russian Federation’s law on intellectual property.

In the event that a thesis is quoted or otherwise used, reference to the author’s name and the source of quotation is required.

Search all student theses