МОДЕЛЬ КЛАСТЕРИЗАЦИИ ЗАГОЛОВКОВ НОВОСТЕЙ С ИСПОЛЬЗОВАНИЕМ РАССТОЯНИЯ ХЭММИНГА

Гао Минъюй; Л.А. Казаковцев

Секция

«МАТЕМАТИЧЕСКИЕ МЕТОДЫ МОДЕЛИРОВАНИЯ, УПРАВЛЕНИЯ И АНАЛИЗА ДАННЫХ»

УДК 519.6

МОДЕЛЬ КЛАСТЕРИЗАЦИИ ЗАГОЛОВКОВ НОВОСТЕЙ С ИСПОЛЬЗОВАНИЕМ

РАССТОЯНИЯ ХЭММИНГА

Гао Минъюй

*

Научный руководитель - Л. А. Казаковцев

Сибирский государственный университет науки и технологий имени академика М. Ф. Решетнева Российская Федерация, 660037, г. Красноярск, просп. им. газ. «Красноярский рабочий», 31

*E-mail: levk@bk.ru

С прогрессом общества темп жизни становится все быстрее и быстрее. Люди также ищут удобства при просмотре новостей. Как быстрее и точнее узнавать новости - дело удобной жизни. В этой статье используется язык Python, начиная с заголовков новостей, используя K-Medodis и Hamming для разделения категорий новостей, кластеризации новостей и упрощения чтения.

Ключевые слова: Python; K-Medoids; Расстояние Хэмминга; кластеризация новостей A MODEL FOR CLUSTERING NEWS HEADLINES USING HAMMING DISTANCE

Gao Mingyu Scientific supervisor - L.A. Kazakovtsev

Reshetnev Siberian State University of Science and Technology 31, Krasnoyarskii rabochii prospekt, Krasnoyarsk, 660037, Russian Federation

*E-mail: levk@bk.ru

With the progress of society, the pace of life is getting faster and faster. People also seek convenience in watching news. How to check news more quickly and accurately is a matter of convenient life. This article uses Python language, starting with news headlines, using K-Medodis and Hamming to divide news categories, clustering news, and making it easier for people to read.

Keywords: Python; K-Mediods; Hamming distance; news clustering

Data preprocessing

1.1 Title acquisition

In order to make a humming distance news clustering model, we should first obtain a list of news

headlines. I used the webpage "https://www.hbfu.edu.cn/newsList?type=1" as an example to make this model.After observing that there are 20 news on each page of this webpage, using the POST method, we also need to use the post to get the data, and use the for loop to extract the information.

Reqliest Method: POST

Fig. 1. The request methd

def" getE>ata( self", data):

self".html = requests.request('POST',seLEViirl„data :

icLData = []

for xi in range(20):

idD ata. append(s elf. html.j s onQ ['rows' ] [11] ['id' ])

Fig. 2. Heading

Every time a page is turned, the value of start increases by 20, so we write a loop to get the data.Each title has its own id, and get information through id.

► 0: {id: 7046, title:

Fig. 3. Jd of the title

| ▼ Form Data start: 28 limit: 20 type: 1

for mill range(0,4000,20): ЛДТВИ^^И

data = {'start': m, ^^H^BHj^^^^H

'limit1: ■¡■ВИН

'type':

hebei.getData(data) ^BmS^^H

for m in idData:

dataPage = {'id':m}

self.htm.12 = requests.request('PO ST',self.url2,data = dataPage) title = self.htmlZ.jsonOrtitle']

Fig. 4. Extended header

Finally, we will store the obtained data in the excel table.

3993 й 'Лhi5^^^ГяЯИтйЯЗЕЭЙtSiJII Ф.bSil№ 1 400C SgPSS^SffiS iflPfii^EpaaHlHlL^JS — IP A 55

Fig. 5. News headline

1.2 Data selection and word segmentation. Since we only used it for model demonstration, I randomly selected 33 pieces of data.Text cannot be directly used for clustering, we need to vectorize it. Therefore, I first segmented the data. Chinese word segmentation we need to use the "jieba" library [1].

ivords = []

for i,row in df.itenows():#row corresponds to the content of each line word — |ieba.ciil(row|''n ri'I JC |)#use the cut function to segment the specified content result =1 '.join(word)

words.append(result)#use the appendO function to add the result of each news headline to the words list.

Fig. 6. Adding segmentation

Секция «Математические методы моделирования, управления и анализа данных»

г : m pf ] fRfti m я - - - - eue s ш ш ti tt w, 'in s m щш ш at » if t m m\ m я в® if ti ti ш ês m m «a1,1 [ m ] «s fi гага m ti et iff и $r, m ?f тш si м> и a i«n кш'> 'e m g ш м iff я« r, s « ®ï m щ s m -iss Jfi m m «f, 4éii »s m м j m si шё s w аш м m «i m m i m mr, 'ни ¡«я sa m m ш m, 'm 4

Fig. 7. Word segmentation result

1.3 Text vectorization. Then the data after word segmentation is vectorized. We need to use the CountVectorizer() function, through which we can easily convert text into numeric values.

ect = CountVectorizei'O X = vect.fit transfoim(words) = X.toanayO

Fig. 8. Converting text to numeric values

0 _. _ _ 01 & e]

[0 <3 0 . - » & el

[e в 0___ 0 0 ч 1

fe 0 <&--- 0 es 0J

[e 0 ._. _ _ 0 10 1 1

Г-э 0 0 — _ -- « Hi 1

Then we get the following result:

Fig. 9. Text vectorization

Use hamming distance for clustering.

2.1 Define hamming distance. Next, we will use hamming distance for clustering.Hamming distance between the character strings "1111" and "1001" is 2. It can be used in strings to compare the similarity between them by counting the number of different characters [2]. First define the calculation of hamming distance:

def hamming_distance( self, a, b):

r = (1 << np_arange(8))[:, None]

return np.count_nonzero((a & r) != (b & r))

Fig. 10. Hamming distance calculation

2.2 K value selection. Because hamming distance can only calculate the distance between integers, and K-Means is not applicable.Here I use the K-medoids algorithm[3]. First of all, we first determine the value of K, here I choose 10:

est one = KMediod(data, k_num_center=10)

Fig. 11. Choosing K equal to 10

2.3 Calculation of centroid. We use the random.shuffle() function to shuffle the elements in the list to randomly select the centroid.

indexs = list(range(len(self.data))) random.shuffle(indexs)

init centroids index = indexs [: self.knum center]

centroids — self.datafinit centroids index, :] # Initial center point

Fig. 12. Random selection of centroid

2.4 Iteration. Then we have to calculate the distance and iterate. First, calculate the distance, and determine the category of the point from the nearest core [4]:

i stance s = [func of dts(sample, eentroid) for centroid in centroid s] car level — np.argmin(distances) samplctargel. append (cur l e vel )

Fig. 13. Calculating the distance from the nearest nucleus

Next, we need to recalculate the centroid and find the best point in several categories:

for i in range (self .k numcenter):

distances — [func of dis(point 1, centroidsfi]) for point 1 in classify jpoints[i]]

Fig. 14. Recalculation of the centroid

First calculate the sum of the distances between the center point and all other points:

tor point in classify pointsfij:

distances = [fiinc_of_dis(point_l, point) forpoint l in classify_points[i]] new distance ~ sum (distances)

Fig. 15. Calculating the sum of distances

Calculate the sum of each point in the cluster and all other points. If it is less than the sum of the distances from the current center point, remove the center point and replace it with this point. [5]

if ncwdistancc < nowdi stances: nowdi stances = ncwdistancc centroids[iJ - point

Fig. 16. Center point removal condition

2.5 Final result. We get the following results, the same number means they are classified into the same category:

[2, 1S 1, 3j 1S 6, 9J 9, в, 2, 7, 2, 6, 2, 2, 2, 2, 2, 6, 8, 5, 9, 2, 2, 2, 2, 8, 2, 5, 2, 6, 4, 2]

Fig. 17. Result

Conclusion. The more accurate the clustering of online news, the more helpful it is to our daily lives, and it can help us get the information we want more conveniently and quickly. Therefore, when clustering them, which method is better, more accurate, and has practical significance. It is hoped that through various researches, this function can be made more perfect, which will help people's livelihood to a certain extent, and will further promote the progress and development of network culture.

References

1. Wang Yutao, Qian Yanzhu "Basics of Text Vectorization: Building a Word Frequency Matrix", "Big Data Analysis and Machine Learning", January 2021.

Cemiim «MaTeMariPiecKHe MCTOjbi m aoje .i h po bii h h , ynpaBieHi« h aHairoa jai h h bi x »

2. deephub, "Summary of 9 common distance measures in data science and overview of their advantages and disadvantages.

3. ", https://deephub.blog.csdn.net/article/details/113539354, February 2021.

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

4. Giuseppe Bonaccorso "K-Medoids", "Hands-On Unsupervised Learning with Python", September 2020.

5. Trident_lin "K-Medoids clustering algorithm Python implementation" https://blog.csdn.net/weixin_39220714/article/details/84867035, December 2018

6. Do not forget Lawlite, "Machine Learning (6): K-Means Clustering-Python, https://blog.csdn.net/u013082989/article/details/53219831, November 2016.

МОДЕЛЬ КЛАСТЕРИЗАЦИИ ЗАГОЛОВКОВ НОВОСТЕЙ С ИСПОЛЬЗОВАНИЕМ РАССТОЯНИЯ ХЭММИНГА Текст научной статьи по специальности «Компьютерные и информационные науки»

Аннотация научной статьи по компьютерным и информационным наукам, автор научной работы — Гао Минъюй, Л.А. Казаковцев

Похожие темы научных работ по компьютерным и информационным наукам , автор научной работы — Гао Минъюй, Л.А. Казаковцев

A MODEL FOR CLUSTERING NEWS HEADLINES USING HAMMING DISTANCE

Текст научной работы на тему «МОДЕЛЬ КЛАСТЕРИЗАЦИИ ЗАГОЛОВКОВ НОВОСТЕЙ С ИСПОЛЬЗОВАНИЕМ РАССТОЯНИЯ ХЭММИНГА»