Научная статья на тему 'PREDICTING APP RATINGS WITH LINEAR REGRESSION'

PREDICTING APP RATINGS WITH LINEAR REGRESSION Текст научной статьи по специальности «Математика»

CC BY
124
27
i Надоели баннеры? Вы всегда можете отключить рекламу.
Ключевые слова
PREDICTING APP RATINGS / MACHINE LEARNING / MULTIPLE LINEAR REGRESSION / LEAST SQUARES REGRESSION / DATA CLEANING

Аннотация научной статьи по математике, автор научной работы — Han Eric

App ratings are one of the most important criteria when it comes to app development and app marketing, as they are indicators of whether an app will provide any benefit to the user. Therefore, it is very crucial to find out what factors are the most important if an app wants to get high ratings. In this project, I used a dataset that contained information regarding apps in the Google Play Store. I constructed a multiple linear regression model that can predict app rating scores according to the basic data information of the apps. I will first introduce the basic concepts and methodology details of the multiple regression. Then, I will construct a data analysis and data model based off the dataset I used. Finally, I will use Python to conduct the model and provide insight for the target companies and developers.

i Надоели баннеры? Вы всегда можете отключить рекламу.
iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.
i Надоели баннеры? Вы всегда можете отключить рекламу.

Текст научной работы на тему «PREDICTING APP RATINGS WITH LINEAR REGRESSION»

https://doi.org/10.29013/EJEMS-21-3-7-13

Han Eric, Great Valley High School, Phoenixville PA, United States E-mail: eric1han23@gmail.com

PREDICTING APP RATINGS WITH LINEAR REGRESSION

Abstract. App ratings are one of the most important criteria when it comes to app development and app marketing, as they are indicators of whether an app will provide any benefit to the user. Therefore, it is very crucial to find out what factors are the most important if an app wants to get high ratings. In this project, I used a dataset that contained information regarding apps in the Google Play Store. I constructed a multiple linear regression model that can predict app rating scores according to the basic data information of the apps. I will first introduce the basic concepts and methodology details of the multiple regression. Then, I will construct a data analysis and data model based off the dataset I used. Finally, I will use Python to conduct the model and provide insight for the target companies and developers.

Keywords: predicting app ratings, machine learning, multiple linear regression, least squares regression, data cleaning.

Background and Introduction

For this project, I wanted to identify and examine the patterns and factors that go into the rating score of apps on the app store. I want to be able to determine why some apps receive better ratings than other apps, and I want to discuss the implementations of this. The main target is to find the relationship between the app rating scores and the factors, with linear regression models as the main analysis tool. Moreover, I will provide insights and results interpretation according to the model outcomes, verifying the efficiency and robustness of the model construction. For example, I could be looking at the effects of the number of installs of an app or the size of an app.

Methodology Overview

Linear Regression Basis [1]:

I am concerned with whether the relationship pattern between two values of variables can be described as a straight line, which is the simplest and most used form.

7 - a + bX

Where Y is the dependent variable, measured in units of the dependent variable, X is the independent

variable, measured in units of the independent variable, and a and b are constants defining the nature of the relationship between the variables X and Y.

• The "a" or Y-intercept (aka Y-int) is the value of Y when X = 0.

• The "b" is the slope of the line and is known as the regression coefficient and is the change in Y associated with a one-unit change in X.

The greater the slope or regression coefficient, the more influence the independent variable has on the dependent variable, and the more change in Y associated with a change in X.

The regression coefficient is typically more important than the intercept from a policy researcher perspective as we are usually interested in the effect of one variable on another.

Coming back to the equation, I also have a term to capture the error in our estimating equation, denoted £ or e. Also known as the residual, it reflects the unexplained variation in Y, and its magnitude reflects the goodness of fit of the regression line. The smaller the error, the closer the points are to

our line. So, our general equation describing a line is: Y = a + bX + e

Regression Analysis:

Simple regression is a procedure to find specific values for the slope and the intercept.

If the line that is drawn to describe the data is has a positive slope, the data suggests a positive relationship. If the line has a negative slope, the data suggests a negative relationship. If the line is horizontal/if the slope is zero, there is no relationship in the data.

In drawing a linear regression line, one wants to minimize the distance between the points and the line. Distance is then measured vertically from an observed point to our estimated line. Since I cannot draw a line that minimizes the distance between all points and the line at the same time, I needed to find a way to average the distances to get a best-fitting line. In the most common form of regression analysis, the technique is to find the sum of the squared values of the vertical distance [2]:

K - Yi

That form of regression is called Ordinary Least Squares, or Least Squares, and it has two key properties:

The sum of all actual values minus expected values equals zero

The sum of all (actual - expected) squared is the minimum value possible.

In equation form:

1. £(Y. -Yi) = 0

2. ^^Yj _Y) =minimum

Data

Data Source [3]:

I am using an excel sheet with information regarding apps. It includes the app names ("App" column), the category of the apps ("Category" column), the rating scores of the apps ("Rating" column), the number of reviews the apps have ("Reviews" column), the size of the apps ("Size" column), the approximate number of installs of the apps ("Installs" column), whether the app is free or paid ("Type" column), the price if it is paid ("Price" column), the content rating ("Content Rating" column), the genre ("Genres" column), the apps' current version ("Current Ver" column), and the apps' android version ("Android Ver" column).

Table 1. - First Preview of the Raw Dataset

!? 1»1Hijj úl.'Cil Him SA92M lltl {ivtrft« iJ i.l t"Fj .jD

. . . ML( h\ ■ -■ sol /L-m. L". .r nc i'-l .JLMÍ." mu LÍ vr'r L j HfllVl. HfT, Jill

JI vtrtid dcv'iT :-.[■>! (flu In w i lnT(n Jf i fi ÔnJpn VHll -VIDI vp

i|t ■ JIHTiJIJB" vni«- 1.7 1Ü1I« 11WU- EMI r 1 iirHvltp

i! rj tile. tWxr - CeUiifl i Stl 1« Mte* Pw t

Oifret FMIKT l»+f AVO^EÍSKN iJ 11713" lltl { ivcn-rL'-'j i.t'.Ji-llP

«1 -j J ra+lrï írvlr 73LJ II.WW mit I E .p

— hEUfiari ""IT v |/FJLJ1 j^i ainrv U ¿1U1E Vd> U lATTCflnj» >№ i hUfW ^Ll'Jr.lfcl- tta TI «lüi Jtv Vi ^ kILI Ji

IS Ixrv^l Ú W ! ÏX K™. [ (V^rnú Jptin#Hi|Ti Cw-^VTf lUlli-J-D

HMUiítVhri ttdfiMlMHtdtfl 4M ±11 i« IHMtK i; üiJjm If,. ♦ (»(THSf» JrttUr.lHl ]J1( 1.1 J-H-FH

i? tinH fttdif. bjr^-ï'. *<>T .4\0U£Í5WN i.7 llOil 1W IKBUOSO» Iltl í EvifV^ ^rtiönipi ].M tisriip

,.1 .lall IriJf ( '■ ■'■ 1 -■■■ lí ul AM №1- Jftj l*r_lAC.C(Utl< ¿É tffcjud rtl»1 I Arf*0ril|n jn

-Í jr * l«L> bt* -'-■■<1J<--J KHlkUP4 f 1 MDJEÛoi ij U1M u(.u*» >IX [ iwvn ÎM

ig ■«xïTïD«u "lù1 UTJUtfj ffiiltll *J UI1W №№ rm [ V.irrLiV 11I>HI(I1 1 " -F " -P

jl jltórtUlH l^irt In:« nfjjn juTa^w^^WHi'tFi i* ifj ÂÎI inr.(i»t If-F i 'T-r/^-r iMl^tlllri ; t.i^mi r

¡2 *wlTtjdE*FifiBry AvjTOV=HIDi£l i i Dw.ooe-» lltl I fíHfffa

fl l№4( 1[ Aürvj (r*npf»ri.1li JJLr-F_A<x\V ». LI V Ii fallu reit« ifw [ h^írti J 1.1 «n! jp

: J ( us »id t-^xes tor № JUUDjii^HIŒa i< !№7№« - ndCJti LDOfjOOi» im VWI« mtn drr liki KflS H

ISilniJlMimt.m F. ro í1, -■_i-Ttl 1 № M Iii pc.üDti n« [ CÉP| L ■ ^v^V^iHl*« ! Ali-^np

!■: DQM^ 'M- '-WT ¡Ji.hv lhrihlfflu ■ -r,.i j.s j v;hi. jT II Í-Ü MU frM [ v.'toBv^TH HittiMI

.^7 IMDHUTflt ullh r_. A-uTÖ VÇHIDtEl 4J iW^ ÎW 1DQOO&+ EMI : [ill^LH 1.7.1 Ulrilfl

M ft ört lim-ultj írlfl i'_t m i : if u «si™ [ luw^i-» i.ll i.i*-d jp

o ujrDJWÎjElHiaa il L5 Wt llir.O»» im [ lUrd dp

m pwÂj^vhwjof F-..I 'j_<n ih.'.Lii ^ lui (W» I Hifjrt i-JJl rtnufci 141

H * 'Í j chemin*rc^-j-ribt'. I-idirH^-rtr^i AWTÛJWÔ^IIlELli LU loM l№ f&mi litt [ Iv^t^

U ClN^ufLV-v *\-T0>V3.V=HIDi£J it sas™ lltl I lllljli» LS.1 tt'.Ji-l-.p

ÍJ (i* h««« imi ioiiMt ftre_U<IJrtlllftJU 1" jaiSl qXi^1 ifir [ H^yfr» t>4.}jr£ JP

ri RÓE^IC №11*01 ireniiy, win tri ■ïiir^T*ii*flri JUITlJJUïD.WHian U lawAai llWf.Wfr+ im [ twvH -_■■:■-F.v-r■■'•r-í: Laauum umw

u irwY.ll Ff;nil rsi UK<HFUV M^JW^Hlítil tt Tit! rttfiw hlY* i »uiC"tV»imi VI) i. í i-Fluí

M CflHV hrtiHrt PVKMAtDl ttlAHNIi-i MjTDjVA.KniClU mtiíii >nr [ LTiirn E f

Ijî .T H1 ■ L Fi WIM FlH I T'jFrr.vv L ■ il i

<4 .<if!ii|>!ii4 MiJtQ_MiQ_\i M i ail im *>.W>- (Í1T [ luwvrt uvMUM

L-'.MitH+IC^L^IV JBUTUJUl^jiHIOn ■j¡ «ULt Irm -nlutn im [ EW^I Fvin Itv V + '-"!S FilM 1-

Í irJUVl M^JW^Hlítil it »IBimL-» hlY* I Vihfh KlLh itv. VÊrHti Kll 1 Jk

■■ JIM Dfen & rjlc A'ü r U J r VI H1JUÎ ywvnrt ►rt' L ALOOSVfllllMi VTJ illïi^vp

ig ^-iL frl ITÙ V]'L TuTi i:Kri'::íJííti,,r. «nOJûOJHHKUÏ il V-tfiH IUI Fm i c.ijÚL^ JUjlv^'dlE * Hîa-J -P

.■; -»(«[»«tiiih-MíKitmlt.r-J ±v ;-11A.1 y Hiiii M aja un KCl®»! (Í1T [ n^vfr» J.J.fl H.îj-d^r

H - "-l'O O í, J» Wlfi MülJf K ij imx im im fT«> hcitodD VnSKttUll« -JD

Table 2.- Second Preview of the Raw Dataset

so-^ic cito^K!

-y T Foilof, '-•■r.iltiy llUTJSTl-n

JUS I UGV=

JiJ! Tjlk^t To 1 LU

a?si .-^pii »yrni J-TJTivdl

4iWol KH* J : ¡omUli

ejiufi» J?.S* ff^jjflrpl I griijnB TO«ïr LUW LiGÎTÏ

SOCIAl

iAWif

■4MVÏ

gjiMi

¿IUI

ÏOÎH!

runiT

GAMS

IULIH AVO : iAW.l fAWvf

SKI«™ ig.[ai№ mi

A4 SXHJ1.EW1 !».«(" rill

i S№75J3V I,iX«],riX"i Fttn U IO«H Vîtes »iMOHWM.IBt» !itn

if lff.^HÇ^- idFr

it UJ^ll W.l

Ü il* Kill

«2.EV SHB9 23H WIS

II

LI 4J 41

iCftCJUl.OM- fr».

FW

KO- il>.

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

K.womc- Fïua

UJ.MO» Fit!

it.moï- mi

MioatK- frei

0 Lverçnpa Ano*-- LU.ÏT 4.1 tr*} up

4 Everyone IkKlb'l

4 Evoryonc SinhJIlflUK- U&l lud»up

0 [ïv^rn.:- r;.];.j>i Vimc ^dn««* J.l^up

□ Iwvcnt ¿It-dp l.ltï.l t-l up

□ 1 YHrYi-np Arudp i-lu.lî

□ r«l( 1.4.1-4Î a.lFMuf

□ IIHtntllWI 1.) JJ»Juf

0 TM-". AŒ.:.-, ttlt ^.îi^dup

fl EWI7C1Ï Hl»îlh 4 LI iijïjÙ

4 Evt'vono ijïu*1 * Ml up

0 EviIVOPl epM^tuniwfl Z.1.L1 J l*-JUf

Table 3.- Third Preview of the Raw Dataset

571 Biggùitnv:¿luerwqiff bdi IIIJKS chdMfs Mima J. J Ji'.l 13C.C3C» C kldL-IU LIOUC^ Uli |*.l—Jup

J-rJ- Mû[û+- ChJ^MAALPAÛplO [MTI'JJ 4.1 -iJt fjld ii.99 hueLTCuc r( ma

BwlBiunrtJpflJflï! Tlw Mima j.ï M li.OCi- C kldL-IU LlOucr^ ■uu

All of the data is put into the variable "data," which uses pandas to read and obtain all the information from the dataset. This variable is fundamental to the function of the code. I say this because this variable "data" is used differently for each column in the dataset to meet the needs of that specific column.

For example, I used the data variable to create different variables only applicable to certain columns, or I used the data variable to apply functions only applicable to certain columns. This allowed me to focus on the columns individually when cleaning the data.

Data Cleaning:

Table 4.- Preview of the Cleaned Dataset

tcttgfe'y PdtlPg Rt^St-i îiZ* IhSlfedLt rypt PrlCE CPPtCit Rat tap

> SJiih 4 t SÎÎ1B № Pfl

AHl_i«njl1 Slfih a Î1ÎÎ44 it 0 3

APT Mi SÎ ¿8671 iMllflJJ 1 OD

it ART flirf.oÉSiih 44 1ÉÎ ÎJ314 □ÎM4.4 1 o.o

6 ART flhß^ofich 3« 174 □ÎMJ 4 0 OD

C41VC* Lut lipjATHj

p Mi I3«igji o^fd r 7jIn

1 flfl A №u]J1 JiirMBÎOTIl VjrtBfifflllUvt»

0 1 OhafriiwiWi It

irtiDeïU'i MjiH, ;6 J-ÜH?

Current Vef Vw Cjtigni^Jwwi OcnriiCltairf

JOJjiml^ Ï 0

Q o

44mx)iï> 0 1

ii«xJiï> 0 0

iCÏjn-iup d 0

10 tl

In general, when examining data, one should make sure that there are readable numerical values within each column. In order to create a multiple linear regression model, data cleaning needs to be done first. Therefore, to begin, I turned the different category types within the category columns into their respective numbers. For example, in the "Category" column of my dataset (shown in figure 1), which contained the categories the apps were in, "ART_AND_DESIGN" was assigned to

0, "AUTO_AND_VEHICLES" was assigned to

1, and so on. I did this by creating a dict which mapped the original column into a new column with the new assigned values. This was useful as it tied a specific number to each app instead of a string. Now, if any app was tied to the number 0, the program would know that they were in the "ART_ AND_DESIGN" category. Figure 4 demonstrates the original column mapped into the new column

"Category Cleaned." The new column is used for the eventual regression model, and the old one is ignored.

Next, the same concept applies to size. The size either had an "M" or a "k" in it (shown in figures 2), which meant megabyte or kilobyte, respectively. In order to make this readable to my program, I wanted to remove these letters and translate them into solid numbers in order to create a more uniform column. In this case, I converted each into solid numbers representative of how many kilobytes the app took up. So, I wrote a function that removed the letter "M" and then multiplied the number by 1024, since 1M is equal to 1024 k. Then, for those with k, I just removed the letter "k" and returned the number. Then, I mapped the column and applied the function to it. In (Figure 4), it portrays the size as being readable floats.

Then, for the Installs column, the values were formatted such as the raw number with a plus sign at the

end of it (shown in figures 1 and 2), so for example, "50,000+." Therefore, I wanted to remove the "+" and then convert the remaining values into floats so that it would be readable. I stripped the "+" off of each value then converted it into float using. astype. However, there was a problem. I was unable to convert the entire column into float because one of the boxes in the columns had an alien value. So, I found the location, and dropped the alien value so that the system would allow me to convert the column into floats. Again, in (Figure 4), one can see the cleaned column.

After, for the "Type" column, there were two val-ues-"Free" or "Paid" (shown in figures 1 and 2). To convert these into readable values, I turned them into binary-0 and 1. I wrote a function which returned 0 if the value was "Free" and returned 1 if the value was "Paid." Then, I mapped the column into a new column and applied the written function. The cleaned dataset in Figure 4 shows the new mapped values.

Nearing the end, I used the same concept as I used for the category column for the content rating column. The content rating column contained "Everyone," "Everyone 10+," "Teen," and "Mature 17+." I used a dict and applied respective values (numbers) to each different value in content rating, so that the values could be readable by the program. In the small preview shown in (Figure 4), the respective numerical values are shown in the Content Rating column instead of the string values.

I then had to clean up the "Genres" column too, which, although quite similar to the "Category" column, was not identical. Even so, the same concept of data cleaning stands for the "Genres" column. To clean this column, I created a separate dict which mapped numerical values in place of the original string values in the column. As you can see in (Figure 4), the "Genres Cleaned" column is showing, which shows numerical values. This is just a map of the original "Genres" column, so when building the regression model, similar to as stated previously regarding the "Category" column, I just use the "Genres Cleaned" column instead of the "Genres" column.

Finally, I had to clean the price column and drop any unneeded columns. Regarding the price column, for the Free values, the price was already equal to 0 (shown in figure 3). I wrote a function that ignored the value of 0 and removed the $ sign from the ones that did cost money. I then dropped the "Current Ver" and "Android Ver" columns, as they did not help much in my model.

Results Insights

Linear Regression:

In order to put this code to use, I used linear regression at the end. By using linear regression, I created linear models that demonstrate relationships between different variables of the data [5]. By compiling many different points based off of the data set, and by drawing a line that draws relationships, I was able to visualize any patterns within the data set. Doing this allowed me to diagnose the factors that go into the overall review of an app.

Correlation:

The colors seen in (Figure 5) are an indication of the magnitude of correlation. The darker (or bluer) the color is, the higher the correlation is, and the lighter (or redder) the color is, the lower the correlation is. When first looking at this heat map, the eye gravitates towards the bottom right and the center, which are marked by the variables "Category Cleaned" and "Genres Cleaned" and "Type" and "Price" respectively. This means that those variables are correlated to each other. Looking at the rest of the heat map, you are able to tell which variables are correlated and which are not. For example, "Content Rating" and "Category Cleaned" would have a low correlation.

Regression Coefficients Validation:

When finding the coefficients of the regression, I found there to be extremely low coefficients. This seems reasonable as the rate of change of the regression graph was not very high, meaning that the low values of the coefficients are plausible [6].

Residuals Analysis:

Figure 5. Correlation Heat Map

R2: 0.010048884330930763

Figure 6. Residuals Plot

Here, I have the residuals plot, as well as my calculated R2 value. The R2 value is, in short, a statistic that gives basic information regarding the accuracy, or robustness, of a model. For the R2 value, one is usually looking for a higher value than the one I calculated: a R2 value of 1 is the most optimal. However, there are reasons for the low R2 value. First off, the data points are very distributed. They do not concentrate on one area - it is more like they are concentrated in many areas (shown in figure 6). This will affect the R2 value. Second, there are differences in the magnitude of the variables used. For example, the column

of reviews may only be in the thousands, whereas the column of installs may be well into the millions. With such a huge difference in value, it is hard to calculate an accurate R2 value that adequately fits the graph. Therefore, although the raw calculations for the R2 value were done, it is not exactly applicable in this situation, due to external factors.

Business Insight/Developer Perspective:

So, our results are linear models that have "Actual Ratings" on the y-axis and "Predicted Ratings" on the x-axis. The models are representative of predictions of app ratings made based on the data provided

Figure 7. Linear Model Excluding Genres Column

Figure 8. Linear Model Including Genres Column

The graphs in Figures 7 and 8 are quite similar of course - one column of genres is not going to make a big difference. There is just a small deviation in standard deviation and average value, with the standard deviation being 0.0569 and 0.064 and the average value being 4.191 and 4.200 respectively for Excluding Genres and Including Genres. The linear model

without genre column had a lower average value and a lower standard deviation as well, meaning the values varied from the average value less than the linear model with the genres column. Including the genres column tended to create the less accurate model [6].

This model is able to create a prediction value of a rating, and then graph it against the actual rating.

As seen from the graph, it is not terrible. Most of the data points are clustered near the center line, which tends to be the most accurate part.

Businesses will be able to use this to create a competitive advantage when releasing apps. Knowing they have a program that can predict the ratings of an app given certain variables, they can create apps that are more centered around the variables that seem to pull the highest predicted rating. For example, variables such as size, price, and the number of installs can all be affected due to this model and give businesses some assistance when creating apps and planning for the future of those apps.

Model Improvement

I found that the R2 value was not very ideal in terms of a good prediction criteria. According to the linear regression model itself, one can improve the model accuracy by standardizing the features, so that all the features can be put into the model with same magnitude. This would also help with dealing with the extremely large values, or conversely, extremely low values.

Another aspect that could help me improve in this project would be increasing the model complexity. Instead of linear model, I can try a non-linear mode, such as non-parametric regression or, even more advanced, machine learning models like a tree model or a neural network. I could also have introduced more complex variables. This would give a sense of specificity and zone into the apps more.

Conclusion

Ultimately, in this paper I was able to reach my goal. I was not only able to clean and scan data, but I was also able to apply it to a linear regression model, which then has applications to the real world. I was able to learn many things and experiment with many things, and research a conclusion that could be useful to businesses who are involved with app development and marketing. Among the many variables that made up the data set I used, such as "App," "Category," "Rating," "Reviews," and "Size," I figured that some were more influential than others. This was supported by my correlation heat graph. For example, looking at the "Type" and "Price" columns, which are symbolic of whether an app costs money or not, you are able to tell that they are very influential variables. Also looking at "Content Rating" and "Installs," there is a decent correlation. When looking at these variables in a logical perspective, it makes logical sense that these variables would be more influential - people want to get bang for their buck, don't they? So, after completing this project, my suggestions for companies and developers who would want to improve their ratings would be to focus on reasonable and balanced prices on apps (if not free) and try to maximize the number of installs. The more installs an app has, the more accurate the ratings get, as it starts to average out. With a greater number of installs comes a greater number of reviews. Deviations are less frequent and therefore an app will have more accurate ratings.

References:

1. Weisberg S. Applied linear regression (Vol. 528). John Wiley & Sons. 2005.

2. Groß J. Linear regression (Vol. 175). Springer Science & Business Media. 2012.

3. Maredia R. Analysis of Google Play Store Data set and predict the popularity of an app on Google Play Store.

4. Taylor R. Interpretation of the correlation coefficient: a basic review. Journal of diagnostic medical sonography,- 6(1). 1990.- P. 35-39.

5. Uyanik G. K. and Güler N. A study on multiple linear regression analysis. Procedia-Social and Behavioral Sciences,- 106. 2013.- P. 234-240.

6. Aiken L. S., West S. G., Pitts S. C., Baraldi A. N. and Wurpts I. C. Multiple linear regression. Handbook of Psychology, Second Edition,- 2. 2012.

i Надоели баннеры? Вы всегда можете отключить рекламу.