Rating Product & Sorting Reviews in Amazon

In this blog, we’ll be focusing on two primary goals:
- Calculating the average rating by considering current reviews and comparing it with the existing average rating.
- Sorting reviews using various methods for comparison
Variables:
#reviewerID: User ID
#asin: Product ID
#reviewerName: User Name
#helpful: Helpful rating degree
#reviewText: Review
#overall: Product rating
#summary: Review summary
#unixReviewTime: Review time
#reviewTime: Raw review time
#day_diff: Number of days since the review
#helpful_yes: Number of times the review was found helpful
#total_vote: Total votes given to the review
import matplotlib.pyplot as plt
import pandas as pd
import math
import scipy.stats as st
pd.set_option('display.max_columns', None)
# pd.set_option('display.max_rows', 10)
pd.set_option('display.expand_frame_repr', False)
pd.set_option('display.float_format', lambda x: '%.5f' % x)
df = pd.read_csv("amazon_review.csv")
df.head()
reviewerID asin reviewerName helpful reviewText overall summary unixReviewTime reviewTime day_diff helpful_yes total_vote
0 A3SBTW3WS4IQSN B007WTAJTO NaN [0, 0] No issues. 4.00000 Four Stars 1406073600 2014-07-23 138 0 0
1 A18K1ODH1I2MVB B007WTAJTO 0mie [0, 0] Purchased this for my device, it worked as adv... 5.00000 MOAR SPACE!!! 1382659200 2013-10-25 409 0 0
2 A2FII3I2MBMUIA B007WTAJTO 1K3 [0, 0] it works as expected. I should have sprung for... 4.00000 nothing to really say.... 1356220800 2012-12-23 715 0 0
3 A3H99DFEG68SR B007WTAJTO 1m2 [0, 0] This think has worked out great.Had a diff. br... 5.00000 Great buy at this price!!! *** UPDATE 1384992000 2013-11-21 382 0 0
4 A375ZM4U047O79 B007WTAJTO 2&1/2Men [0, 0] Bought it with Retail Packaging, arrived legit... 5.00000 best deal around 1373673600 2013-07-13 513 0 0
Rating Products
We have calculated the basic average rating of the product.
df["overall"].mean()
Out[21]: 4.587589013224822
To sort by time, we define the Time Based Weighted Average function:
def time_based_weighted_average(dataframe, w1=50, w2=25, w3=15, w4=10):
return dataframe.loc[dataframe["day_diff"] <= dataframe["day_diff"].quantile(0.25), "overall"].mean() * w1 / 100 + \
dataframe.loc[(dataframe["day_diff"] > dataframe["day_diff"].quantile(0.25)) & (dataframe["day_diff"] <= dataframe["day_diff"].quantile(0.50)), "overall"].mean() * w2 / 100 + \
dataframe.loc[(dataframe["day_diff"] > dataframe["day_diff"].quantile(0.50)) & (dataframe["day_diff"] <= dataframe["day_diff"].quantile(0.75)), "overall"].mean() * w3 / 100 + \
dataframe.loc[(dataframe["day_diff"] > dataframe["day_diff"].quantile(0.75)), "overall"].mean() * w4 / 100
We increased the weight of newly written comments and our new rating value became 4.637306192407316. Our basic basic average rating was 4.587589013224822.
So, by using the Time-Based Weighted function, we achieved a higher rating.
time_based_weighted_average(df, w1=50, w2=25, w3=15, w4=10)
time_based_weighted_average(df, w1=50, w2=25, w3=15, w4=10)
Out[48]: 4.637306192407316
We can change the weights. For example:
time_based_weighted_average(df, w1=60, w2=30, w3=8, w4=2)
time_based_weighted_average(df, w1=60, w2=30, w3=8, w4=2)
Out[7]: 4.662975899944154
Sorting Reviews
Our goal is to determine 20 Reviews that will be displayed on the product detail page for the product.
There is no ‘helpful_no’ variable in data set. We need to create it. Up refers to helpful. We created helpful_no varible and create new df for we’ll use:
df["helpful_no"] = df["total_vote"] - df["helpful_yes"]
df = df[["reviewerName", "overall", "summary", "helpful_yes", "helpful_no", "total_vote", "reviewTime"]]
df.head()
reviewerName overall summary helpful_yes helpful_no total_vote reviewTime
0 NaN 4.00000 Four Stars 0 0 0 2014-07-23
1 0mie 5.00000 MOAR SPACE!!! 0 0 0 2013-10-25
2 1K3 4.00000 nothing to really say.... 0 0 0 2012-12-23
3 1m2 5.00000 Great buy at this price!!! *** UPDATE 0 0 0 2013-11-21
4 2&1/2Men 5.00000 best deal around 0 0 0 2013-07-13
Calculating score_pos_neg_diff, score_average_rating, and wilson_lower_bound scores and adding them to the dataset.
Up-Down Diff Score = (up ratings) − (down ratings)
def score_up_down_diff(up,down):
return up - down
df["score_pos_neg_diff"] = df.apply(lambda x: score_up_down_diff(x["helpful_yes"],x["helpful_no"]),axis=1)
df.sort_values("score_pos_neg_diff", ascending=False).head(10)
reviewerName overall summary helpful_yes helpful_no total_vote reviewTime score_pos_neg_diff
2031 Hyoun Kim "Faluzure" 5.00000 UPDATED - Great w/ Galaxy S4 & Galaxy Tab 4 10... 1952 68 2020 2013-01-05 1884
4212 SkincareCEO 1.00000 1 Star reviews - Micro SDXC card unmounts itse... 1568 126 1694 2013-05-08 1442
3449 NLee the Engineer 5.00000 Top of the class among all (budget-priced) mic... 1428 77 1505 2012-09-26 1351
317 Amazon Customer "Kelly" 1.00000 Warning, read this! 422 73 495 2012-02-09 349
3981 R. Sutton, Jr. "RWSynergy" 5.00000 Resolving confusion between "Mobile Ultra" and... 112 27 139 2012-10-22 85
4596 Tom Henriksen "Doggy Diner" 1.00000 Designed incompatibility/Don't support SanDisk 82 27 109 2012-09-22 55
1835 goconfigure 5.00000 I own it 60 8 68 2014-02-28 52
4672 Twister 5.00000 Super high capacity!!! Excellent price (on Am... 45 4 49 2014-07-03 41
4306 Stellar Eller 5.00000 Awesome Card! 51 14 65 2012-09-06 37
315 Amazon Customer "johncrea" 5.00000 Samsung Galaxy Tab2 works with this card if re... 38 10 48 2012-08-13 28
In the previous output, we identified 10 reviews that we will display based on the metric (up-down).
The “up-down” metric is not preferred for sorting reviews because it only relies on the total count of upvotes or likes a review receives. This metric might not fully reflect the quality of a review. In some cases, a high number of likes on a review does not necessarily mean it is genuinely helpful or insightful. Additionally, some negative reviews could contain valuable information
Average rating
def score_average_rating(up,down):
if up + down == 0:
return 0
return up / (up+down)
df["score_average_rating"] = df.apply(lambda x: score_average_rating(x["helpful_yes"], x["helpful_no"]), axis=1)
This function (score_average_rating(up, down)) calculates an average score using the ratio between the number of likes and dislikes.
df.sort_values("score_average_rating", ascending=False).head(10)
reviewerName overall summary helpful_yes helpful_no total_vote reviewTime score_pos_neg_diff score_average_rating
4277 S. Q. 5.00000 Perfect!! 1 0 1 2012-12-19 1 1.00000
2881 Lou Thomas 5.00000 Nexus One Loves This Card! 1 0 1 2012-01-10 1 1.00000
1073 C. Sanchez 5.00000 Tons of space for phone 1 0 1 2013-08-13 1 1.00000
445 Apache "Elizabeth" 4.00000 Amazon Great Prices 1 0 1 2013-12-18 1 1.00000
3923 Rock Your Roots 5.00000 What more to say? 1 0 1 2013-12-30 1 1.00000
435 Anthony L cate 5.00000 Love the extra storage 1 0 1 2012-07-24 1 1.00000
2901 luis 5.00000 Awesome and fast card :) 1 0 1 2013-05-13 1 1.00000
2204 jbwam "jbwam" 2.00000 Sandisk will replace failures due to bad batch... 1 0 1 2013-06-14 1 1.00000
2206 JCBiker 5.00000 Great card 1 0 1 2013-10-31 1 1.00000
3408 Neng Vang "Neng2012" 5.00000 working no problem 1 0 1 2013-07-25 1 1.00000
In the previous output, we identified 10 reviews that we will display based on the up-down ratio.
Average rating isn’t also preferred like Up-Down Diff Score, has same issue.
Wilson Lower Bound Score
We’re calling our WLB (Wilson Lower Bound) function.
def wilson_lower_bound(up,down,confidence=0.95):
n = up + down
if n == 0:
return 0
z = st.norm.ppf(1- (1 - confidence) / 2)
phat = 1.0 * up / n
return (phat + z * z / (2 * n) - z * math.sqrt((phat * (1 - phat) + z * z / (4 * n)) / n )) / (1 + z * z / n)
We’re applying the function to our dataset.
df["wilson_lower_bound"] = df.apply(lambda x: wilson_lower_bound(x["helpful_yes"], x["helpful_no"]), axis=1)
We’re sorting the first 20 results in ascending order.
df.sort_values("wilson_lower_bound", ascending=False).head(20)
reviewerName overall summary helpful_yes helpful_no total_vote reviewTime score_pos_neg_diff wilson_lower_bound
2031 Hyoun Kim "Faluzure" 5.00000 UPDATED - Great w/ Galaxy S4 & Galaxy Tab 4 10... 1952 68 2020 2013-01-05 1884 0.95754
3449 NLee the Engineer 5.00000 Top of the class among all (budget-priced) mic... 1428 77 1505 2012-09-26 1351 0.93652
4212 SkincareCEO 1.00000 1 Star reviews - Micro SDXC card unmounts itse... 1568 126 1694 2013-05-08 1442 0.91214
317 Amazon Customer "Kelly" 1.00000 Warning, read this! 422 73 495 2012-02-09 349 0.81858
4672 Twister 5.00000 Super high capacity!!! Excellent price (on Am... 45 4 49 2014-07-03 41 0.80811
1835 goconfigure 5.00000 I own it 60 8 68 2014-02-28 52 0.78465
3981 R. Sutton, Jr. "RWSynergy" 5.00000 Resolving confusion between "Mobile Ultra" and... 112 27 139 2012-10-22 85 0.73214
3807 R. Heisler 3.00000 Good buy for the money but wait, I had an issue! 22 3 25 2013-02-27 19 0.70044
4306 Stellar Eller 5.00000 Awesome Card! 51 14 65 2012-09-06 37 0.67033
4596 Tom Henriksen "Doggy Diner" 1.00000 Designed incompatibility/Don't support SanDisk 82 27 109 2012-09-22 55 0.66359
315 Amazon Customer "johncrea" 5.00000 Samsung Galaxy Tab2 works with this card if re... 38 10 48 2012-08-13 28 0.65741
1465 D. Stein 4.00000 Finally. 7 0 7 2014-04-14 7 0.64567
1609 Eskimo 5.00000 Bet you wish you had one of these 7 0 7 2014-03-26 7 0.64567
4302 Stayeraug 5.00000 Perfect with GoPro Black 3+ 14 2 16 2014-03-21 12 0.63977
4072 sb21 "sb21" 5.00000 Used for my Samsung Galaxy Tab 2 7.0 6 0 6 2012-11-09 6 0.60967
1072 Crysis Complex 5.00000 Works wonders for the Galaxy Note 2! 5 0 5 2012-05-10 5 0.56552
2583 J. Wong 5.00000 Works Great with a GoPro 3 Black! 5 0 5 2013-08-06 5 0.56552
121 A. Lee 5.00000 ready for use on the Galaxy S3 5 0 5 2012-05-09 5 0.56552
1142 Daniel Pham(Danpham_X @ yahoo. com) 5.00000 Great large capacity card 5 0 5 2014-02-04 5 0.56552
1753 G. Becker 5.00000 Use Nothing Other Than the Best 5 0 5 2012-10-22 5 0.56552
We have identified comments to highlight according to the WLB method. Above, you can see the first 20 of them.
Big thanks to Vahit Keskin and Miuul
Contact me on Linkedin :) yaseminderyadilli