Amazon Review Data (2018)
Jianmo Ni , UCSD

Description
- The total number of reviews is 233.1 million (142.8 million in 2014).
- Current data includes reviews in the range May 1996 - Oct 2018.
- Product information, e.g. color (white or black), size (large or small), package type (hardcover or electronics), etc.
- Product images that are taken after the user received the product.
- Bullet-point descriptions under product title.
- Technical details table (attribute-value pairs).
- Similar products table.
- Includes 5 new product categories.
You can also download the review data from our previous datasets.
Amazon review (2014)
Amazon review (2013)
Please cite the following paper if you use the data in any way:
Justifying recommendations using distantly-labeled reviews and fined-grained aspects Jianmo Ni, Jiacheng Li, Julian McAuley Empirical Methods in Natural Language Processing (EMNLP) , 2019 pdf
05/2021 We updated high resolution image urls to the metadata!
08/2020 We have updated the metadata and now it includes much less HTML/CSS code. Feel free to download the updated data!
- Load the metadata (e.g. as JSON or DataFrame)
- Check if title has HTML contents and filter them
We provide a colab notebook that helps you find target products and obtain their reviews!
- Unparsed HTML contents
- Duplicate items which have same reviews
- Files complete data K-cores and ratings-only data sample review sample metadata
Complete review data
Please only download these (large!) files if you really need them. We recommend using the smaller datasets (i.e. k-core and CSV files) as shown in the next section .
raw review data (34gb) - all 233.1 million reviews
user review data (18gb) - duplicate items removed (83.68 million reviews), sorted by user
product review data (18gb) - duplicate items removed, sorted by product
5-core (14.3gb) - subset of the data in which all users and items have at least 5 reviews (75.26 million reviews)
Finally, the following file removes duplicates more aggressively, removing duplicates even if they are written by different users. This accounts for users with multiple accounts or plagiarized reviews. Such duplicates account for less than 1 percent of reviews, though this dataset is probably preferable for sentiment analysis type tasks:
aggressively deduplicated data (18gb) - no duplicates whatsoever (82.83 million reviews)
Per-category data - the review and product metadata for each category.
To download the complete review data and the per-category files, the following links will direct you to enter a form. Please contact me if you can't get access to the form.
"Small" subsets for experimentation
If you're using this data for a class project (or similar) please consider using one of these smaller datasets below before requesting the larger files.
K-cores (i.e., dense subsets): These data have been reduced to extract the k-core , such that each of the remaining users and items have k reviews each.
Ratings only: These datasets include no metadata or reviews, but only (item,user,rating,timestamp) tuples. Thus they are suitable for use with mymedialite (or similar) packages.
You can directly download the following smaller per-category datasets.
Data format
Format is one-review-per-line in json. See examples below for further help reading the data.
Sample review:
{ "image": ["https://images-na.ssl-images-amazon.com/images/I/71eG75FTJJL._SY88.jpg"], "overall": 5.0, "vote": "2", "verified": True, "reviewTime": "01 1, 2018", "reviewerID": "AUI6WTTT0QZYS", "asin": "5120053084", "style": { "Size:": "Large", "Color:": "Charcoal" }, "reviewerName": "Abbey", "reviewText": "I now have 4 of the 5 available colors of this shirt... ", "summary": "Comfy, flattering, discreet--highly recommended!", "unixReviewTime": 1514764800 } { "reviewerID": "A2SUAM1J3GNN3B", "asin": "0000013714", "reviewerName": "J. McDonald", "vote": 5, "style": { "Format:": "Hardcover" }, "reviewText": "I bought this for my husband who plays the piano. He is having a wonderful time playing these old hymns. The music is at times hard to read because we think the book was published for singing from more than playing from. Great purchase though!", "overall": 5.0, "summary": "Heavenly Highway Hymns", "unixReviewTime": 1252800000, "reviewTime": "09 13, 2009" }
- reviewerID - ID of the reviewer, e.g. A2SUAM1J3GNN3B
- asin - ID of the product, e.g. 0000013714
- reviewerName - name of the reviewer
- vote - helpful votes of the review
- style - a disctionary of the product metadata, e.g., "Format" is "Hardcover"
- reviewText - text of the review
- overall - rating of the product
- summary - summary of the review
- unixReviewTime - time of the review (unix time)
- reviewTime - time of the review (raw)
- image - images that users post after they have received the product
Metadata includes descriptions, price, sales-rank, brand info, and co-purchasing links:
metadata (24gb) - metadata for 15.5 million products
Sample metadata:
{ "asin": "0000031852", "title": "Girls Ballet Tutu Zebra Hot Pink", "feature": ["Botiquecutie Trademark exclusive Brand", "Hot Pink Layered Zebra Print Tutu", "Fits girls up to a size 4T", "Hand wash / Line Dry", "Includes a Botiquecutie TM Exclusive hair flower bow"], "description": "This tutu is great for dress up play for your little ballerina. Botiquecute Trade Mark exclusive brand. Hot Pink Zebra print tutu.", "price": 3.17, "imageURL": "http://ecx.images-amazon.com/images/I/51fAmVkTbyL._SY300_.jpg", "imageURLHighRes": "http://ecx.images-amazon.com/images/I/51fAmVkTbyL.jpg", "also_buy": ["B00JHONN1S", "B002BZX8Z6", "B00D2K1M3O", "0000031909", "B00613WDTQ", "B00D0WDS9A", "B00D0GCI8S", "0000031895", "B003AVKOP2", "B003AVEU6G", "B003IEDM9Q", "B002R0FA24", "B00D23MC6W", "B00D2K0PA0", "B00538F5OK", "B00CEV86I6", "B002R0FABA", "B00D10CLVW", "B003AVNY6I", "B002GZGI4E", "B001T9NUFS", "B002R0F7FE", "B00E1YRI4C", "B008UBQZKU", "B00D103F8U", "B007R2RM8W"], "also_viewed": ["B002BZX8Z6", "B00JHONN1S", "B008F0SU0Y", "B00D23MC6W", "B00AFDOPDA", "B00E1YRI4C", "B002GZGI4E", "B003AVKOP2", "B00D9C1WBM", "B00CEV8366", "B00CEUX0D8", "B0079ME3KU", "B00CEUWY8K", "B004FOEEHC", "0000031895", "B00BC4GY9Y", "B003XRKA7A", "B00K18LKX2", "B00EM7KAG6", "B00AMQ17JA", "B00D9C32NI", "B002C3Y6WG", "B00JLL4L5Y", "B003AVNY6I", "B008UBQZKU", "B00D0WDS9A", "B00613WDTQ", "B00538F5OK", "B005C4Y4F6", "B004LHZ1NY", "B00CPHX76U", "B00CEUWUZC", "B00IJVASUE", "B00GOR07RE", "B00J2GTM0W", "B00JHNSNSM", "B003IEDM9Q", "B00CYBU84G", "B008VV8NSQ", "B00CYBULSO", "B00I2UHSZA", "B005F50FXC", "B007LCQI3S", "B00DP68AVW", "B009RXWNSI", "B003AVEU6G", "B00HSOJB9M", "B00EHAGZNA", "B0046W9T8C", "B00E79VW6Q", "B00D10CLVW", "B00B0AVO54", "B00E95LC8Q", "B00GOR92SO", "B007ZN5Y56", "B00AL2569W", "B00B608000", "B008F0SMUC", "B00BFXLZ8M"], "salesRank": {"Toys & Games": 211836}, "brand": "Coxlures", "categories": [["Sports & Outdoors", "Other Sports", "Dance"]] }
- asin - ID of the product, e.g. 0000031852
- title - name of the product
- feature - bullet-point format features of the product
- description - description of the product
- price - price in US dollars (at time of crawl)
- imageURL - url of the product image
- imageURL - url of the high resolution product image
- related - related products (also bought, also viewed, bought together, buy after viewing)
- salesRank - sales rank information
- brand - brand name
- categories - list of categories the product belongs to
- tech1 - the first technical detail table of the product
- tech2 - the second technical detail table of the product
- similar - similar product table
Visual Features
We extracted visual features from each product image using a deep CNN (see citation below). Image features are stored in a binary format, which consists of 10 characters (the product ID), followed by 4096 floats (repeated for every product). See files below for further help reading the data.
visual features (141gb) - visual features for all products
The images themselves can be extracted from the image field in the metadata files.
Below are files for individual product categories, which have already had duplicate item reviews removed.
Reading the data
Data can be treated as python dictionary objects. A simple script to read any of the above the data is as follows:
def parse(path): g = gzip.open(path, 'r') for l in g: yield json.loads(l)
Convert to 'strict' json
The above data can be read with python 'eval', but is not strict json. If you'd like to use some language other than python, you can convert the data to strict json as follows:
import json import gzip def parse(path): g = gzip.open(path, 'r') for l in g: yield json.dumps(eval(l)) f = open("output.strict", 'w') for l in parse("reviews_Video_Games.json.gz"): f.write(l + '\n')
Pandas data frame
This code reads the data into a pandas data frame:
import pandas as pd import gzip def parse(path): g = gzip.open(path, 'rb') for l in g: yield json.loads(l) def getDF(path): i = 0 df = {} for d in parse(path): df[i] = d i += 1 return pd.DataFrame.from_dict(df, orient='index') df = getDF('reviews_Video_Games.json.gz')
Convert to CSV
This code converts (a selection of fields from) the above files to CSV format:
import csv fields = ["asin", "description", "brand"] csvOut = gzip.open("meta_Video_Games.csv.gz", 'w') writer = csv.writer(csvOut) for product in parse("meta_Video_Games.json.gz"): line = [] for f in fields: if product.has_key(f): line.append(product[f]) else: line.append("") writer.writerow(line)
Read image features
import array def readImageFeatures(path): f = open(path, 'rb') while True: asin = f.read(10) if asin == '': break a = array.array('f') a.fromfile(f, 4096) yield asin, a.tolist()
Example: compute average rating
ratings = [] for review in parse("reviews_Video_Games.json.gz"): ratings.append(review['overall']) print sum(ratings) / len(ratings)
Example: latent-factor model in mymedialite
Predicts ratings from a rating-only CSV file
./rating_prediction --recommender=BiasedMatrixFactorization --training-file=ratings_Video_Games.csv --test-ratio=0.1
Goodreads Book Graph Datasets Overview
- 2,360,655 books (1,521,962 works, 400,390 book series, 829,529 authors)
- 876,145 users; 228,648,342 user-book interactions in users' shelves (include 112,131,203 reads and 104,551,549 ratings)
Latest News
- [May 2023] Our datasets have been moved! Please refer to this webpage on how to download the datasets. The previous Google drive links will be deprecated soon.
Code Samples
- Download datasets without GUI: download.ipynb
- Display sample records: samples.ipynb
- Calculate basic statistics: statistics.ipynb
- Explore the interaction data: distributions.ipynb
- Explore the review data: reviews.ipynb
- Mengting Wan, Julian McAuley, " Item Recommendation on Monotonic Behavior Chains ", in RecSys'18. [ bibtex ]
- Mengting Wan, Rishabh Misra, Ndapa Nakashole, Julian McAuley, " Fine-Grained Spoiler Detection from Large-Scale Review Corpora ", in ACL'19. [ bibtex ]
Meta-Data of Books
- Detailed book graph (~2gb, about 2.3m books): goodreads_books.json.gz
- Detailed information of authors: goodreads_book_authors.json.gz
- Detailed information of works (i.e., the abstract version of a book regardless any particular editions): goodreads_book_works.json.gz
- Detailed information of book series (Note: Unfortunately, the series id included here cannot be used for URL hack): goodreads_book_series.json.gz
- Extracted fuzzy book genres (genre tags are extracted from users' popular shelves by a simple keyword matching process): goodreads_book_genres_initial.json.gz
Book Shelves
- Complete user-book interactions in 'csv' format (~4.1gb): goodreads_interactions.csv User Ids and Book Ids in this file can be reconstructed by joining on the following two files: book_id_map.csv , user_id_map.csv .
- Detailed information of the complete user-book interactions (~11gb, ~229m records): goodreads_interactions_dedup.json.gz
- User- Book Club mapping information: book_clubs.json
Book Reviews
- Complete book reviews (~15m multilingual reviews about ~2m books and 465k users): goodreads_reviews_dedup.json.gz
- English review subset for spoiler detection (~1.3m book reviews about ~25k books and ~19k users, parsed at sentence-level ): goodreads_reviews_spoiler.json.gz
- English review subset for spoiler detection (~1.3m book reviews about ~25k books and ~19k users, raw texts ): goodreads_reviews_spoiler_raw.json.gz
- Books may overlap across different genres (i.e., one book may belong to multiple genres);
- The subgraph for each genre may not be self-contained. Those are subsets of the nodes on the complete book graph. Detailed information about authors, works, book series etc. can be found in the meta-data section.
- goodreads_books_children.json.gz
- goodreads_interactions_children.json.gz
- goodreads_reviews_children.json.gz
- goodreads_books_comics_graphic.json.gz
- goodreads_interactions_comics_graphic.json.gz
- goodreads_reviews_comics_graphic.json.gz
- goodreads_books_fantasy_paranormal.json.gz
- goodreads_interactions_fantasy_paranormal.json.gz
- goodreads_reviews_fantasy_paranormal.json.gz
- goodreads_books_history_biography.json.gz
- goodreads_interactions_history_biography.json.gz
- goodreads_reviews_history_biography.json.gz
- goodreads_books_mystery_thriller_crime.json.gz
- goodreads_interactions_mystery_thriller_crime.json.gz
- goodreads_reviews_mystery_thriller_crime.json.gz
- goodreads_books_poetry.json.gz
- goodreads_interactions_poetry.json.gz
- goodreads_reviews_poetry.json.gz
- goodreads_books_romance.json.gz
- goodreads_interactions_romance.json.gz
- goodreads_reviews_romance.json.gz
- goodreads_books_young_adult.json.gz
- goodreads_interactions_young_adult.json.gz
- goodreads_reviews_young_adult.json.gz
- mengting.wan at microsoft.com
- If you have any questions regarding these datasets, please create issues at our dataset Github repository.
Recommender Systems and Personalization Datasets
Julian McAuley , UCSD
Description
This page contains a collection of datasets that have been collected for research by our lab. Datasets contain the following features:
- user/item interactions
- star ratings
- product reviews
- social networks
- item-to-item relationships (e.g. copurchases, compatibility)
- product images
- price, brand, and category information
- heart-rate sequences
- other metadata
Please cite the appropriate reference if you use any of the datasets below.
Datasets are in (loose) json format unless specified otherwise, meaning they can be treated as python dictionary objects. A simple script to read json-formatted data is as follows:
def parse(path): g = gzip.open(path, 'r') for l in g: yield eval(l)
Directory by Dataset
Twitch live-streaming interactions
NPR interview dialog data
This American Life podcast transcripts
Recipes and interactions from food.com
Paired Recipes from food.com
EndoMondo fitness tracking data
Amazon product reviews and metadata
Amazon question/answer data
Amazon marketing bias data
Google Local business reviews and metadata
Google Restaurants restaurant reviews and metadata
Steam video game reviews and bundles
Goodreads book reviews
Goodreads spoilers
Pinterest fashion compatibility data
ModCloth clothing fit feedback
ModCloth marketing bias data
RentTheRunway clothing fit feedback
Tradesy bartering data
RateBeer bartering data
Gameswap bartering data
Behance community art reviews and image features
Librarything reviews and social data
Epinions reviews and social data
Cant understanding data
Dance Dance Revolution step charts
NES song data
BeerAdvocate multi-aspect beer reviews
RateBeer multi-aspect beer reviews
Facebook social circles data
Twitter social circles data
Google+ social circles data
Reddit submission popularity and metadata
Directory by Metadata Type
The datasets below can be roughly organized in terms of the types of metadata they contain:
Review text: see Amazon , BeerAdvocate, RateBeer , Google Local , Google Restaurants
Image data: Amazon , Behance , Pinterest , Google Restaurants
Item-to-item relationships: Amazon
Q/A data: Amazon Q/A
Geographical data: Google Local , Google Restaurants , EndoMondo
Heart-Rate data: EndoMondo
Bundle data: Steam
Peer-to-peer trades: Tradesy, RateBeer, Gameswap
Social connections: Librarything, Epinions
Fit feedback: Modcloth, Renttherunway
Multple aspects: BeerAdvocate, RateBeer
This is a dataset of users consuming streaming content on Twitch. We retrieved all streamers, and all users connected in their respective chats, every 10 minutes during 43 days.
Basic statistics
- User ID (anonymized)
- Streamer username
1,34347669376,grimnax,5415, 5419 1,34391109664,jtgtv,5869,5870 1,34395247264,towshun,5898, 5899 1,34405646144,mithrain,6024, 6025 2,33848559952,chfhdtpgus1,206, 207 2,33881429664,sal_gu,519,524 2,33921292016,chfhdtpgus1,922, 924
Download link
See our data folder containing all Twitch files. The file full_a.csv.gz contains the full dataset while 100k.csv is a subset of 100k users for benchmark purposes. The code is available in our Github repository .
Please cite the following if you use the data:
Recommendation on Live-Streaming Platforms: Dynamic Availability and Repeat Consumption Jérémie Rappaz, Julian McAuley and Karl Aberer RecSys , 2021
Interview: NPR Media Dialog Data
This dataset contains interview transcripts from National Public Radio (NPR) . Data includes full interview transcripts and news article headlines.
- Episode Date and Title
- Speaker Names
- Speaker Utterances
- News Article Headlines
episode: 79679 program: Talk of the Nation title: Forecasting the Future of the Internet date: 2006-05-26 episode_order: 48 speaker: Professor LARRY PETERSON (Princeton University) utterance: And this is almost like the neutrality aspect of the issue, that there are places you just can't get to and the universal connectivity of the original Internet is deteriorating. Because of a lack of security built into the Internet your only recourse is to throw up all sorts of protections that are extremely suspicious of every bit of traffic that happens to fly by.
See the Interview Dataset Page for download information.
Interview: Large-scale Modeling of Media Dialog with Discourse Patterns and Knowledge Grounding Bodhisattwa Prasad Majumder*, Shuyang Li*, Jianmo Ni, Julian McAuley EMNLP , 2020 pdf
This American Life Podcast Transcripts
This dataset contains program transcripts from This American Life . Data includes full program transcripts and associated audio.
- Episode Act
- Utterance Lengths
- Episode Audio
episode: ep-1 act: prologue utterance_start: 39.96 utterance_end: 54.89 duration: 14.93 speaker: ira glass utterance: Well, one great thing about starting a new show is utter anonymity. Nobody really knows what to expect from you. This interviewee did not know us from Adam.
See the This American Life Dataset Page for download information.
Speech Recognition and Multi-Speaker Diarization of Long Conversations Huanru Henry Mao, Shuyang Li, Julian McAuley, Garrison W. Cottrell INTERSPEECH , 2020 pdf
Food.com Recipe & Review Data
These datasets contain recipe details and reviews from Food.com (formerly GeniusKitchen). Data includes cooking recipes and review texts.
- Ratings and Reviews
- Recipe Name, Description, Ingredients, and Directions
- Recipe Categories (Tags)
- Recipe Nutrition Information
name: beer mac n cheese soup id: 499490 minutes: 45 contributor_id: 560491 submitted: 2013-04-27 tags: 60-minutes-or-less time-to-make preparation nutrition: 678.8 70.0 20.0 46.0 61.0 134.0 11.0 n_steps: 7 steps: cook the bacon in a pan over medium heat and set aside on paper towels to drain , reserving 2 tablespoons of the grease in the pan add the onion , carrot , celery and jalapeno and cook until tender , about 10-15 minutes add the garlic and cook until fragrant , about a minute mix in the flour and let it cook for 2-3 minutes add the broth , beer , nutmeg , bacon and macaroni and let cook until the macaroni is al-dente , about 7-8 minutes add the cream , mustard , worcestershire sauce and cheese and cook until the cheese has melted without bringing it back to a boil season with cayenne , salt and pepper to taste description: all of the flavors of mac n' cheese in the form of a hot bowl of soup! submitted by kevin lynch ingredients: bacon onion carrots celery jalapeno pepper garlic cloves flour chicken broth beer nutmeg elbow macaroni heavy cream dijon mustard worcestershire sauce cheddar cheese cayenne salt and pepper n_ingredients: 17
user_id: 8937 recipe_id: 44394 date: 2002-12-01 rating: 4 review: This worked very well and is EASY. I used not quite a whole package (10oz) of white chips. Great!
See the Food.com Dataset Page for download information.
Generating Personalized Recipes from Historical User Preferences Bodhisattwa Prasad Majumder*, Shuyang Li*, Jianmo Ni, Julian McAuley EMNLP , 2019 pdf
Recipe Pairs data
This is a collection recipes paired with variants, e.g. a recipe matched with a vegan version of the same recipe.
See the Recipe Pairs Dataset Page for download information.
SHARE: a System for Hierarchical Assistive Recipe Editing Shuyang Li, Yufei Li, Jianmo Ni, Julian McAuley EMNLP , 2022 pdf
EndoMondo Fitness Tracking Data
This is a collection of workout logs from users of EndoMondo . Data includes multiple sources of sequential sensor data such as heart rate logs, speed, GPS, as well as sport type, gender and weather conditions.
- User Identifier
- Latitude/Longitude/Altitude sequences (with timestamps)
- Heart rates
- Various derived sequences
userId: 10921915 gender: male sport: bike id: 396826535 longitude: [24.64977040886879, 24.65014273300767, 24.650910682976246, 24.650668865069747, 24.649145286530256, ...] latitude: [60.173348765820265, 60.173239801079035, 60.17298021353781, 60.172477969899774, 60.17186114564538, ...] altitude: [-1.8044666444624418, -1.8190453555595787, -1.8190453555595787, -1.8511185199732794, -1.871528715509271, ...] timestamp: [1408898746, 1408898754, 1408898765, 1408898778, 1408898794, ...] time_elapsed: [-0.12256752559145224, -0.12221090169596584, -0.12172054383967204, -0.12114103000950663, -0.12042778221853381, ...] heart_rate: [-8.197369036801112, -5.867841701016304, -3.961864789919643, -4.173640002263717, -3.961864789919643, ...] derived_speed: [-7.0829444390064396, -2.8061928357004815, -0.3976286593020398, -0.7571073884764162, 2.6415189187026646, ...] distance: [-4.372303649217691, -2.374952819539426, -0.07926348591212737, 0.4284751220389811, 4.710835498111755, ...] tar_heart_rate: [100, 111, 120, 119, 120, ...] tar_derived_speed: [0, 10.751376415573548, 16.806294372816662, 15.902596545765366, 24.446443398153843, ...] since_begin: [1378478.8892184314, 1378478.8892184314, 1378478.8892184314, 1378478.8892184314, 1378478.8892184314, ...] since_last: [2158.84607810351, 2158.84607810351, 2158.84607810351, 2158.84607810351, 2158.84607810351, ...]
See the FitRec Dataset Page for download information.
Modeling heart rate and activity data for personalized fitness recommendation Jianmo Ni, Larry Muhlstein, Julian McAuley WWW , 2019 pdf
Amazon Product Reviews
This is a large crawl of product reviews from Amazon. This dataset contains 82.83 million unique reviews, from around 20 million users.
- reviews and ratings
- item-to-item relationships (e.g. "people who bought X also bought Y")
- helpfulness votes
- product image (and CNN features)
{ "reviewerID": "A2SUAM1J3GNN3B", "asin": "0000013714", "reviewerName": "J. McDonald", "helpful": [2, 3], "reviewText": "I bought this for my husband who plays the piano. He is having a wonderful time playing these old hymns. The music is at times hard to read because we think the book was published for singing from more than playing from. Great purchase though!", "overall": 5.0, "summary": "Heavenly Highway Hymns", "unixReviewTime": 1252800000, "reviewTime": "09 13, 2009" }
See the Amazon Dataset Page for download information.
The 2014 version of this dataset is also available .
Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering R. He, J. McAuley WWW , 2016 pdf
Image-based recommendations on styles and substitutes J. McAuley, C. Targett, J. Shi, A. van den Hengel SIGIR , 2015 pdf
Amazon Question and Answer Data
These datasets contain questions and answers about products from the Amazon dataset above.
- question and answer text
- is the question binary (yes/no), and if so does it have a yes/no answer?
- product ID (to reference the review dataset)
{ "asin": "B000050B6Z", "questionType": "yes/no", "answerType": "Y", "answerTime": "Aug 8, 2014", "unixTime": 1407481200, "question": "Can you use this unit with GEL shaving cans?", "answer": "Yes. If the can fits in the machine it will despense hot gel lather. I've been using my machine for both , gel and traditional lather for over 10 years." }
See the Amazon Q/A Page for download information.
Modeling ambiguity, subjectivity, and diverging viewpoints in opinion question answering systems Mengting Wan, Julian McAuley International Conference on Data Mining (ICDM) , 2016 pdf
Addressing complex and subjective product-related queries with customer reviews Julian McAuley, Alex Yang World Wide Web (WWW) , 2016 pdf
Marketing Bias data
These datasets contain attributes about products sold on ModCloth and Amazon which may be sources of bias in recommendations (in particular, attributes about how the products are marketed). Data also includes user/item interactions for recommendation.
- user identities
- item sizes, user genders
Example (ModCloth)
item_id,user_id,rating,timestamp,size,fit,user_attr,model_attr,c... 7443,Alex,4,2010-01-21 08:00:00+00:00,,,Small,Small,Dresses,,2012,0 7443,carolyn.agan,3,2010-01-27 08:00:00+00:00,,,,Small,Dresses,,... 7443,Robyn,4,2010-01-29 08:00:00+00:00,,,Small,Small,Dresses,,20... 7443,De,4,2010-02-13 08:00:00+00:00,,,,Small,Dresses,,2012,0 7443,tasha,4,2010-02-18 08:00:00+00:00,,,Small,Small,Dresses,,20... 7443,gina.chihos,5,2010-02-25 08:00:00+00:00,,,,Small,Dresses,,2... 7443,Kim,2,2010-02-26 08:00:00+00:00,,,Small,Small,Dresses,,2012,0 7443,jess.betcher,5,2010-03-26 07:00:00+00:00,,,,Small,Dresses,,...
Download links
See our project page for download links.
Addressing Marketing Bias in Product Recommendations Mengting Wan, Jianmo Ni, Rishabh Misra, Julian McAuley WSDM , 2020 pdf
Google Local Reviews (2021)
This dataset contains review information from Google Maps (ratings, text, images, etc.), business metadata (address, geographic info, descriptions, category information, price, open hours, etc.), and links (related businesses) up to Sep 2021 in the United States.
{ 'user_id': '101463350189962023774', 'name': 'Jordan Adams', 'time': 1627750414677, 'rating': 5, 'text': 'Cool place, great people, awesome dentist!', 'pics': [ { 'url': ['https://lh5.googleusercontent.com/p/AF1QipNq2nZC5TH4_M7h5xRAd 61hoTgvY1o9lozABguI=w150-h150-k-no-p'] } ], 'resp': { 'time': 1628455067818, 'text': 'Thank you for your five-star review! -Dr. Blake' }, 'gmap_id': '0x87ec2394c2cd9d2d:0xd1119cfbee0da6f3' }
- user_id - ID of the reviewer
- name - name of the reviwer
- time - time of the review (unix time)
- rating - rating of the business
- text - text of the review
- pics - pictures of the review
- resp - business response to the review including unix time and text of the response
- gmap_id - ID of the business
{ 'name': 'Walgreens Pharmacy', 'address': 'Walgreens Pharmacy, 124 E North St, Kendallville, IN 46755', 'gmap_id': '0x881614ce7c13acbb:0x5c7b18bbf6ec4f7e', 'description': 'Department of the Walgreens chain providing prescription medications & other health-related items.', 'latitude': 41.451859999999996, 'longitude': -85.2666757, 'category': ['Pharmacy'], 'avg_rating': 4.2, 'num_of_reviews': 5, 'price': '$$', 'hours': [['Thursday', '8AM–1:30PM'], ['Friday', '8AM–1:30PM'], ['Saturday', '9AM–1:30PM'], ['Sunday', '10AM–1:30PM'], ['Monday', '8AM–1:30PM'], ['Tuesday', '8AM–1:30PM'], ['Wednesday', '8AM–1:30PM']], 'MISC': { 'Service options': ['Curbside pickup', 'Drive-through', 'In-store pickup', 'In-store shopping'], 'Health & safety': ['Mask required', 'Staff wear masks', 'Staff get temperature checks'], 'Accessibility': ['Wheelchair accessible entrance', 'Wheelchair accessible parking lot'], 'Planning': ['Quick visit'], 'Payments': ['Checks', 'Debit cards'] }, 'state': 'Closes soon ⋅ 1:30PM ⋅ Reopens 2PM', 'relative_results': ['0x881614cd49e4fa33:0x2d507c24ff4f1c74', '0x8816145bf5141c89:0x535c1d605109f94b', '0x881614cda24cc591:0xca426e3a9b826432', '0x88162894d98b91ef:0xd139b34de70d3e03', '0x881615400b5e57f9:0xc56d17dbe420a67f'], 'url': 'https://www.google.com/maps/place//data=!4m2!3m1!1s0x881614ce7c13acb b:0x5c7b18bbf6ec4f7e?authuser=-1&hl=en&gl=us' }
- name - name of the business
- address - address of the business
- description - description of the business
- latitude - latitude of the business
- longitude - longitude of the business
- category - category of the business
- avg_rating - average rating of the business
- num_of_reviews - number of reviews
- price - price of the business
- hours - open hours
- MISC - MISC information
- state - the current status of the business (e.g., permanently closed)
- relative_results - relative businesses recommended by Google
- url - URL of the business
See the Google Local Dataset Page for download information.
UCTopic: Unsupervised Contrastive Learning for Phrase Representations and Topic Mining Jiacheng Li, Jingbo Shang, Julian McAuley Annual Meeting of the Association for Computational Linguistics (ACL) , 2022 pdf
Personalized Showcases: Generating Multi-Modal Explanations for Recommendations An Yan, Zhankui He, Jiacheng Li, Tianyang Zhang, Julian Mcauley arXiv:2207.00422 , 2022 pdf
Google Local Reviews (2018)
These datasets contain reviews about businesses from Google Local (Google Maps). Data includes geographic information for each business as well as reviews.
- GPS coordinates and address
- User information (places lived, jobs)
- business category, opening hours, etc.
Example (review)
{ 'rating': 3.0, 'reviewerName': u'an lam', 'reviewText': u'Ch\u1ea5t l\u01b0\u1ee3ng t\u1ea1m \u1ed5n', 'categories': [u'Gi\u1ea3i Tr\xed - Caf\xe9'], 'gPlusPlaceId': u'108103314380004200232', 'unixReviewTime': 1372686659, 'reviewTime': u'Jul 1, 2013', 'gPlusUserId': u'100000010817154263736' }
Example (business)
{ 'name': u'Diamond Valley Lake Marina', 'price': None, 'address': [u'2615 Angler Ave', u'Hemet, CA 92545'], 'hours': [[u'Monday', [[u'6:30 am--4:15 pm']]], [u'Tuesday', [[u'6:30 am--4:15 pm']]], [u'Wednesday', [[u'6:30 am--4:15 pm']], 1], [u'Thursday', [[u'6:30 am--4:15 pm']]], [u'Friday', [[u'6:30 am--4:15 pm']]], [u'Saturday', [[u'6:30 am--4:15 pm']]], [u'Sunday', [[u'6:30 am--4:15 pm']]]], 'phone': u'(951) 926-7201', 'closed': False, 'gPlusPlaceId': '104699454385822125632', 'gps': [33.703804, -117.003209] }
Places Data (276mb)
User Data (178mb)
Review Data (1.4gb)
Translation-based factorization machines for sequential recommendation Rajiv Pasricha, Julian McAuley RecSys , 2018 pdf
Translation-based recommendation Ruining He, Wang-Cheng Kang, Julian McAuley RecSys , 2017 pdf
Google Restaurants
This is a mutli-modal dataset of restaurants from Google Local (Google Maps). Data includes images and reviews posted by users, as well as other metadata for each restaurant.
- Geographical location and address
- Reviews, ratings and images
- Business category, opening status, price, etc.
"name":"The Fish Spot", "address":"5101 W Pico Blvd, Los Angeles, CA 90019", "Description":null, "Latitude":34.0481627, "Longitude":-118.3494339, "category":["Seafood restaurant"], "gmap_url":"https://www.google.com/maps/place/The+Fish+Spot/", "Avg_rating":4.3, "Num_of_reviews":80, "price":"$$", "Reviews": [ {"user_id":"111210125124533240892", "time":"3 years ago", "Rating":5, "text":"Absolutely love this place.", "pics":[ {"id":"AF1QipO1ejvRhkVBlg-v52UczxYMD7uebcZIhKC9uGud", "url":["https://lh5.googleusercontent.com/p/"]}, ], "link":"https://www.google.com/maps/reviews/"}, ...,]
See our data folder containing all related files. The file image_review_all.json contains the full dataset, while filter_all_t.json is a subset with filtered review sentences that have higher correlation with images. Code is available in our Github repository .
Steam Video Game and Bundle Data
These datasets contain reviews from the Steam video game platform, and information about which games were bundled together.
- purchases, plays, recommends ("likes")
- product bundles
- pricing information
Example (bundle)
{ 'bundle_final_price': '$29.66', 'bundle_url': 'http://store.steampowered.com/bundle/1482/?utm_source=SteamDB...', 'bundle_price': '$32.96', 'bundle_name': 'Two Tribes Complete Pack!', 'bundle_id': '1482', 'items': [{'genre': 'Casual, Indie', 'item_id': '38700', 'discounted_price': '$4.99', 'item_url': 'http://store.steampowered.com/app/38700', 'item_name': 'Toki Tori'}, {'genre': 'Adventure, Casual, Indie', 'item_id': '201420', 'discounted_price': '$14.99', 'item_url': 'http://store.steampowered.com/app/201420', 'item_name': 'Toki Tori 2+'}, {'genre': 'Strategy, Indie, Casual', 'item_id': '38720', 'discounted_price': '$4.99', 'item_url': 'http://store.steampowered.com/app/38720', 'item_name': 'RUSH'}, {'genre': 'Action, Indie', 'item_id': '38740', 'discounted_price': '$7.99', 'item_url': 'http://store.steampowered.com/app/38740', 'item_name': 'EDGE'}], 'bundle_discount': '10%' }
Version 1: Review Data (6.7mb)
Version 1: User and Item Data (71mb)
Version 2: Review Data (1.3gb)
Version 2: Item metadata (2.7mb)
Bundle Data (92kb)
Self-attentive sequential recommendation Wang-Cheng Kang, Julian McAuley ICDM , 2018 pdf
Item recommendation on monotonic behavior chains Mengting Wan, Julian McAuley RecSys , 2018 pdf
Generating and personalizing bundle recommendations on Steam Apurva Pathak, Kshitiz Gupta, Julian McAuley SIGIR , 2017 pdf
Goodreads Book Reviews
These datasets contain reviews from the Goodreads book review website, and a variety of attributes describing the items. Critically, these datasets have multiple levels of user interaction, raging from adding to a "shelf", rating, and reading.
- add-to-shelf, read, review actions
- book attributes: title, isbn
- graph of similar books
Example (interaction data)
{ "user_id": "8842281e1d1347389f2ab93d60773d4d", "book_id": "130580", "review_id": "330f9c153c8d3347eb914c06b89c94da", "isRead": true, "rating": 4, "date_added": "Mon Aug 01 13:41:57 -0700 2011", "date_updated": "Mon Aug 01 13:42:41 -0700 2011", "read_at": "Fri Jan 01 00:00:00 -0800 1988", "started_at": "" }
Goodreads Spoilers
These datasets contain reviews from the Goodreads book review website, along with annotated "spoiler" information from each review.
- see also metadata from the complete Goodreads dataset
Example (spoiler data)
Sentences are annotated as "1" if the sentence contains a spoiler, "0" otherwise.
{ 'user_id': '01ec1a320ffded6b2dd47833f2c8e4fb', 'timestamp': '2013-12-28', 'review_sentences': [[0, 'First, be aware that this book is not for the faint of heart.'], [0, 'Human trafficking, drugs, kidnapping, abuse in all forms - this story contains all of this and more.'], ..., [0, '(ARC provided by the author in return for an honest review.)']], 'rating': 5, 'has_spoiler': False, 'book_id': '18398089', 'review_id': '4b3ffeaf14310ac6854f140188e191cd' }
Fine-grained spoiler detection from large-scale review corpora Mengting Wan, Rishabh Misra, Ndapa Nakashole, Julian McAuley ACL , 2019 pdf
Pinterest Fashion Compatibility
This dataset contains images (scenes) containing fashion products, which are labeled with bounding boxes and links to the corresponding products.
- product IDs
- bounding boxes
Example (fashion.json)
{ "product": "0027e30879ce3d87f82f699f148bff7e", "scene": "cdab9160072dd1800038227960ff6467", "bbox": [ 0.434097, 0.859363, 0.560254, 1.0 ] }
See our project page for download links, and for instructions as to how the product images can be collected from Pinterest.
Complete the Look: Scene-based complementary product recommendation Wang-Cheng Kang, Eric Kim, Jure Leskovec, Charles Rosenberg, Julian McAuley CVPR , 2019 pdf
Clothing Fit Data
These datasets contain measurements of clothing fit from ModCloth and RentTheRunway .
- ratings and reviews
- fit feedback (small/fit/large etc.)
- user/item measurements
- category information
Example (RentTheRunway)
{ "fit": "fit", "user_id": "420272", "bust size": "34d", "item_id": "2260466", "weight": "137lbs", "rating": "10", "rented for": "vacation", "review_text": "An adorable romper! Belt and zipper were a little hard to navigate in a full day of wear/bathroom use, but that's to be expected. Wish it had pockets, but other than that-- absolutely perfect! I got a million compliments.", "body type": "hourglass", "review_summary": "So many compliments!", "category": "romper", "height": "5' 8\"", "size": 14, "age": "28", "review_date": "April 20, 2016" }
Modcloth (8.5mb)
Renttherunway (31mb)
Decomposing fit semantics for product size recommendation in metric spaces Rishabh Misra, Mengting Wan, Julian McAuley RecSys , 2018 pdf
Product Exchange/Bartering Data
These datasets contain peer-to-peer trades from various recommendation platforms.
- peer-to-peer trades
- "have" and "want" lists
- image data (tradesy)
Example (tradesy)
{ 'lists': { 'bought': ['466', '459', '457', '449'], 'selling': [], 'want': [], 'sold': ['104', '103', '102'] }, 'uid': '2' }
Tradesy (3.8mb)
See the project page for ratebeer, gameswap (and other) datasets
Bartering books to beers: A recommender system for exchange platforms Jérémie Rappaz, Maria-Luiza Vladarean, Julian McAuley, Michele Catasta WSDM , 2017 pdf
VBPR: Visual bayesian personalized ranking from implicit feedback Ruining He, Julian McAuley AAAI , 2016 pdf
Behance Community Art Data
Likes and image data from the community art website Behance. This is a small, anonymized, version of a larger proprietary dataset.
- appreciates (likes)
- extracted image features
Example ("appreciate" data)
Each entry is a user, item, timestamp triple:
276633 01588231 1307583271 1238354 01529213 1307583273 165550 00485000 1307583337 2173258 00776972 1307583340 165550 00158226 1307583406 1238354 01540285 1307583495 2459267 01578261 1307583509 165550 00264669 1307583518 165550 00171501 1307583536
Code to read image features
import struct def readImageFeatures(path): f = open(path, 'rb') while True: itemId = f.read(8) if itemId == '': break feature = struct.unpack('f'*4096, f.read(4*4096)) yield itemId, feature
See our data folder containing all Behance files. The folder also contains additional documentation.
Vista: A visually, socially, and temporally-aware model for artistic recommendation Ruining He, Chen Fang, Zhaowen Wang, Julian McAuley RecSys , 2016 pdf
Social Recommendation Data
These datasets include ratings as well as social (or trust) relationships between users. Data are from LibraryThing (a book review website) and epinions (general consumer reviews).
- price paid (epinions)
- helpfulness votes (librarything)
- flags (librarything)
Example (LibraryThing reviews)
{ 'work': '3067', 'flags': [], 'unixtime': 1160265600, 'stars': 4.5, 'nhelpful': 0, 'time': 'Oct 8, 2006', 'comment': 'great storytelling in this novel about a couple crossed by a time travelling disorder ', 'user': 'justine' }
Example (LibraryThing social network)
Rodo anehan Rodo sevilemar Rodo dingsi Rodo slash RelaxedReader AnnRig RelaxedReader bookbroke RelaxedReader Bumpersmom RelaxedReader DivaColumbus RelaxedReader AnnRig RelaxedReader bookbroke RelaxedReader BookWorm2729 RelaxedReader Bumpersmom
LibraryThing (594mb)
epinions (66mb)
SPMC: Socially-aware personalized Markov chains for sparse sequential recommendation Chenwei Cai, Ruining He, Julian McAuley IJCAI , 2017 pdf
Improving latent factor models via personalized feature projection for one-class recommendation Tong Zhao, Julian McAuley, Irwin King Conference on Information and Knowledge Management (CIKM) , 2015 pdf
Other Non-Recommender-Systems Datasets
Below are various datasets collected by my lab that are not related to recommender systems specifically. Formats of these datasets vary, so their respective project pages should be consulted for further details.
DogWhistle: Cant Understanding Data
DogWhistle is a Chinese dataset collected from the historical records for an online game. It provides hidden words and the cant for them, with human answers. The dataset is suitable for semantic similarity evaluation for large language models.
- cant and the hidden words
- cant history
- human answers
Example (insider subtask)
0 高铁,周末,无情,条纹 冷漠,休息,斑马 冷漠 2 1 高铁,周末,无情,条纹 冷漠,休息,斑马 休息 1 2 高铁,周末,无情,条纹 冷漠,休息,斑马 斑马 3
Please refer to our leaderboard page for download instructions.
Blow the Dog Whistle: A Chinese Dataset for Cant Understanding with Common Sense and World Knowledge Canwen Xu, Wangchunshu Zhou, Tao Ge, Ke Xu, Julian McAuley, Furu Wei NAACL , 2021 pdf
Video Game Data
Step charts from the video game Dance Dance Revolution , and audio files from the NES platform.
See the project pages for Dance Dance Convolution and NES MDB for further details and links to the data
Dance Dance Convolution Chris Donahue, Zachary Lipton, Julian McAuley ICML , 2017 pdf
The NES Music Database: A symbolic music dataset with expressive performance attributes Chris Donahue, Henry Mao, Julian McAuley International Society for Music Information Retrieval Conference (ISMIR) , 2018 pdf
Multi-aspect Reviews
These datasets include reviews with multiple rated dimensions. The most comprehensive of these are beer review datasets from Ratebeer and Beeradvocate, which include sensory aspects such as taste, look, feel, and smell.
- aspect-specific ratings (taste, look, feel, smell, overall impression)
- product category
Example (ratebeer)
beer/name: John Harvards Simcoe IPA beer/beerId: 63836 beer/brewerId: 8481 beer/ABV: 5.4 beer/style: India Pale Ale (IPA) review/appearance: 4/5 review/aroma: 6/10 review/palate: 3/5 review/taste: 6/10 review/overall: 13/20 review/time: 1157587200 review/profileName: hopdog review/text: On tap at the Springfield, PA location. Poured a deep and cloudy orange (almost a copper) color with a small sized off white head. Aromas or oranges and all around citric. Tastes of oranges, light caramel and a very light grapefruit finish. I too would not believe the 80+ IBUs - I found this one to have a very light bitterness with a medium sweetness to it. Light lacing left on the glass.
BeerAdvocate (433mb)
RateBeer (388mb)
Sentences with aspect labels (annotator 1) (758kb)
Sentences with aspect labels (annotator 2) (759kb)
Learning attitudes and attributes from multi-aspect reviews Julian McAuley, Jure Leskovec, Dan Jurafsky International Conference on Data Mining (ICDM) , 2012 pdf
From amateurs to connoisseurs: modeling the evolution of user expertise through online reviews Julian McAuley, Jure Leskovec WWW , 2013 pdf
Social Circles
These datasets contain social connections and "circles" from Facebook, Twitter, and Google Plus.
- social connections
- circles (sets of friends sharing a common property)
- user metadata
Example (Kaggle egonet data)
UserId: Friends 1: 4 6 12 2 208 2: 5 3 17 90 7
See SNAP facebook , twitter , and Google Plus data, as well as the Kaggle competition based on the same data.
Learning to Discover Social Circles in Ego Networks Julian McAuley, Jure Leskovec Neural Information Processing Systems (NIPS) , 2012 pdf
Reddit Submissions
Submissions of reddit posts (and in particular resubmissions of the same content) along with metadata.
- upvotes/downvotes
- post title, subreddit, etc.
#image_id,unixtime,rawtime,title,total_votes,reddit_id,... number_of_downvotes,localtime,score,number_of_comments,username 1005,1335861624,2012-05-01T15:40:24.968266-07:00,I immediately regret this decision,27,t296r,20,pics,7,1335886824,13,0,ninjaroflmaster 1005,1336470481,2012-05-08T16:48:01.418140-07:00,"Pushing your friend into the water,Level: 99",18,tds4i,16,funny,2,1336495681,14,0,hme4 1005,1339566752,2012-06-13T12:52:32.371941-07:00,I told him. He Didn't Listen,6,v0cma,4,funny,2,1339591952,2,0,HeyPatWhatsUp 1005,1342200476,2012-07-14T00:27:56.857805-07:00,Don't end up as this guy.,16,wjivx,7,funny,9,1342225676,-2,2,catalyst24
resubmissions data (7.3mb)
raw html of resubmissions (1.8gb)
See also the SNAP project page .
Understanding the interplay between titles, content, and communities in social media Himabindu Lakkaraju, Julian McAuley, Jure Leskovec ICWSM , 2013 pdf
Questions and comments to Julian McAuley
- Español – América Latina
- Português – Brasil
- Tiếng Việt
amazon_us_reviews
- Description :
Amazon Customer Reviews (a.k.a. Product Reviews) is one of Amazons iconic products. In a period of over two decades since the first review in 1995, millions of Amazon customers have contributed over a hundred million reviews to express opinions and describe their experiences regarding products on the Amazon.com website. This makes Amazon Customer Reviews a rich source of information for academic researchers in the fields of Natural Language Processing (NLP), Information Retrieval (IR), and Machine Learning (ML), amongst others. Accordingly, we are releasing this data to further research in multiple disciplines related to understanding customer product experiences. Specifically, this dataset was constructed to represent a sample of customer evaluations and opinions, variation in the perception of a product across geographical regions, and promotional intent or bias in reviews.
Over 130+ million customer reviews are available to researchers as part of this release. The data is available in TSV files in the amazon-reviews-pds S3 bucket in AWS US East Region. Each line in the data files corresponds to an individual review (tab delimited, with no quote and escape characters).
Each Dataset contains the following columns : marketplace - 2 letter country code of the marketplace where the review was written. customer_id - Random identifier that can be used to aggregate reviews written by a single author. review_id - The unique ID of the review. product_id - The unique Product ID the review pertains to. In the multilingual dataset the reviews for the same product in different countries can be grouped by the same product_id. product_parent - Random identifier that can be used to aggregate reviews for the same product. product_title - Title of the product. product_category - Broad product category that can be used to group reviews (also used to group the dataset into coherent parts). star_rating - The 1-5 star rating of the review. helpful_votes - Number of helpful votes. total_votes - Number of total votes the review received. vine - Review was written as part of the Vine program. verified_purchase - The review is on a verified purchase. review_headline - The title of the review. review_body - The review text. review_date - The date the review was written.
Homepage : https://s3.amazonaws.com/amazon-reviews-pds/readme.html
Source code : tfds.datasets.amazon_us_reviews.Builder
- 0.1.0 (default): No release notes.
Feature structure :
- Feature documentation :
Supervised keys (See as_supervised doc ): None
Figure ( tfds.show_examples ): Not supported.
amazon_us_reviews/Wireless_v1_00 (default config)
Config description : A dataset consisting of reviews of Amazon Wireless_v1_00 products in US marketplace. Each product has its own version as specified with it.
Download size : 1.59 GiB
Dataset size : 7.21 GiB
Auto-cached ( documentation ): No
- Examples ( tfds.as_dataframe ):
amazon_us_reviews/Watches_v1_00
Config description : A dataset consisting of reviews of Amazon Watches_v1_00 products in US marketplace. Each product has its own version as specified with it.
Download size : 155.42 MiB
Dataset size : 753.08 MiB
amazon_us_reviews/Video_Games_v1_00
Config description : A dataset consisting of reviews of Amazon Video_Games_v1_00 products in US marketplace. Each product has its own version as specified with it.
Download size : 453.19 MiB
Dataset size : 1.78 GiB
amazon_us_reviews/Video_DVD_v1_00
Config description : A dataset consisting of reviews of Amazon Video_DVD_v1_00 products in US marketplace. Each product has its own version as specified with it.
Download size : 1.41 GiB
Dataset size : 5.31 GiB

amazon_us_reviews/Video_v1_00
Config description : A dataset consisting of reviews of Amazon Video_v1_00 products in US marketplace. Each product has its own version as specified with it.
Download size : 132.49 MiB
Dataset size : 465.08 MiB
amazon_us_reviews/Toys_v1_00
Config description : A dataset consisting of reviews of Amazon Toys_v1_00 products in US marketplace. Each product has its own version as specified with it.
Download size : 799.61 MiB
Dataset size : 3.61 GiB
amazon_us_reviews/Tools_v1_00
Config description : A dataset consisting of reviews of Amazon Tools_v1_00 products in US marketplace. Each product has its own version as specified with it.
Download size : 318.32 MiB
Dataset size : 1.37 GiB
amazon_us_reviews/Sports_v1_00
Config description : A dataset consisting of reviews of Amazon Sports_v1_00 products in US marketplace. Each product has its own version as specified with it.
Download size : 832.06 MiB
Dataset size : 3.64 GiB
amazon_us_reviews/Software_v1_00
Config description : A dataset consisting of reviews of Amazon Software_v1_00 products in US marketplace. Each product has its own version as specified with it.
Download size : 89.66 MiB
Dataset size : 366.16 MiB
amazon_us_reviews/Shoes_v1_00
Config description : A dataset consisting of reviews of Amazon Shoes_v1_00 products in US marketplace. Each product has its own version as specified with it.
Download size : 612.50 MiB
Dataset size : 3.06 GiB
amazon_us_reviews/Pet_Products_v1_00
Config description : A dataset consisting of reviews of Amazon Pet_Products_v1_00 products in US marketplace. Each product has its own version as specified with it.
Download size : 491.92 MiB
Dataset size : 2.11 GiB
amazon_us_reviews/Personal_Care_Appliances_v1_00
Config description : A dataset consisting of reviews of Amazon Personal_Care_Appliances_v1_00 products in US marketplace. Each product has its own version as specified with it.
Download size : 16.82 MiB
Dataset size : 75.03 MiB
Auto-cached ( documentation ): Yes
amazon_us_reviews/PC_v1_00
Config description : A dataset consisting of reviews of Amazon PC_v1_00 products in US marketplace. Each product has its own version as specified with it.
Dataset size : 5.93 GiB
amazon_us_reviews/Outdoors_v1_00
Config description : A dataset consisting of reviews of Amazon Outdoors_v1_00 products in US marketplace. Each product has its own version as specified with it.
Download size : 428.16 MiB
Dataset size : 1.83 GiB
amazon_us_reviews/Office_Products_v1_00
Config description : A dataset consisting of reviews of Amazon Office_Products_v1_00 products in US marketplace. Each product has its own version as specified with it.
Download size : 488.59 MiB
Dataset size : 2.12 GiB
amazon_us_reviews/Musical_Instruments_v1_00
Config description : A dataset consisting of reviews of Amazon Musical_Instruments_v1_00 products in US marketplace. Each product has its own version as specified with it.
Download size : 184.43 MiB
Dataset size : 792.16 MiB
amazon_us_reviews/Music_v1_00
Config description : A dataset consisting of reviews of Amazon Music_v1_00 products in US marketplace. Each product has its own version as specified with it.
Download size : 1.42 GiB
Dataset size : 5.16 GiB
amazon_us_reviews/Mobile_Electronics_v1_00
Config description : A dataset consisting of reviews of Amazon Mobile_Electronics_v1_00 products in US marketplace. Each product has its own version as specified with it.
Download size : 21.81 MiB
Dataset size : 94.97 MiB
amazon_us_reviews/Mobile_Apps_v1_00
Config description : A dataset consisting of reviews of Amazon Mobile_Apps_v1_00 products in US marketplace. Each product has its own version as specified with it.
Download size : 532.11 MiB
Dataset size : 3.13 GiB
amazon_us_reviews/Major_Appliances_v1_00
Config description : A dataset consisting of reviews of Amazon Major_Appliances_v1_00 products in US marketplace. Each product has its own version as specified with it.
Download size : 23.23 MiB
Dataset size : 96.36 MiB
amazon_us_reviews/Luggage_v1_00
Config description : A dataset consisting of reviews of Amazon Luggage_v1_00 products in US marketplace. Each product has its own version as specified with it.
Download size : 57.53 MiB
Dataset size : 274.07 MiB
amazon_us_reviews/Lawn_and_Garden_v1_00
Config description : A dataset consisting of reviews of Amazon Lawn_and_Garden_v1_00 products in US marketplace. Each product has its own version as specified with it.
Download size : 464.22 MiB
Dataset size : 2.00 GiB
amazon_us_reviews/Kitchen_v1_00
Config description : A dataset consisting of reviews of Amazon Kitchen_v1_00 products in US marketplace. Each product has its own version as specified with it.
Download size : 887.63 MiB
Dataset size : 3.85 GiB
amazon_us_reviews/Jewelry_v1_00
Config description : A dataset consisting of reviews of Amazon Jewelry_v1_00 products in US marketplace. Each product has its own version as specified with it.
Download size : 235.58 MiB
Dataset size : 1.22 GiB
amazon_us_reviews/Home_Improvement_v1_00
Config description : A dataset consisting of reviews of Amazon Home_Improvement_v1_00 products in US marketplace. Each product has its own version as specified with it.
Download size : 480.02 MiB
Dataset size : 2.08 GiB
amazon_us_reviews/Home_Entertainment_v1_00
Config description : A dataset consisting of reviews of Amazon Home_Entertainment_v1_00 products in US marketplace. Each product has its own version as specified with it.
Download size : 184.22 MiB
Dataset size : 741.78 MiB
amazon_us_reviews/Home_v1_00
Config description : A dataset consisting of reviews of Amazon Home_v1_00 products in US marketplace. Each product has its own version as specified with it.
Download size : 1.01 GiB
Dataset size : 4.60 GiB
amazon_us_reviews/Health_Personal_Care_v1_00
Config description : A dataset consisting of reviews of Amazon Health_Personal_Care_v1_00 products in US marketplace. Each product has its own version as specified with it.
Download size : 964.34 MiB
Dataset size : 4.21 GiB
amazon_us_reviews/Grocery_v1_00
Config description : A dataset consisting of reviews of Amazon Grocery_v1_00 products in US marketplace. Each product has its own version as specified with it.
Download size : 382.74 MiB
Dataset size : 1.77 GiB
amazon_us_reviews/Gift_Card_v1_00
Config description : A dataset consisting of reviews of Amazon Gift_Card_v1_00 products in US marketplace. Each product has its own version as specified with it.
Download size : 11.57 MiB
Dataset size : 93.82 MiB
amazon_us_reviews/Furniture_v1_00
Config description : A dataset consisting of reviews of Amazon Furniture_v1_00 products in US marketplace. Each product has its own version as specified with it.
Download size : 142.08 MiB
Dataset size : 646.69 MiB
amazon_us_reviews/Electronics_v1_00
Config description : A dataset consisting of reviews of Amazon Electronics_v1_00 products in US marketplace. Each product has its own version as specified with it.
Download size : 666.45 MiB
Dataset size : 2.74 GiB
amazon_us_reviews/Digital_Video_Games_v1_00
Config description : A dataset consisting of reviews of Amazon Digital_Video_Games_v1_00 products in US marketplace. Each product has its own version as specified with it.
Download size : 26.17 MiB
Dataset size : 124.19 MiB
amazon_us_reviews/Digital_Video_Download_v1_00
Config description : A dataset consisting of reviews of Amazon Digital_Video_Download_v1_00 products in US marketplace. Each product has its own version as specified with it.
Download size : 483.49 MiB
Dataset size : 2.68 GiB
amazon_us_reviews/Digital_Software_v1_00
Config description : A dataset consisting of reviews of Amazon Digital_Software_v1_00 products in US marketplace. Each product has its own version as specified with it.
Download size : 18.12 MiB
Dataset size : 89.59 MiB
amazon_us_reviews/Digital_Music_Purchase_v1_00
Config description : A dataset consisting of reviews of Amazon Digital_Music_Purchase_v1_00 products in US marketplace. Each product has its own version as specified with it.
Download size : 241.82 MiB
Dataset size : 1.20 GiB
amazon_us_reviews/Digital_Ebook_Purchase_v1_00
Config description : A dataset consisting of reviews of Amazon Digital_Ebook_Purchase_v1_00 products in US marketplace. Each product has its own version as specified with it.
Download size : 2.51 GiB
Dataset size : 10.82 GiB
amazon_us_reviews/Camera_v1_00
Config description : A dataset consisting of reviews of Amazon Camera_v1_00 products in US marketplace. Each product has its own version as specified with it.
Download size : 422.15 MiB
Dataset size : 1.69 GiB
amazon_us_reviews/Books_v1_00
Config description : A dataset consisting of reviews of Amazon Books_v1_00 products in US marketplace. Each product has its own version as specified with it.
Download size : 2.55 GiB
Dataset size : 10.01 GiB
amazon_us_reviews/Beauty_v1_00
Config description : A dataset consisting of reviews of Amazon Beauty_v1_00 products in US marketplace. Each product has its own version as specified with it.
Download size : 871.73 MiB
Dataset size : 3.88 GiB
amazon_us_reviews/Baby_v1_00
Config description : A dataset consisting of reviews of Amazon Baby_v1_00 products in US marketplace. Each product has its own version as specified with it.
Download size : 340.84 MiB
Dataset size : 1.45 GiB
amazon_us_reviews/Automotive_v1_00
Config description : A dataset consisting of reviews of Amazon Automotive_v1_00 products in US marketplace. Each product has its own version as specified with it.
Download size : 555.18 MiB
Dataset size : 2.54 GiB
amazon_us_reviews/Apparel_v1_00
Config description : A dataset consisting of reviews of Amazon Apparel_v1_00 products in US marketplace. Each product has its own version as specified with it.
Download size : 618.59 MiB
Dataset size : 3.99 GiB
amazon_us_reviews/Digital_Ebook_Purchase_v1_01
Config description : A dataset consisting of reviews of Amazon Digital_Ebook_Purchase_v1_01 products in US marketplace. Each product has its own version as specified with it.
Download size : 1.21 GiB
Dataset size : 4.87 GiB
amazon_us_reviews/Books_v1_01
Config description : A dataset consisting of reviews of Amazon Books_v1_01 products in US marketplace. Each product has its own version as specified with it.
Dataset size : 8.48 GiB
amazon_us_reviews/Books_v1_02
Config description : A dataset consisting of reviews of Amazon Books_v1_02 products in US marketplace. Each product has its own version as specified with it.
Download size : 1.24 GiB
Dataset size : 4.15 GiB
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License , and code samples are licensed under the Apache 2.0 License . For details, see the Google Developers Site Policies . Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2022-12-06 UTC.

- SNAP C++ Main Page
- SNAP C++ Download
- SNAP C++ Documentation
- Snap.py Python Main Page
- Snap.py Python Download
- Snap.py Python Documentation
- Large networks
- Web datasets
- Other resources
- BIOSNAP Datasets
- Activity Inequality
- Higher-order
- Disinformation
- Memetracker
- Temporal Motifs
- Citing SNAP
Web data: Amazon reviews
Dataset information.
This dataset consists of reviews from amazon . The data span a period of 18 years, including ~35 million reviews up to March 2013. Reviews include product and user information, ratings, and a plaintext review. Note: this dataset contains potential duplicates, due to products whose reviews Amazon merges. A file has been added below (possible_dupes.txt.gz) to help identify products that are potentially duplicates of each other.
Note: A new-and-improved Amazon dataset is available here , which corrects the above duplication issues, and also contains more complete data/metadata.
Source (citation)
- J. McAuley and J. Leskovec. Hidden factors and hidden topics: understanding rating dimensions with review text . RecSys, 2013.
Data format
- product/productId : asin , e.g. amazon.com/dp/B00006HAXW
- product/title : title of the product
- product/price : price of the product
- review/userId : id of the user, e.g. A1RSDE90N6RSZF
- review/profileName : name of the user
- review/helpfulness : fraction of users who found the review helpful
- review/score : rating of the product
- review/time : time of the review (unix time)
- review/summary : review summary
How to parse (in Python)
Cross-Market Recommendation (XMRec)
Cross-Market Recommendation Competition @ WSDM 2022
Cross-Market Product Recommendation @ CIKM 2021
Workshop of Cross-Market Recommendation @ RecSys 2021
XMarket Dataset
Description.
Here, we release XMarket, a large dataset covering 18 local markets on 16 different product categories, featuring 52.5 million user-item interactions. For more information on the XMarket dataset, please refer to our CIKM’21 paper .
Below is the list of our markets and their data. For every market below, you can click and see the list of categories as well as #user, #item, and #ratings we collected. For every market, you can download the ratings , reviews , and metadata associated with asins in this category. Please see below for how to read each type of these files and their file formats. Please reach out if you encounter any problem with these files provided below. The total size of gzipped data is ~7.5 GB.
- United Arab Emirates (ae)
- Australia (au)
- Brazil (br)
- Canada (ca)
- Germany (de)
- France (fr)
- Mexico (mx)
- Netherlands (nl)
- Saudi Arabia (sa)
- Singapore (sg)
- Turkey (tr)
- United Kingdom (uk)
- United States (us)
Data Samples (and python reading code examples)
Provide a simple file format listing as userId itemId rating date . For this purpose, you can easily read each of these ratings with the following code.
After reading, you can see a dataframe similar to below (taken from uk market).
- userId - ID of the reviewer
- itemId - asin or the ID of the product, e.g. B014RDFCFI
- rate - rating of the product in the range of 1-5
- date - date of the review in the format of YYYY-MM-DD
Review files provide a list of json objects, each providing a customer review for a given product. For reading these files you can read line by line and obtain the json dictionary of a specific review as below.
Below is the output of the sample line of the review file we read above.
- reviewerID - ID of the reviewer equivalent to userId of ratings
- asin - ID of the product equivalent to the itemId of ratings, e.g. B014RDFCFI
- reviewerName - name of the reviewer
- reviewText - text of the review
- overall - rating of the product in the range of 1-5
- summary - summary of the review
- cleanReviewTime - date of the review in the format of YYYY-MM-DD
- reviewTime - original review time posted along with the review in the local market calendar
Metadata includes product descriptions, price, sales-rank, brand info, and co-purchasing links. For reading, similar to reviews, use the below python code snippet.
- asin - ID of the product, e.g. see B000023VW2
- title - name of the product
- averageRating - the average rate of the product in the time of obtaining the data, float number [1-5]
- ratingCount - how many users rate this product
- amazon_badge - if there is any badge associated with this product
- ratingDist - the distribution of each rating value
- related - related products (also bought, also viewed, bought together, compared, and sponsored) are listed
- productDetails - a variety of information related to the product, including brand, category, descriptions, features, etc. are all provided in this dictionary. See the product page for further information on each of these fields provided with our data.
If you use this dataset, please refer to our CIKM’21 paper :
FOREC's data cleaning and code are provided in this repository .
Search code, repositories, users, issues, pull requests...
Provide feedback.
We read every piece of feedback, and take your input very seriously.
Saved searches
Use saved searches to filter your results more quickly.
To see all available qualifiers, see our documentation .
- Notifications
Goodreads Datasets
NOTE: Our datasets have been moved!!
Please see our new webpage about how to download these datasets. This Google site along with the download links in our previous Google Drive will be deprecated soon.
====================================
The datasets were collected in late 2017 from goodreads.com , where we only scraped users' public shelves, i.e. everyone can see it on web without login. User IDs and review IDs are anonymized.
We collected these datasets for academic use only. Please do not redistribute them or use for commercial purposes.
If you are using our datasets, please cite the following papers:
Mengting Wan, Julian McAuley, " Item Recommendation on Monotonic Behavior Chains ", in RecSys'18 . [ bibtex ]
Mengting Wan, Rishabh Misra, Ndapa Nakashole, Julian McAuley, " Fine-Grained Spoiler Detection from Large-Scale Review Corpora ", in ACL'19 . [ bibtex ]
If you have any questions or find any bugs regarding these datasets, feel free to contact Mengting Wan ( [email protected] ).
Latest Updates
We've updated several files in May 2019. We really appreciate those who helped us to identify duplicates and bugs in the previous version!
A github repo is created, which includes a few jupyter notebooks showing how to load the datasets and some basic data explorations.
[May 2019] Review files are uploaded.
[May 2019] Interaction files are updated: duplicates and mismatches are removed.
[May 2019] Meta-data of books are updated: text descriptions are normalized; popular shelf names with negative counts are removed.
We collected three groups of datasets: (1) meta-data of the books, (2) user-book interactions (users' public shelves) and (3) users' detailed book reviews. These datasets can be merged together by matching book/user/review ids.
Basic Statistics of the Complete Book Graph:
2,360,655 books ( 1,521,962 works, 400,390 book series, 829,529 authors)
876,145 users; 228,648,342 user-book interactions in users' shelves (include 112,131,203 reads and 104,551,549 ratings)
876,145 users; 229,154,523 user-book interactions in users' shelves (include 112,310,716 reads and 104,713,520 ratings) (We've updated the interaction files and removed duplicates in May 2019) .
Note the complete interaction dataset is very large! We extracted several medium-size subsets by genre, and recommend using these subsets for experimentation first (see " By Genre " for details).
(Meta-Data of Books)
We collected detailed meta-data about 2.36M books. Please see " Books " page for dataset details and sample records.
Quick links:
Complete book graph: goodreads_books.json.gz
Author information: goodreads_book_authors.json.gz
Work information: goodreads_book_works.json.gz
Book series: goodreads_book_series.json.gz
Fuzzy book genres: gooreads_book_genres_initial.json.gz
(User-Book Interactions)
We collected more than 229M user-book interactions. Please see " Shelves " page for dataset details and sample records.
Quick links ( These files could be very large! Consider using genre-wise datasets if your resources are limited. ) :
Complete * 229m* interactions in 'csv' format (~4.1g): goodreads_interactions.csv
User IDs: user_id_map.csv
Book IDs: book_id_map.csv
Contact Mengting Wan ( [email protected] ) if you need a detailed version
(Book Review Texts)
We further re-scraped more than 15M records with detailed review text. Please see " Reviews " page for details and sample records.
Complete 15.7m reviews (~5g): goodread_reviews_dedup.json.gz
Review subset (~1.38m reviews) with parsed spoiler tags: goodreads_reviews_spoiler.json.gz
Spoiler subset with original review text: goodreads_reviews_spoiler_raw.json.gz
Code Samples
(Operate the Datasets)
We created several jupyter notebooks to illustrate how to download/read these datasets, and provide some basic explorations of the data.
Download datasets without GUI: download.ipynb
Display sample records: samples.ipynb
Calculate basic statistics: statistics.ipynb :
Explore the interaction data: distributions.ipynb
Explore the review data: reviews.ipynb
We notice different interaction densities in different subsets.
Books can be overlapped across different genres (i.e., one book may belong to multiple genres).
The (similar) book graph for each genre may not be self-contained. Those are just subsets of the nodes on the complete book graph (see the meta-data section).
Detailed information about authors, works, book series etc. can be found in the meta-data section.
Download Links:
goodreads_books_children.json.gz ( 124,082 books)
goodreads_interactions_children.json.gz ( 10,059,349 interactions)
goodreads_reviews_children.json.gz ( 734,640 detailed reviews )
Comics & Graphic
goodreads_books_comics_graphic.json.gz ( 89,411 books)
goodreads_interactions_comics_graphic.json.gz ( 7,347,630 interactions)
goodreads_reviews_comics_graphic.json.gz (542,338 detailed reviews)
Fantasy & Paranormal
goodreads_books_fantasy_paranormal.json.gz ( 258,585 books)
goodreads_interactions_fantasy_paranormal.json.gz ( 55,397,550 interactions)
goodreads_reviews_fantasy_paranormal.json.gz (3,424,641 detailed reviews)
History & Biography
goodreads_books_history_biography.json.gz ( 302,935 books)
goodreads_interactions_history_biography.json.gz ( 31,479,229 interactions)
goodreads_reviews_history_biography.json.gz (2,066,193 detailed reviews)
Mystery, Thriller & Crime
goodreads_books_mystery_thriller_crime.json.gz ( 219,235 books)
goodreads_interactions_mystery_thriller_crime.json.gz ( 24,799,896 interactions)
goodreads_reviews_mystery_thriller_crime.json.gz (1,849,236 detailed reviews)
goodreads_books_poetry.json.gz ( 36,514 books)
goodreads_interactions_poetry.json.gz ( 2,734,350 interactions)
goodreads_reviews_poetry.json.gz (154,555 detailed reviews)
goodreads_books_romance.json.gz ( 335,449 books)
goodreads_interactions_romance.json.gz ( 42,792,856 interactions)
goodreads_reviews_romance.json.gz (3,565,378 detailed reviews)
Young Adult
goodreads_books_young_adult.json.gz ( 93,398 books)
goodreads_interactions_young_adult.json.gz ( 34,919,254 interactions)
goodreads_reviews_young_adult.json.gz (2,389,900 detailed reviews)
Amazon product data
Julian McAuley , UCSD
New!: See our updated (2018) version of the Amazon data here
New: repository of recommender systems datasets.
See a variety of other datasets for recommender systems research on our lab's dataset webpage
Description
This dataset contains product reviews and metadata from Amazon, including 142.8 million reviews spanning May 1996 - July 2014.
This dataset includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, price, brand, and image features), and links (also viewed/also bought graphs).
"Small" subsets for experimentation
If you're using this data for a class project (or similar) please consider using one of these smaller datasets below before requesting the larger files. To obtain the larger files you will need to contact me to obtain access.
K-cores (i.e., dense subsets): These data have been reduced to extract the k-core , such that each of the remaining users and items have k reviews each.
Ratings only: These datasets include no metadata or reviews, but only (user,item,rating,timestamp) tuples. Thus they are suitable for use with mymedialite (or similar) packages.
Complete review data
Please see the per-category files below, and only download these (large!) files if you really need them:
raw review data (20gb) - all 142.8 million reviews
The above file contains some duplicate reviews, mainly due to near-identical products whose reviews Amazon merges, e.g. VHS and DVD versions of the same movie. These duplicates have been removed in the files below:
user review data (18gb) - duplicate items removed (83.68 million reviews), sorted by user
product review data (18gb) - duplicate items removed, sorted by product
5-core (9.9gb) - subset of the data in which all users and items have at least 5 reviews (41.13 million reviews)
Finally, the following file removes duplicates more aggressively, removing duplicates even if they are written by different users. This accounts for users with multiple accounts or plagiarized reviews. Such duplicates account for less than 1 percent of reviews, though this dataset is probably preferable for sentiment analysis type tasks:
aggressively deduplicated data (18gb) - no duplicates whatsoever (82.83 million reviews)
Format is one-review-per-line in (loose) json. See examples below for further help reading the data.
Sample review:
{ "reviewerID": "A2SUAM1J3GNN3B", "asin": "0000013714", "reviewerName": "J. McDonald", "helpful": [2, 3], "reviewText": "I bought this for my husband who plays the piano. He is having a wonderful time playing these old hymns. The music is at times hard to read because we think the book was published for singing from more than playing from. Great purchase though!", "overall": 5.0, "summary": "Heavenly Highway Hymns", "unixReviewTime": 1252800000, "reviewTime": "09 13, 2009" }
- reviewerID - ID of the reviewer, e.g. A2SUAM1J3GNN3B
- asin - ID of the product, e.g. 0000013714
- reviewerName - name of the reviewer
- helpful - helpfulness rating of the review, e.g. 2/3
- reviewText - text of the review
- overall - rating of the product
- summary - summary of the review
- unixReviewTime - time of the review (unix time)
- reviewTime - time of the review (raw)
Metadata includes descriptions, price, sales-rank, brand info, and co-purchasing links:
metadata (3.1gb) - metadata for 9.4 million products
Sample metadata:
{ "asin": "0000031852", "title": "Girls Ballet Tutu Zebra Hot Pink", "price": 3.17, "imUrl": "http://ecx.images-amazon.com/images/I/51fAmVkTbyL._SY300_.jpg", "related": { "also_bought": ["B00JHONN1S", "B002BZX8Z6", "B00D2K1M3O", "0000031909", "B00613WDTQ", "B00D0WDS9A", "B00D0GCI8S", "0000031895", "B003AVKOP2", "B003AVEU6G", "B003IEDM9Q", "B002R0FA24", "B00D23MC6W", "B00D2K0PA0", "B00538F5OK", "B00CEV86I6", "B002R0FABA", "B00D10CLVW", "B003AVNY6I", "B002GZGI4E", "B001T9NUFS", "B002R0F7FE", "B00E1YRI4C", "B008UBQZKU", "B00D103F8U", "B007R2RM8W"], "also_viewed": ["B002BZX8Z6", "B00JHONN1S", "B008F0SU0Y", "B00D23MC6W", "B00AFDOPDA", "B00E1YRI4C", "B002GZGI4E", "B003AVKOP2", "B00D9C1WBM", "B00CEV8366", "B00CEUX0D8", "B0079ME3KU", "B00CEUWY8K", "B004FOEEHC", "0000031895", "B00BC4GY9Y", "B003XRKA7A", "B00K18LKX2", "B00EM7KAG6", "B00AMQ17JA", "B00D9C32NI", "B002C3Y6WG", "B00JLL4L5Y", "B003AVNY6I", "B008UBQZKU", "B00D0WDS9A", "B00613WDTQ", "B00538F5OK", "B005C4Y4F6", "B004LHZ1NY", "B00CPHX76U", "B00CEUWUZC", "B00IJVASUE", "B00GOR07RE", "B00J2GTM0W", "B00JHNSNSM", "B003IEDM9Q", "B00CYBU84G", "B008VV8NSQ", "B00CYBULSO", "B00I2UHSZA", "B005F50FXC", "B007LCQI3S", "B00DP68AVW", "B009RXWNSI", "B003AVEU6G", "B00HSOJB9M", "B00EHAGZNA", "B0046W9T8C", "B00E79VW6Q", "B00D10CLVW", "B00B0AVO54", "B00E95LC8Q", "B00GOR92SO", "B007ZN5Y56", "B00AL2569W", "B00B608000", "B008F0SMUC", "B00BFXLZ8M"], "bought_together": ["B002BZX8Z6"] }, "salesRank": {"Toys & Games": 211836}, "brand": "Coxlures", "categories": [["Sports & Outdoors", "Other Sports", "Dance"]] }
- asin - ID of the product, e.g. 0000031852
- title - name of the product
- price - price in US dollars (at time of crawl)
- imUrl - url of the product image
- related - related products (also bought, also viewed, bought together, buy after viewing)
- salesRank - sales rank information
- brand - brand name
- categories - list of categories the product belongs to
Visual Features
We extracted visual features from each product image using a deep CNN (see citation below). Image features are stored in a binary format, which consists of 10 characters (the product ID), followed by 4096 floats (repeated for every product). See files below for further help reading the data.
visual features (141gb) - visual features for all products
The images themselves can be extracted from the imUrl field in the metadata files.
Below are files for individual product categories, which have already had duplicate item reviews removed.
Please cite one or both of the following if you use the data in any way:
Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering R. He, J. McAuley WWW , 2016 pdf
Image-based recommendations on styles and substitutes J. McAuley, C. Targett, J. Shi, A. van den Hengel SIGIR , 2015 pdf
Inferring networks of substitutable and complementary products J. McAuley, R. Pandey, J. Leskovec Knowledge Discovery and Data Mining , 2015 pdf
Hidden factors and hidden topics: understanding rating dimensions with review text J. McAuley, J. Leskovec RecSys pdf | reviews | bibtex | code (C++) slides
Reading the data
Data can be treated as python dictionary objects. A simple script to read any of the above the data is as follows:
def parse(path): g = gzip.open(path, 'r') for l in g: yield eval(l)
Convert to 'strict' json
The above data can be read with python 'eval', but is not strict json. If you'd like to use some language other than python, you can convert the data to strict json as follows:
import json import gzip def parse(path): g = gzip.open(path, 'r') for l in g: yield json.dumps(eval(l)) f = open("output.strict", 'w') for l in parse("reviews_Video_Games.json.gz"): f.write(l + '\n')
Pandas data frame
This code reads the data into a pandas data frame:
import pandas as pd import gzip def parse(path): g = gzip.open(path, 'rb') for l in g: yield eval(l) def getDF(path): i = 0 df = {} for d in parse(path): df[i] = d i += 1 return pd.DataFrame.from_dict(df, orient='index') df = getDF('reviews_Video_Games.json.gz')
Convert to CSV
This code converts (a selection of fields from) the above files to CSV format:
import csv fields = ["asin", "description", "brand"] csvOut = gzip.open("meta_Video_Games.csv.gz", 'w') writer = csv.writer(csvOut) for product in parse("meta_Video_Games.json.gz"): line = [] for f in fields: if product.has_key(f): line.append(product[f]) else: line.append("") writer.writerow(line)
Read image features
import array def readImageFeatures(path): f = open(path, 'rb') while True: asin = f.read(10) if asin == '': break a = array.array('f') a.fromfile(f, 4096) yield asin, a.tolist()
Example: compute average rating
ratings = [] for review in parse("reviews_Video_Games.json.gz"): ratings.append(review['overall']) print sum(ratings) / len(ratings)
Example: latent-factor model in mymedialite
Predicts ratings from a rating-only CSV file
./rating_prediction --recommender=BiasedMatrixFactorization --training-file=ratings_Video_Games.csv --test-ratio=0.1

IMAGES
VIDEO
COMMENTS
Amazon review (2013) Citation Please cite the following paper if you use the data in any way: Justifying recommendations using distantly-labeled reviews and fined-grained aspects Jianmo Ni, Jiacheng Li, Julian McAuley Empirical Methods in Natural Language Processing (EMNLP), 2019 pdf News
This is a dataset with complete multilingual review text but without spoiler tags. This dataset is relatively large and contains more than 15M reviews about ~2M books and 465K users. We...
876,145 users; 228,648,342 user-book interactions in users' shelves (include 112,131,203 reads and 104,551,549 ratings) Download links to these datasets can be found in the Datasets section below. Note the complete interaction dataset is very large!
by shrine The Goodreads metadata collection (retired) and 51 million Amazon book reviews. The Goodreads API was retired on December 8th 2020. Mengting Wan from UCSD graciously scraped it before its demise. There may be other datasets besides this. Jianmo Ni, also from UCSD, also scraped Amazon's reviews in 2019.
Description This dataset contains product reviews and metadata from Amazon, including 143.7 million reviews spanning May 1996 - July 2014. This dataset includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, price, brand, and image features), and links (also viewed/also bought graphs). Files
Description This page contains a collection of datasets that have been collected for research by our lab. Datasets contain the following features: user/item interactions star ratings timestamps product reviews social networks item-to-item relationships (e.g. copurchases, compatibility) product images price, brand, and category information GPS data
Amazon Customer Reviews (a.k.a. Product Reviews) is one of Amazons iconic products. In a period of over two decades since the first review in 1995, millions of Amazon customers have contributed over a hundred million reviews to express opinions and describe their experiences regarding products on the Amazon.com website.
[May 2019] Review files are uploaded. We collected three groups of datasets: (1) meta-data of the books, (2) user-book interactions (users' public shelves) and (3) users' detailed book reviews. These datasets can be merged together by matching book/user/review ids. Basic Statistics of the Complete Book Graph: 104,551,549 229,154,523104,713,520
This a very fuzzy version of book genres. These tags are extracted from users' popular shelves by a simple keyword matching process. Download link: gooreads_book_genres_initial.json.gz...
2 participants Perform steps 2 to 5 in the prepare_data.sh file,I only get meta_Books.json and reviews_Books.json. The reviews_Books_5.json file is not extracted from the reviews_Books.json.gz compressed file. So...
17 min read · Jun 17, 2022 1 A recommendation system filters information by predicting ratings or preferences of customers for items that the customers would like to use. It tries to recommend...
Dataset information. This dataset consists of reviews from amazon. The data span a period of 18 years, including ~35 million reviews up to March 2013. Reviews include product and user information, ratings, and a plaintext review. Note: this dataset contains potential duplicates, due to products whose reviews Amazon merges.
Reviews; Review files provide a list of json objects, each providing a customer review for a given product. For reading these files you can read line by line and obtain the json dictionary of a specific review as below. import gzip example_rev_file = 'reviews_uk_Books.json.gz' review_lines = [] with gzip. open (example_rev_file, 'rt', encoding ...
reviewsDataset.json README.md Sentiment Analysis for Amazon Book Reviews R Program implementing sentiment analysis of Amazon book reviews, using Naive Bayes Classifier and SVM. Used 10-Core Amazon Book Review Dataset ( http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Books_10.json.gz ). Prerequisites
1 Loading lines one at a time into DataFrames just to check their rating is incredibly inefficient, it's better to treat everything as dictionaries and make some Series at the end.
gunzip reviews_Books.json.gz: gunzip meta_Books.json.gz: python script/process_data.py meta_Books.json reviews_Books_5.json: python script/local_aggretor.py: python script/split_by_user.py: python script/generate_voc.py: Copy lines Copy permalink View git blame; Reference in new issue; Go
Amazon Product Reviews Dataset - Topic Modelling Problem. Amazon Product Reviews Dataset - Topic Modelling Problem. code. New Notebook. table_chart. New Dataset. emoji_events. New Competition. No Active Events. Create notebooks and keep track of their status here. ... SyntaxError: Unexpected token < in JSON at position 4.
But for most regular JSON work, performance is the deciding factor IMHO. The fact is, Gson is very slow, so we should avoid using it as much as possible and look what other solutions are available. 2. bleeding182 • 5 yr. ago. But for most regular JSON work, performance is the deciding factor IMHO.
Overview We collected three groups of datasets: (1) meta-data of the books, (2) user-book interactions (users' public shelves) and (3) users' detailed book reviews. These datasets can be...
5-core (9.9gb) - subset of the data in which all users and items have at least 5 reviews (41.13 million reviews) Finally, the following file removes duplicates more aggressively, removing duplicates even if they are written by different users. This accounts for users with multiple accounts or plagiarized reviews.