We built a data processing pipeline which is able to extract from user reviews product characteristics such as quality or performance. Our approach uses part-of-speech tagging to retrieve basic impressions on products and extracts qualities by identifying the best bigram collocations based on point-wise mutual information and ratio of likelihood scores. The characteristics are segmented into positivity classes by analyzing user sentiments. The pipeline developed is applicable to any type of product and has direct real-world application possibilities.
We used a collection of user reviews extracted from Amazon, spanning 1996 to 2014, which is freely available for research purposes here
. More specifically, we used the dense 5-core subset for electronics reviews and the full metadata file from which we extracted entries related to electronics. The rationale for this choice is that we need a minimum amount of reviews to be able to extract meaningful results for a product. In addition, we built a dictionary for compound words based on the set of all Wiktionary pages titles. This set can be found here
Read the report
Browse the IPython Notebook
Browse the source code