Busted! The campaign against counterfeit reviews

Fake reviews are poisoning the Internet. Here's how machine learning is attempting to nail the counterfeiters.

Customer sentiment is a type of soft currency. Good reviews are monetizable data, especially when they come from influential, reputable sources and are broadcast far and wide.

In other words, it's best when your fans and their fond feelings are earned.

This marketing principle works in reverse, of course -- negative sentiments and nasty reviews can be showstoppers. Reputations lost cannot easily be reclaimed. And when bad raps persist in public forums -- such as social media, e-commerce sites, or review sites -- you can't count on people forgetting the mud that was slung at you last year or the year before. It will stain your brand in perpetuity, even if the charges were baseless and you've effectively addressed those that weren't.

What's shocking about online reviews is how easy this "currency" is to counterfeit. Cyber space is rife with fake reviews, both positive and negative. We can interpret "fake" in several ways:

The reviewer may use their own name but conceal the fact that they've been "put up to it" (they may have been paid, have a vested interest, or anticipate other material benefits to flow if they say nice things -- or nasty things to trash the competition).
The reviewer may use a pseudonym or otherwise post anonymously, in order to shield himself or herself from being fingered as the perpetrator.
The reviewer may be an automated program that posts legitimate-looking reviews in bulk, thereby overwhelming whatever authentic reviews have been posted manually.

Due to the levels of deception that may be involved, detecting fake online reviews requires that we confirm the following:

The authenticity of the source
The source's impartiality on the matters being reviewed
The nonspam originality of the actual reviews posted by the source

These are tough nuts to crack, especially in an automated fashion that can weed out the bogus reviews before they're posted and do their damage. In that regard, I recently came across an interesting article about an effort at the University of Kansas to develop machine learning algorithms to detect fake reviews. The researchers cite the need for a "more trustworthy social media experience" as driving their initiative.

What the article describes is one part semantic analysis of the posts (to look for verbal signatures of fake reviews), one part graph analysis (to assess the status of each reviewer's relationship with the site on which they post), one part outlier analysis (to determine whether the posts are far outside the average in terms of the sentiments expressed and the frequency of posting), and one part behavioral analysis (to determine whether bogus reviewers are changing their strategy over time and across sites to avoid detection). Underlying the researchers' efforts is an attempt to model fake-review attacks as a graph of "interactions between sociological, psychological, and technological factors."

People might trust online reviews more if they have some confidence that bogus postings are being detected promptly and accurately. Like any content-filtering technology, anti-fake-review algorithms will need to minimize both false positives (seemingly fake reviews that are real) and false negatives (fake reviews that are misclassified as real).

The stakeholders are obviously the businesses and other online entities whose reputations are at stake, as well as the public at large, which uses these opinions in determining whether this or that site, community, or business is worth associating with. If the researchers succeed in bringing machine learning algorithms to bear on the problem, their work could aid online sites in their efforts to self-police fake reviews. It could also help flag possible abusers so that they can be investigated further, blocked from accessing sites, and even referred to the relevant authorities for punitive actions.

If the researchers want to produce an algorithm of practical value, they will need to make it fast, efficient, parallelizable, and automatable to the max. It will need the cloud scalability of today's state-of-the-art antispam, antiphish, and antimalware technologies. As no sane human wants to manually filter an ocean of Nigerian scams, no one in his or her right mind will want to adjudicate whether the next "this restaurant's food stinks" review is the authentic voice of a real customer -- or the malicious posting of its archrival across the street.

It all comes down to detecting the fine line between sincerity and its opposite. It's the same fine line that anti-sarcasm algorithms also attempt to identify, with mixed results.