Thursday 30 August 2012

You Can Write, But You Can't Hide: Big Data Knows Your Writing Quirks


As I wrote recently, data scientists have been able to decode unstructured data to accurately predict where violence will occur in Afghanistan. Now, they can also mine unstructured data to determine the identity of a document’s writer. All of us, it seems, have a “write-print” as unique as our fingerprint.
According to forensic linguists, the experts who investigate a text’s originator, if they have an individual’s known writings, they can detect with up to 95% accuracy that person’s authorship of any other document. Forensic experts have been called as witnesses in the high profile lawsuit by Paul Ceglia, who has sued Mark Zuckerberg, claiming he owns half of Facebook. They’ve also been expert witnesses in murder trials.
While the field of forensic linguistics predates the advent of big data, the sheer volume of data being generated on the Internet is opening new business opportunities for automating the analysis. A company pursuing these opportunities claims it can pinpoint a document’s author and determine everything from the gender, age, and education of a writer to the veracity of the document’s content.
But some analysts don’t even need to have access to known writings of a person to determine a document’s authorship. Using hundreds of thousands of publicly available emails from Enron employees, a group of computer scientists from Concordia University in Montreal tested their approach of clustering documents of unknown origin to identify those written by the same person. While they note that more research is needed, these scientists believe their clustering technique can be used by investigators of cyber crimes where all they have as evidence is a massive amount of suspicious e-mails, text messages, or other written material.
Although no forensic linguist would claim that their analysis is equal to comparing a person’s DNA for identification purposes, they are confident they can find those stylistic quirks in our prose that makes us all individuals—even if the writer is trying to intentionally obfuscate who he or she is by, say, pretending to be illiterate when, in fact, the writer is a college graduate.
Big data is boosting forensic linguistics as a tool to be used in criminal investigations and the courts to seek justice. It’s one more way analytics is improving modern life.

No comments:

Post a Comment