Text analytics, the Enron Data Set … and Andrew Fastow … get a new role

Guest blogger:

Eric De Grasse
Chief Technology Officer
Project Counsel Media

 

4 April 2019 (Paris, France) – As I wrote last week from San Francisco while attending EmTech Digital, the annual “all-things-AI” event sponsored by MIT Technology Review, one of new techniques “out there” is to really mess with machine-learning systems. It’s called “adversarial machine learning” and it is a method that involves experimentally feeding input into an algorithm to reveal the information it has been trained on, while others involve distorting the input in a way that causes the system to “misbehave”. The presenter used the Enron e-mail data set, which is being used more and more for this type of research, because it is a “real-world” data set on which many different machine learning models can be tested. You can read my earlier post by clicking here.

For the e-discovery community the Enron Data Set has remained the “go-to” data set for years as the best source of high-volume data to be used for demos and e-discovery software testing. Bu recently many in the e-discovery community say it’s no longer a representative test data set for the e-discovery software testing process – at least not as it is constituted – and it hasn’t been for a good while.

But to the artificial intelligence and machine learning communities it is a gold mine:

– Google has used it to trick machine-learning algorithms that auto-generate email responses to spit out sensitive data such as credit card numbers, then used those findings to prevent Smart Compose (that is the tool that auto-generates text in Gmail) from being exploited

– Scattertest used it for its text analytics program which helps companies monitor employee email (more on that in a moment) by find “uncommonly common” words by comparing one company category against another (sales vs. marketing) to see relationships between people by their “uncommonly common” words, or by extracting named entities from each email and then link them to all the top entities/users together using a network graph (used in anomaly detection of rogue trading), etc.

– The Computer Science Department at the University of Montana used it in its Python classes to create data visualization networks to show clusters of emails and how they can relate to a company’s communication channels

– Ubisend used it to train its algorithms to create data visualizations of communication networks

But the best might be KeenCorp, a data-analytics firm. Companies hire KeenCorp to analyze their employees’ emails. KeenCorp doesn’t read the emails … exactly. Its software focuses on word patterns and their context. The software then assigns the body of messages a numerical index that purports to measure the level of employee “engagement.” When workers are feeling positive and engaged, the number is high; when they are disengaged or expressing negative emotions like tension, the number is low. Two of the algorithm designers for KeenCorp met Andrew Fastow in Amsterdam after he had finished a public-speaking gig.

Short version: Andrew Fastow was the chief financial officer of Enron Corporation, an energy trading company based in Houston, Texas, until he was fired shortly before the company declared bankruptcy. Fastow was one of the key figures behind the complex web of off-balance-sheet special purpose entities (limited partnerships which Enron controlled) used to conceal Enron’s massive losses in their quarterly balance sheets. By unlawfully maintaining personal stakes in these ostensibly independent ghost-entities, he was able to defraud Enron out of tens of millions of dollars. He was convicted and served a six-year prison sentence for charges related to those acts. These days he spends most of his time speaking about ethics and anti-fraud procedures at universities and events like the Annual Global Anti-Fraud Conference.

A copy of the Enron email database was subsequently made public (long history; this is the short version) and it provided a trove of data that has been used for studies on social networking and computer analysis of language, as well as by e-discovery vendors, and it is now a new source for the artificial intelligence/machine learning crowd. The corpus is unique in that it is one of the only publicly available mass collections of real emails easily available for study, as such collections are typically bound by numerous privacy and legal restrictions which render them prohibitively difficult to access.

The two chaps from KeenCorp told Fastow that they had tested their software using several years’ worth of emails from the Enron Data Set, their focus on Enron’s top 150 executives. They were checking to see how key moments in the company’s tumultuous collapse would register on the KeenCorp index. But something appeared to have gone wrong.

The software had returned the lowest index score at the end of 2001, when Enron filed for bankruptcy. That made sense: Enron executives would have been growing more agitated as the company neared insolvency. But the index had also plummeted more than two years earlier. The two men had scoured various books and reports on Enron’s downfall, but it wasn’t clear what made this earlier date important. Pointing to the sudden dip on the left side of the laminated chart, they told Fastow they had one, exact question: “Do you remember anything unusual happening at Enron on June 28, 1999?”

Fastow spent a fair amount of time with KeenCorp, learning the software and understanding their use of the Enron Data Set. KeenCorp didn’t realize it, but its algorithms had, in fact, spotted one of the most important inflection points in Enron’s history. Fastow would later say on that date in 1999, the company’s board had spent hours discussing a novel proposal called “LJM,” which involved a series of complex and dubious transactions that would hide some of Enron’s poorly performing assets and bolster its financial statements. Ultimately, when discovered, LJM contributed to the firm’s undoing. According to Fastow, nobody formally challenged LJM. Nobody went to the board to say “this is wrong”. But the KeenCorp algorithm detected tension at the company starting with that first mention of LJM.

KeenCorp has half a dozen major corporate clients, and several consultants and advisers (including Fastow) who has been quoted as saying he was so impressed with the algorithm’s ability to spot employees’ concerns about LJM that he’d decided to become an investor. Yes, Fastow knows he’s stuck with a legacy of unethical and illegal behavior from his time at Enron. But he thinks making companies aware of KeenCorp’s software he can help prevent similar situations from occurring in the future.

KeenCorp also points to its “heat maps” of employee engagement that its software creates. KeenCorp says the maps have helped companies identify potential problems in the workplace, including audit-related concerns that accountants failed to flag. The software merely provides a warning, of course – it isn’t trained in the Sarbanes-Oxley Act. But a warning could be enough to help uncover serious problems.

Such early tips might also become an important tool to help companies ensure that they are complying with government rules – a Herculean task for firms in highly regulated fields like finance, health care, insurance, and pharmaceuticals. An early-warning system, though, is only as good as the people using it. Someone at the company, high or low, has to be willing to say something when the heat map turns red – and others have to listen. It is hard to imagine Enron’s directors heeding any warning about the use of complex financial transactions in 1999 – the bad actors included the CEO, and we know that whistle-blowers at the company were ignored.

The potential benefits of analyzing employee correspondence must also be weighed against the costs: In some industries, like finance, the rank and file are acutely aware that everything they say in an email can be read by a higher-up, but in other industries the scanning of emails, however anonymous, will be viewed as intrusive if not downright Big Brotherly.

But it is managers who might have the most to fear from text-analysis tools. KeenCorp software can chart how employees react when a leader is hired or promoted. And one KeenCorp client investigated a branch office after its heat map suddenly started glowing and found that the head of the office had begun an affair with a subordinate.

Privacy concerns? Well, KeenCorp does not collect, store, or report any information at the individual level. According to KeenCorp, all messages are “stripped and treated so that the privacy of individual employees is fully protected”. Nevertheless, let’s face it: many companies do want to obtain information about individuals. Those seeking that information might turn to other software, or build their own data-mining system.

A short note on text analytics

The so-called text-analytics industry is booming. The technology has been around for years. Go to Legaltech and you think the e-discovery industry invented it. But its earliest use was in the early 1960s when it was incorporated into management information systems then being developed.

In the internet world its first use was in the early 1990s when the use of email really took off. It was used to create spam filters because folks were finding their inbox unmanageable. Obviously the tools have grown in sophistication, as have their uses.

Text analytics really became popular in finance. Investment banks and hedge funds use it today to scour public filings, corporate press releases, and statements by executives to find slight changes in language that might indicate whether a company’s stock price is likely to go up or down. Goldman Sachs was the first to develop a natural-language processing text analytics tool for finance. Specialty-research firms use artificial intelligence algorithms to derive insights from earnings – call transcripts, broker research, and news stories.

Does this finance text analytics work? Absolutely. The most well-known case is NetApp, a data-management firm in Silicon Valley. NetApp’s 2010 annual report stated (bolded emphasis is mine): “The failure to comply with U.S. government regulatory requirements could subject us to fines and other penalties.” Addressing the same concern in the 2011 report, the company clarified that “failure to comply” applied “to us or our reseller partners.” Even the 288 savvy “human” stock analysts that followed NetApp missed that phrase, but the researchers’ algorithms set off an alarm. Embedded in that small edit was an early warning. Six months after the 2011 report appeared, news broke that the Syrian government had secretly purchased NetApp equipment through an Italian reseller and used that equipment to spy on its citizens. By then, NetApp’s stock price had already dropped 20 percent.

Its scariest use is the one we see in the media: the language analysis tools used in human-resources departments. HR teams have their own, old-fashioned ways of keeping tabs on employee morale, but people aren’t necessarily honest when asked about their work, even in anonymous surveys. Our grammar, syntax, and word choices might betray more about how we really feel.

So, you have something like Vibe, a program that searches through keywords and emoji in messages sent on Slack, the workplace-communication app. The algorithm reports in real time on whether a team is feeling disappointed, disapproving, happy, irritated, or stressed.

Ok, keeping tabs on employee happiness is crucial to running a successful business. But I think counting emoji is unlikely to prevent the next Enron. But then again … KeenCorp did have the ability to uncover malfeasance through text analysis.

Leave a Reply

Your email address will not be published. Required fields are marked *

scroll to top