How 160,000 intercepted communications led to our latest NSA story

http://www.washingtonpost.com/world/national-security/your-questions-answered-about-the-posts-recent-investigation-of-nsa-surveillance/2014/07/11/43d743e6-0908-11e4-8a6a-19355c7e870a_story.html?wprss=rss_national-security

Version 0 of 1.

Last weekend, The Washington Post published a story I wrote with Julie Tate and Ashkan Soltani about National Security Agency surveillance that sweeps in the conversations of people who are not foreign “targets.” The story, based on 160,000 intercepted communications I received from former NSA contractor Edward Snowden, has provoked a lot of questions, objections, and, I think, misunderstandings.

Some readers and commentators have described the story as an overheated statement of the obvious: that surveillance of one person includes the content of people who talk to him. Others have said The Washington Post, not the government, invaded the privacy of innocents because we published their conversations and the NSA did not. In the view of some critics, we displayed ignorance of NSA systems or knowingly chose to distort the way they work.

(Transcript: Q&A with Barton Gellman)

NSA surveillance is a complex subject — legally, technically and operationally. We drafted the story carefully and stand by all of it. I want to unpack some of the main points and controversies, sprinkling in new material for context. In this format, I can offer more technical detail about the data set that Snowden provided and the methods we used to analyze it. I will also address some ethical and national security issues we faced. Along the way, I will explain why our story actually understated its findings, clear up speculation about spying on President Obama and fact-check a recent CIA tweet about lost passwords.

Let’s begin with a close look at our lead:

Ordinary Internet users, American and non-American alike, far outnumber legally targeted foreigners in the communications intercepted by the National Security Agency from U.S. digital networks, according to a four-month investigation by The Washington Post.

Picture a big pile of conversations intercepted by the NSA. In it are the text of chats and e-mails along with photos and other kinds of files that somebody has sent to somebody else. We counted all the people who took part in those communications (or more precisely, the number of unique online accounts) and compared that figure with the number the NSA was aiming for.

Most of the accounts we found in the pile were not NSA targets and would not have qualified legally as such. Some commentators have said that this is unsurprising and unremarkable. I will come back to that.

Next we put a number on it:

Nine of 10 account holders found in a large cache of intercepted conversations, which former NSA contractor Edward Snowden provided in full to The Post, were not the intended surveillance targets but were caught in a net the agency had cast for somebody else.

That figure is actually far too low, but it was the only one we could measure with any precision. A graphic by Todd Lindeman broke it down. We found about 11,400 unique online accounts. Among them, about 1,200 were designated by the NSA as foreign targets. The remaining 10,000-plus were akin to digital bystanders. Some of them knew the NSA targets and conversed with them. Others fell into the pile by joining a chat room, regardless of subject, or using an online service hosted on a server that a target used for something else entirely.

We did not have an official NSA list of targets. We had to find them in the pile ourselves. Soltani, an independent researcher, did most of the heavy lifting on that. Because the information was not laid out in rows and columns, the way it might be in a spreadsheet, Soltani wrote computer code to extract what we were looking for from something like a quarter-million pages of unstructured text.

Some of our questions could not be answered with the data we had. For that reason, our story did not say what some commentators have imputed to it.

These are fine distinctions, but they are important because we reported only what we could count. We did not say that the NSA intercepted a larger number of conversations or a higher volume of content belonging to bystanders than targets. We said there were more participants (unique online accounts) in those conversations who were not targets than participants who were.

We also did not say that there are more Americans than foreign targets in the pile. We suspect that proposition may be true, but we could not establish it reliably.

Here, from the third paragraph, are some of the things we could count:

Nearly half of the surveillance files, a strikingly high proportion, contained names, e-mail addresses or other details that the NSA marked as belonging to U.S. citizens or residents. NSA analysts masked, or “minimized,” more than 65,000 such references to protect Americans’ privacy, but The Post found nearly 900 additional e-mail addresses, unmasked in the files, that could be strongly linked to U.S. citizens or U.S. residents.

Those are three separate and meaningful measurements.

1. Americans — talking, being talked to or talked about — were identifiable in close to half the files that held intercepted conversations. That was a result we did not expect from surveillance directed at foreigners located overseas.

2. The NSA ingested so much content as it spied on 1,250 foreigners that it had to black out 65,000 references to U.S. citizens and green-card holders. That figure does not include U.S. companies, which are also “U.S. persons” under surveillance law.

3. NSA analysts left a substantial number of U.S. e-mail addresses unmasked. By digging into public and commercially available data, Soltani and Washington Post researchers Julie Tate and Jennifer Jenkins linked about 900 of the captured accounts to U.S. identities. Their sources drew upon standard Internet searches, account registration records, U.S. postal address changes, product marketing databases, court filings and voter registration rolls. The quality of that data is imperfect, but it is likely to be accurate in most cases.

The terms and rules of minimization are opaque, and they have often been used to misdirect public debate. The NSA is forbidden to “target” American citizens, green-card holders or companies for surveillance without an individual warrant from a judge. If it does target Americans “inadvertently” — believing them to be foreign, then discovering otherwise — the NSA normally discards their conversations.

All that is good for privacy, but it has little to do with the way Americans are actually captured by NSA collection systems. U.S. intelligence services routinely use collection methods against foreigners that foreseeably — with certainty — ingest high volumes of U.S. communications as well.

That is called “incidental collection.” The NSA does not discard those U.S. conversations. It stores them, with names uncensored, in a repository called PINWALE and other central databases. No law forbids the NSA to search within that content for U.S. names and other identifiers, and it does so. The CIA does so as well, and the FBI reported recently that it searches the data so routinely that it cannot provide a count. Minimization rules place conditions on those searches and limit, but do not forbid, the distribution of U.S. identities in reports to other agencies.

There is no way to prevent incidental collection, but policy choices decide how much of it will happen and what the NSA and other agencies are allowed to do with its fruits.

In a little-noticed passage of its report, the president’s Review Group on Intelligence and Communications Technologies urged late last year (Recommendation 12, p. 28) that incidentally acquired information about Americans “should be purged upon detection” unless it supplies valuable foreign intelligence or warns of “serious harm to others.” Most of what the NSA keeps now would probably have to be discarded under that standard. The president and his staff set it aside without public comment.

Until now it has not been possible to debate incidental collection in concrete terms. We did not know how much of it happened or the nature of the private content collected. The NSA answers no questions in public about those things. The Office of the Director of National Intelligence asserts that it is unable even to estimate how many Americans are affected. And no outside watchdog — including Congress, the courts, the Privacy and Civil Liberties Oversight Board or the Review Group on Intelligence and Communications Technologies — has had access to enough intercepted content to judge for itself.

Some intelligence veterans argued this week that our story hyped unsurprising facts. Former NSA general counsel Stewart Baker wrote (on The Post Web site) that surveillance of a target obviously acquires the communications of other people. (“Social network researchers everywhere: Well duh,” computer scientist Robert Olson tweeted.)

If that is all The Post were saying, according to Baker:

. . . the inherent bias in the measure is such that it demands an acknowledgment . (After all, it allows you to say ‘half of all account holders in the database weren’t the target’ if the agency stores just a single message sent to the target.) This is something that any halfway sentient editor should have recognized.

As I noted above, we agreed that incidental collection, in the abstract, was not news. Near the top of our story we said it “is inevitable in many forms of surveillance.”

The scale of that collection and the intimate secrets it reveals may not surprise intelligence savants, who understand the collateral effects of surveillance and take the intrusiveness for granted. It is, however, surprising — and, based on reader reactions, disturbing — to a lot of people who relied on public assurances that the NSA focuses tightly on foreign targets and cannot read U.S. e-mails without a warrant.

Here is the way we framed that question:

The surveillance files highlight a policy dilemma that has been aired only abstractly in public. There are discoveries of considerable intelligence value in the intercepted messages — and collateral harm to privacy on a scale that the Obama administration has not been willing to address.

Marc Ambinder, a journalist who has written a good deal about surveillance, offered a more detailed critique. It deserves a somewhat longer reply because it has been widely cited. Ambinder based his conclusion that our story was “a bust” on incorrect assumptions about our data set and erroneous descriptions of the systems the NSA uses to intercept and process communications.

Under Section 702 of the amended Foreign Intelligence Surveillance Act, Ambinder writes, NSA domestic operations begin with a court-certified “class of targets — like ‘Russian government officials living in Utah.’ ” Actually, the target classes certified by the FISA court are far broader (Russia, as a whole, is one of 193 certified countries of interest) and the court is not informed of the specific targets the NSA selects from a certified class. This gives the agency far more latitude for surveillance than Ambinder suggests.

Next, Ambinder writes, the “NSA tries to eliminate as much [as possible] of the targets’ e-mails and chats to people inside the United States automatically.” That is incorrect. There are systems in place that attempt to “defeat,” or filter out, conversations that are solely domestic or solely among Americans. But the NSA is under no legal obligation, and in practice it does not attempt, to filter out U.S. citizens or residents who communicate with a foreign target.

These two errors bring Ambinder to his principal argument, which is that the high proportion of incidental collection and the unmasked U.S. identities we found result from technical limits of the “automated minimization system.” But that is not a problem, he writes, because the defects are cured by hand later in the process. NSA analysts are required only “to minimize every U.S.-person communication that they see,” he writes, and our story was based on intercepted content that analysts had not yet examined.

The communication simply wasn’t looked at. No human being saw it. The Post’s reporters looked at every single line of 160,000 intercepts. The NSA analysts don’t do that/can’t do that because the SIGINT system would not function for a second if they did.

That, too, is wrong. Everything in the sample we analyzed had been evaluated by NSA analysts in Hawaii, pulled from the agency’s central repositories and minimized by hand after automated efforts to screen out U.S. identities. I describe the data more fully near the end of this post.

Had our sample not been evaluated, far more than 90 percent of the people in it would have been non-targets. Had it not been minimized, we would have found far more Americans than we identified on our own.

In the figures we reported, we included every unmasked online account. We did not include the “minimized” accounts because we had no way to know how many were unique.

For example, we could count 2,721 occurrences of the term “minimized U.S. person,” 5,060 of “minimized U.S. user name” and 57,331 of “minimized US IP address.” (There are just over 1,000 additional categories of minimized content.) But in theory, we cannot rule out that all those terms correspond to a single person — a Zelig-like figure whose conversations somehow spanned a universe of 11,000 accounts. In reality, it is likely that the masked U.S. identities number in the hundreds or thousands.

We included none of them in our statistics, because we chose not to impute a number that we could not count. Among the accounts we could identify with confidence, 900 belonged to Americans and 1,250 to foreign targets. If only 400 of the tens of thousands of masked U.S. identities are unique, then the database contains more Americans than lawful foreign targets.

A lot of close readers misunderstood a passage, deep in our story, that referred to President Obama. They thought it meant the NSA was intercepting his e-mail. It did not. (Spying on the president is the kind of news you can probably count on The Post to put at the top.) If I had anticipated that reading, I would have written the following paragraphs differently:

More than 1,000 distinct “minimization” terms appear in the files, attempting to mask the identities of “possible,” “potential” and “probable” U.S. persons, along with the names of U.S. beverage companies, universities, fast-food chains and Web-mail hosts.

Some of them border on the absurd, using titles that could apply to only one man. A “minimized U.S. president-elect” begins to appear in the files in early 2009, and references to the current “minimized U.S. president’ appear 1,227 times in the following four years.

None of these were conversations in which Obama took part. We checked carefully. The statistics refer, instead, to conversations in which someone else mentioned the president’s name. None of them involved inside information.

In one intercepted conversation, somebody tells a joke that begins: [MINIMIZED US PERSON] & [MINIMIZED US PRESIDENT] walk into a bar. The punch line finds its way to genocide. It is not a friendly joke. In another exchange, somebody makes fun of an acquaintance by saying his advice about women is like advice about Islam from [MINIMIZED FORMER US PRESIDENT].

Some misunderstandings are hard to cure. I noted on Twitter on Sunday and Monday that Obama’s conversations were not intercepted. Several of those who replied were not inclined to believe it.

Many people have asked, since the story was published, whether we found conversations intercepted from other elected officials, judges, journalists or nongovernmental organizations. We did not. The files include minimized references to one senator, one member of Congress, three judges, three U.S. “broadcasters” and several NGOs. In all those cases, the subjects were mentioned by other people in conversations about public events.

Our reference to Obama was meant to make another point. We contrasted the NSA’s “scrupulous care” on minimization, in many contexts, with policies that allow an analyst to rely on dubious evidence as the basis for judging a target to be ineligible for that privacy protection. We found many cases in which analysts based a “reasonable belief of foreignness” on the fact that the target was speaking a foreign language or logging on from an IP address that appeared to be overseas. Those criteria would apply to tens of millions of Americans.

The CIA opened a Twitter account last month and has used cheeky humor to win a large following in a short time. On Monday, the account sent out this announcement: “No, we don’t know your password, so we can’t send it to you.” It went viral, with more than 12,000 retweets.

As it happens, the NSA files we examined included 1,152 “minimized U.S. passwords,” meaning passwords to American e-mail and chat accounts intercepted from U.S. data links. Don’t expect tech support from Langley, but the CIA does have access to that raw traffic.

Stewart Baker’s critique of our story made a second point I did not mention above:

The story is built around the implied claim that 90% of NSA intercept data is about innocent people. I think the statistic is a phony.

That is not what the story said or what it meant. We did not try to measure guilt or virtue. For large volumes of intercepted content, the defining quality is intimacy, not innocence.

Baker made his own inbox sound rather boring, filled with routine business and “one-off messages that I can handle with a short reply (or by ignoring the message).” As it happens, e-mail does not make up the bulk of what the NSA intercepts. Far more of the content comes from live chat, a young person’s medium that is filled with the preoccupations of the young.

Among the large majority of people who are not NSA targets, many of the conversations in our sample are exceedingly private. Often they are very far from publishable, without editing.

Him: “How about you [verb, possessive adjective, noun]

Her: “I [verb] if you [another verb].”

Him: “That can be arranged.”

Her: “I really need punishment.”

Another young woman, also not a target, responds to a suitor who proposes to pay a visit.

Her: “don’t think that would b fair on the guy im seeing”

Him: “you can be a bit naughty at times lol”

Her: “Yeah lol”

The conversation proceeds from there. Does it matter to the woman or her boyfriend that the NSA recorded her slide toward infidelity if neither of them knows it? (She is an Australian citizen, whose identity is supposed to be minimized with the same care due to an American, but her name and photographs are unmasked.)

Does it matter to a son that his father’s medical records, or to a mother that her baby’s bath pictures, are in NSA stores?

Early in the Snowden debate, House Intelligence Committee Chairman Mike Rogers said in a hearing that “the fact that we haven’t had any complaints come forward with any specificity arguing that their privacy has been violated, clearly indicates” the system is working.

“But who would be complaining?” asked the witness, American University law professor Stephen Vladeck.

“Somebody who’s privacy was violated,” Rogers replied. “You can’t have your privacy violated if you don’t know your privacy is violated.”

Vladeck disagreed sharply with that statement. NSA rules and procedures, he said, cannot be judged without an objective look at what it does with its authority. That is the debate our story was meant to inform.

In framing our story, we faced a paradox: How do we report on harms to privacy without compounding them? Some readers were disturbed by our quotation of private correspondence — and even of our decision to read it.

Ben Wittes, writing on Lawfare, describes Snowden’s transfer of NSA content to me this way:

The contractor gives a cache of 160,000 such conversations — some of them very lengthy — to a third party. He does so apparently indiscriminately, and he leaves to nothing more than trust that the recipient will use the material responsibly. The third party then proceeds to publish passages . . . from the correspondence of a private individual, written to a boyfriend about their apparent affair — a private individual who has been accused of no wrongdoing. . . . If the contractor in question were anyone other than Edward Snowden, we would immediately recognize this disclosure for what it is: a massive civil liberties violation of precisely the type we put intelligence under the rule of law to try to prevent.

We recognize a dilemma here, but we do not think the answer is obvious. There was an important story to tell about surveillance and privacy. We did not believe we could tell it with broad allusions to unspecified personal content in the NSA’s intercepted files. We also believed we had to give weight to the privacy and national security implications of quoting them.

Wittes writes, in reference to the woman we quoted, that although we “delicately kept her name out of the story, her whole social world will know who she is.” That is speculation. The woman tells me otherwise.

We decided from the start that we would not quote from any conversation without the speaker’s consent. The Australian woman gave us that, provided that we left out her name and other details she specified. Afterward, she wrote to praise a “fantastic article” and said her employer and friends, other than those who knew the story already, had not connected it to her.

“Thanks very much,” she wrote. “I appreciate your efforts for anonymity.”

The one example aside, Wittes makes a broader attack on “Snowden — in the unfettered exercise of his unlimited discretion — choosing Gellman as the sole check and balance on the disclosure of personal data — Gellman who, unlike NSA, has no statutory standard to live up to and no oversight from Congress or the courts.”

It is true that, with a few exceptions such as libel, the government does not lay down standards of publication or compel me to follow them. That is a fairly basic feature of our constitutional system. The way I make use of that freedom, and the choices The Post made for this story, are fair game for anyone to judge. We are comfortable with our choices and the way we made them.

Asking consent before quotation was not our only, or even our first, consideration. We recognized early on that there were national security risks in the mere act of alerting someone that her conversations had been intercepted. We did independent reporting to establish, before I called her, that the Australian woman’s ex-boyfriend was no longer under surveillance and no longer regarded by U.S. intelligence as a threat.

Even when we left out names, we did not feel free to quote intercepted conversations without careful thought. Distinctive language might be recognized by a surveillance target and, likewise, allusions to embarrassing secrets when read by someone close to the person quoted.

As our story stated, we saw for ourselves in the Snowden sample that surveillance under Section 702 has produced a great deal of valuable intelligence. If we told a target directly or indirectly that he was under the NSA’s microscope, we would jeopardize that.

When we looked for examples we could cite, we began by checking whether a surveillance target was still alive and at large. By independent reporting, we identified four who were in custody. We brought those names to the NSA and CIA. Intelligence officials gave us concrete and persuasive reasons, off the record, why any mention of two of them would derail ongoing operations. We left them out and cited the other two — Muhammad Tahir Shahzad, a Pakistan-based bomb builder, and Umar Patek, a suspect in a 2002 terrorist bombing on the Indonesian island of Bali — in our story.

There are risks to privacy, as some critics have noted, in keeping copies of the intercepted files. There are comparable national security risks if someone steals the archive. We have taken significant measures, with advice from leading experts, to keep the material as safe as we can from outsiders. No Washington Post employee has unchecked access, and very few have any access at all. Destroying the files now would be the surest way to ensure they are not breached. It would raise legal questions and halt our work on a story of ongoing global import. We have made no decision for the long term.

There were 22,000 electronic files in the data set we analyzed, containing content intercepted by the NSA between 2009 and 2012. They came from a repository hosted at the NSA’s Kunia regional facility in Hawaii, which was shared by a group of analysts who specialize in Southeast Asian threats and targets.

That Hawaii database was, in essence, curated by members of the group. They drew on a much larger store of “raw,” or unprocessed, content hosted at NSA headquarters and imported selections from it into templates for evaluated material. Special access controls protected the files in both locations because the communications were obtained from network switches and computer servers in the United States. Until 2008, that kind of collection required an individual warrant from a judge. FISA Section 702 allowed the NSA to select tens of thousands of targets on its own under rules and procedures reviewed by the court once a year.

Because our sample had been hand-selected by analysts for the Hawaii database, there was a lot less irrelevant content and “incidentally collected” U.S. communications than an auditor would find in the central PINWALE database from which it was drawn.

About 16,000 of the data files contained the text of intercepted conversations. The rest were photographs or documents such as medical records, travel vouchers, school transcripts and marriage contracts. We converted any text inside the image files to machine-readable form.

Some files had only a single e-mail or instant message exchange. Others included many separate conversations, with many participants. Still others had long, unbroken chat transcripts that extended over several days and hundreds of pages.

In order to analyze the files, Soltani ingested them all into a database. We could then search for quantifiable information with geek tools such as Unix regular expressions and SQL, or structured query language.

We wanted to know, for example, how many distinct conversations there were in the files. Soltani tried several methods to find the boundaries in each document file. He described the data as “dirty,” with typographical errors and inconsistencies in the use of formatting and official templates. Soltani corrected for those errors by using multiple criteria in his searches, such as the first-occurring PINWALE identifier in a header. Comparing them brought us to the published figure of 160,000 conversations.

We also wanted to count how many participants there were in all those conversations. Only a minority of them were identified by name, so we searched by online account identifiers. Those included e-mail addresses, user names on instant messaging services, “handles” on Internet Relay Chat and Facebook identification numbers. (The Washington Post Facebook ID is 6250307292. You can find yours here by putting your user name into the URL.)

Soltani did most of the analysis, but he taught me to make my own queries. E-mail addresses, to take a very simple example, are always composed of a permissible range of characters before and after the @ sign, with a dot in the second half. That query found 12,310 hits. After cleaning up false positives and adding chat handles and Facebook IDs, we reached the published figure of about 11,400 unique accounts.

We had to use more complex methods to identify which of those accounts were NSA targets. We compared several approaches, which produced similar but not identical results. After investigating why they differed, we judged that a count of unique “case notations,” or CASNs, was most reliable.

A case notation looks like this: P2BSQC090008441. A year ago, we published a handy slide to decode it.

The characters SQC stand for the PRISM program, which collects the contents of online accounts from nine large U.S. Internet companies. P2 identifies the target as a Yahoo account, B says it is a chat account and the rest identifies the year surveillance began (2009) and the target’s unique serial number.

Collection from network switches, which the NSA calls Upstream, use case notations that begin with XX.SQF. Those are also called “FBI FISA” collection, managed by the bureau and shared with NSA. Upstream is most commonly used for more ephemeral forms of chat that are not easily obtainable from Internet company servers.

The total number of targets, counting by CASN, came out to 1,257. We made a gut-check of the number — did it make sense? — by reading the contents of a large sample of their conversations.

Julie Tate and Jennifer Jenkins expended prodigious labor determining the names of the account-holders and researching their public records. In nearly every case, the reasons for the NSA’s interest was apparent. Among more than 10,000 non-targeted accounts, the communications reflected a normal range of human interaction.

Because of the changes that Congress made in Section 702, the Privacy and Civil Liberties Oversight Board reported that the volume of untargeted collection — and incidental U.S. content within it — have grown exponentially.

The board split on whether the government should be obliged to obtain a warrant to search and make use of those intercepted U.S. conversations. (No warrant is needed now.) The president’s Review Group went further, recommending that the NSA discard the U.S. content in most circumstances.

The Obama administration has addressed neither of those recommendations. Our story added information that could be found nowhere else about the competing interests at stake.