Yet another blog for spewing. This one may end up with a lot of religious and social content.

2013-06-08

NSA, Web Companies and the NY Times

I'll put it bluntly: The whole flap about the NSA tapping all the data on the web does not pass the bullshit meter. It's not fucking possible.

Why do I say this? I've worked for companies that handle large, and I mean large, amounts of user data, metadata, click-throughs, etc. It's not an easy thing to capture, route or store all those lines of logging. Most companies do it by sampling - they don't log every click, they can't write it out fast enough! 

One set of systems I worked on would fill up each machine's log, that was set on a 3% sampling rate, in less than 12 hours. The box only kept 10 logs back. Maximum of 5 days. There were hundreds of just this one type of machine. During slow periods, a process would come along and collect these - because they recorded ads shown and ads clicked - but only 3%. This process was slow, and the place to write them usually ran out of storage, until they were processed (reduced to abstraction and summaries) by a big Hadoop cluster. This took hours.

This was just ad data - not browser meta data - that was sampled at only 3%, and it still required a lot of processing. The companies do it because that's how they get paid by advertisers. Once the data is processed, it's discarded, to make room for more.

The NYT and Guardian act like the NSA just comes in, hooks up a cable to a server, and starts sucking data for free. It doesn't work that way. Phone logs are different, it is the to and from of phone numbers, and simple cellular routing. It's designed for each *phone* to record it, and then for the home office to grab it as needed for billing.

Web providers can't do that - each page view is like a minute of a phone call, but all different. Sure, they can capture the incoming IP, referer, and what page got rendered. But the large companies can't store that as raw data for long. There's too much. Add in mail and chat, and it's an avalanche.

That's actually why they spend a lot of processing and compute power on personalization - because they want to tailor things to you. But they can't keep the specifics for long, they have to distill it down into a bunch of numbers and keywords.

As far as the big web companies just *giving* this raw big data to the government? No, for a lot of reasons:
  1. It is proprietary - it is what differentiates them from their competition.
  2. It's expensive to store, process and transport the mass of raw data that supposedly the NSA is getting.  Expensive enough to be a significant line item on a company's balance sheet. Waaaaay more expensive for even one company than $20 million a year.
  3. You can't gather and shuffle this much stuff around without the rank and file knowing. Even "backdoors" are obvious to any sysadmin or programmer who has to deal with the code. 
  4. There are tens of thousands of servers all over the world at companies like Google and Facebook.  You can't just connect up to them and start pulling all the user metadata without screwing up the software or the network. Too much load, too much bandwidth used.
Grabbing a specific FISA request is a little easier - it is small enough to sift out of the river of data flowing by without stopping up the process and impacting the property/sites. Companies scrutinize these requests, and interpret them fairly narrowly - if their customers didn't trust them to not indiscriminately hand out even metadata, they wouldn't stay. Customer trust is a big thing in the web business - ask Facebook, as they lose customers over how much they share without consent to advertisers.

These theoretical backdoors are a security risk for any web site. They have multiple layers of security, so it probably wouldn't work anyway. Just a small security hole on one application isn't enough to give the NSA any access to logs, which is where the user identifiable metadata lives. You would have to have a separate access to each server, and believe me, I'd know if people had log sucking processes feeding to outside the network where I've worked.

Even tapping the user input stream and duplicating it for internal load testing is a non-trivial problem, and that's only turned on for a few machines, not thousands.

Yes, in theory, it could be done, if the NSA had a dedicated, high capacity storage server bank and a high bandwidth, dedicated network pipe to each and every data center of every company. Such a thing would cost billions, yes, with a "B", and would not be even close to a secret.

Now, some people might say "what about man in the middle"? Well, this might be possible for unencrypted connections, but it would have to be done at the ISP level, or at the edge of each web company's network, and still would be too expensive and too obvious.

Again, there are companies that independently analyze web traffic, and they need an extra application layer to do so, and even then they still have the big data and sampling problems. They also aren't cheap.

So no, the whole "The NSA is spying on the web!!!111!!!!!" thing just doesn't pass my bullshit detector.

I work with these huge server farms. Even with high speed networks and huge filers, it is time consuming, IO intensive and expensive to shuffle that kind of log data around. Yes, some companies do it, with a percentage of their logs, and spend a big wad of cash to do it, and then discard the raw data so they can process more. They don't store and forward it all to the government - the bandwidth, personnel, and storage space just isn't there to do that for free, and the NSA's budget isn't big enough to get even the big companies set up.

FISA requests about specific stuff? Yeah, that's easier to pull out - but only concurrently, not weeks ago. It still would cost overhead and bandwidth, and if not done carefully could cause service outages. So companies only deliver the minimum required, to minimize the impact to their business. Plus, they really, really hate the gag order that comes with FISA requests, so they are not inclined to give them any more than the absolute minimum.

The NYT and Guardian claims are outrageous, and don't pass the practicality test.  Sorry to piss on your outrage. The secret FISA courts and Patriot Act crap are bad enough, and you shouldn't let the fantastic dilute your anger at the real stuff that goes on all the time. Pay attention to the real issues - secret courts and fishing expeditions - and ignore the sensationalists who are trying to make fools of you.

2 comments:

Anonymous said...

i just got done reading your post and Karl Fogel's 'epic botch' post on rants.org . i get that you have an intrinsic feel for how very much data this represents and how difficult it would be just to collect this amount of data. this has made you skeptical that such a thing CAN be done at all. now i want to ask you to set aside that disbelief. IF such a thing were in fact possible, what would that mean for your estimation of the NSA's capabilities and their competence to proceed with these investigations as they appear to be documented? would this revelation change your view of the constitutionally limited searches within which these agencies are intended to function? This is kind of the 'lottery' question - what would you do if money were no object? any thoughts?

Ravan Asteris said...

If it could be done, it would be horribly expensive, a waste of money, and a huge burden on either the government budget, or a company killing expense for the companies. Companies like Google, Amazon, Yahoo, etc have to make money, and literally doubling the bandwidth they buy would cripple their profits.
The fact is, the more raw data that you ingest, the harder it is to process. I didn't really look at what it would take to *analyze* that much data, just what it would take to collect it.
Putting on a "data analyst hat", I would rather process selected slices that the whole dump of internet logs, even if they were able to be collected and shuffled around.
Finding a needle in a haystack, even by patterns around it, it easier if you don't have too big a haystack.
Does the NSA overreach? Probably. Do the phone companies give them too much? Hell yes, they dump their (pre-processed) billing data to the NSA. But the skynet type of data density that people are accusing them of receiving is not really feasible, from a money or a data analysis perspective. What they get with legal (not constitutional, IMO) FISA requests is bad enough.