Why do I say this? I've worked for companies that handle large, and I mean large, amounts of user data, metadata, click-throughs, etc. It's not an easy thing to capture, route or store all those lines of logging. Most companies do it by sampling - they don't log every click, they can't write it out fast enough!
One set of systems I worked on would fill up each machine's log, that was set on a 3% sampling rate, in less than 12 hours. The box only kept 10 logs back. Maximum of 5 days. There were hundreds of just this one type of machine. During slow periods, a process would come along and collect these - because they recorded ads shown and ads clicked - but only 3%. This process was slow, and the place to write them usually ran out of storage, until they were processed (reduced to abstraction and summaries) by a big Hadoop cluster. This took hours.
This was just ad data - not browser meta data - that was sampled at only 3%, and it still required a lot of processing. The companies do it because that's how they get paid by advertisers. Once the data is processed, it's discarded, to make room for more.
The NYT and Guardian act like the NSA just comes in, hooks up a cable to a server, and starts sucking data for free. It doesn't work that way. Phone logs are different, it is the to and from of phone numbers, and simple cellular routing. It's designed for each *phone* to record it, and then for the home office to grab it as needed for billing.
Web providers can't do that - each page view is like a minute of a phone call, but all different. Sure, they can capture the incoming IP, referer, and what page got rendered. But the large companies can't store that as raw data for long. There's too much. Add in mail and chat, and it's an avalanche.
That's actually why they spend a lot of processing and compute power on personalization - because they want to tailor things to you. But they can't keep the specifics for long, they have to distill it down into a bunch of numbers and keywords.
As far as the big web companies just *giving* this raw big data to the government? No, for a lot of reasons:
- It is proprietary - it is what differentiates them from their competition.
- It's expensive to store, process and transport the mass of raw data that supposedly the NSA is getting. Expensive enough to be a significant line item on a company's balance sheet. Waaaaay more expensive for even one company than $20 million a year.
- You can't gather and shuffle this much stuff around without the rank and file knowing. Even "backdoors" are obvious to any sysadmin or programmer who has to deal with the code.
- There are tens of thousands of servers all over the world at companies like Google and Facebook. You can't just connect up to them and start pulling all the user metadata without screwing up the software or the network. Too much load, too much bandwidth used.
These theoretical backdoors are a security risk for any web site. They have multiple layers of security, so it probably wouldn't work anyway. Just a small security hole on one application isn't enough to give the NSA any access to logs, which is where the user identifiable metadata lives. You would have to have a separate access to each server, and believe me, I'd know if people had log sucking processes feeding to outside the network where I've worked.
Even tapping the user input stream and duplicating it for internal load testing is a non-trivial problem, and that's only turned on for a few machines, not thousands.
Yes, in theory, it could be done, if the NSA had a dedicated, high capacity storage server bank and a high bandwidth, dedicated network pipe to each and every data center of every company. Such a thing would cost billions, yes, with a "B", and would not be even close to a secret.
Now, some people might say "what about man in the middle"? Well, this might be possible for unencrypted connections, but it would have to be done at the ISP level, or at the edge of each web company's network, and still would be too expensive and too obvious.
Again, there are companies that independently analyze web traffic, and they need an extra application layer to do so, and even then they still have the big data and sampling problems. They also aren't cheap.
So no, the whole "The NSA is spying on the web!!!111!!!!!" thing just doesn't pass my bullshit detector.
I work with these huge server farms. Even with high speed networks and huge filers, it is time consuming, IO intensive and expensive to shuffle that kind of log data around. Yes, some companies do it, with a percentage of their logs, and spend a big wad of cash to do it, and then discard the raw data so they can process more. They don't store and forward it all to the government - the bandwidth, personnel, and storage space just isn't there to do that for free, and the NSA's budget isn't big enough to get even the big companies set up.
FISA requests about specific stuff? Yeah, that's easier to pull out - but only concurrently, not weeks ago. It still would cost overhead and bandwidth, and if not done carefully could cause service outages. So companies only deliver the minimum required, to minimize the impact to their business. Plus, they really, really hate the gag order that comes with FISA requests, so they are not inclined to give them any more than the absolute minimum.
The NYT and Guardian claims are outrageous, and don't pass the practicality test. Sorry to piss on your outrage. The secret FISA courts and Patriot Act crap are bad enough, and you shouldn't let the fantastic dilute your anger at the real stuff that goes on all the time. Pay attention to the real issues - secret courts and fishing expeditions - and ignore the sensationalists who are trying to make fools of you.