someproteinguy wrote:
You start with a list of known leads. We were tracking terrorists before they built the database, so there should have never been a point they didn't have working intelligence to start querying from.
The assumed reason for using a database system is to not have to rely on known leads.
Quote:
It's only worthwhile to check more data points if they increase your true-positives at a rate that's acceptably faster than your false-positives.
Sure. But there's no reason to believe that adding more data to the set will increase false-positives. It will, however, increase true-positives that you would have missed in a smaller set.
Here's an easy way to look at it. Let's say my search criteria is "left handed males". One dataset includes 5% of the US population, the other includes 100%. There's no reason to expect that I'm going to accidentally match people who are not males or not left handed in my search simply because I increased the total size of the dataset. I'm still searching for the same criteria. I'm just not missing 95% of the hits.
Quote:
And if you include everything you risk adding in a bunch of misleading data. If you start with known "terrorists" not only do you have a better set of high-quality data you have a set of data points (known positives) you can use to refine the information you're looking at. You can't even hope to build a decent search in the first place unless you know what the correct answer looks like.
The same can be said of whatever method you'd use to trim your dataset down in the first place though. And in your example, it's the "unknown terrorists" we're presumably looking for, right?
Again, I'm not seeing how adding a ton of extra data to the set that doesn't match the search criteria in any way affects the accuracy of the search. If I add a million bits of information about people who are clearly not terrorists, they're still clearly not terrorists and will not match whatever search parameters I'm using to search for terrorists. However, it's quite possible that a few of those million bits of information that I would have excluded at first blush might work in concern with some other bits of information in the dataset that would show me someone I thought was clearly not a terrorist, but who actually is.
And yes, you're right that this could be a false positive, but assuming that these sorts of searches are used to initiate additional human surveillance actions, then the harm from a false positive in this search is relatively low, while the harm from failing to detect a true positive is very high. Let me point out that I'm not making any kind of moral judgement here. I'm just pointing out how using the largest dataset possible is a better way to go here. The restriction is really just about how much data you can either store yourself, or index against other available sources.
Which is a totally valid point, and I'm not disputing it. I was mainly responding to the idea that this would not be a good way to do searches
at all. I heartily disagree. The more data, the better (well, from a "we want to find as many bad guys as possible" point of view and not necessarily from a "we want to protect people's privacy" point of view).
Quote:
Quote:
How do you know which are the relevant parts of the data if you don't first collect it and search through it?
How do I know how to drive a car without searching a giant database? How do I know how what a banana looks like without searching a giant database? How do I know my cousin Jimmy is going to Western Baptist University without checking a giant database? How do I know some guy ran over a lot of people in Nice without looking at a giant database?
I'm not sure how any of that is relevant to pattern searches for behavior that may indicate some kind of terrorist or organized criminal activity. I think that once you even begin talking about searching a database we're past the point of "things we know directly". If we knew who all the terrorists were, we wouldn't need a database in the first place, right? Seems axiomatic that the very discussion assumes that we don't know what we don't know, and are trying to search through large amounts of data for patterns that a computer search can spot, but that humans will miss.
Quote:
There's plenty of ways to gain knowledge without searching a giant database.
Again though, we're specifically not talking about those things. We're talking about things that can only be found by searching a giant database.
Quote:
Again, remember we have actionable intelligence that predates the giant database, we don't need to start over from scratch.
If said database existed solely to assist us in tracking the activities and behavior of already known "bad guys", you'd have a point. But that's not the extent of what these things are constructed for today, and certainly not the limit of what they could do tomorrow. The government wants to use these sorts of things to search through large amounts of seemingly unrelated data for patterns at a speed and accuracy that humans could not come remotely close to. By definition, we're looking for things that we can't already figure out, or else we wouldn't need the system in the first place.
Quote:
No, you just start by looking at the social media accounts of known terrorists and people they've had contact with. Again, we did know stuff prior to facebook existing, there's no reason to start over again.
Again though, you're going to miss the unknown terrorists that way. Also, as I've stated a couple times already, if that's all you're doing, you don't need a searchable database at all. You're just collecting evidence at this point, which, as you stated, we could do by hand before and can do by hand now. I just can't get past the point that the assumption behind this whole thing is to add an additional tool that could be used to spot terrorist plots that we can't detect just by following the trail of known terrorists. And to do that, you want to cast your net as wide as possible.
Quote:
Quote:
Which means you're doing the same amount of work. More really. You want to leverage the technological capability of the tools you have to the maximum degree possible. Why insist on doing a step that avoids using those tools?
Because it's already been done. Why redo work that's already been done, especially if you already have more actionable intelligence than you can follow up on?
Because the search for "known terrorists" didn't stop at some date in the past. It's ongoing. So it's not "already been done". It's being done today. It will be done tomorrow. It'll still be being done a year from now. So given that you have to spend the time and effort generating that list in the first place, why not use the best tools to do so?
Quote:
So what's the point of using a search tool if it's giving poor results even with a "complete" dataset? I mean, even when google knows what you're looking for (and it is, of course, pretty good at this) you're probably only ending up with a few good top hits. As you go down the list, page 5, page 10, page 50, you start to increasingly run into problems with it returning things that are no way relevant to what you're looking for. No one is trying to find the top hit, the most obvious terrorist in the world, the idiot who spams anti-American rhetoric from his Google+ page while searching and purchasing bomb-making materials with a credit card over the internet. Everyone already knows about that guy, and it doesn't matter much what you use to look for him, you'll find him regardless. Your top 1000 hits might well be 90% garbage, why would anyone use that method to try and find a number of candidates.
Exactly. But by limiting the dataset, you'd be effectively limiting the results to just the top 1000 "known terrorists and obvious associates". You basically just made my point for me. We want a tool that does allow us to look at the 100,000th guy who matches our search criteria, and then narrow our search to see if that guy still matches those parameters. Then narrow it some more. And if he's still there after we've eliminated number 1001 to 999,999, we might realize we need to take a closer look at this guy. This same guy we would have completely missed doing it the old way.
Quote:
Again we're assuming a terrorist even has a discernible online presence that would flag him to authorities. These people aren't dumb, they know social media, phone records, etc are all searchable by the authorities. They'll know how to stay off of a "top 1,000 terrorist candidates of 2016" list. Really how many recent terrorists can we say we were actively tracking online before they killed people? More likely to hear "suspect was unknown to authorities" than "we were already watching him closely."
Exactly. Because we're only looking at the obvious signs that the dumb guys (or the guys who don't care) give us. But the idea behind the kinds of mass data systems I'm talking about (again, I'm not making a moral judgement here) is that if you could include a lot more data in your set, you could find patterns of behavior that could accurately flag those "unknown subjects" before they commit their acts of terrorism or whatever. Patterns that you would otherwise miss because you're simply not collecting the data that you need to make the correlation.
Edited, Jul 20th 2016 8:08pm by gbaji