google vs the scientific method

D

Dave

Guest
The End of Theory: The Data Deluge Makes the Scientific Method Obsolete

"All models are wrong, but some are useful."

So proclaimed statistician George Box 30 years ago, and he was right. But what choice did we have? Only models, from cosmological equations to theories of human behavior, seemed to be able to consistently, if imperfectly, explain the world around us. Until now. Today companies like Google, which have grown up in an era of massively abundant data, don't have to settle for wrong models. Indeed, they don't have to settle for models at all.

Sixty years ago, digital computers made information readable. Twenty years ago, the Internet made it reachable. Ten years ago, the first search engine crawlers made it a single database. Now Google and like-minded companies are sifting through the most measured age in history, treating this massive corpus as a laboratory of the human condition. They are the children of the Petabyte Age.

The Petabyte Age is different because more is different. Kilobytes were stored on floppy disks. Megabytes were stored on hard disks. Terabytes were stored in disk arrays. Petabytes are stored in the cloud. As we moved along that progression, we went from the folder analogy to the file cabinet analogy to the library analogy to — well, at petabytes we ran out of organizational analogies.

At the petabyte scale, information is not a matter of simple three- and four-dimensional taxonomy and order but of dimensionally agnostic statistics. It calls for an entirely different approach, one that requires us to lose the tether of data as something that can be visualized in its totality. It forces us to view data mathematically first and establish a context for it later. For instance, Google conquered the advertising world with nothing more than applied mathematics. It didn't pretend to know anything about the culture and conventions of advertising — it just assumed that better data, with better analytical tools, would win the day. And Google was right.

Google's founding philosophy is that we don't know why this page is better than that one: If the statistics of incoming links say it is, that's good enough. No semantic or causal analysis is required. That's why Google can translate languages without actually "knowing" them (given equal corpus data, Google can translate Klingon into Farsi as easily as it can translate French into German). And why it can match ads to content without any knowledge or assumptions about the ads or the content.

Speaking at the O'Reilly Emerging Technology Conference this past March, Peter Norvig, Google's research director, offered an update to George Box's maxim: "All models are wrong, and increasingly you can succeed without them."

This is a world where massive amounts of data and applied mathematics replace every other tool that might be brought to bear. Out with every theory of human behavior, from linguistics to sociology. Forget taxonomy, ontology, and psychology. Who knows why people do what they do? The point is they do it, and we can track and measure it with unprecedented fidelity. With enough data, the numbers speak for themselves.

The big target here isn't advertising, though. It's science. The scientific method is built around testable hypotheses. These models, for the most part, are systems visualized in the minds of scientists. The models are then tested, and experiments confirm or falsify theoretical models of how the world works. This is the way science has worked for hundreds of years.

Scientists are trained to recognize that correlation is not causation, that no conclusions should be drawn simply on the basis of correlation between X and Y (it could just be a coincidence). Instead, you must understand the underlying mechanisms that connect the two. Once you have a model, you can connect the data sets with confidence. Data without a model is just noise.

But faced with massive data, this approach to science — hypothesize, model, test — is becoming obsolete. Consider physics: Newtonian models were crude approximations of the truth (wrong at the atomic level, but still useful). A hundred years ago, statistically based quantum mechanics offered a better picture — but quantum mechanics is yet another model, and as such it, too, is flawed, no doubt a caricature of a more complex underlying reality. The reason physics has drifted into theoretical speculation about n-dimensional grand unified models over the past few decades (the "beautiful story" phase of a discipline starved of data) is that we don't know how to run the experiments that would falsify the hypotheses — the energies are too high, the accelerators too expensive, and so on.

Now biology is heading in the same direction. The models we were taught in school about "dominant" and "recessive" genes steering a strictly Mendelian process have turned out to be an even greater simplification of reality than Newton's laws. The discovery of gene-protein interactions and other aspects of epigenetics has challenged the view of DNA as destiny and even introduced evidence that environment can influence inheritable traits, something once considered a genetic impossibility.

In short, the more we learn about biology, the further we find ourselves from a model that can explain it.

There is now a better way. Petabytes allow us to say: "Correlation is enough." We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot.

The best practical example of this is the shotgun gene sequencing by J. Craig Venter. Enabled by high-speed sequencers and supercomputers that statistically analyze the data they produce, Venter went from sequencing individual organisms to sequencing entire ecosystems. In 2003, he started sequencing much of the ocean, retracing the voyage of Captain Cook. And in 2005 he started sequencing the air. In the process, he discovered thousands of previously unknown species of bacteria and other life-forms.

If the words "discover a new species" call to mind Darwin and drawings of finches, you may be stuck in the old way of doing science. Venter can tell you almost nothing about the species he found. He doesn't know what they look like, how they live, or much of anything else about their morphology. He doesn't even have their entire genome. All he has is a statistical blip — a unique sequence that, being unlike any other sequence in the database, must represent a new species.

This sequence may correlate with other sequences that resemble those of species we do know more about. In that case, Venter can make some guesses about the animals — that they convert sunlight into energy in a particular way, or that they descended from a common ancestor. But besides that, he has no better model of this species than Google has of your MySpace page. It's just data. By analyzing it with Google-quality computing resources, though, Venter has advanced biology more than anyone else of his generation.

This kind of thinking is poised to go mainstream. In February, the National Science Foundation announced the Cluster Exploratory, a program that funds research designed to run on a large-scale distributed computing platform developed by Google and IBM in conjunction with six pilot universities. The cluster will consist of 1,600 processors, several terabytes of memory, and hundreds of terabytes of storage, along with the software, including Google File System, IBM's Tivoli, and an open source version of Google's MapReduce. Early CluE projects will include simulations of the brain and the nervous system and other biological research that lies somewhere between wetware and software.

Learning to use a "computer" of this scale may be challenging. But the opportunity is great: The new availability of huge amounts of data, along with the statistical tools to crunch these numbers, offers a whole new way of understanding the world. Correlation supersedes causation, and science can advance even without coherent models, unified theories, or really any mechanistic explanation at all.

There's no reason to cling to our old ways. It's time to ask: What can science learn from Google?
 
I think that this article is saying that we have too much data to process now, and that we are no longer able to indicate or prove causation in some areas. at the same time we are able to observe much more data and create series of "facts" which seem useful, without having the ability to understand why.

Someone literate needs to explain that "beautiful story" reference. I know that it means that we may be reaching the limits of verifiable scientific knowledge in some fields, and that we are going back to mythology and soft "science", but I don't know where the reference comes from.
 
Interesting story, but completely appalling.

Google does not track and organize an objective reality called the Internet. Google helps give shape to the Internet, too-- might even one of the biggest forces in cyberspace, in fact. Removing "causation" and focusing on "correlation"-- the precedent is Machiavelli, by the way, which says a lot about this "innovation"-- elegantly removes Google itself from the equation, which is insane and irresponsible to say the least.

As a more finite example of what I'm talking about, one application (among many, I assume) of this new "anti-model" theorizing is that Google will be able to create a "smart" web-page (that is, a more advanced version of Google Desktop, which already exists). A phrase I've come across a few times now is, "We'll tell you how you're going to spend your day before you've even had a chance to plan it". This is meant in a benign, Utopian way, as in a cyberspace Jeeves faithfully serving his clients, but surely the darker side of such an application is readily apparent.

The end result of what is proposed in this article is nothing less than the institutionalized splintering of individual identity while at the same time creating easily-controlled masses of people. It's so horrifying that even calling it Orwellian is a laughable understatement.

Welcome to the future!
 
I think that this article is saying that we have too much data to process now, and that we are no longer able to indicate or prove causation in some areas. at the same time we are able to observe much more data and create series of "facts" which seem useful, without having the ability to understand why.

Someone literate needs to explain that "beautiful story" reference. I know that it means that we may be reaching the limits of verifiable scientific knowledge in some fields, and that we are going back to mythology and soft "science", but I don't know where the reference comes from.

"Beautiful story" is the stage a theory reaches just after it coalesces into a lovely, unbroken, unmolested hypothesis, and just before it is blown apart by the rigors of the scientific method.
 
would a belated 4,000 posts thread cheer you up? :D

I'm going to take the other side here. Now, of course I'm as paranoid as the next guy, provided the next guy isn't wearing a tin foil hat, but...

Is it possible that there is no dark side to google?

I have read the rants, and yes, it's true that they have access to more information on many of us than even our closest loved ones. Certain people know certain things about us, but they have access to almost all of it if you have google accounts and use their services. I used to resist them, but they have such neat free stuff, and I figured that, as an individual, what I have to hide is so unimportant, and the only use I serve to them is as a collection of data that they can use to market and sell ads.

The internet itself is spooky when you look at the way some people use it. Besides banking, credit, and shopping habits, people also give enough information to create a profile that is quite Orwellian. I think that your credit card information might be the least of it really. So google is just taking what is already there and using it wisely, and I hope ethically, as they claim.

Let's forget about their deals with China, blocking certain segments of the internet from Chinese people, and the whole question of saving search histories for a minute.

As they say, "if you've nothing to hide, you've nothing to fear".

On the other hand this article and the ideas in it are new to me, and I haven't had a chance to process it yet. I have a certain affinity for the "what works is good" philosophy, and again, I feel that this may not be google, but google taking a natural dilemma or fated point of progress, and helping us realize it, simply because they got their first. On the other hand, (the third hand) this sounds like people are going to be encouraged to ACCEPT WITHOUT QUESTION certain new thoughts, ideas, and "facts" because google, or someone in similar position of authority, said so.

is that the scary part? because really, when google figures out what kinds of things you like, by scanning your gmails, and then offers you super discounts on it, it's really sort of cool. ;)
 
would a belated 4,000 posts thread cheer you up? :D

I'm going to take the other side here. Now, of course I'm as paranoid as the next guy, provided the next guy isn't wearing a tin foil hat, but...

Is it possible that there is no dark side to google?

I'm not sure who coined the phrases, but have you ever heard of "repressive tolerance" or "repressive permissiveness"? It basically means you get what you want, but getting what you want is bad for you.

If Google scans your emails for, let's say, music you like, in order to suggest some stuff you may like from bands you already listen to or bands in the same genres you like, that's cool, right? Well, what about bands who aren't signed to major labels and don't have download distribution on the web? Because Google will monetize its home page so that it works in conjunction with iTunes, Amazon, and the other sites, you won't have access to any music left out of those corporations. You will be sucking from Google's teat, as it were, and the more dependent you become, the less able you'll be to discover new things on your own. If you think that's a trivial point, ask yourself how jazzed you would've been if an adult had come to you at age 15, or whenever it was for you, and said, "You look like a thoughtful, bookish, slightly odd but highly sensitive teenager trying to make sense of a crazy world-- here's a CD, listen to Morrissey, Nine Inch Nails, and The Clash".

And the point about "data correlation" rather than causation is an example of how the living world is made abstract in a very troubling way. Let's say Google is pretty smart (it is already) and you're using your homepage to plan your weekend. You log in on Saturday morning. Google is so bright it knows that Morrissey is playing a concert near your home town. It knows there's a meeting of a Morrissey fan club at a nearby bar beforehand. It gathers a review of the last Morrissey gig and shows you a link to a new Australian interview. All of that can probably be done now.

However, the miraculous leap will be that the "smarter" Google will also be able to tell you that a new, unsigned indie band that sounds like early Smiths is playing a club nearby, or that a local professor who has written a dissertation on gender roles in Morrissey's music is giving a poetry reading at a local Starbucks, or that "East of Eden" is playing at a local revival house. The social networking function will give you the names and email addresses of eleven other Morrissey fans in your vicinity with whom you share various affinities. The mapper function will give you a list of nearby vegetarian restaurants, all highlighted on (what else) a satellite-imaged Google Map. Oh, and since Google knows that the book on Postmodern Architecture in Finland you borrowed is now overdue it will plan a route allowing you to stop by the library, pay your fine by stored credit card, and create a suggested reading list which, by clicking "OK", it can have waiting for you when you swing by the librarian's counter.

Google will also, by remote networking, link your mobile device to those of other fans at the Morrissey venue so yours will play back images and/or sounds from the gig so you know exactly when the support band stops and when Morrissey is about to come on, and during the gig you can watch from multiple "user" perspectives (not to mention chat with fans). Google will collect all this and store it for you to watch when you get home as a kind of jigsaw puzzle live bootleg. Finally, since you watched the videos at 2:57 am on Sunday morning, Google will (since it knows your sleeping patterns by your log-in timestamps) open a new email addressed to your friend (GMail), who you're supposed to meet for coffee at 10:00 am (Google Calendar), to tell him you're going to be late since it knows you'll wake up only after 11:00 am. Hit "Send" and nighty-night.

Not bad, right? Well, how does Google do all this? As the article states, it doesn't think. It doesn't evaluate. It doesn't actually have "taste" as we think of the term. The algorithm correlates data patterns, and data patterns are made up of zillions of "information tags". That indie band that sounded like early Smiths? Google "knew" about them because of their own Google page, which archived key words and essentially cross-referenced them with key words from your page-- whether a "tag cloud" made from your own Internet wanderings or from specific lists you've made, as for example MySpace "Favorite Bands" lists. Maybe Google also searched through reviews of the band's shows on blogs for key words and found "early Smiths" a few times, or even figured it out because on the Internet "Johnny Marr", "Rickenbacker", and Hatful of Hollow, for instance, all form a unique data sequence that means "early Smiths".

Thus is Google made smart: everything is made intelligible by being broken down into data bits; the primitive form of this is the META tag section of web pages. Future Google home pages will just be advanced forms of the same basic data collection and sequencing but the important point to consider is whether or not this agglomeration of data is, first, the same thing as knowledge, and, second, if it gradually takes on a prescriptive, rather than descriptive, role in one's life.

Does a picture start to form in your head of how your entire existence is going to be obliterated into tiny pieces and fed into-- dare I say it-- a matrix consisting of trillions of other tiny pieces that connect you with everything and everyone? How your own will slowly vanishes in place of "tendencies" and "probabilities", and how this abstraction of the world denigrates your own humanity until you are merely a lengthy mathematical equation (or, shall I dare again? a clockwork orange)? And do you see what this might mean to a child born in five years who will never know anything but this world?

Yours Truly,

Debbie Downer :guitar:
 
Last edited:
Could you guys post something a little shorter, please?
 


If you've done nothing wrong
You've got nothing to fear
If you've something to hide
You shouldn't even be here

Long live us
The persuaded we
Integral
Collectively
To the whole project
It's brand new
Conceived solely
To protect you

One world
One reason
Unchanging
One season

If you've done nothing wrong
You've got nothing to fear
If you've something to hide
You shouldn't even be here
You've had your chance
Now we've got the mandate
If you've changed your mind
I'm afraid it's too late
We're concerned
You're a threat
You're not integral
To the project

Sterile
Immaculate
Rational
Perfect

Everyone has
Their own number
In the system that
We operate under
We're moving to
A situation
Where your lives exist
As information

One world
One life
One chance
One reason
All under
One sky
Unchanging
One season

If you've done nothing wrong
You've got nothing to fear
If you've something to hide
You shouldn't even be here
You've had your chance
Now we've got the mandate
If you've changed your mind
I'm afraid it's too late
We're concerned
You're a threat
You're not integral
To the project

Sterile
Immaculate
Rational
Perfect

that's all I have right now. I have to run. I appreciate the response.
 
Last edited by a moderator:
I'll write short if you post frequent. :)

I'll have to contemplate that a bit.

In the meantime, I'm thinking of posting some advice to No Gods Dude about how to be a more effective advocate.
 
Believe it or not, Worm, I had a very lengthy response to the original article drafted, then canceled it because I just don't have the stamina or mental capacity to give it justice. I did want to make a couple points, because the tiny subniche of computer science/artificial intelligence that I work in has some significant overlaps with Google's sphere of influence; in fact, right now I'm working with Google on some cool, secret stuff. If I do well, maybe they'll hire me away, even though I'd rather move back to Iowa than to Sillycon Valley.

* Google's success is based on one important technological innovation (its PageRank algorithm) and many overlooked, but just as important, business innovations--AdSense, streamlined user experience, etc. But even more than these factors, the big thing in its favor is that it's amassed more data than anything in history. It's the Library of Alexandria x eleventy. The world's top AI researchers are falling over themselves to either work for or work with Google, because its data stores are so massive.

* Sure, massive amounts of data can show how inadequate existing models can be, but those models will catch up. They always have.

* Google's search is text-based. There's absolutely no context or summary of the results that you get. In other words, if you want to search for Clinton (the town in Iowa) vs. Clinton (the president), there's no way to specify that you want the town. Yes, you can refine the search endlessly with text-based modifiers (e.g. "Clinton -president -Bill -William -Hillary," etc.), you'll still get a lot of noise. The Next Big Thing should be structured search, like "find all the non-governmental organization that So-and-So is affiliated with." It's right around the corner. There will be several clumsy, for-programmers-only attempts at this, but eventually somebody (and not necessarily Google) will get it right. Just wait.

* The language translation feature is NOT based on the same PageRank algorithm that users see in action whenever they do a search, contrary to what the article implies.
 
I think we might consider google's function as an advertising tool vs it's function as an education tool. I don't think that they could be one without the other.

I remembered that last year a lot of bloggers were mad at google. It was something to do with the bloggers using artificial means to inflate their position in page rank, and it seems that it had something to do with advertising. I googled for it and got this article.
 
Believe it or not, Worm, I had a very lengthy response to the original article drafted, then canceled it because I just don't have the stamina or mental capacity to give it justice. I did want to make a couple points, because the tiny subniche of computer science/artificial intelligence that I work in has some significant overlaps with Google's sphere of influence; in fact, right now I'm working with Google on some cool, secret stuff. If I do well, maybe they'll hire me away, even though I'd rather move back to Iowa than to Sillycon Valley.

* Google's success is based on one important technological innovation (its PageRank algorithm) and many overlooked, but just as important, business innovations--AdSense, streamlined user experience, etc. But even more than these factors, the big thing in its favor is that it's amassed more data than anything in history. It's the Library of Alexandria x eleventy. The world's top AI researchers are falling over themselves to either work for or work with Google, because its data stores are so massive.

* Sure, massive amounts of data can show how inadequate existing models can be, but those models will catch up. They always have.

* Google's search is text-based. There's absolutely no context or summary of the results that you get. In other words, if you want to search for Clinton (the town in Iowa) vs. Clinton (the president), there's no way to specify that you want the town. Yes, you can refine the search endlessly with text-based modifiers (e.g. "Clinton -president -Bill -William -Hillary," etc.), you'll still get a lot of noise. The Next Big Thing should be structured search, like "find all the non-governmental organization that So-and-So is affiliated with." It's right around the corner. There will be several clumsy, for-programmers-only attempts at this, but eventually somebody (and not necessarily Google) will get it right. Just wait.

* The language translation feature is NOT based on the same PageRank algorithm that users see in action whenever they do a search, contrary to what the article implies.

*rubbing eyes*

IS NRITH TAKING PART IN ONE OF OUR PATENTED LONGWINDED DEBATES? MY GOD, THIS IS LIKE A YETI SIGHTING, I'VE GOT TO GET THIS ON TAPE!
 
Back
Top Bottom