scispace - formally typeset
Journal ArticleDOI

The Unreasonable Effectiveness of Data

Alon Halevy, +2 more
- 01 Mar 2009 - 
- Vol. 24, Iss: 2, pp 8-12
TLDR
A trillion-word corpus - along with other Web-derived corpora of millions, billions, or trillions of links, videos, images, tables, and user interactions - captures even very rare aspects of human behavior.
Abstract
At Brown University, there is excitement of having access to the Brown Corpus, containing one million English words. Since then, we have seen several notable corpora that are about 100 times larger, and in 2006, Google released a trillion-word corpus with frequency counts for all sequences up to five words long. In some ways this corpus is a step backwards from the Brown Corpus: it's taken from unfiltered Web pages and thus contains incomplete sentences, spelling errors, grammatical errors, and all sorts of other errors. It's not annotated with carefully hand-corrected part-of-speech tags. But the fact that it's a million times larger than the Brown Corpus outweighs these drawbacks. A trillion-word corpus - along with other Web-derived corpora of millions, billions, or trillions of links, videos, images, tables, and user interactions - captures even very rare aspects of human behavior. So, this corpus could serve as the basis of a complete model for certain tasks - if only we knew how to extract the model from the data.

read more

Citations
More filters
Book

Distributed Optimization and Statistical Learning Via the Alternating Direction Method of Multipliers

TL;DR: It is argued that the alternating direction method of multipliers is well suited to distributed convex optimization, and in particular to large-scale problems arising in statistics, machine learning, and related areas.
Book

Machine Learning : A Probabilistic Perspective

TL;DR: This textbook offers a comprehensive and self-contained introduction to the field of machine learning, based on a unified, probabilistic approach, and is suitable for upper-level undergraduates with an introductory-level college math background and beginning graduate students.
Journal ArticleDOI

A survey on Image Data Augmentation for Deep Learning

TL;DR: This survey will present existing methods for Data Augmentation, promising developments, and meta-level decisions for implementing DataAugmentation, a data-space solution to the problem of limited data.
Book

Linked Data: Evolving the Web into a Global Data Space

TL;DR: This Synthesis lecture provides readers with a detailed technical introduction to Linked Data, including coverage of relevant aspects of Web architecture, as the basis for application development, research or further study.
Proceedings ArticleDOI

Revisiting Unreasonable Effectiveness of Data in Deep Learning Era

TL;DR: In this paper, the authors investigated how the performance of current vision tasks would change if this data was used for representation learning and found that the performance on vision tasks increases logarithmically based on volume of training data size.
References
More filters
Book

A Comprehensive Grammar of the English Language

TL;DR: A Comprehensive grammar of the English language as mentioned in this paper, a comprehensive grammar of English language, a Comprehensive grammar for English language, and a comprehensive grammars of English, is an example of such a grammar.
Book

Introduction to statistical relational learning

Lise Getoor, +1 more
TL;DR: In Introduction to Statistical Relational Learning, leading researchers in this emerging area of machine learning describe current formalisms, models, and algorithms that enable effective and robust reasoning about richly structured systems and data.
Related Papers (5)