Data Mining the Web: $100 Worth of PricelessSummary
I have always been a fan of what I call audacity in engineering, and today’s toolsets make work at Web scale not only possible, but economically feasible, even for lone engineers toiling in tiny startups. We can not only think big, but with the right tools, can execute on a scale never before imagined outside of large corporations or universities.
A few weeks ago, while working on prototype search technology for Lucky Oyster, we were able to leverage a few simple components—data from Common Crawl, Spot Instances from AWS, a few hundred lines of Ruby, and assorted Open Source software—to data mine 3.4 billion Web pages, extracting close to a terabyte of structured data, and building a searchable index of close to 400 million entities. The cost? About $100 US. And all work completed, thanks to several hundred worker nodes, in about 14 hours.
This exercise has its roots in a previous experiment, from earlier this summer (see here), in which we examined just over a billion Web pages to asses the extent of hardcoded references to Facebook and the use of Open Graph tags. In that work, we showed the far-reaching influence of Facebook on the broader Web: in the sample set, 22% of pages had hardcoded references to Facebook; while another 8% incorporated Open Graph tags.
Beyond our initial interest in the proliferation of structured data, this work also got us thinking about something even more important: the coming war between Web search (Google and Bing) and social networking (Facebook primarily), in which what’s at stake is the visibility and ownership of content itself. My mental model for the Facebook graph used to be that Web pages occasionally had pointers to additional social data; but in reality, that model is increasingly inverted, with the social graph having far more (potentially) useful content and metadata, with occasional (and often superfluous or outdated) pointers out to the broader Web. Moreover, the kinds of highly-structured, socially-inflected content that are growing the fastest—both on the Web and in the social graph—are greatly underserved by existing search models.
All of this work has a specific vector within Lucky Oyster: to build products that help people identify, share, learn about, and discover great things in the world (entities like runs, recipes, hikes, places, etc.) with relevance adjusted to who they are and their specific social fabric….Mechanics
The architecture of the Lucky Miner is simple: some n workers start up, consume tasks from a master queue, and then report both their extracted data and statistics to a master data collection service. The components are as follows:
- Common Crawl Data. The current data set from Common Crawl, based on a 2012 Web crawl, consists of over 700,000 Web Archive (ARC) files, each of which holds a compressed set of 100M worth of crawled Web pages, for a total of ~3.4 billion records. I personally can’t say enough about how important their mission is, as it’s the largest and most thorough Web crawl available for public, free use by anyone, anywhere. The organization does a great deal to promote the use of the data, and I believe that’s a key aspect of inspiring engineers everywhere to innovate in new ways, on an ever grander scale.
- AWS S3. Amazon Web Services has graciously hosted the Common Crawl material as a public data set on S3. Providing the ARC files are accessed from within the fabric of AWS, there’s no cost for data transfer. We ended up paying only for the compute resources required to consume and analyze the data.
- AWS Spot Instances. During the first experiment with Common Crawl data, processing just over 1 billion URLs cost about $450. This time around, using spot instances, our average per-hour cost for a High-CPU Medium Instance (c1.medium) was about $.018, just under one tenth of the on-demand rate. In order to leverage spot instances, we needed to a) make sure our architecture was resilient to interruption (for when costs rise above the maximum bid and machines go offline), and b) get a bit savvier about bidding tactics. Thankfully, with some help from the AWS Spot Instance product team, the transition to using spot was trivial and well worth the effort.
- Master Queue. While future versions of this system will likely leverage AWS Simple Queue Service (SQS), we elected to start with beanstalkd for simplicity. An in-memory queue with performance characteristics that greatly outstrip the needs of even 10x the number of workers we run, beanstalkd installs and can be integrated in about 30 seconds, with client libraries in most languages.
- Worker Nodes. On startup, each worker node, a High-CPU Medium Instance (c1.medium), runs two identical processes. Each process pulls AWS S3 paths from the master queue, downloads and streams them through gunzip, and then parses the ARC files and crawl records. A simple extraction class gets a single Web page and metadata as input, and output (extracted content and statistics) is posted to the master data collection service. Each process also posts throughput stats and detailed error information to the master data collection service. All code is written in Ruby, as the primary bottleneck is machine CPU when decompressing ARC files, and not the speed of extraction (at least in this exercise).
- Master Data Collection Service. A very simple REST service accepts GET and POST requests with key/value pairs. Worker processes check in so that we can hit a status URL and thereby keep an eye on zombies, errors, and the number and performance of active workers. Posted data are appended to key-specific local files, for later analysis and work (sums and averages, reducing by key and/or value, indexing, etc.). The code, also written in Ruby, is de minimus, and runs under the Sinatra framework; we then front end the service with Passenger Fusion and Apache httpd. This service requires great attention, as it’s the likeliest bottleneck in the whole architecture.
What’s described above was just the first step for Lucky Oyster; after the extraction run, we run sets of cleansing, analysis and indexing operations in parallel, and then feed the results to a cluster running Solr/Lucene. This then forms the foundation for deeper research on entity structure and interlinking, with a focus on triangulating people and their relationships to entities. The work is early….
We have discussed a few follow on steps for the broader community with the folks at AWS and Common Crawl:
- Releasing sample code to github
- Building an open framework to let people execute code against a subset of the crawl data, for free
- Presenting this work with Common Crawl at the upcoming re:Invent conference this November