Wibi on Whales

At WibiData, we’ve been working hard to bring the Wibi experience to as many platforms as possible. Wibi’s Data Access Server uses Apache Thrift to convey data between the Wibi storage layer and end-user applications built in a variety of languages.

One such supported language is Ruby. I’ve been playing with Ruby a lot recently, and have been really impressed with Ruby on Rails. The simplicity offered by this programming platform blew me away the first time I used it. Our entire customer portal is a Rails app. I built it in about a week–and I had never written a line of Ruby code before attempting that. This is a great example of a framework actually making a language easier to learn, rather than adding to the learning curve, as so many others do. Its combination of auto-generated code and programming-by-convention rather than configuration make it easy to see examples of what to modify, and hides the complexity that other ORM solutions introduce.

Working with Wibi’s Data Access Server in Java is reasonably straightforward, and we are hard at work making this easier from within other languages. We provide the Thrift IDL file for the protocol which you use to connect to the Data Access Server, but you need to understand Thrift’s API, as well as manage the connection and error handling yourself, in addition to knowing which methods of the Data Access Server’s protocol do what.

Integration with frontend systems is a key goal for Wibi; providing real-time recommendations and analysis requires that frontend systems, be they web sites, mobile apps or hardware devices, are all able to post new data to Wibi without using a bulk-load, and retrieve stored information and analysis results on the fly. Unfortunately, many of these systems aren’t written in Java, where use of the Data Access Server is gracefully hidden behind a Wibi-specific API.

So we were left with the question: How do we expand the universe of languages where Wibi integrates well? And how do we do so in a way that feels “native” to the language? Ruby was the next language on my list that I wanted to integrate with–but how can we provide the smoothness of the “Rails experience” to Wibi customers who want to work in Ruby?

Continue reading

The Future of Personalization is Data-Driven

Last week saw the WibiData team in New York for the GigaOM Structure:Data conference. I gave a presentation about my thoughts on developing technology products and the applicability of data to personalizing the user experience. A recording of the talk is available:

There are accompanying slides as well:

Finally, GigaOM posted a summary of the content in a post on their own blog.

Writing Machine Learning Algorithms in WibiData

In a previous blog post, we shared some detail on how WibiData’s architecture and APIs enable large-scale storage, serving, and analysis of user-centric data. One of the primary use cases for this platform is online recommendations: programmatically determining the most relevant content, advertisements, or products to display to users on a web site.

To perform this operational analysis, we’ve introduced new operators called producers and gatherers that are a more natural fit for expressing these analytic tasks. This post is a deeper dive into how WibiData’s producer/gatherer framework works. We’ll motivate this with the specific example of a naive variant of an item-based collaborative filtering algorithm. This is one of many methods that can be used for generating recommendations.

Continue reading

Another perspective on WibiData

Recently, I’ve been speaking about WibiData at Hadoop User Groups and other meetups. Garrett Wu is our Director of Engineering and designed many of the fundamental concepts of WibiData’s computation and data model. In this presentation to the Bay Area Software Engineering meetup, Garrett presents WibiData in his own words. For those who couldn’t make it, both video and slides are available.

Speaking at Structure:Data in New York

Over the past few weeks, I’ve had the privilege of sharing lessons learned about working with Hadoop and HBase with a lot of big data enthusiasts in the Bay Area and LA. Next month I’ll be on the road again, to give a presentation at GigaOM Structure:Data ’12 in NYC.

The conference runs March 21-22. Come see my session, “Analyzing Large-Scale User Data with Hadoop and HBase.” And please introduce yourself to me; I’m always interested in meeting more big data practitioners and learning more about how you wrangle big data to get results.

How WibiData Works

Over the past year, we’ve had the privilege of helping a lot of great customers get more value out of their data by using WibiData to store, analyze and serve information about their users. What makes WibiData such a powerful tool? Why is it more effective than just cobbling together your own solution on top of Hadoop and HBase? In this post, we’ll peel back the covers and look at how some key components of WibiData work to give you more leverage over your data.
Continue reading

A closer look at WibiData

Last night I had the pleasure of speaking at the Bay Area Search meetup group at eBay, in San Jose. I expanded on the material in our Hadoop World presentation, going deeper into the technical underpinnings of our software and explained how some example content recommendation algorithms could be implemented on top of WibiData.

For those who missed it, my slides are available here:

I’ll also be speaking at the Los Angeles Hadoop User Group meetup on February 7th – if you’re in town, come stop by and say hi!

Open Source at WibiData

This is the season for thinking about how fortunate we are, and thinking harder about how we can give back to those around us.

When building WibiData, we had the good fortune to be able to build on top of open source projects like Hadoop, HBase, and Avro. This was a tremendous head start toward our goal of easy user data storage, serving and analysis. We’re excited to “pay it forward” by sharing some of the useful tools we built along the way. Check out the collection of repositories on github we’ve started to host some of our code that might be useful to others. We’re kicking this off by releasing three Java projects:

hbase-maven-plugin

A maven plugin that starts a mini HBase cluster in a separate process. WibiData uses this plugin to execute integration tests against a running HBase “cluster” contained within a single process.

odiago-avro

Extensions for Apache Avro for tight integration with Apache Hadoop’s new MapReduce API. With this library, you can read and write Avro container files of key-value pairs using either Hadoop Writable types (IntWritable, Text) or native Avro Java types (Integer, CharSequence).

odiago-common-flags

WibiData needed a simple command-line flag parsing library that was easy to use. The existing libraries we found were very fully-featured. This made for complex APIs and cumbersome integration with our code, and we didn’t need all the flexibility they offered. We wrote odiago-common-flags with a strong focus on a simple annotation-based syntax.

We hope you find these libraries useful, and we welcome contributions. All code is made available under the terms of the Apache 2.0 license. Follow us on github to stay updated on future project releases. Happy hacking!

FoneDoktor, A WibiData Application

This is a guest blog post by Alex Loddengaard, the author of FoneDoktor, a WibiData application. In this post he’ll explain FoneDoktor and go over its implementation, showing you the benefits of using WibiData to store, access, and analyze user data. Jump to the bottom to see conclusions about why WibiData was the right solution for FoneDoktor. Alex was the third employee at Cloudera and stayed there for two years, working on some of the largest Hadoop implementations during his time there.

I built an Android app called FoneDoktor, which uses WibiData as its primary data storage, access, and analysis system. Having used Hadoop for over four years now, I was insanely impressed with the simplicity that WibiData brings to apps that need to store, access, and analyze massive amounts of user data. I’ll explain FoneDoktor and talk about its implementation, bringing me to the conclusion that WibiData stands for simplicity more than any other alternative in the big data space.

What is FoneDoktor?

FoneDoktor is an Android app that monitors phone usage and recommends usage improvements to betterAndroid phone performance and battery life. Usage information such as average screen brightness, average signal strength, wifi connectivity, power cycles, and more is collected throughout the day and sent to a WibiData cluster. Data is only sent when the phone is connected to a power source, to avoid using battery to send data upstream.

Once FoneDoktor has been running for a few weeks on a phone, it starts analyzing the usage data and makes recommendations in the form of Android push notifications. A notification might suggest that you should turn on auto screen brightness, or start using wifi when your signal is low, if available of course. FoneDoktor has several more notification types, too.

How Does FoneDoktor Work?

A typical usage record, stored as an Apache Avro record, might look something like this:

{
    "datetime": "2011-11-12 03:24:08.111",
    "seconds_on": 477,
    "avg_brightness": 255,
    "is_on_power": true,
    "is_on_wifi": true,
    "is_on_3g": true,
    "signal_strength": 7,
    "device_id": "ABCD1234",
    "is_auto_brightness": true
}

In this case this record is saying that the screen was at full brightness (255) for 477 seconds, connected to power, with wifi and 3g on, and a signal strength of 7. This particular record has a unique device ID, which is what’s used as the WibiData key.

On any given day FoneDoktor will collect about 100 records from each phone. These records are stored in a WibiData column. Each WibiData column stores a specific type of record. For example, there exists one column for wifi-specific records, another column for screen brightness records, etc. Because WibiData timestamps each record as they’re stored, every record is accessible by its key (device ID), column (record type), and timestamp (when the record was created). WibiData also makes it easy to scan both rows and values (by timestamp) in a particular column.

Data Storage and Access with WibiData

The write path (outlined in an architecture diagram in the conclusion section below) for a record starts at the phone. The record is cached on the phone if it’s not connected to power. Then, once the phone is connected to power, the record is sent upstream as JSON to a web server implemented in Python/Django. The web server creates an Avro record and sends a Thrift RPC to the WibiData access server, which writes the record into WibiData.

FoneDoktor’s read path is as straightforward as its write path. The phone periodically queries the web server to see if any new notifications or summary data is available. The web server fires off a Thrift RPC to the WibiDataaccess server, which queries WibiData and returns an Avro record, which isserializedinto JSON in Python/Django before being sent to FoneDoktor.

WibiData is implemented on top of HBase, which means clients don’t need to worry about caches or indexes when reading and writing data. WibiData scales out of the box.

User Analysis in WibiData

Without WibiData, FoneDoktor’s data analysis would be powered by MapReduce over a real-time storage system such as HBase. I spent two years working at Cloudera and I still get lost in MapReduce API changes and oddities with reading and writing data to and from sources other than HDFS.

With WibiData, however, the analysis APIs are very obvious and far more simple than MapReduce. FoneDoktor has two different types of analysis. First, some analysis only looks at a single phone — for example to do battery calculation, summary information about usage, etc. This type of analysis is done in WibiData with producers. All other forms of analysis are done on the entire data set, looking at how all phones are used and creating correlations between usage and performance. This type of analysis is done by gatherers.

Producers

The producer API is dead simple. You specify which tables and columns your data comes from, where output will be written to in WibiData, and a method for processing a single row at a time. WibiData handles buffering, reading from HBase, writing back to HBase, and everything else. No MapReduce. No input/output complexity.

In FoneDoktor’s case, a producer might look at the set of screen brightness records for a given phone and output an aggregate screen brightness average, which can be used for further analysis later.

Gatherers

The gatherer API is slightly more complex than the producer API, but it’s still far more simple than a traditional MapReduce job. Just like the producer API, you specify which data you want to read in, and where output should go. You then write a method for processing individual rows, where the output data of this method is used as input data in a reducer. The reducer is not a traditional MapReduce reducer, but it works very similar to one. It takes aggregated keys and their respective lists of values and outputs data to a WibiData cell. Again, no need for ETL and complex input/output strategies.

In FoneDoktor’s case, a gatherer might look at which devices perform worse than others, and dig into the usage data to learn why. For example, it may find that two users with the same device have drastically different battery life. It will then look at the usage differences and make suggestions to each respective user for improving their battery life.

Conclusions – Why WibiData

Almost every use of Hadoop requires a sibling real-time data storage and access solution (OLTP – online transaction processing) for serving a website, OLAP dashboard, mobile app, or any other real-time piece of software. In practice Hadoop works alongside data solutions such as MySQL, Vertica, Oracle, Teradata, HBase, and lots of other alternatives to power these real-time applications. Hadoop does batch, background processing and these other systems serve and store the real-time data. Without WibiData, FoneDoktor’s architecture would look something like this:

Architecture Without WibiData

Hadoop, during the four years I’ve worked with it, has started to play much more nicely with sibling real-time data storage solutions. But the process of using Hadoop alongside a real-time solution is still cumbersome — one needs to write lots of ETL to move data back and forth. Ops will need to maintain far more daemon processes. Ultimately, the system becomes far more complex and hard to maintain.

WibiData solves all of this by providing one solution for both serving, storing, and analyzing data. Analysis, just like in Hadoop, is done in batch, in the background. Whereas serving and storage, like HBase, is done in real-time, at massive scale. The difference is WibiData comes with great abstractions to make using it much more simple than the alternative. Furthermore WibiData comes with lots of built-in libraries to do most of the needed analysis work for you for complex machine learning and data mining work. With WibiData, FoneDoktor’s architecture looks like this:

Architecture With WibiData

Building FoneDoktor was lots of fun — it was my first Android app and my first usage of WibiData. I’m very impressed with WibiData for the reasons I’ve stated here. Android was great, too.

Use WibiData if you’re tired of moving user data from system to system, to either analyze it or store and access it. WibiData just works, it scales, and most importantly, it’s simple in an otherwise complex technology stack.