2014 Innovation Days
Recently, WibiData held our second semi-annual Innovation Days! In the spirit of creative entrepreneurship, WibiData engineers took a break from their normal sprints to hack on Kiji, WibiData’s open-source framework for Big Data Application development.
Innovation Days kicked off on Wednesday, January 22nd with pitch presentations. The following two days, Wibis split up into groups to hack on the projects of their choice. By Monday, the Wibis were ready to present their projects.
Take a look at what engineers at WibiData hacked on:
Kiji Roaring. Currently within Kiji, ScoreFunctions for applying a model to an entity must be written in either Java or Scala. Data scientists tend to be more familiar with R compared to other languages, and thus it would take longer to prototype possible model scorers in Java/Scala. Kiji Roaring (a play on KijiScoring) is the ability to run score functions that use R code on data within a Kiji row. As part of this, a ScoreFunction for predicting linear regression fitted values can be implemented by the following R code:
fit = lm(primitive.values ~ primitive.ts)
predict(fit, data.frame(primitive.ts = c(currentTime))
Lee built a demo using KijiRest to demonstrate that this linear regression would produce continually updating results when freshening is used. The result was a proof of concept demonstrating the data flow from a conventional ScoreFunction to R and back out to Kiji for the results.
Flow GUI. Many things in the world can be modeled with graphs. This includes things like source code (ASTs are a special case of graphs), scalding flows (cascading produces a FlowDef graph before turning it into a sequence of MapReduce jobs), batch model training workflows (workflow engines essentially just describe a graph of dependent tasks), etc. The Flow GUI is a project that aims to provide an editor for working with graph-style data. Robert built a prototype of an editor to demonstrate concepts for an easy to use and powerful interface for manipulating graphs.
Personalized Search with Kiji and Solr. Kiji is a powerful platform on top of which people can build Big Data Applications (specifically focused around personalization and recommendations.) Apache Solr is a powerful open-source search engine that powers the search engines of many companies. Naturally, trying to build a personalized search experience by combining the two seems like a natural fit. Amit built a prototype of this integration using the MovieLens data and a custom Solr QueryComponent that reads data from a Kiji table and constructs a series of category boosts (based on that user’s most recent ratings) that are specific to an individual.
Kiji + Spark. Apache Spark is a compute engine designed for large-scale data processing. Spark maintains the scalability and fault-tolerance of MapReduce while allowing applications to maintain working sets of data in memory across different phases of computation. Spark is therefore able to offer 10-100x speeds over MapReduce for iterative computations, including many machine learning algorithms. Cloudera recently added the Spark framework to CDH. For this project, Clint and Sebastian created a proof-of-concept example project that creates a Spark RDD from the data in a Kiji table. Spark already provides a Hadoop RDD interface, so we had only to specify the proper input format and Hadoop configuration to read data from a Kiji table. We could easily extend this project in the future to provide an experience somewhat similar to KijiExpress, but on Spark.
AsyncKiji. KijiSchema is built upon the standard synchronous HBase client, and many of its APIs have been shaped by the strengths and limitations of that client. Aaron and Dan aimed to replace the standard HBase client with the AsyncHBase client provided by OpenTSDB and to provide new APIs for reading from tables suited to the strengths of the asynchronous client. To accomplish this we leveraged two libraries developed by Dan:
- Carthorse for creating a view of an HBase table through the AsyncHBase client
- Continuum for defining intervals over totally ordered domains
These libraries allowed us to build an API through which the user defines an interval along each dimension of a KijiTable to retrieve a stream of cells which fall into those intervals.
Overall, the WibiData engineers experimented with parts of Kiji in order to create new plugins, best practices and optimize existing tools. Thanks to all of our engineers that participated, Innovation Days was once again a success!
Stay tuned, you may see some of these features pop up in future Kiji releases!