I'm a second-year Computer Science PhD student at UC Berkeley, where I am advised by Ion Stoica in the AMPLab. I received a BS in EECS from Berkeley in May 2013.

I am spending the summer at Databricks working on GraphX. In past summers I've interned at Amazon, Google, Facebook, and Microsoft Research.

Contact

ankurdave@gmail.com

(650) 701-7705

Projects

GraphX August 2013 to present

GraphX is a distributed graph computation library built on top of Apache Spark. It aims to be as fast as the fastest specialized graph systems while providing much more flexibility. GraphX comes included with Spark; check out the programming guide and the technical report on the system [4]. A paper on GraphX will appear at OSDI 2014 [1].

As an undergrad I wrote a Pregel-like graph processing framework for Spark called Bagel. Bagel is now superseded by GraphX.

Color Identifiers Mode January 2014 to present

Inspired by an article about syntax highlighting for variables instead of keywords, I wrote a demo implementation for Emacs. It became surprisingly popular, reaching the 68th percentile for downloads on MELPA, the primary Emacs package archive. It automatically picks optimally distinct colors and attempts to detect identifiers accurately across a variety of languages.

Arthur 2011 to 2013

As an undergrad I wrote a replay debugger for Spark programs called Arthur. Arthur enabled some interesting program analysis techniques, including forward and backward record tracing: if a distributed computation yielded a strange output record (one that was unexpectedly null, for example), Arthur could trace the record back through the computation graph to find which input records it came from and how it came to be.

We wrote a technical report on Arthur [5].

CloudClustering June to August 2010

I interned at Microsoft Research's eXtreme Computing Group the summer after I graduated high school. My project was to explore how to design scalable iterative programs on top of certain cloud storage abstractions, and in the process I built a prototype called CloudClustering. This led to a workshop paper [3] at DataCloud 2011.

DistBoggle 2008 to 2010

In 10th grade I was an occasional Boggle player, and I became curious what the densest Boggle board (the one the most words packed into it) would look like. I wrote a package called DistBoggle that included a fast Java Boggle solver and two parallel optimizers: a hill climbing algorithm and a coarse-grained distributed genetic algorithm. I later wrote my IB Extended Essay [6] about this.

Publications

Conference and Workshop Papers

[1]
Joseph E. Gonzalez, Daniel Crankshaw, Ankur Dave, Reynold S. Xin, Michael J. Franklin, Ion Stoica. GraphX: Graph Processing in a Distributed Dataflow Framework, OSDI 2014, October 2014 (to appear).
[2]
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, and Ion Stoica. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing, NSDI 2012, April 2012. Best Paper Award and Honorable Mention for Community Award.
[3]
Ankur Dave, Wei Lu, Jared Jackson, and Roger Barga. CloudClustering: Toward an Iterative Data Processing Pattern on the Cloud. DataCloud 2011, May 2011.

Technical Reports

[4]
Reynold S. Xin, Daniel Crankshaw, Ankur Dave, Joseph E. Gonzalez, Michael J. Franklin, Ion Stoica. GraphX: Unifying Data-Parallel and Graph-Parallel Analytics. February 2014.
[5]
Ankur Dave, Matei Zaharia, Scott Shenker, Ion Stoica. Arthur: Rich Post-Facto Debugging for Production Analytics Applications. January 2013.
[6]
Ankur Dave. Optimizing Boggle Boards: An Evaluation of Parallelizable Techniques. IB Extended Essay, January 2009.