I'm a fourth-year Computer Science PhD student at UC Berkeley, where I am advised by Ion Stoica in the AMPLab. I received a BS in EECS from Berkeley in May 2013.

In past summers I've interned at Databricks, Amazon, Google, Facebook, and Microsoft Research.



(650) 701-7705


GraphX is a distributed graph computation library built on top of Apache Spark. It aims to be as fast as the fastest specialized graph systems while providing much more flexibility. GraphX comes included with Spark; check out the programming guide and the OSDI 2014 paper [2].

As an undergrad I wrote a Pregel-like graph processing framework for Spark called Bagel. Bagel is now superseded by GraphX.

Inspired by an article about syntax highlighting for variables instead of keywords, I wrote a demo implementation for Emacs. It became surprisingly popular, reaching the 77th percentile for downloads on MELPA, the primary Emacs package archive. It automatically picks optimally distinct colors and attempts to detect identifiers accurately across a variety of languages.

2011 to 2013

As an undergrad I wrote a replay debugger for Spark programs called Arthur. Arthur enabled some interesting program analysis techniques, including forward and backward record tracing: if a distributed computation yielded a strange output record (one that was unexpectedly null, for example), Arthur could trace the record back through the computation graph to find which input records it came from and how it came to be.

We wrote a technical report on Arthur [5].

June to August 2010

I interned at Microsoft Research's eXtreme Computing Group the summer after I graduated high school. My project was to explore how to design scalable iterative programs on top of certain cloud storage abstractions, and in the process I built a prototype called CloudClustering. This led to a workshop paper [4] at DataCloud 2011.

2008 to 2010

In 10th grade I was an occasional Boggle player, and I became curious what the densest Boggle board (the one the most words packed into it) would look like. I wrote a package called DistBoggle that included a fast Java Boggle solver and two parallel optimizers: a hill climbing algorithm and a coarse-grained distributed genetic algorithm. I later wrote my IB Extended Essay [6] about this.


Conference and Workshop Papers

Ankur Dave, Alekh Jindal, Li Erran Li, Reynold S. Xin, Joseph E. Gonzalez, Matei Zaharia. GraphFrames: An Integrated API for Mixing Graph and Relational Queries, GRADES 2016, June 2016.
Joseph E. Gonzalez, Reynold S. Xin, Ankur Dave, Daniel Crankshaw, Michael J. Franklin, Ion Stoica. GraphX: Graph Processing in a Distributed Dataflow Framework, OSDI 2014, October 2014.
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing, NSDI 2012, April 2012. Best Paper Award and Honorable Mention for Community Award.
Ankur Dave, Wei Lu, Jared Jackson, Roger Barga. CloudClustering: Toward an Iterative Data Processing Pattern on the Cloud, DataCloud 2011, May 2011.

Technical Reports

Ankur Dave, Matei Zaharia, Scott Shenker, Ion Stoica. Arthur: Rich Post-Facto Debugging for Production Analytics Applications, January 2013.
Ankur Dave. Optimizing Boggle Boards: An Evaluation of Parallelizable Techniques, IB Extended Essay, January 2009.