
I'm a Software Engineer at Databricks, where I'm working on Photon, a highly efficient query processing engine for Apache Spark SQL.
I completed my PhD in Computer Science from UC Berkeley in August 2020. I studied problems in large-scale analytics and was advised by Ion Stoica and Raluca Ada Popa in the RISELab. I received a BS in EECS from Berkeley in May 2013.
Opaque is a package for Apache Spark SQL that enables encryption for DataFrames using Intel SGX trusted hardware. It is designed to enable analytics on sensitive data in an untrusted cloud. We ported a subset of Spark SQL's relational operators to C++ so they could run within SGX enclaves. See the NSDI 2017 paper.
GraphX is a distributed graph computation library built on top of Apache Spark. It aims to be as fast as the fastest specialized graph systems while providing much more flexibility. GraphX comes included with Spark; check out the programming guide and the OSDI 2014 paper.
As an undergrad I wrote a replay debugger for Spark programs called Arthur. Arthur enabled some interesting program analysis techniques, including forward and backward record tracing: if a distributed computation yielded a strange output record (one that was unexpectedly null, for example), Arthur could trace the record back through the computation graph to find which input records it came from and how it came to be.
We wrote a technical report on Arthur.
Inspired by an article about syntax highlighting for variables instead of keywords, I wrote a demo implementation for Emacs. It became surprisingly popular, reaching the 86th percentile for downloads on MELPA, the primary Emacs package archive. It automatically picks optimally distinct colors and attempts to detect identifiers accurately across a variety of languages.