Changing the Tide: Efficient Summarization Techniques for Massive Data


                Computer Science Colloquium

                 Thursday, February 14, 2013
                          4:00 p.m.
                      Hardymon Theater
                  Davis Marksbury Building

                       Dr. Jeffrey Jestes
                       University of Utah


"Changing the Tide: Efficient Summarization Techniques for Massive Data"

Given the recent explosion of data we have witnessed within the past few years, it has become apparent that many algorithms designed for these applications simply will not scale to huge amounts of data being generated. Due to the massive nature of modern data, it is often times infeasible for computers to efficiently manage and query it exactly. An attractive path for the future are data summarization techniques, which can be used to make data analytic tasks orders of magnitude more efficient while still allowing approximation guarantees on results. We argue in order to be useful, the summary must be efficient to construct and should be highly parallelizable. We present two such techniques which efficiently construct summaries over massive data; the first constructs wavelet histograms and the second constructs k-nearest neighbor graphs. Our techniques are demonstrated and evaluated in MapReduce, but are applicable for any parallel and distributed compute platforms. In addition to being able to efficiently construct summaries, we also motivate a vision for interactive and mergeable summaries as a direction for future research.


About the speaker:
Jeffrey Jestes received the BS degree in computer science from Florida State University in 2008. He was a PhD student in the Computer Science Department, Florida State University between August 2008 and July 2011. He is a PhD student at the School of Computing, University of Utah, Since August 2011. He has interned twice with Microsoft Research in the eXtreme Computing Group during his PhD, studying solutions to problems in massive data. His current research interests include summarizing massive data in distributed and parallel frameworks; ranking, monitoring, and tracking big data; scalable query processing in large databases.