42nd International Conference on Very Large Data Bases (VLDB) 2016 Full Program
 DAY 1: MONDAY [September 5] Time Track 1 (PEARL 1)Tutorials Track 2 (MAPLE)Tutorials Track 3 (ROYAL 1)Workshops Track 4 (ROYAL 2)Workshops Track 5 (Boardroom)Workshops Co-located Event (PEARL 2)Digital India 9.00-10.30 T1: Differential Privacy in the Wild T2: Operational Analytics Data Management Systems W1: TPCTC Performance Evaluation and Benchmarking W2: BIRTE: Business Intelligence for the Real Time Enterprise W3: QDB: Quality in Databases Session 1: India Stack 10:30-11:00 Tea Break (30 mts) 11.00-12.30 T1 (contd): Differential Privacy in the Wild T2 (contd): Operational Analytics Data Management Systems W1 (contd): TPCTC W2 (contd): BIRTE W3 (contd): QDB Session 2: Data and India Scale Challenges 12:30-2:00 Lunch Break (90 mts) 2.00-3.30 T3: Modern Main-memory DBMS T4: Machine Learning in the Real World W1 (contd): TPCTC W2 (contd): BIRTE W3 (contd): QDB Hackathon Final Round (1.30 - 5.00) 3:30-4:00 Tea Break (30 mts) 4.00-5.30 T3 (contd): Modern Main-memory DBMS T4 (contd): Machine Learning in the Real World W1 (contd): TPCTC W2 (contd): BIRTE W3 (contd): QDB 5:30 -6:30 High Tea Break (60 mts) 6.30-8.30 Digital India Event - Keynote and Panel Discussion (Location: PEARL) 8.30 onwards Mini Dinner
 DAY 2: TUESDAY [September 6] Time Track 1 (PEARL 1)Research Track 2 (PEARL 2)Research Track 3 (ROYAL 1)Research Track 4 (ROYAL 2)Industry Track 5 (MAPLE)Demos and Posters 8.30-9.15 Conference Inauguration (Location: PEARL) 9.15-10.30 Keynote 1: Ion Stoica: Trends and Challenges in Big Data Processing 10.30-11.15 Tea Break (45 mts) 11.15-12.45 R1: Transaction Processing R2: RDF Data Systems R3: Distributed and Cloud Systems -1 I1: Distributed Data Analytics Platforms -1 Demo 1a: Data Engines and Analytics Research Poster 1 (R10, R11, R12, R13, R14) 12:45-2:00 Lunch Break (75 mts) 2.00-3.30 R4: Memory Management R5: Data Cleaning -1 R6: Graph Processing -1 I2: Distributed Data Analytics Platforms -2 Demo 2a: Interactive and Exploratory Systems Research Poster 2 (R15, R16, R17, R18, R19, R20) 3:30-4:00 Tea Break (30 mts) 4.00-5.30 R7: Query Execution -1 R8: Data Security and Privacy R9: Ranking Queries I3: Data Engine Architectures -1 Demo 3a: Graph and Semistructured Data Research Poster 3 (R21, R22, R23, R24, R25) 5:30-6:30 High Tea Break (60 mts) 6.30-8.30 VLDB CONFERENCE RECEPTION and QUIZ PROGRAM (Location: PEARL)
 DAY 3: WEDNESDAY [September 7] Time Track 1 (PEARL 1)Research Track 2 (PEARL 2)Research Track 3 (ROYAL 1)Research Track 4 (ROYAL 2)Industry and Research Track 5 (MAPLE)Demos and Posters 9.00-9.15 Conference Announcements 9.15-10.30 Keynote 2: Anand Rajaraman: Data-Driven Disruption: The View from Silicon Valley 10:30-11:15 Tea Break (45 mts) 11.15-12.45 R10: Query Optimization -1 R11: Spatial Data and Queries -1 R12: Distributed and Cloud Systems -2 I4: Data Engine Architectures -2 Demo 3b: Graph and Semistructured Data Research Poster 4 (R1, R2, R3, R4, R5) 12:45-2:00 Lunch Break (75 mts) 2.00-3.30 R13: Query Execution-2 Panel: Will AI Eat Us All? R14: Graph Processing -2 R15: Data Cleaning -2 Demo 1b: Data Engines and Analytics Research Poster 5 (R6, R7, R8, R9, VLDBJournal) 3:30-4:00 Tea Break (30 mts) 4.00-5.30 R16: Data and Query Models -1 R17: Scalable Analytics -1 R18: Database Hardware-Software Codesign I5: Graph Systems and Analytics Demo 2b: Interactive and Exploratory Systems Research Poster 6 ( R26, R27, R28, R29, R30) 5:45-6:00 Buses depart to Banquet Venue 7.30-10.00 VLDB CONFERENCE BANQUET (Venue: Kingdom of Dreams)
 DAY 4: THURSDAY [September 8] Time Track 1 (PEARL 1)Research Track 2 (PEARL 2)Research Track 3 (ROYAL 1)Research Track 4 (ROYAL 2)Tutorials Track 5 (MAPLE)Research 9.00-10.30 VLDB Awards (VLDB 2016: Best Paper, Best Demo; Endowment: 10 Year, Early Career, Women Achiever) Endowment Award Talk 1: Xin Luna Dong Endowment Award Talk 2: Mohamed F. Mokbel, Chi-Yin Chow, Walid G. Aref 10:30-11:15 Tea Break (45 mts) 11.15-12.45 R19: Query Optimization -2 R20: Spatial Data and Queries -2 R21: Distributed and Cloud Systems -3 T5: Human Factors in Crowdsourcing R22: Entity Matching -1 12:45-2:00 Lunch Break (75 mts) 2.00-3.30 R23: Query Execution -3 R24: Social Networks and Crowdsourcing R25: Graph Processing -3 T6: Qualitative Data Cleaning R26: Data and Query Models -2 3:30-4:00 Tea Break (30 mts) 4.00-5.30 R27: Data and Query Models -3 R28: Entity Matching -2 R29: Community Search and Mining R30: Scalable Analytics -2 6.30-9.30 ENTERTAINMENT PROGRAM AND GALA DINNER (Location: PEARL)
 DAY 5: FRIDAY [September 9] Time Track 1 (PEARL 1)Workshops Track 2 (PEARL 2)Workshops Track 3 (ROYAL 1)Workshops Track 4 (ROYAL 2)Workshops Track 5 (MAPLE)Workshops 9.00-10.30 W4 : PhD Workshop W5: ADMS/IMDM: Accelerating Data Management Systems/In-Memory Data Management and Analytics W6: DMAH: Data Management and Analytics for Medicine and Healthcare W7: SoDAM: Social Data Analytics and Management W8: BOSS/Polyglot: Big Data Open-Source Systems (BOSS) / Polyglot 10:30-11:00 Tea Break (30 mts) 11.00-12.30 W4 (contd): PhD Workshop W5 (contd): ADMS/IMDM W6 (contd): DMAH W7 (contd): SoDAM W8 (contd): BOSS/Polyglot 12:30-2:00 Lunch Break (90 mts) 2.00-3.30 W4 (contd): PhD Workshop W5 (contd): ADMS/IMDM W6 (contd): DMAH W8 (contd): BOSS/Polyglot 3:30-4:00 Tea Break (30 mts) 4.00-5.30 W4 (contd): PhD Workshop W5 (contd): ADMS/IMDM W6 (contd): DMAH W8 (contd): BOSS/Polyglot

# Monday Sep 5th, 9:00 am - 10:30 am

## Tutorial T1

### Location: Pearl 1

Differential Privacy in the Wild

Ashwin Machanavajjhala (Duke University), Xi He (Duke University), Michael Hay (Colgate University)

Abstract:Differential privacy has emerged as an important standard for privacy preserving computation over databases containing sensitive information about individuals. Research on differential privacy spanning a number of research areas, including theory, security, database, networks, machine learning, and statistics, over the last decade has resulted in a variety of privacy preserving algorithms for a number of analysis tasks. Despite maturing research efforts, the adoption of differential privacy by practitioners in industry, academia, or government agencies has so far been rare. Hence, in this tutorial, we will first describe the foundations of differentially private algorithm design that cover the state of the art in private computation on tabular data. In the second half of the tutorial we will highlight real world applications on complex data types, and identify research challenges in applying differential privacy to real world applications.

## Tutorial T2

### Location: Maple

Operational Analytics Data Management

Alexander Boehm (SAP), Jens Dittrich (University of Saarland), Niloy Mukherjee (LinkedIn), Ippokratis Pandis (Amazon), Rajkumar Sen (MemSQL)

Abstract:As enterprise businesses become more agile and responsive to trends, sentiments and surges, business analytics applications can no longer rely on the classic data warehousing model of attempting to derive insights from data at rest. The data for analytics is as current as of the last ETL (Extract, Transform and Load) job that moved data from the operational system to the data warehouse, and business can no longer afford missing on real-time insights on the data that is fresh in the operational system. For example, a retail store could run analytics on transactional data to track sales and use the information to offer discounts. There is an increasing demand for database management systems to be able to perform real-time analytics on data that gets ingested and modified in live mainstream operational databases. As a response, many commercial vendors as well as academia have attempted to solve the problem by combining transactional and analytical processing capabilities in the same database system; these systems will be referred to as operational analytics systems. In this tutorial, we shall present an in-depth overview of operational analytical systems. We shall start with a discussion on the various aspects associated with the design of such a system; ranging from data storage, indexing to query optimization and processing. We shall then present a set of representative systems in detail, highlight their individual architecture and design characteristics, and discuss several key research problems they address. This tutorial is intended for both researchers and practitioners in the industry.

## Workshop W1

### Location: Royal 1

TPC Technology Conference on Performance Evaluation and Benchmarking (TPCTC)

Raghunath Nambiar (Cisco, USA), Meikel Poess (Oracle, USA)

## Workshop W2

### Location: Royal 2

Business Intelligence for the Real Time Enterprise (BIRTE)

Meichun Hsu (Hewlett-Packard), Malu Castellanos (Hewlett-Packard), Panos K Chrysanthis (University of Pittsburgh)

## Workshop W3

### Location: Boardroom

Quality in Databases

Christoph Quix (Fraunhofer FIT, Germany), Rihan Hai (RWTH Aachen University, Germany), Hongzhi Wang (Harbin Inst. of Tech., China), Venkat N. Gudivada (East Carolina University, USA), Laure Berti (QCRI, Qatar)

# Monday Sep 5th, 2:00 pm - 3:30 pm

## Tutorial T3

### Location: Pearl 1

Modern Main-Memory Database Systems

Paul Larson (Microsoft Research), Justin Levandoski (Microsoft Research)

Abstract:This tutorial provides an overview of recent developments in mainmemory database systems. With growing memory sizes and memory prices dropping by a factor of 10 every 5 years, data having a “primary home” in memory is now a reality. Main-memory databases eschew many of the traditional architectural tenets of relational database systems that optimized for disk-resident data. Innovative approaches to fundamental issues such as concurrency control and query processing are required to unleash the full performance potential of main-memory databases. The tutorial is focused around design issues and architectural choices that must be made when building a high performance database system optimized for main-memory: data storage and indexing, concurrency control, durability and recovery techniques, query processing and compilation, support for high availability, and ability to support hybrid transactional and analytics workloads. This will be illustrated by example solutions drawn from four state-of-the-art systems: HStore/ VoltDB, Hekaton, HyPeR, and SAP HANA. The tutorial will also cover current and future research trends.

## Tutorial T4

### Location: Maple

Machine Learning in the Real World

Vineet Chaoji (Amazon, India), Rajeev Rastogi (Amazon, India), Gourav Roy (Amazon, India)

Abstract:Machine Learning (ML) has become a mature technology that is being applied to a wide range of business problems such as web search, online advertising, product recommendations, object recognition, and so on. As a result, it has become imperative for researchers and practitioners to have a fundamental understanding of ML concepts and practical knowledge of end-to-end modeling. This tutorial takes a hands-on approach to introducing the audience to machine learning. The first part of the tutorial gives a broad overview and discusses some of the key concepts within machine learning. The second part of the tutorial takes the audience through the end-to-end modeling pipeline for a real-world income prediction problem. The tutorial includes some hands-on exercises. If you want to follow along, you will need a laptop with at least 2 GB of RAM and Firefox/ Google Chrome browser installed. Note that your laptop must be capable of connecting to internet via Wifi or your mobile data connection. We will be using docker containers, so specific software does not need to be installed on laptops.

# Tuesday Sep 6th, 8:30 am - 9:15 am

## Conference Inauguration

### Location: Pearl

Anand Deshpande (Persistent Systems); T. M. Vijayaraman (Persistent Systems); Surajit Chaudhuri (Microsoft Research); Jayant Haritsa (Indian Institute of Science); Mukesh Mohania (IBM Research)

# Tuesday Sep 6th, 9:15 am - 10:30 am

## Keynote 1

### Chair: Surajit Chaudhuri, Microsoft Research

Trends and Challenges in Big Data Processing

Prof. Ion Stoica, University of California, Berkeley

Abstract:Almost six years ago we started the Spark project at UC Berkeley. Spark is a cluster computing engine that is optimized for in- memory processing, and unifies support for a variety of workloads, including batch, interactive querying, streaming, and iterative computations. Spark is now the most active big data project in the open source community, and is already being used by over one thousand organizations. One of the reasons behind Spark’s success has been our early bet on the continuous increase in the memory capacity and the feasibility to fit many realistic workloads in the aggregate memory of typical production clusters. Today, we are witnessing new trends, such as Moore’s law slowing down, and the emergence of a variety of computation and storage technologies, such as GPUs, FPGAs, and 3D Xpoint. In this talk, I’ll discuss some of the lessons we learned in developing Spark as a unified computation platform, and the implications of today’s hardware and software trends on the development of the next generation of big data processing systems.

Bio: Ion Stoica is a Professor in the EECS Department at University of California at Berkeley. He received his PhD from Carnegie Mellon University in 2000. He does research on cloud computing and networked computer systems. Past work includes the Dynamic Packet State (DPS), Chord DHT, Internet Indirection Infrastructure (i3), declarative networks, replay-debugging, and multi- layer tracing in distributed systems. His current research focuses on resource management and scheduling for data centers, cluster computing frameworks, and network architectures. He is an ACM Fellow and has received numerous awards, including the SIGCOMM Test of Time Award (2011), and the ACM doctoral dissertation award (2001). In 2006, he co-founded Conviva, a startup to commercialize technologies for large scale video distribution, and in 2013, he co-founded Databricks a startup to commercialize technologies for Big Data processing.

# Tuesday Sep 6th, 11:15 am - 12:45 pm

## Research R1: Transaction Processing

### Chair: Daniel Abadi, Yale Univ.

Leveraging Lock Contention to Improve OLTP Application Performance

Cong Yan (University of Washington); Alvin Cheung (University of Washington)

Abstract:Locking is one of the predominant costs in transaction processing. While much work has focused on designing efficient concurrency control mechanisms, not much has been done on understanding how transaction applications issue queries and leveraging application semantics to improve application performance. This paper presents QURO, a query-aware compiler that automatically reorders queries in transaction code to improve performance. Observing that certain queries within a transaction are more contentious than others as they require locking the same tuples as other concurrently executing transactions, QURO automatically changes the application such that contentious queries are issued as late as possible. We have evaluated QURO on various transaction benchmarks, and our results show that QURO-generated implementations can increase transaction throughput by up to 6.53x, while reduce transaction latency by up to 85%.
BCC: Reducing False Aborts in Optimistic Concurrency Control with Low Cost for In-Memory Databases

Yuan Yuan (The Ohio State University); Kaibo Wang (The Ohio State University); Rubao Lee (The Ohio State University); Xiaoning Ding (New Jersey Institute of Technology); Jing Xing (Institute of Computing Technology, Chinese Academy of Sciences); Spyros Blanas (The Ohio State University); Xiaodong Zhang (The Ohio State University)

Abstract:The Optimistic Concurrency Control (OCC) method has been commonly used for in-memory databases to ensure transaction serializability — a transaction will be aborted if its read set has been changed during execution. This simple criterion to abort transactions causes a large proportion of false positives, leading to excessive transaction aborts. Transactions aborted false-positively (i.e. false aborts) waste system resources and can significantly degrade system throughput (as much as 3.68x based on our experiments) when data contention is intensive. Modern in-memory databases run on systems with increasingly parallel hardware and handle workloads with growing concurrency. They must efficiently deal with data contention in the presence of greater concurrency by minimizing false aborts. This paper presents a new concurrency control method named Balanced Concurrency Control (BCC) which aborts transactions more carefully than OCC does. BCC detects data dependency patterns which can more reliably indicate unserializable transactions than the criterion used in OCC. The paper studies the design options and implementation techniques that can effectively detect data contention by identifying dependency patterns with low overhead. To test the performance of BCC, we have implemented it in Silo and compared its performance against that of the vanilla Silo system with OCC and two-phase locking (2PL). Our extensive experiments with TPC-W- like, TPC-C-like and YCSB workloads demonstrate that when data contention is intensive, BCC can increase transaction throughput by more than 3x versus OCC and more than 2x versus 2PL; meanwhile, BCC has comparable performance with OCC for workloads with low data contention.
Multi-Version Range Concurrency Control in Deuteronomy

Justin Levandoski (Microsoft Research); David Lomet (Microsoft Research); Sudipta Sengupta (Microsoft Research); Ryan Stutsman (Microsoft Research); Rui Wang (Microsoft Research)

Abstract:The Deuteronomy transactional key value store executes millions of serializable transactions/second by exploiting multi-version timestamp order concurrency control. However, it has not supported range operations, only individual record operations (e.g., create, read, update, delete). In this paper, we enhance our multi-version timestamp order technique to handle range concurrency and prevent phantoms. Importantly, we maintain high performance while respecting the clean separation of duties required by Deuteronomy, where a transaction component performs purely logical concurrency control (including range support), while a data component performs data storage and management duties. Like the rest of the Deuteronomy stack, our range technique manages concurrency information in a latch-free manner. With our range enhancement, Deuteronomy can reach scan speeds of nearly 250 million records/s (more than 27 GB/s) on modern hardware, while providing serializable isolation complete with phantom prevention.
S-Store: Streaming Meets Transaction Processing

John Meehan (Brown University); Nesime Tatbul (Intel Labs and MIT); Stan Zdonik (Brown University); Cansu Aslantas (Brown University); Ugur Cetintemel (Brown University); Jiang Du (University of Toronto); Tim Kraska (Brown University); Samuel Madden (MIT); David Maier (Portland State University); Andrew Pavlo (CMU); Michael Stonebraker (MIT); Kristin Tufte (Portland State University); Hao Wang (MIT)

Abstract:Stream processing addresses the needs of real-time applications. Transaction processing addresses the coordination and safety of short atomic computations. Heretofore, these two modes of operation existed in separate, stove-piped systems. In this work, we attempt to fuse the two computational paradigms in a single system called S-Store. In this way, S-Store can simultaneously accommodate OLTP and streaming applications. We present a simple transaction model for streams that integrates seamlessly with a traditional OLTP system, and provides both ACID and stream-oriented guarantees. We chose to build S-Store as an extension of H-Store - an open-source, in-memory, distributed OLTP database system. By implementing S-Store in this way, we can make use ofthe transaction processing facilities that H-Store already provides, and we can concentrate on the additional features that are neededto support streaming. Similar implementations could be done usingother main-memory OLTP platforms. We show that we can actually achieve higher throughput for streaming workloads in S-Store than an equivalent deployment in H-Store alone. We also show howthis can be achieved within H-Store with the addition of a modest amount of new functionality. Furthermore, we compare S-Store to two state-of-the-art streaming systems, Esper and Apache Storm, and show how S-Store can sometimes exceed their performance while at the same time providing stronger correctness guarantee.

## Research R2: RDF Data Systems

### Chair: Atsuyuki Morishima, Univ. of Tsukuba

S2RDF: RDF Querying with SPARQL on Spark

Alexander Schätzle (University of Freiburg); Martin Przyjaciel-Zablocki (University of Freiburg); Simon Skilevic (University of Freiburg); Georg Lausen (University of Freiburg)

Abstract:RDF has become very popular for semantic data publishing due to its flexible and universal graph-like data model. Thus, the ever- increasing size of RDF data collections raises the need for scalable distributed approaches. We endorse the usage of existing infrastructures for Big Data processing like Hadoop for this purpose. Yet, SPARQL query performance is a major challenge as Hadoop is not intentionally designed for RDF processing. Existing approaches often favor certain query pattern shapes while performance drops significantly for other shapes. In this paper, we introduce a novel relational partitioning schema for RDF data called ExtVP that uses a semi-join based preprocessing, akin to the concept of Join Indices in relational databases, to efficiently minimize query input size regardless of its pattern shape and diameter. Our prototype system S2RDF is built on top of Spark and uses SQL to execute SPARQL queries over ExtVP. We demonstrate its superior performance in comparison to state of the art SPARQL-on-Hadoop approaches.
Inferray: fast in-memory RDF inference

Julien Subercaze (Laboratoire Hubert Curien); Christophe Gravie (Laboratoire Hubert Curien); Jules Chevalier (Laboratoire Hubert Curien); Frederique Laforest (Laboratoire Hubert Curien)

Abstract:The advent of semantic data on the Web requires efficient reasoning systems to infer RDF and OWL data. The linked nature and the huge volume of data entail efficiency and scalability challenges when designing productive inference systems. This paper presents Inferray, an implementation of RDFS, ρdf, and RDFS-Plus inference with improved performance over existing solutions. The main features of Infer- ray are 1) a storage layout based on vertical partitioning that guarantees sequential access and efficient sort-merge join inference; 2) efficient sorting of pairs of 64-bit integers using ad-hoc optimizations on MSD radix and a custom counting sort; 3) a dedicated temporary storage to perform efficient graph closure computation. Our measurements on synthetic and real-world datasets show improvements over competitors on RDFS-Plus, and up to several orders of magnitude for transitivity closure.
RDF Graph Alignment with Bisimulation

Peter Buneman (University of Edinburgh); Sławek Staworko (University of Edinburgh, University of Lille, and LINKS, INRIA NordEurope)

Abstract:We investigate approaches to the alignment of two RDF triple stores inspired by the classical notion of graph bisimulation and relate this to metrics that describe the accuracy of the alignment. Alignment of RDF stores is essential in the understanding of evolving, curated ontologies. One needs to align nodes in sucessive versions that are given “blank’’ names; moreover, a new version may switch terminology from one external ontology to another. This means that some further alignment of the identifiers of the two external ontologies is useful. We first describe a form of bisimulation based on node colorings, in which the colors define equivalence classes, we then go on to describe an estimate of the accuracy of a coloring based on (a) the similarity of literal nodes and (b) on the “structural’’ similarity of internal nodes (URIs and blank nodes). The effectiveness of these method is tested on two evolving data sets: an ontology described in OWL and triple store derived from an evolving relational database. Both of these are curated resources for biologists.
Semantic SPARQL Similarity Search Over RDF Knowledge Graphs

Weiguo Zheng (Peking University); Lei Zou (Peking University); Wei Peng (Peking University); Xifeng Yan (University of California at Santa Barbara); Shaoxu Song (Tsinghua University); Dongyan Zhao (Peking University)

Abstract:RDF knowledge graphs have attracted increasing attentions these years. However, due to the schema-free nature of RDF data, it is very difficult for users to have full knowledge of the underlying schema. Furthermore, the same kind of information can be represented in diverse graph fragments. Hence, it is a huge challenge to formulate complex SPARQL expressions by taking the union of all the possible structures. In this paper, we propose an effective framework to access the RDF repository even if users have no full knowledge of the underlying schema. Specifically, given a SPARQL query, the system could return as more answers that match the query based on the semantic similarity as possible. Interestingly, we propose a systematic method to mine diverse semantically equivalent structure patterns. More importantly, incorporating both structural and semantic similarities we are the first to propose a novel similarity measure, semantic graph edit distance. In order to improve the efficiency performance, we utilize the semantic summary graph to summarize the knowledge graph, which supports both high-level pruning and drill-down pruning. We also devise an effective lower bound based on the TA-style access to each of the candidate sets. Extensive experiments over real datasets confirm the effectiveness and efficiency of our approach.

## Research R3: Distributed and Cloud Database Systems -1

### Chair: Ashraf Aboulnaga, QCRI

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics

Juwei Shi (Renmin University of China); Yunjie Qiu (IBM Research – China); Umar Farooq Minhas (IBM Research - Almaden); Limei Jiao (IBM Research – China); Chen Wang (Tsinghua University); Berthold Reinwald (IBM Research - Almaden); Fatma Ozcan ((IBM Research - Almaden)

Abstract:MapReduce and Spark are two very popular open source clustercomputing frameworks for large scale data analytics. These frame- works hide the complexity of task parallelism and fault-tolerance, by exposing a simple programming API to users. In this paper, we evaluate the major architectural components in MapReduce and Spark frameworks including: shuffle, execution model, and caching, by using a set of important analytic workloads. To conduct a detailed analysis, we developed two profiling tools: (1) We correlate the task execution plan with the resource utilization for both MapReduce and Spark, and visually present this correlation; (2) We provide a breakdown of the task execution time for in-depth analysis. Through detailed experiments, we quantify the performance differences between MapReduce and Spark. Furthermore, we attribute these performance differences to different componentswhich are architected differently in the two frameworks. We fur-ther expose the source of these performance differences by usinga set of micro-benchmark experiments. Overall, our experiments show that Spark is about2:5x,5x, and5x faster than MapReduce, for Word Count, k-means, and PageRank, respectively. The main causes of these speedups are the efficiency of the hash-based aggregation component for combine, as well as reduced CPU and disk overheads due to RDD caching in Spark. An exception to this is the Sort workload, for which MapReduce is 2x faster than Spark.We show that MapReduce’s execution model is more efficient for shuffling data than Spark, thus making Sort run faster on MapReduce.
Compressed Linear Algebra for Large-Scale Machine Learning [Best Paper Award Winner]

Matthias Boehm (IBM Research – Almaden); Ahmed Elgohary (University of Maryland); Peter Haas (IBM Research – Almaden); Fred Reiss (IBM Research – Almaden); Berthold Reinwald (IBM Research – Almaden)

Abstract:Large-scale machine learning (ML) algorithms are often iterative with repeated read-only data access and I/O-bound matrix- vector multiplications to converge to an optimal model. It is crucial for performance to fit the data into single-node or distributed main memory. General-purpose, heavy- and lightweight compression techniques struggle to achieve both good compression ratios and fast decompression speed to enable blockwise uncompressed operations. Hence, we initiate work on compressed linear algebra (CLA), in which lightweight database compression techniques are applied to matrices and then linear algebra operations such as matrix-vector multiplication are executed directly on the compressed representations. We contribute effective column compression schemes, cache-conscious operations, and an efficient sampling-based compression algorithm. Our experiments show that CLA achieves in-memory operation performance close to the uncompressed case and good compression ratios that allow us to fit larger datasets into available memory. We thereby obtain significant end-to-end performance improvements up to 26x or reduced memory requirements.
PIXIDA: Optimizing Data Parallel Jobs in Wide-Area Data Analytics.

Konstantinos Kloudas (INESC-ID); Rodrigo Rodrigues (INESC-ID/IST at University of Lisbon); Nuno Preguica (NOVA LINCS / NOVA at University of Lisbon); Margarida Mamede (NOVA LINCS / NOVA at University of Lisbon)

Abstract:In the era of global-scale services, big data analytical queries are often required to process datasets that span multiple data centers (DCs). In this setting, cross-DC bandwidth is often the scarcest, most volatile, and/or most expensive resource. However, current widely deployed big data analytics frameworks make no attempt to minimize the traffic traversing these links. In this paper, we present PIXIDA, a scheduler that aims to minimize data movement across resource constrained links. To achieve this, we introduce a new abstraction called SILO, which is key to modeling PIXIDA’s scheduling goals as a graph partitioning problem. Furthermore, we show that existing graph partitioning problem formulations do not map to how big data jobs work, causing their solutions to miss opportunities for avoiding data movement. To address this, we formulate a new graph partitioning problem and propose a novel algorithm to solve it. We integrated PIXIDA in Spark and our experiments show that, when compared to existing schedulers, PIXIDA achieves a significant traffic reduction of up to ~9× on the aforementioned links.
Cümülön: Matrix-Based Data Analytics in the Cloud with Spot Instances

Botong Huang (Duke University); Nicholas Jarrett (Duke University); Shivnath Babu (Duke University); Sayan Mukherjee (Duke University); Jun Yang (Duke University)

Abstract:We describe Cumulon, a system aimed at helping users develop and deploy matrix-based data analysis programs in a public cloud. A key feature of Cumulon is its end-to-end support for the so-called spot instances—-machines whose market price fluctuates over time but is usually much lower than the regular fixed price. A user sets a bid price when acquiring spot instances, and loses them as soon as the market price exceeds the bid price. While spot instances can potentially save cost, they are difficult to use effectively, and run the risk of not finishing work while costing more. Cumulon provides a highly elastic computation and storage engine on top of spot instances, and offers automatic cost-based optimization of execution, deployment, and bidding strategies. Cumulon further quantifies how the uncertainty in the market price translates into the cost uncertainty of its recommendations, and allows users to specify their risk tolerance as an optimization constraint.

## Industrial I1: Distributed Data Analytics Platforms - 1

### Chair: Gustavo Alonso, ETH Zurich

Data Infrastructure at Flipkart (Invited Talk)

Sharad Agarwal (Flipkart)

Abstract:Data infrastructure lies at the core of Flipkart technology stack. The e-commerce business not only deals with huge traffic but the kind of data it deals with, is widely varied too. Whether it is user behavioural or transactional data of millions of customers or catalog with millions of products that being offered by thousands of sellers, or logistics network that covers the entire nation, the challenge to effectively derive value and build data enabled products lies in all aspects of technology stack. Data infrastructure at Flipkart is a hosted central platform as a service that enables realtime systemic decision making and business intelligence across all product lines at Flipkart. In this session, I will talk about the technical challenges in building this highly scalable and self-serve data platform that manages more than petabyte of data with thousand of realtime business critical data streams. I will also talk about the key technologies and platform abstractions that enable self serve realtime ETL and querying capabilities on Hadoop as well as on traditional data warehouses in seamless fashion.
Database System Support of Simulation Data

Hermano Lustosa (LNCC); Fabio Porto (LNCC); Pablo Blanco (LNCC); Patrick Valduriez (INRIA)

Abstract:Supported by increasingly efficient HPC infra-structure, numerical simulations are rapidly expanding to fields such as oil and gas, medicine and meteorology. As simulations become more precise and cover longer periods of time, they may produce files with terabytes of data that need to be efficiently analyzed. In this paper, we investigate techniques for managing such data using an array DBMS. We take advantage of multidimensional arrays that nicely models the dimensions and variables used in numerical simulations. However, a naive approach to map simulation data files may lead to sparse arrays, impacting query response time, in particular, when the simulation uses irregular meshes to model its physical domain. We propose efficient techniques to map coordinate values in numerical simulations to evenly distributed cells in array chunks with the use of equi-depth histograms and space-filling curves. We implemented our techniques in SciDB and, through experiments over real-world data, compared them with two other approaches: row- store and column-store DBMS. The results indicate that multidimensional arrays and column-stores are much faster than a traditional row-store system for queries over a larger amount of simulation data. They also help identifying the scenarios where array DBMSs are most efficient, and those where they are outperformed by column-stores.
Hybrid Row-Column Partitioning in Teradata

Mohammed Al-Kateb (Teradata Labs); Paul Sinclair (Teradata Labs); Grace Au (Teradata Labs); Carrie Ballinger (Teradata Labs)

Abstract:Data partitioning is an indispensable ingredient of database systems due to the performance improvement it can bring to any given mixed workload. Data can be partitioned horizontally or vertically. While some commercial proprietary and open source database systems have one flavor or mixed flavors of these partitioning forms, Teradata Database offers a unique hybrid row-column store solution that seamlessly combines both of these partitioning schemes. The key feature of this hybrid solution is that either row, column, or combined partitions are all stored and handled in the same way internally by the underlying file system storage layer. In this paper, we present the main characteristics and explain the implementation approach of Teradata’s row-column store. We also discuss query optimization techniques applicable specifically to partitioned tables. Furthermore, we present a performance study that demonstrates how different partitioning options impact the performance of various queries.
Consistent Regions: Guaranteed Tuple Processing in IBM Streams

Gabriela Jacques da Silva (IBM Research – T.J. Watson); Fang Zheng (IBM Research – T.J. Watson); Daniel Debrunner (IBM); Kun-Lung Wu (IBM Research – T.J. Watson); Victor Dogaru (IBM); Eric Johnson (IBM); Michael Spicer (IBM); Ahmet Erdem Sariyuce (Sandia National Labs)

Abstract:Guaranteed tuple processing has become critically important for many streaming applications. This paper describes how we enabled IBM Streams, an enterprise-grade stream processing system, to provide data processing guarantees. Our solution goes from language-level abstractions to a runtime protocol. As a result, with a couple of simple annotations at the source code level, IBM Streams developers can define consistent regions, allowing any subgraph of their streaming application to achieve guaranteed tuple processing. At runtime, a consistent region periodically executes a variation of the Chandy-Lamport snapshot algorithm to establish a consistent global state for that region. The coupling of consistent states with data replay enables guaranteed tuple processing.

## Demo 1a: Data Engines and Analytics

### Location: Maple

Vita: A Versatile Toolkit for Generating Indoor Mobility Data for Real-World Buildings

Huan Li (Zhejiang University); Hua Lu (Aalborg University); Xin Chen (Zhejiang University); Gang Chen (Zhejiang University) ; Ke Chen (Zhejiang University); Lidan Shou (Zhejiang University)

JexLog: A Sonar for the Abyss

Tobias Scheuer (SAP SE); Norman May (SAP SE); Alexander Böhm (SAP SE); Daniel Scheibli (SAP SE)

SI2P: A Restaurant Recommendation System Using Preference Queries over Incomplete Information

Xiaoye Miao (Zhejiang University); Yunjun Gao (Zhejiang University); Gang Chen (Zhejiang University); Huiyong Cui (Zhejiang University) ; Chong Guo (Zhejiang University); Weida Pan (Zhejiang University)

Mixed-instance querying: a lightweight integration architecture for data journalism

Raphaël Bonaque (INRIA); Tien Cao (INRIA); Bogdan Cautis (Université Paris-Sud); François Goasdoué (Université de Rennes 1); Javier Letellier (INRIA); Ioana Manolescu (INRIA); Oscar Mendoza (INRIA); Swen Ribeiro (Université Paris-Sud); Xavier Tannier (Université ParisSud); Michaël Thomazo (INRIA)

Precision Performance Surgery for PostgreSQL — LLVM-based Expression Compilation, Just in Time

Dennis Butterstein (University of Tuebingen); Torsten Grust (Universität Tübingen)

Partial Marking for Automated Grading of SQL Queries

Bikash Chandra (IIT Bombay); Mathew Joseph (IIT Bombay); Bharath Radhakrishnan (IIT Bombay); Shreevidhya Acharya (IIT Bombay); S. Sudarshan (IIT Bombay)

Squall: Scalable Real-time Analytics

Aleksandar Vitorovic (EPFL); Mohammed Elseidy (EPFL); Khayyam Guliyev (EPFL); Khue Vu Minh (EPFL); Daniel Espino Timón (EPFL); Mohammad Dashti (EPFL); Yannis Klonatos (EPFL); Christoph Koch (EPFL)

LocationSpark: A Distributed In-Memory Data Management System for Big Spatial Data

Mingjie Tang (Purdue University); Yongyang Yu (Purdue University); Qutaibah Malluhi (Qatar University); Mourad Ouzzani (Qatar Computing Research Institute); Walid Aref (Purdue University)

Amoeba: A Shape changing Storage System for Big Data

Anil Shanbhag (MIT); Samuel Madden (MIT); Alekh Jindal (Microsoft); Yi Lu (MIT)

F: Regression Models over Factorized Views

Dan Olteanu (University of Oxford); Maximilian Schleich (University of Oxford)

Magellan: Toward Building Entity Matching Management Systems over Data Science Stacks

Pradap Konda (University of Wisconsin Madison); Sanjib Das (University of Wisconsin Madison); Paul Suganthan G. C. (University of Wisconsin Madison); Anhai Doan (University of Wisconsin Madison); Adel Ardalan (University of Wisconsin Madison); Jeffrey Ballard (University of Wisconsin Madison); Han Li (University of Wisconsin Madison); Fatemah Panahi (University of Wisconsin Madison); Haojun Zhang (University of Wisconsin Madison); Jeffrey Naughton (University of Wisconsin Madison); Shishir Prasad (University of Wisconsin Madison);Ganesh Krishnan (WalmartLabs); Rohit Deep (WalmartLabs); Vijay Raghavendra (WalmartLabs)

Schema Independent and Scalable Relational Learning By Castor

Jose Picado (Oregon State University); Parisa Ataei (Oregon State University); Arash Termehchy (Oregon State University); Alan Fern (Oregon State University)

AD-WIRE: Add-on for Web Item Reviewing System

Rajeshkumar Kannapalli (University of Texas at Arlington); Azade Nazi (University of Texas at Arlington); Mahashweta Das (Hewlett Packard Labs); Gautam Das (University of Texas at Arlington)

# Tuesday Sep 6th, 2:00 pm - 3:30 pm

## Research R4: Memory Management

### Chair: Anastasia Ailamaki, EPFL Lausanne

RUMA has it: Rewired User-space Memory Access is Possible!

Felix Martin Schuhknecht (Saarland University); Jens Dittrich (Saarland University); Ankur Sharma (Saarland University)

Abstract:Memory management is one of the most boring topics in database research. It plays a minor role in tasks like free-space management or efficient space usage. Here and there we also realize its impact on database performance when worrying about NUMA-aware memory allocation, data compacting, snapshotting, and defragmentation. But, overall, let’s face it: the entire topic sounds as exciting as ‘garbage collection’ or ‘debugging a program for memory leaks’. What if there were a technique that would promote memory management from a third class helper thingie to a first class citizen in algorithm and systems design? What if that technique turned the role of memory management in a database system (and any other data processing system) upside-down? What if that technique could be identified as a key for re-designing various core algorithms with the effect of outperforming existing state-of-the-art methods considerably? Then we would write this paper. We introduce RUMA: Rewired User-space Memory Access. It allows for physiological data management, i.e. we allow developers to freely rewire the mappings from virtual to physical memory (in user space) while at the same time exploiting the virtual memory support offered by hardware and operating system. We show that fundamental database building blocks such as array operations, partitioning, sorting, and snapshotting benefit strongly from RUMA.
Asynchronous Memory Access Chaining

Onur Kocberber (EPFL); Babak Falsafi (EPFL); Boris Grot (University of Edinburgh)

Abstract:In-memory databases rely on pointer-intensive data structures to quickly locate data in memory. A single lookup operation in such data structures often exhibits long-latency memory stalls due to dependent pointer dereferences. Hiding the memory latency by launching additional memory accesses for other lookups is an effective way of improving performance of pointer- chasing codes (e.g., hash table probes, tree traversals). The ability to exploit such inter-lookup parallelism is beyond the reach of modern out-of-order cores due to the limited size of their instruction window. Instead, recent work has proposed software prefetching techniques that exploit inter-lookup parallelism by arranging a set of independent lookups into a group or a pipeline, and navigate their respective pointer chains in a synchronized fashion. While these techniques work well for highly regular access patterns, they break down in the face of irregularity across lookups. Such irregularity includes variable-length pointer chains, early exit, and read/write dependencies. This work introduces Asynchronous Memory Access Chaining (AMAC), a new approach for exploiting inter-lookup parallelism to hide the memory access latency. AMAC achieves high dynamism in dealing with irregularity across lookups by maintaining the state of each lookup separately from that of other lookups. This feature enables AMAC to initiate a new lookup as soon as any of the in-flight lookups complete. In contrast, the static arrangement of lookups into a group or pipeline in existing techniques precludes such adaptivity. Our results show that AMAC matches or outperforms state-of-the-art prefetching techniques on regular access patterns, while delivering up to 2.3x higher performance under irregular data structure lookups. AMAC fully utilizes the available micro-architectural resources, generating the maximum number of memory accesses allowed by hardware in both single- and multi-threaded execution modes.
WarpLDA: a Cache Efficient O(1) Algorithm for Latent Dirichlet Allocation

Jianfei Chen (Tsinghua University); Kaiwei Li (Tsinghua University); Jun Zhu (Tsinghua University); Wenguang Chen (Tsinghua University)

Abstract:Developing efficient and scalable algorithms for Latent Dirichlet Allocation (LDA) is of wide interest for many applications. Previous work has developed an O(1) Metropolis-Hastings (MH) sampling method for each token. However, its performance is far from being optimal due to frequent cache misses caused by random accesses to the parameter matrices. In this paper, we first carefully analyze the memory access behavior of existing algorithms for LDA by cache locality at document level. We then develop WarpLDA, which achieves O(1) time complexity per-token and fits the randomly accessed memory per-document in the L3 cache. Our empirical results in a wide range of testing conditions demonstrate that WarpLDA is consistently 5-15x faster than the state-of-the-art MH-based LightLDA, and is faster than the state-of-the-art sparsity aware F+LDA in most settings. Our WarpLDA learns a million topics from 639 millions of documents in only five hours at an unprecedented throughput of 11 billion tokens per second.
Lifetime-Based Memory Management for Distributed Data Processing Systems

Lu Lu (Huazhong University of Science and Technology); Xuanhua Shi (Huazhong University of Science and Technology); Yongluan Zhou (University of Southern Denmark); Xiong Zhang (Huazhong University of Science and Technology); Hai Jin (Huazhong University of Science and Technology); Cheng Pei (Huazhong University of Science and Technology); Ligang He (University of Warwick); Yuanzhen Geng (Huazhong University of Science and Technology)

Abstract:In-memory caching of intermediate data and eager combining of data in shuffle buffers have been shown to be very effective in minimizing the re-computation and I/O cost in distributed data processing systems like Spark and Flink. However, it has also been widely reported that these techniques would create a large amount of long-living data objects in the heap, which may quickly saturate the garbage collector, especially when handling a large dataset, and hence would limit the scalability of the system. To eliminate this problem, we propose a lifetime-based memory management framework, which, by automatically analyzing the user-defined functions and data types, obtains the expected lifetime of the data objects, and then allocates and releases memory space accordingly to minimize the garbage collection overhead. In particular, we present Deca, a concrete implementation of our proposal on top of Spark, which transparently decomposes and groups objects with similar lifetimes into byte arrays and releases their space altogether when their lifetimes come to an end. An extensive experimental study using both synthetic and real datasets shows that, in comparing to Spark, Deca is able to 1) reduce the garbage collection time by up to 99.9%, 2) to achieve up to 22.7x speed up in terms of execution time in cases without data spilling and 41.6x speedup in cases with data spilling, and 3) to consume up to 46.6% less memory.

## Research R5: Data Cleaning - 1

### Chair: Ihab Ilyas, Univ. of Waterloo

Cleaning Timestamps with Temporal Constraints

Shaoxu Song (Tsinghua University); Yue Cao (Tsinghua University); Jianmin Wang (Tsinghua University)

Abstract:Timestamps are often found to be dirty in various scenarios, e.g., in distributed systems with clock synchronization problems or unreliable RFID readers. Without cleaning the imprecise timestamps, temporal-related applications such as provenance analysis or pattern queries are not reliable. To evaluate the correctness of timestamps, temporal constraints could be employed, which declare the distance restrictions between timestamps. Guided by such constraints on timestamps, in this paper, we study a novel problem of repairing inconsistent timestamps that do not conform to the required temporal constraints. Following the same line of data repairing, the timestamp repairing problem is to minimally modify the timestamps towards satisfaction of temporal constraints. This problem is practically challenging, given the huge space of possible timestamps. We tackle the problem by identifying a concise set of promising candidates, where an optimal repair solution can always be found. Repair algorithms with efficient pruning are then devised over the identified candidates. Experiments on real datasets demonstrate the superiority of our proposal compared to the state-of-the-art approaches.
Repairing Data through Regular Expressions

Zeyu Li (Harbin Institute of Technology); Hongzhi Wang (Harbin Institute of Technology); Wei Shao (Harbin Institute of Technology); Jianzhong Li (Harbin Institute of Technology); Hong Gao (Harbin Institute of Technology)

Abstract:Since regular expressions are often used to detect errors in sequences such as strings or date, it is natural to use it for data repair. Motivated by this, we propose a data repair method based on regular expression to make the input sequence data obey the given regular expression with minimal revisions. The proposed method contains two steps, sequence repair and token value repair. For sequence repair, we propose the Regular-expression-based Structural Repair algorithm (RSR in short). RSR algorithm is a dynamic programming algorithm that utilizes Nondeterministic Finite Automata (NFA) to calculate the edit distance between the prefix of the input string and partial pattern regular expression with time complexity $O(mn^2)$ and space complexity $O(mn)$ where $m$ is the number of NFA’s edge and $n$ is the input string length. We also give an optimization to achieve high performance. For token value repair, we combine the edit-distance-based method and associate rules by a unified argument for the selection of the two methods. Experimental results on both real and synthetic data show that the proposed method could repair the data effectively and efficiently.
Combining Quantitative and Logical Data Cleaning

Nataliya Prokoshyna (University of Toronto); Jaroslaw Szlichta (University of Ontario Institute of Technology); Fei Chiang (McMaster University); Renée Miller (University of Toronto); Divesh Srivastava (AT&T Labs Research)

Abstract:Quantitative data cleaning relies on the use of statistical methods to identify and repair data quality problems while logical data cleaning tackles the same problems using various forms of logical reasoning over declarative dependencies. Each of these approaches has its strengths: the logical approach is able to capture subtle data quality problems using sophisticated dependencies, while the quantitative approach excels at ensuring that the repaired data has desired statistical properties. We propose a novel framework within which these two approaches can be used synergistically to combine their respective strengths. We instantiate our framework using (i) metric functional dependencies, a type of dependency that generalizes functional dependencies (FDs) to identify inconsistencies in domains where only large differences in metric data are considered to be a data quality problem, and (ii) repairs that modify the inconsistent data so as to minimize statistical distortion, measured using the Earth Mover’s Distance. We show that the problem of computing a statistical distortion minimal repair is NP-hard. Given this complexity, we present an efficient algorithm for finding a minimal repair that has a small statistical distortion using EMD computation over semantically related attributes. To identify semantically related attributes, we present a sound and complete axiomatization and an efficient algorithm for testing implication of metric FDs. While the complexity of inference for some other FD extensions is co-NP complete, we show that the inference problem for metric FDs remains linear, as in traditional FDs. We prove that every instance that can be generated by our repair algorithm is set-minimal (with no unnecessary changes). Our experimental evaluation demonstrates that our techniques obtain a considerably lower statistical distortion than existing repair techniques, while achieving similar levels of efficiency.
Temporal Rules Discovery for Web Data Cleaning

Ziawasch Abedjan (MIT); Cuneyt Akcora (Qatar Computing Research Institute); Mourad Ouzzani (Qatar Computing Research Institute); Paolo Papotti (Qatar Computing Research Institute); Michael Stonebraker (MIT)

Abstract:Declarative rules, such as functional dependencies, are widely used for cleaning data. Several systems take them as input for detecting errors and computing a “clean” version of the data. To support domain experts, in specifying these rules, several tools have been proposed to profile the data and mine rules. However, existing discovery techniques have traditionally ignored the time dimension. Recurrent events, such as persons reported in locations, have a duration in which they are valid, and this duration should be part of the rules or the cleaning process would simply fail. In this work, we study the rule discovery problem for temporal web data. Such a discovery process is challenging because of the nature of web data; extracted facts are (i) sparse over time, (ii) reported with delays, and (iii) often reported with errors over the values because of inaccurate sources or non robust extractors. We handle these challenges with a new discovery approach that is more robust to noise. Our solution uses machine learning methods, such as association measures and outlier detection, for the discovery of the rules, together with an aggressive repair of the data in the mining step itself. Our experimental evaluation over real-world data from Recorded Future, an intelligence company that monitors over 700K Web sources, shows that temporal rules improve the quality of the data with an increase of the average precision in the cleaning process from 0.37 to 0.84, and a 40% relative increase in the average F-measure.

## Research R6: Graph Processing - 1

### Chair: Bin Cui, Peking Univ.

I/O Efficient ECC Graph Decomposition via Graph Reduction

Long Yuan (The University of New South Wales); Lu Qin (University of Technology, Sydney); Xuemin Lin (The University of New South Wales); Lijun Chang (The University of New South Wales); Wenjie Zhang (The University of New South Wales)

Abstract:The problem of computing k-edge connected components (k- ECCs) of a graph G for a specific k is a fundamental graph problem and has been investigated recently. In this paper, we study the problem of ECC decomposition, which computes the k-ECCs of a graph G for all k values. ECC decomposition can be widely applied in a variety of applications such as graph-topology analysis, community detection, Steiner component search, and graph visualization. A straightforward solution for ECC decomposition is to apply the existing k-ECC computation algorithm to compute the k-ECCs for all k values. However, this solution is not applicable to large graphs for two challenging reasons. First, all existing k-ECC computation algorithms are highly memory intensive due to the complex data structures used in the algorithms. Second, the number of possible k values can be very large, resulting in a high computational cost when each k value is independently considered. In this paper, we address the above challenges, and study I/O efficient ECC decomposition via graph reduction. We introduce two elegant graph reduction operators which aim to reduce the size of the graph loaded in memory while preserving the connectivity information of a certain set of edges to be computed for a specific k. We also propose three novel I/O efficient algorithms, Bottom-Up, Top-Down, and Hybrid, that explore the k values in different orders to reduce the redundant computations between different k values. We analyze the I/O and memory costs for all proposed algorithms. In our experiments, we evaluate our algorithms using seven real large datasets with various graph properties, one of which contains 1.95 billion edges. The experimental results show that our proposed algorithms are scalable and efficient.
LEOPARD: Lightweight Edge-Oriented Partitioning and Replication for Dynamic Graphs

Jiewen Huang (Yale University); Daniel Abadi (Yale University)

Abstract:This paper introduces a dynamic graph partitioning algorithm, designed for large, constantly changing graphs. We propose a partitioning framework that adjusts on the fly as the graph structure changes. We also introduce a replication algorithm that is tightly integrated with the partitioning algorithm, which further reduces the number of edges cut by the partitioning algorithm. Even though the proposed approach is handicapped by only taking into consideration local parts of the graph when reassigning vertices, extensive evaluation shows that the proposed approach maintains a quality partitioning over time, which is comparable at any point in time to performing a full partitioning from scratch using a state-the-art static graph partitioning algorithm such as METIS. Furthermore, when vertex replication is turned on, edge-cut can improve by an order of magnitude.
Towards Maximum Independent Sets on Massive Graphs

Yu Liu (Renmin University of China); Jiaheng Lu (University of Helsinki); Hua Yang (Renmin University of China); Xiaokui Xiao (Nanyang Technological University); Zhewei Wei (Renmin University of China)

Abstract:Maximum independent set (MIS) is a fundamental problem in graph theory and it has important applications in many areas such as social network analysis, graphical information systems and coding theory. The problem is NP-hard, and there has been numerous studies on its approximate solutions. While successful to a certain degree, the existing methods require memory space at least linear in the size of the input graph. This has become a serious concern in view of the massive volume of today’s fast-growing graphs. In this paper, we study the MIS problem under the semi-external setting, which assumes that the main memory can accommodate all vertices of the graph but not all edges. We present a greedy algorithm and a general vertex-swap framework, which swaps vertices to incrementally increase the size of independent sets. Our solutions require only few sequential scans of graphs on the diskfile, thus enabling in-memory computation without costly random disk accesses. Experiments on large-scale datasets show that our solutions are able to compute a large independent set for a massive graph with 59 million vertices and 151 million edges using a commodity machine, with a memory cost of 469MB and a time cost of three minutes, while yielding an approximation ratio that is around99% of the theoretical optimum.
Weaver: A High-Performance, Transactional Graph Database Based on Refinable Timestamps

Ayush Dubey (Cornell University); Greg Hill (Stanford University); Robert Escriva (Cornell University); Emin Sirer (Cornell University)

Abstract:Graph databases have become an increasingly common infrastructure component. Yet existing systems either operate on offline snapshots, provide weak consistency guarantees, or use expensive concurrency control techniques that limit performance. In this paper, we introduce a new distributed graph database, called Weaver, which enables efficient, transactional graph analyses as well as strictly serializable ACID transactions on dynamic graphs. The key insight that allows Weaver to combine strict serializability with horizontal scalability and high performance is a novel request ordering mechanism called refinable timestamps. This technique couples coarse-grained vector timestamps with a fine-grained timeline oracle to pay the overhead of strong consistency only when needed. Experiments show that Weaver enables a Bitcoin blockchain explorer that is 8x faster than Blockchain.info, and achieves 12x higher throughput than the Titan graph database on social network workloads and 4x lower latency than GraphLab on offline graph traversal workloads.

## Industrial I2: Distributed Data Analytics Platforms - 2

### Chair: Justin Levandoski, Microsoft

Big Data Analysis as a Service on Public Clouds (Invited talk)

Rajat Venkatesh (Qubole)

Abstract:Public clouds pose challenges (reliability, IO latency etc) as well as provide opportunities (elasticity, pricing models) for large volume data processing. In this talk, I will present the lessons learnt and features (auto-scaling, hybrid clusters with spot nodes, caching etc) implemented in Qubole’s data service to provide an efficient (cost & performance) and a reliable data processing service in the top public clouds. QDS allows users to run batch or ETL, interactive and streaming workloads. The technologies used for each of these workloads are different. Apache Hive, Apache Hadoop M/R and Apache Pig are popular for batch workloads. Presto & SparkSQL are used for interactive analytic queries. Apache Spark is used for machine learning. Spark Streaming is used for streaming workloads. Apache HBase is the K-V store. As such the design and architecture of each of these technologies is very different. We’ve focussed on each of these technologies separately and augmentedthem to run better on the cloud as well as a service. This talk will iterate through each of these technologies, list areas where the open source distribution has gaps and how QDS has closed these gaps. Since these technologies are representative in their class, the features implemented in QDS is broadly applicable. I will also provide real-world metrics for performance gains and cost benefits where applicable.
dmapply: A functional primitive to express distributed machine learning algorithms in R

Edward Ma (Hewlett-Packard Enterprise Vertica); Vishrut Gupta (Hewlett-Packard Enterprise Vertica); Meichun Hsu (Hewlett-Packard Enterprise Vertica); Indrajit Roy (Hewlett Packard Labs)

Abstract:Due to R’s popularity as a data-mining tool, many distributed systems expose an R-based API to users who need to build a distributed application in R. As a result, data scientists have to learn to use different interfaces such as RHadoop, SparkR, Revolution R’s ScaleR, and HPE’s Distributed R. Unfortunately, these interfaces are custom, non-standard, and difficult to learn. Not surprisingly, R applications written in one framework do not work in another, and each backend infrastructure has spent redundant effort in implementing distributed machine learning algorithms. Working with the members of R-core, we have created ddR (Distributed Data structures in R), a unified system that works across different distributed frameworks. In ddR, we introduce a novel programming primitive called dmapply that executes functions on distributed data structures. The dmapply primitive encapsulates different computation patterns: from function and data broadcast to pair-wise communication. We show that dmapply is powerful enough to express algorithms that fit the statistical query model, which includes many popular machine learning algorithms, as well as applications written in MapReduce. We have integrated ddR with many backends, such as R’s single-node parallel framework, multi-node SNOW framework, Spark, and HPE Distributed R, with few or no modifications to any of these systems. We have also implemented multiple machine learning algorithms which are not only portable across different distributed systems, but also have performance comparable to the “native” implementations on the backends. We believe that ddR will standardize distributed computing in R, just like the SQL interface has standardized how relational data is manipulated.
SystemML: Declarative Machine Learning on Spark

Matthias Boehm (IBM Research – Almaden); Michael Dusenberry (IBM Spark Technology Center); Deron Eriksson (IBM Spark Technology Center); Alexandre Evfimievski (IBM Research – Almaden); Faraz Makari Manshadi (IBM Research – Almaden); Niketan Pansare (IBM Research – Almaden); Berthold Reinwald ( IBM Research – Almaden); Frederick Reiss (IBM Research – Almaden); Prithviraj Sen (IBM Research – Almaden); Arvind Surve (IBM Spark Technology Center); Shirish Tatikonda (IBM Research – Almaden)

Abstract:The rising need for custom machine learning (ML) algorithms and the growing data sizes that require the exploitation of distributed, data-parallel frameworks such as MapReduce or Spark, pose significant productivity challenges to data scientists. Apache SystemML addresses these challenges through declarative ML by (1) increasing the productivity of data scientists as they are able to express custom algorithms in a familiar domain-specific language covering linear algebra primitives and statistical functions, and (2) transparently running these ML algorithms on distributed, data-parallel frameworks by applying cost-based compilation techniques to generate efficient, low-level execution plans with in-memory single-node and large-scale distributed operations. This paper describes SystemML on Apache Spark, end to end, including insights into various optimizer and runtime techniques as well as performance characteristics. We also share lessons learned from porting SystemML to Spark and declarative ML in general. Finally, SystemML is open-source, which allows the database community to leverage it as a testbed for further research.
Not for the Timid: On the Impact of Aggressive Over-booking in the Cloud

Willis Lang (Microsoft); Karthik Ramachandra (Microsoft); David DeWitt (Microsoft); Shize Xu (Microsoft); Qun Guo (Microsoft); Ajay Kalhan (Microsoft); Peter Carlin (Microsoft)

Abstract:To lower hosting costs and service prices, database-as-a-service (DBaaS) providers strive to maximize cluster utilization without negatively affecting their users’ service experience. Some of the most effective approaches for increasing service efficiency result in the over-booking of the cluster with user databases. For instance, one approach is to reclaim cluster capacity from a database when it is idle, temporarily re-using the capacity for some other purpose, and over-booking the cluster’s resources. Such approaches are largely driven by policies that determine when it is prudent to temporarily reclaim capacity from an idle database. In this paper, we examine policies that inherently tune the system’s idle sensitivity. Increased sensitivity to idleness leads to aggressive over-booking while the converse leads to conservative reclamation and lower utilization levels. Aggressive over-booking also incurs a “reserve” capacity cost (for when we suddenly “owe” capacity to previously idle databases.) We answer these key questions in this paper: (1) how to find a “good” resource reclamation policy for a given DBaaS cluster of users; and (2) how to forecast the needed near-term reserve capacity. To help us answer these questions, we used production user activity traces from Azure SQL DB and built models of an over-booking mechanism. We show that choosing the right policy can substantially boost the efficiency of the service, facilitating lower service prices via lower amortized infrastructure costs.

## Demo 2a: Interactive and Exploratory Systems

### Location: Maple

ArchimedesOne: Query Processing over Probabilistic Knowledge Bases

Xiaofeng Zhou (University of Florida); Yang Chen (University of Florida); Daisy Zhe Wang (University of Florida)

Rudolf: Interactive Rule Refinement System for Fraud Detection

Tova Milo (Tel-Aviv University); Slava Novgorodov (Tel-Aviv University); Wang-Chiew Tan (University of California at Santa Cruz)

Ziggy: Characterizing Query Results for Data Explorers

Thibault Sellam (CWI); Martin Kersten (CWI)

Blaeu: Mapping and Navigating Large Tables with Cluster Analysis

Thibault Sellam (CWI); Robin Cijvat (MonetDB Solutions); Richard Koopmanschap (MonetDB Solutions); Martin Kersten (CWI)

A Demonstration of VisDPT: Visual Exploration of Differentially Private Trajectories

Xi He (Duke University); Nisarg Raval (Duke University); Ashwin Machanavajjhala (Duke University)

YASK: A Why-Not Question Answering Engine for Spatial Keyword Query Services

Lei Chen (Hong Kong Baptist University); Jianliang Xu (Hong Kong Baptist University); Christian Jensen (Aalborg University); Yafei Li (Hong Kong Baptist University)

Exploring Databases via Reverse Engineering Ranking Queries with PALEO

Kiril Panev (TU Kaiserslautern); Sebastian Michel ( TU Kaiserslautern); Evica Milchevski (TU Kaiserslautern); Koninika Pal (TU Kaiserslautern)

ExRank: An Exploratory Ranking Interface

Ramon Bespinyowong (National University of Singapore); Wei Chen (Zhejiang University); H. V. Jagadish (University of Michigan); Yuxin Ma (Zhejiang University)

NLProv: Natural Language Provenance

Daniel Deutch (Tel Aviv University); Nave Frost (Tel Aviv University); Amir Gilad (Tel Aviv University)

Towards Personalized Maps: Mining User Preferences from Geo-textual Data

Kaiqi Zhao (Nanyang Technological University); Yiding Liu (Nanyang Technological University); Quan Yuan (Nanyang Technological University); Lisi Chen (Hong Kong Baptist University); Zhida Chen (Nanyang Technological University); Gao Cong (Nayang Technological University)

A System for Region Search and Exploration

Kaiyu Feng (Nanyang Technological University); Kaiqi Zhao (Nanyang Technological University); Yiding Liu (Nanyang Technological University); Gao Cong (Nanyang Technological University)

SigmaKB: Multiple Probabilistic Knowledge Base Fusion

Miguel Rodriguez (University of Florida); Sean Goldberg (University of Florida); Daisy Zhe Wang (University of Florida)

Collaborative Crowdsourcing with Crowd4U

Kosetsu Ikeda (University of Tsukuba); Atsuyuki Morishima (University of Tsukuba); Habibur Rahman (UT Arlington); Senjuti Basu Roy (NJIT); Saravanan Thirumuruganathan (QCRI, HBKU); Sihem Amer-Yahia (CNRS LIG); Gautam Das (UT Arlington)

# Tuesday Sep 6th, 4:00 pm - 5:30 pm

## Research R7: Query Execution -1

### Chair: Alfons Kemper, TU Munich

Parallel Evaluation of Multi-Semi-Joins

Jonny Daenen (Hasselt University); Frank Neven (Hasselt University); Tony Tan (National Taiwan University); Stijn Vansummeren (Université Libre de Bruxelles)

Abstract:While services such as Amazon AWS make computing power abundantly available, adding more computing nodes can incur high costs in, for instance, pay-as-you-go plans while not always significantly improving the net running time (aka wall-clock time) of queries. In this work, we provide algorithms for parallel evaluation of SGF queries in MapReduce that optimize total time, while retaining low net time. Not only can SGF queries specify all semi-join reducers, but also more expressive queries involving disjunction and negation. Since SGF queries can be seen as Boolean combinations of (potentially nested) semi-joins, we introduce a novel multi-semi-join (MSJ) MapReduce operator that enables the evaluation of a set of semi-joins in one job. We use this operator to obtain parallel query plans for SGF queries that outvalue sequential plans w.r.t. net time and provide additional optimizations aimed at minimizing total time without severely affecting net time. Even though the latter optimizations are NP-hard, we present effective greedy algorithms. Our experiments, conducted using our own implementation Gumbo on top of Hadoop, confirm the usefulness of parallel query plans, and the effectiveness and scalability of our optimizations, all with a significant improvement over Pig and Hive.
Lightning Fast and Space Efficient Inequality Joins

Zuhair Khayyat (King Abdullah University of Science and Technology); William Lucia (Qatar Computing Research Institute); Meghna Singh (Qatar Computing Research Institute), Mourad Ouzzani (Qatar Computing Research Institute); Paolo Papotti (Qatar Computing Research Institute); Jorge-Arnulfo Quiane-Ruiz (Qatar Computing Research Institute); Nan Tang (Qatar Computing Research Institute); Panos Kalnis (King Abdullah University of Science and Technology)

Abstract:Inequality joins, which join relational tables on inequality conditions, are used in various applications. While there have been a wide range of optimization methods for joins in database systems, from algorithms such as sort-merge join and band join, to various indices such as B+-tree, R –tree and Bitmap, inequality joins have received little attention and queries containing such joins are usually very slow. In this paper, we introduce fast inequality join algorithms. We put columns to be joined in sorted arrays and we use permutation arrays to encode positions of tuples in one sorted array w.r.t. the other sorted array. In contrast to sort-merge join, we use space efficient bit-arrays that enable optimizations, such as Bloom filter indices, for fast computation of the join results. We have implemented a centralized version of these algorithms on top of PostgreSQL, and a distributed version on top of Spark SQL. We have compared against well known optimization techniques for inequality joins and show that our solution is more scalable and several orders of magnitude faster.
MQJoin: Efficient Shared Execution of Main-Memory Joins

Darko Makreshanski (ETH Zurich); Georgios Giannikis (Oracle Labs); Gustavo Alonso (ETH Zurich); Donald Kossman (Microsoft Research)

Abstract:Database architectures typically process queries one-at-a-time, executing concurrent queries in independent execution contexts. Often, such a design leads to unpredictable performance and poor scalability. One approach to circumvent the problem is to take advantage of sharing opportunities across concurrently running queries. In this paper we propose Many-Query Join (MQJoin), a novel method for sharing the execution of a join that can efficiently deal with hundreds of concurrent queries. This is achieved by minimizing redundant work and making efficient use of main-memory bandwidth and multi-core architectures. Compared to existing proposals, MQJoin is able to efficiently handle larger workloads regardless of the schema by exploiting more sharing opportunities. We also compared MQJoin to two commercial main-memory column-store databases. For a TPC-H based workload, we show that MQJoin provides 2-5x higher throughput with significantly more stable response times.
Incremental Computation of Common Windowed Holistic Aggregates

Richard Wesley (Tableau Software); Fei Xu (Tableau Software)

Abstract:Windowed aggregates are a SQL 2003 feature for computing aggregates in moving windows. Common examples include cumulative sums, local maxima and moving quantiles. With the advent over the last few years of easy-to-use data analytics tools, these functions are becoming widely used by more and more analysts, but some aggregates (such as local maxima) are much easier to compute than others (such as moving quantiles). Nevertheless, aggregates that are more difficult to compute, like quantile and mode (or “most frequent”) provide more appropriate statistical summaries in the common situation when a distribution is not Gaussian and are an essential part of a data analysis toolkit. Recent work has described highly efficient windowed implementations of the most common aggregate function categories, including distributive aggregates such as cumulative sums and algebraic aggregates such as moving averages. But little has been published on either the implementation or the performance of the more complex holistic windowed aggregates such as moving quantiles. This paper provides the first in-depth study of how to efficiently implement the three most common holistic windowed aggregates (count distinct, mode and quantile) by reusing the aggregate state between consecutive frames. Our measurements show that these incremental algorithms generally achieve improvements of about 10× over naive implementations, and that they can effectively detect when to reset the internal state during extreme frame variation.

## Research R8: Data Security and Privacy

### Chair: Lei Chen, HKUST

Oblivious RAM: A Dissection and Experimental Evaluation [Experiments and Analyses]

Zhao Chang (University of Utah); Dong Xie (University of Utah); Feifei Li (University of Utah)

Abstract:Many companies choose the cloud as their data and IT infrastructure platform. The remote access of the data brings the issue of trust. Despite the use of strong encryption schemes, adversaries can still learn valuable information regarding encrypted data by observing the data access patterns. To that end, one can hide the access patterns, which may leak sensitive information, using Oblivious RAMs (ORAMs). Numerous works have proposed different ORAM constructions, but they have never been thoroughly compared against and tested on large databases. There are also no open source implementation of these schemes. These limitations make it difficult for researchers and practitioners to choose and adopt a suitable ORAM for their applications. To address this issue, we provide a thorough study over several practical ORAM constructions, and implement them under the same library. We perform extensive experiments to provide insights into their performance characteristics with respect to efficiency, scalability, and communication cost.
Design of Policy-Aware Differentially Private Algorithms

Samuel Haney (Duke University); Ashwin Machanavajjhala (Duke University); Bolin Ding (Microsoft Research)

Abstract:The problem of designing error optimal differentially private algorithms is well studied. Recent work applying differential privacy to real world settings have used variants of differential privacy that appropriately modify the notion of neighboring databases. The problem of designing error optimal algorithms for such variants of differential privacy is open. In this paper, we show a novel transformational equivalence result that can turn the problem of query answering under differential privacy with a modified notion of neighbors to one of query answering under standard differential privacy, for a large class of neighbor definitions. We utilize the Blowfish privacy framework that generalizes differential privacy. Blowfish uses a policy graph to instantiate different notions of neighboring databases. We show that the error incurred when answering a workload W on a database x under a Blowfish policy graph G is identical to the error required to answer a transformed workload f_G(W) on database g_G(x) under standard differential privacy, where f_G and g_G are linear transformations based on G. Using this result, we develop error efficient algorithms for releasing histograms and multidimensional range queries under different Blowfish policies. We believe the tools we develop will be useful for finding mechanisms to answer many other classes of queries with low error under other policy graphs.
Behavior Query Discovery in System-Generated Temporal Graphs

Bo Zong (University of California at Santa Barbara); Xusheng Xiao (NEC Labs America, Inc); Zhichun Li (NEC Labs America, Inc.); Zhenyu Wu (NEC Labs America, Inc.); Zhiyun Qian (University of California at Riverside); Xifeng Yan (University of California at Santa Barbara); Ambuj Singh (University of California at Santa Barbara); Guofei Jiang (NEC Labs America, Inc.)

Abstract:Computer system monitoring generates huge amounts of logs that record the interaction of system entities. How to query such data to better understand system behaviors and identify potential system risks and malicious behaviors becomes a challenging task for system administrators due to the dynamics and heterogeneity of the data. System monitoring data are essentially heterogeneous temporal graphs with nodes being system entities and edges being their interactions over time. Given the complexity of such graphs, it becomes time-consuming for system administrators to manually formulate useful queries in order to examine abnormal activities, attacks, and vulnerabilities in computer systems. In this work, we investigate how to query temporal graphs and treat query formulation as a discriminative temporal graph pattern mining problem. We introduce TGMiner to mine discriminative patterns from system logs, and these patterns can be taken as templates for building more complex queries. TGMiner leverages temporal information in graphs to prune graph patterns that share similar growth trend without compromising pattern quality. Experimental results on real system data show that TGMiner is 6-32 times faster than baseline methods. The discovered patterns were verified by system experts; they achieved high precision (97%) and recall (91%).

## Research R9: Ranking Queries

### Chair: Renee Miller, Univ. of Toronto

Query Reranking As A Service

Abolfazl Asudeh (UT Arlington); Nan Zhang (George Washington University); Gautam Das (UT Arlington)

Abstract:The ranked retrieval model has rapidly become the de facto way for search query processing in client-server databases, especially those on the web. Despite of the extensive efforts in the database community on designing better ranking functions/mechanisms, many such databases in practice still fail to address the diverse and sometimes contradicting preferences of users on tuple ranking, perhaps (at least partially) due to the lack of expertise and/or motivation for the database owner to design truly effective ranking functions. This paper takes a different route on addressing the issue by defining a novel {\em query reranking problem}, i.e., we aim to design a third-party service that uses nothing but the public search interface of a client-server database to enable the on-the-fly processing of queries with any user-specified ranking functions (with or without selection conditions), no matter if the ranking function is supported by the database or not. We analyze the worst-case complexity of the problem and introduce a number of ideas, e.g., on- the-fly indexing, domination detection and virtual tuple pruning, to reduce the average-case cost of the query reranking algorithm. We also present extensive experimental results on real-world datasets, in both offline and live online systems, that demonstrate the effectiveness of our proposed techniques.
Finding Pareto Optimal Groups: Group-based Skyline

Jinfei Liu (Emory University); Li Xiong (Emory University); Jian Pei (Simon Fraser University); Jun Luo (Lenovo); Haoyu Zhang (Emory University)

Abstract:Skyline computation, aiming at identifying a set of skyline points that are not dominated by any other point, is particularly useful for multi-criteria data analysis and decision making. Traditional skyline computation, however, is inadequate to answer queries that need to analyze not only individual points but also groups of points. To address this gap, we generalize the original skyline definition to the novel group-based skyline (G-Skyline), which represents Paretooptimal groups that are not dominated by other groups. Inorder to compute G-Skyline groups consisting of k points efficiently, we present a novel structure that represents the points in a directed skyline graph and captures the dominance relationships among the points based on the first k skyline layers. We propose efficient algorithms to compute the first k skyline layers. We then present two heuristic algorithms to efficiently compute the G-Skyline groups: the point-wise algorithm and the unit group-wise algorithm, using various pruning strategies. The experimental results on the real NBA dataset and the synthetic datasets show that G-Skyline is interesting and useful, and our algorithms are efficient and scalable.
k-Regret Queries with Nonlinear Utilities

Taylor Kessler Faulkner (Denison University); Will Brackenbury (Denison University); Ashwin Lall (Denison University)

Abstract:In exploring representative databases, a primary issue has been finding accurate models of user preferences. Given this, our work generalizes the method of regret minimization as proposed by Nanongkaiet al. to include nonlinear utility functions. Regret minimization is an approach for selecting k representative points from a database such that every user’s ideal point in the entire database is similar to one of the kpoints. This approach combines benefits of the methods top-k and skyline; it controls the size of the output but does not require knowledge of users’ preferences. Prior work with k-regret queries assumes users’ preferences to be modeled by linear utility functions. In this paper, we derive upper and lower bounds for non-linear utility functions, as these functions can better fit occurrences such as diminishing marginal returns, propensity for risk, and substitutability of preferences. To model these phenomena, we analyze a broad subset of convex, concave, and constant elasticity of substitution functions. We also run simulations on real and synthetic data to prove the efficacy of our bounds in practice.
Discovering the Skyline of Web Databases

Abolfazl Asudeh (UT Arlington); Saravanan Thirumuruganathan (UT Arlington); Nan Zhang (George Washington University); Gautam Das (UT Arlington)

Abstract:Many web databases are “hidden” behind proprietary search interfaces that enforce the top-k output constraint, i.e., each query returns at most k of all matching tuples, preferentially selected and returned according to a proprietary ranking function. In this paper, we initiate research into the novel problem of skyline discovery over top-k hidden web databases. Since skyline tuples provide critical insights into the database and include the top-ranked tuple for every possible ranking function following the monotonic order of attribute values, skyline discovery from a hidden web database can enable a wide variety of innovative third-party applications over one or multiple web databases. Our research in the paper shows that the critical factor affecting the cost of skyline discovery is the type of search interface controls provided by the website. As such, we develop efficient algorithms for three most popular types, i.e., one-ended range, free range and point predicates, and then combine them to support web databases that feature a mixture of these types. Rigorous theoretical analysis and extensive real-world online and offline experiments demonstrate the effectiveness of our proposed techniques and their superiority over baseline solutions.

## Industrial I3: Data Engine Architectures - 1

### Chair: Wolfgang Lehner, TU Dresden

Hardware-Software Co-design for Data Management (Invited Talk)

Garret Swart (Oracle); Shasank Chavan (Oracle)

Abstract:Big is better: Large scale computing systems give access to huge amounts of data without the costs of moving data between systems. This allows for larger tables, bigger sorts, fatter graphs, and more cloud tenants sharing the the same resource pool on SPARC systems that scale linearly in cost and performance from 8 to 512 cores, 64 to 4096 threads. Secure is better: Cache- line level memory access checking allows our instrumented memory allocators to manage memory at production speed while detecting bugs and reporting attacks in real time. Information Density is better: With hardware designed for scanning n-gram compressed, bit packed, dictionary and run-length encoded columnar data at full memory bandwidth, we make maximal use of every bit stored and every cache line transferred over the memory channels with no impact on performance. Fast is better: With hardware support for running database operators on specialized streaming processors, we drive the memory channels at maximum rate, freeing up power and cores for running user computations on the result of these operators. Connected is better: Integrating EDR InfiniBand on-chip and on-board with low-latency, high-throughput, one-sided networking. Portable is better: By supporting platform independent acceleration APIs inside the database we support a wide variety of acceleration techniques and give applications and query planners the information to make the best use of the available hardware. Integrated is better: By supporting and accelerating multiple storage types (In-memory, NVMe, Exadata, NFS, HDFS, Fibre Channel), data formats (row major, column major, graph, JSON, spatial, MIME, Hive), algorithms, query languages, network protocols, and hardware platforms in a single product, we optimize, share resources, increase usability and reduce the cost and the cognitive load in acquiring, storing, securing and understanding data. In this talk, I will describe the experience that drives our microprocessor acceleration priorities, the constraints and joys of the hardware-software co-design process, the HW features that resulted, and how software engineers have exploiting these features in ways we expected and ways we didn’t. The industry has accelerated linear algebra and computer graphics, and in this talk we’ll see how we apply it to data processing but with changes to match the lower compute density of the problem space.
Nitro: A Fast, Scalable In-Memory Storage Engine for NoSQL Global Secondary Index

Sarath Lakshman (Couchbase Inc.); Sriram Melkote (Couchbase Inc.); John Liang (Couchbase Inc.); Ravi Mayuram (Couchbase Inc.)

Abstract:We present Nitro, a high-performance in-memory key-value storage engine used in Couchbase 4.5 Global Secondary Indexes. The Nitro storage engine is well suited for the recent hardware trends like large amounts of memory and many CPU cores. The storage engine leverages latch-free data structures and tries to achieve linear scalability for the index read-write operations. The Nitro storage engine offers concurrent readers and writers, lightweight database snapshots, stable scan, backup and recovery operations. We integrated Nitro into the Couchbase Global Secondary Indexes (GSI) and observed significant improvement in performance compared to our disk oriented storage engine configured with the same amount of memory for buffer cache. On a 32 core machine, we observed an end-to- end GSI server insertion throughput of 1,650,000 entries/sec and index update throughput of 822,000 entries/sec. A single instance of Nitro data structure running on a 40 core machine achieved a peak insertion throughput of 4 million index entries/sec and entry lookup throughput of 10 million lookups/sec.
Aerospike: Architecture of a Real-Time Operational DBMS

V. Srinivasan (Aerospike Inc.); Brian Bulkowski (Aerospike Inc.); Wei-Ling Chu (Aerospike Inc.); Sunil Sayyaparaju (Aerospike Inc.); Andrew Gooding (Aerospike Inc.); Rajkumar Iyer (Aerospike Inc.); Ashish Shinde (Aerospike Inc.); Thomas Lopatic (Aerospike Inc.)

Abstract:In this paper, we describe the solutions developed to address key technical challenges encountered while building a distributed database system that can smoothly handle demanding real-time workloads and provide a high level of fault tolerance. Specifically, we describe schemes for the efficient clustering and data partitioning for the automatic scale out of processing across multiple nodes and for optimizing the usage of CPUs, DRAM, SSDs and networks to efficiently scale up performance on one node. The techniques described here were used to develop Aerospike (formerly Citrusleaf), a high performance distributed database system built to handle the needs of today’s interactive online services. Most real-time decision systems that use Aerospike require very high scale and need to make decisions within a strict SLA by reading from, and writing to, a database containing billions of data items at a rate of millions of operations per second with sub-millisecond latency. For over five years, Aerospike has been continuously used in over a hundred successful production deployments, as many enterprises have discovered that it can substantially enhance their user experience.
Comdb2 - Bloomberg's Higly Available Relational Database System

Alex Scotti (Bloomberg LP); Mark Hannum (Bloomberg LP); Michael Ponomarenko (Bloomberg LP); Dorin Hogea (Bloomberg LP); Akshat Sikarwar (Bloomberg LP); Mohit Khullar (Bloomberg LP); Adi Zaimi (Bloomberg LP); James Leddy (Bloomberg LP); Fabio Angius (Bloomberg LP); Rivers Zhang (Bloomberg LP); Lingzhi Deng (Bloomberg LP)

Abstract:Comdb2 is a distributed database designed for geographical replication and high availability. In contrast with the latest trends in this field Comdb2 offers full transactional support, a standard relational model, and the expressivity of SQL. Moreover, the system allows for rich stored procedures using a dialect of Lua. Comdb2 implements a serializable system in which reads from any node always return current values. Comdb2 provides transparent High Avail- ability through built in service discovery and sophisticated retry logic embedded in the standard API. In addition to the relational data model, Comdb2 implements queues for publisher-to- subscriber message delivery. Queues can be combined with table triggers for time- consistent log distribution, two functionalities that are commonly needed in modern OLTP. In this paper we give an overview of our last twelve years of work. We focus on the design choices that have made Comdb2 the primary database solution within our company, Bloomberg LP (BLP).

## Demo 3a: Graph and Semistructured Data

### Location: Maple

GARUDA: A System for Large-Scale Mining of Statistically Significant Connected Subgraphs

Satyajit Bhadange (IIT Kanpur); Akhil Arora (Xerox Research Centre India); Arnab Bhattacharya (IIT Kanpur)

Generating Flexible Workloads for Graph Databases

Guillaume Bagan (CNRS LIRIS); Angela Bonifati (University of Lyon 1 & CNRS LIRIS); Radu Ciucanu (University of Oxford); George Fletcher (TU Eindhoven); Aurélien Lemay (University of Lille 3 & INRIA); Nicky Advokaat (TU Eindhoven)

Graph databases in the browser: using LevelGraph to explore New Delhi

Antonio Maccioni (Roma Tre University); Matteo Collina (NearForm)

Sapphire: Querying RDF Data Made Simple

Ahmed El-Roby (University of Waterloo); Khaled Ammar (University of Waterloo); Ashraf Aboulnaga (Qatar Computing Research Institute); Jimmy Lin (University of Waterloo)

December: A Declarative Tool for Crowd Member Selection

Yael Amsterdamer (Bar Ilan University); Tova Milo (Tel-Aviv University); Amit Somech (Tel-Aviv University); Brit Youngmann (Tel-Aviv University)

AutoG: A Visual Query Autocompletion Framework for Graph Databases

Peipei Yi (Hong Kong Baptist University); Byron Choi (Hong Kong Baptist University); Sourav Bhowmick (Nanyang Technological University); Jianliang Xu (Hong Kong Baptist University)

Exploratory Querying of Extended Knowledge Graphs

Mohamed Yahya (Max Planck Insitute for Informatics); Klaus Berberich (Max Planck Institute for Informatics); Maya Ramanath (IIT Delhi); Gerhard Weikum (Max Planck Institute for Informatics)

SPARQLByE: Querying RDF data by example

Gonzalo Diaz (University of Oxford); Marcelo Arenas (PUC Chile); Michael Benedikt (University of Oxford)

Graph-based Exploration of Non-graph Datasets

Udayan Khurana (IBM Research – T.J. Watson); Srinivasan Parthasarathy (IBM Research – T.J. Watson); Deepak Turaga (IBM Research – T.J. Watson)

Rogas: A Declarative Framework for Network Analytics

Minjian Liu (The Australian National University); Qing Wang (The Australian National University)

Large-scale Complex Analytics on Semi-structured Datasets using AsterixDB and Spark

Wail Alkowaileet (CCEE at KACST and MIT); Sattam Alsubaiee (CCEE at KACST and MIT); Michael Carey (University of California at Irvine); Till Westmann (Couchbase Inc.); Yingyi Bu (Couchbase Inc.)

# Wednesday Sep 7th, 9:15 am - 10:30 am

## Keynote 2

### Chair: Jayant Haritsa, IISc Bangalore

Data-Driven Disruption: The View from Silicon Valley

Anand Rajaraman, Founding Partner, Rocketship.vc

Abstract:We live in an era where software is transforming industries, the sciences, and society as a whole. This exciting phenomenon has been described by the phrase “software is eating the world.” It is becoming increasingly apparent that data is the fuel powering software’s conquests. Data is the new disruptor. It’s hard to believe that the first decade of the Big Data era is already behind us. Silicon Valley has been at the forefront of developing and applying data-driven approaches to create disruption at many levels: infrastructure (e.g., Hadoop and Spark), capabilities (e.g., image recognition and machine translation), and killer apps (e.g., self-driving cars and messaging bots). In this talk, we first look back on the past decade and share learnings from the frontlines of data-driven disruption. Looking ahead, we then describe challenges and opportunities for the next decade. Since this has also been a personal journey, we will use examples drawn from personal experience to illustrate each point.

Bio:Anand is a Founding Partner of two Silicon Valley venture capital funds focused on early- stage technology companies: Milliways Ventures and RocketshipVC. He was the co-founder of two successful startups: Junglee (acquired by Amazon.com) and Kosmix (acquired by Walmart). At Walmart, he created and led WalMartLabs (as its Senior Vice President). As an academic, Anand’s research has focused at the intersection of database systems, the World-Wide Web, and social media. His research publications have won several awards at prestigious academic conferences, including three retrospective 10-year Best Paper awards at ACM SIGMOD and VLDB, and ICDT. His textbook “Mining of Massive Datasets”, co-authored with Jeff Ullman and Jure Leskovec, has been published.

# Wednesday Sep 7th, 11:15 am - 12:45 pm

## Research R10: Query Optimization - 1

### Chair: S Sudarshan, IIT Bombay

Faster Plan Generation through Consideration of Functional Dependencies and Keys

Marius Eich (University of Mannheim); Pit Fender (Oracle Labs); Guido Moerkotte (University of Mannheim)

Abstract:It has been a recognized fact for many years that query execution can benefit from pushing group-by operators down in the operator tree and applying them before a join. This so-called eager aggregation reduces the size(s) of the join argument(s), making join evaluation faster. Lately, the idea enjoyed a revival when it was applied to outer joins for the first time and incorporated in a state-of- the-art plan generator. However, this recent approach is highly dependent on the use of heuristics because of the exponential growth of the search space that goes along with eager aggregation. Finding an optimal solution for larger queries calls for effective optimality preserving pruning mechanisms to reduce the search space size as far as possible. By a more thorough investigation of functional dependencies and keys, we provide a set of new pruning criteria and evaluate their effectiveness with respect to the runtime and memory consumption of the resulting plan generator.
Exploiting Soft and Hard Correlations in Big Data Query Optimization

Hai Liu (Worcester Polytechnic Institute); Dongqing Xiao (Worcester Polytechnic Institute); Pankaj Didwania (Worcester Polytechnic Institute); Mohamed Eltabakh (Worcester Polytechnic Institute)

Abstract:It has been a recognized fact for many years that query execution can benefit from pushing group-by operators down in the operator tree and applying them before a join. This so-called eager aggregation reduces the size(s) of the join argument(s), making join evaluation faster. Lately, the idea enjoyed a revival when it was applied to outer joins for the first time and incorporated in a state-of- the-art plan generator. However, this recent approach is highly dependent on the use of heuristics because of the exponential growth of the search space that goes along with eager aggregation. Finding an optimal solution for larger queries calls for effective optimality preserving pruning mechanisms to reduce the search space size as far as possible. By a more thorough investigation of functional dependencies and keys, we provide a set of new pruning criteria and evaluate their effectiveness with respect to the runtime and memory consumption of the resulting plan generator.
How Good Are Query Optimizers, Really?

Viktor Leis (Technische Universität München); Andrey Gubichev (Technische Universität München); Atanas Mirchev (Technische Universität München); Peter Boncz (CWI); Alfons Kemper (Technische Universität München); Thomas Neumann (Technische Universität München)

Abstract:Finding a good join order is crucial for query performance. In this paper, we introduce the Join Order Benchmark (JOB) and experimentally revisit the main components in the classic query optimizer architecture using a complex, real-world data set and realistic multi-join queries. We investigate the quality of industrial-strength cardinality estimators and find that all estimators routinely produce large errors. We further show that while estimates are essential for finding a good join order, query performance is unsatisfactory if the query engine relies too heavily on these estimates. Using another set of experiments that measure the impact of the cost model, we find that it has much less influence on query performance than the cardinality estimates. Finally, we investigate plan enumeration techniques comparing exhaustive dynamic programming with heuristic algorithms and find that exhaustive enumeration improves performance despite the sub-optimal cardinality estimates.
Teaching an RDBMS about ontological constraints

Damian Bursztyn (INRIA & LIX); François Goasdoué (University of Rennes 1 and INRIA); Ioana Manolescu (INRIA & LIX)

Abstract:In the presence of an ontology, query answers must reflect not only data explicitly present in the database, but also implicit data, which holds due to the ontology, even though it is not present in the database. A large and useful set of ontologies enjoys FOL reducibility of query answering, that is: answering a query q can be reduced to evaluating a certain first-order logic (FOL) formula (obtained from the query and ontology) against only the explicit facts. We present a novel query optimization framework for ontology-based data access settings enjoying FOL reducibility. Our framework is based on searching within a set of alternative equivalent FOL queries, i.e., FOL reformulations, one with minimal evaluation cost when evaluated through a relational database system. We apply this framework to the DL-LiteR Description Logic underpinning the W3C’s OWL2 QL ontology language, and demonstrate through experiments its performance benefits when two leading SQL systems, one open-source and one commercial, are used for evaluating the FOL query reformulations.

## Research R11: Spatial Data and Queries - 1

### Chair: Amr El Abbadi, UC Santa Barbara

Processing and Optimizing Main Memory Spatial-Keyword Queries

Taesung Lee (Yonsei University); Jin-Woo Park (POSTECH); Sanghoon Lee (POSTECH); Seung-won Hwang (Yonsei University); Sameh Elnikety (Microsoft Research); Yuxiong He (Microsoft Research)

Abstract:Important cloud services rely on spatial-keyword queries, containing a spatial predicate and arbitrary boolean keyword queries. In particular, we study the processing of such queries in main memory to support short response times. In contrast, current state-of-the-art spatial-keyword indexes and relational engines are designed for different assumptions. Rather than building a new spatial-keyword index, we employ a cost-based optimizer to process these queries using a spatial index and a keyword index. We address several technical challenges to achieve this goal. We introduce three operators as the building blocks to construct plans for main memory query processing. We then develop a cost model for the operators and query plans. We introduce five optimization techniques that efficiently reduce the search space and produce a query plan with low cost. The optimization techniques are computationally efficient, and they identify a query plan with a formal approximation guarantee under the common independence assumption. Furthermore, we extend the framework to exploit interesting orders. We implement the query optimizer to empirically validate our proposed approach using real-life datasets. The evaluation shows that the optimizations provide significant reduction in the average and tail latency of query processing: 7- to 11-fold reduction over using a single index in terms of 99th percentile response time. In addition, this approach outperforms existing spatial- keyword indexes, and DBMS query optimizers for both average and high-percentile response times.
Spatial Online Sampling and Aggregation

Lu Wang (Hong Kong University of Science and Technology); Robert Christensen (University of Utah) ; Feifei Li (University of Utah); Ke Yi (Hong Kong University of Science and Technology)

Abstract:The massive adoption of smart phones and other mobile devices has generated humongous amount of spatial and spatio- temporal data. The importance of spatial analytics and aggregation is ever-increasing. An important challenge is to support interactive exploration over such data. However, spatial analytics and aggregation using all data points that satisfy a query condition is expensive, especially over large data sets, and could not meet the needs of interactive exploration. To that end, we present novel indexing structures that support spatial online sampling and aggregation on large spatial and spatio- temporal data sets. In spatial online sampling, random samples from the set of spatial (or spatio-temporal) points that satisfy a query condition are generated incrementally in an online fashion. With more and more samples, various spatial analytics and aggregations can be performed in an online, interactive fashion, with estimators that have better accuracy over time. Our design works well for both memory-based and disk-resident data sets, and scales well towards different query and sample sizes. More importantly, our structures are dynamic, hence, they are able to deal with insertions and deletions efficiently. Extensive experiments on large real data sets demonstrate the improvements achieved by our indexing structures compared to other baseline methods.
Online Minimum Matching in Real-Time Spatial Data: Experiments and Analysis [Experiments and Analyses]

Yongxin Tong (Beihang University); Jieying She (Hong Kong University of Science and Technology); Bolin Din (Microsoft Research); Lei Chen (Hong Kong University of Science and Technology); Tianyu Wo (Beihang University); Ke Xu (Beihang University)

Abstract:Recently, with the development of mobile Internet and smartphones, the \underline{o}nline \underline{m}inimum \underline{b} ipartite \underline{m}atching in real time spatial data (OMBM) problem becomes popular. Specifically, given a set of service providers with specific locations and a set of users who dynamically appear one by one, the OMBM problem is to find a maximum- cardinality matching with minimum total distance following that once a user appears, s/he must be immediately matched to an unmatched service provider, which cannot be revoked, before subsequent users arrive. To address this problem, existing studies mainly focus on analyzing the worst-case competitive ratios of the proposed online algorithms, but study on the performance of the algorithms in practice is absent. In this paper, we present a comprehensive experimental comparison of the representative algorithms of the OMBM problem. Particularly, we observe a surprising result that the simple and efficient greedy algorithm, which has been considered as the worst due to its exponential worst-case competitive ratio, is significantly more effective than other algorithms. We investigate the results and further show that the competitive ratio of the worst case of the greedy algorithm is actually just a constant, 3.195, in the average-case analysis. We try to clarify a 24-year misunderstanding towards the greedy algorithm and justify that the greedy algorithm is not bad at all. Finally, we provide a uniform implementation for all the algorithms of the OMBM problem and clarify their strengths and weaknesses, which can guide practitioners to select appropriate algorithms for various scenarios.
AQWA: Adaptive Query-Workload-Aware Partitioning of Big Spatial Data

Ahmed M. Aly (Purdue University); Ahmed R. Mahmood (Purdue University); Mohamed S. Hassan (Purdue University); Walid G. Aref (Purdue University); Mourad Ouzzani (Qatar Computing Research Institute); Hazem Elmeleegy (Turn Inc.); Thamir Qadah (Purdue University)

Abstract:The unprecedented spread of location-aware devices has resulted ina plethora of location-based services in which huge amounts of spatial data need to be efficiently processed by large-scale computing clusters. Existing cluster-based systems for processing spatial data employ static data-partitioning structures that cannot adapt to data changes, and that are insensitive to the query workload. Hence, these systems are incapable of consistently providing good performance. To close this gap, we present AQWA, an adaptive and query-workload-aware mechanism for partitioning large-scale spatial data. AQWA does not assume prior knowledge of the data distribution or the query workload. Instead, as data is consumed and queries are processed, the data partitions are incrementally updated. With extensive experiments using real spatial data from Twitter, and various workloads of range and k-nearest-neighbor queries, we demonstrate that AQWA can achieve an order of magnitude enhancement in query performance compared to the state- of-the-art systems.

## Research R12: Distributed and Cloud Systems - 1

### Chair: Krithi Ramamritham, IIT Bombay

High-Speed Query Processing over High-Speed Networks

Wolf Roediger (Technische Universität München); Tobias Muehlbauer (Technische Universität München); Alfons Kemper (Technische Universität München); Thomas Neumann (Technische Universität München)

Abstract:Modern database clusters entail two levels of networks: connecting CPUs and NUMA regions inside a single server in the small and multiple servers in the large. The huge performance gap between these two types of networks used to slow down distributed query processing to such an extent that a cluster of machines actually performed worse than a single many-core server. The increased main-memory capacity of the cluster remained the sole benefit of such a scale-out. The economic viability of high-speed interconnects such as InfiniBand has narrowed this performance gap considerably. However, InfiniBand’s higher network bandwidth alone does not improve query performance as expected when the distributed query engine is left unchanged. The scalability of distributed query processing is impaired by TCP overheads, switch contention due to uncoordinated communication, and load imbalances resulting from the inflexibility of the classic exchange operator model. This paper presents the blueprint for a distributed query engine that addresses these problems by considering both levels of networks holistically. It consists of two parts: First, hybrid parallelism that distinguishes local and distributed parallelism for better scalability in both the number of cores as well as servers. Second, a novel communication multiplexer tailored for analytical database workloads using remote direct memory access (RDMA) and low-latency network scheduling for high-speed communication with almost no CPU overhead. An extensive evaluation within the HyPer database system using the TPC-H benchmark shows that our holistic approach indeed enables high-speed query processing over high-speed networks.
The End of Slow Networks: It’s Time for a Redesign

Carsten Binnig (Brown University); Andrew Crotty (Brown University); Alex Galakatos (Brown University); Tim Kraska (Brown University); Erfan Zamanian (Brown University)

Abstract:The next generation of high-performance networks with remote direct memory access (RDMA) capabilities requires a fundamental rethinking of the design of distributed in-memory DBMSs. These systems are commonly built under the assumption that the network is the primary bottleneck and should be avoided at all costs, but this assumption no longer holds. For instance, with InfiniBand FDR 4x, the bandwidth available to transfer data across the network is in the same ballpark as the bandwidth of one memory channel. Moreover, RDMA transfer latencies continue to rapidly improve as well. In this paper, we first argue that traditional distributed DBMS architectures cannot take full advantage of high-performance networks and suggest a new architecture to address this problem. Then, we discuss initial results from a prototype implementation of our proposed architecture for OLTP and OLAP, showing remarkable performance improvements over existing designs.
Tempo: Robust and Self-Tuning Resource Management in Multi-tenant Parallel Databases

Zilong Tan (Duke University); Shivnath Babu (Duke University)

Abstract:Multi-tenant database systems have a component called the Resource Manager, or RM that is responsible for allocating resources to tenants. RMs today do not provide direct support for performance objectives such as: “Average job response time of tenant A must be less than two minutes”, or “No more than 5% of tenant B’s jobs can miss the deadline of 1 hour.” Thus, DBAs have to tinker with the RM’s low-level configuration settings to meet such objectives. We propose a framework called Tempo that brings simplicity, self-tuning, and robustness to existing RMs. Tempo provides a simple interface for DBAs to specify performance objectives declaratively, and optimizes the RM configuration settings to meet these objectives. Tempo has a solid theoretical foundation which gives key robustness guarantees. We report experiments done on Tempo using production traces of data-processing workloads from companies such as Facebook and Cloudera. These experiments demonstrate significant improvements in meeting desired performance objectives over RM configuration settings specified by human experts.
Measuring and Optimizing Distributed Array Programs

Mingxing Zhang (Tsinghua University); Yongwei Wu (Tsinghua University); Kang Chen (Tsinghua University); Teng Ma (Tsinghua University); Weimin Zheng (Tsinghua University)

Abstract:Nowadays, there is a rising trend of building array-based distributed computing frameworks. By taking advantage of highly optimized array/matrix operations, these frameworks have achieved significant improvement on performance for many important algorithms. However, due to the absence of a comprehensive optimizer, most of these frameworks execute each primitive in an isolated manner and in the exact order defined by programmers, which implies a huge space for optimization. In this paper, we propose a novel array- based programming model, named as KASEN, which distinguishes itself from models in the existing literature by defining a strict computation and communication model. This model makes it easy to analyze programs’ behavior and measure their performance, with which we design a corresponding optimizer that can automatically apply many high-level optimizations to the original programs written by programmers. According to our evaluation, KASEN is sufficient for implementing many important machine learning and graph algorithms, and it can be extended very easily. More importantly, the optimizer of KASEN can achieve significant reduction on memory read/write, buffer allocation and network traffic. The speedup achieved by our optimizer is up to 5.82X for real-world and synthetic datasets.

## Industrial I4: Data Engine Architectures - 2

### Chair: Spyros Blanas, Ohio State Univ.

Kodiak: Leveraging Materialized Views For Very Low-Latency Analytics Over High-Dimensional Web-Scale Data

Shaosu Liu (Turn Inc.); Bin Song (Turn Inc.); Sriharsha Gangam (Turn Inc.); Lawrence Lo (Turn Inc.); Khaled Elmeleegy (Turn Inc.)

Abstract:Turn’s online advertising campaigns produce petabytes of data. This data is composed of trillions of events, e.g. impressions, clicks, etc., spanning multiple years. In addition to a timestamp, each event includes hundreds of fields describing the user’s attributes, campaign’s attributes, attributes of where the ad was served, etc. Advertisers need advanced analytics to monitor their running campaigns’ performance, as well as to optimize future campaigns. This involves slicing and dicing the data over tens of dimensions over arbitrary time ranges. Many of these queries need to power the web portal to provide reports and dashboards. For an interactive response time, they have to have tens of milliseconds latency. At Turn’s scale of operations, no existing system was able to deliver this performance in a cost effective manner. Kodiak, a distributed analytical data platform for web-scale high-dimensional data, was built to serve this need. It relies on precomputations to materialize thousands of views to serve these advanced queries. These views are partitioned and replicated across Kodiak’s storage nodes for scalability and reliability. They are system maintained as new events arrive. At query time, the system auto-selects the most suitable view to serve each query. Kodiak has been used in production for over a year. It hosts 2490 views for over three petabytes of raw data serving over 200K queries daily. It has median and 99% query latencies of 8 ms and 252 ms respectively. Our experiments show that its query latency is 3 orders of magnitude faster than leading big data platforms on head-to-head comparisons using Turn’s query workload. Moreover, Kodiak uses 4 orders of magnitude less resources to run the same workload.
Accelerating Analytics with Dynamic In-Memory Expressions

Aurosish Mishra (Oracle America); Shasank Chavan (Oracle America); Allison Holloway (Oracle America); Tirthankar Lahiri (Oracle America); Zhen Hua Liu (Oracle America); Sunil Chakkappen (Oracle America); Dennis Lui (Oracle America); Vinita Subramanian (Oracle America); Ramesh Kumar (Oracle America); Maria Colgan (Oracle America); Jesse Kamp (Oracle America); Niloy Mukherjee (LinkedIn America); Vineet Marwah (Oracle America)

Abstract:Oracle Database In-Memory (DBIM) accelerates analytic workload performance by orders of magnitude through an in-memory columnar format utilizing techniques such as SIMD vector processing, in-memory storage indexes, and optimized predicate evaluation and aggregation. With Oracle Database 12.2, Database In-Memory is further enhanced to accelerate analytic processing through a novel lightweight mechanism known as Dynamic In-Memory Expressions (DIMEs). The DIME mechanism automatically detects frequently occurring expressions in a query workload, and then creates highly optimized, transactionally consistent, in-memory columnar representations of these expression results. At runtime, queries can directly access these DIMEs, thus avoiding costly expression evaluations. Furthermore, all the optimizations introduced in DBIM can apply directly to DIMEs. Since DIMEs are purely in-memory structures, no changes are required to the underlying tables. We show that DIMEs can reduce query elapsed times by several orders of magnitude without the need for costly pre-computed structures such as computed columns or materialized views or cubes.
TrafficDB: HERE's High Performance Shared-Memory Data Store

Ricardo Fernandes (HERE Global B.V.); Piotr Zaczkowski (HERE Global B.V.); Bernd Göttler (HERE Global B.V.); Conor Ettinoffe (HERE Global B.V.); Anis Moussa (HERE Global B.V.)

Abstract:HERE’s traffic-aware services enable route planning and traffic visualisation on web, mobile and connected car applications. These services process thousands of requests per second and require efficient ways to access the information needed to provide a timely response to end-users. The characteristics of road traffic information and these traffic-aware services require storage solutions with specific performance features. A route planning application utilising traffic congestion information to calculate the optimal route from an origin to a destination might hit a database with millions of queries per second. However, existing storage solutions are not prepared to handle such volumes of concurrent read operations, as well as to provide the desired vertical scalability. This paper presents TrafficDB, a shared-memory data store, designed to provide high rates of read operations, enabling applications to directly access the data from memory. Our evaluation demonstrates that TrafficDB handles millions of read operations and provides near-linear scalability on multi-core machines, where additional processes can be spawned to increase the systems’ throughput without a noticeable impact on the latency of querying the data store. The paper concludes with a description of how TrafficDB improved the performance of our traffic-aware services running in production.
The MemSQL Query Optimizer: A modern optimizer for real-time analytics in a distributed database

Jack Chen (MemSQL Inc.); Samir Jindel (MemSQL Inc.); Robert Walzer (MemSQL Inc.); Rajkumar Sen (MemSQL Inc.); Nika Jimsheleishvilli (MemSQL Inc.) ; Michael Andrews (MemSQL Inc.),

Abstract:Real-time analytics on massive datasets has become a very common need in many enterprises. These applications require not only rapid data ingest, but also quick answers to analytical queries operating on the latest data. MemSQL is a distributed SQL database designed to exploit memory-optimized, scale-out architecture to enable real-time transactional and analytical workloads which are fast, highly concurrent, and extremely scalable. Many analytical queries in MemSQL’s customer workloads are complex queries involving joins, aggregations, sub- queries, etc. over star and snowflake schemas, often ad-hoc or produced interactively by business intelligence tools. These queries often require latencies of seconds or less, and therefore require the optimizer to not only produce a high quality distributed execution plan, but also produce it fast enough so that optimization time does not become a bottleneck. In this paper, we describe the architecture of the MemSQL Query Optimizer and the design choices and innovations which enable it quickly produce highly efficient execution plans for complex distributed queries. We discuss how query rewrite decisions oblivious of distribution cost can lead to poor distributed execution plans, and argue that to choose high-quality plans in a distributed database, the optimizer needs to be distribution-aware in choosing join plans, applying query rewrites, and costing plans. We discuss methods to make join enumeration faster and more effective, such as a rewrite-based approach to exploit bushy joins in queries involving multiple star schemas without sacrificing optimization time. We demonstrate the effectiveness of the MemSQL optimizer over queries from the TPC-H benchmark and a real customer workload.

## Demo 3b: Graph and Semistructured Data

### Location: Maple

(Same as in Demo 3a)

# Wednesday Sep 7th, 2:00 pm - 3:30 pm

## Research R13: Query Execution - 2

### Chair: Wolfgang Lehner, TU Dresden

A Seven-Dimensional Analysis of Hashing Methods and its Implications on Query Processing

Stefan Richter (Saarland University); Victor Alvarez (Saarland University); Jens Dittrich (Saarland University)

Abstract:Hashing is a solved problem. It allows us to get constant time access for lookups. Hashing is also simple. It is safe to use an arbitrary method as a black box and expect good performance, and optimizations to hashing can only improve it by a negligible delta. Why are all of the previous statements plain wrong? That is what this paper is about. In this paper we thoroughly study hashing for integer keys and carefully analyze the most common hashing methods in a five-dimensional requirements space: (1) data-distribution, (2) load factor, (3) dataset size, (4) read/write-ratio, and (5) un/successful-ratio. Each point in that design space may potentially suggest a different hashing scheme, and additionally also a different hash function. We show that a right or wrong decision in picking the right hashing scheme and hash function combination may lead to significant difference in performance. To substantiate this claim, we carefully analyze two additional dimensions: (6) five representative hashing schemes (which includes an improved variant of Robin Hood hashing), (7) four important classes of hash functions widely used today. That is, we consider 20 different combinations in total. Finally, we also provide a glimpse about the effect of table memory layout and the use of SIMD instructions. Our study clearly indicates that picking the right combination may have considerable impact on insert and lookup performance, as well as memory footprint. A major conclusion of our work is that hashing should be considered a white box before blindly using it in applications, such as query processing. Finally, we also provide a strong guideline about when to use which hashing method.
Query-Aware Locality-Sensitive Hashing for Approximate Nearest Neighbor Search

Qiang Huang (Sun Yat-sen University); Jianlin Feng (Sun Yat-sen University); Yikai Zhang (Sun Yat-sen University); Qiong Fang (South China University of Technology); Wilfred Ng (Hong Kong University of Science and Technology)

Abstract:Locality-Sensitive Hashing (LSH) and its variants are the well-known indexing schemes for the problem of c-Approximate Nearest Neighbor (c-ANN) search in high-dimensional Euclidean space. Traditionally, LSH functions are constructed in a query-oblivious manner in the sense that buckets are partitioned before any query arrives. However, objects closer to a query may be partitioned into different buckets, which is undesirable. Due to the use of query-oblivious bucket partition, the state-of-the-art LSH schemes for external memory, namely C2LSH and LSB-Forest, only work with approximation ratio of integer c >= 2. In this paper, we introduce a novel concept of query-aware bucket partition which uses a given query as the anchor” for bucket partition. Accordingly, a query-aware LSH function is a random projection coupled with query-aware bucket partition, which removes random shift required by traditional query-oblivious LSH functions. Notably, query-aware bucket partition can be easily implemented so that query performance is guaranteed. We propose a novel query-aware LSH scheme named QALSH for c-ANN search over external memory. Our theoretical studies show that QALSH enjoys guarantee on query quality. The use of query-aware LSH function enables QALSH to work with any approximation ratio c > 1. Extensive experiments on four real datasets show that QALSH outperforms C2LSH and LSB-Forest. Specifically, by using a ratio c < 2, QALSH can achieve much better query quality.
Neighbor-Sensitive Hashing

Yongjoo Park (University of Michigan); Michael Cafarella (University of Michigan); Barzan Mozafari (University of Michigan)

Abstract:Approximate kNN (k-nearest neighbor) techniques using binary hash functions are among the most commonly used approaches for overcoming the prohibitive cost of performing exact kNN queries. However, the success of these techniques largely depends on their hash functions’ ability to distinguish kNN items; that is, the kNN items retrieved based on data items’ hashcodes, should include as many true kNN items as possible. A widely-adopted principle for this process is to ensure that similar items are assigned to the same hashcode so that the items with the hashcodes similar to a query’s hashcode are likely to be true neighbors. In this work, we abandon this heavily-utilized principle and pursue the opposite direction for generating more effective hash functions for kNN tasks. That is, we aim to increase the distance between similar items in the hashcode space, instead of reducing it. Our contribution begins by providing theoretical analysis on why this revolutionary and seemingly counter-intuitive approach leads to a more accurate identification of kNN items. Our analysis is followed by a proposal for a hashing algorithm that embeds this novel principle. Our empirical studies confirm that a hashing algorithm based on this counter-intuitive idea significantly improves the efficiency and accuracy of state-of-the-art techniques.
LSH Ensemble: Internet Scale Domain Search

Erkang Zhu (University of Toronto); Fatemeh Nargesian (University of Toronto); Ken Pu (UOIT); Renée Miller (University of Toronto)

Abstract:We study {\it the problem of domain search} where a domain is a set of distinct values from an unspecified universe. We use Jaccard set containment score, defined as $|Q \cap X|/|Q|$, as the measure of relevance of a domain $X$ to a query domain $Q$. Our choice of Jaccard set containment over Jaccard similarity as a measure of relevance makes our work particularly suitable for searching Open Data and data on the web, as Jaccard similarity is known to have poor performance over sets with large differences in their domain sizes. We demonstrate that the domains found in several real-life Open Data and web data repositories show a power-law distribution over their domain sizes. We present a new index structure, Locality Sensitive Hashing (LSH) Ensemble, that solves the domain search problem using set containment at Internet scale. Our index structure and search algorithm cope with the data volume and skew by means of data sketches using Minwise Hashing and domain partitioning. Our index structure does not assume a prescribed set of data values. We construct a cost model that describes the accuracy of LSH Ensemble with any given partitioning. This allows us to formulate the data partitioning for LSH Ensemble as an optimization problem. We prove that there exists an {\it optimal} partitioning for any data distribution. Furthermore, for datasets following a power-law distribution, as observed in Open Data and Web data corpora, we show that the optimal partitioning can be approximated using equi-depth, making it particularly efficient to use in practice. We evaluate our algorithm using real data (Canadian Open Data and WDC Web Tables) containing up over 262 million domains. The experiments demonstrate that our index consistently outperforms other leading alternatives in accuracy and performance. The improvements are most dramatic for data with large skew in the domain sizes. Even at 262 million domains, our index sustains query performance with under 3 seconds response time.

## Panel

### Location: Pearl 2

Will AI Eat Us All?

Ihab F. Ilyas (Professor, Cheriton School of Computer Science, University of Waterloo), H. V. Jagadish (University of Michigan), Sihem Amer-Yahia (CNRS), Aditya Parameswaran (University of Illinois (UIUC)), Sunita Sarawagi (IIT Bombay)

## Research R14: Graph Processing - 2

### Chair: Kyuseok Shim, Seoul National Univ.

G-SQL: Fast Query Processing via Graph Exploration

Hongbin Ma (eBay); Bin Shao (Microsoft Research Asia); Yanghua Xiao (Fudan University); Liang Jeff Chen (Microsoft Research Asia); Haixun Wang (Facebook)

Abstract:A lot of real-life data is of graph nature. However, it is not until recently that business begins to exploit data’s connectedness for business insights. On the other hand, RDBMSs are a mature technology for data management, but they are not for graph processing. Take graph traversal, a common graph operation for example, it heavily relies on a graph primitive that accesses a given node’s neighborhood. We need to join tables following foreign keys to access the nodes in the neighborhood if an RDBMS is used to manage graph data. Graph exploration is a fundamental building block of many graph algorithms. But this simple operation is costly due to large volumes of I/O caused by the massive amount of table joins. In this paper, we present G-SQL, our first effort towards the integration of a RDBMS and a native graph processing engine (Trinity). G-SQL leverages fast graph exploration provided by Trinity’s memory cloud to answer multi-way join queries, and it uses RDBMSs to provide mature data management functionalities, including reliable data storage, additional data access methods, etc. Specifically, G-SQL is a SQL dialect augmented with graph exploration functionalities and it dispatches query tasks to its underlying RDMBS and the Trinity graph engine. The G-SQL runtime coordinates the two query processors, and uses a unified cost model to ensure that the entire query is processed efficiently. Experimental results show that our approach greatly expands capabilities of RDBMs and delivers exceptional performance for SQL-graph hybrid queries.
Fully Dynamic Betweenness Centrality Maintenance on Massive Networks

Takanori Hayashi (The University of Tokyo); Takuya Akiba (National Institute of Informatics); Yuichi Yoshida (National Institute of Informatics)

Abstract:Measuring the relative importance of each vertex in a network is one of the most fundamental building blocks in network analysis. Among several importance measures, betweenness centrality, in particular, plays key roles in many real applications. Considerable effort has been made for developing algorithms for static settings. However, real networks today are highly dynamic and are evolving rapidly, and scalable dynamic methods that can instantly reflect graph changes into centrality values are required. In this paper, we present the first fully dynamic method for managing betweenness centrality of all vertices in a large dynamic network. Its main data structure is the weighted hyperedge representation of shortest paths called hypergraph sketch. We carefully design dynamic update procedure with theoretical accuracy guarantee. To accelerate updates, we further propose two auxiliary data structures called two-ball index and special-purpose reachability index. Experimental results using real networks demonstrate its high scalability and efficiency. In particular, it can reflect a graph change in less than a millisecond on average for a large-scale web graph with 106M vertices and 3.7B edges, which is several orders of magnitude larger than the limits of previous dynamic methods.
Parallel Local Graph Clustering

Julian Shun (University of California at Berkeley); Farbod Roosta-Khorasani (University of California at Berkeley); Kimon Fountoulakis (University of California at Berkeley); Michael Mahoney (University of California at Berkeley)

Abstract:Graph clustering has many important applications in computing, but due to growing sizes of graph, even traditionally fast clustering methods such as spectral partitioning can be computationally expensive for real-world graphs of interest. Motivated partly by this, so-called local algorithms for graph clustering have received significant interest due to the fact that they can find good clusters in a graph with work proportional to the size of the cluster rather than that of the entire graph. This feature has proven to be crucial in making such graph clustering and many of its downstream applications efficient in practice. While local clustering algorithms are already faster than traditional algorithms that touch the entire graph, there is an opportunity to make them even more efficient via parallelization. Existing local clustering algorithms are sequential. In this paper, we show how to parallelize many of these algorithms in the shared-memory multicore setting, and we analyze the parallel complexity of these algorithms. We present comprehensive experiments on large-scale graphs showing that the parallel algorithms achieve good parallel speedups on a modern multicore machine, thus significantly speeding up the analysis of local graph clusters in the very large-scale setting.
KCore Decomposition for Large Networks on a Single PC

Wissam Khaouid (University of Victoria); Marina Barsky (University of Victoria); Venkatesh Srinivasan (University of Victoria); Alex Thomo (University of Victoria)

Abstract:Studying the topology of a network is critical to inferring underlying dynamics such as tolerance to failure, group behavior and spreading patterns. k-core decomposition is a well-established metric which partitions a graph into layers from external to more central vertices. In this paper we aim to explore whether k-core decomposition of large networks can be computed using a consumer- grade PC. We feature implementations of the “vertex-centric” distributed protocol introduced by Montresor, De Pellegrini and Miorandi on GraphChi and Webgraph. Also, we present an accurate implementation of the Batagelj and Zaversnik algorithm for k-core decomposition in Webgraph. With our implementations, we show that we can efficiently handle networks of billions of edges using a single consumer-level machine within reasonable time and can produce excellent approximations in only a fraction of the execution time. To the best of our knowledge, our biggest graphs are considerably larger than the graphs considered in the literature. Next, we present an optimized implementation of an external-memory algorithm (EMcore) by Cheng, Ke, Chu, and Ozsu. We show that this algorithm also performs well for large datasets, however, it cannot predict whether a given memory budget is sufficient for a new dataset. We present a thorough analysis of all algorithms concluding that it is viable to compute k-core decomposition for large networks in a consumer-grade PC.

## Research R15: Data Cleaning - 2

### Chair: Panos Chrysanthis, Univ. of Pittsburgh

Messing Up with BART: Error Generation for Evaluating Data-Cleaning Algorithms

Patricia Arocena (University of Toronto); Boris Glavic (Illinois Institute of Technology); Giansalvatore Mecca (Università of Basilicata); Renée Miller (University of Toronto); Paolo Papotti (Qatar Computing Research Institute); Donatello Santoro (Università of Basilicata)

Abstract:We study the problem of introducing errors into clean databases for the purpose of benchmarking data-cleaning algorithms. Our goal is to provide users with the highest possible level of control over the error-generation process, and at the same time develop solutions that scale to large databases. We show in the paper that the error-generation problem is surprisingly challenging, and in fact, NP-complete. To provide a scalable solution, we develop a correct and efficient greedy algorithm that sacrifices completeness, but succeeds under very reasonable assumptions. To scale to millions of tuples, the algorithm relies on several non-trivial optimizations, including a new symmetry property of data quality constraints. The trade-off between control and scalability is the main technical contribution of the paper.
Error Detection: Where are we and what needs to be done? [Experiments and Analyses]

Ziawasch Abedjan (MIT); Xu Chu (University of Waterloo); Dong Deng (Tsinghua University); Raul Castro Fernandez (MIT); Ihab Ilyas (University of Waterloo); Mourad Ouzzani (Qatar Computing Research Institute); Paolo Papotti (Arizona State University); Michael Stonebraker (MIT); Nan Tang (Qatar Computing Research Institute)

Abstract:Data cleaning has played a critical role in ensuring data quality for enterprise applications. Naturally, there has been extensive research in this area, and many data cleaning algorithms have been translated into tools to detect and to possibly repair certain classes of errors such as outliers, duplicates, missing values, and violations of integrity constraints. Since different types of errors may coexist in the same dataset, we often need to run more than one kind of tool. In this paper, we investigate two pragmatic questions: (1) are these tools robust enough to capture most errors in real-world data sets? and (2) what is the best strategy to holistically run multiple tools to optimize the detection effort? To answer these two questions, we obtained multiple data cleaning tools that utilize a variety of error detection techniques. We also collected five real-world data sets, for which we could obtain both the raw data and the ground truth on existing errors. In this paper, we report our experimental findings on the errors detected by the tools we tested. First, we show that the coverage of each tool is well below 100%. Second, we show that the order in which multiple tools are run makes a big difference. Hence, we propose a holistic multi-tool strategy that orders the invocations of the available tools to maximize their benefit, while minimizing human effort in verifying results. Third, since this holistic approach still does not lead to acceptable error coverage, we discuss two simple strategies that have the potential to improve the situation, namely domain specific tools and data enrichment. We close this paper by reasoning about the errors that are not detectable by any of the tools we tested.
ActiveClean: Interactive Data Cleaning For Statistical Modeling

Sanjay Krishnan (University of California at Berkeley); Jiannan Wang (Simon Fraser University); Eugene Wu (Columbia University); Michael Franklin (University of California at Berkeley); Ken Goldberg (University of California at Berkeley)

Abstract:Analysts often clean dirty data iteratively–cleaning some data, executing the analysis, and then, cleaning more data based on the results. We explore the iterative cleaning process in the context of statistical model training, which is an increasingly popular form of data analytics. The challenge is that training high-dimensional statistical models on partially clean data lacks guarantees, and the iterative cleaning process may actually diverge. We propose ActiveClean which allows for progressive and iterative cleaning in statistical modeling problems while preserving convergence guarantees. ActiveClean supports a popular class of models called convex loss models (e.g., linear regression and SVMs), and also prioritizes cleaning those records likely to affect the results. We evaluate ActiveClean on four real-world datasets UCI Adult, UCI EEG, MNIST, and Dollars for Docs with both real and synthetic errors. Our results suggest that our proposed optimizations can improve model accuracy by up-to 2.5x for the same amount of data cleaned. Furthermore, for a fixed cleaning budget and on all real dirty datasets, ActiveClean returns more accurate models than uniform sampling and Active Learning.
META: An Efficient Matching-Based Method for Error-Tolerant Autocompletion

Dong Deng (Tsinghua University); Guoliang Li (Tsinghua University); He Wen (Tsinghua University); H. V. Jagadish (University of Michigan); Feng Jianhua (Tsinghua University)

Abstract:Autocompletion has been widely adopted in many computing systems because it can instantly provide users with results as users type in queries. Since the typing task is tedious and prone to error, especially on mobile devices, a recent trend is to tolerate errors in autocompletion. Existing error-tolerant autocompletion methods build a trie to index the data, utilize the trie index to compute the trie nodes that are similar to the query, called active nodes, and identify the leaf descendants of active nodes as the results. However these methods have two limitations. First, they involve many redundant computations to identify the active nodes. Second, they do not support top-k queries. To address these problems, we propose a matching-based framework, which computes the answers based on matching characters between queries and data. We design a compact tree index to maintain active nodes in order to avoid the redundant computations. We devise an incremental method to efficiently answer top-k queries. Experimental results on real datasets show that our method outperforms state-of-the-art approaches by 1-2 orders of magnitude.

Same as Demo 1a

## Research Posters 5

### Location: Maple

Decorating the Cloud: Enabling Annotation Management in MapReduce Infrastructure

Yue Lu (Worcester Polytechnic Institute); Yuguan Li (Worcester Polytechnic Institute); Mohamed Eltabakh (Worcester Polytechnic Institute)

S3-TM: scalable streaming short text matching

Fuat Basik (Bilkent University); Bugra Gedik (Bilkent University); Hakan Ferhatsomanoglu (Bilkent University); Mert Emin Kalender (Bilkent University)

Elite: an Elastic Infrastructure for Big Spatiotemporal Trajectories

Xike Xie (Aalborg University); Benjin Mei (Renmin University of China); Jinchuan Chen (Renmin University of China); Xiaoyong Du (Renmin University of China); Christian Jensen (Aalborg University

Accelerating SPARQL Queries by Exploiting Hash-based Locality and Adaptive Partitioning

Razen Harbi (King Abdullah University of Science and Technology); Ibrahim Abdelaziz (King Abdullah University of Science and Technology); Panos Kalnis (King Abdullah University of Science and Technology); Nikos Mamoulis (University of Ioannina); Yasser Ebrahim (King Abdullah University of Science and Technology); Majed Sahli (King Abdullah University of Science and Technology)

Know Your Customer: Computing k-Most Promising Products for Targeted Marketing

Md. Saiful Islam (Swinburne University of Technology); Chengfei Liu (Swinburne University of Technology)

Bitwise Dimensional Co-Clustering for Analytical Workloads

Stephan Baumann (Technische Universtät Ilmenau); Peter Boncz (CWI); Kai-Uwe Sattler (Technische Universtät Ilmenau)

# Wednesday Sep 7th, 4:00 pm - 5:30 pm

## Research R16: Data and Query Models - 1

### Chair: Yanlei Diao, Ecole Polytechnique

SlimShot: In-Database Probabilistic Inference for Knowledge Bases

Eric Gribkoff (University of Washington); Dan Suciu (University of Washington)

Abstract:Increasingly large Knowledge Bases are being created, by crawling the Web or other corpora of documents, and by extracting facts and relations using machine learning techniques. To manage the uncertainty in the data, these KBs rely on probabilistic engines based on Markov Logic Networks (MLN), for which probabilistic inference remains a major challenge. Today’s state of the art systems use variants of MCMC, which have no theoretical error guarantees, and, as we show, suffer from poor performance in practice. In this paper we describe SlimShot (Scalable Lifted Inference and Monte Carlo Sampling Hybrid Optimization Technique), a probabilistic inference engine for knowledge bases. SlimShot converts the MLN to a tuple- independent probabilistic database, then uses a simple Monte Carlo-based inference, with three key enhancements: (1) it combines sampling with safe query evaluation, (2) it estimates a conditional probability by jointly computing the numerator and denominator, and (3) it adjusts the proposal distribution based on the sample cardinality. In combination, these three techniques allow us to give formal error guarantees, and we demonstrate empirically that SlimShot outperforms today’s state of the art probabilistic inference engines used in knowledge bases.
Scalable Package Queries in Relational Database Systems

Matteo Brucato (University of Massachusetts Amherst); Juan Beltran (New York University Abu Dhabi); Azza Abouzied (New York University Abu Dhabi); Alexandra Meliou (University of Massachusetts Amherst)

Abstract:Traditional database queries follow a simple model: they define constraints that each tuple in the result must satisfy. This model is computationally efficient, as the database system can evaluate the query conditions on each tuple individually. However, many practical, real-world problems require a collection of result tuples to satisfy constraints collectively, rather than individually. In this paper, we present package queries, a new query model that extends traditional database queries to handle complex constraints and preferences over answer sets. We develop a full-fledged package query system, implemented on top of a traditional database engine. Our work makes several contributions. First, we design PaQL, a SQL-based query language that supports the declarative specification of package queries. We prove that PaQL is as least as expressive as integer linear programming, and therefore, evaluation of package queries is in general NP-hard. Second, we present a fundamental evaluation strategy that combines the capabilities of databases and constraint optimization solvers to derive solutions to package queries. The core of our approach is a set of translation rules that transform a package query to an integer linear program. Third, we introduce an offline data partitioning strategy allowing query evaluation to scale to large data sizes. Fourth, we introduce SketchRefine, a scalable algorithm for package evaluation, with strong approximation guarantees ((1 +/- e)^6- factor approximation). Finally, we present extensive experiments over real-world and benchmark data. The results demonstrate that SketchRefine is effective at deriving high-quality package results, and achieves runtime performance that is an order of magnitude faster than directly using ILP solvers over large datasets.
Explaining Query Answers with Explanation-Ready Databases

Sudeepa Roy (Duke University); Laurel Orr (University of Washington); Dan Suciu (University of Washington)

Abstract:With the increased generation and availability of big data in different domains, there is an imminent requirement for data analysis tools that are able to explain’ the trends and anomalies obtained from this data to a range of users with different backgrounds. Wu- Madden (PVLDB 2013) and Roy-Suciu (SIGMOD 2014) recently proposed solutions that can explain interesting or unexpected answers to simple aggregate queries in terms of predicates on attributes. In this paper, we propose a generic framework that can support much richer, insightful explanations by preparing the database offline, so that top explanations can be found interactively at query time. The main idea in such explanation-ready databases’ is to pre-compute the effects of potential explanations (called interventions’), and efficiently re-evaluate the original query taking into accounts these effects. We formalize this notion and define an explanation- query’ that can evaluate all possible explanations simultaneously without having to run an iterative process, develop algorithms and optimizations, and evaluate our approach with experiments on real data.
A framework for annotating CSV-like data

Marcelo Arenas (PUC Chile); Francisco Maturana (PUC Chile); Cristian Riveros (PUC Chile); Domagoj Vrgoc (PUC Chile)

Abstract:In this paper, we propose a simple and expressive framework for adding metadata to CSV documents and their noisy variants. The framework is based on annotating parts of the document that can be later used to read, query, or exchange the data. The core of our framework is a language based on extended regular expressions that are used for selecting data. These expressions are then combined using a set of rules in order to annotate the data. We study the computational complexity of implementing our framework and present an efficient evaluation algorithm that runs in time proportional to its output and linear in its input. As a proof of concept, we test an implementation of our framework against a large number of real world datasets and show that it can be efficiently used in practice.

## Research R17: Scalable Analytics - 1

### Chair: Matthias Boehm, IBM Research

A General-Purpose Query-Centric Framework for Querying Big Graphs [Innovative Systems and Applications]

Da Yan (The Chinese University of Hong Kong); James Cheng (The Chinese University of Hong Kong); Tamer Özsu (University of Waterloo); Fan Yang (The Chinese University of Hong Kong); Yi Lu (The Chinese University of Hong Kong); John C.S. Lui (The Chinese University of Hong Kong); Qizhen Zhang (The Chinese University of Hong Kong); Wilfred Ng (The Hong Kong University of Science and Technology)

Abstract:Pioneered by Google’s Pregel, many distributed systems have been developed for large-scale graph analytics. These systems expose the user-friendly “think like a vertex” programming interface to users, and exhibit good horizontal scalability. However, these systems are designed for tasks where the majority of graph vertices participate in computation, but are not suitable for processing light- workload graph queries where only a small fraction of vertices need to be accessed. The programming paradigm adopted by these systems can seriously under-utilize the resources in a cluster for graph query processing. In this work, we develop a new open-source system, called Quegel, for querying big graphs, which treats queries as first-class citizens in the design of its computing model. Users only need to specify the Pregel-like algorithm for a generic query, and Quegel processes light-workload graph queries on demand using a novel superstep-sharing execution model to effectively utilize the cluster resources. Quegel further provides a convenient interface for constructing graph indexes, which significantly improve query performance but are not supported by existing graph- parallel systems. Our experiments verified that Quegel is highly efficient in answering various types of graph queries and is up to orders of magnitude faster than existing systems.
Distance-based Outlier Detection in Data Streams [Experiments and Analyses]

Luan Tran (University of Southern California); Liyue Fan (University of Southern California); Cyrus Shahabi (University of Southern California)

Abstract:Continuous outlier detection in data streams has important applications in fraud detection, network security, and public health. The arrival and departure of data objects in a streaming manner impose new challenges for outlier detection algorithms, especially in time and space efficiency. In the past decade, several studies have been performed to address the problem of distance-based outlier detection in data streams (DODDS), which adapts an unsupervised definition and does not have any distributional assumption on data values. Our work is motivated by the lack of comparative evaluation among the state-of-the-art algorithms using the same dataset selections and on the same platform. We systematically evaluate five most recent exact algorithms for DODDS under various stream settings and outlier rates. Our extensive results show that in most settings, the MCOD algorithm provides superior performance among all algorithms, including the most recent algorithm Thresh_LEAP.
SEEDB: Efficient Data-Driven Visualization Recommendations to Support Visual Analytics

Manasi Vartak (MIT); Sajjadur Rahman (University of Illinois); Samuel Madden (MIT); Aditya Parameswaran (University of Illinois at Urbana-Champaign); Neoklis Polyzotis (Google)

Abstract:Data analysts often build visualizations as the first step in their an-alytical workflow. However, when working with high-dimensional datasets, identifying visualizations that show relevant or desiredtrends in data can be laborious. We propose SEEDB, a visual-ization recommendation engine to facilitate fast visual analysis:given a subset of data to be studied, SEEDB intelligently exploresthe space of visualizations, evaluates promising visualizations fortrends, and recommends those it deems most “useful” or “inter-esting”. The two major obstacles in recommending interesting vi-sualizations are (a)scale: evaluating a large number of candidatevisualizations while responding within interactive time scales, and(b)utility: identifying an appropriate metric for assessing interest-ingness of visualizations. For the former, SEEDB introducesprun-ing optimizationsto quickly identify high-utility visualizations andsharing optimizationsto maximize sharing of computation acrossvisualizations. For the latter, as a first step, we adopt a deviation- based metric for visualization utility, while indicating how we maybe able to generalize it to other factors influencing utility. We im-plement SEEDB as a middleware layer that can run on top of anyDBMS. Our experiments show that our framework can identify in-teresting visualizations with high accuracy. Our optimizations leadtomultiple orders of magnitude speedupon relational row and col-umn stores and provide recommendations at interactive time scales.Finally, we demonstrate via a user study the effectiveness of ourdeviation-based utility metric and the value of recommendations in supporting visual analytics.
Explaining Outputs in Modern Data Analytics

Zaheer Chothia (ETH Zurich); John Liagouris (ETH Zurich); Frank McSherry (ETH Zurich); Timothy Roscoe (ETH Zurich)

Abstract:We report on the design and implementation of a general framework for interactively explaining the outputs of modern data-parallel computations, including iterative data analytics. To produce explanations, existing works adopt a naive backward tracing approach which runs into known issues; naive backward tracing may identify: (i) too much information that is difficult to process, and (ii) not enough information to reproduce the output, which hinders the logical debugging of the program. The contribution of this work is twofold. First, we provide methods to effectively reduce the size of explanations based on the first occurrence of a record in an iterative computation. Second, we provide a general method for identifying explanations that are sufficient to reproduce the target output in arbitrary computations – a problem for which no viable solution existed until now. We implement our approach on differential dataflow, a modern high-throughput, low-latency dataflow platform. We add a small (but extensible) set of rules to explain each of its data-parallel operators, and we implement these rules as differential dataflow operators themselves. This choice allows our implementation to inherit the performance characteristics of differential dataflow, and results in a system that efficiently computes and updates explanatory inputs even as the inputs of the reference computation change. We evaluate our system with various analytic tasks on real datasets, and we show that it produces concise explanations in tens of milliseconds, while remaining faster – up to two orders of magnitude – than even the best implementations that do not support explanations.

## Research R18: Database Hardware-Software Codesign

### Chair: Ryan Stutsman, Univ. of Utah

YourSQL: A High-Performance Database System Leveraging In-Storage Computing [Innovative Systems and Applications]

Insoon Jo (Samsung Electronics Co.); Duck-Ho Bae (Samsung Electronics Co.); Andre Yoon (Samsung Electronics Co.); Jeong-Uk Kang (Samsung Electronics Co.); Sangyeun Cho (Samsung Electronics Co.); Daniel Lee (Samsung Electronics Co.); Jaeheon Jeong (Samsung Electronics Co.)

Abstract:This paper presents YourSQL, a database system that ac- celerates data-intensive queries with the help of additional in-storage computing capabilities. YourSQL realizes very early filtering of data by offloading data scanning of a query to user-programmable solid-state drives. We implement our system on a recent branch of MariaDB (a variant of MySQL). In order to quantify the performance gains of YourSQL, we evaluate SQL queries with varying complexities. Our result shows that YourSQL reduces the execution time of the whole TPC-H queries by 3.6x, compared to a vanilla system. Moreover, the average speed-up of the five TPC-H queries with the largest performance gains reaches over 15x. Thanks to this significant reduction of execution time, we observe sizable energy savings. Our study demonstrates that the YourSQL approach, combining the power of early filtering with end-to-end datapath optimization, can accelerate large-scale analytic queries with lower energy consumption.
Cheap Data Analytics using Cold Storage Devices

Renata Borovica-Gajic (EPFL); Raja Appuswamy (EPFL); Anastasia Ailamaki (EPFL)

# Friday Sep 9th, 9:00 am - 10:30 am

## Joint Workshop W5

### Location: Pearl 2

Accelerating Data Management Systems (ADMS) and In-memory Data Management & Analytics (IMDM)

ADMS - Rajesh Bordawekar (IBM Research – T.J. Watson); Tirthankar Lahiri (Oracle) IMDM – Spyros Blanas (The Ohio State University); Justin Levandoski (Microsoft Research); Andy Pavlo (Carnegie Mellon University)

## Workshop W6

### Location: Royal 1

Data Management and Analytics for Medicine and Healthcare (DMAH)

Fusheng Wang (Stony Brook University); Lixia Yao (University of North Carolina at Charlotte); Gang Luo (University of Utah)

## Workshop W7

### Location: Royal 2

Social Data Analytics and Management (SoDAM)

Maheshweta Das (Hewlett Packard Labs); Gautam Das (University of Texas at Arlington)

## Workshop W8

### Location: Maple

Big Data Open-Source Systems (BOSS) / Polyglot

Tilmann Rabl (TU Berlin); Sebastian Schelter (Amazon)