42nd International Conference on Very Large Data Bases (VLDB) 2016
Full Program
DAY 1: MONDAY [September 5]
Time Track 1 (PEARL 1)
Track 2 (MAPLE)
Track 3 (ROYAL 1)
Track 4 (ROYAL 2)
Track 5 (Boardroom)
Co-located Event (PEARL 2)
Digital India
9.00-10.30 T1: Differential Privacy in the Wild T2: Operational Analytics Data Management Systems W1: TPCTC Performance Evaluation and Benchmarking W2: BIRTE: Business Intelligence for the Real Time Enterprise W3: QDB: Quality in Databases Session 1: India Stack
10:30-11:00 Tea Break (30 mts)
11.00-12.30 T1 (contd): Differential Privacy in the Wild T2 (contd): Operational Analytics Data Management Systems W1 (contd): TPCTC W2 (contd): BIRTE W3 (contd): QDB Session 2: Data and India Scale Challenges
12:30-2:00 Lunch Break (90 mts)
2.00-3.30 T3: Modern Main-memory DBMS T4: Machine Learning in the Real World W1 (contd): TPCTC W2 (contd): BIRTE W3 (contd): QDB Hackathon Final Round (1.30 - 5.00)
3:30-4:00 Tea Break (30 mts)
4.00-5.30 T3 (contd): Modern Main-memory DBMS T4 (contd): Machine Learning in the Real World W1 (contd): TPCTC W2 (contd): BIRTE W3 (contd): QDB
5:30 -6:30 High Tea Break (60 mts)
6.30-8.30 Digital India Event - Keynote and Panel Discussion (Location: PEARL)
8.30 onwards Mini Dinner
DAY 2: TUESDAY [September 6]
Time Track 1 (PEARL 1)
Track 2 (PEARL 2)
Track 3 (ROYAL 1)
Track 4 (ROYAL 2)
Track 5 (MAPLE)
Demos and Posters
8.30-9.15 Conference Inauguration (Location: PEARL)
9.15-10.30 Keynote 1: Ion Stoica: Trends and Challenges in Big Data Processing
10.30-11.15 Tea Break (45 mts)
11.15-12.45 R1: Transaction Processing R2: RDF Data Systems R3: Distributed and Cloud Systems -1 I1: Distributed Data Analytics Platforms -1 Demo 1a: Data Engines and Analytics
Research Poster 1 (R10, R11, R12, R13, R14)
12:45-2:00 Lunch Break (75 mts)
2.00-3.30 R4: Memory Management R5: Data Cleaning -1 R6: Graph Processing -1 I2: Distributed Data Analytics Platforms -2 Demo 2a: Interactive and Exploratory Systems
Research Poster 2 (R15, R16, R17, R18, R19, R20)
3:30-4:00 Tea Break (30 mts)
4.00-5.30 R7: Query Execution -1 R8: Data Security and Privacy R9: Ranking Queries I3: Data Engine Architectures -1 Demo 3a: Graph and Semistructured Data
Research Poster 3 (R21, R22, R23, R24, R25)
5:30-6:30 High Tea Break (60 mts)
DAY 3: WEDNESDAY [September 7]
Time Track 1 (PEARL 1)
Track 2 (PEARL 2)
Track 3 (ROYAL 1)
Track 4 (ROYAL 2)
Industry and Research
Track 5 (MAPLE)
Demos and Posters
9.00-9.15 Conference Announcements
9.15-10.30 Keynote 2: Anand Rajaraman: Data-Driven Disruption: The View from Silicon Valley
10:30-11:15 Tea Break (45 mts)
11.15-12.45 R10: Query Optimization -1 R11: Spatial Data and Queries -1 R12: Distributed and Cloud Systems -2 I4: Data Engine Architectures -2 Demo 3b: Graph and Semistructured Data
Research Poster 4 (R1, R2, R3, R4, R5)
12:45-2:00 Lunch Break (75 mts)
2.00-3.30 R13: Query Execution-2 Panel: Will AI Eat Us All? R14: Graph Processing -2 R15: Data Cleaning -2 Demo 1b: Data Engines and Analytics
Research Poster 5 (R6, R7, R8, R9, VLDBJournal)
3:30-4:00 Tea Break (30 mts)
4.00-5.30 R16: Data and Query Models -1 R17: Scalable Analytics -1 R18: Database Hardware-Software Codesign I5: Graph Systems and Analytics Demo 2b: Interactive and Exploratory Systems
Research Poster 6 ( R26, R27, R28, R29, R30)
5:45-6:00 Buses depart to Banquet Venue
7.30-10.00 VLDB CONFERENCE BANQUET (Venue: Kingdom of Dreams)
DAY 4: THURSDAY [September 8]
Time Track 1 (PEARL 1)
Track 2 (PEARL 2)
Track 3 (ROYAL 1)
Track 4 (ROYAL 2)
Track 5 (MAPLE)
9.00-10.30 VLDB Awards (VLDB 2016: Best Paper, Best Demo; Endowment: 10 Year, Early Career, Women Achiever)
Endowment Award Talk 1: Xin Luna Dong
Endowment Award Talk 2: Mohamed F. Mokbel, Chi-Yin Chow, Walid G. Aref
10:30-11:15 Tea Break (45 mts)
11.15-12.45 R19: Query Optimization -2 R20: Spatial Data and Queries -2 R21: Distributed and Cloud Systems -3 T5: Human Factors in Crowdsourcing R22: Entity Matching -1
12:45-2:00 Lunch Break (75 mts)
2.00-3.30 R23: Query Execution -3 R24: Social Networks and Crowdsourcing R25: Graph Processing -3 T6: Qualitative Data Cleaning R26: Data and Query Models -2
3:30-4:00 Tea Break (30 mts)
4.00-5.30 R27: Data and Query Models -3 R28: Entity Matching -2 R29: Community Search and Mining R30: Scalable Analytics -2
DAY 5: FRIDAY [September 9]
Time Track 1 (PEARL 1)
Track 2 (PEARL 2)
Track 3 (ROYAL 1)
Track 4 (ROYAL 2)
Track 5 (MAPLE)
9.00-10.30 W4 : PhD Workshop
W5: ADMS/IMDM: Accelerating Data Management Systems/In-Memory Data Management and Analytics W6: DMAH: Data Management and Analytics for Medicine and Healthcare W7: SoDAM: Social Data Analytics and Management W8: BOSS/Polyglot: Big Data Open-Source Systems (BOSS) / Polyglot
10:30-11:00 Tea Break (30 mts)
11.00-12.30 W4 (contd): PhD Workshop W5 (contd): ADMS/IMDM W6 (contd): DMAH W7 (contd): SoDAM W8 (contd): BOSS/Polyglot
12:30-2:00 Lunch Break (90 mts)
2.00-3.30 W4 (contd): PhD Workshop W5 (contd): ADMS/IMDM W6 (contd): DMAH W8 (contd): BOSS/Polyglot
3:30-4:00 Tea Break (30 mts)
4.00-5.30 W4 (contd): PhD Workshop W5 (contd): ADMS/IMDM W6 (contd): DMAH W8 (contd): BOSS/Polyglot

Monday Sep 5th, 9:00 am - 10:30 am

Tutorial T1

Location: Pearl 1

Differential Privacy in the Wild

Ashwin Machanavajjhala (Duke University), Xi He (Duke University), Michael Hay (Colgate University)

Abstract:Differential privacy has emerged as an important standard for privacy preserving computation over databases containing sensitive information about individuals. Research on differential privacy spanning a number of research areas, including theory, security, database, networks, machine learning, and statistics, over the last decade has resulted in a variety of privacy preserving algorithms for a number of analysis tasks. Despite maturing research efforts, the adoption of differential privacy by practitioners in industry, academia, or government agencies has so far been rare. Hence, in this tutorial, we will first describe the foundations of differentially private algorithm design that cover the state of the art in private computation on tabular data. In the second half of the tutorial we will highlight real world applications on complex data types, and identify research challenges in applying differential privacy to real world applications.

Tutorial T2

Location: Maple

Operational Analytics Data Management

Alexander Boehm (SAP), Jens Dittrich (University of Saarland), Niloy Mukherjee (LinkedIn), Ippokratis Pandis (Amazon), Rajkumar Sen (MemSQL)

Abstract:As enterprise businesses become more agile and responsive to trends, sentiments and surges, business analytics applications can no longer rely on the classic data warehousing model of attempting to derive insights from data at rest. The data for analytics is as current as of the last ETL (Extract, Transform and Load) job that moved data from the operational system to the data warehouse, and business can no longer afford missing on real-time insights on the data that is fresh in the operational system. For example, a retail store could run analytics on transactional data to track sales and use the information to offer discounts. There is an increasing demand for database management systems to be able to perform real-time analytics on data that gets ingested and modified in live mainstream operational databases. As a response, many commercial vendors as well as academia have attempted to solve the problem by combining transactional and analytical processing capabilities in the same database system; these systems will be referred to as operational analytics systems. In this tutorial, we shall present an in-depth overview of operational analytical systems. We shall start with a discussion on the various aspects associated with the design of such a system; ranging from data storage, indexing to query optimization and processing. We shall then present a set of representative systems in detail, highlight their individual architecture and design characteristics, and discuss several key research problems they address. This tutorial is intended for both researchers and practitioners in the industry.

Workshop W1

Location: Royal 1

TPC Technology Conference on Performance Evaluation and Benchmarking (TPCTC)

Raghunath Nambiar (Cisco, USA), Meikel Poess (Oracle, USA)

Workshop W2

Location: Royal 2

Business Intelligence for the Real Time Enterprise (BIRTE)

Meichun Hsu (Hewlett-Packard), Malu Castellanos (Hewlett-Packard), Panos K Chrysanthis (University of Pittsburgh)

Workshop W3

Location: Boardroom

Quality in Databases

Christoph Quix (Fraunhofer FIT, Germany), Rihan Hai (RWTH Aachen University, Germany), Hongzhi Wang (Harbin Inst. of Tech., China), Venkat N. Gudivada (East Carolina University, USA), Laure Berti (QCRI, Qatar)

Monday Sep 5th, 11:00 am - 12:30 pm

Tutorials and Workshops Contd.

Monday Sep 5th, 2:00 pm - 3:30 pm

Tutorial T3

Location: Pearl 1

Modern Main-Memory Database Systems

Paul Larson (Microsoft Research), Justin Levandoski (Microsoft Research)

Abstract:This tutorial provides an overview of recent developments in mainmemory database systems. With growing memory sizes and memory prices dropping by a factor of 10 every 5 years, data having a “primary home” in memory is now a reality. Main-memory databases eschew many of the traditional architectural tenets of relational database systems that optimized for disk-resident data. Innovative approaches to fundamental issues such as concurrency control and query processing are required to unleash the full performance potential of main-memory databases. The tutorial is focused around design issues and architectural choices that must be made when building a high performance database system optimized for main-memory: data storage and indexing, concurrency control, durability and recovery techniques, query processing and compilation, support for high availability, and ability to support hybrid transactional and analytics workloads. This will be illustrated by example solutions drawn from four state-of-the-art systems: HStore/ VoltDB, Hekaton, HyPeR, and SAP HANA. The tutorial will also cover current and future research trends.

Tutorial T4

Location: Maple

Machine Learning in the Real World

Vineet Chaoji (Amazon, India), Rajeev Rastogi (Amazon, India), Gourav Roy (Amazon, India)

Abstract:Machine Learning (ML) has become a mature technology that is being applied to a wide range of business problems such as web search, online advertising, product recommendations, object recognition, and so on. As a result, it has become imperative for researchers and practitioners to have a fundamental understanding of ML concepts and practical knowledge of end-to-end modeling. This tutorial takes a hands-on approach to introducing the audience to machine learning. The first part of the tutorial gives a broad overview and discusses some of the key concepts within machine learning. The second part of the tutorial takes the audience through the end-to-end modeling pipeline for a real-world income prediction problem. The tutorial includes some hands-on exercises. If you want to follow along, you will need a laptop with at least 2 GB of RAM and Firefox/ Google Chrome browser installed. Note that your laptop must be capable of connecting to internet via Wifi or your mobile data connection. We will be using docker containers, so specific software does not need to be installed on laptops.

Workshops Contd.

Monday Sep 5th, 4:00 pm - 5:30 pm

Tutorials and Workshops Contd.

Tuesday Sep 6th, 8:30 am - 9:15 am

Conference Inauguration

Location: Pearl

Anand Deshpande (Persistent Systems); T. M. Vijayaraman (Persistent Systems); Surajit Chaudhuri (Microsoft Research); Jayant Haritsa (Indian Institute of Science); Mukesh Mohania (IBM Research)

Tuesday Sep 6th, 9:15 am - 10:30 am

Keynote 1

Location: Pearl

Chair: Surajit Chaudhuri, Microsoft Research

Trends and Challenges in Big Data Processing

Prof. Ion Stoica, University of California, Berkeley

Abstract:Almost six years ago we started the Spark project at UC Berkeley. Spark is a cluster computing engine that is optimized for in- memory processing, and unifies support for a variety of workloads, including batch, interactive querying, streaming, and iterative computations. Spark is now the most active big data project in the open source community, and is already being used by over one thousand organizations. One of the reasons behind Spark’s success has been our early bet on the continuous increase in the memory capacity and the feasibility to fit many realistic workloads in the aggregate memory of typical production clusters. Today, we are witnessing new trends, such as Moore’s law slowing down, and the emergence of a variety of computation and storage technologies, such as GPUs, FPGAs, and 3D Xpoint. In this talk, I’ll discuss some of the lessons we learned in developing Spark as a unified computation platform, and the implications of today’s hardware and software trends on the development of the next generation of big data processing systems.

Bio: Ion Stoica is a Professor in the EECS Department at University of California at Berkeley. He received his PhD from Carnegie Mellon University in 2000. He does research on cloud computing and networked computer systems. Past work includes the Dynamic Packet State (DPS), Chord DHT, Internet Indirection Infrastructure (i3), declarative networks, replay-debugging, and multi- layer tracing in distributed systems. His current research focuses on resource management and scheduling for data centers, cluster computing frameworks, and network architectures. He is an ACM Fellow and has received numerous awards, including the SIGCOMM Test of Time Award (2011), and the ACM doctoral dissertation award (2001). In 2006, he co-founded Conviva, a startup to commercialize technologies for large scale video distribution, and in 2013, he co-founded Databricks a startup to commercialize technologies for Big Data processing.

Tuesday Sep 6th, 11:15 am - 12:45 pm

Research R1: Transaction Processing

Location: Pearl 1

Chair: Daniel Abadi, Yale Univ.

Leveraging Lock Contention to Improve OLTP Application Performance

Cong Yan (University of Washington); Alvin Cheung (University of Washington)

Abstract:Locking is one of the predominant costs in transaction processing. While much work has focused on designing efficient concurrency control mechanisms, not much has been done on understanding how transaction applications issue queries and leveraging application semantics to improve application performance. This paper presents QURO, a query-aware compiler that automatically reorders queries in transaction code to improve performance. Observing that certain queries within a transaction are more contentious than others as they require locking the same tuples as other concurrently executing transactions, QURO automatically changes the application such that contentious queries are issued as late as possible. We have evaluated QURO on various transaction benchmarks, and our results show that QURO-generated implementations can increase transaction throughput by up to 6.53x, while reduce transaction latency by up to 85%.
BCC: Reducing False Aborts in Optimistic Concurrency Control with Low Cost for In-Memory Databases

Yuan Yuan (The Ohio State University); Kaibo Wang (The Ohio State University); Rubao Lee (The Ohio State University); Xiaoning Ding (New Jersey Institute of Technology); Jing Xing (Institute of Computing Technology, Chinese Academy of Sciences); Spyros Blanas (The Ohio State University); Xiaodong Zhang (The Ohio State University)

Abstract:The Optimistic Concurrency Control (OCC) method has been commonly used for in-memory databases to ensure transaction serializability — a transaction will be aborted if its read set has been changed during execution. This simple criterion to abort transactions causes a large proportion of false positives, leading to excessive transaction aborts. Transactions aborted false-positively (i.e. false aborts) waste system resources and can significantly degrade system throughput (as much as 3.68x based on our experiments) when data contention is intensive. Modern in-memory databases run on systems with increasingly parallel hardware and handle workloads with growing concurrency. They must efficiently deal with data contention in the presence of greater concurrency by minimizing false aborts. This paper presents a new concurrency control method named Balanced Concurrency Control (BCC) which aborts transactions more carefully than OCC does. BCC detects data dependency patterns which can more reliably indicate unserializable transactions than the criterion used in OCC. The paper studies the design options and implementation techniques that can effectively detect data contention by identifying dependency patterns with low overhead. To test the performance of BCC, we have implemented it in Silo and compared its performance against that of the vanilla Silo system with OCC and two-phase locking (2PL). Our extensive experiments with TPC-W- like, TPC-C-like and YCSB workloads demonstrate that when data contention is intensive, BCC can increase transaction throughput by more than 3x versus OCC and more than 2x versus 2PL; meanwhile, BCC has comparable performance with OCC for workloads with low data contention.
Multi-Version Range Concurrency Control in Deuteronomy

Justin Levandoski (Microsoft Research); David Lomet (Microsoft Research); Sudipta Sengupta (Microsoft Research); Ryan Stutsman (Microsoft Research); Rui Wang (Microsoft Research)

Abstract:The Deuteronomy transactional key value store executes millions of serializable transactions/second by exploiting multi-version timestamp order concurrency control. However, it has not supported range operations, only individual record operations (e.g., create, read, update, delete). In this paper, we enhance our multi-version timestamp order technique to handle range concurrency and prevent phantoms. Importantly, we maintain high performance while respecting the clean separation of duties required by Deuteronomy, where a transaction component performs purely logical concurrency control (including range support), while a data component performs data storage and management duties. Like the rest of the Deuteronomy stack, our range technique manages concurrency information in a latch-free manner. With our range enhancement, Deuteronomy can reach scan speeds of nearly 250 million records/s (more than 27 GB/s) on modern hardware, while providing serializable isolation complete with phantom prevention.
S-Store: Streaming Meets Transaction Processing

John Meehan (Brown University); Nesime Tatbul (Intel Labs and MIT); Stan Zdonik (Brown University); Cansu Aslantas (Brown University); Ugur Cetintemel (Brown University); Jiang Du (University of Toronto); Tim Kraska (Brown University); Samuel Madden (MIT); David Maier (Portland State University); Andrew Pavlo (CMU); Michael Stonebraker (MIT); Kristin Tufte (Portland State University); Hao Wang (MIT)

Abstract:Stream processing addresses the needs of real-time applications. Transaction processing addresses the coordination and safety of short atomic computations. Heretofore, these two modes of operation existed in separate, stove-piped systems. In this work, we attempt to fuse the two computational paradigms in a single system called S-Store. In this way, S-Store can simultaneously accommodate OLTP and streaming applications. We present a simple transaction model for streams that integrates seamlessly with a traditional OLTP system, and provides both ACID and stream-oriented guarantees. We chose to build S-Store as an extension of H-Store - an open-source, in-memory, distributed OLTP database system. By implementing S-Store in this way, we can make use ofthe transaction processing facilities that H-Store already provides, and we can concentrate on the additional features that are neededto support streaming. Similar implementations could be done usingother main-memory OLTP platforms. We show that we can actually achieve higher throughput for streaming workloads in S-Store than an equivalent deployment in H-Store alone. We also show howthis can be achieved within H-Store with the addition of a modest amount of new functionality. Furthermore, we compare S-Store to two state-of-the-art streaming systems, Esper and Apache Storm, and show how S-Store can sometimes exceed their performance while at the same time providing stronger correctness guarantee.

Research R2: RDF Data Systems

Location: Pearl 2

Chair: Atsuyuki Morishima, Univ. of Tsukuba

S2RDF: RDF Querying with SPARQL on Spark

Alexander Schätzle (University of Freiburg); Martin Przyjaciel-Zablocki (University of Freiburg); Simon Skilevic (University of Freiburg); Georg Lausen (University of Freiburg)

Abstract:RDF has become very popular for semantic data publishing due to its flexible and universal graph-like data model. Thus, the ever- increasing size of RDF data collections raises the need for scalable distributed approaches. We endorse the usage of existing infrastructures for Big Data processing like Hadoop for this purpose. Yet, SPARQL query performance is a major challenge as Hadoop is not intentionally designed for RDF processing. Existing approaches often favor certain query pattern shapes while performance drops significantly for other shapes. In this paper, we introduce a novel relational partitioning schema for RDF data called ExtVP that uses a semi-join based preprocessing, akin to the concept of Join Indices in relational databases, to efficiently minimize query input size regardless of its pattern shape and diameter. Our prototype system S2RDF is built on top of Spark and uses SQL to execute SPARQL queries over ExtVP. We demonstrate its superior performance in comparison to state of the art SPARQL-on-Hadoop approaches.
Inferray: fast in-memory RDF inference

Julien Subercaze (Laboratoire Hubert Curien); Christophe Gravie (Laboratoire Hubert Curien); Jules Chevalier (Laboratoire Hubert Curien); Frederique Laforest (Laboratoire Hubert Curien)

Abstract:The advent of semantic data on the Web requires efficient reasoning systems to infer RDF and OWL data. The linked nature and the huge volume of data entail efficiency and scalability challenges when designing productive inference systems. This paper presents Inferray, an implementation of RDFS, ρdf, and RDFS-Plus inference with improved performance over existing solutions. The main features of Infer- ray are 1) a storage layout based on vertical partitioning that guarantees sequential access and efficient sort-merge join inference; 2) efficient sorting of pairs of 64-bit integers using ad-hoc optimizations on MSD radix and a custom counting sort; 3) a dedicated temporary storage to perform efficient graph closure computation. Our measurements on synthetic and real-world datasets show improvements over competitors on RDFS-Plus, and up to several orders of magnitude for transitivity closure.
RDF Graph Alignment with Bisimulation

Peter Buneman (University of Edinburgh); Sławek Staworko (University of Edinburgh, University of Lille, and LINKS, INRIA NordEurope)

Abstract:We investigate approaches to the alignment of two RDF triple stores inspired by the classical notion of graph bisimulation and relate this to metrics that describe the accuracy of the alignment. Alignment of RDF stores is essential in the understanding of evolving, curated ontologies. One needs to align nodes in sucessive versions that are given “blank’’ names; moreover, a new version may switch terminology from one external ontology to another. This means that some further alignment of the identifiers of the two external ontologies is useful. We first describe a form of bisimulation based on node colorings, in which the colors define equivalence classes, we then go on to describe an estimate of the accuracy of a coloring based on (a) the similarity of literal nodes and (b) on the “structural’’ similarity of internal nodes (URIs and blank nodes). The effectiveness of these method is tested on two evolving data sets: an ontology described in OWL and triple store derived from an evolving relational database. Both of these are curated resources for biologists.
Semantic SPARQL Similarity Search Over RDF Knowledge Graphs

Weiguo Zheng (Peking University); Lei Zou (Peking University); Wei Peng (Peking University); Xifeng Yan (University of California at Santa Barbara); Shaoxu Song (Tsinghua University); Dongyan Zhao (Peking University)

Abstract:RDF knowledge graphs have attracted increasing attentions these years. However, due to the schema-free nature of RDF data, it is very difficult for users to have full knowledge of the underlying schema. Furthermore, the same kind of information can be represented in diverse graph fragments. Hence, it is a huge challenge to formulate complex SPARQL expressions by taking the union of all the possible structures. In this paper, we propose an effective framework to access the RDF repository even if users have no full knowledge of the underlying schema. Specifically, given a SPARQL query, the system could return as more answers that match the query based on the semantic similarity as possible. Interestingly, we propose a systematic method to mine diverse semantically equivalent structure patterns. More importantly, incorporating both structural and semantic similarities we are the first to propose a novel similarity measure, semantic graph edit distance. In order to improve the efficiency performance, we utilize the semantic summary graph to summarize the knowledge graph, which supports both high-level pruning and drill-down pruning. We also devise an effective lower bound based on the TA-style access to each of the candidate sets. Extensive experiments over real datasets confirm the effectiveness and efficiency of our approach.

Research R3: Distributed and Cloud Database Systems -1

Location: Royal 1

Chair: Ashraf Aboulnaga, QCRI

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics

Juwei Shi (Renmin University of China); Yunjie Qiu (IBM Research – China); Umar Farooq Minhas (IBM Research - Almaden); Limei Jiao (IBM Research – China); Chen Wang (Tsinghua University); Berthold Reinwald (IBM Research - Almaden); Fatma Ozcan ((IBM Research - Almaden)

Abstract:MapReduce and Spark are two very popular open source clustercomputing frameworks for large scale data analytics. These frame- works hide the complexity of task parallelism and fault-tolerance, by exposing a simple programming API to users. In this paper, we evaluate the major architectural components in MapReduce and Spark frameworks including: shuffle, execution model, and caching, by using a set of important analytic workloads. To conduct a detailed analysis, we developed two profiling tools: (1) We correlate the task execution plan with the resource utilization for both MapReduce and Spark, and visually present this correlation; (2) We provide a breakdown of the task execution time for in-depth analysis. Through detailed experiments, we quantify the performance differences between MapReduce and Spark. Furthermore, we attribute these performance differences to different componentswhich are architected differently in the two frameworks. We fur-ther expose the source of these performance differences by usinga set of micro-benchmark experiments. Overall, our experiments show that Spark is about2:5x,5x, and5x faster than MapReduce, for Word Count, k-means, and PageRank, respectively. The main causes of these speedups are the efficiency of the hash-based aggregation component for combine, as well as reduced CPU and disk overheads due to RDD caching in Spark. An exception to this is the Sort workload, for which MapReduce is 2x faster than Spark.We show that MapReduce’s execution model is more efficient for shuffling data than Spark, thus making Sort run faster on MapReduce.
Compressed Linear Algebra for Large-Scale Machine Learning [Best Paper Award Winner]

Matthias Boehm (IBM Research – Almaden); Ahmed Elgohary (University of Maryland); Peter Haas (IBM Research – Almaden); Fred Reiss (IBM Research – Almaden); Berthold Reinwald (IBM Research – Almaden)

Abstract:Large-scale machine learning (ML) algorithms are often iterative with repeated read-only data access and I/O-bound matrix- vector multiplications to converge to an optimal model. It is crucial for performance to fit the data into single-node or distributed main memory. General-purpose, heavy- and lightweight compression techniques struggle to achieve both good compression ratios and fast decompression speed to enable blockwise uncompressed operations. Hence, we initiate work on compressed linear algebra (CLA), in which lightweight database compression techniques are applied to matrices and then linear algebra operations such as matrix-vector multiplication are executed directly on the compressed representations. We contribute effective column compression schemes, cache-conscious operations, and an efficient sampling-based compression algorithm. Our experiments show that CLA achieves in-memory operation performance close to the uncompressed case and good compression ratios that allow us to fit larger datasets into available memory. We thereby obtain significant end-to-end performance improvements up to 26x or reduced memory requirements.
PIXIDA: Optimizing Data Parallel Jobs in Wide-Area Data Analytics.

Konstantinos Kloudas (INESC-ID); Rodrigo Rodrigues (INESC-ID/IST at University of Lisbon); Nuno Preguica (NOVA LINCS / NOVA at University of Lisbon); Margarida Mamede (NOVA LINCS / NOVA at University of Lisbon)

Abstract:In the era of global-scale services, big data analytical queries are often required to process datasets that span multiple data centers (DCs). In this setting, cross-DC bandwidth is often the scarcest, most volatile, and/or most expensive resource. However, current widely deployed big data analytics frameworks make no attempt to minimize the traffic traversing these links. In this paper, we present PIXIDA, a scheduler that aims to minimize data movement across resource constrained links. To achieve this, we introduce a new abstraction called SILO, which is key to modeling PIXIDA’s scheduling goals as a graph partitioning problem. Furthermore, we show that existing graph partitioning problem formulations do not map to how big data jobs work, causing their solutions to miss opportunities for avoiding data movement. To address this, we formulate a new graph partitioning problem and propose a novel algorithm to solve it. We integrated PIXIDA in Spark and our experiments show that, when compared to existing schedulers, PIXIDA achieves a significant traffic reduction of up to ~9× on the aforementioned links.
Cümülön: Matrix-Based Data Analytics in the Cloud with Spot Instances

Botong Huang (Duke University); Nicholas Jarrett (Duke University); Shivnath Babu (Duke University); Sayan Mukherjee (Duke University); Jun Yang (Duke University)

Abstract:We describe Cumulon, a system aimed at helping users develop and deploy matrix-based data analysis programs in a public cloud. A key feature of Cumulon is its end-to-end support for the so-called spot instances—-machines whose market price fluctuates over time but is usually much lower than the regular fixed price. A user sets a bid price when acquiring spot instances, and loses them as soon as the market price exceeds the bid price. While spot instances can potentially save cost, they are difficult to use effectively, and run the risk of not finishing work while costing more. Cumulon provides a highly elastic computation and storage engine on top of spot instances, and offers automatic cost-based optimization of execution, deployment, and bidding strategies. Cumulon further quantifies how the uncertainty in the market price translates into the cost uncertainty of its recommendations, and allows users to specify their risk tolerance as an optimization constraint.

Industrial I1: Distributed Data Analytics Platforms - 1

Location: Royal 2

Chair: Gustavo Alonso, ETH Zurich

Data Infrastructure at Flipkart (Invited Talk)

Sharad Agarwal (Flipkart)

Abstract:Data infrastructure lies at the core of Flipkart technology stack. The e-commerce business not only deals with huge traffic but the kind of data it deals with, is widely varied too. Whether it is user behavioural or transactional data of millions of customers or catalog with millions of products that being offered by thousands of sellers, or logistics network that covers the entire nation, the challenge to effectively derive value and build data enabled products lies in all aspects of technology stack. Data infrastructure at Flipkart is a hosted central platform as a service that enables realtime systemic decision making and business intelligence across all product lines at Flipkart. In this session, I will talk about the technical challenges in building this highly scalable and self-serve data platform that manages more than petabyte of data with thousand of realtime business critical data streams. I will also talk about the key technologies and platform abstractions that enable self serve realtime ETL and querying capabilities on Hadoop as well as on traditional data warehouses in seamless fashion.
Database System Support of Simulation Data

Hermano Lustosa (LNCC); Fabio Porto (LNCC); Pablo Blanco (LNCC); Patrick Valduriez (INRIA)

Abstract:Supported by increasingly efficient HPC infra-structure, numerical simulations are rapidly expanding to fields such as oil and gas, medicine and meteorology. As simulations become more precise and cover longer periods of time, they may produce files with terabytes of data that need to be efficiently analyzed. In this paper, we investigate techniques for managing such data using an array DBMS. We take advantage of multidimensional arrays that nicely models the dimensions and variables used in numerical simulations. However, a naive approach to map simulation data files may lead to sparse arrays, impacting query response time, in particular, when the simulation uses irregular meshes to model its physical domain. We propose efficient techniques to map coordinate values in numerical simulations to evenly distributed cells in array chunks with the use of equi-depth histograms and space-filling curves. We implemented our techniques in SciDB and, through experiments over real-world data, compared them with two other approaches: row- store and column-store DBMS. The results indicate that multidimensional arrays and column-stores are much faster than a traditional row-store system for queries over a larger amount of simulation data. They also help identifying the scenarios where array DBMSs are most efficient, and those where they are outperformed by column-stores.
Hybrid Row-Column Partitioning in Teradata

Mohammed Al-Kateb (Teradata Labs); Paul Sinclair (Teradata Labs); Grace Au (Teradata Labs); Carrie Ballinger (Teradata Labs)

Abstract:Data partitioning is an indispensable ingredient of database systems due to the performance improvement it can bring to any given mixed workload. Data can be partitioned horizontally or vertically. While some commercial proprietary and open source database systems have one flavor or mixed flavors of these partitioning forms, Teradata Database offers a unique hybrid row-column store solution that seamlessly combines both of these partitioning schemes. The key feature of this hybrid solution is that either row, column, or combined partitions are all stored and handled in the same way internally by the underlying file system storage layer. In this paper, we present the main characteristics and explain the implementation approach of Teradata’s row-column store. We also discuss query optimization techniques applicable specifically to partitioned tables. Furthermore, we present a performance study that demonstrates how different partitioning options impact the performance of various queries.
Consistent Regions: Guaranteed Tuple Processing in IBM Streams

Gabriela Jacques da Silva (IBM Research – T.J. Watson); Fang Zheng (IBM Research – T.J. Watson); Daniel Debrunner (IBM); Kun-Lung Wu (IBM Research – T.J. Watson); Victor Dogaru (IBM); Eric Johnson (IBM); Michael Spicer (IBM); Ahmet Erdem Sariyuce (Sandia National Labs)

Abstract:Guaranteed tuple processing has become critically important for many streaming applications. This paper describes how we enabled IBM Streams, an enterprise-grade stream processing system, to provide data processing guarantees. Our solution goes from language-level abstractions to a runtime protocol. As a result, with a couple of simple annotations at the source code level, IBM Streams developers can define consistent regions, allowing any subgraph of their streaming application to achieve guaranteed tuple processing. At runtime, a consistent region periodically executes a variation of the Chandy-Lamport snapshot algorithm to establish a consistent global state for that region. The coupling of consistent states with data replay enables guaranteed tuple processing.

Demo 1a: Data Engines and Analytics

Location: Maple

Vita: A Versatile Toolkit for Generating Indoor Mobility Data for Real-World Buildings

Huan Li (Zhejiang University); Hua Lu (Aalborg University); Xin Chen (Zhejiang University); Gang Chen (Zhejiang University) ; Ke Chen (Zhejiang University); Lidan Shou (Zhejiang University)

JexLog: A Sonar for the Abyss

Tobias Scheuer (SAP SE); Norman May (SAP SE); Alexander Böhm (SAP SE); Daniel Scheibli (SAP SE)

SI2P: A Restaurant Recommendation System Using Preference Queries over Incomplete Information

Xiaoye Miao (Zhejiang University); Yunjun Gao (Zhejiang University); Gang Chen (Zhejiang University); Huiyong Cui (Zhejiang University) ; Chong Guo (Zhejiang University); Weida Pan (Zhejiang University)

Mixed-instance querying: a lightweight integration architecture for data journalism

Raphaël Bonaque (INRIA); Tien Cao (INRIA); Bogdan Cautis (Université Paris-Sud); François Goasdoué (Université de Rennes 1); Javier Letellier (INRIA); Ioana Manolescu (INRIA); Oscar Mendoza (INRIA); Swen Ribeiro (Université Paris-Sud); Xavier Tannier (Université ParisSud); Michaël Thomazo (INRIA)

Precision Performance Surgery for PostgreSQL — LLVM-based Expression Compilation, Just in Time

Dennis Butterstein (University of Tuebingen); Torsten Grust (Universität Tübingen)

Partial Marking for Automated Grading of SQL Queries

Bikash Chandra (IIT Bombay); Mathew Joseph (IIT Bombay); Bharath Radhakrishnan (IIT Bombay); Shreevidhya Acharya (IIT Bombay); S. Sudarshan (IIT Bombay)

Squall: Scalable Real-time Analytics

Aleksandar Vitorovic (EPFL); Mohammed Elseidy (EPFL); Khayyam Guliyev (EPFL); Khue Vu Minh (EPFL); Daniel Espino Timón (EPFL); Mohammad Dashti (EPFL); Yannis Klonatos (EPFL); Christoph Koch (EPFL)

LocationSpark: A Distributed In-Memory Data Management System for Big Spatial Data

Mingjie Tang (Purdue University); Yongyang Yu (Purdue University); Qutaibah Malluhi (Qatar University); Mourad Ouzzani (Qatar Computing Research Institute); Walid Aref (Purdue University)

Amoeba: A Shape changing Storage System for Big Data

Anil Shanbhag (MIT); Samuel Madden (MIT); Alekh Jindal (Microsoft); Yi Lu (MIT)

F: Regression Models over Factorized Views

Dan Olteanu (University of Oxford); Maximilian Schleich (University of Oxford)

Magellan: Toward Building Entity Matching Management Systems over Data Science Stacks

Pradap Konda (University of Wisconsin Madison); Sanjib Das (University of Wisconsin Madison); Paul Suganthan G. C. (University of Wisconsin Madison); Anhai Doan (University of Wisconsin Madison); Adel Ardalan (University of Wisconsin Madison); Jeffrey Ballard (University of Wisconsin Madison); Han Li (University of Wisconsin Madison); Fatemah Panahi (University of Wisconsin Madison); Haojun Zhang (University of Wisconsin Madison); Jeffrey Naughton (University of Wisconsin Madison); Shishir Prasad (University of Wisconsin Madison);Ganesh Krishnan (WalmartLabs); Rohit Deep (WalmartLabs); Vijay Raghavendra (WalmartLabs)

Schema Independent and Scalable Relational Learning By Castor

Jose Picado (Oregon State University); Parisa Ataei (Oregon State University); Arash Termehchy (Oregon State University); Alan Fern (Oregon State University)

AD-WIRE: Add-on for Web Item Reviewing System

Rajeshkumar Kannapalli (University of Texas at Arlington); Azade Nazi (University of Texas at Arlington); Mahashweta Das (Hewlett Packard Labs); Gautam Das (University of Texas at Arlington)

Research Posters 1

Location: Maple

Tuesday Sep 6th, 2:00 pm - 3:30 pm

Research R4: Memory Management

Location: Pearl 1

Chair: Anastasia Ailamaki, EPFL Lausanne

RUMA has it: Rewired User-space Memory Access is Possible!

Felix Martin Schuhknecht (Saarland University); Jens Dittrich (Saarland University); Ankur Sharma (Saarland University)

Abstract:Memory management is one of the most boring topics in database research. It plays a minor role in tasks like free-space management or efficient space usage. Here and there we also realize its impact on database performance when worrying about NUMA-aware memory allocation, data compacting, snapshotting, and defragmentation. But, overall, let’s face it: the entire topic sounds as exciting as ‘garbage collection’ or ‘debugging a program for memory leaks’. What if there were a technique that would promote memory management from a third class helper thingie to a first class citizen in algorithm and systems design? What if that technique turned the role of memory management in a database system (and any other data processing system) upside-down? What if that technique could be identified as a key for re-designing various core algorithms with the effect of outperforming existing state-of-the-art methods considerably? Then we would write this paper. We introduce RUMA: Rewired User-space Memory Access. It allows for physiological data management, i.e. we allow developers to freely rewire the mappings from virtual to physical memory (in user space) while at the same time exploiting the virtual memory support offered by hardware and operating system. We show that fundamental database building blocks such as array operations, partitioning, sorting, and snapshotting benefit strongly from RUMA.
Asynchronous Memory Access Chaining

Onur Kocberber (EPFL); Babak Falsafi (EPFL); Boris Grot (University of Edinburgh)

Abstract:In-memory databases rely on pointer-intensive data structures to quickly locate data in memory. A single lookup operation in such data structures often exhibits long-latency memory stalls due to dependent pointer dereferences. Hiding the memory latency by launching additional memory accesses for other lookups is an effective way of improving performance of pointer- chasing codes (e.g., hash table probes, tree traversals). The ability to exploit such inter-lookup parallelism is beyond the reach of modern out-of-order cores due to the limited size of their instruction window. Instead, recent work has proposed software prefetching techniques that exploit inter-lookup parallelism by arranging a set of independent lookups into a group or a pipeline, and navigate their respective pointer chains in a synchronized fashion. While these techniques work well for highly regular access patterns, they break down in the face of irregularity across lookups. Such irregularity includes variable-length pointer chains, early exit, and read/write dependencies. This work introduces Asynchronous Memory Access Chaining (AMAC), a new approach for exploiting inter-lookup parallelism to hide the memory access latency. AMAC achieves high dynamism in dealing with irregularity across lookups by maintaining the state of each lookup separately from that of other lookups. This feature enables AMAC to initiate a new lookup as soon as any of the in-flight lookups complete. In contrast, the static arrangement of lookups into a group or pipeline in existing techniques precludes such adaptivity. Our results show that AMAC matches or outperforms state-of-the-art prefetching techniques on regular access patterns, while delivering up to 2.3x higher performance under irregular data structure lookups. AMAC fully utilizes the available micro-architectural resources, generating the maximum number of memory accesses allowed by hardware in both single- and multi-threaded execution modes.
WarpLDA: a Cache Efficient O(1) Algorithm for Latent Dirichlet Allocation

Jianfei Chen (Tsinghua University); Kaiwei Li (Tsinghua University); Jun Zhu (Tsinghua University); Wenguang Chen (Tsinghua University)

Abstract:Developing efficient and scalable algorithms for Latent Dirichlet Allocation (LDA) is of wide interest for many applications. Previous work has developed an O(1) Metropolis-Hastings (MH) sampling method for each token. However, its performance is far from being optimal due to frequent cache misses caused by random accesses to the parameter matrices. In this paper, we first carefully analyze the memory access behavior of existing algorithms for LDA by cache locality at document level. We then develop WarpLDA, which achieves O(1) time complexity per-token and fits the randomly accessed memory per-document in the L3 cache. Our empirical results in a wide range of testing conditions demonstrate that WarpLDA is consistently 5-15x faster than the state-of-the-art MH-based LightLDA, and is faster than the state-of-the-art sparsity aware F+LDA in most settings. Our WarpLDA learns a million topics from 639 millions of documents in only five hours at an unprecedented throughput of 11 billion tokens per second.
Lifetime-Based Memory Management for Distributed Data Processing Systems

Lu Lu (Huazhong University of Science and Technology); Xuanhua Shi (Huazhong University of Science and Technology); Yongluan Zhou (University of Southern Denmark); Xiong Zhang (Huazhong University of Science and Technology); Hai Jin (Huazhong University of Science and Technology); Cheng Pei (Huazhong University of Science and Technology); Ligang He (University of Warwick); Yuanzhen Geng (Huazhong University of Science and Technology)

Abstract:In-memory caching of intermediate data and eager combining of data in shuffle buffers have been shown to be very effective in minimizing the re-computation and I/O cost in distributed data processing systems like Spark and Flink. However, it has also been widely reported that these techniques would create a large amount of long-living data objects in the heap, which may quickly saturate the garbage collector, especially when handling a large dataset, and hence would limit the scalability of the system. To eliminate this problem, we propose a lifetime-based memory management framework, which, by automatically analyzing the user-defined functions and data types, obtains the expected lifetime of the data objects, and then allocates and releases memory space accordingly to minimize the garbage collection overhead. In particular, we present Deca, a concrete implementation of our proposal on top of Spark, which transparently decomposes and groups objects with similar lifetimes into byte arrays and releases their space altogether when their lifetimes come to an end. An extensive experimental study using both synthetic and real datasets shows that, in comparing to Spark, Deca is able to 1) reduce the garbage collection time by up to 99.9%, 2) to achieve up to 22.7x speed up in terms of execution time in cases without data spilling and 41.6x speedup in cases with data spilling, and 3) to consume up to 46.6% less memory.

Research R5: Data Cleaning - 1

Location: Pearl 2

Chair: Ihab Ilyas, Univ. of Waterloo

Cleaning Timestamps with Temporal Constraints

Shaoxu Song (Tsinghua University); Yue Cao (Tsinghua University); Jianmin Wang (Tsinghua University)

Abstract:Timestamps are often found to be dirty in various scenarios, e.g., in distributed systems with clock synchronization problems or unreliable RFID readers. Without cleaning the imprecise timestamps, temporal-related applications such as provenance analysis or pattern queries are not reliable. To evaluate the correctness of timestamps, temporal constraints could be employed, which declare the distance restrictions between timestamps. Guided by such constraints on timestamps, in this paper, we study a novel problem of repairing inconsistent timestamps that do not conform to the required temporal constraints. Following the same line of data repairing, the timestamp repairing problem is to minimally modify the timestamps towards satisfaction of temporal constraints. This problem is practically challenging, given the huge space of possible timestamps. We tackle the problem by identifying a concise set of promising candidates, where an optimal repair solution can always be found. Repair algorithms with efficient pruning are then devised over the identified candidates. Experiments on real datasets demonstrate the superiority of our proposal compared to the state-of-the-art approaches.
Repairing Data through Regular Expressions

Zeyu Li (Harbin Institute of Technology); Hongzhi Wang (Harbin Institute of Technology); Wei Shao (Harbin Institute of Technology); Jianzhong Li (Harbin Institute of Technology); Hong Gao (Harbin Institute of Technology)

Abstract:Since regular expressions are often used to detect errors in sequences such as strings or date, it is natural to use it for data repair. Motivated by this, we propose a data repair method based on regular expression to make the input sequence data obey the given regular expression with minimal revisions. The proposed method contains two steps, sequence repair and token value repair. For sequence repair, we propose the Regular-expression-based Structural Repair algorithm (RSR in short). RSR algorithm is a dynamic programming algorithm that utilizes Nondeterministic Finite Automata (NFA) to calculate the edit distance between the prefix of the input string and partial pattern regular expression with time complexity $O(mn^2)$ and space complexity $O(mn)$ where $m$ is the number of NFA’s edge and $n$ is the input string length. We also give an optimization to achieve high performance. For token value repair, we combine the edit-distance-based method and associate rules by a unified argument for the selection of the two methods. Experimental results on both real and synthetic data show that the proposed method could repair the data effectively and efficiently.
Combining Quantitative and Logical Data Cleaning

Nataliya Prokoshyna (University of Toronto); Jaroslaw Szlichta (University of Ontario Institute of Technology); Fei Chiang (McMaster University); Renée Miller (University of Toronto); Divesh Srivastava (AT&T Labs Research)

Abstract:Quantitative data cleaning relies on the use of statistical methods to identify and repair data quality problems while logical data cleaning tackles the same problems using various forms of logical reasoning over declarative dependencies. Each of these approaches has its strengths: the logical approach is able to capture subtle data quality problems using sophisticated dependencies, while the quantitative approach excels at ensuring that the repaired data has desired statistical properties. We propose a novel framework within which these two approaches can be used synergistically to combine their respective strengths. We instantiate our framework using (i) metric functional dependencies, a type of dependency that generalizes functional dependencies (FDs) to identify inconsistencies in domains where only large differences in metric data are considered to be a data quality problem, and (ii) repairs that modify the inconsistent data so as to minimize statistical distortion, measured using the Earth Mover’s Distance. We show that the problem of computing a statistical distortion minimal repair is NP-hard. Given this complexity, we present an efficient algorithm for finding a minimal repair that has a small statistical distortion using EMD computation over semantically related attributes. To identify semantically related attributes, we present a sound and complete axiomatization and an efficient algorithm for testing implication of metric FDs. While the complexity of inference for some other FD extensions is co-NP complete, we show that the inference problem for metric FDs remains linear, as in traditional FDs. We prove that every instance that can be generated by our repair algorithm is set-minimal (with no unnecessary changes). Our experimental evaluation demonstrates that our techniques obtain a considerably lower statistical distortion than existing repair techniques, while achieving similar levels of efficiency.
Temporal Rules Discovery for Web Data Cleaning

Ziawasch Abedjan (MIT); Cuneyt Akcora (Qatar Computing Research Institute); Mourad Ouzzani (Qatar Computing Research Institute); Paolo Papotti (Qatar Computing Research Institute); Michael Stonebraker (MIT)

Abstract:Declarative rules, such as functional dependencies, are widely used for cleaning data. Several systems take them as input for detecting errors and computing a “clean” version of the data. To support domain experts, in specifying these rules, several tools have been proposed to profile the data and mine rules. However, existing discovery techniques have traditionally ignored the time dimension. Recurrent events, such as persons reported in locations, have a duration in which they are valid, and this duration should be part of the rules or the cleaning process would simply fail. In this work, we study the rule discovery problem for temporal web data. Such a discovery process is challenging because of the nature of web data; extracted facts are (i) sparse over time, (ii) reported with delays, and (iii) often reported with errors over the values because of inaccurate sources or non robust extractors. We handle these challenges with a new discovery approach that is more robust to noise. Our solution uses machine learning methods, such as association measures and outlier detection, for the discovery of the rules, together with an aggressive repair of the data in the mining step itself. Our experimental evaluation over real-world data from Recorded Future, an intelligence company that monitors over 700K Web sources, shows that temporal rules improve the quality of the data with an increase of the average precision in the cleaning process from 0.37 to 0.84, and a 40% relative increase in the average F-measure.

Research R6: Graph Processing - 1

Location: Royal 1

Chair: Bin Cui, Peking Univ.

I/O Efficient ECC Graph Decomposition via Graph Reduction

Long Yuan (The University of New South Wales); Lu Qin (University of Technology, Sydney); Xuemin Lin (The University of New South Wales); Lijun Chang (The University of New South Wales); Wenjie Zhang (The University of New South Wales)

Abstract:The problem of computing k-edge connected components (k- ECCs) of a graph G for a specific k is a fundamental graph problem and has been investigated recently. In this paper, we study the problem of ECC decomposition, which computes the k-ECCs of a graph G for all k values. ECC decomposition can be widely applied in a variety of applications such as graph-topology analysis, community detection, Steiner component search, and graph visualization. A straightforward solution for ECC decomposition is to apply the existing k-ECC computation algorithm to compute the k-ECCs for all k values. However, this solution is not applicable to large graphs for two challenging reasons. First, all existing k-ECC computation algorithms are highly memory intensive due to the complex data structures used in the algorithms. Second, the number of possible k values can be very large, resulting in a high computational cost when each k value is independently considered. In this paper, we address the above challenges, and study I/O efficient ECC decomposition via graph reduction. We introduce two elegant graph reduction operators which aim to reduce the size of the graph loaded in memory while preserving the connectivity information of a certain set of edges to be computed for a specific k. We also propose three novel I/O efficient algorithms, Bottom-Up, Top-Down, and Hybrid, that explore the k values in different orders to reduce the redundant computations between different k values. We analyze the I/O and memory costs for all proposed algorithms. In our experiments, we evaluate our algorithms using seven real large datasets with various graph properties, one of which contains 1.95 billion edges. The experimental results show that our proposed algorithms are scalable and efficient.
LEOPARD: Lightweight Edge-Oriented Partitioning and Replication for Dynamic Graphs

Jiewen Huang (Yale University); Daniel Abadi (Yale University)

Abstract:This paper introduces a dynamic graph partitioning algorithm, designed for large, constantly changing graphs. We propose a partitioning framework that adjusts on the fly as the graph structure changes. We also introduce a replication algorithm that is tightly integrated with the partitioning algorithm, which further reduces the number of edges cut by the partitioning algorithm. Even though the proposed approach is handicapped by only taking into consideration local parts of the graph when reassigning vertices, extensive evaluation shows that the proposed approach maintains a quality partitioning over time, which is comparable at any point in time to performing a full partitioning from scratch using a state-the-art static graph partitioning algorithm such as METIS. Furthermore, when vertex replication is turned on, edge-cut can improve by an order of magnitude.
Towards Maximum Independent Sets on Massive Graphs

Yu Liu (Renmin University of China); Jiaheng Lu (University of Helsinki); Hua Yang (Renmin University of China); Xiaokui Xiao (Nanyang Technological University); Zhewei Wei (Renmin University of China)

Abstract:Maximum independent set (MIS) is a fundamental problem in graph theory and it has important applications in many areas such as social network analysis, graphical information systems and coding theory. The problem is NP-hard, and there has been numerous studies on its approximate solutions. While successful to a certain degree, the existing methods require memory space at least linear in the size of the input graph. This has become a serious concern in view of the massive volume of today’s fast-growing graphs. In this paper, we study the MIS problem under the semi-external setting, which assumes that the main memory can accommodate all vertices of the graph but not all edges. We present a greedy algorithm and a general vertex-swap framework, which swaps vertices to incrementally increase the size of independent sets. Our solutions require only few sequential scans of graphs on the diskfile, thus enabling in-memory computation without costly random disk accesses. Experiments on large-scale datasets show that our solutions are able to compute a large independent set for a massive graph with 59 million vertices and 151 million edges using a commodity machine, with a memory cost of 469MB and a time cost of three minutes, while yielding an approximation ratio that is around99% of the theoretical optimum.
Weaver: A High-Performance, Transactional Graph Database Based on Refinable Timestamps

Ayush Dubey (Cornell University); Greg Hill (Stanford University); Robert Escriva (Cornell University); Emin Sirer (Cornell University)

Abstract:Graph databases have become an increasingly common infrastructure component. Yet existing systems either operate on offline snapshots, provide weak consistency guarantees, or use expensive concurrency control techniques that limit performance. In this paper, we introduce a new distributed graph database, called Weaver, which enables efficient, transactional graph analyses as well as strictly serializable ACID transactions on dynamic graphs. The key insight that allows Weaver to combine strict serializability with horizontal scalability and high performance is a novel request ordering mechanism called refinable timestamps. This technique couples coarse-grained vector timestamps with a fine-grained timeline oracle to pay the overhead of strong consistency only when needed. Experiments show that Weaver enables a Bitcoin blockchain explorer that is 8x faster than Blockchain.info, and achieves 12x higher throughput than the Titan graph database on social network workloads and 4x lower latency than GraphLab on offline graph traversal workloads.

Industrial I2: Distributed Data Analytics Platforms - 2

Location: Royal 2

Chair: Justin Levandoski, Microsoft

Big Data Analysis as a Service on Public Clouds (Invited talk)

Rajat Venkatesh (Qubole)

Abstract:Public clouds pose challenges (reliability, IO latency etc) as well as provide opportunities (elasticity, pricing models) for large volume data processing. In this talk, I will present the lessons learnt and features (auto-scaling, hybrid clusters with spot nodes, caching etc) implemented in Qubole’s data service to provide an efficient (cost & performance) and a reliable data processing service in the top public clouds. QDS allows users to run batch or ETL, interactive and streaming workloads. The technologies used for each of these workloads are different. Apache Hive, Apache Hadoop M/R and Apache Pig are popular for batch workloads. Presto & SparkSQL are used for interactive analytic queries. Apache Spark is used for machine learning. Spark Streaming is used for streaming workloads. Apache HBase is the K-V store. As such the design and architecture of each of these technologies is very different. We’ve focussed on each of these technologies separately and augmentedthem to run better on the cloud as well as a service. This talk will iterate through each of these technologies, list areas where the open source distribution has gaps and how QDS has closed these gaps. Since these technologies are representative in their class, the features implemented in QDS is broadly applicable. I will also provide real-world metrics for performance gains and cost benefits where applicable.
dmapply: A functional primitive to express distributed machine learning algorithms in R

Edward Ma (Hewlett-Packard Enterprise Vertica); Vishrut Gupta (Hewlett-Packard Enterprise Vertica); Meichun Hsu (Hewlett-Packard Enterprise Vertica); Indrajit Roy (Hewlett Packard Labs)

Abstract:Due to R’s popularity as a data-mining tool, many distributed systems expose an R-based API to users who need to build a distributed application in R. As a result, data scientists have to learn to use different interfaces such as RHadoop, SparkR, Revolution R’s ScaleR, and HPE’s Distributed R. Unfortunately, these interfaces are custom, non-standard, and difficult to learn. Not surprisingly, R applications written in one framework do not work in another, and each backend infrastructure has spent redundant effort in implementing distributed machine learning algorithms. Working with the members of R-core, we have created ddR (Distributed Data structures in R), a unified system that works across different distributed frameworks. In ddR, we introduce a novel programming primitive called dmapply that executes functions on distributed data structures. The dmapply primitive encapsulates different computation patterns: from function and data broadcast to pair-wise communication. We show that dmapply is powerful enough to express algorithms that fit the statistical query model, which includes many popular machine learning algorithms, as well as applications written in MapReduce. We have integrated ddR with many backends, such as R’s single-node parallel framework, multi-node SNOW framework, Spark, and HPE Distributed R, with few or no modifications to any of these systems. We have also implemented multiple machine learning algorithms which are not only portable across different distributed systems, but also have performance comparable to the “native” implementations on the backends. We believe that ddR will standardize distributed computing in R, just like the SQL interface has standardized how relational data is manipulated.
SystemML: Declarative Machine Learning on Spark

Matthias Boehm (IBM Research – Almaden); Michael Dusenberry (IBM Spark Technology Center); Deron Eriksson (IBM Spark Technology Center); Alexandre Evfimievski (IBM Research – Almaden); Faraz Makari Manshadi (IBM Research – Almaden); Niketan Pansare (IBM Research – Almaden); Berthold Reinwald ( IBM Research – Almaden); Frederick Reiss (IBM Research – Almaden); Prithviraj Sen (IBM Research – Almaden); Arvind Surve (IBM Spark Technology Center); Shirish Tatikonda (IBM Research – Almaden)

Abstract:The rising need for custom machine learning (ML) algorithms and the growing data sizes that require the exploitation of distributed, data-parallel frameworks such as MapReduce or Spark, pose significant productivity challenges to data scientists. Apache SystemML addresses these challenges through declarative ML by (1) increasing the productivity of data scientists as they are able to express custom algorithms in a familiar domain-specific language covering linear algebra primitives and statistical functions, and (2) transparently running these ML algorithms on distributed, data-parallel frameworks by applying cost-based compilation techniques to generate efficient, low-level execution plans with in-memory single-node and large-scale distributed operations. This paper describes SystemML on Apache Spark, end to end, including insights into various optimizer and runtime techniques as well as performance characteristics. We also share lessons learned from porting SystemML to Spark and declarative ML in general. Finally, SystemML is open-source, which allows the database community to leverage it as a testbed for further research.
Not for the Timid: On the Impact of Aggressive Over-booking in the Cloud

Willis Lang (Microsoft); Karthik Ramachandra (Microsoft); David DeWitt (Microsoft); Shize Xu (Microsoft); Qun Guo (Microsoft); Ajay Kalhan (Microsoft); Peter Carlin (Microsoft)

Abstract:To lower hosting costs and service prices, database-as-a-service (DBaaS) providers strive to maximize cluster utilization without negatively affecting their users’ service experience. Some of the most effective approaches for increasing service efficiency result in the over-booking of the cluster with user databases. For instance, one approach is to reclaim cluster capacity from a database when it is idle, temporarily re-using the capacity for some other purpose, and over-booking the cluster’s resources. Such approaches are largely driven by policies that determine when it is prudent to temporarily reclaim capacity from an idle database. In this paper, we examine policies that inherently tune the system’s idle sensitivity. Increased sensitivity to idleness leads to aggressive over-booking while the converse leads to conservative reclamation and lower utilization levels. Aggressive over-booking also incurs a “reserve” capacity cost (for when we suddenly “owe” capacity to previously idle databases.) We answer these key questions in this paper: (1) how to find a “good” resource reclamation policy for a given DBaaS cluster of users; and (2) how to forecast the needed near-term reserve capacity. To help us answer these questions, we used production user activity traces from Azure SQL DB and built models of an over-booking mechanism. We show that choosing the right policy can substantially boost the efficiency of the service, facilitating lower service prices via lower amortized infrastructure costs.

Demo 2a: Interactive and Exploratory Systems

Location: Maple

ArchimedesOne: Query Processing over Probabilistic Knowledge Bases

Xiaofeng Zhou (University of Florida); Yang Chen (University of Florida); Daisy Zhe Wang (University of Florida)

Rudolf: Interactive Rule Refinement System for Fraud Detection

Tova Milo (Tel-Aviv University); Slava Novgorodov (Tel-Aviv University); Wang-Chiew Tan (University of California at Santa Cruz)

Ziggy: Characterizing Query Results for Data Explorers

Thibault Sellam (CWI); Martin Kersten (CWI)

Blaeu: Mapping and Navigating Large Tables with Cluster Analysis

Thibault Sellam (CWI); Robin Cijvat (MonetDB Solutions); Richard Koopmanschap (MonetDB Solutions); Martin Kersten (CWI)

A Demonstration of VisDPT: Visual Exploration of Differentially Private Trajectories

Xi He (Duke University); Nisarg Raval (Duke University); Ashwin Machanavajjhala (Duke University)

YASK: A Why-Not Question Answering Engine for Spatial Keyword Query Services

Lei Chen (Hong Kong Baptist University); Jianliang Xu (Hong Kong Baptist University); Christian Jensen (Aalborg University); Yafei Li (Hong Kong Baptist University)

Exploring Databases via Reverse Engineering Ranking Queries with PALEO

Kiril Panev (TU Kaiserslautern); Sebastian Michel ( TU Kaiserslautern); Evica Milchevski (TU Kaiserslautern); Koninika Pal (TU Kaiserslautern)

ExRank: An Exploratory Ranking Interface

Ramon Bespinyowong (National University of Singapore); Wei Chen (Zhejiang University); H. V. Jagadish (University of Michigan); Yuxin Ma (Zhejiang University)

NLProv: Natural Language Provenance

Daniel Deutch (Tel Aviv University); Nave Frost (Tel Aviv University); Amir Gilad (Tel Aviv University)

Towards Personalized Maps: Mining User Preferences from Geo-textual Data

Kaiqi Zhao (Nanyang Technological University); Yiding Liu (Nanyang Technological University); Quan Yuan (Nanyang Technological University); Lisi Chen (Hong Kong Baptist University); Zhida Chen (Nanyang Technological University); Gao Cong (Nayang Technological University)

A System for Region Search and Exploration

Kaiyu Feng (Nanyang Technological University); Kaiqi Zhao (Nanyang Technological University); Yiding Liu (Nanyang Technological University); Gao Cong (Nanyang Technological University)

SigmaKB: Multiple Probabilistic Knowledge Base Fusion

Miguel Rodriguez (University of Florida); Sean Goldberg (University of Florida); Daisy Zhe Wang (University of Florida)

Collaborative Crowdsourcing with Crowd4U

Kosetsu Ikeda (University of Tsukuba); Atsuyuki Morishima (University of Tsukuba); Habibur Rahman (UT Arlington); Senjuti Basu Roy (NJIT); Saravanan Thirumuruganathan (QCRI, HBKU); Sihem Amer-Yahia (CNRS LIG); Gautam Das (UT Arlington)

Research Posters 2

Location: Maple

Tuesday Sep 6th, 4:00 pm - 5:30 pm

Research R7: Query Execution -1

Location: Pearl 1

Chair: Alfons Kemper, TU Munich

Parallel Evaluation of Multi-Semi-Joins

Jonny Daenen (Hasselt University); Frank Neven (Hasselt University); Tony Tan (National Taiwan University); Stijn Vansummeren (Université Libre de Bruxelles)

Abstract:While services such as Amazon AWS make computing power abundantly available, adding more computing nodes can incur high costs in, for instance, pay-as-you-go plans while not always significantly improving the net running time (aka wall-clock time) of queries. In this work, we provide algorithms for parallel evaluation of SGF queries in MapReduce that optimize total time, while retaining low net time. Not only can SGF queries specify all semi-join reducers, but also more expressive queries involving disjunction and negation. Since SGF queries can be seen as Boolean combinations of (potentially nested) semi-joins, we introduce a novel multi-semi-join (MSJ) MapReduce operator that enables the evaluation of a set of semi-joins in one job. We use this operator to obtain parallel query plans for SGF queries that outvalue sequential plans w.r.t. net time and provide additional optimizations aimed at minimizing total time without severely affecting net time. Even though the latter optimizations are NP-hard, we present effective greedy algorithms. Our experiments, conducted using our own implementation Gumbo on top of Hadoop, confirm the usefulness of parallel query plans, and the effectiveness and scalability of our optimizations, all with a significant improvement over Pig and Hive.
Lightning Fast and Space Efficient Inequality Joins

Zuhair Khayyat (King Abdullah University of Science and Technology); William Lucia (Qatar Computing Research Institute); Meghna Singh (Qatar Computing Research Institute), Mourad Ouzzani (Qatar Computing Research Institute); Paolo Papotti (Qatar Computing Research Institute); Jorge-Arnulfo Quiane-Ruiz (Qatar Computing Research Institute); Nan Tang (Qatar Computing Research Institute); Panos Kalnis (King Abdullah University of Science and Technology)

Abstract:Inequality joins, which join relational tables on inequality conditions, are used in various applications. While there have been a wide range of optimization methods for joins in database systems, from algorithms such as sort-merge join and band join, to various indices such as B+-tree, R –tree and Bitmap, inequality joins have received little attention and queries containing such joins are usually very slow. In this paper, we introduce fast inequality join algorithms. We put columns to be joined in sorted arrays and we use permutation arrays to encode positions of tuples in one sorted array w.r.t. the other sorted array. In contrast to sort-merge join, we use space efficient bit-arrays that enable optimizations, such as Bloom filter indices, for fast computation of the join results. We have implemented a centralized version of these algorithms on top of PostgreSQL, and a distributed version on top of Spark SQL. We have compared against well known optimization techniques for inequality joins and show that our solution is more scalable and several orders of magnitude faster.
MQJoin: Efficient Shared Execution of Main-Memory Joins

Darko Makreshanski (ETH Zurich); Georgios Giannikis (Oracle Labs); Gustavo Alonso (ETH Zurich); Donald Kossman (Microsoft Research)

Abstract:Database architectures typically process queries one-at-a-time, executing concurrent queries in independent execution contexts. Often, such a design leads to unpredictable performance and poor scalability. One approach to circumvent the problem is to take advantage of sharing opportunities across concurrently running queries. In this paper we propose Many-Query Join (MQJoin), a novel method for sharing the execution of a join that can efficiently deal with hundreds of concurrent queries. This is achieved by minimizing redundant work and making efficient use of main-memory bandwidth and multi-core architectures. Compared to existing proposals, MQJoin is able to efficiently handle larger workloads regardless of the schema by exploiting more sharing opportunities. We also compared MQJoin to two commercial main-memory column-store databases. For a TPC-H based workload, we show that MQJoin provides 2-5x higher throughput with significantly more stable response times.
Incremental Computation of Common Windowed Holistic Aggregates

Richard Wesley (Tableau Software); Fei Xu (Tableau Software)

Abstract:Windowed aggregates are a SQL 2003 feature for computing aggregates in moving windows. Common examples include cumulative sums, local maxima and moving quantiles. With the advent over the last few years of easy-to-use data analytics tools, these functions are becoming widely used by more and more analysts, but some aggregates (such as local maxima) are much easier to compute than others (such as moving quantiles). Nevertheless, aggregates that are more difficult to compute, like quantile and mode (or “most frequent”) provide more appropriate statistical summaries in the common situation when a distribution is not Gaussian and are an essential part of a data analysis toolkit. Recent work has described highly efficient windowed implementations of the most common aggregate function categories, including distributive aggregates such as cumulative sums and algebraic aggregates such as moving averages. But little has been published on either the implementation or the performance of the more complex holistic windowed aggregates such as moving quantiles. This paper provides the first in-depth study of how to efficiently implement the three most common holistic windowed aggregates (count distinct, mode and quantile) by reusing the aggregate state between consecutive frames. Our measurements show that these incremental algorithms generally achieve improvements of about 10× over naive implementations, and that they can effectively detect when to reset the internal state during extreme frame variation.

Research R8: Data Security and Privacy

Location: Pearl 2

Chair: Lei Chen, HKUST

Oblivious RAM: A Dissection and Experimental Evaluation [Experiments and Analyses]

Zhao Chang (University of Utah); Dong Xie (University of Utah); Feifei Li (University of Utah)

Abstract:Many companies choose the cloud as their data and IT infrastructure platform. The remote access of the data brings the issue of trust. Despite the use of strong encryption schemes, adversaries can still learn valuable information regarding encrypted data by observing the data access patterns. To that end, one can hide the access patterns, which may leak sensitive information, using Oblivious RAMs (ORAMs). Numerous works have proposed different ORAM constructions, but they have never been thoroughly compared against and tested on large databases. There are also no open source implementation of these schemes. These limitations make it difficult for researchers and practitioners to choose and adopt a suitable ORAM for their applications. To address this issue, we provide a thorough study over several practical ORAM constructions, and implement them under the same library. We perform extensive experiments to provide insights into their performance characteristics with respect to efficiency, scalability, and communication cost.
Design of Policy-Aware Differentially Private Algorithms

Samuel Haney (Duke University); Ashwin Machanavajjhala (Duke University); Bolin Ding (Microsoft Research)

Abstract:The problem of designing error optimal differentially private algorithms is well studied. Recent work applying differential privacy to real world settings have used variants of differential privacy that appropriately modify the notion of neighboring databases. The problem of designing error optimal algorithms for such variants of differential privacy is open. In this paper, we show a novel transformational equivalence result that can turn the problem of query answering under differential privacy with a modified notion of neighbors to one of query answering under standard differential privacy, for a large class of neighbor definitions. We utilize the Blowfish privacy framework that generalizes differential privacy. Blowfish uses a policy graph to instantiate different notions of neighboring databases. We show that the error incurred when answering a workload W on a database x under a Blowfish policy graph G is identical to the error required to answer a transformed workload f_G(W) on database g_G(x) under standard differential privacy, where f_G and g_G are linear transformations based on G. Using this result, we develop error efficient algorithms for releasing histograms and multidimensional range queries under different Blowfish policies. We believe the tools we develop will be useful for finding mechanisms to answer many other classes of queries with low error under other policy graphs.
Behavior Query Discovery in System-Generated Temporal Graphs

Bo Zong (University of California at Santa Barbara); Xusheng Xiao (NEC Labs America, Inc); Zhichun Li (NEC Labs America, Inc.); Zhenyu Wu (NEC Labs America, Inc.); Zhiyun Qian (University of California at Riverside); Xifeng Yan (University of California at Santa Barbara); Ambuj Singh (University of California at Santa Barbara); Guofei Jiang (NEC Labs America, Inc.)

Abstract:Computer system monitoring generates huge amounts of logs that record the interaction of system entities. How to query such data to better understand system behaviors and identify potential system risks and malicious behaviors becomes a challenging task for system administrators due to the dynamics and heterogeneity of the data. System monitoring data are essentially heterogeneous temporal graphs with nodes being system entities and edges being their interactions over time. Given the complexity of such graphs, it becomes time-consuming for system administrators to manually formulate useful queries in order to examine abnormal activities, attacks, and vulnerabilities in computer systems. In this work, we investigate how to query temporal graphs and treat query formulation as a discriminative temporal graph pattern mining problem. We introduce TGMiner to mine discriminative patterns from system logs, and these patterns can be taken as templates for building more complex queries. TGMiner leverages temporal information in graphs to prune graph patterns that share similar growth trend without compromising pattern quality. Experimental results on real system data show that TGMiner is 6-32 times faster than baseline methods. The discovered patterns were verified by system experts; they achieved high precision (97%) and recall (91%).

Research R9: Ranking Queries

Location: Royal 1

Chair: Renee Miller, Univ. of Toronto

Query Reranking As A Service

Abolfazl Asudeh (UT Arlington); Nan Zhang (George Washington University); Gautam Das (UT Arlington)

Abstract:The ranked retrieval model has rapidly become the de facto way for search query processing in client-server databases, especially those on the web. Despite of the extensive efforts in the database community on designing better ranking functions/mechanisms, many such databases in practice still fail to address the diverse and sometimes contradicting preferences of users on tuple ranking, perhaps (at least partially) due to the lack of expertise and/or motivation for the database owner to design truly effective ranking functions. This paper takes a different route on addressing the issue by defining a novel {\em query reranking problem}, i.e., we aim to design a third-party service that uses nothing but the public search interface of a client-server database to enable the on-the-fly processing of queries with any user-specified ranking functions (with or without selection conditions), no matter if the ranking function is supported by the database or not. We analyze the worst-case complexity of the problem and introduce a number of ideas, e.g., on- the-fly indexing, domination detection and virtual tuple pruning, to reduce the average-case cost of the query reranking algorithm. We also present extensive experimental results on real-world datasets, in both offline and live online systems, that demonstrate the effectiveness of our proposed techniques.
Finding Pareto Optimal Groups: Group-based Skyline

Jinfei Liu (Emory University); Li Xiong (Emory University); Jian Pei (Simon Fraser University); Jun Luo (Lenovo); Haoyu Zhang (Emory University)

Abstract:Skyline computation, aiming at identifying a set of skyline points that are not dominated by any other point, is particularly useful for multi-criteria data analysis and decision making. Traditional skyline computation, however, is inadequate to answer queries that need to analyze not only individual points but also groups of points. To address this gap, we generalize the original skyline definition to the novel group-based skyline (G-Skyline), which represents Paretooptimal groups that are not dominated by other groups. Inorder to compute G-Skyline groups consisting of k points efficiently, we present a novel structure that represents the points in a directed skyline graph and captures the dominance relationships among the points based on the first k skyline layers. We propose efficient algorithms to compute the first k skyline layers. We then present two heuristic algorithms to efficiently compute the G-Skyline groups: the point-wise algorithm and the unit group-wise algorithm, using various pruning strategies. The experimental results on the real NBA dataset and the synthetic datasets show that G-Skyline is interesting and useful, and our algorithms are efficient and scalable.
k-Regret Queries with Nonlinear Utilities

Taylor Kessler Faulkner (Denison University); Will Brackenbury (Denison University); Ashwin Lall (Denison University)

Abstract:In exploring representative databases, a primary issue has been finding accurate models of user preferences. Given this, our work generalizes the method of regret minimization as proposed by Nanongkaiet al. to include nonlinear utility functions. Regret minimization is an approach for selecting k representative points from a database such that every user’s ideal point in the entire database is similar to one of the kpoints. This approach combines benefits of the methods top-k and skyline; it controls the size of the output but does not require knowledge of users’ preferences. Prior work with k-regret queries assumes users’ preferences to be modeled by linear utility functions. In this paper, we derive upper and lower bounds for non-linear utility functions, as these functions can better fit occurrences such as diminishing marginal returns, propensity for risk, and substitutability of preferences. To model these phenomena, we analyze a broad subset of convex, concave, and constant elasticity of substitution functions. We also run simulations on real and synthetic data to prove the efficacy of our bounds in practice.
Discovering the Skyline of Web Databases

Abolfazl Asudeh (UT Arlington); Saravanan Thirumuruganathan (UT Arlington); Nan Zhang (George Washington University); Gautam Das (UT Arlington)

Abstract:Many web databases are “hidden” behind proprietary search interfaces that enforce the top-k output constraint, i.e., each query returns at most k of all matching tuples, preferentially selected and returned according to a proprietary ranking function. In this paper, we initiate research into the novel problem of skyline discovery over top-k hidden web databases. Since skyline tuples provide critical insights into the database and include the top-ranked tuple for every possible ranking function following the monotonic order of attribute values, skyline discovery from a hidden web database can enable a wide variety of innovative third-party applications over one or multiple web databases. Our research in the paper shows that the critical factor affecting the cost of skyline discovery is the type of search interface controls provided by the website. As such, we develop efficient algorithms for three most popular types, i.e., one-ended range, free range and point predicates, and then combine them to support web databases that feature a mixture of these types. Rigorous theoretical analysis and extensive real-world online and offline experiments demonstrate the effectiveness of our proposed techniques and their superiority over baseline solutions.

Industrial I3: Data Engine Architectures - 1

Location: Royal 2

Chair: Wolfgang Lehner, TU Dresden

Hardware-Software Co-design for Data Management (Invited Talk)

Garret Swart (Oracle); Shasank Chavan (Oracle)

Abstract:Big is better: Large scale computing systems give access to huge amounts of data without the costs of moving data between systems. This allows for larger tables, bigger sorts, fatter graphs, and more cloud tenants sharing the the same resource pool on SPARC systems that scale linearly in cost and performance from 8 to 512 cores, 64 to 4096 threads. Secure is better: Cache- line level memory access checking allows our instrumented memory allocators to manage memory at production speed while detecting bugs and reporting attacks in real time. Information Density is better: With hardware designed for scanning n-gram compressed, bit packed, dictionary and run-length encoded columnar data at full memory bandwidth, we make maximal use of every bit stored and every cache line transferred over the memory channels with no impact on performance. Fast is better: With hardware support for running database operators on specialized streaming processors, we drive the memory channels at maximum rate, freeing up power and cores for running user computations on the result of these operators. Connected is better: Integrating EDR InfiniBand on-chip and on-board with low-latency, high-throughput, one-sided networking. Portable is better: By supporting platform independent acceleration APIs inside the database we support a wide variety of acceleration techniques and give applications and query planners the information to make the best use of the available hardware. Integrated is better: By supporting and accelerating multiple storage types (In-memory, NVMe, Exadata, NFS, HDFS, Fibre Channel), data formats (row major, column major, graph, JSON, spatial, MIME, Hive), algorithms, query languages, network protocols, and hardware platforms in a single product, we optimize, share resources, increase usability and reduce the cost and the cognitive load in acquiring, storing, securing and understanding data. In this talk, I will describe the experience that drives our microprocessor acceleration priorities, the constraints and joys of the hardware-software co-design process, the HW features that resulted, and how software engineers have exploiting these features in ways we expected and ways we didn’t. The industry has accelerated linear algebra and computer graphics, and in this talk we’ll see how we apply it to data processing but with changes to match the lower compute density of the problem space.
Nitro: A Fast, Scalable In-Memory Storage Engine for NoSQL Global Secondary Index

Sarath Lakshman (Couchbase Inc.); Sriram Melkote (Couchbase Inc.); John Liang (Couchbase Inc.); Ravi Mayuram (Couchbase Inc.)

Abstract:We present Nitro, a high-performance in-memory key-value storage engine used in Couchbase 4.5 Global Secondary Indexes. The Nitro storage engine is well suited for the recent hardware trends like large amounts of memory and many CPU cores. The storage engine leverages latch-free data structures and tries to achieve linear scalability for the index read-write operations. The Nitro storage engine offers concurrent readers and writers, lightweight database snapshots, stable scan, backup and recovery operations. We integrated Nitro into the Couchbase Global Secondary Indexes (GSI) and observed significant improvement in performance compared to our disk oriented storage engine configured with the same amount of memory for buffer cache. On a 32 core machine, we observed an end-to- end GSI server insertion throughput of 1,650,000 entries/sec and index update throughput of 822,000 entries/sec. A single instance of Nitro data structure running on a 40 core machine achieved a peak insertion throughput of 4 million index entries/sec and entry lookup throughput of 10 million lookups/sec.
Aerospike: Architecture of a Real-Time Operational DBMS

V. Srinivasan (Aerospike Inc.); Brian Bulkowski (Aerospike Inc.); Wei-Ling Chu (Aerospike Inc.); Sunil Sayyaparaju (Aerospike Inc.); Andrew Gooding (Aerospike Inc.); Rajkumar Iyer (Aerospike Inc.); Ashish Shinde (Aerospike Inc.); Thomas Lopatic (Aerospike Inc.)

Abstract:In this paper, we describe the solutions developed to address key technical challenges encountered while building a distributed database system that can smoothly handle demanding real-time workloads and provide a high level of fault tolerance. Specifically, we describe schemes for the efficient clustering and data partitioning for the automatic scale out of processing across multiple nodes and for optimizing the usage of CPUs, DRAM, SSDs and networks to efficiently scale up performance on one node. The techniques described here were used to develop Aerospike (formerly Citrusleaf), a high performance distributed database system built to handle the needs of today’s interactive online services. Most real-time decision systems that use Aerospike require very high scale and need to make decisions within a strict SLA by reading from, and writing to, a database containing billions of data items at a rate of millions of operations per second with sub-millisecond latency. For over five years, Aerospike has been continuously used in over a hundred successful production deployments, as many enterprises have discovered that it can substantially enhance their user experience.
Comdb2 - Bloomberg's Higly Available Relational Database System

Alex Scotti (Bloomberg LP); Mark Hannum (Bloomberg LP); Michael Ponomarenko (Bloomberg LP); Dorin Hogea (Bloomberg LP); Akshat Sikarwar (Bloomberg LP); Mohit Khullar (Bloomberg LP); Adi Zaimi (Bloomberg LP); James Leddy (Bloomberg LP); Fabio Angius (Bloomberg LP); Rivers Zhang (Bloomberg LP); Lingzhi Deng (Bloomberg LP)

Abstract:Comdb2 is a distributed database designed for geographical replication and high availability. In contrast with the latest trends in this field Comdb2 offers full transactional support, a standard relational model, and the expressivity of SQL. Moreover, the system allows for rich stored procedures using a dialect of Lua. Comdb2 implements a serializable system in which reads from any node always return current values. Comdb2 provides transparent High Avail- ability through built in service discovery and sophisticated retry logic embedded in the standard API. In addition to the relational data model, Comdb2 implements queues for publisher-to- subscriber message delivery. Queues can be combined with table triggers for time- consistent log distribution, two functionalities that are commonly needed in modern OLTP. In this paper we give an overview of our last twelve years of work. We focus on the design choices that have made Comdb2 the primary database solution within our company, Bloomberg LP (BLP).

Demo 3a: Graph and Semistructured Data

Location: Maple

GARUDA: A System for Large-Scale Mining of Statistically Significant Connected Subgraphs

Satyajit Bhadange (IIT Kanpur); Akhil Arora (Xerox Research Centre India); Arnab Bhattacharya (IIT Kanpur)

Generating Flexible Workloads for Graph Databases

Guillaume Bagan (CNRS LIRIS); Angela Bonifati (University of Lyon 1 & CNRS LIRIS); Radu Ciucanu (University of Oxford); George Fletcher (TU Eindhoven); Aurélien Lemay (University of Lille 3 & INRIA); Nicky Advokaat (TU Eindhoven)

Graph databases in the browser: using LevelGraph to explore New Delhi

Antonio Maccioni (Roma Tre University); Matteo Collina (NearForm)

Sapphire: Querying RDF Data Made Simple

Ahmed El-Roby (University of Waterloo); Khaled Ammar (University of Waterloo); Ashraf Aboulnaga (Qatar Computing Research Institute); Jimmy Lin (University of Waterloo)

December: A Declarative Tool for Crowd Member Selection

Yael Amsterdamer (Bar Ilan University); Tova Milo (Tel-Aviv University); Amit Somech (Tel-Aviv University); Brit Youngmann (Tel-Aviv University)

AutoG: A Visual Query Autocompletion Framework for Graph Databases

Peipei Yi (Hong Kong Baptist University); Byron Choi (Hong Kong Baptist University); Sourav Bhowmick (Nanyang Technological University); Jianliang Xu (Hong Kong Baptist University)

Exploratory Querying of Extended Knowledge Graphs

Mohamed Yahya (Max Planck Insitute for Informatics); Klaus Berberich (Max Planck Institute for Informatics); Maya Ramanath (IIT Delhi); Gerhard Weikum (Max Planck Institute for Informatics)

SPARQLByE: Querying RDF data by example

Gonzalo Diaz (University of Oxford); Marcelo Arenas (PUC Chile); Michael Benedikt (University of Oxford)

Graph-based Exploration of Non-graph Datasets

Udayan Khurana (IBM Research – T.J. Watson); Srinivasan Parthasarathy (IBM Research – T.J. Watson); Deepak Turaga (IBM Research – T.J. Watson)

Rogas: A Declarative Framework for Network Analytics

Minjian Liu (The Australian National University); Qing Wang (The Australian National University)

Large-scale Complex Analytics on Semi-structured Datasets using AsterixDB and Spark

Wail Alkowaileet (CCEE at KACST and MIT); Sattam Alsubaiee (CCEE at KACST and MIT); Michael Carey (University of California at Irvine); Till Westmann (Couchbase Inc.); Yingyi Bu (Couchbase Inc.)

Research Posters 3

Location: Maple

Tuesday Sep 6th, 6:30 pm - 8:30 pm

VLDB Conference Reception & Quiz Program

Location: Pearl

Wednesday Sep 7th, 9:15 am - 10:30 am

Keynote 2

Location: Pearl

Chair: Jayant Haritsa, IISc Bangalore

Data-Driven Disruption: The View from Silicon Valley

Anand Rajaraman, Founding Partner, Rocketship.vc

Abstract:We live in an era where software is transforming industries, the sciences, and society as a whole. This exciting phenomenon has been described by the phrase “software is eating the world.” It is becoming increasingly apparent that data is the fuel powering software’s conquests. Data is the new disruptor. It’s hard to believe that the first decade of the Big Data era is already behind us. Silicon Valley has been at the forefront of developing and applying data-driven approaches to create disruption at many levels: infrastructure (e.g., Hadoop and Spark), capabilities (e.g., image recognition and machine translation), and killer apps (e.g., self-driving cars and messaging bots). In this talk, we first look back on the past decade and share learnings from the frontlines of data-driven disruption. Looking ahead, we then describe challenges and opportunities for the next decade. Since this has also been a personal journey, we will use examples drawn from personal experience to illustrate each point.

Bio:Anand is a Founding Partner of two Silicon Valley venture capital funds focused on early- stage technology companies: Milliways Ventures and RocketshipVC. He was the co-founder of two successful startups: Junglee (acquired by Amazon.com) and Kosmix (acquired by Walmart). At Walmart, he created and led WalMartLabs (as its Senior Vice President). As an academic, Anand’s research has focused at the intersection of database systems, the World-Wide Web, and social media. His research publications have won several awards at prestigious academic conferences, including three retrospective 10-year Best Paper awards at ACM SIGMOD and VLDB, and ICDT. His textbook “Mining of Massive Datasets”, co-authored with Jeff Ullman and Jure Leskovec, has been published.

Wednesday Sep 7th, 11:15 am - 12:45 pm

Research R10: Query Optimization - 1

Location: Pearl 1

Chair: S Sudarshan, IIT Bombay

Faster Plan Generation through Consideration of Functional Dependencies and Keys

Marius Eich (University of Mannheim); Pit Fender (Oracle Labs); Guido Moerkotte (University of Mannheim)

Abstract:It has been a recognized fact for many years that query execution can benefit from pushing group-by operators down in the operator tree and applying them before a join. This so-called eager aggregation reduces the size(s) of the join argument(s), making join evaluation faster. Lately, the idea enjoyed a revival when it was applied to outer joins for the first time and incorporated in a state-of- the-art plan generator. However, this recent approach is highly dependent on the use of heuristics because of the exponential growth of the search space that goes along with eager aggregation. Finding an optimal solution for larger queries calls for effective optimality preserving pruning mechanisms to reduce the search space size as far as possible. By a more thorough investigation of functional dependencies and keys, we provide a set of new pruning criteria and evaluate their effectiveness with respect to the runtime and memory consumption of the resulting plan generator.
Exploiting Soft and Hard Correlations in Big Data Query Optimization

Hai Liu (Worcester Polytechnic Institute); Dongqing Xiao (Worcester Polytechnic Institute); Pankaj Didwania (Worcester Polytechnic Institute); Mohamed Eltabakh (Worcester Polytechnic Institute)

Abstract:It has been a recognized fact for many years that query execution can benefit from pushing group-by operators down in the operator tree and applying them before a join. This so-called eager aggregation reduces the size(s) of the join argument(s), making join evaluation faster. Lately, the idea enjoyed a revival when it was applied to outer joins for the first time and incorporated in a state-of- the-art plan generator. However, this recent approach is highly dependent on the use of heuristics because of the exponential growth of the search space that goes along with eager aggregation. Finding an optimal solution for larger queries calls for effective optimality preserving pruning mechanisms to reduce the search space size as far as possible. By a more thorough investigation of functional dependencies and keys, we provide a set of new pruning criteria and evaluate their effectiveness with respect to the runtime and memory consumption of the resulting plan generator.
How Good Are Query Optimizers, Really?

Viktor Leis (Technische Universität München); Andrey Gubichev (Technische Universität München); Atanas Mirchev (Technische Universität München); Peter Boncz (CWI); Alfons Kemper (Technische Universität München); Thomas Neumann (Technische Universität München)

Abstract:Finding a good join order is crucial for query performance. In this paper, we introduce the Join Order Benchmark (JOB) and experimentally revisit the main components in the classic query optimizer architecture using a complex, real-world data set and realistic multi-join queries. We investigate the quality of industrial-strength cardinality estimators and find that all estimators routinely produce large errors. We further show that while estimates are essential for finding a good join order, query performance is unsatisfactory if the query engine relies too heavily on these estimates. Using another set of experiments that measure the impact of the cost model, we find that it has much less influence on query performance than the cardinality estimates. Finally, we investigate plan enumeration techniques comparing exhaustive dynamic programming with heuristic algorithms and find that exhaustive enumeration improves performance despite the sub-optimal cardinality estimates.
Teaching an RDBMS about ontological constraints

Damian Bursztyn (INRIA & LIX); François Goasdoué (University of Rennes 1 and INRIA); Ioana Manolescu (INRIA & LIX)

Abstract:In the presence of an ontology, query answers must reflect not only data explicitly present in the database, but also implicit data, which holds due to the ontology, even though it is not present in the database. A large and useful set of ontologies enjoys FOL reducibility of query answering, that is: answering a query q can be reduced to evaluating a certain first-order logic (FOL) formula (obtained from the query and ontology) against only the explicit facts. We present a novel query optimization framework for ontology-based data access settings enjoying FOL reducibility. Our framework is based on searching within a set of alternative equivalent FOL queries, i.e., FOL reformulations, one with minimal evaluation cost when evaluated through a relational database system. We apply this framework to the DL-LiteR Description Logic underpinning the W3C’s OWL2 QL ontology language, and demonstrate through experiments its performance benefits when two leading SQL systems, one open-source and one commercial, are used for evaluating the FOL query reformulations.

Research R11: Spatial Data and Queries - 1

Location: Pearl 2

Chair: Amr El Abbadi, UC Santa Barbara

Processing and Optimizing Main Memory Spatial-Keyword Queries

Taesung Lee (Yonsei University); Jin-Woo Park (POSTECH); Sanghoon Lee (POSTECH); Seung-won Hwang (Yonsei University); Sameh Elnikety (Microsoft Research); Yuxiong He (Microsoft Research)

Abstract:Important cloud services rely on spatial-keyword queries, containing a spatial predicate and arbitrary boolean keyword queries. In particular, we study the processing of such queries in main memory to support short response times. In contrast, current state-of-the-art spatial-keyword indexes and relational engines are designed for different assumptions. Rather than building a new spatial-keyword index, we employ a cost-based optimizer to process these queries using a spatial index and a keyword index. We address several technical challenges to achieve this goal. We introduce three operators as the building blocks to construct plans for main memory query processing. We then develop a cost model for the operators and query plans. We introduce five optimization techniques that efficiently reduce the search space and produce a query plan with low cost. The optimization techniques are computationally efficient, and they identify a query plan with a formal approximation guarantee under the common independence assumption. Furthermore, we extend the framework to exploit interesting orders. We implement the query optimizer to empirically validate our proposed approach using real-life datasets. The evaluation shows that the optimizations provide significant reduction in the average and tail latency of query processing: 7- to 11-fold reduction over using a single index in terms of 99th percentile response time. In addition, this approach outperforms existing spatial- keyword indexes, and DBMS query optimizers for both average and high-percentile response times.
Spatial Online Sampling and Aggregation

Lu Wang (Hong Kong University of Science and Technology); Robert Christensen (University of Utah) ; Feifei Li (University of Utah); Ke Yi (Hong Kong University of Science and Technology)

Abstract:The massive adoption of smart phones and other mobile devices has generated humongous amount of spatial and spatio- temporal data. The importance of spatial analytics and aggregation is ever-increasing. An important challenge is to support interactive exploration over such data. However, spatial analytics and aggregation using all data points that satisfy a query condition is expensive, especially over large data sets, and could not meet the needs of interactive exploration. To that end, we present novel indexing structures that support spatial online sampling and aggregation on large spatial and spatio- temporal data sets. In spatial online sampling, random samples from the set of spatial (or spatio-temporal) points that satisfy a query condition are generated incrementally in an online fashion. With more and more samples, various spatial analytics and aggregations can be performed in an online, interactive fashion, with estimators that have better accuracy over time. Our design works well for both memory-based and disk-resident data sets, and scales well towards different query and sample sizes. More importantly, our structures are dynamic, hence, they are able to deal with insertions and deletions efficiently. Extensive experiments on large real data sets demonstrate the improvements achieved by our indexing structures compared to other baseline methods.
Online Minimum Matching in Real-Time Spatial Data: Experiments and Analysis [Experiments and Analyses]

Yongxin Tong (Beihang University); Jieying She (Hong Kong University of Science and Technology); Bolin Din (Microsoft Research); Lei Chen (Hong Kong University of Science and Technology); Tianyu Wo (Beihang University); Ke Xu (Beihang University)

Abstract:Recently, with the development of mobile Internet and smartphones, the \underline{o}nline \underline{m}inimum \underline{b} ipartite \underline{m}atching in real time spatial data (OMBM) problem becomes popular. Specifically, given a set of service providers with specific locations and a set of users who dynamically appear one by one, the OMBM problem is to find a maximum- cardinality matching with minimum total distance following that once a user appears, s/he must be immediately matched to an unmatched service provider, which cannot be revoked, before subsequent users arrive. To address this problem, existing studies mainly focus on analyzing the worst-case competitive ratios of the proposed online algorithms, but study on the performance of the algorithms in practice is absent. In this paper, we present a comprehensive experimental comparison of the representative algorithms of the OMBM problem. Particularly, we observe a surprising result that the simple and efficient greedy algorithm, which has been considered as the worst due to its exponential worst-case competitive ratio, is significantly more effective than other algorithms. We investigate the results and further show that the competitive ratio of the worst case of the greedy algorithm is actually just a constant, 3.195, in the average-case analysis. We try to clarify a 24-year misunderstanding towards the greedy algorithm and justify that the greedy algorithm is not bad at all. Finally, we provide a uniform implementation for all the algorithms of the OMBM problem and clarify their strengths and weaknesses, which can guide practitioners to select appropriate algorithms for various scenarios.
AQWA: Adaptive Query-Workload-Aware Partitioning of Big Spatial Data

Ahmed M. Aly (Purdue University); Ahmed R. Mahmood (Purdue University); Mohamed S. Hassan (Purdue University); Walid G. Aref (Purdue University); Mourad Ouzzani (Qatar Computing Research Institute); Hazem Elmeleegy (Turn Inc.); Thamir Qadah (Purdue University)

Abstract:The unprecedented spread of location-aware devices has resulted ina plethora of location-based services in which huge amounts of spatial data need to be efficiently processed by large-scale computing clusters. Existing cluster-based systems for processing spatial data employ static data-partitioning structures that cannot adapt to data changes, and that are insensitive to the query workload. Hence, these systems are incapable of consistently providing good performance. To close this gap, we present AQWA, an adaptive and query-workload-aware mechanism for partitioning large-scale spatial data. AQWA does not assume prior knowledge of the data distribution or the query workload. Instead, as data is consumed and queries are processed, the data partitions are incrementally updated. With extensive experiments using real spatial data from Twitter, and various workloads of range and k-nearest-neighbor queries, we demonstrate that AQWA can achieve an order of magnitude enhancement in query performance compared to the state- of-the-art systems.

Research R12: Distributed and Cloud Systems - 1

Location: Royal 1

Chair: Krithi Ramamritham, IIT Bombay

High-Speed Query Processing over High-Speed Networks

Wolf Roediger (Technische Universität München); Tobias Muehlbauer (Technische Universität München); Alfons Kemper (Technische Universität München); Thomas Neumann (Technische Universität München)

Abstract:Modern database clusters entail two levels of networks: connecting CPUs and NUMA regions inside a single server in the small and multiple servers in the large. The huge performance gap between these two types of networks used to slow down distributed query processing to such an extent that a cluster of machines actually performed worse than a single many-core server. The increased main-memory capacity of the cluster remained the sole benefit of such a scale-out. The economic viability of high-speed interconnects such as InfiniBand has narrowed this performance gap considerably. However, InfiniBand’s higher network bandwidth alone does not improve query performance as expected when the distributed query engine is left unchanged. The scalability of distributed query processing is impaired by TCP overheads, switch contention due to uncoordinated communication, and load imbalances resulting from the inflexibility of the classic exchange operator model. This paper presents the blueprint for a distributed query engine that addresses these problems by considering both levels of networks holistically. It consists of two parts: First, hybrid parallelism that distinguishes local and distributed parallelism for better scalability in both the number of cores as well as servers. Second, a novel communication multiplexer tailored for analytical database workloads using remote direct memory access (RDMA) and low-latency network scheduling for high-speed communication with almost no CPU overhead. An extensive evaluation within the HyPer database system using the TPC-H benchmark shows that our holistic approach indeed enables high-speed query processing over high-speed networks.
The End of Slow Networks: It’s Time for a Redesign

Carsten Binnig (Brown University); Andrew Crotty (Brown University); Alex Galakatos (Brown University); Tim Kraska (Brown University); Erfan Zamanian (Brown University)

Abstract:The next generation of high-performance networks with remote direct memory access (RDMA) capabilities requires a fundamental rethinking of the design of distributed in-memory DBMSs. These systems are commonly built under the assumption that the network is the primary bottleneck and should be avoided at all costs, but this assumption no longer holds. For instance, with InfiniBand FDR 4x, the bandwidth available to transfer data across the network is in the same ballpark as the bandwidth of one memory channel. Moreover, RDMA transfer latencies continue to rapidly improve as well. In this paper, we first argue that traditional distributed DBMS architectures cannot take full advantage of high-performance networks and suggest a new architecture to address this problem. Then, we discuss initial results from a prototype implementation of our proposed architecture for OLTP and OLAP, showing remarkable performance improvements over existing designs.
Tempo: Robust and Self-Tuning Resource Management in Multi-tenant Parallel Databases

Zilong Tan (Duke University); Shivnath Babu (Duke University)

Abstract:Multi-tenant database systems have a component called the Resource Manager, or RM that is responsible for allocating resources to tenants. RMs today do not provide direct support for performance objectives such as: “Average job response time of tenant A must be less than two minutes”, or “No more than 5% of tenant B’s jobs can miss the deadline of 1 hour.” Thus, DBAs have to tinker with the RM’s low-level configuration settings to meet such objectives. We propose a framework called Tempo that brings simplicity, self-tuning, and robustness to existing RMs. Tempo provides a simple interface for DBAs to specify performance objectives declaratively, and optimizes the RM configuration settings to meet these objectives. Tempo has a solid theoretical foundation which gives key robustness guarantees. We report experiments done on Tempo using production traces of data-processing workloads from companies such as Facebook and Cloudera. These experiments demonstrate significant improvements in meeting desired performance objectives over RM configuration settings specified by human experts.
Measuring and Optimizing Distributed Array Programs

Mingxing Zhang (Tsinghua University); Yongwei Wu (Tsinghua University); Kang Chen (Tsinghua University); Teng Ma (Tsinghua University); Weimin Zheng (Tsinghua University)

Abstract:Nowadays, there is a rising trend of building array-based distributed computing frameworks. By taking advantage of highly optimized array/matrix operations, these frameworks have achieved significant improvement on performance for many important algorithms. However, due to the absence of a comprehensive optimizer, most of these frameworks execute each primitive in an isolated manner and in the exact order defined by programmers, which implies a huge space for optimization. In this paper, we propose a novel array- based programming model, named as KASEN, which distinguishes itself from models in the existing literature by defining a strict computation and communication model. This model makes it easy to analyze programs’ behavior and measure their performance, with which we design a corresponding optimizer that can automatically apply many high-level optimizations to the original programs written by programmers. According to our evaluation, KASEN is sufficient for implementing many important machine learning and graph algorithms, and it can be extended very easily. More importantly, the optimizer of KASEN can achieve significant reduction on memory read/write, buffer allocation and network traffic. The speedup achieved by our optimizer is up to 5.82X for real-world and synthetic datasets.

Industrial I4: Data Engine Architectures - 2

Location: Royal 2

Chair: Spyros Blanas, Ohio State Univ.

Kodiak: Leveraging Materialized Views For Very Low-Latency Analytics Over High-Dimensional Web-Scale Data

Shaosu Liu (Turn Inc.); Bin Song (Turn Inc.); Sriharsha Gangam (Turn Inc.); Lawrence Lo (Turn Inc.); Khaled Elmeleegy (Turn Inc.)

Abstract:Turn’s online advertising campaigns produce petabytes of data. This data is composed of trillions of events, e.g. impressions, clicks, etc., spanning multiple years. In addition to a timestamp, each event includes hundreds of fields describing the user’s attributes, campaign’s attributes, attributes of where the ad was served, etc. Advertisers need advanced analytics to monitor their running campaigns’ performance, as well as to optimize future campaigns. This involves slicing and dicing the data over tens of dimensions over arbitrary time ranges. Many of these queries need to power the web portal to provide reports and dashboards. For an interactive response time, they have to have tens of milliseconds latency. At Turn’s scale of operations, no existing system was able to deliver this performance in a cost effective manner. Kodiak, a distributed analytical data platform for web-scale high-dimensional data, was built to serve this need. It relies on precomputations to materialize thousands of views to serve these advanced queries. These views are partitioned and replicated across Kodiak’s storage nodes for scalability and reliability. They are system maintained as new events arrive. At query time, the system auto-selects the most suitable view to serve each query. Kodiak has been used in production for over a year. It hosts 2490 views for over three petabytes of raw data serving over 200K queries daily. It has median and 99% query latencies of 8 ms and 252 ms respectively. Our experiments show that its query latency is 3 orders of magnitude faster than leading big data platforms on head-to-head comparisons using Turn’s query workload. Moreover, Kodiak uses 4 orders of magnitude less resources to run the same workload.
Accelerating Analytics with Dynamic In-Memory Expressions

Aurosish Mishra (Oracle America); Shasank Chavan (Oracle America); Allison Holloway (Oracle America); Tirthankar Lahiri (Oracle America); Zhen Hua Liu (Oracle America); Sunil Chakkappen (Oracle America); Dennis Lui (Oracle America); Vinita Subramanian (Oracle America); Ramesh Kumar (Oracle America); Maria Colgan (Oracle America); Jesse Kamp (Oracle America); Niloy Mukherjee (LinkedIn America); Vineet Marwah (Oracle America)

Abstract:Oracle Database In-Memory (DBIM) accelerates analytic workload performance by orders of magnitude through an in-memory columnar format utilizing techniques such as SIMD vector processing, in-memory storage indexes, and optimized predicate evaluation and aggregation. With Oracle Database 12.2, Database In-Memory is further enhanced to accelerate analytic processing through a novel lightweight mechanism known as Dynamic In-Memory Expressions (DIMEs). The DIME mechanism automatically detects frequently occurring expressions in a query workload, and then creates highly optimized, transactionally consistent, in-memory columnar representations of these expression results. At runtime, queries can directly access these DIMEs, thus avoiding costly expression evaluations. Furthermore, all the optimizations introduced in DBIM can apply directly to DIMEs. Since DIMEs are purely in-memory structures, no changes are required to the underlying tables. We show that DIMEs can reduce query elapsed times by several orders of magnitude without the need for costly pre-computed structures such as computed columns or materialized views or cubes.
TrafficDB: HERE's High Performance Shared-Memory Data Store

Ricardo Fernandes (HERE Global B.V.); Piotr Zaczkowski (HERE Global B.V.); Bernd Göttler (HERE Global B.V.); Conor Ettinoffe (HERE Global B.V.); Anis Moussa (HERE Global B.V.)

Abstract:HERE’s traffic-aware services enable route planning and traffic visualisation on web, mobile and connected car applications. These services process thousands of requests per second and require efficient ways to access the information needed to provide a timely response to end-users. The characteristics of road traffic information and these traffic-aware services require storage solutions with specific performance features. A route planning application utilising traffic congestion information to calculate the optimal route from an origin to a destination might hit a database with millions of queries per second. However, existing storage solutions are not prepared to handle such volumes of concurrent read operations, as well as to provide the desired vertical scalability. This paper presents TrafficDB, a shared-memory data store, designed to provide high rates of read operations, enabling applications to directly access the data from memory. Our evaluation demonstrates that TrafficDB handles millions of read operations and provides near-linear scalability on multi-core machines, where additional processes can be spawned to increase the systems’ throughput without a noticeable impact on the latency of querying the data store. The paper concludes with a description of how TrafficDB improved the performance of our traffic-aware services running in production.
The MemSQL Query Optimizer: A modern optimizer for real-time analytics in a distributed database

Jack Chen (MemSQL Inc.); Samir Jindel (MemSQL Inc.); Robert Walzer (MemSQL Inc.); Rajkumar Sen (MemSQL Inc.); Nika Jimsheleishvilli (MemSQL Inc.) ; Michael Andrews (MemSQL Inc.),

Abstract:Real-time analytics on massive datasets has become a very common need in many enterprises. These applications require not only rapid data ingest, but also quick answers to analytical queries operating on the latest data. MemSQL is a distributed SQL database designed to exploit memory-optimized, scale-out architecture to enable real-time transactional and analytical workloads which are fast, highly concurrent, and extremely scalable. Many analytical queries in MemSQL’s customer workloads are complex queries involving joins, aggregations, sub- queries, etc. over star and snowflake schemas, often ad-hoc or produced interactively by business intelligence tools. These queries often require latencies of seconds or less, and therefore require the optimizer to not only produce a high quality distributed execution plan, but also produce it fast enough so that optimization time does not become a bottleneck. In this paper, we describe the architecture of the MemSQL Query Optimizer and the design choices and innovations which enable it quickly produce highly efficient execution plans for complex distributed queries. We discuss how query rewrite decisions oblivious of distribution cost can lead to poor distributed execution plans, and argue that to choose high-quality plans in a distributed database, the optimizer needs to be distribution-aware in choosing join plans, applying query rewrites, and costing plans. We discuss methods to make join enumeration faster and more effective, such as a rewrite-based approach to exploit bushy joins in queries involving multiple star schemas without sacrificing optimization time. We demonstrate the effectiveness of the MemSQL optimizer over queries from the TPC-H benchmark and a real customer workload.

Demo 3b: Graph and Semistructured Data

Location: Maple

(Same as in Demo 3a)

Research Posters 4

Location: Maple

Wednesday Sep 7th, 2:00 pm - 3:30 pm

Research R13: Query Execution - 2

Location: Pearl 1

Chair: Wolfgang Lehner, TU Dresden

A Seven-Dimensional Analysis of Hashing Methods and its Implications on Query Processing

Stefan Richter (Saarland University); Victor Alvarez (Saarland University); Jens Dittrich (Saarland University)

Abstract:Hashing is a solved problem. It allows us to get constant time access for lookups. Hashing is also simple. It is safe to use an arbitrary method as a black box and expect good performance, and optimizations to hashing can only improve it by a negligible delta. Why are all of the previous statements plain wrong? That is what this paper is about. In this paper we thoroughly study hashing for integer keys and carefully analyze the most common hashing methods in a five-dimensional requirements space: (1) data-distribution, (2) load factor, (3) dataset size, (4) read/write-ratio, and (5) un/successful-ratio. Each point in that design space may potentially suggest a different hashing scheme, and additionally also a different hash function. We show that a right or wrong decision in picking the right hashing scheme and hash function combination may lead to significant difference in performance. To substantiate this claim, we carefully analyze two additional dimensions: (6) five representative hashing schemes (which includes an improved variant of Robin Hood hashing), (7) four important classes of hash functions widely used today. That is, we consider 20 different combinations in total. Finally, we also provide a glimpse about the effect of table memory layout and the use of SIMD instructions. Our study clearly indicates that picking the right combination may have considerable impact on insert and lookup performance, as well as memory footprint. A major conclusion of our work is that hashing should be considered a white box before blindly using it in applications, such as query processing. Finally, we also provide a strong guideline about when to use which hashing method.
Query-Aware Locality-Sensitive Hashing for Approximate Nearest Neighbor Search

Qiang Huang (Sun Yat-sen University); Jianlin Feng (Sun Yat-sen University); Yikai Zhang (Sun Yat-sen University); Qiong Fang (South China University of Technology); Wilfred Ng (Hong Kong University of Science and Technology)

Abstract:Locality-Sensitive Hashing (LSH) and its variants are the well-known indexing schemes for the problem of c-Approximate Nearest Neighbor (c-ANN) search in high-dimensional Euclidean space. Traditionally, LSH functions are constructed in a query-oblivious manner in the sense that buckets are partitioned before any query arrives. However, objects closer to a query may be partitioned into different buckets, which is undesirable. Due to the use of query-oblivious bucket partition, the state-of-the-art LSH schemes for external memory, namely C2LSH and LSB-Forest, only work with approximation ratio of integer c >= 2. In this paper, we introduce a novel concept of query-aware bucket partition which uses a given query as the ``anchor” for bucket partition. Accordingly, a query-aware LSH function is a random projection coupled with query-aware bucket partition, which removes random shift required by traditional query-oblivious LSH functions. Notably, query-aware bucket partition can be easily implemented so that query performance is guaranteed. We propose a novel query-aware LSH scheme named QALSH for c-ANN search over external memory. Our theoretical studies show that QALSH enjoys guarantee on query quality. The use of query-aware LSH function enables QALSH to work with any approximation ratio c > 1. Extensive experiments on four real datasets show that QALSH outperforms C2LSH and LSB-Forest. Specifically, by using a ratio c < 2, QALSH can achieve much better query quality.
Neighbor-Sensitive Hashing

Yongjoo Park (University of Michigan); Michael Cafarella (University of Michigan); Barzan Mozafari (University of Michigan)

Abstract:Approximate kNN (k-nearest neighbor) techniques using binary hash functions are among the most commonly used approaches for overcoming the prohibitive cost of performing exact kNN queries. However, the success of these techniques largely depends on their hash functions’ ability to distinguish kNN items; that is, the kNN items retrieved based on data items’ hashcodes, should include as many true kNN items as possible. A widely-adopted principle for this process is to ensure that similar items are assigned to the same hashcode so that the items with the hashcodes similar to a query’s hashcode are likely to be true neighbors. In this work, we abandon this heavily-utilized principle and pursue the opposite direction for generating more effective hash functions for kNN tasks. That is, we aim to increase the distance between similar items in the hashcode space, instead of reducing it. Our contribution begins by providing theoretical analysis on why this revolutionary and seemingly counter-intuitive approach leads to a more accurate identification of kNN items. Our analysis is followed by a proposal for a hashing algorithm that embeds this novel principle. Our empirical studies confirm that a hashing algorithm based on this counter-intuitive idea significantly improves the efficiency and accuracy of state-of-the-art techniques.
LSH Ensemble: Internet Scale Domain Search

Erkang Zhu (University of Toronto); Fatemeh Nargesian (University of Toronto); Ken Pu (UOIT); Renée Miller (University of Toronto)

Abstract:We study {\it the problem of domain search} where a domain is a set of distinct values from an unspecified universe. We use Jaccard set containment score, defined as $|Q \cap X|/|Q|$, as the measure of relevance of a domain $X$ to a query domain $Q$. Our choice of Jaccard set containment over Jaccard similarity as a measure of relevance makes our work particularly suitable for searching Open Data and data on the web, as Jaccard similarity is known to have poor performance over sets with large differences in their domain sizes. We demonstrate that the domains found in several real-life Open Data and web data repositories show a power-law distribution over their domain sizes. We present a new index structure, Locality Sensitive Hashing (LSH) Ensemble, that solves the domain search problem using set containment at Internet scale. Our index structure and search algorithm cope with the data volume and skew by means of data sketches using Minwise Hashing and domain partitioning. Our index structure does not assume a prescribed set of data values. We construct a cost model that describes the accuracy of LSH Ensemble with any given partitioning. This allows us to formulate the data partitioning for LSH Ensemble as an optimization problem. We prove that there exists an {\it optimal} partitioning for any data distribution. Furthermore, for datasets following a power-law distribution, as observed in Open Data and Web data corpora, we show that the optimal partitioning can be approximated using equi-depth, making it particularly efficient to use in practice. We evaluate our algorithm using real data (Canadian Open Data and WDC Web Tables) containing up over 262 million domains. The experiments demonstrate that our index consistently outperforms other leading alternatives in accuracy and performance. The improvements are most dramatic for data with large skew in the domain sizes. Even at 262 million domains, our index sustains query performance with under 3 seconds response time.


Location: Pearl 2

Will AI Eat Us All?

Ihab F. Ilyas (Professor, Cheriton School of Computer Science, University of Waterloo), H. V. Jagadish (University of Michigan), Sihem Amer-Yahia (CNRS), Aditya Parameswaran (University of Illinois (UIUC)), Sunita Sarawagi (IIT Bombay)

Research R14: Graph Processing - 2

Location: Royal 1

Chair: Kyuseok Shim, Seoul National Univ.

G-SQL: Fast Query Processing via Graph Exploration

Hongbin Ma (eBay); Bin Shao (Microsoft Research Asia); Yanghua Xiao (Fudan University); Liang Jeff Chen (Microsoft Research Asia); Haixun Wang (Facebook)

Abstract:A lot of real-life data is of graph nature. However, it is not until recently that business begins to exploit data’s connectedness for business insights. On the other hand, RDBMSs are a mature technology for data management, but they are not for graph processing. Take graph traversal, a common graph operation for example, it heavily relies on a graph primitive that accesses a given node’s neighborhood. We need to join tables following foreign keys to access the nodes in the neighborhood if an RDBMS is used to manage graph data. Graph exploration is a fundamental building block of many graph algorithms. But this simple operation is costly due to large volumes of I/O caused by the massive amount of table joins. In this paper, we present G-SQL, our first effort towards the integration of a RDBMS and a native graph processing engine (Trinity). G-SQL leverages fast graph exploration provided by Trinity’s memory cloud to answer multi-way join queries, and it uses RDBMSs to provide mature data management functionalities, including reliable data storage, additional data access methods, etc. Specifically, G-SQL is a SQL dialect augmented with graph exploration functionalities and it dispatches query tasks to its underlying RDMBS and the Trinity graph engine. The G-SQL runtime coordinates the two query processors, and uses a unified cost model to ensure that the entire query is processed efficiently. Experimental results show that our approach greatly expands capabilities of RDBMs and delivers exceptional performance for SQL-graph hybrid queries.
Fully Dynamic Betweenness Centrality Maintenance on Massive Networks

Takanori Hayashi (The University of Tokyo); Takuya Akiba (National Institute of Informatics); Yuichi Yoshida (National Institute of Informatics)

Abstract:Measuring the relative importance of each vertex in a network is one of the most fundamental building blocks in network analysis. Among several importance measures, betweenness centrality, in particular, plays key roles in many real applications. Considerable effort has been made for developing algorithms for static settings. However, real networks today are highly dynamic and are evolving rapidly, and scalable dynamic methods that can instantly reflect graph changes into centrality values are required. In this paper, we present the first fully dynamic method for managing betweenness centrality of all vertices in a large dynamic network. Its main data structure is the weighted hyperedge representation of shortest paths called hypergraph sketch. We carefully design dynamic update procedure with theoretical accuracy guarantee. To accelerate updates, we further propose two auxiliary data structures called two-ball index and special-purpose reachability index. Experimental results using real networks demonstrate its high scalability and efficiency. In particular, it can reflect a graph change in less than a millisecond on average for a large-scale web graph with 106M vertices and 3.7B edges, which is several orders of magnitude larger than the limits of previous dynamic methods.
Parallel Local Graph Clustering

Julian Shun (University of California at Berkeley); Farbod Roosta-Khorasani (University of California at Berkeley); Kimon Fountoulakis (University of California at Berkeley); Michael Mahoney (University of California at Berkeley)

Abstract:Graph clustering has many important applications in computing, but due to growing sizes of graph, even traditionally fast clustering methods such as spectral partitioning can be computationally expensive for real-world graphs of interest. Motivated partly by this, so-called local algorithms for graph clustering have received significant interest due to the fact that they can find good clusters in a graph with work proportional to the size of the cluster rather than that of the entire graph. This feature has proven to be crucial in making such graph clustering and many of its downstream applications efficient in practice. While local clustering algorithms are already faster than traditional algorithms that touch the entire graph, there is an opportunity to make them even more efficient via parallelization. Existing local clustering algorithms are sequential. In this paper, we show how to parallelize many of these algorithms in the shared-memory multicore setting, and we analyze the parallel complexity of these algorithms. We present comprehensive experiments on large-scale graphs showing that the parallel algorithms achieve good parallel speedups on a modern multicore machine, thus significantly speeding up the analysis of local graph clusters in the very large-scale setting.
KCore Decomposition for Large Networks on a Single PC

Wissam Khaouid (University of Victoria); Marina Barsky (University of Victoria); Venkatesh Srinivasan (University of Victoria); Alex Thomo (University of Victoria)

Abstract:Studying the topology of a network is critical to inferring underlying dynamics such as tolerance to failure, group behavior and spreading patterns. k-core decomposition is a well-established metric which partitions a graph into layers from external to more central vertices. In this paper we aim to explore whether k-core decomposition of large networks can be computed using a consumer- grade PC. We feature implementations of the “vertex-centric” distributed protocol introduced by Montresor, De Pellegrini and Miorandi on GraphChi and Webgraph. Also, we present an accurate implementation of the Batagelj and Zaversnik algorithm for k-core decomposition in Webgraph. With our implementations, we show that we can efficiently handle networks of billions of edges using a single consumer-level machine within reasonable time and can produce excellent approximations in only a fraction of the execution time. To the best of our knowledge, our biggest graphs are considerably larger than the graphs considered in the literature. Next, we present an optimized implementation of an external-memory algorithm (EMcore) by Cheng, Ke, Chu, and Ozsu. We show that this algorithm also performs well for large datasets, however, it cannot predict whether a given memory budget is sufficient for a new dataset. We present a thorough analysis of all algorithms concluding that it is viable to compute k-core decomposition for large networks in a consumer-grade PC.

Research R15: Data Cleaning - 2

Location: Pearl 2

Chair: Panos Chrysanthis, Univ. of Pittsburgh

Messing Up with BART: Error Generation for Evaluating Data-Cleaning Algorithms

Patricia Arocena (University of Toronto); Boris Glavic (Illinois Institute of Technology); Giansalvatore Mecca (Università of Basilicata); Renée Miller (University of Toronto); Paolo Papotti (Qatar Computing Research Institute); Donatello Santoro (Università of Basilicata)

Abstract:We study the problem of introducing errors into clean databases for the purpose of benchmarking data-cleaning algorithms. Our goal is to provide users with the highest possible level of control over the error-generation process, and at the same time develop solutions that scale to large databases. We show in the paper that the error-generation problem is surprisingly challenging, and in fact, NP-complete. To provide a scalable solution, we develop a correct and efficient greedy algorithm that sacrifices completeness, but succeeds under very reasonable assumptions. To scale to millions of tuples, the algorithm relies on several non-trivial optimizations, including a new symmetry property of data quality constraints. The trade-off between control and scalability is the main technical contribution of the paper.
Error Detection: Where are we and what needs to be done? [Experiments and Analyses]

Ziawasch Abedjan (MIT); Xu Chu (University of Waterloo); Dong Deng (Tsinghua University); Raul Castro Fernandez (MIT); Ihab Ilyas (University of Waterloo); Mourad Ouzzani (Qatar Computing Research Institute); Paolo Papotti (Arizona State University); Michael Stonebraker (MIT); Nan Tang (Qatar Computing Research Institute)

Abstract:Data cleaning has played a critical role in ensuring data quality for enterprise applications. Naturally, there has been extensive research in this area, and many data cleaning algorithms have been translated into tools to detect and to possibly repair certain classes of errors such as outliers, duplicates, missing values, and violations of integrity constraints. Since different types of errors may coexist in the same dataset, we often need to run more than one kind of tool. In this paper, we investigate two pragmatic questions: (1) are these tools robust enough to capture most errors in real-world data sets? and (2) what is the best strategy to holistically run multiple tools to optimize the detection effort? To answer these two questions, we obtained multiple data cleaning tools that utilize a variety of error detection techniques. We also collected five real-world data sets, for which we could obtain both the raw data and the ground truth on existing errors. In this paper, we report our experimental findings on the errors detected by the tools we tested. First, we show that the coverage of each tool is well below 100%. Second, we show that the order in which multiple tools are run makes a big difference. Hence, we propose a holistic multi-tool strategy that orders the invocations of the available tools to maximize their benefit, while minimizing human effort in verifying results. Third, since this holistic approach still does not lead to acceptable error coverage, we discuss two simple strategies that have the potential to improve the situation, namely domain specific tools and data enrichment. We close this paper by reasoning about the errors that are not detectable by any of the tools we tested.
ActiveClean: Interactive Data Cleaning For Statistical Modeling

Sanjay Krishnan (University of California at Berkeley); Jiannan Wang (Simon Fraser University); Eugene Wu (Columbia University); Michael Franklin (University of California at Berkeley); Ken Goldberg (University of California at Berkeley)

Abstract:Analysts often clean dirty data iteratively–cleaning some data, executing the analysis, and then, cleaning more data based on the results. We explore the iterative cleaning process in the context of statistical model training, which is an increasingly popular form of data analytics. The challenge is that training high-dimensional statistical models on partially clean data lacks guarantees, and the iterative cleaning process may actually diverge. We propose ActiveClean which allows for progressive and iterative cleaning in statistical modeling problems while preserving convergence guarantees. ActiveClean supports a popular class of models called convex loss models (e.g., linear regression and SVMs), and also prioritizes cleaning those records likely to affect the results. We evaluate ActiveClean on four real-world datasets UCI Adult, UCI EEG, MNIST, and Dollars for Docs with both real and synthetic errors. Our results suggest that our proposed optimizations can improve model accuracy by up-to 2.5x for the same amount of data cleaned. Furthermore, for a fixed cleaning budget and on all real dirty datasets, ActiveClean returns more accurate models than uniform sampling and Active Learning.
META: An Efficient Matching-Based Method for Error-Tolerant Autocompletion

Dong Deng (Tsinghua University); Guoliang Li (Tsinghua University); He Wen (Tsinghua University); H. V. Jagadish (University of Michigan); Feng Jianhua (Tsinghua University)

Abstract:Autocompletion has been widely adopted in many computing systems because it can instantly provide users with results as users type in queries. Since the typing task is tedious and prone to error, especially on mobile devices, a recent trend is to tolerate errors in autocompletion. Existing error-tolerant autocompletion methods build a trie to index the data, utilize the trie index to compute the trie nodes that are similar to the query, called active nodes, and identify the leaf descendants of active nodes as the results. However these methods have two limitations. First, they involve many redundant computations to identify the active nodes. Second, they do not support top-k queries. To address these problems, we propose a matching-based framework, which computes the answers based on matching characters between queries and data. We design a compact tree index to maintain active nodes in order to avoid the redundant computations. We devise an incremental method to efficiently answer top-k queries. Experimental results on real datasets show that our method outperforms state-of-the-art approaches by 1-2 orders of magnitude.

Demo 1b: Data Engines and Analytics

Location: Maple

Same as Demo 1a

Research Posters 5

Location: Maple

Decorating the Cloud: Enabling Annotation Management in MapReduce Infrastructure

Yue Lu (Worcester Polytechnic Institute); Yuguan Li (Worcester Polytechnic Institute); Mohamed Eltabakh (Worcester Polytechnic Institute)

S3-TM: scalable streaming short text matching

Fuat Basik (Bilkent University); Bugra Gedik (Bilkent University); Hakan Ferhatsomanoglu (Bilkent University); Mert Emin Kalender (Bilkent University)

Elite: an Elastic Infrastructure for Big Spatiotemporal Trajectories

Xike Xie (Aalborg University); Benjin Mei (Renmin University of China); Jinchuan Chen (Renmin University of China); Xiaoyong Du (Renmin University of China); Christian Jensen (Aalborg University

Accelerating SPARQL Queries by Exploiting Hash-based Locality and Adaptive Partitioning

Razen Harbi (King Abdullah University of Science and Technology); Ibrahim Abdelaziz (King Abdullah University of Science and Technology); Panos Kalnis (King Abdullah University of Science and Technology); Nikos Mamoulis (University of Ioannina); Yasser Ebrahim (King Abdullah University of Science and Technology); Majed Sahli (King Abdullah University of Science and Technology)

Know Your Customer: Computing k-Most Promising Products for Targeted Marketing

Md. Saiful Islam (Swinburne University of Technology); Chengfei Liu (Swinburne University of Technology)

Bitwise Dimensional Co-Clustering for Analytical Workloads

Stephan Baumann (Technische Universtät Ilmenau); Peter Boncz (CWI); Kai-Uwe Sattler (Technische Universtät Ilmenau)

Wednesday Sep 7th, 4:00 pm - 5:30 pm

Research R16: Data and Query Models - 1

Location: Pearl 1

Chair: Yanlei Diao, Ecole Polytechnique

SlimShot: In-Database Probabilistic Inference for Knowledge Bases

Eric Gribkoff (University of Washington); Dan Suciu (University of Washington)

Abstract:Increasingly large Knowledge Bases are being created, by crawling the Web or other corpora of documents, and by extracting facts and relations using machine learning techniques. To manage the uncertainty in the data, these KBs rely on probabilistic engines based on Markov Logic Networks (MLN), for which probabilistic inference remains a major challenge. Today’s state of the art systems use variants of MCMC, which have no theoretical error guarantees, and, as we show, suffer from poor performance in practice. In this paper we describe SlimShot (Scalable Lifted Inference and Monte Carlo Sampling Hybrid Optimization Technique), a probabilistic inference engine for knowledge bases. SlimShot converts the MLN to a tuple- independent probabilistic database, then uses a simple Monte Carlo-based inference, with three key enhancements: (1) it combines sampling with safe query evaluation, (2) it estimates a conditional probability by jointly computing the numerator and denominator, and (3) it adjusts the proposal distribution based on the sample cardinality. In combination, these three techniques allow us to give formal error guarantees, and we demonstrate empirically that SlimShot outperforms today’s state of the art probabilistic inference engines used in knowledge bases.
Scalable Package Queries in Relational Database Systems

Matteo Brucato (University of Massachusetts Amherst); Juan Beltran (New York University Abu Dhabi); Azza Abouzied (New York University Abu Dhabi); Alexandra Meliou (University of Massachusetts Amherst)

Abstract:Traditional database queries follow a simple model: they define constraints that each tuple in the result must satisfy. This model is computationally efficient, as the database system can evaluate the query conditions on each tuple individually. However, many practical, real-world problems require a collection of result tuples to satisfy constraints collectively, rather than individually. In this paper, we present package queries, a new query model that extends traditional database queries to handle complex constraints and preferences over answer sets. We develop a full-fledged package query system, implemented on top of a traditional database engine. Our work makes several contributions. First, we design PaQL, a SQL-based query language that supports the declarative specification of package queries. We prove that PaQL is as least as expressive as integer linear programming, and therefore, evaluation of package queries is in general NP-hard. Second, we present a fundamental evaluation strategy that combines the capabilities of databases and constraint optimization solvers to derive solutions to package queries. The core of our approach is a set of translation rules that transform a package query to an integer linear program. Third, we introduce an offline data partitioning strategy allowing query evaluation to scale to large data sizes. Fourth, we introduce SketchRefine, a scalable algorithm for package evaluation, with strong approximation guarantees ((1 +/- e)^6- factor approximation). Finally, we present extensive experiments over real-world and benchmark data. The results demonstrate that SketchRefine is effective at deriving high-quality package results, and achieves runtime performance that is an order of magnitude faster than directly using ILP solvers over large datasets.
Explaining Query Answers with Explanation-Ready Databases

Sudeepa Roy (Duke University); Laurel Orr (University of Washington); Dan Suciu (University of Washington)

Abstract:With the increased generation and availability of big data in different domains, there is an imminent requirement for data analysis tools that are able to `explain’ the trends and anomalies obtained from this data to a range of users with different backgrounds. Wu- Madden (PVLDB 2013) and Roy-Suciu (SIGMOD 2014) recently proposed solutions that can explain interesting or unexpected answers to simple aggregate queries in terms of predicates on attributes. In this paper, we propose a generic framework that can support much richer, insightful explanations by preparing the database offline, so that top explanations can be found interactively at query time. The main idea in such `explanation-ready databases’ is to pre-compute the effects of potential explanations (called `interventions’), and efficiently re-evaluate the original query taking into accounts these effects. We formalize this notion and define an `explanation- query’ that can evaluate all possible explanations simultaneously without having to run an iterative process, develop algorithms and optimizations, and evaluate our approach with experiments on real data.
A framework for annotating CSV-like data

Marcelo Arenas (PUC Chile); Francisco Maturana (PUC Chile); Cristian Riveros (PUC Chile); Domagoj Vrgoc (PUC Chile)

Abstract:In this paper, we propose a simple and expressive framework for adding metadata to CSV documents and their noisy variants. The framework is based on annotating parts of the document that can be later used to read, query, or exchange the data. The core of our framework is a language based on extended regular expressions that are used for selecting data. These expressions are then combined using a set of rules in order to annotate the data. We study the computational complexity of implementing our framework and present an efficient evaluation algorithm that runs in time proportional to its output and linear in its input. As a proof of concept, we test an implementation of our framework against a large number of real world datasets and show that it can be efficiently used in practice.

Research R17: Scalable Analytics - 1

Location: Pearl 2

Chair: Matthias Boehm, IBM Research

A General-Purpose Query-Centric Framework for Querying Big Graphs [Innovative Systems and Applications]

Da Yan (The Chinese University of Hong Kong); James Cheng (The Chinese University of Hong Kong); Tamer Özsu (University of Waterloo); Fan Yang (The Chinese University of Hong Kong); Yi Lu (The Chinese University of Hong Kong); John C.S. Lui (The Chinese University of Hong Kong); Qizhen Zhang (The Chinese University of Hong Kong); Wilfred Ng (The Hong Kong University of Science and Technology)

Abstract:Pioneered by Google’s Pregel, many distributed systems have been developed for large-scale graph analytics. These systems expose the user-friendly “think like a vertex” programming interface to users, and exhibit good horizontal scalability. However, these systems are designed for tasks where the majority of graph vertices participate in computation, but are not suitable for processing light- workload graph queries where only a small fraction of vertices need to be accessed. The programming paradigm adopted by these systems can seriously under-utilize the resources in a cluster for graph query processing. In this work, we develop a new open-source system, called Quegel, for querying big graphs, which treats queries as first-class citizens in the design of its computing model. Users only need to specify the Pregel-like algorithm for a generic query, and Quegel processes light-workload graph queries on demand using a novel superstep-sharing execution model to effectively utilize the cluster resources. Quegel further provides a convenient interface for constructing graph indexes, which significantly improve query performance but are not supported by existing graph- parallel systems. Our experiments verified that Quegel is highly efficient in answering various types of graph queries and is up to orders of magnitude faster than existing systems.
Distance-based Outlier Detection in Data Streams [Experiments and Analyses]

Luan Tran (University of Southern California); Liyue Fan (University of Southern California); Cyrus Shahabi (University of Southern California)

Abstract:Continuous outlier detection in data streams has important applications in fraud detection, network security, and public health. The arrival and departure of data objects in a streaming manner impose new challenges for outlier detection algorithms, especially in time and space efficiency. In the past decade, several studies have been performed to address the problem of distance-based outlier detection in data streams (DODDS), which adapts an unsupervised definition and does not have any distributional assumption on data values. Our work is motivated by the lack of comparative evaluation among the state-of-the-art algorithms using the same dataset selections and on the same platform. We systematically evaluate five most recent exact algorithms for DODDS under various stream settings and outlier rates. Our extensive results show that in most settings, the MCOD algorithm provides superior performance among all algorithms, including the most recent algorithm Thresh_LEAP.
SEEDB: Efficient Data-Driven Visualization Recommendations to Support Visual Analytics

Manasi Vartak (MIT); Sajjadur Rahman (University of Illinois); Samuel Madden (MIT); Aditya Parameswaran (University of Illinois at Urbana-Champaign); Neoklis Polyzotis (Google)

Abstract:Data analysts often build visualizations as the first step in their an-alytical workflow. However, when working with high-dimensional datasets, identifying visualizations that show relevant or desiredtrends in data can be laborious. We propose SEEDB, a visual-ization recommendation engine to facilitate fast visual analysis:given a subset of data to be studied, SEEDB intelligently exploresthe space of visualizations, evaluates promising visualizations fortrends, and recommends those it deems most “useful” or “inter-esting”. The two major obstacles in recommending interesting vi-sualizations are (a)scale: evaluating a large number of candidatevisualizations while responding within interactive time scales, and(b)utility: identifying an appropriate metric for assessing interest-ingness of visualizations. For the former, SEEDB introducesprun-ing optimizationsto quickly identify high-utility visualizations andsharing optimizationsto maximize sharing of computation acrossvisualizations. For the latter, as a first step, we adopt a deviation- based metric for visualization utility, while indicating how we maybe able to generalize it to other factors influencing utility. We im-plement SEEDB as a middleware layer that can run on top of anyDBMS. Our experiments show that our framework can identify in-teresting visualizations with high accuracy. Our optimizations leadtomultiple orders of magnitude speedupon relational row and col-umn stores and provide recommendations at interactive time scales.Finally, we demonstrate via a user study the effectiveness of ourdeviation-based utility metric and the value of recommendations in supporting visual analytics.
Explaining Outputs in Modern Data Analytics

Zaheer Chothia (ETH Zurich); John Liagouris (ETH Zurich); Frank McSherry (ETH Zurich); Timothy Roscoe (ETH Zurich)

Abstract:We report on the design and implementation of a general framework for interactively explaining the outputs of modern data-parallel computations, including iterative data analytics. To produce explanations, existing works adopt a naive backward tracing approach which runs into known issues; naive backward tracing may identify: (i) too much information that is difficult to process, and (ii) not enough information to reproduce the output, which hinders the logical debugging of the program. The contribution of this work is twofold. First, we provide methods to effectively reduce the size of explanations based on the first occurrence of a record in an iterative computation. Second, we provide a general method for identifying explanations that are sufficient to reproduce the target output in arbitrary computations – a problem for which no viable solution existed until now. We implement our approach on differential dataflow, a modern high-throughput, low-latency dataflow platform. We add a small (but extensible) set of rules to explain each of its data-parallel operators, and we implement these rules as differential dataflow operators themselves. This choice allows our implementation to inherit the performance characteristics of differential dataflow, and results in a system that efficiently computes and updates explanatory inputs even as the inputs of the reference computation change. We evaluate our system with various analytic tasks on real datasets, and we show that it produces concise explanations in tens of milliseconds, while remaining faster – up to two orders of magnitude – than even the best implementations that do not support explanations.

Research R18: Database Hardware-Software Codesign

Location: Royal 1

Chair: Ryan Stutsman, Univ. of Utah

YourSQL: A High-Performance Database System Leveraging In-Storage Computing [Innovative Systems and Applications]

Insoon Jo (Samsung Electronics Co.); Duck-Ho Bae (Samsung Electronics Co.); Andre Yoon (Samsung Electronics Co.); Jeong-Uk Kang (Samsung Electronics Co.); Sangyeun Cho (Samsung Electronics Co.); Daniel Lee (Samsung Electronics Co.); Jaeheon Jeong (Samsung Electronics Co.)

Abstract:This paper presents YourSQL, a database system that ac- celerates data-intensive queries with the help of additional in-storage computing capabilities. YourSQL realizes very early filtering of data by offloading data scanning of a query to user-programmable solid-state drives. We implement our system on a recent branch of MariaDB (a variant of MySQL). In order to quantify the performance gains of YourSQL, we evaluate SQL queries with varying complexities. Our result shows that YourSQL reduces the execution time of the whole TPC-H queries by 3.6x, compared to a vanilla system. Moreover, the average speed-up of the five TPC-H queries with the largest performance gains reaches over 15x. Thanks to this significant reduction of execution time, we observe sizable energy savings. Our study demonstrates that the YourSQL approach, combining the power of early filtering with end-to-end datapath optimization, can accelerate large-scale analytic queries with lower energy consumption.
Cheap Data Analytics using Cold Storage Devices

Renata Borovica-Gajic (EPFL); Raja Appuswamy (EPFL); Anastasia Ailamaki (EPFL)

Abstract:Enterprise databases use storage tiering to lower capital and operational expenses. In such a setting, data waterfalls from an SSD based high-performance tier when it is “hot” (frequently accessed) to a disk-based capacity tier and finally to a tape-based archival tier when “cold” (rarely accessed). To address the unprecedented growth in the amount of cold data, hardware vendors introduced new devices named Cold Storage Devices (CSD) explicitly targeted at cold data workloads. With access latencies in tens of seconds and cost/GB as low as 0.01$/GB/month, CSD provide a middle ground between the low-latency (ms), high-cost, HDD-based capacity tier, and high-latency (min to h), low-cost, tape-based, archival tier. Driven by the price/performance aspects of CSD, this paper makes a case for using CSD as a replacement for both capacity and archival tiers of enterprise databases. Although CSD offer major cost savings, we show that current database systems can suffer from severe performance drop when CSD are used as a replacement for HDD due to the mismatch between design assumptions made by the query execution engine and actual storage characteristics of the CSD. We then build a CSD-driven query execution framework, called Skipper, that modifies both the database execution engine and CSD scheduling algorithms to be aware of each other. Using results from our implementation of the architecture based on PostgreSQL and OpenStack Swift, we show that Skipper is capable of completely masking the high latency overheads of CSD, thereby opening up CSD for wider adoption as a storage tier for cheap data analytics over cold data.
An Experimental Evaluation of Datacenter Workloads On Low-Power Embedded Micro Servers

Yiran Zhao (University of Illinois at Urbana-Champaign); Shen Li (University of Illinois at Urbana-Champaign); Shaohan Hu (University of Illinois at Urbana-Champaign); Hongwei Wang (University of Illinois at Urbana-Champaign); Shuochao Yao (University of Illinois at Urbana-Champaign); Huajie Shao (University of Illinois at Urbana-Champaign); Tarek Abdelzaher (University of Illinois at Urbana- Champaign)

Abstract:This paper presents a comprehensive evaluation of an ultra-low power cluster, built upon the Intel Edison based micro servers. The improved performance and high energy efficiency of micro servers have driven both academia and industry to explore the possibility of replacing conventional brawny servers with a larger swarm of embedded micro servers. Existing attempts mostly focus on mobile-class micro servers, whose capacities are similar to mobile phones. We, on the other hand, target on sensor-class micro servers, which are originally intended for uses in wearable technologies, sensor networks, and Internet-of- Things. Although sensor-class micro servers have much less capacity, they are touted for minimal power consumption (< 1 Watt), which opens new possibilities of achieving higher energy efficiency in datacenter workloads. Our systematic evaluation of the Edison cluster and comparisons to conventional brawny clusters involve careful workload choosing and laborious parameter tuning, which ensures maximum server utilization and thus fair comparisons. Results show that the Edison cluster achieves up to 3.5 times improvement on work-done-per-joule for web service applications and data-intensive MapReduce jobs. In terms of scalability, the Edison cluster scales linearly on the throughput of web service workloads, and also shows satisfactory scalability for MapReduce workloads despite coordination overhead.

Industrial I5: Graph Systems and Analytics

Location: Royal 2

Chair: Prasad Deshpande, Consultant

LDBC Graphalytics: A Benchmark for Large-Scale Graph Analysis on Parallel and Distributed Platforms

Alexandru Iosup (Delft University of Technology); Tim Hegeman (Delft University of Technology); Wing Lung Ngai (Delft University of Technology); Stijn Heldens (Delft University of Technology); Arnau Prat-Pérez (Universitat Politècnica de Catalunya); Thomas Manhardt (Oracle Labs); Hassan Chafi (Oracle Labs); Mihai Capotă (Intel Labs); Narayanan Sundaram (Intel Labs); Michael Anderson (Intel Labs); Ilie Gabriel Tanase (IBM Research); Yinglong Xia (Huawei Research America); Lifeng Nai (Georgia Tech); Peter Boncz (CWI)

Abstract:In this paper we introduce LDBC Graphalytics, a new industrial-grade benchmark for graph analysis platforms. It consists of six deterministic algorithms, standard datasets, synthetic dataset generators, and reference output, that enable the objective comparison of graph analysis platforms. Its test harness produces deep metrics that quantify multiple kinds of system scalability, such as horizontal/vertical and weak/strong, and of robustness, such as failures and performance variability. The benchmark comes with opensource software for generating data and monitoring performance. We describe and analyze six implementations of the benchmark (three from the community, three from industry), providing insights into the strengths and weaknesses of the platforms. Key to our contribution, vendors perform the tuning and benchmarking of their platforms.
GraphJet: Real-Time Content Recommendations at Twitter

Aneesh Sharma (Twitter Inc.); Jerry Jiang (Twitter Inc.); Praveen Bommannavar (Twitter Inc.); Brian Larson (Twitter Inc.); Jimmy Lin (University of Waterloo)

Abstract:This paper presents GraphJet, a new graph-based system for generating content recommendations at Twitter. As motivation, we trace the evolution of our formulation and approach to the graph recommendation problem, embodied in successive generations of systems. Two trends can be identified: supplementing batch with real-time processing and a broadening of the scope of recommendations from users to content. Both of these trends come together in GraphJet, an in-memory graph processing engine that maintains a real- time bipartite interaction graph between users and tweets. The storage engine implements a simple API, but one that is sufficiently expressive to support a range of recommendation algorithms based on random walks that we have refined over the years. Similar to Cassovary, a previous graph recommendation engine developed at Twitter, GraphJet assumes that the entire graph can be held in memory on a single server. The system organizes the interaction graph into temporally-partitioned index segments that hold adjacency lists. GraphJet is able to support rapid ingestion of edges while concurrently serving lookup queries through a combination of compact edge encoding and a dynamic memory allocation scheme that exploits power-law characteristics of the graph. Each GraphJet server ingests up to one million graph edges per second, and in steady state, computes up to 500 recommendations per second, which translates into several million edge read operations per second.
Using Domain-Specific Languages For Analytic Graph Databases

Martin Sevenich (Oracle Labs); Sungpack Hong (Oracle Labs); Oskar Van Rest (Oracle Labs); Zhe Wu (Oracle); Jayanta Banerjee (Oracle); Hassan Chafi (Oracle Labs)

Abstract:Recently graph has been drawing a lot of attention both as a natural data model that captures fine-grained relationships between data entities and as a tool for powerful data analysis that considers such relationships. In this paper, we present a new graph database system that integrates a robust graph storage with an efficient graph analytics engine. Primarily, our system adopts two domain-specific languages (DSLs), one for describing graph analysis algorithms and the other for graph pattern matching queries. Compared to the API-based approaches in conventional graph processing systems, the DSL-based approach provides users with more flexible and intuitive ways of expressing algorithms and queries. Moreover, the DSL-based approach has significant performance benefits as well, (1) by skipping (remote) API invocation overhead and (2) by applying high-level optimization from the compiler.
Cubrick: Indexing Millions of Records per Second for Interactive Analytics

Pedro Pedreira (Facebook Inc); Chris Croswhite (Facebook Inc.); Luis Bona (Federal University of Parana)

Abstract:This paper describes the architecture and design of Cubrick, a distributed multidimensional in-memory database suited for interactive analytics over highly dynamic datasets. Cubrick has a strictly multidimensional data model composed of cubes, dimensions and metrics, supporting sub-second OLAP operations such as slice and dice, roll-up and drill-down over terabytes of data. All data stored in Cubrick is range partitioned by every dimension and stored within containers called bricks in an unordered and sparse fashion, providing high data ingestion rates and indexed access through any combination of dimensions. In this paper, we describe details about Cubrick’s internal data structures, distributed model, query execution engine and a few details about the current implementation. Finally, we present results from a thorough experimental evaluation that leveraged datasets and queries collected from a few internal Cubrick deployments at Facebook.

Demo 2b: Interactive and Exploratory Systems

Location: Maple

Same as Demo 2a

Research Posters 6

Location: Maple

Wednesday Sep 7th, 5:45 pm - 6:00 pm

Buses depart for banquet venue

Go to Top

Wednesday Sep 7th, 7:30 pm - 10:00 pm

VLDB Conference Banquet

Location: Kingdom of Dreams

Website: http://www.kingdomofdreams.in/
Go to Top

Thursday Sep 8th, 9:00 am - 10:30 am

VLDB Awards Session

Location: Pearl

Chair: Sunita Sarawagi, IIT Bombay

VLDB 2016 Awards: Best Paper and Best Demonstration

Best Paper Award

Compressed Linear Algebra for Large-Scale Machine Learning
Ahmed Elgohary (University of Maryland), Matthias Boehm (IBM Research - Almaden), Peter Haas (IBM Research), Fred Reiss (IBM Research - Almaden), and Berthold Reinwald (IBM Research, Almaden) for “Proposing a comprehensive solution to the challenge of compressing matrices for linear algebra operations in the context of machine learning tasks”.

VLDB Endowment Awards

VLDB 10-year Best Paper Award

The New Casper: Query Processing for Location Services without Compromising Privacy
Mohamed F. Mokbel, Chi-Yin Chow, Walid G. Aref VLDB 2006 for inspiring research to support privacy in spatial query processing.

VLDB Early Career Research Contribution Award

Xin Luna Dong for advancing the state of the art of knowledge fusion

VLDB Women in Database Research Award

Magdalena Balazinska for her inspirational research record on scalable distributed data systems.

Endowment Award Talk 1

Location: Pearl

Leave No Valuable Data Behind: The Crazy Ideas and the Business

Xin Luna Dong

Abstract:With the mission “leave no valuable data behind”, we developed techniques for knowledge fusion to guarantee the correctness of the knowledge. This talk starts with describing a few crazy ideas we have tested. The first, known as “Knowledge Vault”, used 15 extractors to automatically extract knowledge from 1B+ Webpages, obtaining 3B+ distinct (subject, predicate, object) knowledge triples and predicting well-calibrated probabilities for extracted triples. The second, known as “Knowledge-Based Trust”, estimated the trustworthiness of 119M webpages and 5.6M websites based on the correctness of their factual information. We then present how we bring the ideas to business in filling the gap between the knowledge at Google Knowledge Graph and the knowledge in the world.

Bio:Xin Luna Dong is a Principal Scientist at Amazon, leading the efforts of constructing Amazon Product Graph. She was one of the major contributors to the Knowledge Vault project, and has led the Knowledge-based Trust project, which is called the “Google Truth Machine” by Washington’s Post. She has co-authored book “Big Data Integration”, published 65+ papers in top conferences and journals, given 20+ keynotes/invited-talks/tutorials, and got the Best Demo award in Sigmod 2005. She is the PC co-chair for Sigmod 2018 and WAIM 2015, and serves as an area chair for Sigmod 2017, Sigmod 2015, ICDE 2013, and CIKM 2011.

Endowment Award Talk 2

Location: Pearl

Location Data Management: A Tale of Two Systems and the “Next Destination”!

Mohamed Mokbel, Chi-Yin Chow, and Walid Aref

Abstract:In early 2000, we had the vision of ubiquitous location services, where each object is aware of its location, and continuously sends its location to a designated database server. This flood of location data opened the door for a myriad of location-based services that were considered visionary at that time, yet today they are a reality and have become ubiquitous. To realize our early vision, we identified two main challenges that needed to be addressed, namely, scalability and privacy. We have addressed these challenges through two main systems, PLACE and Casper. PLACE, developed at Purdue University from 2000 to 2005, set up the environment for built-in database support of scalable and continuous location-based services. The Casper system, developed at University of Minnesota from 2005 to 2010, was built inside the PLACE server allowing it to provide its high quality scalable service, while maintaining the privacy of its users’ locations. This talk will take you through a time journey of location services from 2000 until today, and beyond, highlighting the development efforts of the PLACE and Casper systems, along with their impact on current and future research initiatives in both academia and industry.

Bio 1:Mohamed F. Mokbel (Ph.D., Purdue University, MS, B.Sc., Alexandria University) is an Associate Professor in the Department of Computer Science and Engineering, University of Minnesota. His research interests include the interaction of GIS and location-based services with database systems and cloud computing. His research work has been recognized by five Best Paper Awards and by the NSF CAREER award. Mohamed is/was the program co-chair of SIGMOD 2018, ACM SIGSPATIAL from 2008 to 2010, and IEEE MDM 2011 and 2014. He is an Associate Editor for ACM TODS, ACM TSAS, VLDB journal, and GeoInformatica. Mohamed is currently serving as the elected Chair of ACM SIGSPATIAL 2014-2017. For more information, please visit: www.cs.umn.edu/~mokbel.

Bio 2:Chi-Yin Chow received the B.A. and M.Phil. degrees from The Hong Kong Polytechnic University, Hong Kong in 2002 and 2005, respectively. He also received the M.S. and Ph.D. degrees from the University of Minnesota-Twin Cities, USA in 2008 and 2010, respectively. He is currently an assistant professor in Department of Computer Science, City University of Hong Kong. His research interests include big data analytics, data management, GIS, mobile computing, location-based services, and data privacy. He is the co-founder and co-chair of the ACM SIGSPATIAL MobiGIS 2012 to 2016, and the editor of the ACM SIGSPATIAL Newsletter. Dr. Chow received the best paper awards in ICA3PP 2015 and IEEE MDM 2009.

Bio 3:Walid G. Aref is a professor of computer science at Purdue. His research interests are in extending the functionality of database systems in support of emerging applications, e.g., spatial, spatio-temporal, multimedia, biological, and sensor databases. He is also interested in query processing, indexing, data mining, and geographic information systems (GIS). Professor Aref’s research has been supported by the National Science Foundation, the National Institute of Health, Purdue Research Foundation, Qatar National Research Foundation, CERIAS, Panasonic, and Microsoft Corp. In 2001, he received the CAREER Award from the National Science Foundation and in 2004, he received a Purdue University Faculty Scholar award. Professor Aref is a member of Purdue’s CERIAS. He is an associate editor of the ACM Transactions of Database Systems (ACM TODS) and the ACM Transactions of Spatial Algorithms and Systems (TSAS), and has been an editor of the VLDB Journal. He is a senior member of the IEEE, and a member of the ACM. Professor Aref is an executive committee member, co-founder, and the past chair of the ACM Special Interest Group on Spatial Information (SIGSPATIAL).

Thursday Sep 8th, 11:15 am - 12:45 pm

Research R19: Query Optimization - 2

Location: Pearl 1

Chair: Johann Christoph-Freytag, Humboldt Univ.

Optimization of Conjunctive Predicates for Main Memory Column Stores

Fisnik Kastrati (University of Mannheim); Guido Moerkotte (University of Mannheim)

Abstract:Optimization of queries with conjunctive predicates for main memory databases remains a challenging task. The traditional way of optimizing this class of queries relies on predicate ordering based on selectivities or ranks. However, the optimization of queries with conjunctive predicates is a much more challenging task, requiring a holistic approach in view of (1) an accurate cost model that is aware of CPU architectural characteristics such as branch (mis)prediction, (2) a storage layer, allowing for a streamlined query execution, (3) a common subexpression elimination technique, minimizing column access costs, and (4) an optimization algorithm able to pick the optimal plan even in presence of a small (bounded) estimation error. In this work, we embrace the holistic approach, and show its superiority experimentally. Current approaches typically base their optimization algorithms on at least one of two assumptions: (1) the predicate selectivities are assumed to be independent, (2) the predicate costs are assumed to be constant. Our approach is not based on these assumptions, as they in general do not hold.
Parallelizing Query Optimization on Shared-Nothing Architectures

Immanuel Trummer (EPFL); Christoph Koch (EPFL)

Abstract:Data processing systems offer an ever increasing degree of parallelism on the levels of cores, CPUs, and processing nodes. Query optimization must exploit high degrees of parallelism in order not to gradually become the bottleneck of query evaluation. We show how to parallelize query optimization at a massive scale. We present algorithms for parallel query optimization in left-deep and bushy plan spaces. At optimization start, we divide the plan space for a given query into partitions of equal size that are explored in parallel by worker nodes. At the end of optimization, each worker returns the optimal plan in its partition to the master which determines the globally optimal plan from the partition-optimal plans. No synchronization or data exchange is required during the actual optimization phase. The amount of data sent over the network, at the start and at the end of optimization, as well as the complexity of serial steps within our algorithms increase only linearly in the number of workers and in the query size. The time and space complexity of optimization within one partition decreases uniformly in the number of workers. We parallelize single- and multi-objective query optimization over a cluster with 100 nodes in our experiments, using more than 250 concurrent worker threads (Spark executors). Despite high network latency and task assignment overheads, parallelization yields speedups of up to one order of magnitude for large queries whose optimization takes minutes on a single node.
Multiple Query Optimization on the D-Wave 2X Adiabatic Quantum Computer

Immanuel Trummer (EPFL); Christoph Koch (EPFL)

Abstract:The D-Wave adiabatic quantum annealer solves hard combinatorial optimization problems leveraging quantum physics. The newest version features over 1000 qubits and was released in August 2015. We were given access to one of the first exemplars, currently hosted at NASA Ames research center in California, to explore the potential for hard optimization problems that arise in the context of databases. In this paper, we tackle the problem of multiple query optimization (MQO). We show how a MQO problem instance can be transformed into a mathematical formula that complies with the restrictive input format accepted by the quantum annealer. This formula is translated into ferromagnetic weights on and between qubits such that the configuration minimizing the input formula can be found via a process called adiabatic quantum annealing. We analyze the asymptotic growth rate of the number of required qubits in the MQO problem dimensions as the number of qubits is currently the main factor restricting applicability. We experimentally compare the performance of the quantum annealer against other MQO algorithms executed on a traditional computer. While the problem sizes that can be treated with 1000 qubits are still limited, we already find a narrow class of problem instances where the quantum annealer is three orders of magnitude faster than other approaches.

Research R20: Spatial Data and Queries - 2

Location: Pearl 2

Chair: Walid G. Aref, Purdue Univ.

k-Nearest Neighbors on Road Networks: A Journey in Experimentation and In-Memory Implementation [Experiments and Analysis Track]

Tenindra Abeywickrama (Monash University); Muhammad Cheema (Monash University); David Taniar (Monash University)

Abstract:A k nearest neighbor (kNN) query on road networks retrieves the k closest points of interest (POIs) by their network distances from a given location. Today, in the era of ubiquitous mobile computing, this is a highly pertinent query. While Euclidean distance has been used as a heuristic to search for the closest POIs by their road network distance, its efficacy has not been thoroughly investigated. The most recent methods have shown significant improvement in query performance. Earlier studies, which proposed disk-based indexes, were compared to the current state-of-the-art in main memory. However, recent studies have shown that main memory comparisons can be challenging and require careful adaptation. This paper presents an extensive experimental investigation in main memory to settle these and several other issues. We use efficient and fair memory-resident implementations of each method to reproduce past experiments and conduct additional comparisons for several overlooked evaluations. Notably we revisit a previously discarded technique (IER) showing that, through a simple improvement, it is often the best performing technique.
SKYPE: Top-k Spatial-keyword Publish/Subscribe Over Sliding Window

Xiang Wang (University of New South Wales); Ying Zhang (University of Technology Sydney); Wenjie Zhang (University of New South Wales); Xuemin Lin (University of New South Wales); Zengfeng Huang (University of New South Wales)

Abstract:As the prevalence of social media and GPS-enabled devices, a massive amount of geo-textual data has been generated in a stream fashion, leading to a variety of applications such as location-based recommendation and information dissemination. In this paper, we investigate a novel real-time top-k monitoring problem over sliding window of streaming data; that is, we continuously maintain the top-k most relevant geo-textual messages (e.g., geo-tagged tweets) for a large number of spatial-keyword subscriptions (e.g., registered users interested in local events) simultaneously. To provide the most recent information under controllable memory cost, sliding window model is employed on the streaming geo-textual data. To the best of our knowledge, this is the first work to study top-k spatial-keyword publish/subscribe over sliding window. A novel system, called Skype (Top-k Spatial-keyword Publish/Subscribe), is proposed in this paper. In Skype, to continuously maintain top-k results for massive subscriptions, we devise a novel indexing structure upon subscriptions such that each incoming message can be immediately delivered on its arrival. Moreover, to reduce the expensive top-k re-evaluation cost triggered by message expiration, we develop a novel cost-based k-skyband technique to reduce the number of re-evaluations in a cost-effective way. Extensive experiments verify the great efficiency and effectiveness of our proposed techniques.
Maximizing Bichromatic Reverse Spatial and Textual k Nearest Neighbor Queries

Farhana Choudhury (RMIT University); Shane Culpepper (RMIT University); Timos Sellis (RMIT University); Xin Cao (Queen’s University, Belfast)

Abstract:The problem of maximizing bichromatic reverse k nearest neighbor queries (BRkNN) has been extensively studied in spatial databases. In this work, we present a related query for spatial-textual databases that finds an optimal location, and a set of keywords that maximizes the size of bichromatic reverse spatial textual k nearest neighbors (MaxBRSTkNN). Such a query has many practical applications including social media advertisements where a limited number of relevant advertisements are displayed to each user. The problem is to find the location and the text contents to include in an advertisement so that it will be displayed to the maximum number of users. The increasing availability of spatial-textual collections allows us to answer these queries for both spatial proximity and textual similarity. This paper is the first to consider the MaxBRSTkNN query. We show that the problem is NP-hard and present both approximate and exact solutions.
Cache locality is not enough: High-Performance Nearest Neighbor Search with Product Quantization Fast Scan

Fabien André (Technicolor); Anne-Marie Kermarrec (INRIA); Nicolas Le Scouarnec (Technicolor

Abstract:Nearest Neighbor (NN) search in high dimension is an important feature in many applications (e.g., image retrieval, multimedia databases). Product Quantization (PQ) is a widely used solution which offers high performance, i.e. low response time, while preserving a high accuracy. PQ represents high-dimensional vectors (e.g., images descriptors) by compact codes. Hence, very large databases can be stored in memory, allowing NN queries without resorting to slow I/O operations. PQ computes distances to neighbors using cache-resident lookup tables, thus its performance remains limited by (i) the many cache accesses that the algorithm requires, and (ii) the inability to leverage SIMD instructions available on modern CPUs. In this paper, we advocate that cache locality is not sufficient for efficiency. To address these issues, we design a novel algorithm, PQ Fast Scan, that transforms the cache-resident lookup tables into small tables, sized to fit SIMD registers. This transformation allows (i) in-register lookups in place of cache accesses and (ii) an efficient SIMD implementation. PQ Fast Scan has the exact same accuracy as PQ, while having 4 to 5 times lower response time (e.g., for 25 million vectors, scan time is reduced from 74ms to 13ms).

Research R21: Distributed and Cloud Systems -2

Location: Royal 1

Chair: Mike Carey, UC Irvine

WiSeDB: A Learning-based Workload Management Advisor for Cloud Databases

Ryan Marcus (Brandeis University); Olga Papaemmanouil (Brandeis University)

Abstract:Workload management for cloud databases deals with the tasks of resource provisioning, query placement, and query scheduling in a manner that meets the application’s performance goals while minimizing the cost of using cloud resources. Existing solutions have approached these three challenges in isolation while aiming to optimize a single performance metric. In this paper, we introduce WiSeDB, a learning-based framework for generating holistic workload management solutions customized to application-defined performance goals and workload characteristics. Our approach relies on supervised learning to train cost-effective decision tree models for guiding query placement, scheduling, and resource provisioning decisions. Applications can use these models for both batch and online scheduling of incoming workloads. A unique feature of our system is that it can adapt its offline model to stricter/looser performance goals with minimal re-training. This allows us to present to the application alternative workload management strategies that address the typical performance vs. cost trade-off of cloud services. Experimental results show that our approach has very low training overhead while offering low cost strategies for a variety of performance metrics and workload characteristics.
Husky: Towards a More Efficient and Expressive Distributed Computing Framework

Fan Yang (The Chinese University of Hong Kong); Jinfeng Li (The Chinese University of Hong Kong); James Cheng (The Chinese University of Hong Kong)

Abstract:Finding efficient, expressive and yet intuitive programming models for data-parallel computing system is an important and open problem. Systems like Hadoop and Spark have been widely adopted for massive data processing, as coarse-grained primitives like map and reduce are succinct and easy to master. However, sometimes over-simplified API hinders programmers from more fine-grained control and designing more efficient algorithms. Developers may have to resort to sophisticated domain-specific languages (DSLs), or even low-level layers like MPI, but this raises development cost---learning many mutually exclusive systems prolongs the development schedule, and the use of low-level tools may result in bug-prone programming. This motivated us to start the Husky open-source project, which is an attempt to strike a better balance between high performance and low development cost. Husky is developed mainly for in-memory large scale data mining, and also serves as a general research platform for designing efficient distributed algorithms. We show that many existing frameworks can be easily implemented and bridged together inside Husky, and Husky is able to achieve similar or even better performance compared with domain-specific systems.
Titian: Data Provenance Support in Spark

Matteo Interlandi (University of California at Los Angeles); Kshitij Shah (University of California at Los Angeles); Sai Tetali (University of California at Los Angeles); Muhammad Gulzar (University of California at Los Angeles); Seunghyun Yoo (University of California at Los Angeles); Miryung Kim (University of California at Los Angeles); Todd Millstein (University of California at Los Angeles); Tyson Condie (University of California at Los Angeles)

Abstract:Debugging data processing logic in Data-Intensive Scalable Computing (DISC) systems is a difficult and time consuming effort. Today’s DISC systems offer very little tooling for debugging programs, and as a result programmers spend countless hours collecting evidence (e.g., from log files) and performing trial and error debugging. To aid this effort, we built Titian, a library that enables data provenance— tracking data through transformations—in Apache Spark. Data scientists using the Titian Spark extension will be able to quickly identify the input data at the root cause of a potential bug or outlier result. Titian is built directly into the Spark platform and offers data provenance support at interactive speeds—orders-of-magnitude faster than alternative solutions—while minimally impacting Spark job performance; observed overheads for capturing data lineage rarely exceed 30% above the baseline job execution time.

Tutorial T5

Location: Royal 2

Human Factors in Crowdsourcing

Sihem Amer-Yahia (CNRS); Senjuti Basu Roy (New Jersey Institute of Technology)

Abstract:Today, crowdsourcing is used to “taskify” any job ranging from simple receipt transcription to collaborative editing, fan-subbing, citizen science, and citizen journalism. The crowd is typically volatile, its arrival and departure asynchronous, and its levels of attention and accuracy diverse. Tasks vary in complexity and may necessitate the participation of workers with varying degrees of expertise. Sometimes, workers need to collaborate explicitly and build on each other’s contributions to complete a single task. For example, in disaster reporting, CrowdMap allows geographically closed people with diverse and complementary skills, to work together to report details about the course of a typhoon or the aftermath of an earthquake. This uber-ization of human labor requires the understanding of workers’ motivation in completing a task, their ability to work together in collaborative tasks, as well as, helping workers find relevant tasks. For over 40 years, organization studies have thoroughly examined human factors that affect workers in physical workplaces. More recently, computer scientists have developed algorithms that verify and lever- age those findings in a virtual marketplace, in this case, a crowdsourcing platform. The goal of this tutorial is to review those two areas and discuss how their combination may improve workers’ experience, task throughput and outcome quality for both microtasks and collaborative tasks. We will start with a coverage of motivation theory, team formation, and learning worker profiles. We will then address open research questions that result from this review.

Research R22: Entity Matching - 1

Location: Maple

Chair: Ioana Manolescu, INRIA

Online Entity Resolution Using an Oracle

Donatella Firmani (University of Rome “Tor Vergata”); Barna Saha (University of Massachusetts Amherst) ;Divesh Srivastava (AT&T Labs Research)

Abstract:Entity resolution (ER) is the task of identifying all records in a database that refer to the same underlying entity. This is an expensive task, and can take a significant amount of money and time; the end-user may want to take decisions during the process, rather than waiting for the task to be completed. We formalize an online version of the entity resolution task, and use an oracle which correctly labels matching and non-matching pairs through queries. In this setting, we design algorithms that seek to maximize progressive recall, and develop a novel analysis framework for prior proposals on entity resolution with an oracle, beyond their worst case guarantees. Finally, we provide both theoretical and experimental analysis of the proposed algorithms.
BLAST: a Loosely Schema-aware Meta-blocking Approach for Entity Resolution

Giovanni Simonini (Università di Modena e Reggio Emilia); Sonia Bergamaschi (Università di Modena e Reggio Emilia); H. V. Jagadish (University of Michigan)

Abstract:To identify records that refer to the same entity is a fundamental step for data integration. Blocking techniques are typically employed to reduce the complexity of this task by partitioning records into blocks and limiting the comparison to those records co-occurring in them. Generally, to deal with highly heterogeneous and noisy data (e.g. semi- structured data of the Web), these techniques rely on redundancy to reduce the chance of missing matches. Meta- blocking is the task of restructuring blocks generated by redundancy-based blocking techniques, removing superfluous comparisons. Existing meta-blocking approaches rely exclusively on schema-agnostic features. In this paper, we demonstrate how “loose” schema in- formation (i.e., statistics collected directly from the data) can be exploited to enhance the quality of the blocks in a holistic loosely schema-aware (meta-)blocking approach. We call it Blast (Blocking with Loosely-Aware Schema Techniques). We show how Blast can automatically extract this loose information by adopting a LSH-based step for efficiently scaling to large datasets. We experimentally demonstrate, on five real world datasets, how Blast outperforms the state-of-the-art unsupervised meta- blocking approaches, and, in many cases, also the supervised one.
QuERy: A Framework for Integrating Entity Resolution with Query Processing

Hotham Altwaijry (University of California at Irvine); Sharad Mehrotra (University of California at Irvine); Dmitri Kalashnikov (AT&T Labs Research)

Abstract:This paper explores an analysis-aware data cleaning architecture for a large class of SPJ SQL queries. In particular, we propose QuERy, a novel framework for integrating entity resolution (ER) with query processing. The aim of QuERy is to correctly and efficiently answer complex queries issued on top of dirty data. The comprehensive empirical evaluation of the proposed solution demonstrates its significant advantage in terms of efficiency over the traditional techniques for the given problem settings.
Comparative Analysis of Approximate Blocking Techniques for Entity Resolution [Experiments and Analyses Track]

George Papadakis (University of Athens); Jonathan Svirsky (Technion); Avigdor Gal (Technion); Themis Palpanas (Paris Descartes University)

Abstract:Entity Resolution is a core task for merging data collections. Due to its quadratic complexity, it typically scales to large volumes of data through blocking: similar entities are clustered into blocks and pair-wise comparisons are executed only between co-occurring entities, at the cost of some missed matches. There are numerous blocking methods, and the aim of this work is to offer a comprehensive empirical survey, extending the dimensions of comparison beyond what is commonly available in the literature. We consider 17 state-of-the-art blocking methods and use 6 popular real datasets to examine the robustness of their internal configurations and their relative balance between effectiveness and time efficiency. We also investigate their scalability over a corpus of 7 established synthetic datasets that range from 10,000 to 2 million entities.

Thursday Sep 8th, 2:00 pm - 3:30 pm

Research R23: Query Execution - 2

Location: Pearl 1

Chair: Peter Boncz, CWI

An Empirical Evaluation of Set Similarity Join Techniques

Willi Mann (University of Salzburg); Nikolaus Augsten (University of Salzburg); Panagiotis Bouros (Aarhus University)

Abstract:Set similarity joins compute all pairs of similar sets from two collections of sets. We conduct extensive experiments on seven state-of- the-art algorithms for set similarity joins. These algorithms adopt a filter-verification approach. Our analysis shows that verification has not received enough attention in previous works. In practice, efficient verification inspects only a small, constant number of set elements and is faster than some of the more sophisticated filter techniques. Although we can identify three winners, we find that most algorithms show very similar performance. The key technique is the prefix filter, and AllPairs, the first algorithm adopting this technique is still a relevant competitor. We repeat experiments from previous work and discuss diverging results. All our claims are supported by a detailed analysis of the factors that determine the overall runtime.
Streaming Similarity Self-Join

Gianmarco De Francisci Morales (Qatar Computing Research Institute); Aristides Gionis (Aalto University)

Abstract:We introduce and study the problem of computing the similarity self-join in a streaming context (SSSJ), where the input is an unbounded stream of items arriving continuously. The goal is to find all pairs of items in the stream whose similarity is greater than a given threshold. The simplest formulation of the problem requires unbounded memory, and thus, it is intractable. To make the problem feasible, we introduce the notion of time-dependent similarity: the similarity of two items decreases with the difference in their arrival time. By leveraging the properties of this time-dependent similarity function, we design two algorithmic frameworks to solve the SSSJ problem. The first one, MiniBatch (MB), uses existing index-based filtering techniques for the static version of the problem, and combines them in a pipeline. The second framework, Streaming (STR), adds time filtering to the existing indexes, and integrates new time-based bounds deeply in the working of the algorithms. We also introduce a new indexing technique (L2), which is based on an existing state-of-the-art indexing technique (L2AP), but is optimized for the streaming case. Extensive experiments show that the STR algorithm, when instantiated with the L2 index, is the most scalable option across a wide array of datasets and parameters.
An Efficient Partition Based Method for Exact Set Similarity Joins

Dong Deng (Tsinghua University); Guoliang Li (Tsinghua University); He Wen (Tsinghua University); Jianhua Feng (Tsinghua University)

Abstract:We study the exact set similarity join problem, which, given two collections of sets, finds out all the similar set pairs from the collections. Existing methods generally utilize the prefix filter based framework. They generate a prefix for each set and prune all the pairs whose prefixes are disjoint. However the pruning power is limited, because if two dissimilar sets share a common element in their prefixes, they cannot be pruned. To address this problem, we propose a partition-based framework. We design a partition scheme to partition the sets into several subsets and guarantee that two sets are similar only if they share a common subset. To improve the pruning power, we propose a mixture of the subsets and their 1-deletion neighborhoods (the subset of a set by eliminating one element). As there are multiple allocation strategies to generate the mixture, we evaluate different allocations and design a dynamic-programming algorithm to select the optimal one. However the time complexity of generating the optimal one is O(s^3) for a set with size s. To speed up the allocation selection, we develop a greedy algorithm with an approximation ratio of 2. To further reduce the complexity, we design an adaptive grouping mechanism, and the two techniques can reduce the complexity to O(s log s). Experimental results on three real-world datasets show our method achieves high performance and outperforms state-of-the-art methods by 2-5 times.
Fast Queries Over Heterogeneous Data Formats Through Engine Customization

Manos Karpathiotakis (EPFL); Ioannis Alagiannis (EPFL); Anastasia Ailamaki (EPFL)

Abstract:Industry and academia are continuously becoming more data-driven and data-intensive, relying on the analysis of a wide variety of heterogeneous datasets to gain insights. The different data models and formats pose a significant challenge on performing analysis over a combination of diverse datasets. Serving all queries using a single, general-purpose query engine is slow. On the other hand, using a specialized engine for each heterogeneous dataset increases complexity: queries touching a combination of datasets require an integration layer over the different engines. This paper presents a system design that natively supports heterogeneous data formats and also minimizes query execution times. For multi-format support, the design uses an expressive query algebra which enables operations over various data models. For minimal execution times, it uses a code generation mechanism to mimic the system and storage most appropriate to answer a query fast. We validate our design by building Proteus, a query engine which natively supports queries over CSV, JSON, and relational binary data, and which specializes itself to each query, dataset, and workload via code generation. Proteus outperforms state-of-the-art open-source and commercial systems on both synthetic and real-world workloads without being tied to a single data model or format, all while exposing users to a single query interface.

Research R24: Social Networks and Crowdsourcing

Location: Pearl 2

Chair: Sihem Amer-Yahia, CNRS

Crowdsourced Top-k Algorithms: An Experimental Evaluation

Xiaohang Zhang (Tsinghua University); Guoliang Li (Tsinghua University); Jianhua Feng (Tsinghua University)

Abstract:Crowdsourced top-k computation has attracted significant attention recently, thanks to emerging crowdsourcing platforms, e.g., Amazon Mechanical Turk and CrowdFlower. Crowdsourced top-k algorithms ask the crowd to compare the objects and infer the top-k objects based on the crowdsourced comparison results. The crowd may return incorrect answers, but traditional top-k algorithms cannot tolerate the errors from the crowd. To address this problem, the database and machine-learning communities have independently studied the crowdsourced top-k problem. The database community proposes the heuristic-based solutions while the machine-learning community proposes the learning-based methods (e.g., maximum likelihood estimation). However, these two types of techniques have not been compared systematically under the same experimental framework. Thus it is rather difficult for a practitioner to decide which algorithm should be adopted. Furthermore, the experimental evaluation of existing studies has several weaknesses. Some methods assume the crowd returns high-quality results and some algorithms are only tested on simulated experiments. To alleviate these limitations, in this paper we present a comprehensive comparison of crowdsourced top-k algorithms. Using various synthetic and real datasets, we evaluate each algorithm in terms of result quality and efficiency on real crowdsourcing platforms. We reveal the characteristics of different techniques and provide guidelines on selecting appropriate algorithms for various scenarios.
CLAMShell: Speeding up Crowds for Low-latency Data Labeling

Daniel Haas (University of California at Berkeley); Jiannan Wang (Simon Fraser University); Eugene Wu (Columbia University); Michael Franklin (University of California at Berkeley)

Abstract:Data labeling is a necessary but often slow process that impedes the development of interactive systems for modern data analysis. Despite rising demand for manual data labeling, there is a surprising lack of work addressing its high and unpredictable latency. In this paper, we introduce \sys, a system that speeds up crowds in order to achieve consistently low-latency data labeling. We offer a taxonomy of the sources of labeling latency and study several large crowd-sourced labeling deployments to understand their empirical latency profiles. Driven by these insights, we comprehensively tackle each source of latency, both by developing novel techniques such as straggler mitigation and pool maintenance and by optimizing existing methods such as crowd retainer pools and active learning. We evaluate \sys{} in simulation and on live workers on Amazon’s Mechanical Turk, demonstrating that our techniques can provide an order of magnitude speedup and variance reduction over existing crowdsourced labeling strategies.
Dynamic Influence Analysis in Evolving Networks

Naoto Ohsaka (The University of Tokyo); Takuya Akiba (National Institute of Informatics); Yuichi Yoshida (National Institute of Informatics); Ken-ichi Kawarabayashi (National Institute of Informatics)

Abstract:We propose the first real-time fully-dynamic index data structure designed for influence analysis on evolving networks. With this aim, we carefully redesign the data structure of the state-of-the-art sketching method introduced by Borgs et al., and construct corresponding update algorithms. Using this index, we present algorithms for two kinds of queries, influence estimation and influence maximization, which are strongly motivated by practical applications, such as viral marketing. We provide a thorough theoretical analysis, which guarantees the non-degeneracy of the solution accuracy after an arbitrary number of updates. Furthermore, we introduce a reachability-tree-based technique and a skipping method, which greatly reduce the time consumption required for edge/vertex deletions and vertex additions, respectively, and counter-based random number generators, which improve the space efficiency. Experimental evaluations using real dynamic networks with tens of millions of edges demonstrate the efficiency, scalability, and accuracy of our proposed indexing scheme. Specifically, it can reflect a graph modification within a time of several orders of magnitude smaller than that required to reconstruct an index from scratch, estimate the influence spread of a vertex set accurately within a millisecond, and select highly influential vertices at least ten times faster than state-of-the-art static algorithms.
From Competition to Complementarity: Comparative Influence Diffusion and Maximization

Wei Lu (University of British Columbia); Wei Chen (Microsoft Research); Laks Lakshmanan (University of British Columbia)

Abstract:Influence maximization is a well-studied problem that asks for a small set of influential users from a social network, such that by targeting them as early adopters, the expected total adoption through influence cascades over the network is maximized. However, almost all prior work focuses on cascades of a single propagating entity or purely-competitive entities. In this work, we propose the Comparative Independent Cascade (Com-IC) model that covers the full spectrum of entity interactions from competition to complementarity. In Com-IC, users’ adoption decisions depend not only on edge-level information propagation, but also on a node- level automaton whose behavior is governed by a set of model parameters, enabling our model to capture not only competition, but also complementarity, to any possible degree. We study two natural optimization problems, Self Influence Maximization and Complementary Influence Maximization, in a novel setting with complementary entities. Both problems are NP-hard, and we devise efficient and effective approximation algorithms via non-trivial techniques based on reverse-reachable sets and a novel ``sandwich approximation’’ strategy. The applicability of both techniques extends beyond our model and problems. Our experiments show that the proposed algorithms consistently outperform intuitive baselines in four real-world social networks, often by a significant margin. In addition, we learn model parameters from real user action logs.

Research R25: Graph Processing - 3

Location: Royal 1

Chair: Mohamed Mokbel, Univ. of Minnesota

Data-driven Visual Graph Query Interface Construction and Maintenance: Challenges and Opportunities [Vision]

Sourav Bhowmick (Nanyang Technological University); Byron Choi (Hong Kong Baptist University); Curtis Dyreson (Utah State University)

Abstract:Visual query interfaces make it easy for scientists and other non-expert users to query a data collection. Heretofore, visual query interfaces have been statically-constructed, independent of the data. In this paper we outline a vision of a different kind of interface, one that is built (in part) from the data. In our data-driven approach, the visual interface is dynamically constructed and maintained. A data-driven approach has many benefits such as reducing the cost in constructing and maintaining an interface, superior support for query formulation, and increased portability of the interface. We focus on graph databases, but our approach is applicable to several other kinds of databases such as JSON and XML.
The shortest path is not always a straight line

Vasiliki Kalavri (KTH Royal Institute of Technology); Tiago Simas (Telefonica Research); Dionysios Logothetis (Facebook)

Abstract:In this paper, we leverage the concept of the metric backbone to improve the efficiency of large-scale graph analytics. The metric backbone is the minimum subgraph that preserves the shortest paths of a weighted graph. We use the metric backbone in place of the original graph to compute various graph metrics exactly or with good approximation. By computing on a smaller graph, we improve the performance of graph analytics applications on two different systems, a batch graph processing system and a graph database. Further, we provide an algorithm for the computation of the metric backbone on large graphs. While one can compute the metric backbone by solving the all-pairs-shortest- paths problem, this approach incurs prohibitive time and space complexity for big graphs. Instead, we propose a heuristic that makes computing the metric backbone practical even for large graphs. Additionally, we analyze several real datasets of different sizes and domains and we show that we can approximate the metric backbone by removing only first-order semi-metric edges; edges for which a shorter two-hop path exists. We provide a distributed implementation of our algorithm and apply it in large scale scenarios. We evaluate our algorithm using a variety of real graphs, including a Facebook social network subgraph of 50 billion edges. We measure the impact of using the metric backbone on runtime performance in two graph management systems. We achieve query speedups of up to 6.7x in the Neo4j commercial graph database and job speedups of up to 6x in the Giraph graph processing system.
New Lower and Upper Bounds for Shortest Distance Queries on Terrains

Manohar Kaul (Technische Universität Berlin); Raymond Chi-Wing Wong (The Hong Kong University of Science and Technology); Christian Jensen (Aalborg University)

Abstract:The increasing availability of massive and accurate laser data enables the processing of spatial queries on terrains. As shortest- path computation, an integral element of query processing, is inherently expensive on terrains, a key approach to enabling efficient query processing is to reduce the need for exact shortest-path computation in query processing. We develop new lower and upper bounds on terrain shortest distances that are provably tighter than any existing bounds. Unlike existing bounds, the new bounds do not rely on the quality of the triangulation. We show how use of the new bounds speeds up query processing by reducing the need for exact distance computations. Speedups of of nearly an order of magnitude are demonstrated empirically for well-known spatial queries.
On Measuring the Lattice of Commonalities Among Several Linked Datasets

Michalis Mountantonakis (FORTH); Yannis Tzitzikas (FORTH)

Abstract:A big number of datasets has been published according to the principles of Linked Data and this number keeps increasing. Although the ultimate objective is linking and integration, it is not currently evident how connected the current LOD cloud is. The classical measurements and visualizations of the LOD cloud count and show only the number of links that exist between pairs of datasets. Measurements (and indexes) that involve more than two datasets are not available although they would be useful in several tasks,e.g. (a) for obtaining complete information about one particular URI (or set of URIs) enriched (b) for aiding dataset discovery and dataset selection,(c) for assessing the connectivity between any set of datasets for quality checking and for monitoring their evolution over time, (d) for constructing visualizations that provide more informative overviews. Since it would be prohibitively expensive to perform all these measurements in a naive way, in this paper we introduce various indexes (and their construction algorithms) that can speedup such tasks. In brief, we introduce (i) a namespace-based prefix index, (ii) a sameAs catalog for computing the symmetric and transitive closure of the owl:sameAs relationships encountered in the datasets, (iii) a semantics-aware element index (that exploits the aforementioned indexes), and (iv) finally two lattice- based incremental algorithms for speeding up the computation of the intersection URIs of any set of datasets. We discuss the speedup obtained by the introduced indexes and algorithms through comparative results andfinally we report measurements about connectivity of the LOD cloud that have never been carried out in the past.

Tutorial T6

Location: Royal 2

Qualitative Data Cleaning

Xu Chu (University of Waterloo); Ihab Ilyas (University of Waterloo)

Abstract:Data quality is one of the most important problems in data management, since dirty data often leads to inaccurate data analytics results and wrong business decisions. Data cleaning exercise often consist of two phases: error detection and error repairing. Error detection techniques can either be quantitative or qualitative; and error repairing is performed by applying data transformation scripts or by involving human experts, and sometimes both. In this tutorial, we discuss the main facets and directions in designing qualitative data cleaning techniques. We present a taxonomy of current qualitative error detection techniques, as well as a taxonomy of current data repairing techniques. We will also discuss proposals for tackling the challenges for cleaning “big data” in terms of scale and distribution.

Research R26: Data and Query Models -2

Location: Maple

Chair: Tim Kraska, Brown Univ.

Interactive Browsing and Navigation in Relational Databases

Minsuk Kahng (Georgia Institute of Technology); Shamkant Navathe (Georgia Institute of Technology); John Stasko (Georgia Institute of Technology); Duen Horng Chau (Georgia Institute of Technology)

Abstract:Although database researchers have devoted considerable effort to helping database users formulate queries, many users still find it challenging to specify queries that involve joining tables. To help users construct join queries for exploring relational databases, we propose Etable, a novel presentation data model that provides users with a presentation-level interactive view. This view compactly presents one-to-many and many-to-many relationships within a single enriched table by allowing a cell to contain a set of entity references. Users can directly interact with the table and the entity references to incrementally construct complex queries. To enable users to explore data on a conceptual entity-relationship level, we also introduce a graph-based model, called typed graph model that provides an abstraction of relational databases. In a user study, participants performed a range of database querying tasks faster with Etable than with a commercial graphical query builder. Subjective feedback about Etable was also positive. All participants found that Etable was easier to learn and helpful for exploring databases.
ATHENA: An Ontology-Driven System for Natural Language Querying over Relational Data Stores

Diptikalyan Saha (IBM Research - India); Avrilia Floratou (IBM Research – Almaden); Karthik Sankaranarayanan (IBM Research – India); Umar Farooq Minhas (IBM Research – Almaden); Ashish Mittal (IBM Research – India); Fatma Ozcan (IBM Research – Almaden)

Abstract:In this paper, we present ATHENA, an ontology-driven system for natural language querying of complex relational databases. Natural language interfaces to databases enable users easy access to data, without the need to learn a complex query language, such as SQL. ATHENA uses domain specific ontologies, which describe the semantic entities, and their relationships in a domain. We propose a unique two-stage approach, where the input natural language query (NLQ) is first translated into an intermediate query language over the ontology, called OQL, and subsequently translated into SQL. Our two-stage approach allows us to decouple the physical layout of the data in the relational store from the semantics of the query, providing physical independence. Moreover, ontologies provide richer semantic information, such as inheritance and membership relations, that are lost in a relational schema. By reasoning over the ontologies, our NLQ engine is able to accurately capture the user intent. We study the effectiveness of our approach using three different workloads on top of geographical (GEO), academic (MAS) and financial (FIN) data. ATHENA achieves 100% precision on the GEO and MAS workloads, and 99% precision over the FIN workload which operates on a complex financial ontology. Moreover, ATHENA attains 87.2%, 88.3%, and 88.9% recall on the GEO, MAS, and FIN workloads, respectively.
BlinkFill: Semi-supervised Programming By Example for Syntactic String Transformations

Rishabh Singh (Microsoft Research)

Abstract:The recent Programming By Example (PBE) techniques such as FlashFill have shown great promise for enabling end-users to perform data transformation tasks using input-output examples. Since examples are inherently an under-specification, there are typically a large number of hypotheses conforming to the examples, and the PBE techniques suffer from scalability issues for finding the intended program amongst the large space. We present a semi-supervised learning technique to significantly reduce this ambiguity by using the logical information present in the input data to guide the synthesis algorithm. We develop a data structure InputDataGraph to succinctly represent a large set of logical patterns that are shared across the input data, and use this graph to efficiently learn substring expressions in a new PBE system BlinkFill. We evaluate BlinkFill on 207 real-world benchmarks and show that BlinkFill is significantly faster (on average 41x) and requires fewer input-output examples (1.27 vs 1.53) to learn the desired transformations in comparison to FlashFill.
Query From Examples: An Iterative, Data-Driven Approach to Query Construction

Hao Li (National University of Singapore); Chee-Yong Chan (National University of Singapore); David Maier (Portland State University)

Abstract:In this paper, we propose a new approach, called Query from Examples (QFE), to help non-expert database users construct SQL queries. Our approach, which is designed for users who might be unfamiliar with SQL, only requires that the user is able to determine whether a given output table is the result of his or her intended query on a given input database. To kick-start the construction of a target query Q, the user first provides a pair of inputs: a sample database D and an output table R which is the result of Q on D. As there will be many candidate queries that transform D to R, QFE winnows this collection by presenting the user with new database-result pairs that distinguish these candidates. Unlike previous approaches that use synthetic data for such pairs, QFE strives to make these distinguishing pairs as close to the original (D, R) pair as possible. By doing so, it seeks to minimize the effort needed by a user to determine if a new database-result pair is consistent with his or her desired query. We demonstrate the effectiveness and efficiency of our approach using real datasets from SQLShare, a cloud-based platform designed to help scientists utilize RDBMS technology for data analysis.

Thursday Sep 8th, 4:00 pm - 5:30 pm

Research R27: Data and Query Models -3

Location: Pearl 1

Chair: Sudeepa Roy, Duke Univ.

Exploiting Equality Generating Dependencies in Checking Chase Termination

Marco Calautti (University of Calabria); Sergio Greco (University of Calabria); Cristian Molinaro (University of Calabria); Irina Trubitsyna (University of Calabria)

Abstract:The chase is a well-known algorithm with a wide range of applications in data exchange, data cleaning, data integration, query optimization, and ontological reasoning. Since the chase evaluation might not terminate and it is undecidable whether it terminates, the problem of defining (decidable) sufficient conditions ensuring termination has received a great deal of interest in recent years. In this regard, several termination criteria have been proposed. One of the main weaknesses of current approaches is the limited analysis they perform on equality generating dependencies (EGDs). In this paper, we propose sufficient conditions ensuring that a set of dependencies has at least one terminating chase sequence. We propose novel criteria which are able to perform a more accurate analysis of EGDs. Specifically, we propose a new stratification criterion and an adornment algorithm. The latter can both be used as a termination criterion and be combined with current techniques to make them more effective, in that strictly more sets of dependencies are identified. Our techniques identify sets of dependencies that are not recognized by any of the current criteria.
The Complexity of Resilience and Responsibility for Self-Join-Free Conjunctive Queries

Cibele Freire (University of Massachusetts Amherst); Wolfgang Gatterbauer (Carnegie Mellon University); Neil Immerman (University of Massachusetts Amherst); Alexandra Meliou (University of Massachusetts Amherst)

Abstract:Several research thrusts in the area of data management have focused on understanding how changes in the data affect the output of a view or standing query. Example applications are explaining query results, propagating updates through views, and anonymizing datasets. An important aspect of this analysis is the problem of deleting a minimum number of tuples from the input tables to make a given Boolean query false, which we refer to as “the resilience of a query”. In this paper, we study the complexity of resilience for self-join-free conjunctive queries with arbitrary functional dependencies. The cornerstone of our work is the novel concept of triads, a simple structural property of a query that leads to the several dichotomy results we show in this paper. The concepts of triads and resilience bridge the connections between the problems of deletion propagation and causal responsibility, and allow us to substantially advance the known complexity results in these topics. Specifically, we show a dichotomy for the complexity of resilience, which identifies previously unknown tractable families for deletion propagation with source side-effects, and we extend this result to account for functional dependencies. Further, we identify a mistake in a previous dichotomy for causal responsibility, and offer a revised characterization based purely on the structural form of the query (presence or absence of triads). Finally, we extend the dichotomy for causal responsibility in two ways: (a) we account for functional dependencies in the input tables, and (b) we compute responsibility for sets of tuples specified via wildcards.
A Shifting Bloom Filter Framework for Set Queries

Tong Yang (Peking University); Alex X. Liu (Michigan State University); Muhammad Shahzad (North Carolina State University); Yuankun Zhong (Nanjing Uinversity); Qiaobin Fu (Boston University); Zi Li (Nanjing University); Gaogang Xie (ICT, CAS); Xiaoming Li (Peking University)

Abstract:Set queries are fundamental operations in computer systems and applications. This paper addresses the fundamental problem of designing a probabilistic data structure that can quickly process set queries using a small amount of memory. We propose a Shifting Bloom Filter (ShBF) framework for representing and querying sets. We demonstrate the effectiveness of ShBF using three types of popular set queries: membership, association, and multiplicity queries. The key novelty of ShBF is on encoding the auxiliary information of a set element in a location offset. In contrast, prior BF based set data structures allocate additional memory to store auxiliary information. To evaluate ShBF in comparison with prior art, we conducted experiments using real-world network traces. Results show that ShBF significantly advances the state-of-the-art on all three types of set queries.
Decibel: The Relational Dataset Branching System

Michael Maddox (MIT); David Goehring (MIT); Aaron Elmore (University of Chicago); Samuel Madden (MIT); Aditya Parameswaran (University of Illinois at Urbana-Champaign); Amol Deshpande (University of Maryland)

Abstract:As scientific endeavors and data analysis becomes increasingly collaborative, there is a need for data management systems that natively support the versioning or branching of datasets to enable concurrent analysis, cleaning, integration, manipulation, or curation of data across teams of individuals. Common practice for sharing and collaborating on datasets involves creating or storing multiple copies of the dataset, one for each stage of analysis, with no provenance information tracking the relationships between these datasets. This results not only in wasted storage, but also makes it challenging to track and integrate modifications made by different users to the same dataset. In this paper, we introduce the Relational Dataset Branching System, Decibel, a new relational storage system with built-in version control designed to address these shortcomings. We present our initial design for Decibel and provide a thorough evaluation of three versioned storage engine designs that focus on efficient query processing with minimal storage overhead. We also develop an exhaustive benchmark to enable the rigorous testing of these and future versioned storage engine designs.

Research R28: Entity Matching - 2

Location: Pearl 2

Chair: Themis Palpanas, Paris Descartes Univ.

Magellan: Toward Building Entity Matching Management Systems [Innovative Systems and Applications]

Pradap Konda (University of Wisconsin Madison); Sanjib Das (University of Wisconsin Madison); Paul Suganthan G. C. (University of Wisconsin Madison); Anhai Doan (University of Wisconsin Madison); Adel Ardalan (University of Wisconsin Madison); Jeff Ballard (University of Wisconsin Madison); Han Li (University of Wisconsin Madison); Fatemah Panahi (University of Wisconsin Madison); Haojun Zhang (University of Wisconsin Madison); Jeffrey Naughton (University of Wisconsin Madison); Shishir Prasad (Instacart); Ganesh Krishnan (WalmartLabs); Rohit Deep (WalmartLabs); Vijay Raghavendra (WalmartLabs)

Abstract:Entity matching (EM) has been an important data integration problem and will become even more important in the age of Big Data. However, most EM works have focused only on developing matching algorithms. Going forward we argue that significantly more attention should be devoted to building EM systems. We discuss four major limitations of current EM systems, then describe Magellan, a new kind of EM system. Compared to current EM systems, Magellan is novel in several important aspects. (1) It provides a how-to guide that tells users what to do in each EM scenario, step by step. (2) It provides tools to help users do these steps. The tools seek to cover the entire EM pipeline, not just matching and blocking as current EM systems do. (3) Magellan is built on top of the data analysis and Big Data stacks in Python, allowing it to borrow a rich set of capabilities in data cleaning, IE, visualization, learning, etc. (4) In contrast to current ``closed world’’ EM systems, Magellan is a new kind of system that we call ``open world’’ systems, but building such systems raises significant challenges. We present extensive experiments with 44 students and many real users at various organizations that show the promise of Magellan.
Distributed Data Deduplication

Xu Chu (University of Waterloo); Ihab Ilyas (University of Waterloo); Paraschos Koutris (University of Wisconsin Madison)

Abstract:Data deduplication refers to the process of identifying tuples in a relation that refer to the same real world entity. The complexity of the problem is inherently quadratic with respect to the number of tuples, since a similarity value must be computed for every pair of tuples. In order to avoid comparing tuple pairs that are obviously non-duplicates, matching algorithms use blocking techniques that divide the tuples into blocks and compare only tuples within the same block. However, even with the use of blocking, data deduplication remains a costly problem for large datasets. In this paper, we show how to further speed up data deduplication by leveraging parallelism in a shared-nothing computing environment. Our main contribution is a distribution strategy, called \disdedup, that minimizes the maximum workload across all worker nodes and provides strong theoretical guarantees. We demonstrate the effectiveness of our proposed strategy by performing extensive experiments on both synthetic datasets with varying block size distributions, as well as real world datasets.
The iBench Integration Metadata Generator

Patricia Arocena (University of Toronto); Boris Glavic (Illinois Institute of Technology); Radu Ciucanu (University of Oxford); Renée Miller (University of Toronto)

Abstract:Given the maturity of the data integration field it is surprising that rigorous empirical evaluations of research ideas are so scarce. We identify a major roadblock for empirical work the lack of comprehensive metadata generators that can be used to create benchmarks for different integration tasks. This makes it difficult to compare integration solutions, understand their generality, and understand their performance. We present iBench, the first metadata generator that can be used to evaluate a wide-range of integration tasks (data exchange, mapping creation, mapping composition, schema evolution, among many others). iBench permits control over the size and characteristics of the metadata it generates (schemas, constraints, and mappings). Our evaluation demonstrates that iBench can efficiently generate very large, complex, yet realistic scenarios with different characteristics. We also present an evaluation of three mapping creation systems using iBench and show that the intricate control that iBench provides over metadata scenarios can reveal new and important empirical insights. iBench is an open-source, extensible tool that we are providing to the community. We believe it will raise the bar for empirical evaluation and comparison of data integration systems.
Schema-agnostic vs Schema-based Configurations for Blocking Methods on Homogeneous Data

George Papadakis (University of Athens); George Alexiou (IMIS, “Athena” Research Center); George Papastefanatos (IMIS, “Athena” Research Center); Georgia Koutrika (Hewlett-Packard Labs)

Abstract:Entity Resolution constitutes a core task for data integration that, due to its quadratic complexity, typically scales to large datasets through blocking methods. These can be configured in two ways. The schema-based configuration relies on schema information in order to select signatures of high distinctiveness and low noise, while the schema-agnostic one treats every token from all attribute values as a signature. The latter approach has significant potential, as it requires no fine-tuning by human experts and it applies to heterogeneous data. Yet, there is no systematic study on its relative performance with respect to the schema-based configuration. This work covers this gap by comparing analytically the two configurations in terms of effectiveness, time efficiency and scalability. We apply them to 9 established blocking methods and to 11 benchmarks of structured data. We provide valuable insights into the internal functionality of the blocking methods with the help of a novel taxonomy. Our studies reveal that the schema-agnostic configuration offers unsupervised and robust definition of blocking keys under versatile settings, trading a higher computational cost for a consistently higher recall than the schema-based one. It also enables the use of state-of-the-art blocking methods without schema knowledge.

Research R29: Community Search and Mining

Location: Royal 1

Chair: Tamer Oszu, Univ. of Waterloo

Effective Community Search for Large Attributed Graphs

Yixiang Fang (The University of Hong Kong); Reynold C.K. Cheng (The University of Hong Kong); Siqiang Luo (The University of Hong Kong); Jiafeng Hu (The University of Hong Kong)

Abstract:Given a graph $G$ and a vertex $q \in G$, the {\it community search} query returns a subgraph of $G$ that contains vertices related to $q$. Communities, which are prevalent in {\it attributed graphs} such as social networks and knowledge bases, can be used in emerging applications such as product advertisement and setting up of social events. In this paper, we investigate the {\it attributed community query} (or ACQ), which returns an {\it attributed community} (AC) for an {\it attributed graph}. The AC is a subgraph of $G$, which satisfies both {\it structure cohesiveness} (i.e., its vertices are tightly connected) and {\it keyword cohesiveness} (i.e., its vertices share common keywords). The AC enables a better understanding of how and why a community is formed (e.g., members of an AC have a common interest in music, because they all have the same keyword ``music’’). An AC can be ``personalized’’; for example, an ACQ user may specify that an AC returned should be related to the keywords like ``sports’’. To enable efficient AC search, we develop the CL-tree structure and three algorithms based on it. We evaluate our solutions on four large graphs, namely Flickr, DBLP, Tencent, and DBpedia. Our results show that ACs are more effective and efficient than existing community retrieval approaches. Moreover, an AC contains more precise and personalized information than existing community search and detection methods.
Approximate Closest Community Search in Networks

Xin Huang (University of British Columbia); Laks Lakshmanan (University of British Columbia); Jeffrey Xu Yu (The Chinese University of Hong Kong); Hong Cheng (The Chinese University of Hong Kong)

Abstract:Recently, there has been significant interest in the study of the community search problem in social and information networks: given one or more query nodes, find densely connected communities containing the query nodes. However, most existing studies do not address the ``free rider” issue, that is, nodes far away from query nodes and irrelevant to them are included in the detected community. Some state-of-the-art models have attempted to address this issue, but not only are their formulated problems NP-hard, they do not admit any approximations without restrictive assumptions, which may not always hold in practice. In this paper, given an undirected graph $G$ and a set of query nodes $Q$, we study community search using the $k$-truss based community model. We formulate our problem of finding a closest truss community (CTC), as finding a connected $k$-truss subgraph with the largest $k$ that contains $Q$, and has the minimum diameter among such subgraphs. We prove this problem is NP-hard. Furthermore, it is NP-hard to approximate the problem within a factor $(2-\varepsilon)$, for any $\ varepsilon >0 $. However, we develop a greedy algorithmic framework, which first finds a CTC %connected $k$-truss graph with the largest $k$ containing $Q$, and then iteratively removes the furthest nodes from $Q$, from the graph. The method achieves 2-approximation to the optimal solution. To further improve the efficiency, we make use of a compact truss index and develop efficient algorithms for $k$-truss identification and maintenance as nodes get eliminated. In addition, using bulk deletion optimization and local exploration strategies, we propose two more efficient algorithms. One of them trades some approximation quality for efficiency while the other is a very efficient heuristic. Extensive experiments on 6 real-world networks show the effectiveness and efficiency of our community model and search algorithms.
Ego-net Community Mining Applied to Friend Suggestion

Alessandro Epasto (Brown University); Silvio Lattanzi (Google); Vahab Mirrokni (Google); Ismail Sebe (Google); Ahmed Taei (Google); Sunita Verma (Google)

Abstract:In this paper, we present a study of the community structure of ego-networks --- the graphs representing the connections among the neighbors of a node --- for several online social networks. Toward this goal, we design a new technique to efficiently build and cluster all the ego-nets of a graph in parallel (note that even just building the ego-nets efficiently is challenging on large networks). Our experimental findings are quite compelling: at a microscopic level it is easy to detect high quality communities. Leveraging on this fact we, then, develop new features for friend suggestion based on co-occurrences of two nodes in different ego-nets’ communities. Our new features can be computed efficiently on very large scale graphs by just analyzing the neighborhood of each node. Furthermore, we prove formally on a stylized model, and by experimental analysis that this new similarity measure outperforms the classic local features employed for friend suggestions.
Tracking the Conductance of Rapidly Evolving Topic-Subgraphs

Sainyam Galhotra (Xerox Research Centre India); Amitabha Bagchi (IIT Delhi); Srikanta Bedathur (IBM Research – India); Maya Ramanath (IIT Delhi); Vidit Jain (American Express Big Data Labs)

Abstract:Monitoring the formation and evolution of communities in large online social networks such as Twitter is an important problem that has generated considerable interest in both industry and academia. Fundamentally, the problem can be cast as studying evolving sub-graphs (each subgraph corresponding to a topical community) on an underlying social graph – with users as nodes and the connectionbetween them as edges. A key metric of interest in this setting istracking the changes to the conductance of subgraphs induced byedge activations. This metric quantifies how well or poorly connected a subgraph is to the rest of the graph relative to its internal connections. Conductance has been demonstrated to be of greatuse in many applications, such as identifying bursty topics, tracking the spread of rumors, and so on. However, tracking this simpl emetric presents a considerable scalability challenge – the underlying social network is large, the number of communities that are activeat any moment is large, the rate at which these communities evolveis high, and moreover, we need to track conductance in real-time. We address these challenges in this paper. We propose an in-memory approximation called BloomGraphs to store and update these (possibly overlapping) evolving subgraphs. As the name suggests, we use Bloom filters to represent an approximation of the underlying graph. This representation is compact and computationally efficient to maintain in the presence of updates. This is especially important when we need to simultaneously maintain thousands of evolving subgraphs. BloomGraphs are used in computing and tracking conductance of these subgraphs as edge-activations arrive. BloomGraphs have several desirable properties in the context of this application, including a small memory footprint and efficient updateability. We also demonstrate mathematically that the error incurred in computing conductance is one-sided and that in the case of evolving subgraphs the change in approximate conductance has the same sign as the change in exact conductance inmost cases. We validate the effectiveness of BloomGraphs through extensive experimentation on large Twitter graphs and other social networks.

Research R30: Scalable Analytics - 2

Location: Maple

Chair: Jonathan Goldstein, Microsoft Research

Streaming Anomaly Detection Using Randomized Matrix Sketching

Hao Huang (General Electric Global Research); Shiva Kasiviswanathan (Samsung Research America)

Abstract:Data is continuously being generated from sources such as machines, network traffic, application logs, etc. Timely and accurate detection of anomalies in massive data streams has important applications such as in preventing machine failures, intrusion detection, and dynamic load balancing. In this paper, we introduce a novel (unsupervised) anomaly detection framework which can be used to detect anomalies in a streaming fashion by making only one pass over the data while utilizing limited storage. We adapt ideas from matrix sketching to maintain, in a streaming model, a set of few orthogonal vectors that form a good approximate basis for all the observed data. Using this constructed orthogonal basis, anomalies in new incoming data are detected based on a simple reconstruction error test. We theoretically prove that our algorithm compares favorably with an offline approach based on expensive global singular value decomposition (SVD) updates. Additionally, we apply ideas from randomized low-rank matrix approximations to further speedup the algorithm. The experimental results show the effectiveness and efficiency of our approach over other popular scalable anomaly detection approaches.
Index-Assisted Hierarchical Computations in Main-Memory RDBMS

Robert Brunel (Technische Universität München); Norman May (SAP SE); Alfons Kemper (Technische Universität München)

Abstract:We address the problem of expressing and evaluating computations on hierarchies represented as database tables. Engine support for such computations is very limited today, and so they are usually outsourced into stored procedures or client code. Recently, data model and SQL language extensions were proposed to conveniently represent and work with hierarchies. On that basis we introduce a concept of structural grouping to relational algebra, provide concise syntax to express a class of useful computations, and discuss algorithms to evaluate them efficiently by exploiting available indexing schemes. This extends the versatility of RDBMS towards a great many use cases dealing with hierarchical data.
DEXTER: Large-Scale Discovery and Extraction of Product Specifications on the Web

Disheng Qiu (Università degli Studi Roma Tre);Luciano Barbosa (IBM Research – Brazil); Xin Luna Dong (Google); Yanyan Shen (National University of Singapore); Divesh Srivastava (AT&T Labs – Research)

Abstract:The web is a rich resource of structured data. There has been an increasing interest in using web structured data for many applications such as data integration, web search and question answering. In this paper, we present DEXTER, a system to find product sites on the web, and detect and extract product specifications from them. Since product specifications exist in multiple product sites, our focused crawler relies on search queries and backlinks to discover product sites. To perform the detection, and handle the high diversity of specifications in terms of content, size and format, our system uses supervised learning to classify HTML fragments (e.g.,tables and lists) present in web pages as specifications or not. To perform large- scale extraction of the attribute-value pairs from the HTML fragments identified by the specification detector, DEXTER adopts two lightweight strategies: a domain-independent and unsu-pervised wrapper method, which relies on the observation that these HTML fragments have very similar structure; and a combination of this strategy with a previous approach, which infers extraction patterns by annotations generated by automatic but noisy annotators.The results show that our crawler strategy to locate product specification pages is effective: (1) it discovered 146M product specification pages from 3,005 sites and 9 different categories; (2) the specification detector obtains high values of F-measure (close to 0.9) over a heterogeneous set of product specifications; and (3) our efficient wrapper methods for attribute-value extraction get very high values of precision (0.92) and recall (0.95) and obtain better results than a state-of-the-art, supervised rule-based wrapper.
Walking in the Cloud: Parallel SimRank at Scale

Zhenguo Li (Huawei); Yixiang Fang (The University of Hong Kong); Qin Liu (The Chinese University of Hong Kong); Jiefeng Cheng (Huawei); Reynold Cheng (University of Hong Kong); John Lui (The Chinese University of Hong Kong)

Abstract:Despite its popularity, SimRank is computationally costly, in both time and space. In particular, its recursive nature poses a great challenge in using modern distributed computing power, and also prevents querying similarities individually. Existing solutions suffer greatly from these practical issues. In this paper, we break such dependency for maximum efficiency possible. Our method consists of offline and online phases. In offline phase, a length-$n$ indexing vector is derived by solving a linear system in parallel. At online query time, the similarities are computed instantly from the index vector. Throughout, the Monte Carlo method is used to maximally reduce time and space. Our algorithm, called CloudWalker, is highly parallelizable, with only linear time and space. Remarkably, it responses to both single-pair and single-source queries in constant time. CloudWalker is orders of magnitude more efficient and scalable than existing solutions for large-scale problems. Implemented on Spark with 10 machines and tested on the web-scale clue-web graph with 1 billion nodes and 43 billion edges, it takes 110 hours for offline indexing, 64 seconds for a single-pair query, and 188 seconds for a single-source query. To the best of our knowledge, our work is the first to report results on clue-web, which is 10x larger than the largest graph ever reported for SimRank computation.

Thursday Sep 8th, 6:30 pm - 9:30 pm

Entertainment Program and Gala Dinner

Location: Pearl

Friday Sep 9th, 9:00 am - 10:30 am

Workshop W4

Location: Pearl 1

Joint Workshop W5

Location: Pearl 2

Accelerating Data Management Systems (ADMS) and In-memory Data Management & Analytics (IMDM)

ADMS - Rajesh Bordawekar (IBM Research – T.J. Watson); Tirthankar Lahiri (Oracle) IMDM – Spyros Blanas (The Ohio State University); Justin Levandoski (Microsoft Research); Andy Pavlo (Carnegie Mellon University)

Workshop W6

Location: Royal 1

Data Management and Analytics for Medicine and Healthcare (DMAH)

Fusheng Wang (Stony Brook University); Lixia Yao (University of North Carolina at Charlotte); Gang Luo (University of Utah)

Workshop W7

Location: Royal 2

Social Data Analytics and Management (SoDAM)

Maheshweta Das (Hewlett Packard Labs); Gautam Das (University of Texas at Arlington)

Workshop W8

Location: Maple

Big Data Open-Source Systems (BOSS) / Polyglot

Tilmann Rabl (TU Berlin); Sebastian Schelter (Amazon)

Friday Sep 9th, 11:00 am - 12:30 pm

Friday Sep 9th, 2:00 pm - 3:30 pm

Friday Sep 9th, 4:00 pm - 5:30 pm