Toward Scalable Semantic Big Data




Julian Dolby

IBM Thomas J. Watson Research Center





SPLASH-I, Vancouver, October 2017

Collaborative Work

  • Bishwaranjan Bhattacharjee
  • Mihaela Bornea
  • James Cimino
  • Patrick Dantressangle
  • Achille Fokoue
  • Aditya Kalyanpur
  • Anastasios Kementsietsidis
  • Aaron Kersehbaum
  • Li Ma
  • Chintan Patel
  • Edith Schonberg
  • Kavitha Srinivas
  • Octavian Udrea

Outline

  • Running Example
  • Scalable expressive reasoning
  • Storing RDF data in a database
  • Integration of Web data

Running Example

Objects in Our Universe

Example OWL Universe

  • Individuals
    $\begin{array}{l}Museum(Athens), Museum(Heraklion), \\ Museum(MOMA), Minoan(LaParisienne),\\ Mycenean(DeathMask),VanGogh(StarryNight)\end{array}$
  • Roles
    $\begin{array}{l} has(Athens,DeathMask),has(MOMA,StarryNight)\\ has(Heraklion,LaParisienne) \end{array}$
  • Axioms (TGDs)
    $\begin{array}{l} Minoan \sqsubseteq \exists{creationSite.Crete}\\ Mycenaean \sqsubseteq \exists{creationSite.Mycenae} \end{array}$

Example ABox $A$

The Summary ABox


  • Map ABox $A$ to $A'$ for scalability using $f$
    $\begin{array}{l} C(a) \in A \implies C(f(a)) \in A'\\ R(a,b) \in A \implies R(f(a), f(b)) \in A' \end{array}$
  • We choose concept sets as f



J. Dolby, A. Fokoue, A. Kalyanpur, A. Kershenbaum, E. Schonberg, K. Srinivas, L. Ma
Scalable Semantic Retrieval through Summarization and Refinement.
AAAI 2007

Example Summary ABox $A'$

Pointer Analysis Analogy

class Cell {
 private Object datum;
 Cell(Object d) { datum = d; }
 
 static void catInTheHat() {
  Cell x = new Cell("Thing 1");
  Cell y = new Cell("Thing 2");
 }
}

Heap

Type-Based Abstraction

Example Query

  • "Museums that have works from Crete"
  • Answer is Heraklion since has Minoan LaParisienne
  • DL
    • $Museum \wedge \exists{has.\exists{creationSite.Crete}}$
    • Negate query at each node, find contradictions
    • Entities, edges in contradiction called justification
  • Needs DL reasoning: creationSite edge is implicit

Initial Query Answer

Refinement

  • J is the justification, i.e. the conflict
  • Partition summary nodes by edges in justification
    $key(a) \equiv \left\{ R(s, t) \left| \begin{array}{l} f(a) = s \wedge\\ R(s,t) \in J \wedge\\ \exists b \; R(a,b) \in A \wedge f(b) = t \end{array} \right. \right\}$

Query Refinement

Refined Query Answer

Pointer Analysis Analogy

class Cell {
 private Object datum;
 Cell(Object d) { datum = d; }
 
 static void catInTheHat() {
  Cell x = new Cell("Thing 1");
  Cell y = new Cell("Thing 2");
 }
}
  • Pointer analyis query: “what points to ‘Thing 1’?”
    • refinement in the style of Plevyak or Sridharan

Reasoning Results

$${\scriptsize \begin{array}{|l|l|l|l|l|l|} \hline Reasoner & Dataset & Avg. Time & St.Dev & Range \\ \hline KAON2 & UOBM1 & 20.7 & 1.2 & 18-37\\ \hline KAON2 & UOBM10 & 447.6 & 23.3 & 414.8-530\\ \hline SHER & UOBM1 & 4.2 & 3.8 & 2.4-23.8\\ \hline SHER & UOBM10 & 15.4 & 25.6 & 6.4-191.1 \\ \hline SHER & UOBM30 & 34.7 & 63.5 & 11.6-391.1 \\ \hline \end{array}}$$

RDF in a Relational Store

  • Numerous large RDF data sources
    • DBPedia (>300M triples)
    • Web data (>3B triples from BTC)
  • Exploit scalable RDBMS technology
    • query optimization
    • transaction support
    • concurrency
  • Quetzal

M. Bornea et al., Building an efficient RDF store over a relational database. SIGMOD 2013

Museums with Locations

Quetzal Schema

  • Entity-oriented schema
    • properties for subject on single row
    • rows for predicates and values
    • secondary table for multi-valued predicates
  • Fit entities onto limited database rows
    • generally more predicates than available rows
    • graph coloring to maximize density
    • spill onto multiple rows only when necessary
  • Analogous tables for reverse direction
  • Reduces joins for "star" queries

Example Entity-Oriented Table

${\tiny \begin{array}{|r|l|l|l|l|l|l|}\hline {\rm{subject}} & {\rm{p1}} & {\rm{v1}} & {\rm{p2}} & {\rm{v2}} & {\rm{p3}} & {\rm{v3}}\\ \hline LaParisienne & type & Minoan & & & &\\ DeathMask & type & Mycenean & & & &\\ StarryNight & type & VanGogh & & & &\\ Heraklion & type & Museum & has & La\dots & at & l1\\ Athens & type & Museum & has & De\dots & at & l2\\ MOMA & type & Museum & has & St\dots & at & l3\\ l1 & type & Location & lat & 35.3 & long & 25.1\\ l2 & type & Location & lat & 38.0 & long & 23.7\\ l3 & type & Location & lat & 40.7 & long & -74.0\\ \hline \end{array}}$

Register Allocation Analogy


  • code fragment
var x = 7
var y = x + 7     
var z = 3
y = y + z
code reg 1 reg 2
x = 7 x  
y = x + 7 x y
x = 3 z y
y = y + z z y

Quetzal Results on LUBM

Data Everywhere

  • Numerous structured data sources available
    • medical (Drugbank, Uniprot); general (DBpedia)
    • much data in RDF, queried with SPARQL
    • But data increasingly diverse
  • RDF, XML, JSON, CSV formats
    • accessible as dumps, query endpoints and APIs
    • Powerful if integrated and queried effectively
    • reuse and extend existing declarative SPARQL


J. Dolby et al., Extending SPARQL for Data Analytic Tasks. ISWC 2016

Modularize SPARQL with Functions

  • "museums with some type of exhibit"
  • function museumsWith(?type ->
                         ?museum ?lat ?long) {
      ?museum has ?art .
      ?art type ?type .
      ?museum at ?loc .
      ?loc geo:lat ?lat .
      ?loc geo:long ?long .
    }
    
  • Functions called with bind
  • bind ?museum ?lat ?long
      as museumsWith(Minoan) .
    

Web Service http://ip-api.com/

  • Web service returns IP-based information
  • <query>
     <status><![CDATA[success]]></status>
     <country><![CDATA[United States]]></country>
     <countryCode><![CDATA[US]]></countryCode>
     <region><![CDATA[IL]]></region>
     <regionName><![CDATA[Illinois]]></regionName>
     <city><![CDATA[Chicago]]></city>
     <zip><![CDATA[60605]]></zip>
     <lat><![CDATA[41.8632]]></lat>
     <lon><![CDATA[-87.6198]]></lon>
     <timezone><![CDATA[America/Chicago]]>&</timezone>
     <isp><![CDATA[AT&T Services]]></isp>
     <org><![CDATA[Hilton Hotels Corporation]]></org>
     <as><![CDATA[AS7018 AT&T Services, Inc.]]></as>
     <query><![CDATA[12.218.232.8]]></query>
    </query>
    

Web Service Example

  • Web service to return latitude and longitude
  • function geo:getPosition( -> ?lat ?long)
    service get http://ip-api.com/xml [] -> xml
    "/query": "./lat" "./long"
    
  • getPosition used with bind
  • select ?lat ?long where {
      bind ?lat ?long as geo:getPosition()
    }
    

Combining Disparate Data

  • "Nearby Minoan exhibits"
  • RDF museums, http://ip-api.com locations
  • select ?museum where {
      bind ?museum ?lat1 ?long1
        as museumsWith(Minoan) .
      bind ?lat2 ?long2 as geo:getPosition()
      FILTER(?lat2-?lat1 < .1 &&
             ?long2-?long1 < .1)
    }
    

The Future

  • Further extensions to SPARQL
    • further modularization, e.g. module systems
    • potential for first-class functions
  • Semantic big data means reasoning
    • put summarization, refinement into Quetzal
    • exploit entity-oriented schema for summarization
  • Would love others to get involved