Toward Scalable Semantic Big Data




Julian Dolby

IBM Thomas J. Watson Research Center





Semantic Big Data, SIGMOD, Chicago, May 2017

Collaborative Work

  • Bishwaranjan Bhattacharjee
  • Mihaela Bornea
  • James Cimino
  • Patrick Dantressangle
  • Achille Fokoue
  • Aditya Kalyanpur
  • Anastasios Kementsietsidis
  • Aaron Kersehbaum
  • Li Ma
  • Chintan Patel
  • Edith Schonberg
  • Kavitha Srinivas
  • Octavian Udrea

Outline

  • Running Example
  • Scalable expressive reasoning
  • Clinical trials use-case
  • Storing RDF data in a database
  • Integration of Web data

Running Example

Objects in Our Universe

Example OWL Universe

  • Individuals
    $\begin{array}{l}Museum(Athens), Museum(Heraklion), \\ Museum(MOMA), Minoan(LaParisienne),\\ Mycenean(DeathMask),VanGogh(StarryNight)\end{array}$
  • Roles
    $\begin{array}{l} has(Athens,DeathMask),has(MOMA,StarryNight)\\ has(Heraklion,LaParisienne) \end{array}$
  • Axioms (TGDs)
    $\begin{array}{l} Minoan \sqsubseteq \exists{creationSite.Crete}\\ Mycenaean \sqsubseteq \exists{creationSite.Mycenae} \end{array}$

Example ABox $A$

The Summary ABox


  • Map ABox $A$ to $A'$ for scalability using $f$
    $\begin{array}{l} C(a) \in A \implies C(f(a)) \in A'\\ R(a,b) \in A \implies R(f(a), f(b)) \in A' \end{array}$
  • We choose concept sets as f



J. Dolby, A. Fokoue, A. Kalyanpur, A. Kershenbaum, E. Schonberg, K. Srinivas, L. Ma
Scalable Semantic Retrieval through Summarization and Refinement.
AAAI 2007

Example Summary ABox $A'$

Example Query

  • "Museums that have works from Crete"
  • Answer is Heraklion since has Minoan LaParisienne
  • DL
    • $Museum \wedge \exists{has.\exists{creationSite.Crete}}$
    • Negate query at each node, find contradictions
    • Entities, edges in contradiction called justification
  • Needs DL reasoning: creationSite edge is implicit

Initial Query Answer

Refinement

  • J is the justification, i.e. the conflict
  • Partition summary nodes by edges in justification
    $key(a) \equiv \left\{ R(s, t) \left| \begin{array}{l} f(a) = s \wedge\\ R(s,t) \in J \wedge\\ \exists b \; R(a,b) \in A \wedge f(b) = t \end{array} \right. \right\}$

Query Refinement

Refined Query Answer

Reasoning Results

$${\scriptsize \begin{array}{|l|l|l|l|l|l|} \hline Reasoner & Dataset & Avg. Time & St.Dev & Range \\ \hline KAON2 & UOBM1 & 20.7 & 1.2 & 18-37\\ \hline KAON2 & UOBM10 & 447.6 & 23.3 & 414.8-530\\ \hline SHER & UOBM1 & 4.2 & 3.8 & 2.4-23.8\\ \hline SHER & UOBM10 & 15.4 & 25.6 & 6.4-191.1 \\ \hline SHER & UOBM30 & 34.7 & 63.5 & 11.6-391.1 \\ \hline \end{array}}$$

Clinical Trials Use-Case

  • Clinical trials critical for drug development
    • Show effectiveness and safety of new drugs
    • Finding patients often a manual process
    • Result can be low participation rates
  • Reasoning should be able to help
    • Trial criteria online, in semi-structured form
    • Medical knowledge formalized, e.g. SNOMED-CT
    • Criteria as queries against medical knowledge

C. Patel et al, Matching Patient Records to Clinical Trials Using Ontologies, ISWC/ASWC 2007

Challenges

  • Knowledge engineering
    • Must manually connect hospital with SNOMED
    • Hospital format often a taxonomy
  • Scalability
    • High expressivity required e.g. negation
    • Large amounts of patient data (59M assertions)
  • Noisy, incomplete data
    • E.g. contradictory results from tests
    • Summarization efficiently handles clashes
    • Cleanse data before reasoning

Clinical Trial 00419068

  • "Patient on corticosteroid or cytotoxic agent"
  • DL query for potential trial member:
    $Patient00419068 \sqsubseteq \exists{associatedObservation}.\\ \; {\exists{roleGroup}.\\ \;\; {\exists{administeredSubstance}.\\ \;\;\; {\exists{roleGroup}.\\ \;\;\;\; {\exists{hasActiveIngredient}.\\ \;\;\;\;\; {\left(corticosteroid \sqcup cytotoxicAgent\right)}}}}}$
  • roleGroup expresses n-ary relations

Clinical Trials Results

$${\scriptsize \begin{array}{|l|r|r|l| } \hline Query & Matches & Time (m) & Weakened\\ \hline NCT00084266 & 1018 & 68.9 & yes\\ \hline NCT00288808 & 3127 & 63.8 & no \\ \hline NCT00393341 & 74 & 26.4 & yes \\ \hline NCT00419978 & 164 & 31.8 & yes\\ \hline NCT00304382 & 107 & 56.4 & yes \\ \hline NCT00304889 & 2 & 61.4 & no \\ \hline NCT00001162 & 1357 & 370.8 & no \\ \hline NCT00298870 & 5555 & 145.5 & no \\ \hline NCT00419068 & 4794 & 78.8 & no \\ \hline \end{array}}$$

RDF in a Relational Store

  • Numerous large RDF data sources
    • DBPedia (>300M triples)
    • Web data (>3B triples from BTC)
  • Exploit scalable RDBMS technology
    • query optimization
    • transaction support
    • concurrency
  • Quetzal

M. Bornea et al., Building an efficient RDF store over a relational database. SIGMOD 2013

RDF Challenges for DBs

  • Dynamic schema
    • Set of properties depends on dataset
    • RDBMS require fixed schema
    • Quetzal tailors schema for each RDF dataset
  • SPARQL queries
    • Declarative graph query language
    • Quetzal translates SPARQL to SQL
    • Retain benefits from DB technology

Museums with Locations

Quetzal Schema

  • Entity-oriented schema
    • properties for subject on single row
    • rows for predicates and values
    • secondary table for multi-valued predicates
  • Fit entities onto limited database rows
    • generally more predicates than available rows
    • graph coloring to maximize density
    • spill onto multiple rows only when necessary
  • Analogous tables for reverse direction
  • Reduces joins for "star" queries

Example Entity-Oriented Table

${\tiny \begin{array}{|r|l|l|l|l|l|l|}\hline {\rm{subject}} & {\rm{p1}} & {\rm{v1}} & {\rm{p2}} & {\rm{v2}} & {\rm{p3}} & {\rm{v3}}\\ \hline LaParisienne & type & Minoan & & & &\\ DeathMask & type & Mycenean & & & &\\ StarryNight & type & VanGogh & & & &\\ Heraklion & type & Museum & has & La\dots & at & l1\\ Athens & type & Museum & has & De\dots & at & l2\\ MOMA & type & Museum & has & St\dots & at & l3\\ l1 & type & Location & lat & 35.3 & long & 25.1\\ l2 & type & Location & lat & 38.0 & long & 23.7\\ l3 & type & Location & lat & 40.7 & long & -74.0\\ \hline \end{array}}$

Quetzal Results on LUBM

Data Everywhere

  • Numerous structured data sources available
    • medical (Drugbank, Uniprot); general (DBpedia)
    • much data in RDF, queried with SPARQL
    • But data increasingly diverse
  • RDF, XML, JSON, CSV formats
    • accessible as dumps, query endpoints and APIs
    • Powerful if integrated and queried effectively
    • reuse and extend existing declarative SPARQL


J. Dolby et al., Extending SPARQL for Data Analytic Tasks. ISWC 2016

Modularize SPARQL with Functions

  • "museums with some type of exhibit"
  • function museumsWith(?type ->
                         ?museum ?lat ?long) {
      ?museum has ?art .
      ?art type ?type .
      ?museum at ?loc .
      ?loc geo:lat ?lat .
      ?loc geo:long ?long .
    }
    
  • Functions called with bind
  • bind ?museum ?lat ?long
      as museumsWith(Minoan) .
    

Web Service http://ip-api.com/

  • Web service returns IP-based information
  • <query>
     <status><![CDATA[success]]></status>
     <country><![CDATA[United States]]></country>
     <countryCode><![CDATA[US]]></countryCode>
     <region><![CDATA[IL]]></region>
     <regionName><![CDATA[Illinois]]></regionName>
     <city><![CDATA[Chicago]]></city>
     <zip><![CDATA[60605]]></zip>
     <lat><![CDATA[41.8632]]></lat>
     <lon><![CDATA[-87.6198]]></lon>
     <timezone><![CDATA[America/Chicago]]>&</timezone>
     <isp><![CDATA[AT&T Services]]></isp>
     <org><![CDATA[Hilton Hotels Corporation]]></org>
     <as><![CDATA[AS7018 AT&T Services, Inc.]]></as>
     <query><![CDATA[12.218.232.8]]></query>
    </query>
    

Web Service Example

  • Web service to return latitude and longitude
  • function geo:getPosition( -> ?lat ?long)
    service get http://ip-api.com/xml [] -> xml
    "/query": "./lat" "./long"
    
  • getPosition used with bind
  • select ?lat ?long where {
      bind ?lat ?long as geo:getPosition()
    }
    

Combining Disparate Data

  • "Nearby Minoan exhibits"
  • RDF museums, http://ip-api.com locations
  • select ?museum where {
      bind ?museum ?lat1 ?long1
        as museumsWith(Minoan) .
      bind ?lat2 ?long2 as geo:getPosition()
      FILTER(?lat2-?lat1 < .1 &&
             ?long2-?long1 < .1)
    }
    

Web Service Use Case

Ongoing Work: BigQuery

  • Adapt entity-oriented schema to column store
    • Use column per predicate when possible
    • Use repeated columns instead of secondary table
    • No reverse tables
  • Currently, Google BigQuery schema implemented
    • Loader using Apache Beam pipeline
    • Parts of Quetzal functional

The Future

  • Further extensions to SPARQL
    • edge annotations for 'property graph' uses
    • language, schema extensions for annotations
  • Semantic big data means reasoning
    • put summarization, refinement into Quetzal
    • exploit entity-oriented schema for summarization
  • Would love others to get involved