transactional storage for geo-replicated systems

transactional storage for geo-replicated systems

transactional storage for geo-replicated systems marcos k. aguilera microsoft research silicon valley joint work with yair sovran russell power jinyang li one slide summary walter is a key-value store for web apps, featuring transactions data replication across distant sites (geo-replication)

transactional semantics: parallel snapshot isolation (PSI) property is strong, yet it has efficient implementation implementation of PSI uses two techniques preferred sites counting sets using walter, we built two geo-replicated apps facebook-like app, twitter-like app motivation web applications growing in size social networks, web mail, messaging

beyond a single data center or site more capacity: world-wide user base access locality: web users get local service disaster tolerance: if site fails others can take over goal: infrastructure for developing apps deployed on many sites architecture of a web application shared persistent state only at bottom tier higher tiers have only private session state or no state to deploy at many sites, replicate picture at each site hard part, which our work concerns: storage tier

state possibly shared among users existing solutions for storage tier replicated database systems single-master: updates must be done at one server multi-master: application must resolve conflicts hard and expensive to scale highly-scalable storage (Google, Amazon, HP, etc) either single site systems [SDDS, BigTable, GFS, Sinfonia, Percolator] or limited or no support for transactions [PNUTS, Cassandra]

or poor multi-site performance [Megastore] walter: multi-site storage system geo-replication = replication across sites transactions (next slide) non-structural choice interface based on key-value pairs write(key,value) read(key) terminology: object = key-value pair why transactions in storage system?

help dealing with hard problems arising from concurrency+failures transfer hard problems from application to storage infrastructure fewer storage systems than applications infrastructure developers have better expertise why transactions, part two: life without transactions issue of integrity of the storage state dangling references orphan objects unexpected reference cycles garbage

resulting in code complexity lack of productivity loss of software quality our goal in providing transactions: facilitate job of web developer without sacrificing performance transaction coordination and anomalies less coordination, more anomalies parallel

snapshot isolation (PSI) snapshot isolation serializability eventual consistency snapshot isolation supported by commercial DB systems properties reads performed on a consistent snapshot

writes concentrated at commit time no write-write conflicts issue with snapshot isolation it orders the commit time of all transactions even those that do not conflict with each other forbids some scenarios we want to allow for efficiency issue with snapshot isolation (contd) scenario forbidden by snapshot isolation (it requires total ordering of update transactions) Alice

Bob initial state initial state A1. update upload photo transaction B1. update upload

transaction photo A2. read see own transaction photo (see own update) B2. read see own transaction photo (see own update)

A3. read transaction B3. read transaction parallel snapshot isolation (PSI) snapshot isolation one commit time PSI a commit time per site

one timeline a timeline per site read from snapshot read from snapshot at site no write-write conflicts ditto causality property parallel snapshot isolation (PSI)

features (1) commit time per site, (2) timeline per site, (3) read from snapshot at site, (4) no write-write conflicts (5) causality: if T1 commits at site S before T2 starts at site S then T2 does not commit before T1 at any site implementing PSI efficiently PSI prohibits write-write conflicts (even across sites) issue: how to enforce it efficiently? (without coordination across sites)

two techniques preferred sites: optimize updates at certain sites counting sets: data structure that eliminates conflicts technique #1: preferred sites each object assigned a preferred site at that site, object can be written efficiently transactions that modify objects at their preferred site can use fast commit (without cross-site communication) inspired by notion of a primary site but less restrictive, because objects modifiable at any site example of usage: web application

preferred site of objects of a user = site close to where user usually logs in from potential performance issue what if many sites often modify an object? no good way to assign a preferred site to object bad idea: keep changing preferred sites of object requires cross site communication defeats efficiency goal technique #2: counting sets goal: eliminate conflicts when clients update the same object concurrently, most likely they do not want to overwrite it

otherwise we have a race condition cast concurrent updates as operations on a set example: users B and C adds themselves to As friends list instead of writes, treat updates as set additions set addition commutes, so conflict eliminated counting sets (contd) beyond opaque key-value pairs value in key-value pair treated as a set primitive operations in transactions not just reads, writes, but also set operations

transaction manager understands some updates are commutative so they do not conflict want: all set operations to commute with each other so no need to check for conflicts at all but set operations are not always commutative counting sets: anti-elements example of non-commutative operations Alice: add x to set S Bob: remove x from set S different execution order = different outcome goal: make all operations commutative

counting set = map from id to counter (negative values ok) add: increment counter remove: decrement counter removing from empty set: set with anti-element increment and decrement always commute, so no conflicts putting both techniques together certain transactions: fast execution and commit = without cross site communication if preferred site of written objects is local or updates done on counting sets objects

other transactions: slow commit = with cross site communication walter API start(): void commit(): int abort(): int read(oid, buffer): int write(oid, buffer, length): int setAdd(csetoid, id): int setDel(csetoid, id): int setRead(csetoid, iterator): int C++ example

Tx x; x.start(); len = x.read(o1, &buf); err = x.write(o2, buf, len); res = x.commit(); PHP example $x = waStartTx(); $buf = waRead($x, $o1); $err = waWrite($x, $o2, $buf); $res = waCommit($x); walter architecture

scalability walter server can be scalability bottleneck on site solutions 1. smaller sites, many per data center data center 1 data center 2 2. distribute walter server replace commit with distributed commit partial replication objects need not be replicated at all sites

replica set chosen by administrator application can access any object even if not locally replicated system gives illusion all objects replicated at all sites system retrieves appropriate version from remote site useful for large state (e.g., users emails) using walter: applications 1. built waltSocial social networking application runs on many sites

2. ported reTwis to run on many sites existing twitter clone written in PHP original uses redis key-value store (single site) replaced redis with walter waltSocial simple social networking application supports common operations, such as befriend post message on wall read info uses counting sets friends list, message list

preferred site site close to user where user logs into /* befriend operation */ Tx x; x.start(); x.read(oidA, &profileA); x.read(oidB, &profileB); x.setAdd(profileA.friendlist, oidB); x.setAdd(profileB.friendlist, oidA); success = x.commit();

reTwis twitter clone uses redis key-value store supports atomic operations on lists, counters works on a single site (*) (*) redis supports replication across sites, but it is limited to a master-slave scheme we replaced redis with walter to support multi-site operation atomic operations replaced with transactions uses counting set to store timeline port done in less than a day

evaluation used Amazon EC2 with four sites Virginia (VA), California (CA), Ireland (IE), Singapore (SG) round-trip latency from 82ms (VA-CA) to 277ms (IE-SG) some experiments on local cluster, because EC2 performance was poor with many threads EC2 has write caching at the disks experiments evaluate performance of walter performance of applications of walter

walter performance, one site base performance on one site done on local cluster not EC2 read transactions write transactions walter 72 Ktps 34 Ktps berkeley db

80 Ktps 32 Ktps in EC2, performance is halved walter performance, many sites write workload on EC2, full replication walter performance, many sites read workload on EC2, full replication waltSocial performance

on 4 sites in EC2, full replication operation # objects accessed throughput read info 3 40 Kops/s befriend

4 20 Kops/s post message 6 17 Kops/s reTwis performance in EC2

related work commutative replicated types inspired counting sets, but proposed types different cloud storage systems single-site only: Bigtable, Sinfonia, Percolator, distributed b-tree no or limited transactions: Dynamo, PNUTS, COPS multi-site, synchronous replication: Megastore database systems literature replication: provides serializability or snapshot isolation=slow escrow transactions: for numeric data, slow eventual consistency

requires conflict resolution logic conclusion web apps increasingly deployed over many sites developers need appropriate storage infrastructure transactions are a powerful abstraction can provide transactions with reasonable performance semantics: parallel snapshot isolation

techniques preferred sites counting sets one slide summary walter is a key-value store for web apps, featuring transactions data replication across distant sites (geo-replication) transactional semantics: parallel snapshot isolation (PSI) property is strong, yet it has efficient implementation implementation of PSI uses two techniques preferred sites

counting sets using walter, we built two geo-replicated apps facebook-like app, twitter-like app backup slides containers an object has metadata replica set: where it is replicated preferred site too wasteful to keep metadata per object objects are grouped into containers

objects in a container share metadata preferred versus primary sites definition-wise: primary is more restrictive preferred allows writes at any site context-wise: also more restrictive use many primary-backup databases transactions cannot cross databases or semantics unclear anomalies Anomaly

short fork long fork conflicting fork dirty read non-repeatable read lost update short fork (write skew) Serializ Snapshot -ability Isolation No Yes No

No No No No No No No No No PSI Yes Yes No

No No No long fork Eventual Consistency Yes Yes Yes Yes Yes Yes

using PSI multi-object atomic updates all writes in transaction happen together snapshots reads come from snapshot of data read-modify-write operations transactions read object and writes updated version conditional writes transaction reads object, checks condition, and modify object only if condition holds

can check and write many objects at once others?

Recently Viewed Presentations

  • Plate Tectonics - Biology, Earth Science, Environmental ...

    Plate Tectonics - Biology, Earth Science, Environmental ...

    Plate Tectonics Chapter 9.2 - 9.3 Plate Tectonics Proposed in 1965 by Tuzo Wilson = combination of Wegener & Hess's ideas. Convection Currents move the lithospheric plates causing geologic activity - (mountain building, volcanic eruptions and earthquakes) Convection Currents Upper...
  • Thanksgiving on Thursday by Mary Pope Osbourne

    Thanksgiving on Thursday by Mary Pope Osbourne

    Thanksgiving on ThursdaybyMary Pope OsborneVocabularyChapters One and TwoChapters Three and FourChapters Five and SixChapters Seven through Ten
  • Women In WWII

    Women In WWII

    The Navy Nurse Corps. At the time of Pearl Harbor, the Navy had just under active 1000 navy nurses, with another 1,000 on reserves. Nurses in the Navy were considered officers, however, full commission status wasn't granted until 1944. During...
  • 1st Generation Pencil and Tablet 2nd Generation Excel

    1st Generation Pencil and Tablet 2nd Generation Excel

    MBB traced back to the inventory as being in use - until is received in the Utilization Log. ERIC Utilization Log - demonstrates how the MBB. HCE3928 is assigned to JBC C04FEC. ERIC sends Precinct or SRD information to the...
  • Standard & Transmission-based Precautions

    Standard & Transmission-based Precautions

    Procedure or isolation masks, the common choice in long-term care, are not regulated and may vary in quality and performance. ... Insulin pens/fingerstick devices - one resident use only. Sterile needle/syringe use dedicated to one patient/one use.
  • Buenos Trabajodas! del timbre: Repasen para la prueba

    Buenos Trabajodas! del timbre: Repasen para la prueba

    Use a variety of verbs: tener + que, creer + que deber, comer, beber, compartir, gustar, encantar, ser, estar, hacer La Tarea Continue your role as a nutritionist Read Marcos' description of his lifestyle habits Give him written advice using:...
  • Biochemistry

    Biochemistry

    Biochemical Reactions. Biochemical Reactions: reactions that occur inside the cells of living things in order to produce energy necessary for life. All cells require energy to carry out the functions necessary for life. The energy that all cells use is...
  • The Vietnam war - Weebly

    The Vietnam war - Weebly

    Aug. 25- HCM convince Emperor to abdicate & turn rule over to Viet Minh in Hanoi. Sept. 2- declaration of Vietnamese independence from France . Fr Influence always greater in S than in N. Franco-British troops move in from S,...