0
I Use This!
Activity Not Available

News

Analyzed over 1 year ago. based on code collected over 1 year ago.
Posted over 11 years ago by [email protected] (@topher Topher)
Topher wrote: If you want to read only our blog and filter the rest of the forum, you may surf https://forum.treode.com/category/blog If you want an RSS feed of just the blog, you may use ... [More] https://forum.treode.com/category/blog.rss Posts: 1 Participants: 1 Read full topic [Less]
Posted over 11 years ago by [email protected] (@topher Topher)
Topher wrote: Previously we discussed single-decree Paxos, and then we looked at problems that arise when it's extended into multi-Paxos. We concluded with the encouraging suggestion that we could use single-decree Paxos ... [More] in a different way: the minitransaction. TreodeDB provides atomic writes and yet has no single points of failure, no scaling bottlenecks and no masters to fail-over. Does this sound unbelievable? The minitransaction makes it possible. Background The Apache Hadoop file system depends on the NameNode to direct clients to the location of file data. A Hadoop cell has only one NameNode, so when it goes offline clients cannot read or write files. The NameNode is a single point of failure. Furthermore, once the number of clients saturates the capacity of the NameNode server, there is no means to divide its work between two or more servers, so the NameNode is also a scaling bottleneck. Services which build on Hadoop, like Apache HBase, inherit these weaknesses. As a result, many enterprises that use Hadoop and HBase do so only in non-critical services, such as supporting the business analytics group. Although outages frustrate users in these departments, the company continues to serve its customers and make its revenue. The Apache Zookeeper service replicates data across multiple nodes. Any node may answer a read request from its local replica. The service processes writes by passing them to a leader who then coordinates the replicas through an agreement protocol. Zookeeper can handle faults since both readers and writers can make progress when nodes fail or disconnect, as long as at least more than half remain usable. Also Zookeeper can scale to handle more readers by adding more replicas, although that adversely impacts writes which must then establish agreement across a larger majority. Zookeeper resolves the single point of failure found in Hadoop, however it introduces a new problems that we explained in an earlier post. Notably, Zookeeper cannot scale beyond the capacity of the smallest machine, and the process to handle failures can cause delays. Traditional databases coordinate multiple machines using two-phase commit. A client contacts nodes to read and write data, and when it has completed its work, it then attempts to commit the changes in two steps: in the first step, the prepare phase, the client checks that each node is able to commit. If each node can do so then in the second step, the commit phase, it tells each node to in fact commit. The approach can scale: one can add more nodes, and those nodes can share the load of reads and writes. Early implementations centered around a distributed lock manager, which is very difficult to get right—something I learned in 1994 when I was part of a project that tried it on a cluster of 36 nodes. Resolution Can we get the best of all these? Can we have a scalable service that offers atomic writes, but that does not suffer from scaling bottlenecks and failover delays? Yes, we can. The minitransaction shows us how. This concept was first tried in Sinfonia (ACM, PDF) and later revisited in Scalaris (ACM, website). To understand what makes it “mini”, we can view traditional two-phase commit as actually having three phases; there's effectively a zeroeth phase of reads and writes preceding the prepare request. The minitransaction on the other hand, piggybacks the writes on the prepare request, and it makes those writes conditional upon a timestamp of the previous values. Once a simple majority of replicas has prepared the writes, single-decree Paxos ensures that either all nodes commit or all nodes abort, even in the most devilish failure scenarios. This example shows the minitransaction can complete with two network round-trips. The server, acting on behalf of the client, sends prepare messages to several replicas simultaneously and awaits a response. Then it sends accept messages to several acceptors simultaneously and awaits a response. A traditional transaction would have many round trips for locking, reading and writing before the prepare message. The minitransaction eliminates these steps by including a conditional write in the prepare message, which makes it simple and fast. There are two other interesting optimizations visible above. First, the server can assume the role of principal proposer. Second, the web server can return “200 OK” to the client before sending commit to the replicas. Because exactly one web server handles the client's request, it can use the special ballot number 0 which is implicitly granted, a shortcut which we examined in our post about single-decree Paxos. Furthermore, once a simple majority of acceptors has accepted the proposal to commit, the transaction is effectively committed even though the replicas haven't yet been notified. What happens if the web server crashes after sending “200 OK” and before sending commit to any replica? The replicas timeout and learn the outcome of the transaction from the acceptors. What happens if the web server crashes before sending a commit proposal to the acceptors? The replicas timeout, the first one discovers from the acceptors that there is no proposal, and so it proposes to abort. Also, the original web client can reconnect to a different web server, and then it too can learn the outcome of its write. Finally, what if there's some unfortunate case in which the web server attempts to propose commit, and a replica attempts to propose abort at about the same time? The single-decree Paxos protocol ensures that all participants agree on exactly one outcome. This protocol addresses reliability concerns. As long as the web server can interact with a simple majority of replicas and acceptors, it can complete its work. In the terminology of the CAP theorem, the technique favors C and P over A. To make the system scalable, we spread replicas and acceptors across a large cluster. The web server uses a hash of the row key to locate its replicas, and it uses a hash of a transaction-id to locate its acceptors. Conclusion With the minitransaction, we can spread reads and writes across many nodes, so the system can scale to handle large datasets and workloads. Needing only a simple majority of replicas, we can tolerate software crashes, server failures and network disconnects. Finally, since we use single-decree Paxos, there's no need for a leader and thus no hiccups from leader fail-over. This is the approach we adopted in TreodeDB so that it has no single points of failure, no scaling bottlenecks and no masters to fail-over. The minitransaction makes TreodeDB reliable and scalable. Posts: 2 Participants: 1 Read full topic [Less]
Posted over 11 years ago by [email protected] (@topher Topher)
Topher wrote: Mutli-Paxos is a distributed algorithm that coordinates changes to a replicated dataset. Several machines host the data set and respond to requests to read or modify the data, and the multi-Paxos protocol ... [More] ensures that those replicas remain synchronized even when machines crash or the network fails. It gave rise to Google's Chubby lock service and Apache Zookeeper, and ultimately led to the development of the Raft protocol. In designing and coding TreodeDB, we avoided multi-Paxos and Raft for a number of reasons. Each of these problems can be addressed with some care, but we ask “Could it be done differently?” Multi-Paxos is difficult to implement Multi-Paxos developed its notoriety as a confusing algorithm early on. It was first described in the second half of Mr. Lamport's papers The Part-Time Parliament and Paxos Made Simple. That explanation merely outlined the algorithm and left details for the reader. Mike Burrows described the subtleties engineers at Google encountered when they used multi-Paxos in the Chubby lock service. Kirsch and Amir attempted to elucidate Paxos, but their paper remains impenetrable like the others. Finally, Ongaro and Ousterhout modified multi-Paxos to produce a more tractable protocol called Raft. Although engineers may find it easier to comprehend, it remains challenging to understand an implement. Capacity is limited by one machine The family of multi-Paxos and Raft implementations (including Chubby, Apache Zookeeper, etcd and more) replicate a dataset across multiple machines. Each machine holds the entire dataset, thus its size cannot exceed the capacity of the smallest machine. Each of these implementations structure their API around a directory hierarchy, although neither multi-Paxos or Raft require that. Also, Zookeeper offers a multi operation to perform several updates atomically. If we are willing to give up the ability to list directories and the multi capability, then we can use a consistent hash of the path to determine a cohort of Paxos acceptors or Raft coordinators. The Spinnaker project explored this idea. To handle loads that are too large for a single machine, we must give up some useful features. Leader fail-over causes delays Multi-Paxos and Raft identify one of the machines as the primary. All requests to modify data must be coordinated by the primary, which may sometimes fail. The protocols eventually detect the failure and elect a new leader, and they can do so while keeping the replicas consistent. This works even if the old leader did not actually fail, but perhaps lost connectivity for a brief period. It ensures that a quorum of replicas elects the new leader and then ignores the old leader. The process of leader fail-over can be reasonably fast. Nonetheless it does cause delays. If not careful, clients may in-turn pass those delays further up the call stack to their clients. Disconnects can create semantic difficulties If a client issues an update and then looses its connection to the server before receiving the acknowledgement, the client needs to take some action. The specific nature of this action depends on the kind of update and the meaning of the data, so the concern must be addressed case by case. For example, if a client attempts to increment a counter and fails to receive an acknowledgment, the client can reconnect. When it does so, it may find the counter has in fact been incremented, but the client cannot know if that was its increment request or another client's. The paper on Chubby (Section 4) and the documentation for Zookeeper discuss these usage traps. To avoid such problems, clients can use operations such as compare-and-set, or they can use data structures with operations that are commutative and idempotent. Engineers must think carefully about how to handle disconnects. So we're good, right? We think not. Although careful use of multi-Paxos and Raft does allow us to avoid problems, we must give up atomic update. Scalable implementations like Spinnaker do not provide an equivalent to Zookeeper's multi operation that assures us all of the items are modified or none of them are. Attempting to work around this limitation, it is easy to make mistakes such that flakey servers and networks leave some rows changed an not others, and thus leave the database in an inconsistent state. When engineers use a system that does provide an atomic operation for multiple rows, they find that it greatly simplifies building the application. Could it be done differently? Yes. Multi-Paxos is just one way to extend single-decree Paxos. There are other ways to use single-decree Paxos effectively, and there is a wealth of literature exploring interesting techniques derived from it. One such distributed algorithm, the minitransaction, uses single-decree Paxos to coordinate atomic updates to multiple replicas of multiple items. TreodeDB manages database rows this way, and we will present the minitransaction in our next post. Posts: 1 Participants: 1 Read full topic [Less]
Posted over 11 years ago by [email protected] (@topher Topher)
Topher wrote: TreodeDB uses Paxos, more specifically the simplest version which we will call single-decree Paxos. Google’s Chubby lock service and Apache Zookeeper both implement an extension called multi-Paxos. That ... [More] variation presents a number of issues, which we will explore in another post. In this post we review the basic Paxos protocol. The proposer runs two phases We assume that machines and software may crash. We assume the network may go down, partition machines, drop messages or mangle them in a detectable way. Otherwise though, we assume that actors can be trusted. We are not assuming the components behave in a byzantine way, that they do not deliberately fake messages which appear to be valid. Single-decree Paxos coordinates multiple actors, called proposers, that need to agree on the outcome of a single decision. It does so in two phases. In the first phase, which we will call the ask-grant phase, a proposer contacts several other actors called acceptors, and asks “May I propose a value at ballot np?” The proposer does not yet include a value to be accepted, it merely asks if it may make a proposal. It includes a ballot number, which it draws from an ordered domain, and each proposer draws from a distinct subset of that domain. The ballot number can be implemented as an integer together with the host name or some other identifier unique to each proposer. An acceptor may then respond “nothing has been accepted,” or “value v has been accepted at ballot na.” Or an acceptor may not respond at all. The first phase completes when the proposer has received responses from a simple majority of acceptors. If the proposer gets the first response, “nothing has been accepted,” from a simple majority of acceptors, then it is free to propose its own value at ballot np. If it receives the second response, “value v has been accepted as ballot na,” from any acceptor after hearing from a simple majority of them, then the proposer is obligated to propose the value v of the highest numbered ballot na. The proposer may timeout while waiting for the responses, in which case it can restart the ask-grant phase with a higher ballot number. The proposer begins the second phase, which we will call the propose-accept phase, by telling each acceptor “I propose value v for ballot np.” An acceptor may then respond “accepted.” Or it may ignore the proposal. If the proposer times out while waiting for acceptances, it can restart the protocol at phase one with a higher numbered ballot. The second phase closes when the proposer hears acceptances from a simple majority of acceptors, at which point the proposer knows v is the final decision of the protocol. procedure propose (x): choose np v ≔ x n ≔ 0 // phase 1: ask-grant send ask (np) to acceptors await response from a quorum: on grant (np, Some (va, na)): count response towards quorum if na ≥ n then v ≔ va n ≔ na on grant (np, None): count response towards quorum on timeout: restart procedure choosing a larger np // phase 2: propose-accept send propose (np, v) to acceptors await response from a quorum: on accept (np): count response towards quorum on timeout: restart procedure choosing a larger np // v has been chosen return v The acceptors run independently So how are the acceptors supposed to behave? In single-decree Paxos, each acceptor runs independently, tracking the highest numbered ballot it has granted or accepted. Also, it keeps the value for the highest numbered ballot it has accepted. An acceptor should do this in a way that it can recover the state if it crashes and restarts. Each grant and accept constitutes a promise from the acceptor, and the protocol fails if the acceptor crashes, restarts with a clean state and then reneges its earlier promises. When an acceptor receives an ask with a ballot number n, it grants the request if the new n is greater than any ballot the acceptor has already granted or accepted. The grant constitutes a promise to not accept any ballot number less than n. When the acceptor receives a proposal with a ballot number n, it accepts the proposal if the new n is greater than any ballot the acceptor has already granted or accepted. ng ≔ 0 p ≔ None on ask (np): if np ≥ ng then // promise to never accept a proposal less than np ng ≔ np respond grant (np, p) on propose (np, v): if np ≥ ng then ng ≔ np p ≔ Some (v, np) respond accept (np) That's it. We can explain single-decree Paxos in less than 1,000 words. Speaking of that, let's look at some pictures of single-decree Paxos in different scenarios. When all goes smoothly When one proposer flows through the protocol without dropped messages or crashes, it completes in two round trips. Let's look at the interaction between the proposer and three acceptors. Once the proposer receives an acceptance from a simple majority of acceptors, it can consider the value to be the final decision. Any future proposer will discover this same value when it begins its ask. When we can identify a principal proposer Let's suppose that we limit the range of ballot numbers, specifically that we forbid negative ballot numbers. A grant constitutes a promise to not grant or accept a lower ballot number. Since 0 is the lowest ballot number, it's implicitly the case that an acceptor cannot grant or accept a lower one. If we can identify a single proposer as the principal, that actor may jump immediately to the second phase. The principal actor can complete the protocol with one round trip. What if the principal proposer makes a proposal that conflicts with those from other proposals? The protocol ensures that all proposers come to agreement on one decision. We will see examples below. The multi-Paxos protocol maintains a leader who enjoys this privilege of one round trip for many proposals; the complexities of multi-Paxos arise when the leader fails and another proposer must establish itself as the new leader. We will discuss that in another post. When prior proposers fail Perhaps one or more proposers began the protocol but did not finish. Maybe they crashed, or the power was cut, or the network went down. Or maybe they are delayed and will return later to try to complete the decision themselves. Let's see what another proposer does when it discovers the incomplete progress. In this case, the proposer continues with the highest numbered value it found. Notice that no simple majority of acceptors supplied the same proposal. Thus no earlier proposer completed the protocol, so no value had yet been finalized. Although no value had been finalized, this proposer is obliged to use the value of the highest numbered proposal. When two proposers overlap their ask Perhaps two proposers run at about the same time, each trying to propose a different value. For example, maybe they both begin phase one simultaneously. In this case the acceptors ignore the second ask since its ballot number is lower than the one granted. The second proposer eventually times out and then discovers the value from the first proposer. When two proposers overlap their proposal It's also possible that two proposers begin phase two simultaneously. Similar to before, the acceptors ignore the proposal with the lower numbered ballot. The loosing proposer eventually times out and discovers the value from the winning proposer. Where to learn more When considering how multiple proposers may interleave their messages or fail, we can construct adversarial scenarios ad-nauseam. We would prefer a convincing proof of correctness. Mr. Lamport provides the necessary arguments in his papers, The Part-Time Parliament and Paxos Made Simple. You only need to read the first half. Stop when Mr. Lamport mentions a state machine. That's the point at which he extends single-decree Paxos into multi-Paxos. In the future we will discuss issues with that approach, and we will present another way to use single-decree Paxos for coordinating state changes. Posts: 1 Participants: 1 Read full topic [Less]