Forums : Suggestions for Ohloh 2.0

Dear Open Hub Users,

We’re excited to announce that we will be moving the Open Hub Forum to https://community.synopsys.com/s/black-duck-open-hub. Beginning immediately, users can head over, register, get technical help and discuss issue pertinent to the Open Hub. Registered users can also subscribe to Open Hub announcements here.

On May 1, 2020, we will be freezing https://www.openhub.net/forums and users will not be able to create new discussions. If you have any questions and concerns, please email us at [email protected]

The Ohloh collection APIs seem to hav...

The Ohloh collection APIs seem to have inherent race conditions. If my code asks for the first 25 items of some type, someone can create items in the first 25 and cause my code to see some items twice, or (in some cases) delete items in the first 25 and cause my code to miss some items.

Ohloh collections need some way to iterate over them more safely than that. It doesn't matter if code fails to see newly added items; however, code should not miss existing items (not added since the start of iteration) or see the same items multiple times.

Josh Triplett over 16 years ago

Hmm, good point.

I don't know how other sites deal with this issue. Ideas?

Robin Luckey over 16 years ago

Looking around at other web services, I couldn't find any good examples of ways to solve this. However, I think I have a solution which will work: ditch page numbers, and instead select based on the field you sort on.

Suppose the client does a search for items containing thingy, sorted by item name in ascending order. Ohloh does the appropriate select, along the lines of SELECT * FROM items WITH name LIKE %thingy% ORDER BY name LIMIT 25. The database returns a full 25 results; suppose they run from 1 thingy to abracathingy. Ohloh sends back the first set of results, and since it received as many results as it asked for, Ohloh includes a flag saying that more may exist. (This avoids having clients hard-code the use of 25 results per page; however, that represents an unrelated problem, and Ohloh could just let clients apply the same logic and hard-code 25.)

Now, the client decides it wants to ask for more results. Rather than asking for page 2 of the items containing thingy, with all the associated race conditions, the client instead asks for the items containing thingy starting after abracathingy. Ohloh can now do a select along the lines of SELECT * FROM items WITH name LIKE %thingy% AND name > abracathingy ORDER BY name LIMIT 25. This will return the next logical set of results for the client, regardless of what changes have occurred since the client requested the first set of results. If new items have appeared before abracathingy, the client will not see previous items again even though they now come after the first 25. If items before abracathingy have disappeared, the client will not miss the items between abracathingy and the new 26th item. Even if abracathingy disappears, the client will still see the next item after abracathingy.

This approach also suggests some other changes to the collection APIs. The current API returns page 1 if you ask for a page that does not exist; with the new API, asking for results with names after the last name should give an empty result set, which tells the client to that no more results exist. Also, with the new API, you can go to the previous page by asking for the last item before the first one in your result set. Ohloh (or alternatively the client, but that seems inelegant) can implement that by reversing the ORDER BY and the pagination part of the WHERE condition, then reversing the results before returning them so they fall in the original order.

Josh Triplett over 16 years ago

That solution removes dupes (which are easy to detect) but doesn't solve the missing items problem that can occur in either case. Replication is an ugly problem and aside from simple date range in the past queries, a stateless HTTP API is not a great fit for a solution.

bombguy over 16 years ago

Yes, stateless and paged results don't mix. Somebody has to maintain some state if you want stable paging. Here are two reasonable and common approaches to stable iteration of a collection over HTTP:

1) If the server is willing to manage a result set for the client and dole it out in pieces, have a means for creating a stable collection view (e.g. POST a query form to /getview), which responds with a 201 Created status code and a Location header supplying a URI handle to the collection (e.g. /views/fb07a1). Further operations based at that URI (e.g. /views/fb07a1/p1, /views/fb07a1/p2) are understood to apply to the created collection view.

For very large or active collections, one simply might not be able to guarantee that the collection view was stable over time; to signal the client to start over with a new result set based on the same query, HTTP semantics like 301 Moved Permanently can be useful. (This is kind of like a Java collection throwing a ConcurrentModificationException). Collection views accessed after the server is no longer willing to persist them can be met with 410 Gone.

2) If the client is smart enough and you don't want to create server resources at all, you can make the client responsible for all paging of collections. The initial request from client to server fetches a complete list of matching object or row IDs comprising a snapshot of the collection, and each paged request then explicitly asks for a sequence of those IDs. Addition of an item to the collection by a third party is unknown to the client unless the client re-requests the list; deletion of an item from the collection is equally unknown; deletion of an item altogether results in a failure to retrieve that item (404, null in a list, etc). Of course, you need to be working with collections small enough that their entire ID set can be downloaded in reasonable time.

But to be honest, I'm not sure that stable collection iteration matters in the Ohloh API at this point ... we're not dealing with bank transactions here ... if I see something show up twice in a paged Ohloh widget, I doubt I'll flip out about it. Opening up the code parsers and recognizing more code and comment types would get my vote as a better place to spend development energy.

Rob Heittman over 16 years ago

bombguy:
With the solution I proposed, you will only miss items added before your current position in the results after you begin searching. You will never miss existing items, even if an item before your current position disappears or changes name to put it after your current position. With the current Ohloh collection API, you could miss existing items or see items twice.

This represents an important distinction. The same kind of guarantee occurs when using most forms of non-blocking synchronization: a non-blocking reader might not see new changes, but it will not miss items that have existed for a reasonable time before it started reading. This represents my concerns when using the Ohloh API. I don't care about missing new changes, because the change might just as well have happened after I finished reading, in which case I would have missed it regardless; furthermore, any attempt to cause me to re-examine the query when something changes might well have me constantly re-examining the query. However, I do care about missing existing items in a query, because it violates user expectations. For example, suppose I built a site that uses Ohloh to grab metadata about a user or project; if someone attempts to search for themselves as a contributor or for their own project, and discovers that they or their project does not exist, despite Ohloh saying they do, they will blame my site, not Ohloh. :) From Ohloh's perspective, the same thing applies to Ohloh's own search, which uses the same paging mechanism. If someone searches for something that they know exists, and they do not find it, that represents a serious problem.

Rob Heittman:
1) Having the server maintain state could theoretically work, but it imposes an ongoing burden on the server to maintain a substantial amount of state for every recent client. Furthermore, detecting if a client's view has become invalid requires a significant amount of work on the server, unless you suggest offloading this work to the SQL server in the form of a read transaction; what you propose seems to amount to extending read transactions over the network. This does not seem like a reasonable expectation from the server. Also, telling the client to restart their query might lead to a livelock; what keeps them from needing to restart their query continuously, given sufficient activity on Ohloh during each query?

2) I agree that it makes sense for the client to handle the state. However, fetching a complete list of row IDs defeats the original purpose of the paging, namely to decrease the load on Ohloh's servers for the common case of not needing results beyond the one the searcher wanted. (And, for that matter, to ensure that larger collections entail more requests towards the quota for a given API key.) Furthermore, if you provide only the ID, not the actual data for the item, you impose an extra request for every item, which causes significant overhead for both client and server. Thus, as you said, it only works for small collections; right now it works for collections smaller than 25, so you've only made an argument to increase the page size, which does not solve the entire problem.

The solution I proposed works on collections of any size, poses no extra burden on the server, does not pose an unreasonable burden on the client, and provides sufficient consistency to satisfy user expectations about finding items known to exist.

Josh Triplett over 16 years ago

Perhaps I should have clarified that the idioms I described illustrate exactly why it is rather heavy to guarantee stable collection iteration, and why I don't think it's needed here.

What you proposed, Josh, is an inexpensive compromise with good properties. I do think it only has the stated properties if the field you sort on is an identity for the items, yes? But that is probably a reasonable limitation to impose.

Rob Heittman over 16 years ago

If the field in question isn't a unique key, or itself changes during the iteration, you can still miss rows obviously. Once you start playing cute with the data access layer, it becomes less API and more give me a web call to feed my widget. Nothing wrong with that, of course -- but I think it's best either to design for the real replication case or just leave it as is.

bombguy over 16 years ago

Rob:

Perhaps I should have clarified that the idioms I described illustrate exactly why it is rather heavy to guarantee stable collection iteration, and why I don't think it's needed here.

Ah, I did not realize that. In that case, I agree entirely.

What you proposed, Josh, is an inexpensive compromise with good properties. I do think it only has the stated properties if the field you sort on is an identity for the items, yes? But that is probably a reasonable limitation to impose.

Good point. If you sorted on a field that does not uniquely identify an item, this would not identify your position in the set of items with the same field value. On the other hand, I rather like the idea of making all sorts stable by having secondary sort criteria that include a uniquely identifying field.

One note, though: I did not mean to imply my approach would prevent seeing an item multiple times or missing an item if the item itself changes while querying. For example, if I search for projects sorted by name, and the name of a project changes, I might well see it twice or not at all.

Josh Triplett over 16 years ago