Forums : Technical Issue Help

Dear Open Hub Users,

We’re excited to announce that we will be moving the Open Hub Forum to https://community.blackduck.com/s/black-duck-open-hub. Beginning immediately, users can head over, register, get technical help and discuss issue pertinent to the Open Hub. Registered users can also subscribe to Open Hub announcements here.

On May 1, 2020, we will be freezing https://www.openhub.net/forums and users will not be able to create new discussions. If you have any questions and concerns, please email us at [email protected]

CVS source code commit activity missing

Hello, this is a follow-up to the previous feedback topic but staying limited to a single topic this time (thanks for implementing a few of the features mentioned, we noticed). In particular, the thread uncovered what seems to be some failure to process a huge portion of BRL-CAD's commit metrics so I went about determining just how much might be missing through some data-mining of my own. The results were rather interesting to say the least and do seem to concur with the previous assertion that much of our commit traffic is not being accounted for causing a cascade of incorrect values.

I extracted all of our project's commit log messages and processed them with some simple scripting (also provided for your reference), with the results being visible at http://ftp.brlcad.org/statcvs/cvs.html. One of the metrics, shown in the second column, is that the number of unique commit messages, regardless of author, was 28409. Ohloh's commit statistics currently shows that same metric as being just 14760.

Sorting by time on the ohloh commits page, you can see that in 1983 Ohloh found 2 of the 3 apparently unique commits; and in 1984 it found only 1 of the 137 unique commits. Similarly, my data-mining of the log messages unveiled 48 usernames that have committed, whereas Ohloh lists that only 39 were found. Similar issues seem to be prevalent in the year's contribution being computed for the various authors, although I suspect it's probably all caused by what seems to be roughly 50% of the commits not being accounted for.

My hope is that the direct data-mining report at http://ftp.brlcad.org/statcvs/cvs.html might help track down what the cause is. Numbers in parenthesis on the per-person columns are where I collapsed two usernames where they were the same person, but I show the original values individually too that Ohloh should have been able to match. Thanks again for the hard work and great site!

Cheers!
Sean

sean about 19 years ago

Hi Sean,

I apologize for not responding to your earlier post in depth; I have been very busy putting out fires in our database lately. However, I have been very keen to solve this discrepancy.

Thank you very much for publishing your data-mining experiments; that's incredibly helpful to me.

I repeated some of your experiments today and was able to reproduce both your results and our own results with a simple change: you've measured the entire history of all branches, while Ohloh measures only the history of the HEAD branch.

This is something we touched upon in our last discussion. When we fetch the log, we pass the -b option to restrict the log to the head. If you repeat your experiments using a log obtained with -b, you will match the Ohloh numbers.

In other words, I think our measurements are correct, but I don't think we've measured what you wanted us to measure.

For now, we maintain a single report for the direct history of the current head. Files which no longer exist, or work done on side branches, will be totally ignored by our parser. This may change in the future, but we had to start somewhere.

Most people who come to Ohloh expect to see a current and exact number of lines of code on the head. That's the metric which most developers know and use to judge our accuracy. What, then, do we do with all of the activity that doesn't contribute lines to the head? We'd like to represent it on a report somewhere, but for storage reasons we can't keep hundreds of separate reports going for the hundreds of different branches. On the other hand, building a database model which allows realtime reports on any branch seemed extremely complicated, at least for the first version of our service. We made a simplifying decision to ignore activity that doesn't relate to the head. For the great majority of projects, this is a reasonable thing to do.

I'm hearing clearly that you want all activity represented, no matter in which branch that work occured. I'm agreement with you here: I think that the development effort expended anywhere in the source is a much more important metric than the length of the resulting code.

Where this really impacts the BRL-CAD report is for files that were deleted and moved somewhere else. The CVS log doesn't record this action, so the renamed files look to us like brand new files.

One of the things that we do as we measure lines of code is generate a unique hash for each source file. This opens up the possiblity in the future that we can quickly recognize a renamed or copied file even if the source control does not tell us about it. By comparing these hashes, we could recognize that /brlcad/jove/jove.h revision 11.8 is in fact the same file as /brlcad/src/other/jove/jove.h revision 1.1, and we could connect the history timelines together. The bottom line is that we could still restrict our report to the current head, but by exploring all the branches along the way we could identify history that the CVS log doesn't reveal.

I'm not leading to any particular answers here, I'm just describing how our system currently simplifies the history, and why we have the numbers we do. I'm also tossing out some ideas for the future and hoping to invite some discussion. It's a long post, but it's a complicated topic, and you seem up for it :-).

Thanks again for your detailed work and interest in our service.

Robin

Robin Luckey about 19 years ago

Robin,

AHA!.. that explains a lot .. and no need to apologize. I'm sure there are constant fires that need to be put out on top of all of this trouble-shooting, support, all while actually trying to still provide new and improved features. The detailed response is, however, incredibly appreciated and I think will actually be rather enlightening to both of us for several reasons. I should have thought to check if you were using cvs log -b myself -- t'was an oversight as we actually rarely use branches for active development. What it did unveil, however, is what could be construed as a bug with cvs log -b or at least a somewhat unexpected and probably undesirable behavior. To put it in other terms I don't think that means what you think it means. .. to explain ..

The -b option provides information about the latest revision on the default branch. Most projects only have one revision (numbered 1) and the default branch is of course usually HEAD. More to the point, if you have a project that utilizes RCS version numbers, cvs log -b will not report previous revisions even for HEAD developement. I'd call it a bug myself in the cvs log command, but it's more one of those detail behaviors of CVS that is carried over from it's RCS heritage/limitation days.

While in general I do agree that I probably would want even branch efforts to be represented somewhere in the statistics, and there's certainly questions and design complications on how to go about that, I can appreciate limiting the statistics to HEAD development for now. That's not the case here, though, as the predominance of activity is HEAD activity.

It's noticeable with BRL-CAD since our sources were in RCS for years and then later imported into CVS so that history was be retained. That import process preserves RCS revision numbers too and with RCS, it was somewhat more common practice to actually use the RCS number for various purposes. For example, you could have it imply api compatibility, simply serve as a means to denote new development phases, or even just to reset the minor revision number. You can actually even still do this with CVS, though it's done via flags during commit that will bump the major revision or via dreaded cvs admin commands. All of this mucking with revision numbers actually has really nothing to do with branching as it just increments the major revision number.

To see an example of this in BRL-CAD, you see in the same log you mentioned for jove, i.e. cvs log jove/jove.h, that the head revision is currently set to 11 (with the latest version before it was moved showing revision 11.8), but that there have also been revisions 10, 2, and 1 as well for that particular (bad example) file. Those are all on HEAD. This can be seen on just about every file in our repository where the RCS major revision number has been incremented over time and other numbers can be seen. If you were to check out HEAD using timestamps going back in time, you will get all of those other revisions as they existed.

To truely get all activity on HEAD, the *cvs log -b* command isn't really sufficient as it not only ignores branches, but it also ignores previous revisions. Instead of -b or at least in addition to it, you can get all HEAD activity by specifying that the revision range should start from RCS revision 1.

cvs log -b -r1:

I threw in the -b just to make it (doubly) explicit that the logs should be limited to the default branch, though it's actually redundant when you specify a major revision range. The 1: basically means from major revision 1 to whatever the latest major revision number there is, which limits results to those found on the default branch. Perhaps this -r1: could be added? :-) I bet you'll find at least a handful of (older) projects where this will make a significant difference in the statistics being reported. I updated my script tables at http://ftp.brlcad.org/statcvs/cvs.html using cvs log -b -r1: and you can see there is a difference, but not nearly as substantial (which is what I would expect). In any regard, thanks again for the insight into the processing being used as well as for the productive discussion. Hope to see the revision numbering taken into account... :-)

Cheers!
Sean

sean about 19 years ago

As for your other comments, I think the hash idea is pretty innovative, but perhaps also rather problematic/limited as that only captures pure movements and not necessarily renames (since a file rename might cause a couple lines in the file to be updated with the new name or path). A thought -- you could probably do some sort of time-sensitive diff match, e.g. if the contents of a file that was deleted match within some %% (e.g. 95%) of those in another file being added within the same ## hours (e.g. 24 hours), then treat them as being probably a moved and/or renamed file. Then it'd just be a matter of fine tuning the %% and ## to not have false positives. Might need the files to also have some minimum line count so you don't match pretty empty files being moved around too, but seems easy enough to tweak into something useful... Cheers!

sean about 19 years ago

Sorry for bumping this old thread, but the search lead me here, and I was trying to figure out why ohloh doesn't know about all commits.

I tend to keep HEAD with just a README, and then all commits go to the proper branch. And well, many other work outthere is done only on separate branches so ohloh stats are pretty inaccurate.

This thread is over 2 years old, so I'm wondering if you have plans to crawl non-head commits. Probably something that would be nice to find in the FAQ. If it is there, sorry, I missed it.

Cheers

markus_petrux almost 17 years ago