12
I Use This!
Activity Not Available

News

Posted almost 15 years ago by matt
One thing that makes me think I'm back on the right track with the CDB format and cursor stuff is that I'm ripping out quite a bit of hard-to-follow code in my experimental branch. Feels good :)
Posted almost 15 years ago by matt
OK, so it's not as bad as I thought. First of all, I discovered the CDB database library and tried translating its format into Python. My home-grown on-disk associative array format is basically an ordered linear list, but chunked up and written ... [More] out as small pickled blocks containing a range of key-values, with the key ranges of each block kept in an index -- basically taking advantage of the speed of cPickle, which is much faster at reading data off of disk than looping through it with pure Python. CDB uses a two-stage hashing algorithm, where a key is hashed, the hash is assigned to one of 256 buckets, and then within that bucket it's stored in a hashtable. Turns out my CDB implementation is 2x faster than my homegrown format when the keys are close together (the homegrown format does OK there because it's inherently caching), but 100x faster for random keys. That's pretty exciting. (Writing out the CDB format turns out to be roughly 30% slower, but I can live with that for those kinds of lookup improvements.) Whoosh uses the fact that keys in the on-disk tables are ordered lexicographically (like in a B-tree) all over the place. Luckily the CDB format stores its keys and values in insertion order, so all I have to do is insert them in lexicographic order like I already am doing. One thing the current format can do is (again like a B-tree) let me say "go to this key in the table", and if the key doesn't exist it returns the closest key following the one I asked for. This turns out to be very useful in a search engine (e.g. for prefix queries). But since the CDB format is a hash, trying to get the position of a non-existent key just gives you an error. I solved this in a simplistic way: I subclassed the CDB format so that as I insert key/value pairs, every Nth key is added to an index list. So now I can simulate the old "position at or after this key" behavior by looking up the closest key <= the target key in the index list, and then reading linearly forward from there (since CDB stores keys and values sequentially) to the matching or following key. The trick will be choosing N, which you would want to be small for small indexes (so you don't have to read linearly through large numbers of key-values), but big for big indexes (so the list of index keys doesn't get huge and take up a lot of RAM). Second, I did some more synthetic benchmarking and discovered that, even though using a cursor object is much slower than using a generator (probably because of all the dot-name lookups), it seems like that's more than compensated for by not having to read and process so much when you can use the cursor object's skip_to() method in an AND query to skip over non-matching postings. I've also got some momentum going adding documentation to Whoosh, even though none of it has been checked in yet. I'm using Sphinx, and while I still hate reStructuredText, the convenience of the rest of Sphinx makes it worth putting up with rST. I need to write scripts so I can generate the HTML docs, include them with source uploads to PyPI, and upload them to the Whoosh website when I make a release. [Less]
Posted almost 15 years ago by matt
I'm beginning to feel like striving for performance in a Python library is counter-productive. Every thing you do to eek out performance warps the code, making it less clear. And it seems like every time you try to make something faster, it actually ... [More] gets slower. The issue that's currently chafing my butt is the performance cost of attributes, as it pertains to generators vs. custom iterators. Consider the following two ways to implement a counter iterator... # Method returning a generator class A(object): def __iter__(self): counter = 0 while True: counter += 1 yield counter # # Custom iterator class class B(object): def __init__(self): self.counter = 0 def __iter__(self): return self def next(self): self.counter += 1 return self.counter Because it uses local variables, the generator will is roughly ten times faster than the custom iterator. Currently, when reading through posting lists (the list of documents a certain word appears in), I use simple generators that yield every document in the posting list. I wanted to implement a useful and long-known optimization where you skip through the posting lists to avoid reading document information you don't need. For example, if you're doing an AND between two terms, and the first document in which term A appears is document 10, then in the posting list of term B, you can skip any documents lower than 10. The problem is that to do this I need to switch from generators to a custom "cursor" type iterator with a method to allow skipping through postings lists. By analogy with the earlier code examples: class B2(object): def __init__(self): self.counter = 0 def __iter__(self): return self def next(self): self.counter += 1 return self.counter def skip_to(self, n): self.counter = n But since this involves a attribute getting/setting and explicit looping, in order to make reading through posting lists faster and more efficient, I first have to make it ten times slower. ARG!!! Now, I could try to simulate "skip to" functionality in a generator using a "yield expression" and send(), but send() is not supported in Python 2.4 (or earlier). I find that sending information into a generator and receiving it from a generator at the same time is counter-intuitive and leads to hard-to-follow code. I'd have to call next() and send() in a Python loop instead of using for...in, so the loop would still be roughly four times slower than a straight generator. I'm not sure the speed benefits of skipping through the posting list would make up for the much slower looping. Another thing makes me wonder about whether trying for performance is worthwhile: hopefully Unladen Swallow will come along some day and reduce or eliminate a lot of performance bottlenecks in the current CPython. It's possible that by crafting my code in weird ways to find the most efficient code paths in the current CPython, I'm writing something that will be slower on a future CPython than a sane implementation. [Less]
Posted about 15 years ago by matt
The latest PyPI release of Whoosh (0.1.11) contains changes to the index format (I changed to a sane storage format for the per-document field lengths) that are not backward compatible. You will need to recrease any existing indexes with the new version. Sorry for the inconvenience.
Posted about 15 years ago by matt
As a followup to my last post, I've been thinking about how speeding up Whoosh using array.tofile() and fromfile() raises a conflict between two of my goals for Whoosh. Whoosh is supposed to be a fast (for Python) search library. But I also ... [More] envisioned it as a useful bit of source code to hobbyists and maybe even serious researches, who could take advantage of the dynamic nature of Python to do quick experiments with it. Using the array methods will speed up Whoosh by quite a bit (at the macro level, not 200x -- I meant to put a ;) after that bit -- but quite a bit). But while it gets Whoosh closer to being as fast as Python can go, it will also warp the implementation into something that doesn't make any sense outside of the Python interpreter. That is, it would be a horrible way to write a search library except for the fact that it's also the fastest way given the nature of the Python interpreter, where increasing the percentage of your program that touches C code is more important than being clever or "doing it right". I still think the speed improvements are worth it, because above all else I'd like Whoosh to be of practical use, for example to projects like Trac and MoinMoin? that have search functions but can't require native libraries. And the best way to do that is to be fast. But I'll be sorry to remove some of the "clever" bits. [Less]
Posted about 15 years ago by matt
The great perversion of trying to write high-performance code in Python is that it has almost nothing to do with being clever. It really comes down to writing your program in such as way that it touches as much C code as possible. Some things you ... [More] would do for performance in another languag would be a disaster in Python. For example, it would be foolish to implement a perfect hash function in Python for better performance, because a Python implementation could never be as fast as the C implementation of the built-in dict class. When I began implementing Whoosh, I quite shamelessly used Lucene as a template, and then elaborated on it, sometimes implementing things -- completely customizable posting formats, for example -- beyond what Lucene offers. Lucene uses varints (variable-length integers) and deltas to compress postings lists, so instead of four bytes per number for a 32-bit integer, most numbers are stored in one byte (using deltas keeps the numbers small enough to fit in one byte). This not only saves space, but in low-level languages like C and Java, it's faster, because it helps the disk cache. My implementation of this method in Whoosh (along with the fact that I use zlib to compress stored document metadata by default) is one of the reasons Whoosh indexes are typically extremely small compared to those created by most search libraries. The downside of using varints for Whoosh is that reading and writing varints is (obviously) not a built-in function of Python. So in some of the tightest loops in the search code, I'm using relatively slow Python constructs, like explicit for loops. Recent tests I've run suggest using array.tofile() and array.fromfile() is about 200 times faster than writing and reading postings "manually" (that is, explicitly in Python). Which makes sense, because these array methods do both the looping and IO in C, where I'm currently doing it in the interpreter. Of course the downside of using array is that it doesn't do any compression -- every number, no matter how small, takes up four bytes on disk. So you'd expect the on-disk size of a Whoosh index to get much larger, probably doubling. For my use-case (a medium-size online help system of about 6000 documents), I'd guess the index would probably go from approx. 4MB to somewhere around 8MB. However, the uncompressed nature of on-disk arrays has a potential advantage: because the numbers are fixed-size, I could use offset file.seek()s to shortcut certain operations. On balance, considering the price of disk space these days, 100-200x better searching performance probably justifies doubling the size of the index. Combined with ideas for chunked posting formats I've cribbed from the Google presentation that recently showed up on the search blogs, I think searching could potentially be much faster in a future version of Whoosh. [Less]
Posted about 15 years ago by matt
Well, that was interesting. Whoosh got noticed in a few places, and a few people are playing with it. Now I need to work on three things. Documentation Finishing features Exploring performance Documentation is what I'm going to be focusing on ... [More] first, and right away. I think it's awesome that people are trying Whoosh out just based on docstrings and the GettingStarted page, but I really need to write a user guide, get down some design notes, and write about how to accomplish things in Whoosh. It would be cool to do a screencast or something too. But writing documentation is my day job; I need to find motivation to do it in my spare time too. There are quite a few things that are coded and more-or-less-working-probably (e.g. highlighting search results excerpts) but not perfectly integrated, or wanting a better API. There are also some code I wrote just to get things to a releasable state that I need to revisit. That's what I'll probably work on after getting the documentation going. Increasing performance is going to be tricky. Given the design parameters, I feel like I'm running up against the speed of the Python interpreter. Finding more performance may mean being extremely clever about doing less work, and exploring parallel processing. I can also try working with very large indexes and see what kind of tunable parameters might help with that. [Less]
Posted about 15 years ago by matt
Yay! Anonymous access to the whoosh repository is fixed.
Posted about 15 years ago by matt
To summarize the initial release of Whoosh, Day Two: I screwed up setup.py and uploaded incomplete distributions. Twice. Anonymous access isn't working for the Subversion repository. I've got a question in to my service provider about it, but until ... [More] I get a reply there, try using username=guest and password=guest, with my apologies. Oh, and I need to write more unit tests, user docs, and figure out a way to integrate generated API docs with Trac. Did I mention that I'm not a professional developer? I should probably put that on the wiki somewhere ;) [Less]
Posted about 15 years ago by matt
The trials, tribulations, and triumphs of one man's journey to create a kick-ass Python search engine library.