SUNDANCE Bug Fixes & Design Decisions




Optimatization

Optimization of Sundance distinguishes version 1.2 from 1.1. Version 1.2 has been rewritten to take advantage of a number of naively implemented code sections including: changing the hash table from a linear probing hash function to a secondary hashing function; moving virtually all calls for the length() function of a list object out of the check phase of for loops (several hundred of these), and clearing out the reference counting hash table at sentence::reset() even if some memory leaking objects were still around. Results were dramatic -- the 1700 terrorism texts were processed on ursa25 in over 10 hours with version 1.1. This number dropped to 1:44 with version 1.2 (using preprocessed texts). Ursa19 accomplished the same task in 41 minutes.

Still left to do, in terms of optimization, is to revisit the primary hash function, which is inherently slow due to its inner loop, and at some point to rewrite the segmenters in an effort to get rid of the reference counting based system.

THE Memory Leak

Well, after two days of debugging, I found it. Yep, 'it' would be the memory leak. I just ran Sundance over the 374 relevant trec-125 texts without a crash, and without a memory expansion. (I had turned off the dictionary growth for this test.) It took 1 hour 17 minutes on ursa19, and literally half of that time was spent on 3 mega-texts that are federal register files. So, you're probably wondering what the leak was... Here are the nasty details:

First of all -- it's leaks not leak.

1) The code that set the 'attach' pointer for a constituent (when a pp gets attached, this pointer is set to point back to the attachee constituent) wasn't very smart. At certain points, when the 'attach' pointer was nil, the system would still record in the reference count class that something pointed to nil. This didn't cause dynamically allocated objects never to be destroyed, but it did let the reference count class, essentially a hash table, to grow unnecessarily.

2) The hash table code had a bug in the [] operator. So, if you were scanning the reference count class using that method, you'd occasionally get bogus values. This made debugging the damn reference count thing just a little bit frustrating. It's fixed now.

-- Now for the biggies --

3) The sentence class has a function 'reset' which is supposed to go through the entire tree and delete all the constituents that can be deleted, i.e. those that are not referenced more than once. The implementation of this function was not recursive in any way. So, if you had a sentence that had an NP and VP at the top, and then a bunch of other levels below, only the NP and VP would be removed, leaving the rest of the tree sitting in memory. (***AAAAAAAAAAA!!!!***)

I rewrote this function to do a depth-first traversal of the tree while removing constituents. But, that wasn't quite good enough.

When a PP is attached to, say, an NP, the attach pointer is set so that the NP has a reference count of 2 (one for itself, and one for the PP that attaches to it). Since the NP comes before the PP in the depth-first search, the NP won't ever get deleted because its reference count from the PP won't get decremented until after its one chance for deletion has come and gone. I fixed this by adding a quick depth-first search for 'parent' and 'attach' pointer before doing the deletion traversal. Since the only time 'reset' is called is when we really want to blow away the entire sentence structure, removing these pointers doesn't cause any harm and it allows the constituents to be cleanly removed from memory.

So, Sentence::reset now simply calls two functions:
Constituent::ClearChildrenAttachPointers()
Constituent::DeleteChildren()

4) I thought that was it, but I noticed that the system would still leak when sentences with PP's were handled. It turns out that this symptom is an indication of an approach that I'm not particularly happy about, but it's widespread in Sundance's parsing and I found a way to get around it. Here's the deal:

When Sundance does its segmenting, it builds new constituents or constituent trees and then throws away the original trees and replaces them with the new ones. The upshot is that lots of constituents are getting created and copied and later destroyed rather than using the same set of constituents for, say, the words themselves of a sentence and just manipulating pointers to build the tree on top of them. (The constituent objects that represent the leaf nodes when finished are NOT the same consituent objects that represented the words upon initial reading of the sentence.)

The problem is that we have an inconsistency in this approach. We like to think we can just create and copy a bunch of constituents, but, in the case of PP's, the PP's have an attachment pointer set to some other constituent which doesn't get updated properly. So these PP constituents refer back to older constituents which are left in the dust as segmentation happens.

The only solution to this, short of rewriting ALL the segmentation code, is to make sure we do PP attachment as the last thing. So, I just moved the PP attach function call to AFTER all the np, vp, parens, and clause segmentation calls. It seems to work alright.

Gerund Handling

Gerund handling is done in several steps. First, the morphological rules must be turned on to allow GERUND and NOUN tags to be generated during the tagging stage. Second, the segmentation functions (np_segment() and vp_segment()) must take into account GERUNDS. This actually happens at the lower level heuristics (suggest_new_np() and heuristics_suggest_noun()). See these functions for more specifics.

Initial tests have demonstrated that gerund handling is working well, but it remains to be seen if this will continue in the face of exhaustive testing...