XProgramming > XP Magazine > XST: Complex Scope
COLLECTED TOPICS: Kate Oneal | Adventures in C# | Documentation in XP | Book Reviews
XST: Complex Scope
Ron Jeffries
11/21/2005
In which we discover that our implementation is "totally wrong" and we have to rewrite everything. Or do we?

A Clarifying Test

The tests we have so far satisfy me that I have a working restrict operation, but they may not satisfy you. Let's write a test that may make it more clear what's going on.

    def test_name_restrict
      nameData = 
        "Jeffries    Ron Hendrickson ChetAnderson    Ann Johnson     Lee "
      input = XSet.new(16, nameData)
      selectData = "HendricksonJeffries   "
      select = XSet.new(11, selectData)
      expected = "Jeffries    Ron Hendrickson Chet"
      result = input.restrict(select)
      assert_equal(expected, result.contents)
    end

This test mimics the example discussion in our first article. We have had to rearrange things a bit, and that's the focus of this article. We have four records, containing name information, last name first, first name last. Last name starts in byte 0, and first name starts in byte 12. We restrict that data with two last names, starting in byte 0. Our restriction works, selecting, in this case, the first two records.

Limitations -- A Test That Cannot Work

Now we can see some limitations of our implementation so far. The sets themselves are hard to create; we have to count the byte positions; we have to compare the bulk string contents instead of something easier. Most important (to me, at the moment) is that our restrict can only restrict things starting from the beginning of the record.

Suppose that we wanted to find all the name records that had first name Ron or Lee. We can't express it this way, at least not if we want it to work:

    def test_firstname_restrict
      nameData = 
        "Jeffries    Ron Hendrickson ChetAnderson    Ann Johnson     Lee "
      input = XSet.new(16, nameData)
      selectData = "Ron Lee "
      select = XSet.new(4, selectData)
      expected = "Jeffries    Ron Johnson     Lee "
      result = input.restrict(select)
      assert_equal(expected, result.contents)
    end

The problem is that our current restrict operation compares starting at byte 0 of the records of the two sets. Set-theoretically, the restricting set we have in mind is something like:
{{R12, o13, n14}, {L12, e13, e14}} but the one we have is
{{R0, o1, n2}, {L0, e1, e2}}.

Our test_firstname_restrict therefore fails. The restrict returns no values, because no last names match "Ron" or "Lee" in the first three letters.

What shall we do? We can make the test run by giving the restrict an additional parameter, an integer by which to offset the comparison. That will work in this case, and in any case where we want to do a selection against contiguous bytes. But there are at least two other interesting cases, increasingly difficult:

  1. Selection against two fields that are not contiguous in the input data set. What if we wanted people named "Ron" from "Omaha"?
  2. Selection against different fields in different selecting records. What if we wanted people with last name "Jeffries" and people with first name "Chet"?

Ultimately, we'd like the restrict operation not to take additional parameters. We'd like the oddness of these queries to be contained in the sets. In the case of the firstname test above, we would like our select set to know that "Ron" and "Lee" occupy bytes 12-14 of two otherwise empty records! For that case, all the records would be alike, and the field contiguous. In the "Ron from Omaha" case, the records would be alike, but the fields not contiguous. In the "Jeffries or Chet" case, however, each record of the restricting set may be different from the others.

We would like all this information to be included in the input sets. And, therefore, in principle, at least, the restricted set might also consist of records that are offset, noncontiguous, and all different. This is a knotty problem!

Some Possible Solutions

We have a broken test. We could "just" make it run. Then we could write a harder test, one that would lead us in the direction we need to go, which is pretty wel sketched above. Interestingly, I believe that we'll wind up going in a direction I rejected earlier.

Remember that I was pushing toward comparing the bytes one at a time, and backed off from that solution? My issue then was performance, and I guessed that performance would not be improved by going to single bytes. It appears to me now, looking forward, that we'll get to a byte-by-byte solution. Let me explain. No, it is too much. Let me sum up.

If the selecting record can be offset and discontiguous, perhaps like this:
{R12, o13, n14, O30, m31, a32, h33, a34}, then we won't be able to pull out the whole selecting record and compare it, as a slice, with the beginning of the input record. We will have to compare each byte based on its logical position -- its superscript -- not on its physical position in the byte stream. It's true that in a system with a compare that could scan several bytes at once, we might optimize this to compare little substrings, but basically we've got to have the ability to match down to a single byte at a time, skipping around in one record relative to the other.

YAGNI???

Am I violating the YAGNI principle here? I'm thinking about code for which I do not have a broken test. My test can be made to run by providing an offset, which could easily enough be embedded in the XSet for the selecting set, and pulled out and used in the restrict code. But I'm looking ahead to a more complex series of solutions:

  1. The restrict we have now is zero-based on both records.
  2. We could solve the first name problem with a single offset in the selecting set.
  3. We could solve the discontiguous selection problem with a single vector of superscripts in the selecting set.
  4. We could solve the discontiguous selection with different records with a vector of superscripts for each record.

It's that last one that calls for skipping around one byte at a time. By the time we get there, we might have an entirely different solution. Is it a violation of YAGNI to be thinking about where we might be going?

Well, no. I believe that I invented YAGNI, and it is a rule intended to keep us from building code before its time. YAGNI prohibits coding a feature we do not need. It doesn't prohibit us from coding features we do need, and it certainly does not prohibit thinking. Thinking is good. Essential, in my view.

Where were we? Oh, yes ...

... what to do about these more convoluted sets. It appears that, ultimately, each record of an XSet will have to know, not just the elements it contains, but the scope (superscript) of each element -- in other words, each byte.

But wait, there's more. Recall that my input and selecting sets are actually sets of records. And records are sets. Remember this definition of restrict, from an earlier article:

Restrict(R, S) = R restrict S = { r | r elt R & exists s elt S & r intersect s = s }

Note in that definition that r and s, records of the input and selecting set, are being subjected to the intersect operator. Those records are sets! But our code implementation of restrict is ignoring the set nature of the records. It has implicitly reduced them to strings of bytes. It might be that we should fix that, making restrict aware that its records are sets, and using intersect on them.

Here are some issues, pro and con, with that notion:

  • CON: I'm pretty sure that I want the base representation of sets to be flat arrays or fixed-length records, because I know how to do that and I am thinking about performance. (Premature optimization?)
  • CON: Putting superscript information on every byte would more than double the size of every data set.
  • PRO: Sticking closer to the set theory might pay off in generality and in learning.
  • PRO: One way or another, we have to compare byte by byte. Building that notion into the data (the XSets) will be better than building it into the code.

After Lunch ...

Chet and I met at the Seattle's Best Coffee in Borders in Brighton and I brought him up to date on the last article, which he had only skimmed, and this one. In describing the sets involved in the complex queries above, the path became more clear. If this thing is to adhere to the kind of mathematical integrity that Dave Childs talks to me about, then the current implementation has gotten to the bytes too quickly.

The current implementation is not taking into account that the records are also sets, and that the real definition of restrict is that a master record is selected if there is a record in the selecting set that is a subset of the master record. What I've implemented does answer that question in terms of arrays of bytes, but the subset notion is not explicit, nor is the fact that each record is itself a set. That needs to be fixed.

There is also a bug, though it would take an odd test to exhibit it right now. If more than one record in the selecting set matches the same master record, that record will be sent to the output once for each match. The intention is that it will be sent just once. The code checks all the selecting records, doesn't break on the first one that matches. I'll write a test for that and then fix it. And I'll comment out that first name test for now, since it is just there to show that the current scheme can't work.

    def test_single_selection
      nameData = "Jeffries    Ron Hendrickson ChetAnderson    Ann Johnson     Lee "
      input = XSet.new(16, nameData)
      selectData = "Jeffries   Jeffries   "
      select = XSet.new(11, selectData)
      expected = "Jeffries    Ron "
      result = input.restrict(select)
      assert_equal(expected, result.contents)
    end

The test has two selectors for Jeffries. We want the Ron Jeffries record to be selected or not selected, not selected twice. The test fails:

  1) Failure:
test_single_selection(TC_MyTest) [./restricttest.rb:49]:
<"Jeffries    Ron "> expected but was
<"Jeffries    Ron Jeffries    Ron ">.

The fix is to put a break statement in restrict:

     def restrict(aSet)
       resultContents = ""
       for myRecordIndex in recordRange
         for hisRecordIndex in aSet.recordRange
           if match(myRecordIndex, aSet, hisRecordIndex)
             resultContents << record(myRecordIndex)
             break
           end
         end
       end
       XSet.new(@recordLength, resultContents)
     end

All better now. That alligator is dead. We can get back to draining the swamp.

Totally Wrong!

There has been a discussion going on on the TDD list, regarding what happens when a test shows you that your current implementation is totally wrong, and you will have to completely rewrite all your code. I've been saying that it seldom if ever happens to me that I have to rewrite the whole program, and people have been accusing me of things like being more experienced or incredibly lucky. Both are, of course true. One fellow there even accused me of being a couple of years older than I actually am. Can you imagine??

And now, here we are. Our implementation is about as wrong as it can get, isn't it? We're processing records and comparing bytes, and we need to be processing sets and asking whether one is a subset of another. Surely we have to start over?

Well, it's tempting. But if I were to start over, it would make everyone happy, and I can't be doing that. Let's see whether we can refactor the code to do what we need. Take a look at restrict, at match, and at record:

     def restrict(aSet)
       resultContents = ""
       for myRecordIndex in recordRange
         for hisRecordIndex in aSet.recordRange
           if match(myRecordIndex, aSet, hisRecordIndex)
             resultContents << record(myRecordIndex)
             break
           end
         end
       end
       XSet.new(@recordLength, resultContents)
     end
     
     def match(myRecordIndex, aSet, hisRecordIndex)
       record1 = self.record(myRecordIndex)
       record2 = aSet.record(hisRecordIndex)
       record1[0...record2.length]==record2
     end
     
     def record(recordIndex)
       @contents[recordIndex*@recordLength...recordIndex*@recordLength+@recordLength]
     end

Can we change match so that it processes XSets rather than byte arrays? Sure, why not? Maybe we can make the record method return an XSet and then ask whether the one XSet is a subset of the other. There is an issue to be concerned about, however. An XSet, as presently defined, is known to be an array of records, of some record length. A record, on the other hand, is not (right now at least) an array of records, it is an array of bytes. Let's see whether maybe it won't matter. An idea I have is to make the record set be a set of record length 1, and see whether that will keep us moving forward. Let's recode thusly, making record() return an XSet, and changing match to ask the records (sets) whether one is a subset of the other:

     def match(myRecordIndex, aSet, hisRecordIndex)
       record1 = self.record(myRecordIndex)
       record2 = aSet.record(hisRecordIndex)
       record2.subset?(record1)
     end
     
     def record(recordIndex)
       recordContents = @contents[recordIndex*@recordLength...recordIndex*@recordLength+@recordLength]
       XSet.new(1,recordContents)
     end

Now it's just a Small Matter Of Programming to implement subset?. Let's try ...

Ha! Works. The main issue was that when record returns an XSet, a lot of people had to look at record.contents. It's a little bit shaky right now but all the tests are working. I'll paste in the code and explain it when I get home from T'ai Chi class. Or in the middle of the night ...

Not So Totally Wrong After All

Here's the code as it stands now, passing all the tests (except for the commented-out first-name one, which cannot pass until we do something about it). I'll highlight the interesting changes:

  class TC_MyTest < Test::Unit::TestCase
  
    def test_cardinality
       input = XSet.new(4, "123 234 132 342 abc ")
       assert_equal(5, input.cardinality)
    end
    
    def test_record_bytes
      input = XSet.new(1,"abcdef")
      assert_equal("b", input.record(1).contents)
    end
    
    def test_recordRange
       input = XSet.new(4, "123 234 132 342 abc ")
       assert_equal(0...5, input.recordRange)
    end
    
    def test_record
       input = XSet.new(4, "123 234 132 342 abc ")
       assert_equal("132 ", input.record(2).contents)
    end
    
    def test_match
       input = XSet.new(4, "123 234 132 342 ")
       select = XSet.new(1,"1")
       assert(input.match(2, select, 0))    
    end
    
    def test_restrict
      input = XSet.new(4, "123 234 132 342 ")
      select = XSet.new(1,"1")
      expected = "123 132 "
      result = input.restrict(select)
      assert_equal(expected,result.contents)
    end
     
    def test_name_restrict
      nameData = "Jeffries    Ron Hendrickson ChetAnderson    Ann Johnson     Lee "
      input = XSet.new(16, nameData)
      selectData = "HendricksonJeffries   "
      select = XSet.new(11, selectData)
      expected = "Jeffries    Ron Hendrickson Chet"
      result = input.restrict(select)
      assert_equal(expected, result.contents)
    end
     
    def test_single_selection
      nameData = "Jeffries    Ron Hendrickson ChetAnderson    Ann Johnson     Lee "
      input = XSet.new(16, nameData)
      selectData = "Jeffries   Jeffries   "
      select = XSet.new(11, selectData)
      expected = "Jeffries    Ron "
      result = input.restrict(select)
      assert_equal(expected, result.contents)
    end
     
#    def test_firstname_restrict
#      nameData = "Jeffries    Ron Hendrickson ChetAnderson    Ann Johnson     Lee "
#      input = XSet.new(16, nameData)
#      selectData = "Ron Lee "
#      select = XSet.new(4, selectData)
#      expected = "Jeffries    Ron Johnson     Lee "
#      result = input.restrict(select)
#      assert_equal(expected, result.contents)
#    end
  end
   
   class XSet
     attr_reader :contents
   
     def initialize(recordLength, contents)
       @recordLength = recordLength
       @contents = contents
     end
     
     def restrict(aSet)
       resultContents = ""
       for myRecordIndex in recordRange
         for hisRecordIndex in aSet.recordRange
           if match(myRecordIndex, aSet, hisRecordIndex)
             resultContents << record(myRecordIndex).contents
             break
           end
         end
       end
       XSet.new(@recordLength, resultContents)
     end
     
     def match(myRecordIndex, aSet, hisRecordIndex)
       record1 = self.record(myRecordIndex)
       record2 = aSet.record(hisRecordIndex)
       record2.subset?(record1)
     end
     
     def record(recordIndex)
       recordContents = 
      @contents[recordIndex*@recordLength...recordIndex*@recordLength+@recordLength]
       XSet.new(1,recordContents)
     end
     
     def subset?(aLargerSet)
       for recordIndex in recordRange
         unless aLargerSet.contains?(record(recordIndex), recordIndex)
           return false
         end
       end
       return true
     end
     
     def contains?(aRecord, anIndex)
       record(anIndex).contents == aRecord.contents
     end
     
     def recordRange
       0...cardinality
     end
     
     def cardinality
       @contents.length / @recordLength
     end
   end

So, what happened? We've made a fairly major design mistake, namely that the records of a set have to be sets themselves. (This is clear from the theory, but my focus on byte manipulation caused me to close my eyes to it.) And we're surely not done with the ramifications that will follow from that mistake. But we chose not to do a major rewrite all in a bang, but instead to make the necessary changes (record returns an XSet, and restrict does subset?) and then make the tests run. That turned out not to be difficult.

Was it luck? Skill? The results of advancing age? I'm not sure. I think, though that it comes from keeping the design pretty clean, and from having a firm commitment toward small changes rather than radical rewrites. If we decide something is impossible, it will be. I try not to decide that small steps are impossible, and often, they turn out to be possible.

Sure, there's skill involved. Programming, in my opinion, is all about skill, and I expect that it always will be. So, practice, man, practice.

What's Wrong Now?

The code still isn't bad, except for a little duplication here and there. However, I'm concerned that the changes just made are not robust. There are issues about the subset operation -- such as whether it will work on a set that is a relation. And I'm not entirely comfortable with the way that XSet containing the record is created -- we're using the record method to get its elements out of it. The elements are definitely not records -- they are elements, which may or may not be records.

So there are issues of expressiveness here -- elements vs records, for example. And there are probably some tests missing for the subset? operation, which will either demonstrate that it's robust, or, more likely, show that it isn't. Finally, I'm concerned that, though I'm trying to make this code reflect the mathematics accurately, it may not as yet do so.

Does that mean that I should have rewritten everything? I don't think so. When we do TDD, we expect that our code will, along the way, not be entirely robust, and not reflect the final needs. It's time for some more tests ... which we'll work on next time.

XProgramming > XP Magazine > XST: Complex Scope
COLLECTED TOPICS: Kate Oneal | Adventures in C# | Documentation in XP | Book Reviews