The tangler

Now we get to the part where the real work of tangling source code begins. An important basic step is determining the names of the elements that require processing, and this is done by finding the name of the element within a lookup table which converts the name into a number. The numbers that represent each element are here, in the file elements.h as are the strings that represent the pertinent attributes. These numeric values are used below in the table elements in the file tangle.c.

      
--Code fragment from file: elements.h--


#ifndef _ELEMENTS_H

#define _ELEMENTS_H

#define ELEM_UNKNOWN  0
#define ELEM_CODE     1
#define ELEM_FRAGMAP  2
#define ELEM_FRAGMENT 3

#define ATTR_FILENAME "filename"
#define ATTR_NAME "name"

typedef struct {
  char *name;
  int tval;
} elem;

extern elem elements[];

#endif

      
    

Here begins the file tangle.c. It begins with the inclusion of the usual files.

      
--Code fragment from file: tangle.c--


#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <assert.h>
#include <expat.h>
#include "xml-lit.h"
#include "elements.h"
#include "hash.h"

      
    

The fragment data structure

We need to first build up a representation of each tangle result file in memory. Each tangle result file consists of one or more fragments. Each fragment can contain either literal code or another fragment, so the most amenable data structure for use with this is a tree of unions. This is given in the structure fragment, whose fields are as follows:

The fields of a fragment structure

frag_t type

The type of fragment. This can have the value FRAG_ROOT, FRAG_FILE, FRAG_NAMED, or FRAG_TEXT, which represent either the root of the fragment tree, top-level file fragments, named fragments, or text fragments.

union metadata

This contains the metadata associated with the fragment. If the fragment is of type FRAG_FILE or FRAG_NAMED, the metadata that applies is name, and gives the name of the file into which the fragment's children should go in the former case, or the name of the fragment in the latter case. If the fragment is of type FRAG_TEXT, what applies is line, and is the line number in the XML file where the text of the fragment begins. There is no metadata associated with FRAG_ROOT.

union frag_data

This contains the actual data of the fragment. If we are a FRAG_ROOT, FRAG_FILE, or FRAG_NAMED, it contains a pointer to another fragment given by the child which points to the fragment's children. FRAG_TEXT fragments are leaves in the tree, so text is the field in the union which applies.

int visited

A mark saying this node has been visited.

struct fragment_t *next

A pointer to a sibling of this fragment

        
--Code fragment from file: tangle.c--


typedef enum {
  FRAG_ROOT, FRAG_FILE, FRAG_NAMED, FRAG_TEXT
} frag_t;

typedef struct fragment_t {
  frag_t type;
  union {
    char *name;              /* for named or file fragments */
    int line;                /* for text fragments */
  } metadata;
  union {
    XML_Char *text;          /* if it is a text fragment */
    struct {
      struct fragment_t *head;
      struct fragment_t *tail;
    } frag;                  /* if it is a file or named fragment */
  } frag_data;
  int visited;
  struct fragment_t *next;   /* next fragment */
} fragment;

        
      

The Parser's State

The tangle parser can be described mathematically as a deterministic finite automaton (DFA) whose input symbols are the XML tags in our namespace. The state of our finite automaton not only consists of the code fragment tree described above but also a context data structure, which we declare here:

        
--Code fragment from file: tangle.c--


typedef struct {
  int element_stack[4];    /* stack */
  int esp;                 /* stack pointer */
  fragment *head;          /* root of the fragment tree */
  fragment *cf;            /* context fragment */
  hashtable ht;            /* hash table */
  int startfrag;
} p_context;

static p_context pctxt;


        
      

This data structure is used as the user data within the parser, and is modified in the course of the construction of the tree. Elements that are recognized by our parser are pushed onto the stack. Since the maximum legal depth of nesting of elements in the XML-Lit namespace is only three (code->fragment->fragmap is the deepest), the stack doesn't have to be very deep. It is a semantic error to have nesting of XML-Lit elements more than three levels deep. Such constructions are junk, and will be noted as such. Every time an XML-Lit element start tag is found its tag type is pushed. Every time an end tag is found the tag type is popped. The element_stack field of this structure can hold one entry more than the maximum so we can determine whether an error has occurred. The head is the root of the fragment tree, and the cf member is the current context fragment, the fragment to which additional child nodes are added. The rules governing what this becomes are described in the following section.

The fragment data structure in action

The receipt of one of the elements in xml-lit's namespace or character data causes changes to occur within the fragment data structure within our program:

The sequence of events on receiving an element or character data

  1. If a code element is received:

    1. If the filename given in the element has not yet been seen, create a new FRAG_FILE fragment node describing that file and set the context fragment to point to this node.

    2. If the filename has already been seen, just set the context fragment to point to that node.

  2. If a fragmap element is found, a new FRAG_NAMED fragment node is created which is the child of the current context node, unless a fragment node with the same name already exists. That case is considered an error, but the error will only be detected when the time comes to output the fragments to a file. The context fragment becomes this new fragment.

  3. If a fragment element is found, find the FRAG_NAMED node that matches the fragment name given. It is an error if such cannot be found. The found element becomes the context fragment.

  4. If character data is found within one of the elements we are concerned with:

    1. If the character data is within a fragmap, it is simply ignored. Character data there is only useful during the weaving process.

    2. Otherwise, a new FRAG_TEXT child is added to the context fragment, containing the text we were given.

The handlers modify their behavior based on these rules. Note that what we are doing essentially is creating a limited DOM tree of all the elements in the xml-lit namespace.

Utility functions for new-style parsing

Before we get to the handler code proper, we have a couple of useful utility functions. This first one, verify_localname() is given a universal name the way Expat qualifies it, a namespace URI, and a local name, and returns true if the universal name is the same as the one constructed from the URI and local name.

        
--Code fragment from file: tangle.c--


static int
verify_localname(const char *u_name, const char *namespace_uri,
                 const char *localname)
{
  char *u_name_local;

  /* false if not in namespace */
  if (!CHECKNS(u_name, namespace_uri, strlen(namespace_uri)))
    return(0);
  u_name_local = strrchr(u_name, DELIMCHAR) + 1;
  return(strcmp(u_name_local, localname) == 0);
}

        
      

This next one, get_elemtval() attempts to convert an element universal name given to it into its token value. It returns ELEM_UNKNOWN if it is not recognized, else one of the symbolic names defined in elements.h.

This works by using a lookup table of the element local names which are recognized by the program in new-style processing.

      
--Code fragment from file: tangle.c--


elem elements[] = {
  { "code", ELEM_CODE },
  { "fragmap", ELEM_FRAGMAP },
  { "fragment", ELEM_FRAGMENT },
  { NULL, 0 }
};

      
    
      
--Code fragment from file: tangle.c--


static int
get_elemtval(const XML_Char *el)
{
  int i;

        
      

The procedure followed by get_elemtval()

  1. It first checks to see if the input element belongs to the namespace it is supposed to recognize. If it does not belong to that namespace, we return ELEM_UNKNOWN.

                
    --Code fragment from file: tangle.c--
    
    
    
      /* ignore elements not in the namespace we know */
      if (!CHECKNS(el, NAMESPACE_URI, NAMESPACE_URI_LENGTH))
        return(ELEM_UNKNOWN);
    
              
            
  2. Now, we check to see whether the tag is one of those we are supposed to process. If not, we just ignore it and go back to the parser. Maybe we should flag that as an error too, but since this is a development version we just ignore it for now.

    This is done by looping through the elements array till we find (or not find, as the case may be) the element passed to us. If we do find it we return the token value immediately.

                
    --Code fragment from file: tangle.c--
    
    
      for (i=0; elements[i].name; i++) {
        if (verify_localname(el, NAMESPACE_URI, elements[i].name))
          return(elements[i].tval);
      }
      return(ELEM_UNKNOWN);
    }
    
    
              
            

This function, get_attribval() will get the value of the named attribute, or NULL if the attribute is not present.

        
--Code fragment from file: tangle.c--


static const XML_Char *
get_attribval(const XML_Char **attr, const XML_Char *attr2find)
{
  int i;

  for (i=0; attr[i]; i+=2) {
    if (verify_localname(attr[i], NAMESPACE_URI, attr2find))
      return(attr[i+1]);
  }
  return(NULL);
}

        
      

This last function, add_fragment will add a fragment child to the fragment passed to it.

        
--Code fragment from file: tangle.c--


static fragment *
add_fragment(fragment *parent)
{
  fragment *new_frag;

        
      

Obviously, it's not valid to add a child to a FRAG_TEXT node, so we flag an error if we try. That should never happen if we've written the program properly, but as they say, program defensively. After that preliminary check, we allocate the new fragment and initialize its contents with zero:

        
--Code fragment from file: tangle.c--


  assert(parent->type != FRAG_TEXT);
  new_frag = (fragment *)malloc(sizeof(fragment));
  if (new_frag == NULL)
    error_exit("Memory exhausted\n");
  memset(new_frag, 0, sizeof(fragment));

        
      

We should now attach this new fragment to the parent. If the parent has no children yet, this becomes both the head and the tail and we are done.

        
--Code fragment from file: tangle.c--


  new_frag->next = NULL;
  if (parent->frag_data.frag.head == NULL) {
    parent->frag_data.frag.head = parent->frag_data.frag.tail = new_frag;
    return(new_frag);
  }

        
      

Otherwise, the new element is linked to the tail and the new element becomes the tail, like so:

        
--Code fragment from file: tangle.c--


  parent->frag_data.frag.tail->next = new_frag;
  parent->frag_data.frag.tail = new_frag;
  return(new_frag);
}

        
      

The character data handler

This handler is the simplest of them all. It's active only when the current element is eithercode or fragment, and it only does one thing: add a new FRAG_TEXT fragment child to the context node.

        
--Code fragment from file: tangle.c--


void
ns_char_handler(XML_Parser p, const XML_Char *s, int len)
{
  p_context *ctxt;
  fragment *txtfrag;

        
      

First thing it needs to do is determine the current context fragment. Once it knows this it creates a new FRAG_TEXT node that is a child of this.

        
--Code fragment from file: tangle.c--


  ctxt = XML_GetUserData(p);
  txtfrag = add_fragment(ctxt->cf);

        
      

Last thing it needs to do is fill out the fields of the newly created fragment structure. Nothing really complicated here. A line number entry is added if and only if it happens to be the start of an element. The line number is only stored if it is the start of an element.

        
--Code fragment from file: tangle.c--


  txtfrag->type = FRAG_TEXT;
  txtfrag->metadata.line = (ctxt->startfrag) ? XML_GetCurrentLineNumber(p) : -1;
  if (ctxt->startfrag)
    ctxt->startfrag = 0;
  txtfrag->frag_data.text = (XML_Char *)malloc(sizeof(XML_Char)*(len+1));
  if (txtfrag->frag_data.text == NULL)
    error_exit("Memory exhausted\n");
  strncpy(txtfrag->frag_data.text, s, len);
  txtfrag->frag_data.text[len] = 0;
}

        
      

Finding stuff in the tree

Before we actually get into the business of actually recognizing and processing start elements, we need to be able to find (1) ELEM_CODE fragments associated with a given filename, and (2) ELEM_FRAGMAP elements with a given name. The first is relatively easy, and it is aided by a consequence of the constraints involved in constructing our tree: the first level of the tree from the root is just a linked list of all the ELEM_CODE fragments, so we can do a simple linear search. This is done by the find_codefrag() function.

        
--Code fragment from file: tangle.c--


fragment *
find_codefrag(p_context *ctxt, const XML_Char *name)
{
  fragment *f;

  for (f=ctxt->head->frag_data.frag.head; f; f=f->next) {
    if (strcmp(f->metadata.name, name) == 0)
      return(f);
  }
  return(NULL);
}

        
      

The second type of search is facilitated by the hash module described in the section called “Hash Tables and Hash Functions”. A hash table is part of the parser state, and the user data associated with it are the map fragments.

The start element handler

This is quite probably the most complicated part of the program, and where the heart of the processing occurs. Here we process the start tags for all the elements in our namespace.

      
--Code fragment from file: tangle.c--


static void
ns_elem_start(XML_Parser p, const XML_Char *el, const XML_Char **attr)
{
  int tval;
  p_context *ctxt;

        
      

Steps followed by the start element handler

  1. Determine whether the element sent is even one of the elements we are concerned with. This is done by the get_elemtval() function we defined above. Do nothing and return to the parser if it's not one we are concerned with.

                
    --Code fragment from file: tangle.c--
    
    
      tval = get_elemtval(el);
      if (tval == ELEM_UNKNOWN)
        return;
    
                
              
  2. Get the current context and then determine whether this new element just received is valid based on it. We do that by looking at the stack of elements in the context.

    1. If the stack is empty and the element received is not a code element, it is invalid.

    2. If the stack top is a fragment map, any element received is invalid.

    3. If the stack top is a fragment and the element received is not a fragment map, it is invalid.

    These simple tests are encoded in this long if statement. If the tests do not show the new element as being invalid, we add it to the stack.

                
    --Code fragment from file: tangle.c--
    
    
      ctxt = XML_GetUserData(p);
      if ((ctxt->esp == 0 && tval != ELEM_CODE) ||
          (ctxt->element_stack[ctxt->esp-1] == ELEM_FRAGMAP) ||
          (ctxt->element_stack[ctxt->esp-1] == ELEM_FRAGMENT &&
           tval != ELEM_FRAGMAP))
        error_exit("parse error at line %d (unexpected element found)\n",
          XML_GetCurrentLineNumber(p));
       ctxt->element_stack[ctxt->esp++] = tval;
    
                
              
  3. If the element found satisfies our validity constraints, we then proceed to perform the actions outlined in the section called “The fragment data structure in action”. First thing we do is perform a switch based on the input element's token value:

                
    --Code fragment from file: tangle.c--
    
    
      switch (tval) {
    
                
              
    1. If the element is a ELEM_CODE, we first try to see if there is already a code fragment which refers to the file given by the filename attribute. If such a fragment does not exist, we create it. If not, we do nothing. In either case we point the context fragment in our state to this fragment, and set the character data handler to point to ns_char_handler().

                      
      --Code fragment from file: tangle.c--
      
      
          case ELEM_CODE:
          {
            const char *aval;
            fragment *codefrag;
      
            ctxt->startfrag = 1;
            aval = get_attribval(attr, ATTR_FILENAME);
            if (aval == NULL)
              return;
            codefrag = find_codefrag(ctxt, aval);
            if (!codefrag) {
              codefrag = add_fragment(ctxt->head);
              codefrag->type = FRAG_FILE;
              codefrag->metadata.name = strdup(aval);
              codefrag->frag_data.frag.head = codefrag->frag_data.frag.head = NULL;
            }
            ctxt->cf = codefrag;
            XML_SetCharacterDataHandler(p, ns_char_handler);
          }
          break;
      
                      
                    
    2. If the element is a ELEM_FRAGMAP, we first attempt to find the fragment map name in the named fragment hash table (i.e., it's already been defined). If the name of the fragment is already there, that is considered an error. If not, a new FRAG_NAMED fragment is added as child of the context fragment, with the name of the fragment in question, and installed into the hash table, and the character data handler unset.

                      
      --Code fragment from file: tangle.c--
      
      
          case ELEM_FRAGMAP:
          {
            const char *aval;
            fragment *mapfrag;
      
            ctxt->startfrag = 1;
            aval = get_attribval(attr, ATTR_NAME);
            if (aval == NULL)
              return;
            mapfrag = hash_lookup(ctxt->ht, aval);
            if (mapfrag)
              error_exit("map fragment named %s redefined\n", aval);
            mapfrag = add_fragment(ctxt->cf);
            mapfrag->type = FRAG_NAMED;
            mapfrag->metadata.name = strdup(aval);
            mapfrag->frag_data.frag.head = mapfrag->frag_data.frag.tail = NULL;
            if (hash_install(ctxt->ht, mapfrag->metadata.name, mapfrag))
              error_exit("hash table space exhausted (use -T to increase)\n");
      
            XML_SetCharacterDataHandler(p, NULL);
          }
          break;
      
                      
                    
    3. If the element is a ELEM_FRAGMENT, we try to find the fragment map node referred to. If it is not found, we have an error. If it was found, that fragment map node becomes the context fragment.

                      
      --Code fragment from file: tangle.c--
      
      
          case ELEM_FRAGMENT:
          {
            const char *aval;
            fragment *mapfrag;
      
            ctxt->startfrag = 1;
            aval = get_attribval(attr, ATTR_NAME);
            if (aval == NULL)
              return;
            mapfrag = hash_lookup(ctxt->ht, aval);
            if (mapfrag == NULL)
              error_exit("named fragment %s referred to but not previously defined\n", aval);
            ctxt->cf = mapfrag;
          }
          break;
        }
      }
      
                      
                    

The end element handler

This handler is relatively simple. All it does is pop the context stack and remove the character handler if the closing tag was a code tag, or add the character handler if the closing tag was a fragmap.

        
--Code fragment from file: tangle.c--


static void
ns_elem_end(XML_Parser p, const char *el) {
  int tval;
  p_context *ctxt;

  tval = get_elemtval(el);
  if (tval == ELEM_UNKNOWN)
    return;
  ctxt = XML_GetUserData(p);
  ctxt->esp--;
  assert(ctxt->element_stack[ctxt->esp] == tval);
  if (tval == ELEM_CODE)
    XML_SetCharacterDataHandler(p, NULL);
  if (tval == ELEM_FRAGMAP)
    XML_SetCharacterDataHandler(p, ns_char_handler);
}

        
      

Old-Style Parsing

We want compatibility with xmltangle, so we have a completely different set of handlers to use if such processing is desired. We use a much more simple data structure for old-style parsing, a simple static array list of files.

        
--Code fragment from file: tangle.c--


#define MAXFNAMES 100
static char *fnames[MAXFNAMES];
static int nfiles;

        
      

Here we have the old-style handler for characters. It works by first getting the user data, which in this case is the file name, from the XML_Parser given to it when it is called by Expat. If there is none, that is considered an error, so the program prints a message and exits.

        
--Code fragment from file: tangle.c--


static void
os_char_handler(XML_Parser p, const XML_Char *s, int len)
{
  char *fn;
  FILE *fp;

  fn = XML_GetUserData(p);
  if (!fn)
    error_exit("No user data found");

        
      

Now the handler has a (hopefully valid) file name on which to spew the character data it has been given. In any case, the file should have at least been opened once by the start tag handler (see below), so it should be safe to open the file in append mode.

        
--Code fragment from file: tangle.c--

  fp = fopen(fn, "a");
  if (!fp) {
    perror(progname);
    error_exit("Error opening file %s\n", fn);
  }

        
      

Provided that the file could be opened, by the time we get here fp should be a valid FILE pointer, which points to the proper file.

        
--Code fragment from file: tangle.c--

  fwrite(s, sizeof(XML_Char), len, fp);
  fclose(fp);
}
        
      

Of course we have the old-style start tag handler, which looks for <programlisting/> tags:

      
--Code fragment from file: tangle.c--


static void
os_elem_start(XML_Parser p, const char *el, const char **attr)
{
  int i, linenum;
  char *filename = NULL;
  void *ud;
  FILE *fp;

        
      

If old-style processing is selected, we look for <programlisting/> elements and find their role attributes. The value associated with these attributes become the value for filename.

        
--Code fragment from file: tangle.c--


  if (strcmp(el, OS_ELEM))
    return;
  for (i=0; attr[i]; i+=2) {
    if (strcmp(attr[i], OS_ATTR) == 0) {
      filename = strdup(attr[i+1]);
      break;
    }
  }


        
      

After this filename should have been set to some value. If no value was set, we just return.

        
--Code fragment from file: tangle.c--



  if (filename == NULL)
    return;


        
      

Now, we make the newly-found filename the user data kept by the parser state, freeing any old user data that may have been left over from processing a previous tag. We also set os_char_handler() as our character handler, so that subsequent characters get handled by it.

      
--Code fragment from file: tangle.c--


  if ((ud = XML_GetUserData(p)) != NULL) {
    XML_SetUserData(p, NULL);
    free(ud);
  }
  XML_SetUserData(p, filename);
  XML_SetCharacterDataHandler(p, os_char_handler);

      
    

But before we return back to the Expat parse loop, we need to put the file name and the line of the file into the source file if so requested. To do this, we need to first check whether the file has already been seen and recognized. To do this, we loop through the list of files that have been found and see if it's there already.

      
--Code fragment from file: tangle.c--


  fp = NULL;
  /* find filename in the list of fnames */
  for (i=0; i<nfiles; i++) {
    if (strcmp(fnames[i], filename) == 0) {
      fp = fopen(filename, "a");
      if (!fp) {
        perror(progname);
        error_exit("Unable to open file %s\n", filename);
      }
      break;
    }
  }

      
    

If on the other hand, the filename was not found in our list, the new file name is added to the list (checks are made for the boundaries), and the file opened in write mode.

      
--Code fragment from file: tangle.c--


  if (fp == NULL) {		/* file has not yet been found */
    if (nfiles >= MAXFNAMES)
      error_exit("Maximum number of files (%d) exceeded.\n", MAXFNAMES);

    fp = fopen(filename, "w");
    fnames[nfiles++] = strdup(filename);
  }

      
    

Provided that the file could be opened, by the time we get here fp should be a valid FILE pointer, which points to the proper file. We can now generate a #line preprocessor directive if we were asked to do so. If we were not, we do nothing.

      
--Code fragment from file: tangle.c--


  if (cmode) {
    linenum = XML_GetCurrentLineNumber(p);
    fprintf(fp, "\n# line %d \"%s\"\n", linenum, infilename);
  }
  fclose(fp);
}

      
    

Now finally, we get to the elem_end() function. Its job is simply to remove any character data handler if one had been previously set, because subsequent characters after a <programlisting/> tag closes are not to be handled by os_char_handler().

      
--Code fragment from file: tangle.c--


static void
os_elem_end(XML_Parser p, const char *el) {
  /* if it was the end of a programlisting, remove any character data handler */
  if (strcmp(el, "programlisting") == 0)
    XML_SetCharacterDataHandler(p, NULL);
}


      
    

Handler setting

This final function sets these handlers and makes the parser return the parser object as the userdata argument. It also sets up the context data structure.

        
--Code fragment from file: tangle.c--


void
setup_tangle(XML_Parser p, int processmode, int hash_size) {
  XML_UseParserAsHandlerArg(p);
  (processmode) ?
    XML_SetElementHandler(p, ns_elem_start, ns_elem_end) :
    XML_SetElementHandler(p, os_elem_start, os_elem_end);
  if (processmode) {
    XML_SetUserData(p, &pctxt);
    pctxt.esp = 0;
    pctxt.head = (fragment *)malloc(sizeof(fragment));
    if (pctxt.head == NULL)
      error_exit("Memory exhausted\n");
    memset(pctxt.head, 0, sizeof(fragment));
    pctxt.head->type = FRAG_ROOT;
    pctxt.cf = NULL;
    pctxt.ht = hash_init(hash_size);
  }
}

        
      

Tangle the tree

After the tree of fragments has been constructed, we then need to traverse the tree to actually write the files. We do not traverse the tree straight, although doing this will also work. What we do is treat the first level of the tree as a linked list, and each element of the linked list as the root of a tree. Such trees are traversed preorder and the function that does this is called (predictably enough) preorder_trav(), and it is a recursive traversal. This ought to be good enough, as the trees generated by the creation of a typical literate program are not generally very deep.

        
--Code fragment from file: tangle.c--


static void
preorder_trav(FILE *fp, fragment *p)
{
  static int lastlineno = 0;

  if (p == NULL)
    return;
  if (p->type == FRAG_TEXT) {
    if (cmode & (p->metadata.line > 0)) {
      fprintf(fp, "\n# line %d \"%s\"\n", p->metadata.line, infilename);
      lastlineno = p->metadata.line;
    }
    fputs(p->frag_data.text, fp);
    preorder_trav(fp, p->next);    /* no sons, so get next only */
  } else {
    preorder_trav(fp, p->frag_data.frag.head); /* son fragment */
    preorder_trav(fp, p->next);    /* brother fragment */
  }
}

        
      

The actual treetangle() function uses preorder_trav to tangle each FRAG_FILE one at a time.

        
--Code fragment from file: tangle.c--


void
treetangle(void)
{
  FILE *fp;
  fragment *f;

  for (f=pctxt.head->frag_data.frag.head; f; f=f->next) {
    fp = fopen(f->metadata.name, "w");
    fprintf(stderr, "File %s being written\n", f->metadata.name);
    preorder_trav(fp, f->frag_data.frag.head);
    fclose(fp);
  }
}