GuessDTD

I notice that the wmlbrowser extension for Firefox has a problem; some WML sites render better if wmlbrowser has access to the WML DTD, but wmlbrowser can’t ship it for licensing reasons.

That got me thinking: surely it’s possible, particularly for XML-based languages where conformance to the schema is a requirement, to reverse engineer the contents of the schema if you have enough documents which conform to it? Or, at least, you could make a good guess.

For example, if the root element is always <wml>, you could guess that as the root. And if it only contained elements from a given list, and if a particular element only ever appeared once, etc. etc. Is this feasible? If so, has anyone already written “guess-schema”?

9 thoughts on “GuessDTD

  1. The XML editor in Eclipse’s WTP project has an option to infer the schema from the current document to provide content assist. I have only briefly used it but it seems to work.

  2. bsmedberg: I believe it’s the other side of a click-to-accept licence agreement. So wmlbrowser has an option to take you to the site to agree to the terms. But it’s all a bit obnoxious.

    You’d need to see the site for exact details.

  3. The extension can load the DTD off the web. I made it so that you have to tick a box saying that you accept the terms and conditions, which is a pain (you have to open the options window first), but I think that covers the bases legally.

    Technically I only need the DTDs for the entity declarations (  etc.), it’s not the schema I care about at all. So perhaps I should just ship with a “fake” DTD containing only the entities.

    The other obstacle is that DTDs have to be stored in browser chrome, not in user profiles. I guess I should raise a bug on this (and maybe even try to fix it).

    Matthew (wmlbrowser author)

  4. “The extension can load the DTD off the web. I made it so that you have to tick a box saying that you accept the terms and conditions, which is a pain (you have to open the options window first), but I think that covers the bases legally.”

    That is strange, if a DTD reference is provided any validating XML parser will automatically retrieve it! So how do they suppose that would work??

    ~Grauw

  5. Why does the WML browser need the DTD? As far as I can tell, the only things in the DTD that could make the lack of the DTD a problem are the entity definitions for nbsp and shy. Creating a pseudo-DTD for two entity definitions is not difficult. (However, I consider DTD-based entities harmful in the Web context. When Mozilla gave into XHTML entities, other browsers had to follow, too, which runs against the idea that interactive browsers that have non-validating parsers were supposed to be relieved from processing DTDs. If Mozilla had firmly rejected the entities, perhaps the XML DTD-based character entities could have been eradicated on the Web.)