CML - a potted history

CML has been developed by Peter Murray-Rust and Henry Rzepa since 1995. It is the de facto XML for chemistry, is accepted by publishers and has more than 1 million lines of Open Source code supporting it. CML can be validated and built into authoring tools (for example the Chemistry Add-in for Microsoft Word).

Inspired and informed by the work of Tim Berners-Lee at CERN, CML's first 'public outing' was in August 1995, at the ACS meeting in Chicago. The primary purpose of CML is (and has always been) to allow humans and machines to communicate chemical concepts without loss of semantic information. The vision is that scientific publications should be represented semantically such that both humans and machines can consume them, again without loss. The only technology capable of managing this at present (and probably for some time to come) is XML. It is widely used in publishing and CML can therefore be technically adopted for any document-like examples such as journals, theses, supplier catalogues, textbooks, regulatory information, input/output from programs etc..

Peter Murray-Rust and Henry Rzepa hard at work

CML under active development in a conducive atmosphere - Peter Murray-Rust and Henry Rzepa hard at work...

Throughout its evolution, CML has been driven by code-design - developing the code in parallel with formal specifications and recommendations (many emanating from the W3C). There is a particular need to validate combinations of elements and attributes, which is impossible to hand-code. Automated methods of validation and code-generation based on the full W2C DOM model proved unwieldy and led to the adoption of the simpler XOM model, which has had a strong influence on the subsequent design. The increasing availability of XPath-based technology has had a major positive effect on the possible flexibility of the organisation of elements and attributes.

However, with increasing deployment to different areas of chemistry, it became clear that it was going to be impossible to find universal content models for most elements. This is due to not only the diversity of chemistry but also the different ways that chemists might wish to organise information. This led to the proposal of an extremely flexible content model, constrained by the use of convention rather than XSD technology. Conventions therefore provide the 'grammar' of CML, with the 'vocabulary' supplied by dictionaries containing well-defined entries with specific associated semantics (a concept modelled on the crystallographic CIF architecture).

The idea of validation lies at the heart of CML. Without validation, the author of a program cannot easily write conformant software if the input is variable; similarly the author cannot know whether an input is fit for purpose. A major purpose of CML is to make sure that all chemical information is validatable and that the rules for this validation are openly visible - this website hosts a validation service and lists all the information (schema specification, dictionaries and conventions) required to write clean compliant CML code. Extension of the CML architecture by sub-communities who wish to develop CML in their own specific area of chemistry (or related science) is now readily achieved through the dictionary/convention mechanism.

Those interested in a more detailed review of the design and evolution of CML should read the article recently (October 2011) published in the Journal of Cheminformatics as part of the Visions of a Semantic Molecular Future thematic issue.