CML - Frequently Asked Questions

The FAQ is separated into various sections: Overview, Why is CML special?, Chemistry, Software.

Sections covering Community, Resources & Possible use cases of CML are currently in production...

Overview

  1. Q. What is CML?
  2. A. CML (Chemical Markup Language) is an XML language designed to hold most of the central concepts in chemistry. It was the first language to be developed and plays the same role for chemistry as MathML for mathematics and GML for geographical systems.

  3. Q. What does CML cover?
  4. A. CML covers most mainstream chemistry and especially molecules, reactions, solid-state, computation and spectroscopy. Since it has a special flexible approach to numeric science it also covers a very wide range of chemical properties, parameters and experimental observation. It is particularly concerned with the communication between machines and humans, and machines to machines. It has been heavily informed by the current chemical scholarly literature and chemical databases.

  5. Q. What is a CML 'element'?
  6. A. XML defines a vocabulary through 'elements'. This unfortunately clashes with the concept of chemical element so we use the term CMLElement to represent the XML vocabulary. There are about 120 CMLElements and most of them have straightforward names representing well-understood concepts (e.g. product, plane3). Most users will only need the subset of CMLElements relevent to their discipline.

  7. Q. How long has CML been going?
  8. A. Since 1994. It has had several versions which have expanded its content and increased the rigour and semantic power.

  9. Q. Is CML 'stable'?
  10. A. Yes. The last major content revision was about 5 years ago, and the software design has been stablefor approximately the same period.

  11. Q. Why do I need CML?
  12. A. CML is becoming the lingua franca for the formal representation of chemistry in modern information systems. Most people will not interact directly with CML but will increasingly use systems which rely on it for their implementation.

  13. Q. How do I use CML?
  14. A. Most people will use software without realising that it is based on CML. If you are a developer or manage a data resource then you will almost certainly use existing libraries rather than having to build software from scratch.

  15. Q. Who uses CML?
  16. A. Effectively all Open Source chemistry projects (such as in the Blue Obelisk) use CML. It is not yet universal in closed commercial products although we know anecdotally that many systems can use CML, and that many companies use it internally. As an example, we have worked with Microsoft Research to develop a CML plug-in for Word and are working with a number of organisations such as chemical companies and national laboratories to develop systems based on this.

  17. Q. Where can I see examples of CML?
  18. A. Because CML is so flexible, it's similar to asking 'where can I see examples of mathematical equations?' Our publications normally include significant snippets of CML with descriptions of their purpose. There are a large number of CML resources mentioned in later sections. We also ensure that all concepts in CML are supported by examples co-published with the schema.

    What's special about CML?

  19. Q. Why can't we use existing systems?
  20. A. CML is a very powerful language for representing complex concepts in a completely precise and validatable manner. Most other systems rely on implicit semantics (e.g. units are not given and the meaning of quantities can only be determined from manuals or reading the source code). CML provides self-describing documents and data, and means that groups who have never worked together before can understand each other's information. Because it's in XML, it is straightforward to integrate CML with other XML languages such as MathML, SVG (graphics), WordML/OOXML, and DocBook/NLM-XML (from the NIH).

  21. Q. What does 'extensible' mean for CML?
  22. A. CML provides an abstraction of numerical science and so rather than having special markup for a melting point, it describes it as a scalar property defined by a dictionary entry. This means that users can add more properties to CML simply by creating and using their own dictionaries.

  23. Q. How does CML control the spread of dictionaries?
  24. A. It doesn't. Anyone can create a dictionary but they have to use a unique namespace (probably based on their domain name). This means that there is no ambiguity since every item is precisely defined. Dictionaries are a major part of the CML activity and we have software to support their creation and validation.

    The CMLElements have evolved and been designed to provide a toolkit for describing a very wide range of chemical concepts. It is therefore likely that you will be able to express a new concept using the existing CMLElements rather than inventing new ones. However, there will always be a small demand for new elements but you should remember that each new element requires support by software to be useful.

  25. Q. How does CML manage numeric quantities?
  26. A. A very wide range of numerical quantities can be broken down into the data type and the dimensionality. CML supports 5 data types at present (string, integer, real number, Boolean and date), and uses the W3C definitions and software to support these. CML provides three dimensionalities: zero (scalar), one (homogeneous array) and two (rectangular matrix). This covers a very wide range of concepts found in chemical documents and databases. Some concepts, such as boiling point, require two primitive elements: parameter and property. The parameter describes the constraining pressure while the property supports the temperature. Again, we have found that a wide range of chemical concepts can be created from primitive components.

  27. Q. If everyone makes up their own concepts, don't we get chaos?
  28. A. CML manages concepts and documents through the deployment of conventions.

  29. Q. What's a convention?
  30. A. A convention is an agreement by a sub-community to use particular structures and constraints for documents, CMLElements and attributes.

  31. Q. How does CML manage conventions?
  32. A. A convention consists of a formal specification, understandable by humans and precisely processable by machines. Each convention has a namespace so that it is always clear which convention is enforced. Namespaces can be combined, for example by nesting, so that simple conventions can be used by more complex ones.

  33. Q. What's validation?
  34. A. CML encourages and increasingly requires documents to be valid against a convention. This means that if you use software which reads and writes a CML document, the software knows what's in the document from the convention. In this way, software designers can create tools which can be guaranteed to perform correctly on valid documents. If a document has an unknown convention, or an unsupported one, the software can ignore part or all of the document without crashing.

  35. Q. What's a 'datument'?
  36. A. A 'datument' is the combination of all the components of a piece of scientific information into a single flexible CML object, possibly using hyperlinks. It challenges the traditional separation of documents and data, and can be used to disply information to both humans and machines in the most effective manner.

    Chemistry

  37. Q. What types of chemistry does CML support?
  38. A. CML can support much of the current scientific literature and most undergraduate chemistry. Much of this is already possible but some will require additional software and conventions to be built.

  39. Q. How does CML support molecules?
  40. A. A molecule is a very general term and does not imply any particular use of chemical bonds or representation. At its simplest, a molecule contains atoms and may optionally contain bonds between those atoms. Molecules can also contain other molecules, so that salts or liganded proteins can be described by nested molecules. Molecules often contain property elements and can also be associated with, for example, spectra. The molecule is re-used by other concepts in CML such as reaction.

    A molecule can be represented at many levels of detail. At its simplest, it may have just an id, then perhaps a title or inline formula, and then be extended to provide atoms, the bonds between them and the coordinates and other properties. Chem4Word allows a wide range of different representations for the molecule concept.

  41. Q. How does CML support reactions?
  42. A. CML uses the traditional concepts of product and reactant, and can associate a wide range of properties with either. There are a number of concepts such as reaction list, reaction step which allow complex sequences of reactions to be described. For example, CML can represent a synthetic scheme or the precise reaction mechanism of a single reaction. Reactions can have properties such as temperature and observations such as the length of time for which a reaction took place.

  43. Q. How does CML support crystal structures and solid state chemistry?
  44. A. CML supports crystal structures with specific elements such as crystal and symmetry, which supports most of the common crystallographic data and operations. This is also enough to support modelling of the solid-state, both completely periodic and also partially periodic (1-D and 2-D). There is a large collection of crystal structure (CrystalEye) where the original CIF files have been converted to CML.

  45. Q. How does CML support spectroscopy?
  46. A. CML has specific elements (spectrum and peak) which are used to describe a wide range of 1-D spectra (e.g.IR, UV, CD etc.) It is straightforward to devise conventions for 2-D spectra. There are collections of peaks published (such as NMRShiftDB) but few Open collections of continuous 1-D spectra. However, a number of national laboratories are now planning to use CML to manage their analytical spectroscopy.

  47. Q. How does CML support computational chemistry?
  48. A. CML provides a number of tools (module, list) to structure the output of computational processes, both QM and empirical. We have written converters (parsers) for several well-known codes and this means that the information in the logfile becomes immediately much more valuable. We are also working with program authors (e.g. NWChem, DL_POLY) to embed CML routines within the code so that they automatically generate CML.

  49. Q. Can CML be used with other disciplines, such as bioscience?
  50. A. Yes. Because CML has its own namespace, it can be included anywhere in other XML documents, and the processors may choose either to include it as part of the information or to ignore it. CML is part of the BioPAX standard and we have also been involved in the MIABE activity.

  51. Q. What's missing from CML?
  52. A. We are currently working on the representation of forcefields in CML and mapping chemical reactions onto kinetic equations. These involve more complex semantics but it should be possible to use the existing range of CMLElements. There is no immediate apparent need for new elements, although in some cases (e.g. concentrations) it could simplify the representation.

  53. Q. Are there alternatives to CML?
  54. A. CML deliberately does not duplicate other established efforts, so we use basis sets from the EMSL basis set exchange rather than using our own markup. We continue to use our own definition for scientific units until UnitsML becomes widely deployed and supported by software. We note the existence of Analytical Information markup language (AnIML) which takes somewhat different approach and which is yet to be deployed in significant volume. We also note the successful ThermoML, used for describing thermochemical data published in several journals.

    It is theoretically possible to describe semantics in a more explicit form by using RDF and ontologies (e.g. ChemAxiom). Our current view is that this is too complex for the chemical community to take onboard, and provides little effective added value over our hard-coded and rule-based semantics.

    Software

  55. Q. Where do I download CML?
  56. A. You don't. CML does not exist as a binary package. The uses of CML are very varied so people and applications will require their own selection of tools. The JUMBO and JUMBO-Converters represent an implementation for the whole of CML and an important part of its existence is to act as a reference, defining the semantics where they are not trivially interpretable from the specification.

    The formal schema specifications are available for download.

  57. Q. What software do I need to use CML?
  58. A. The commonest operations in CMl will be creation, transmission, reading, tranformation, editing, computation, searching and display. Most applications have a small subset of this. For example, a Quixote repository is based on the Chempound software which provides validation, ingest, indexing,searching and display. Chem4Word provides the creation of molecules, the integration of documents and CML and a rich set of display options in the widely-distributed Word environment. CrystalEye trawls the public scientific literature for crystal structures which it converts into CML and displays on browsable pages. Many programs such as DL_POLY now emit CML as a primary representation of computational results. The MaCiE system annotated enzyme reaction mechanisms and provided animated graphical representation of the reaction. There are a vast number of different ways of combining the tools in CML systems to provide a very large range of chemical applications.

  59. Q. How do I write CML software?
  60. A. Wherever possible, you should build on an existing library and use existing web services. The libraries ensure not only syntactic validity of the CML but also increasingly support conventions and other chemical operations. Not all functions are available in all languages so you may need to refer to some existing implementations and port parts of the code. Because XML is designed as the 'digital dialtone of the Web' (Jon Bosak) it is the means of choice for supplying information from web services. Frequently it is preferable to use existing web services rather than try to write functionality from scratch. For example, our OPSIN server will supply chemical name to structure conversion on demand.

    If you do have to write CML software then we provide a range of examples and tools to help with the validation.

  61. Q. What tools and groups include CML in their applications?
  62. A. In 2011 there were over 20 independent Open Source chemistry systems and most of these have facilities for CML. The most central is probably OpenBabel and CDK probably has the richest chemical functionality. Jmol and Avogadro are very high-quality display and analysis tools.

  63. Q. How do I create documents containing CML?
  64. A. The Chem4Word system is an example of an integrated chemistry and text editor, where the result is held completely in XML. The code is all Open and it will be technically possible to port most of the functionality to OpenOffice. In many cases the document will be created by automated aggregation and editing of components, of which CML will be one. Thus, for example, XML stylesheets could be used to include CML into documents conforming to the NLM-DTD.

    Resources

  65. Q. Where can I find molecules in CML?
  66. A. CrystalEye1, CrystalEye2 and Quixote servers have tens or hundreds of thousands of molecular structures. It is straightforward to convert other legacy formats e.g. MDL molfiles or ChemDraw files to CML using OpenBabel, CDK or JUMBO-Converters.

  67. Q. Where can I find reactions in CML?
  68. A. Reactions are not yet widely represented in chemistry because there are very few examples of collections of Open reactions that we can process without legal constraints. We have, however, extracted a large number of reactions from the patent literature (the Green Chain Reaction) and the MaCiE database contains enzyme reactions and mechanisms.

  69. Q. Where can I find spectra in CML?
  70. A. There are none (yet).

  71. Q. Where can I find crystal structures in CML?
  72. A. The original CrystalEye project contains approx. 250,000 crystal structures converted into CML.

  73. Q. Where can I find computational chemistry in CML?
  74. A. The aim of the Quixote project is to transform and aggregate compchem results into a browsable and searchable Chempound repository.

    Community

  75. Q. How is the community organised?
  76. A.

  77. Q. Where can I ask questions about CML?
  78. A.

  79. Q. How do I contribute software or resources?
  80. A.

  81. Q. Where do I report problems?
  82. A.

    Possible use cases of CML

  83. Q. I'm a synthetic chemist - how can CML help me?
  84. A.

  85. Q. I'm a materials scientist - how can CML help me?
  86. A.

  87. Q. I'm a computational chemist - how can CML help me?
  88. A.

  89. Q. I'm a crystallographer - how can CML help me?
  90. A.

  91. Q. I'm a teacher - how can CML help me?
  92. A.