Clover 3 - new clover generation

Introduction

There are several goals of this development phase. First of all new subgraph hierarchy has to be introduced. The other changes described in this document are either direct implication of establishing tree hierchy into graph elements (subgraphs) or other maintanence updates which was planned in long term and now is convenient opportunity to introduce them with minimalized harm.

Motivation

What is subgraph? Subgraph can be seen as a new graph element (something what can be placed into graph), which has attributes, input ports, output ports and can be defined by etl developer as a set of ordinary component (or other subgraphs) mutually joined by edges. In fact there will not be established new graph element, but current transformation graph will be extended by the missing graph attributes, input and output ports and this new upgraded graph becomes also new graph element, which can be without any limitations inserted into another graph. Finally that should allow to develop 'components' based on a set of other graph elements. This generalization will go far. Each current graph element for instance lookup table, connection, component can contain its own graph elements. For example our simple lookup table which has data content defined in a flat file has to have assigned also metadata. In case internal lookup table this metadata can be chosen from list of graph metadata. Obviously this approach cannot be used in case external lookup table, because this external lookup table is shared among different graphs, so the lookup table has to carry its own metadata definition with. Similar issue will be resolved by this generalization in case of DBLookupTable, which could have assigned database connection. Probably most interesting impact of this generalization will be in ability to assign metadata object to a component. That allows us to define fixed metadata on a particular port of a component. Pretty nice example could be also the Universal data reader whose second output port is dedicated to log out errors during data parsing. The reader is awaiting accurate metadata format (int, int, string, string) assigned on this port. Unfortunately clover has no way to propagate this information to user (except documentation). As soon as we introduce this generalization, clover.gui will be able to provide to end user this information and each edge connected with this port will have automatically assigned right metadata. This metadata object is not editable nonetheless can be used for any other edge in this graph.

New class hierarchy

We have to introduce new class hierarchy of all graph elements. All graph elements has to implement IGraphElement (all metadata, connections, lookup tables, sequences, components and graphs). This is almost done in current clover engine except metadata element. Moreover the IGraphElement interface has to be essentially extended.

[already present method]

getId()
getName()
checkConfig()
init()
reset()
free()

[new set of method]

getMetadata()
getLookupTables()
getConnections()
getSequences()
getParentElement()
getParentGraph()

Direct descendant will be IGraphComponent interface. This interface should implement all runnable parts of graph (components and (sub)graphs). Fundamental method of this interface is getWatchDog(). Watch dog is something what processes this runnable graph element. For instance in case components only prepares a thread and launches the dedicated component's algorithm. Watch dog for graph will be more complex, because a graph user contains different phases and lot of other runnable elements - so has to recursively process all embedded IGraphComponent of course via their watch dogs.

[IGraphComponent]

getWatchDog()
getInputPorts()
getOutputPorts()

Next and last item in this class hierarchy is IGraph interface. Maybe it seems there are not a lot differences between IGraphComponent and IGraph and this interface is not necessary. However, indeed we need an authority which manages edges between embedded IGraphComponent. Under these circumstances we can confess also the other natural method.

[IGraph]

getEdges()
getPhases()
	getComponents()

There are a lot of unmentioned issues and other consequences, which definitely arise during implementation.

Changes in TransformationGraphXMLReaderWriter

Each graph element (not only graphs) can contain other embedded graph elements. We have to introduce new xml syntax for this purpose. Example below shows typical graph elements nesting.

[new syntax]

<LookupTableLink id="LookupTable0">
	<LookupTable type="dbLookup" name="name" 
			dbConnection="this/Connection0" 
			metadata="this/Metadata1">
		<attr name="sqlQuery"><![CDATA[select * from employee where employee_id=?]]></attr>
		<Global>
			<ConnectionLink id="Connection0" type="JDBC" dbConfig="${CONN_DIR}/postgre.cfg"/>
			<Metadata id="Metadata1" fileURL="${META_DIR}/delimited/employee.fmt"/>
		</Global>
	</LookupDef>
</LookupTable>

[former syntax]

<LookupTable id="LookupTable0" type="dbLookup" name="name" 
		dbConnection="Connection0" 
		metadata="Metadata1">
	<attr name="sqlQuery"><![CDATA[select * from employee where employee_id=?]]></attr>
</LookupTable>
<Connection id="Connection0" type="JDBC" dbConfig="${CONN_DIR}/postgre.cfg"/>
<Metadata id="Metadata1" fileURL="${META_DIR}/delimited/employee.fmt"/>

Next graph code snapshot shows cross references in the hierarchy.

<Edge id="INEDGE1" metadata="LookupTable0/Metadata1" fromNode="INPUT1:0" toNode="JOIN:0"/>

or

<Node id="DB_EXECUTE1" type="DB_EXECUTE" dbConnection="LookupTable0/Connection0" errorActions="CONTINUE">
<attr name="sqlQuery"><![CDATA[create table proc_table (
	id INTEGER,
	string VARCHAR(80),
	date DATETIME
);]]></attr>
</Node>

As you can see we have to start to use new naming convention for cross references of graph elements. If a graph element wants to get reference to its local subelement, should use key word 'this', because namespace for all id attributes starts on a graph level not graphelement level.

And now probably most important example in this chapter - xml syntax of (sub)graphs. Next simple graph contains 4 serial components - data generator, filter, sorter and trash.

[former syntax]

<Graph name="My graph">
	<Global>
		<Metadata id="Metadata0">
			<Record fieldDelimiter="|" name="defaultName" recordDelimiter="\n" type="delimited">
				<Field name="field1" type="integer"/>
			</Record>
		</Metadata>
	</Global>
	<Phase number="0">
		<Node id="DATA_GENERATOR0" type="DATA_GENERATOR" randomFields="field1=random(&quot;0&quot;,&quot;200&quot;)" recordsNumber="100" />
		<Node id="EXT_FILTER0" type="EXT_FILTER" filterExpression="$field1 &gt; 100"/>
		<Node id="EXT_SORT0" type="EXT_SORT" sortKey="field1(a)"/>
		<Node id="TRASH0" type="TRASH"/>
		<Edge id="Edge1" metadata="Metadata0" fromNode="DATA_GENERATOR0:0" toNode="EXT_FILTER0:0"/>
		<Edge id="Edge3" metadata="Metadata0" fromNode="EXT_FILTER0:0" toNode="EXT_SORT0:0"/>
		<Edge id="Edge2" metadata="Metadata0" fromNode="EXT_SORT0:0" toNode="TRASH0:0"/>
	</Phase>
</Graph>

Let's the middle components combination - filter and sorter - is intended as a reusable piece of etl code. Then we should be able to use new syntax a to define subgraph for:

[new syntax]

<Graph name="My Graph">
	<Global>
		<Metadata id="Metadata0">
			<Record fieldDelimiter="|" name="defaultName" recordDelimiter="\n" type="delimited">
				<Field name="field1" type="integer"/>
			</Record>
		</Metadata>
		<GraphLink id="GraphLink0">
			<Graph name="Useful subgraph">
				<Global>
				</Global>
				<Phase number="0">
					<Node id="EXT_FILTER0" type="EXT_FILTER" filterExpression="$field1 &gt; 100"/>
					<Node id="EXT_SORT0" type="EXT_SORT" sortKey="field1(a)"/>
				</Phase>
				<Edge id="Edge1" fromNode="super:0" toNode="EXT_FILTER0:0"/>
				<Edge id="Edge3" fromNode="EXT_FILTER0:0" toNode="EXT_SORT0:0"/>
				<Edge id="Edge3" fromNode="EXT_SORT0:0" toNode="super:0"/>
			</Graph>
		</GraphLink>
		<!--GraphLink id="GraphLink1" fileURL="${GRAPH_DIR}/usefulSubgraph.grf"/-->
	</Global>
	<Phase number="0">
		<Node id="DATA_GENERATOR0" type="DATA_GENERATOR" randomFields="field1=random(&quot;0&quot;,&quot;200&quot;)" recordsNumber="100" />
		<Graph id="GRAPH0" type="GraphLink0"/>
		<Node id="TRASH0" type="TRASH"/>
	</Phase>
	<Edge id="Edge1" metadata="Metadata0" fromNode="DATA_GENERATOR0:0" toNode="GRAPH0:0"/>
	<Edge id="Edge2" metadata="Metadata0" fromNode="GRAPH0:0" toNode="TRASH0:0"/>
</Graph>

There are few important things, which deserve comment. GraphLink xml element represents input point for graph import. Both variants internal and external linking are presented. Edges inside the linked graph do not have assigned any metadata. This will be closer explained in the next chapter, for now you can consider it as a wildcard metadata. New keyword 'super' has to be introduce due to link to parent graph and its input and output ports. All the other things are pretty straightforward. Graph can be used a number of time via 'Graph' xml element in any phase and linked by edges as any other component.

Metadata wildcards

Due subgraph universality and ability to process arbitrary data structure we are forced to introduce something what can be called metadata wildcard. In case no metadata are mentioned on an edge metadata object is automatically taken from neighboring components. This feature essentially reduces user time neccesary to distribute metadata among the graph.

(Sub)graph parametrization

Consider situation that you need use our 'useful subgraph' two times, however every time with different filter threshold. You can prepare two different graphs or use standard clover parametrization (based on strings). Unfortunately neither is convenient for clover.gui to provide appropriate gui dialog. So we have to introduce also new general graph parametrization, which in fact means that some of components inside a graph has so called exported attributes. These attributes are taken over by parent graph. Check the example:

<Graph name="Useful subgraph">
	<properties>
		<property category="basic" displayName="Filter expression" 
			modifiable="true" name="filterExpression" nullable="true" required="true"
			target="EXT_FILTER0.filterExpression"/>
	</properties>
	<Global>
	</Global>
	<Phase number="0">
		<Node id="EXT_FILTER0" type="EXT_FILTER" filterExpression="this attribute could be omitted"/>
		<Node id="EXT_SORT0" type="EXT_SORT" sortKey="field1(a)"/>
	</Phase>
	<Edge id="Edge1" fromNode="super:0" toNode="EXT_FILTER0:0"/>
	<Edge id="Edge3" fromNode="EXT_FILTER0:0" toNode="EXT_SORT0:0"/>
	<Edge id="Edge3" fromNode="EXT_SORT0:0" toNode="super:0"/>
</Graph>

and multiple usage of thus defined graph can look like this

<Phase number="0">
	...
	<Graph id="GRAPH0" type="GraphLink0" filterExpression="$field1 &gt; 100"/>
	...
	<Graph id="GRAPH1" type="GraphLink0" filterExpression="$field1 &gt; 90"/>
	...
</Phase>

Input/output ports specification for (sub)graphs

We need also specify set of input and output ports of (sub)graphs. This specification is for example neccesary for clover.gui to provide to user proper amount of available ports. Suggested solution follows:

<Graph name="Useful subgraph">
    <inputPorts>
      <singlePort name="input" required="true"/>
    </inputPorts>
    <outputPorts>
      <singlePort name="output" required="true"/>
      <!--multiplePort required="true"/-->
    </outputPorts>
 
	<properties>
		...
	</properties>
	<Global>
		...
	</Global>
	<Phase number="0">
		...
	</Phase>
	...
</Graph>

New components factory - no more fromXML() and toXML() methods

Clover 3 with all these upgrades are also great opportunity to unify component description in clover.engine

<extension point-id="component">
	<parameter id="className" value="org.jetel.component.RoundRobinGather"/>
	<parameter id="type" value="ROUND_ROBIN_GATHER"/>
</extension>

and component description in clover.gui

  <ETLComponent category="transformers" className="org.jetel.component.ExtFilter" iconPath="icons/transformers/ExtFilter" name="ExtFilter"  type="EXT_FILTER" passThrough="true">
    <shortDescription>Filters incoming data.</shortDescription>
    <description>Receives data records through connected input port, removes some of them depending on defined filter expression and sends the rest to the connected first output port. Rejected records are sent to the optional second output port if connected.</description>
    <inputPorts>
      <singlePort name="0" required="true"/>
    </inputPorts>
    <outputPorts>
      <singlePort name="0" required="true" label="accepted"/>
      <singlePort name="1" label="rejected"/>
    </outputPorts>
    <properties>
      <property category="basic" displayName="Filter expression" modifiable="true" name="filterExpression" nullable="true" required="true">
        <singleType name="filter"/>
      </property>
    </properties>
  </ETLComponent>

All this information on the clover.engine site makes possible to automatize component instantiation. For each registered property component developer just prepares suitable setter and component factory manages conversion to appropriate data format and pass this value to the component. On the other side getters can be exploited to substitute by hand created toXML() methods. Another useful outcome of this unification is unambiguous default values definition. Each property declaration can specify default value direct in the xml description and clover.gui can easily provide this information to the end user. Last but not least engine can finally provide to gui client list of all supported components and their attributes. Mainly that could be used in clover.gui support for clover.server. Server will be able to inform eclipse plugin about all available components.

FIXME: parameters on graph element level
FIXME: do we need a more complex structure for ports - for instance a group of ports

roadmap/clover3.txt · Last modified: 2009/09/16 12:27 (external edit)
Back to top
chimeric.de = chi`s home Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0