Readers

Readers are initial components of graph that reads data from input source. The source can be for example a file placed on local disk, ftp, ldap, jms or database tables, etc. Graph must contain at least one of these components or more.


Common attributes



Attribute Description Exp.
id component identification string
type component type. This attribute is automatically generated from gui. string


File readers

Attribute Description Exp.
charset character encoding of the input file see locales encoding
dataPolicy specifies how to handle misformatted or incorrect data. Strict | Controlled | Lenient
fileURL path to the data input file. ( [zip: | gzip: | tar:] [path/] filename ) |
( http[s]://[user:password@]server [/path] [/filename] ) |
( [s]ftp://user:password@server [/path] /filename ) | - | ..
numRecords specifies how many records/rows should be read from the source file. number
skipFirstLine specifies whether first record/line should be skipped. If record delimiter is specified than skip one record else first line of flat file. false | true
skipRows specifies how many records/rows should be skipped from the source file; good for handling files where first rows is a header not a real data. number
trim specifies whether to trim strings before setting them to data fields. When not set, strings are trimmed depending on “trim” attribute of metadata. false | true


Database readers

Attribute Description Exp.
sqlQuery query to be sent to database see wikipedia sql
fetchSize how many records should be fetched from db at once. number


CloverDataReader

Reads data saved in Clover internal format and send the records to out ports.

Input ports: none

Output ports:

  • at least one output port defined/connected.


Xml attributes:

Attribute Mandatory Description Default ETL Version since
id yes component identification
type yes component type CLOVER_READER
fileURL yes path to the data file. It archive storing data, data indexes and metadata description or binary file with data saved in Clover internal format.
indexFileURL no if index file is not in the same directory as data filr or has not expected name (fileURL.idx)
skipRows no specifies how many records/rows should be skipped from the source file; good for handling files where first rows is a header not a real data. 0
numRecords no specifies how many records/rows should be read from the source.
startRecord no index of first parsed record 0
finalRecord no index of final parsed record


Both startRecord and finalRecord attributes are deprecated and should not be used.

Example:

  <Node id="CLOVER_READER0" type="CLOVER_READER" fileURL="zip:customers.clv.zip"/>
 
  <Node id="CLOVER_READER0"
        type="CLOVER_READER"
        fileURL="customers.clv"
        finalRecord="2"
        startRecord="1"
  />



DataGenerator

Generates data according to pattern. Record fields can be filled by constants, random or sequence values, lookup tables, CTL functions.. User can use either enhanced generate/generateClass/generateURL generator or simple pattern-randomFields-sequenceFields generator.

Input ports: none

Output ports:

  • at least one output port defined/connected.


Xml attributes:

Attribute Mandatory Description Default ETL Version since
id yes component identification
type yes component type DATA_GENERATOR
generate if no generateClass or generateURL contains definition of code of the generator; the attribute does not resolve escape characters, this resolving will be in the newest 2.8.x version 2.7
generateClass if no generate or generateURL name of the class to be used for data generating 2.7
generateURL if no generateClass or generate contains path to the file with code of the generator. 2.7
charset no character encoding of the external generator file (generateURL). ISO-8859-1 2.9
pattern no pattern for filling new record. It is string containing values for all fields, which will be not set by random or sequence values. Field's values in this string have to have format coherent with metadata (appropriate length or delimited by appropriate delimiter)
randomFields no names of fields to be set by random values (optionaly with ranges) separated by semicolon. When there are not given random ranges (or one of them) there are used minimum possible values for given data field (eg. for LongDataField minimum is Long.MIN_VALUE and maximum Long.MAX_VALUE). Random strings are generated from chars 'a' till 'z'. For numeric fields random ranges are: min value (inclusive) and max value (exclusive), and for byte or string fields random ranges mean minimum and maximum length of field (if it is not fixed), eg. field1=random(0,51) - for numeric field random value from range (0,50], for string field - random string of length 0 till 51 chars, field2=random(10) - allowed only for string or byte field, means length of field. It is prescribed to use standard mapping syntax since 2.5 version: fields are preceded by $, mappings are separated by :;| {colon, semicolon, pipe} and assignment sign is :=, eg.: $field1:=random(0,51);$field2:=random(10)
randomSeed no Sets the seed of this random number generator using a single long seed.
sequenceFields no names of fields to be set by values from sequence (optionaly with sequence name: fieldName=sequenceName) separated by semicolon. It is prescribed to use standard mapping syntax since 2.5 version: fields are preceded by $, mappings are separated by :;| {colon, semicolon, pipe} and assignment sign is :=, eg.: $field1:=sequenceName
recordsNumber yes number of records to generate

Example:

 
  <Node id="DATA_GENERATOR0" 
        type="DATA_GENERATOR"
        recordsNumber="10" 
        generate="//#TL
                  int i;
 
                  function generate() {
                      i = 2; // a key
                      $0.RandomName := random_string(0,5)+random_string(5,5);
                      $0.RandomDate := random_date("2009.01.01","2009.12.31","yyyy.MM.dd");
                      $0.Random := random();$0.RandomInt := random()*100;
                      $0.Composite := random_string(3,5)+" - " + round(random()*100);
                      $0.Sequence := sequence(Sequence0).next;
                      $0.LookupTableV1 := lookup(LookupTable0,i).field2;
                      $0.LookupTableV2 := lookup(LookupTable0,i).field1;
                  }
 
                  function init() {
                      lookup_admin(LookupTable0, init);
                  }
 
                  function finished() {
                  }
                  "
  >
 
  <Node id="DATA_GENERATOR0" type="DATA_GENERATOR">
     <attr name="randomFields">$ShipAddress :=random(1,777);$EmployeeID:=random( 1,${EMPLOYEE_NUMBER});$Freight:=random(1,51);$ShippedDate:=random(20.10.2005,30.10.2005)</attr>
     <attr name="recordsNumber">10000</attr>
     <attr name="sequenceFields">OrderID</attr>
     <attr name="pattern">agata|20.10.2005|30.10.2005|1|test|Prague|EU|000000|CZ
     </attr>
  </Node>



DataReader

Parses specified input data file and send the records to the first output port. Embeded parser covers both fixlen and delimited data format.

Logging port has to define following metadata structure:

Field Type Description
0 integer record number
1 integer field number (number 1 means the first field == whose index is 0)
2 string wrong record in a raw form
3 string error message - detail information about this error

The metadata for logging port:

  <Metadata id="Metadata0">
    <Record name="errorPort" type="delimited">
      <Field delimiter=";" name="RecNumber" nullable="true" type="integer"/>
      <Field delimiter=";" name="FieldNumber" nullable="true" type="integer"/>
      <Field delimiter=";" name="RawRecord" nullable="true" type="string"/>
      <Field delimiter="\r\n" name="ErrorMessage" nullable="true" type="string"/>
    </Record>
  </Metadata>

Note: logging port is used only if controlled data policy is defined

Input ports:

  • one optional input port defined/connected (port protocol see fileURL).

Output ports:

  • one obligate output port defined/connected.
  • one optional logging port defined/connected.


Xml attributes:

Attribute Mandatory Description Default ETL Version since
id yes component identification
type yes component type DATA_READER
fileURL yes path to the data input file.
charset no character encoding of the input file ISO-8859-1
dataPolicy no specifies how to handle misformatted or incorrect data. 'Strict' aborts processing, 'Controlled' logs the entire record while processing continues, and 'Lenient' attempts to set incorrect data to default values while processing continues. Strict
skipLeadingBlanks no specifies whether to skip leading blanks before setting string to data fields. When not set, there is used value of “trim” attribute of metadata.
skipTrailingBlanks no specifies whether to skip trailing blanks before setting string to data fields. When not set, there is used value of “trim” attribute of metadata. 2.6
trim no specifies whether to trim strings before setting them to data fields. When not set, strings are trimmed depending on “trim” attribute of metadata. Note: if this option is ON (true), then field composed of all blanks/spaces is transformed to null (zero length string).
skipFirstLine no deprecated - replaces skipSourceRows. Specifies whether first record/line should be skipped. If record delimiter is specified than skip one record else first line of flat file. false
skipRows no specifies how many records/rows should be skipped from the source file; good for handling files where first rows is a header not a real data. 0
numRecords no specifies how many records/rows should be read from the source.
skipSourceRows no specifies how many records/rows should be skipped from every source file; good for handling files where first rows is a header not a real data. 0 2.7
numSourceRecords no specifies how many records/rows should be read from every source. 2.7
maxErrorCount no count of tolerated error records in input file (applicable only for controlled data policy) 0
quotedStrings no field can be quoted by ' ' or ” ” false
treatMultipleDelimitersAsOne no if this option is true, then multiple delimiters are recognise as one delimiter false
incrementalFile incrementalKey property file used for incremental reading
incrementalKey incrementalFile property name stored in property file carries last reading position
verbose no provides more comprehensive error notification in exchange for worse performance (few percents) true 2.8

Example:

  <Node id="InputFile" type="DATA_READER" fileURL="data.txt"/>
 
  <Node id="InputFile" 
        type="DATA_READER" 
        fileURL="zip:http://www.store.com/data.zip#data.txt" 
        charset="ISO-8859-15"
        dataPolicy="Controlled"
        skipLeadingBlanks="false"
        trim="false"
        skipFirstLine="true"
        skipRows="1"
        numRecords="100"
        maxErrorCount="0"
        quotedStrings="false"
        treatMultipleDelimitersAsOne="false"
  />



DBInputTable

This component reads data from DB. It first executes specified query on DB and then extracts all the rows returned.
SqlQuery and url are mutually exclusive. Url is the primary and if found the sqlQuery will not be used.

When connecting to MS SQL Server, it is convenient to use jTDS http://jtds.sourceforge.net driver. It is an open source 100% pure Java JDBC driver for Microsoft SQL Server and Sybase. Its speed is higher than that of Microsoft driver.

Input ports: none

Output ports:

  • at least one output port defined/connected.


Xml attributes:

Attribute Mandatory Description Default ETL Version since
id yes component identification
type yes component type DB_INPUT_TABLE
dbConnection yes id of the Database Connection object to be used to access the database
sqlQuery if no url query to be sent to database. From ver. 2.4 the query can contain mapping between clover and database fields eg. query:
select $field1:=dbField1, $field2:=dbField2 from mytable
is interpreted as:
select dbField1, dbField2 from mytable
and output field field1 will be filled by value from dbField1 and field2 will be filled by value from dbField2. The query can be written without mapping also; then output fields will be fulfilled from the first in order data flows from database. For incremental reading clause where defining new records must be present (see incrementalKey, incrementalFile attributes), eg. query for incremental reading should look like: select $f1:=db1, $f2:=db2, … from myTable where dbX > #myKey1 and dbY ⇐#myKey2, where myKey1 and myKey2 must be defined in incrementalKey attribute.
sqlQuery or url must be defined
url if no sqlQuery url location of the query. The query will be loaded from file referenced by url. Syntax of the query must be as described above.
fetchSize no how many records should be fetched from db at once. See JDBC's java.sql.Statement.setFetchSize() MIN_INT constant is implemented - is resolved to Integer. MIN_INT value (good for MySQL JDBC driver) 20
SQLCode no XML tag. This tag allows for embedding large SQL statement directly into graph.
dataPolicy no specifies how to handle misformatted or incorrect data. 'Strict' aborts processing, 'Controlled' logs the entire record while processing continues, and 'Lenient' attempts to set incorrect data to default values while processing continues. 'Strict'
incrementalFile incrementalKey url to file where key values are stored. Values have to be set by user for 1st reading, then are set to requested value (see sqlQuery, incrementalKey attributes) automatically, eg.
myKey1=0
myKey2=1990-01-01

Dates, times and timestamps have be written in format defined in Defaults.DEFAULT_DATE_FORMAT, Defaults.DEFAULT_TIME_FORMAT, Defaults.DEFAULT_DATETIME_FORMAT
incrementalKey incrementalFile defines on which db fields incremental values are defined and which record from result set will be stored (last, first, min or max). Key parts have to be separated by :;| {colon, semicolon, pipe}, eg.:myKey1=first(dbX);myKey2=min(dbY) (see sqlQuery attribute)

Examples:

  <Node id="INPUT" type="DB_INPUT_TABLE" dbConnection="NorthwindDB" sqlQuery="select * from employee_z"/>
 
  <Node id="INPUT" type="DB_INPUT_TABLE" dbConnection="NorthwindDB" url="c:/temp/test.sql"/>
 
  <Node id="INPUT" type="DB_INPUT_TABLE" dbConnection="NorthwindDB" DataPolicy="Strict" fetchSize="1000">
        <attr name="SQLCode">
            select * from employee_z
        </attr>
  </Node>
 
  <Property id="GraphParameter0" name="param1" value="A%"/>
  <Node id="INPUT" type="DB_INPUT_TABLE" dbConnection="NorthwindDB" DataPolicy="Strict" fetchSize="1000">
        <attr name="SQLCode">
            select * from employee_z where last_name = '${param1}'
        </attr>
  </Node>
 
<Node dbConnection="DBConnection0" id="INPUT" 
  sqlQuery="select $last_name:=last_name,$full_name:=full_name from employee" type="DB_INPUT_TABLE"/>
 
Example for incremental reading:
 
<Node dbConnection="DBConnection0" id="INPUT" 
  incrementalFile="dbInc.txt" 
  incrementalKey="key1=last(id);key2=max(last_update)" 
  sqlQuery="select * from employee where id > #key1 or last_update>#key2" type="DB_INPUT_TABLE"/>
 
  Starting content of dbInc.txt:
  	key1=0
  	key2=1999-12-31



DBFDataReader

Reads records from specified dBase data file and broadcasts the records to all connected out ports. This component needs metadata specified as fix-length - type=“fixed”. Also, first field in metadata must be String field with length 1 which is used as indicator of deleted records in DBF. Such metadata can be automatically generated by Clover's utility DBFAnalyzer. Its main class can be executed as 'java -cp “clover.engine.jar” org.jetel.database.dbf.DBFAnalyzer'

Note: DBFAnalyzer generates additional information from DBF file (dataOffset and recordSize), but these are not neccessary.

Input ports: * one optional input port defined/connected (port protocol see fileURL).

Output ports:

  • at least one output port defined/connected.


Xml attributes:

Attribute Mandatory Description Default ETL Version since
id yes component identification
type yes component type DBF_DATA_READER
fileURL yes path to the input files
dataPolicy no specifies how to handle misformatted or incorrect data. 'Strict' aborts processing, 'Controlled' logs the entire record while processing continues, and 'Lenient' attempts to set incorrect data to default values while processing continues. 'Strict'
charset no Which character set to use for decoding field's data. Default value is deduced from DBF table header. If it is specified as part of metadata at record level, then it is take from there.
skipRows no specifies how many records/rows should be skipped from the source file. 0
numRecords no specifies how many records/rows should be read from the source.
skipSourceRows no specifies how many records/rows should be skipped from every source file. 0 2.7
numSourceRecords no specifies how many records/rows should be read from every source. 2.7
incrementalFile incrementalKey property file used for incremental reading
incrementalKey incrementalFile property name stored in property file carries last reading position

Example:

  <Node id="InputFile" type="DBF_DATA_READER" fileURL="/tmp/customers.dbf"/>
 
  <Node id="InputFile" 
        type="DBF_DATA_READER"
        fileURL="/tmp/customers.dbf"
        dataPolicy="Strict"
        charset="UTF-8"
  />



JmsReader

Receives JMS messages and transforms them to data records using user-specified transformation class (so-called processor). The processor implements a JmsMsg2DataRecord interface or inherits from a JmsMsg2DataRecordBase class. The processor may be specified either by class name or by inline Java code.

Default implementation of the processor org.jetel.component.jms.JmsMsg2DataRecordProperties is sufficient in most cases. Body of the incoming message is stored in field which is specified by bodyField component attribute. Properties of the message are stored in fields with the same names (if they exist in output record metadata). It can process javax.jms.TextMessage as well as javax.jms.BytesMessage (since 2.8).

Input ports: none

Output ports:

  • at least one output port defined/connected.


Xml attributes:

Attribute Mandatory Description Default ETL Version since
id yes component identification
type yes component type JMS_READER
connection yes JMS connection ID
processorClass no Name of processor class. Default value is applied only if attributes processorCode and processorURL are't specified org.jetel.component.jms.JmsMsg2DataRecordProperties
processorCode no Inline Java code defining processor class. It's applied only if processorClass isn't specified
processorURL no URL to file which contains java source of processor class. It's applied only if processorClass and processorCode aren't specified
charset no Charset of processor code, if it's specified by processorURL attribute. Default is taken from CloverETL engine defaults
selector no JMS selector specifying messages to be processed
maxMsgCount no maximal number of messages to be processed. 0 means there's no constraint on count of messages. 0
timeout no maximal time (in milliseconds) to await a next message. 0 means forever. 0
bodyField no Name of field in output record metadata, which should be filled by body of incoming JMS message. This attribute is used by default processor implementation (JmsMsg2DataRecordProperties). If value of “bodyField” attribute is specified, there must be such field in metadata. If value isn't specified, processor tries to set field named “bodyField”, but it's silently ignored if such field doesn't exist in output record metadata. bodyField (since 2.8 - older versions don't have any default)
msgCharset no Charset of messages content. It's used only for javax.jsm.BytesMessage. This attribute is used by default processor implementation (JmsMsg2DataRecordProperties). Default is taken from CloverETL engine defaults 2.8


Constraints of reading messages:

Attribute maxMsgCount Attribute timeout Description
0 0 Node keeps waiting for new messages. Also Phase, which this node is embedded in, never stops.
greater then 0 0 Node reads new messages until its count reaches maxMsgCount. It doesn't matter how long it lasts.
0 greater then 0 Node reads new messages for specified amount of milliseconds. It doesn't matter how many messages it reads.
greater then 0 greater then 0 JmsReader stops when count of read messages reaches maxMsgCount or timeout occured.

Example:

  <Node id="JmsReader" type="JMS_READER" connection="dest" />
 
  <Node id="JmsReader" 
        type="JMS_READER"
        connection="dest"
        timeout="4000"
        maxMsgCount="0"
  </Node>



LdapReader

This class is intended to provide a mean to read information from an LDAP directory. Provides the logic to extract search result of an LDAP directory and transform them into Jetel Data Records. The metadata provided throuh output port/edge must precisely describe the structure of read object.
Results of the search must have the same objectClass.

Input ports: none

Output ports:

  • at least one output port defined/connected.

NOTE: only string and byte clover data fields are supported; string is compatible with most of ldap usually types; byte is necessary for example to userPassword ldap type reading


Xml attributes:

Attribute Mandatory Description Default ETL Version since
id yes component identification
type yes component type LDAP_READER
ldapUrl yes Ldap url of the directory, on the form “ldap://host:port/
base yes Base DN used for the LDAP search
filter yes Filter used for the LDAP connection.
scope no Scope of the search request, must be one of object, onelevel or subtree. 'OBJECT'
user no The user DN to used when connecting to directory.
password no The password to used when connecting to directory.
multiValueSeparator no LDAP is possible to handle keys with multiple values. These values are delimited by this string/character. __none__ is special escape value to turn off this functionality, only first value is read. |

Example:

  <Node id="INPUT1" 
        type="LDAP_READER"
        ldapUrl="ldap://ldap.uninett.no:389/"
        base="ou=people,dc=uninett,dc=no"
        filter="uid=*"
        scope="SUBTREE">
  </Node>
 
  <Node id="INPUT1" 
        type="LDAP_READER"
        ldapUrl="ldap://foobar.com:389/"
        base="ou=people,dc=foo,dc=bar"
        filter="uid=*"
        scope="subtree"
        user="uid=Manager,dc=foo,dc=bar"
        password="manager_pass">
  </Node>



MultiLevelReader

since 2.7.0

This is an universal reader frame used to read flat files with heterogenous structure. Such files can contain a mix of both fixed-length and delimited data records along with other non-record data.

Input ports:

  • one optional input port defined/connected (port protocol see fileURL).

Output ports:

  • multiple ports with assigned metadata for each data record type


The logic itself which parses the file into records is out of scope of this reader and is delegated to user implementation of a MultiLevelSelector (“selector” in further reading) interface which is the key part of a working multi level reader. There is no default mode of operation since the underlying files can have virtually any structure. Selectors are plugged into the reader via selectorClass or selectorCode properties. There will be a set of built-in implementations of common file formats, like COBOL Copybook, etc.

MultiLevelReader uses the selector to identify data of potentionally various types (different metadata), then parses the particular record and sends it to one of the connected output ports - exactly the one with corresponding metadata attached. It works in a character-based loop - at first it allows the selector to “take a look ahead” at data at (or after) current position and find and decide, which type (metadata) the next record will be. Then it uses a standard DataParser to parse the record from the file using the metadata proposed by the selector. Finally it sends the record to the corresponding output port. Then loop runs until end of file is reached or no further records can be identified.

Each particular implementation of MultiLevelSelector must be implemented with all possible formats and caveats of the files it is supposed to operate on in mind. Without properly working selector the whole component is likely to fail its job. Selectors often work as state machines driven by input characters.

Xml attributes:

Attribute Mandatory Description Default ETL Version since
id yes component identification
type yes component type MULTI_LEVEL_READER
fileURL yes path to the input files
charset no character encoding of the input file ISO-8859-1
dataPolicy no specifies how to handle misformatted or incorrect data. 'Strict' aborts processing and 'Lenient' attempts to skip incorrect data and continue. 'Strict'
skipRows no specifies how many records from the beginning will be skipped
numRecords no specifies how many records/rows should be read from the source.
skipSourceRows no specifies how many records from the beginning of each source will be skipped
numSourceRecords no specifies how many records/rows should be read from each source
selectorCode no Inline Java code for class implementing the MultiLevelSelector interface.
selectorURL no URL of Java class implementing the MultiLevelSelector interface.
selectorClass no Full name of java class implementing the MultiLevelSelector interface. Must be loaded in classpath of running jvm PrefixMultiLevelSelector
selectorProperties no Properties (key-value pairs) for the particular selector, if any applicable

Note:

Default selector is PrefixMultiLevelSelector specified as default value for selectorClass property. If you need to specify your own custom selector, you can use one of the three attributes: selectorCode, selectorURL or selectorClass. You must specify only one at a time.

MultiLevelSelector - under the hood

Here is a breif overview of the MultiLevelSelector interface and how it should behave. It is a basic interface, yet very powerful.

Methods of MultiLevelSelector

Method Description
void init(DataRecordMetadata[], Properties) An init method with the pool of available metadata on output ports. We will be selecting metadata from this set so each selector must store them.
int choose(CharBuffer) Main method which looks into CharBuffer and reads until it can decide, where the next record begins and what type is it. It returns index to “metadata pool” (see init(DataRecordMetadata[]) method above
int nextRecordOffset() This is always called in relation to previous call to choose(). It must report the number of characters to skip before the start of the next identified record.
int lookAheadCharacters() This method can report how many characters the selector will need to identify next record. This has rather a statistical meaning and doesn't need to return anything (0 or negative number)
void reset() Resets the internal state of the selector (if there is any)

An implementation often works on the principle of a parser state machine - it reads one character after another and advances its state until it comes to conclusion. Each time a new record is to be identified, the reset() method is called. Multiple calls to choose() without reset() are possible in case of buffer underruns.

PrefixMultiLevelSelector

Default implementation of MultiLevelSelector determines records by their prefixes (in character-wise sense). Any number of prefixes can be specified and each prefix defines the output port to send the record to. The prefix-to-portnumber table can be specified in selectorProperties attribute.

Example of a simple mixed flat file:

1,a,b,c,10,20,30
2,a,apple,30,20
2,a,orange,34,56
2,b,carrot,129
3,1,2,3,4,5,6,7

In the previous example the records and their types are determined by the first two fields. PrefixMultiLevelSelector can be used to parse this file with the following selectorProperties table:

Key Value
1 0
2,a 1
2,b 2
3 3

Where left side are strings (keys) and values are numbers of output ports.

More advanced example of a mixed flat file:

This file cannot be parsed using the default PrefixMultiLevelSelector. A custom selector is needed but quite easy to implement.

# This is an example file for MultiLevelReader.
# This is a flat file that contains mix of delimited and fixed-length
# data records along with comments and blank spaces.
#
# An example implementation of MultiLevelSelector interface is responsible for all the logic here
# e.g. all comments and blank spaces are ignored by this implementation
# In this example, data types are determined by first character on a line

# next line is fixed-length data
H1953JOHN  

# these lines are mixed types delimited data
1,a,b,c,10,20,30
2,a,b,30,20

/* another form of comment */

# next two lines are again fixed-length data
CMARY  SMITH 1992F
CJANE  SMITH 1990F # note that previous newline is technically an error in fixed-length data
CJACOB SMITH 1993MCPETER SMITH 1996M # but newlines, as these comments, are skipped by the selector

# yet more delimited data
2,x,z,34,56
2,z,y,129,345
3,1,2,3,4,5,6,7

/*
  * This is a multiline
  * comment
  /* Even nested comments can be allowed */
  But must be nested properly.
  */
 
      # rest of data follow right after this indented comment
2,john,smith,3,1954
F2008SMITH 

Output from example above

The advanced example above has total of 5 output ports with following data on each of them:

Port 0

1,a,b,c,10,20,30

Port 1

2,a,b,30,20
2,x,z,34,56
2,z,y,129,345
2,john,smith,3,1954

Port 2

3,1,2,3,4,5,6,7

Port 3

H1953JOHN  F2008SMITH 

Port 4

CMARY  SMITH 1992FCJANE  SMITH 1990FCJACOB SMITH 1993MCPETER SMITH 1996M



ParallelReader

since 2.8.1

Parses specified input data file and send the records to the output port. Embeded parser now covers just delimited data format (fixlen data will be supported in a future release).

The goal of this component is very similar to Universal Data Reader - read the CSV files. The reason why this component was developed was to maximalize the reading performence. The improvement was reached on few levels. First of all the reading of file is parallelized by set of reading threads. Input file is divided into set of chunks and each reading thread parses just records from this part of an input file. This algorithm simply exploits very fast hard drives, which are now commonly available. Number of readers is dedicated by component parameter levelOfParallelism. Next performance improvement was reached by using of simplistic data parser. This parser is as simple as posible - limited validation, error handling, functionality - but very fast.

List of limitation:

  • order of output records is non-deterministic (for levelOfParalellism > 1)
  • trim, skipLeadingBlanks and skipTraillingBlanks is omitted
  • only delimited output metadata are accepted - fixlen metadata should follow in a future release
  • skipRows and numRecords are not currently available - maybe in a future release
  • output metedata has to have two bytes record delimiter
  • all output metadata fields has to have at most two characters delimiters
  • incremental reading is not supported
  • auto-filling is not supported



Input ports:

  • none

Output ports:

  • one obligate output port defined/connected.


Xml attributes:

Attribute Mandatory Description Default ETL Version since
id yes component identification
type yes component type PARALLEL_READER
fileURL yes path to the data input file, the given URL is limited to real files on a local harddrive - no other protocols are supported
charset no character encoding of the input file ISO-8859-1
dataPolicy no specifies how to handle misformatted or incorrect data. 'Strict' aborts processing, 'Controlled' logs the entire record while processing continues, and 'Lenient' attempts to set incorrect data to default values while processing continues. Strict
levelOfParallelism no number of parallel running workers 2
quotedStrings no field can be quoted by ' ' or ” ” false
segmentReading no in case your graph is running in clover server environment, the parallel reader can process only appropriate part of file; whole data file is devided to segments by clover server and each cluster worker processes only one proper part of file false

Example:

<Node id="PARALLEL_READER0" 
    type="PARALLEL_READER"
    fileURL="${DATAIN_DIR}/data.txt"
    levelOfParallelism="3"
    quotedStrings="true"/>



XMLExtract

Xml Extract Component is a component, which parse XML datafile to different output(s). This component have only one input and 1..n output(s).

Description:

As intput is necessitated .xml file or some text file with xml structure. The elemets and their children elements will be parsed by following actions:

Ouputs are depending on mapping definition. Only one nested level of elements is possible to be inserted in one output port. If element includes some nested elements, then it`s necessary to create new output port for this element and his children. If his child includes another nested children elements, it`s same.

For example, if you have a file with this structure:

<?xml version="1.0" encoding="ISO-8859-1"?>
<BOOK>
 <ID>11</ID>
 <NAME>Western</NAME>
 <AUTHOR>John Wayne</AUTHOR>
 <CHAPTER>
   <CHNAME>In desert</CHNAME>
   <SECTION> 
      <PARA>paragraph1</PARA>
   </SECTION>
   <SECTION> 
      <PARA>paragraph2</PARA>
   </SECTION>
 </CHAPTER>
 <CHAPTER>
  <CHNAME>Back in the pub</CHNAME>
  <SECTION> 
      <PARA>paragraph3</PARA>
  </SECTION>
 </CHAPTER>
</BOOK>

For this file, you must have 3 outputs ports (and data writers too). First for element BOOK and his elements without children(ID,NAME,AUTHOR), second for element CHAPTER (with CHNAME) and last for element SECTION (with element PARA). The level of nesting in this document (root element BOOK) is three.

Mapping used in the Xml Extract Component is :

<Mapping element="BOOK" outPort="0">
 <Mapping element="CHAPTER" outPort="1" parentKey="ID" generatedKey="ID">
   <Mapping element="SECTION" outPort="2" parentKey="CHNAME" generatedKey="CHNAME"/>
 </Mapping>
</Mapping>

The format of metadata`s flows is depending on elemets, that are exepected for writing to output file or database. In our example, there are three types of matadata. First metadata have delimited structure ID;NAME;AUTHOR\r\n , second type have structure ID;CHNAME\r\n, where ID is a identification element from parent tag(BOOK) and last type have structure CHNAME;PARA\r\n where CHNAME is a identification element from parent CHAPTER.

In outputs, there are following values:

Output 1 : 11;Western;Jahn Wayne
Output 2:  11;In desert                  -- "11" is a parent element indentificator
           11;Back in the pub
Output 3:  In desert;paragraph1          --"In desert" is a parent element indentificator
           In desert;paragraph2
           Back in the pub;paragraph3    --"Back in the pub" is a parent element indentificator

All nested XML elements will be recognized as record fields and mapped by name (except elements serviced by other nested Mapping elements), if you prefere other mapping xml fields and clover fields than 'by name', use xmlFields and cloveFields attributes to setup custom fields mapping. 'useNestedNodes' component attribute defines. If also child of nested xml elements will be mapped on the current clover record. Record from nested Mapping element could be connected via key fields with parent record produced by parent Mapping element (see parentKey and generatedKey attribute notes). In case that fields are unsuitable for key composing, extractor could fill one or more fields with values comming from sequence (see sequenceField and sequenceId attribute).

Mapping attribute contains mapping hierarchy in XML form. DTD of mapping:

 <!ELEMENT Mappings (Mapping*)> 
  <!ELEMENT Mapping (Mapping*)> 
  <!ATTLIST Mapping 
   element NMTOKEN #REQUIRED       
     //name of binded XML element   
   outPort NMTOKEN #IMPLIED       
     //name of output port for this mapped XML element 
   parentKey NMTOKEN #IMPLIED      
     //field name of parent record, which is copied into field of the current record 
     //passed in generatedKey atrribute  
   generatedKey NMTOKEN #IMPLIED   
     //see parentKey comment 
   sequenceField NMTOKEN #IMPLIED  
     //field name, which will be filled by value from sequence 
     //(can be used to generate new key field for relative records)  
   sequenceId NMTOKEN #IMPLIED     
     //we can supply sequence id used to fill a field defined in a sequenceField attribute 
     //(if this attribute is omited, non-persistent PrimitiveSequence will be used) 
   xmlFields NMTOKEN #IMPLIED      
     //comma separeted xml element names, which will be mapped on appropriate record fields 
     //defined in cloverFields attribute 
   cloverFields NMTOKEN #IMPLIED   
     //see xmlFields comment 
   skipRows NMTOKEN #IMPLIED   
     //skips elements for a mapping
   numRecords NMTOKEN #IMPLIED   
     //count of element that are processed for a mapping
  >

Input ports: none

Output ports:

  • at least one output port defined/connected - depends on mapping definition.


Xml attributes:

Attribute Mandatory Description Default ETL Version since
id yes component identification
type yes component type XML_EXTRACT
mappingURL !mapping file containing a mapping between xml elements or attributes and clover fields 2.8
mapping !mappingURL mapping between xml elements or attributes and clover fields
sourceUri yes location of source XML data to process
useNestedNodes no if nested unmapped XML elements will be used as data source; false if will be ignored true
xmlFeatures no defines how to handle input xml file. See features 2.7
skipRows no specifies how many records/rows should be skipped from the source file. Good for handling files where first rows is a header not a real data. 0
numRecords no specifies how many records/rows should be read from the source.

Example:

  <myXML>  
   <phrase>  
    <text>hello</text>  
    <localization>  
     <chinese>how allo yee dew ying</chinese>  
     <german>wie gehts</german>  
    </localization>  
   </phrase>  
   <locations>  
    <location>  
     <name>Stormwind</name>  
     <description>Beautiful European architecture with a scenic canal system.</description>  
    </location>  
    <location>  
     <name>Ironforge</name>  
     <description>Economic capital of the region with a high population density.</description>  
    </location>  
   </locations>  
   <someUselessElement>...</someUselessElement>  
   <someOtherUselessElement/>  
   <phrase>  
    <text>bye</text>  
    <localization>  
     <chinese>she yee lai ta</chinese>  
     <german>aufweidersehen</german>  
    </localization>  
   </phrase>  
  </myXML>
 
  Suppose we want to pull out "phrase" as one datarecord, "localization" as another 
  datarecord, and "location" as the final datarecord and ignore the useless elements. 
  First we define the metadata for the records.
  Then create the following mapping in the graph:
 
  <node id="myId" type="com.lrn.etl.job.component.XMLExtract">
        <attr name="mapping"><![CDATA[
          <Mapping element="phrase" outPort="0" sequenceField="id">
              <Mapping element="localization" outPort="1" parentKey="id" generatedKey="parent_id"/>
          </Mapping>
          <Mapping element="location" outPort="2"/>
          ]]>
        </attr>
  </node>
 
  Port 0 will get the DataRecords:
  1) id=1, text=hello
  2) id=2, text=bye
 
  Port 1 will get:
  1) parent_id=1, chinese=how allo yee dew ying, german=wie gehts
  2) parent_id=2, chinese=she yee lai ta, german=aufwiedersehen
 
  Port 2 will get:
  1) name=Stormwind, description=Beautiful European architecture with a scenic canal system.
  2) name=Ironforge, description=Economic capital of the region with a high population density.
 
  i.e.2.
  <x>
     <y>z</y>
     xValue
  </x>
 
  There will be no column x with value xValue.
  Issue: Namespaces are not considered.
 
  <ns1:x>xValue</ns1:x>
  <ns2:x>xValue2</ns2:x>
 
  Will be considered the same x.

Technical information:

The component is based on SAX technology and uses common jre SAX parser that loads nodes from xml during processing.

Performance test for the component:

xml size memory allocation working time
100kB 0.2MB 1s
1MB 0.2MB 1.6s
10MB 0.2MB 3.7s

HW: AMD Athlon™ 64 Processor 3200+, SW: Suse 10.2. x86_64 Architecture, test graph: Xml extract example



XmlXPathReader

Parses xml input data file base on xpaths queries and broadcasts the records to specific connected output ports.

Description:

Each context element mentioned in context hierarchy in mapping attribute of this component iterates over all matched xml nodes (results of XPath query). A nested context element query is evaluated on each result of the parent context. A translation xml nodes to clover data records is provided by mapping elements of appropriate context. All mapping xpaths or nodeName, that are defined in mapping elements, bind results to clover fields. XML elements and clover fields with same names are mapped by this component automatically on each other. XPath attribute can mapped arbitrary node value by contrast to nodeName that can mapped only element from the query result. Mapping definition via nodeName is quicker, so it is better to use nodeName than xpath if it is possible.

Record from nested Context element could be connected via key fields with parent record produced by parent Mapping element (see parentKey and generatedKey attribute notes). In case that retrieved values are not suitable to compose unique key, extractor could fill one or more fields with values coming from sequence (see sequenceField and sequenceId attribute).

If read XML document contains definition of XML namespaces you have to specify attribute namespacePaths in mapping. See description of namespacePaths in DTD definition of mapping below. The child elements of clover mapping inherit the definition of namespacePaths attribute from parent element.

Mapping attribute contains mapping hierarchy in XML form. DTD of mapping:

  <!ELEMENT Context (Context* | Mapping*)>
  <!ELEMENT Mapping>
 
  <!ELEMENT Context (Context* | Mapping*)>
  <!ATTLIST Context
    xpath NMTOKEN #REQUIRED          //xpath query to the xml node
    outPort NMTOKEN #IMPLIED         //name of output port for this mapped XML element
    parentKey NMTOKEN #IMPLIED       //field name of parent record, which is copied into field of 
                                     //the current record passed in generatedKey atrribute
    generatedKey NMTOKEN #IMPLIED    //see parentKey comment
    sequenceField NMTOKEN #IMPLIED   //field name, which will be filled by value from sequence
                                     //(can be used to generate new key field for relative records)
    sequenceId NMTOKEN #IMPLIED      //we can supply sequence id used to fill a field 
                                     //defined in a sequenceField attribute (if this attribute is omited, 
                                     //non-persistent PrimitiveSequence will be used)
    namespacePaths NMTOKEN #IMPLIED  //list of namespaces delimited by ';' used for a xpath attribute
                                     //example: namespacePaths='n1="http://www.w3.org/TR/html4/";n2="http://ops.com/"'
                                     //example for default namespace: namespacePaths='"http://www.w3.org/TR/html4/";n2="http://ops.com/"'
 
  <!ELEMENT Mapping>
  <!ATTLIST Mapping
    cloverField NMTOKEN #REQUIRED          //name of metadata field
    xpath NMTOKEN #REQUIRED if no nodeName  //xpath query to the xml value
    nodeName NMTOKEN #REQUIRED if no xpath  //direct xml node from where is taken a text, it is guicker than xpath
    trim NMTOKEN #IMPLIED                   //trims leading and trailing space (it is true by default)
    namespacePaths NMTOKEN #IMPLIED         //list of namespaces delimited by ';' used for a xpath attribute
                                            //example: namespacePaths='n1="http://www.w3.org/TR/html4/";n2="http://ops.com/"'
                                            //example for default namespace: namespacePaths='"http://www.w3.org/TR/html4/"'
  >

XPath language: http://www.w3.org/TR/xpath

XPath hint: conditional expression

  • /node1/nodeX[id/@value=1] - select all nodes where attribute “value” in the subnode “id” is 1


Input ports:

  • one optional input port defined/connected (port protocol see fileURL).

Output ports:

  • at least one output port defined/connected - depends on mapping definition.


Xml attributes:

Attribute Mandatory Description Default ETL Version since
id yes component identification
type yes component type XML_XPATH_READER
fileURL yes location of source XML data to process
mappingURL !mapping file containing a mapping between xml elements or attributes and clover fields 2.8
mapping yes mapping between xml elements or attributes and clover fields
dataPolicy no specifies how to handle misformatted or incorrect data. lStrict' value aborts processing, 'Controlled' logs the entire record while processing continues, and 'Lenient' attempts to set incorrect data to default values while processing continues. Strict
xmlFeatures no defines how to handle input xml file. See features 2.7
skipRows no specifies how many records/rows should be skipped from the source file. 0
numRecords no max number of parsed records

Example:

  <myXML xmlns:n2="http://ops.com/" xmlns:n1="http://www.w3.org/TR/html4/">  
	<phrase>  
		<text>hello</text>  
		<localization aid="100">  
			<chinese>how allo yee dew ying</chinese>  
			<german>wie gehts</german>  
		</localization>  
	</phrase>  
	<locations>
		<n1:location>  
			<name>Stormwind</name>  
			<description>Beautiful European architecture with a scenic canal system.</description>  
		</n1:location>  
		<n2:location>  
			<name>Ironforge</name>  
			<description>Economic capital of the region with a high population density.</description>  
		</n2:location>  
	</locations>  
	<someUselessElement>...</someUselessElement>  
	<someOtherUselessElement/>  
	<phrase>  
		<text>bye</text>  
		<localization aid="101">  
			<chinese>she yee lai ta</chinese>  
			<german>aufweidersehen</german>  
		</localization>  
	</phrase>  
  </myXML>
 
  Suppose we want to pull out "phrase" as one datarecord, "localization" as another 
  datarecord, and "location" as the final datarecord and ignore the useless elements. 
  First we define the metadata for the records.
  Then create the following mapping in the graph:
 
  <node id="myId" type="XML_XPATH_READER">
        <attr name="mapping"><![CDATA[
          <Context xpath="/myXML" >
              <Context xpath="phrase" outPort="0" sequenceField="id">
                  <Context xpath="localization" outPort="1" parentKey="id" generatedKey="parent_id">
                      <Mapping xpath="./@aid" cloverField="aid"/>
                  </Context>
              </Context>
              <Context xpath="locations/n2:location" outPort="2" namespacePaths="n1='http://www.w3.org/TR/html4/';n2='http://ops.com/'">
          </Context>
          ]]>
        </attr>
  </node>
 
  // alternative - mapping elements that can be omitted
  <node id="myId" type="XML_XPATH_READER">
        <attr name="mapping"><![CDATA[
          <Context xpath="/myXML" >
              <Context xpath="phrase" outPort="0" sequenceField="id">
                  <Mapping nodeName="text" cloverField="text"/>
                  <Context xpath="localization" outPort="1" parentKey="id" generatedKey="parent_id">
                      <Mapping nodeName="chinese" cloverField="chinese"/>
                      <Mapping nodeName="german" cloverField="german"/>
                      <Mapping xpath="./@aid" cloverField="aid"/>
                  </Context>
              </Context>
              <Context xpath="locations/n2:location" outPort="2" namespacePaths="n1='http://www.w3.org/TR/html4/';n2='http://ops.com/'">
                  <Mapping xpath="name/text()" cloverField="name"/>
                  <Mapping xpath="description" cloverField="description"/>
              </Context>
          </Context>
          ]]>
        </attr>
  </node>
 
 
  Port 0 will get the DataRecords:
  1) id=1, text=hello
  2) id=2, text=bye
 
  Port 1 will get:
  1) parent_id=1, chinese=how allo yee dew ying, german=wie gehts, aid=100
  2) parent_id=2, chinese=she yee lai ta, german=aufwiedersehen, aid=101
 
  Port 2 will get:
  1) name=Ironforge, description=Economic capital of the region with a high population density.

Technical information:

The component is based on DOM technology and uses DOM Saxon-B parser that loads whole xml into memory and then prepares nodes for xpath queries.
DOM parsers needs more memory than SAX parsers and the xpath parsing is slower because the xpath is very complexed query language. If the file is too big and java throws MemoryOutOfException, it is necessary to increase java memory heap via ie.: -Xmx512m jvm parameter where a number 512 is memory size in MB.

Performance test for the component:

xml size memory allocation working time
100kB 0.45MB 1.5s
1MB 2.8MB 2.9s
10MB 24.6MB 6.5s

HW: AMD Athlon™ 64 Processor 3200+, SW: Suse 10.2. x86_64 Architecture, test graph: Xml xpath reader example
The same data and output like XMLExtract performance test.

XLSReader

Parses data from xls file and send the records to output ports. JExcel can handle with files up to ~8.1MB in xls file - ~4.9MB in flat file - for more data set more memory for jvm.

Input ports: * one optional input port defined/connected (port protocol see fileURL).

Output ports:

  • at least one output port defined/connected.


Xml attributes:

Attribute Mandatory Description Default ETL Version since
id yes component identification
type yes component type XLS_READER
parser no The type of a XLS(X) parser. Possible values: 'auto' for automatic selection of a parser based on a file extension, 'XLS' for a classic XLS parser, 'XLSX' for a XLSX parser. auto
fileURL yes path to the input file
dataPolicy no specifies how to handle misformatted or incorrect data. 'Strict' aborts processing, 'Controlled' logs the entire record while processing continues, and 'Lenient' attempts to set incorrect data to default values while processing continues. 'Strict'
maxErrorCount no count of tolerated error records in input file (applicable only for controlled data policy) 0
sheetName no name of sheet for reading data. Can be used with wild cards as '?' and '*'
sheetNumber no number of sheet for reading data (starting from 0). Can be set as mask: {number; minNumber-maxNumber; *-maxNumber; minNumber-*; or as their combination separated by comma, eg. 1,3,5-7,9-*}. This attribute has higher priority then sheetName. One of theese atributes has to be set.
metadataRow no number of row where are names of columns 0
fieldMap no Pairs of clover fields and xls columns (cloverField=xlsColumn) separated by :;| {colon, semicolon, pipe}. Can be used for mapping clover fields and xls fields or for defining order of reading columns from xls sheet. Xls columns can be written as names given in row specified by metadataRow attribute or as column's codes preceded * by $. Xls fields may be missing, then columns are read in order they are in xls sheet and are given to proper metadata fields. It is prescribed to use standard mapping syntax since 2.5 version: clover fields are preceded by $, xls cell codes by # mappings are separated by :;| {colon, semicolon, pipe} and assignment sign is :=, eg.: $Freight:=FREIGHT or $Freight:=#C when metadataRow>0 default mapping is by column name (if metadataRow>0 and fieldMap is not set, clover fields whose names differ from names of xls columns will be empty or will be filled by 0); when metadataRow=0 default mapping is by column index (if metadataRow=0 and fieldMap is not set and clover fields on the same index are of different data type than xls columns on the same index, graph will fail).
charset no character encoding of the input file. Don't set it, if XSLReader uses POI library (it recognizes encoding automatically). When XLSReader uses JExcelAPI. ISO-8859-1
skipRows no specifies how many records/rows should be skipped from the source file; good for handling files where first rows is a header not a real data. It also depends on the metadataRow number. 0
numRecords no specifies how many records/rows should be read from the source. It also depends on the metadataRow number.
skipSourceRows no specifies how many records/rows should be skipped from every source file; good for handling files where first rows is a header not a real data. 0 2.8
numSourceRecords no specifies how many records/rows should be read from every source. 2.8
startRow no index of first parsed record 0
finalRow no index of final parsed record
incrementalFile incrementalKey property file used for incremental reading
incrementalKey incrementalFile property name stored in property file carries last reading position


Both startRow and finalRow are deprecated and should not be used.

Example:

  <Node id="XLS_READER1" type="XLS_READER" fileURL="ORDERS.xls"/>
 
 
  <Node id="XLS_READER1"
        type="XLS_READER"
        fieldMap="ORDER=ORDERID,N,20,5;CUSTOMERID=CUSTOMERID,C,5; 
                  EMPLOYEEID=EMPLOYEEID,N,20,5;ORDERDATE=ORDERDATE,D;REQUIREDDA=REQUIREDDA,
                  D;SHIPCOUNTR=SHIPCOUNTR,C,15"
        fileURL="ORDERS.xls"
        metadataRow="1" 
        startRow="2"
  </Node>
 
  <Node id="XLS_READER1"
        type="XLS_READER"
        fieldMap="ORDER=$a;CUSTOMERID=$b;EMPLOYEEID=$c;ORDERDATE=$d;
                  REQUIREDDA=$d;SHIPPEDDAT=$f;SHIPVIA=$g;FREIGHT=$h;SHIPNAME=$i;SHIPADDRES=$j;
                  SHIPCITY=$k;SHIPREGION=$l;SHIPPOSTAL=$n;SHIPCOUNTR=$m"
        fileURL="ORDERS.xls"
        metadataRow="1"
  </Node>
 
  <Node id="XLS_READER1"
        type="XLS_READER"
        fieldMap="ORDER;CUSTOMERID;EMPLOYEEID;ORDERDATE;SHIPCOUNTR" 
        fileURL="*.xls"
        sheetNumber="*"
  </Node>
 
  <Node id="XLS_READER0" 
        type="XLS_READER" 
        dataPolicy="strict"
        fileURL="example.xls"
        metadataRow="1" 
        startRow="2"
        sheetName="Sheet?" 
  </Node>
 
  <Node id="XLS_READER1" 
        type="XLS_READER"
        fileURL="${DATAIN_DIR}/other/O*.xls" 
        metadataRow="1" 
        sheetNumber="*"
        fieldMap="$OrderDate:=ORDERDATE,D;$EmployeeID:=EMPLOYEEID,N,20,5;$Freight:=FREIGHT,N,20,5;$ShipCountry:=SHIPCOUNTR,C,15;"
   </Node>
 
   <Node id="XLS_READER1" 
        type="XLS_READER"
        fileURL="${DATAIN_DIR}/other/O*.xls" 
        metadataRow="1" 
        sheetNumber="*"
        fieldMap="$OrderDate:=#D;$ShipAddress:=#J;$ShipPostalCode:=#M;$ShipName:=#I;$CustomerID:=#B;$ShipCity:=#K;"
   </Node>



Appendix 1 - Attributes

dataPolicy

- specifies what should happen if a BadDataFormatException is thrown. This can happen if i.e.:

  • parsed value can't be assigned to data field (as in case when value is null and field is not-nullable)
  • the value has wrong format
  • a parser found unexpected record/field delimiter or end of file.


There are three different data policies defined:

Value Description
Strict any BadDataFormatException aborts processing of graph. This is default value for specific readers.
Controlled every BadDataFormatException is only logged for entire record while processing continues for next record.
Lenient every BadDataFormatException is skipped while processing continues for next record.

fileURL

Value Description
/path/filename.txt path to the data local input file.
/path/filename1.txt;/path/filename2.txt path to two data local input files.
/path/* path to the data local input files. Component reads all files in directory.
/path/file00?.txt path to the data local input files. Component reads all wildcard matched files .
/path/file.txt;/path/file2.txt path to the data local input files. Component reads all delimited files .
zip:/path/filename.zip path to the data zip input file. Component reads first file in zip file.
zip:/path/filename.zip#name.txt path to the data zip input file. Component reads one file marked after '#'.
gzip:/path/filename.gz path to the data gzip input file.
ftp://user:password@server/path/name.txt ftp address to the data input file.
ftp://user:password@server/path/name*.txt ftp address to the data input file with a wild card.
sftp://user:password@server/path/name.txt sftp address to the data input file.
sftp://user:password@server/path/name*.txt sftp address to the data input file with a wild card.
http://server/path/name.txt http address to the data input file.
https://server/path/name.txt https address to the data input file.
zip:(http://server/path/name.zip)#filename.txt path to the data zip input file via http.
zip:(ftp://user:password@server/path/name.zip)#filename.txt path to the data zip input file via ftp.
zip:(zip:(http://server/path/name.zip)#filename.zip)#name.txt path to the data zip inner input file via http.
zip:(/path/filename?.zip)#name.* path to the data zip inner input file with wild cards. since 2.7
zip:/path/filename?.zip#name.* path to the data zip inner input file with wild cards. since 2.7
gzip:(http://server/path/name.gz) path to the data gzip input file via http.
gzip:(ftp://user:password@server/path/name.gz) path to the data gzip input file via ftp.
tar:(path/name.tar)#path/filename.txt path to the data tar input file.
tar:(gzip:path/name.tar.gz)#path/filename.txt path to the data tar input file that is gziped.
tar:(ftp://user:password@server/path/name.tar)#filename.txt path to the data tar input file via ftp.
tar:((gzip:/path/name?.gz)#filename?.tar)#name.??? path to the data tar/gzip inner input file with wild cards. since 2.7
port:$0.fieldName:source each data record field from input port represents an URL to be load in & parsed. *1)
port:$0.fieldName:discrete each data record field from input port represents one particular data source. *1)
port:$0.fieldName:stream all data fields from input port are concatenated (version 2.10 - until a field containing null value & represent one particular data source). *1)
dict:keyName:discrete reads data from dictionary *2).
dict:keyName:source reads data from dictionary such as discrete type but expects an input URL/file. The data from this input passes to the reader. *2).
- stdin(console) is the data input file.

  • *1) protocol for a field mapping. You need only a port and a field from where you want to read. And a type of processing that defines if the data are plain data or url addresses. The new protocol has this syntax: port:$port.field[:processingType]. The processingType is optional and can be: “source”, “discrete”, “stream”. The “discrete” type is the default value. The type of input field from which you will be reading data can be only string, byte and cbyte.
  • *1) the protocol can be used for DataReader, DBFDataReader, DelimitedDataReader, FixLenDataReader and XmlXPathReader
  • *1) Remember that this protocol can also be used for DBExecute when SQL command is received through the input port! Query URL will be as follows: port:$0.fieldName:discrete. SQL command can also be read from a file whose name, including path, should be passed to DBExecute from input port and the Query URL attribute should be the following: port:$0.fieldName:source.

  • *2) protocol for a dictionary mapping. You need only a dictionary and a keyName from where you want to read. The protocol has this syntax: dictionary:keyName[:processingType] where processingType is 'discrete' or 'source'. The 'discrete' value is default when the type is not defined. The reader finds out type of source value from the dictionary and creates readable channel for a parser. The reader supports following type of sources: InputStream, byte[], ReadableByteChannel, CharSequence, CharSequence[], List<CharSequence>, List<byte[]>, ByteArrayOutputStream.



Proxy specification for a URL in the fileURL attribute. The URL can have three proxy protocols:

  • direct - represents a direct connection, or the absence of a proxy.
  • proxy - represents proxy for high level protocols such as HTTP or FTP.
  • proxysocks - represents a SOCKS (V4 or V5) proxy.


Value Description
http:(direct:)//seznam.cz/ no proxy used
http:(proxy://user:password@212.93.193.82:443)//seznam.cz/ proxy for http protocol
ftp:(proxy://user:password@proxyserver:1234)//seznam.cz/ proxy for ftp protocol
sftp:(proxy://66.11.122.193:443)//user:password@server/path/file.dat proxy for sftp protocol


Appendix 2 - Summary

bytes/records
per file
skip/num
records
charset zip gzip tar *1) ftp sftp http https stdin autofilling
CloverDataReader - yes - yes yes yes yes yes yes yes yes yes
DataGenerator - - - - - - - - - - - yes
DataReader yes yes yes yes yes yes yes yes yes yes yes yes
DBFDataReader yes yes yes yes yes yes yes yes yes yes yes yes
DBInputTable - - - - - - - - - - - yes
DelimitedDataReader yes yes yes yes yes yes yes yes yes yes yes yes
FixLenDataReader yes yes yes yes yes yes yes yes yes yes yes yes
JmsReader - - - - - - - - - - - yes
LdapReader - - - - - - - - - - - yes
LookupTableReaderWriter - - - - - - - - - - - no
MultiLevelReader - yes yes yes yes yes yes yes yes yes yes yes
XLSReader yes yes yes yes yes yes yes yes yes yes yes yes
XMLExtract yes yes - yes yes yes yes yes yes yes yes yes
XMLXPathReader yes yes - yes yes yes yes yes yes yes yes yes

  • *1) since clover etl 2.7


components/readers.txt · Last modified: 2010/06/10 15:55 by twaller
Back to top
chimeric.de = chi`s home Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0