Lookup tables contain data records that are searched.
There are five variants of lookup tables currently supported:
Since 2.6 version all lookup tables can be internal - all attributes and theirs values are stored directly in graph, or external (shared) - attributes and theirs values are stored in separate file and in graph there is only link to the file eg.:
in graph:
<LookupTable id="LookupTable0" lookupConfig="${LOOKUP_DIR}/MyRangeLookup.cfg"/>
in MyRangeLookup.cfg file:
endFields=dateEnd;timeEnd;nameEnd
endInclude=true;true;true
metadata=${META_DIR}/delimited/schedule.fmt
name=Tour schedule
startFields=dateStart;timeStart;nameStart
startInclude=true;true;true
type=rangeLookup
useI18N=true
All data records stored in the simple lookup table are kept in memory. Therefore, sufficient memory has to be available in order to accommodate all data records to be stored. If the data records are loaded from a data file, the size of available memory should be approximately 6× bigger than the size of the data file. However, this multiplier may vary for different types of data records stored in the data file.
Xml attributes:
| Attribute | Mandatory | Description | Default | Version |
|---|---|---|---|---|
id | yes | component identification | ||
type | yes | component type | simpleLookup | |
metadata | yes | metadata associated with the DataRecord stored in this lookup table. For internal table this is id of requested metadata, for external table this is file with metadata definition | ||
fileURL | yes | path to the data file. | ||
key | yes | names of fields which comprise indexKey to lookup table. Specifies what object will be used for looking up data. | ||
dataType | yes | type of file data - delimited or fixed | till 2.4 | |
charset | no | character encoding of the lookup data | ISO-8859-1 | |
initialSize | no | initial size of lookup table | “Lookup.LOOKUP_INITIAL_CAPACITY” from properties or 512 | |
keyDuplicates | no | indicates if there can be more records stored with one key | false | since 2.6 |
Xml definition:
Transformation for ver. 2.5:
<Global> ... <LookupTable id="LookupTable0" type="simpleLookup" metadata="Metadata0" fileURL="data/delimited/employees.dat" dataType="delimited" charset="ISO-8859-1" key="EmployeeID" name="name" /> </Global> <Node id="JOIN" type="REFORMAT"> <attr name="transform"> import org.jetel.component.DataRecordTransform; import org.jetel.data.DataRecord; import org.jetel.data.RecordKey; import org.jetel.data.lookup.LookupTable; import org.jetel.exception.*; import org.jetel.graph.TransformationGraph; public class reformatTest extends DataRecordTransform { LookupTable mylookup; public boolean init(){ try { mylookup = graph.getLookupTable("LookupTable0"); if (!mylookup.isInitialized()) { mylookup.init(); } } catch (ComponentNotReadyException e) { e.printStackTrace(); return false; } // create key String[] lookupKeyStr={"EmployeeID"}; RecordKey key=new RecordKey(lookupKeyStr,sourceMetadata[0]); key.init(); mylookup.setLookupKey(key); return true; } public boolean transform(DataRecord[] source, DataRecord[] target){ DataRecord employee; employee=mylookup.get(source[0]); if (employee==null) return false; // skip this one target[0].getField(0).setValue(source[0].getField(0).getValue()); target[0].getField(1).setValue(source[0].getField(1).getValue()); target[0].getField(2).setValue(source[0].getField(2).getValue().toString()); target[0].getField(3).setValue(employee.getField(0).getValue().toString()); target[0].getField(4).setValue(employee.getField(1).getValue()); return true; } public void finished(){ mylookup.free(); } } </attr> </Node>Transformation since ver. 2.6:
<Global> ... <LookupTable id="LookupTable0" type="simpleLookup" metadata="Metadata0" fileURL="data/delimited/employees.dat" dataType="delimited" charset="ISO-8859-1" key="EmployeeID" name="name" /> </Global> <Node id="JOIN" type="REFORMAT"> <attr name="transform"> import org.jetel.component.DataRecordTransform; import org.jetel.data.DataRecord; import org.jetel.data.RecordKey; import org.jetel.data.lookup.Lookup; import org.jetel.data.lookup.LookupTable; import org.jetel.exception.ComponentNotReadyException; public class reformatTest extends DataRecordTransform{ Lookup mylookup; public boolean init() throws ComponentNotReadyException{ // create key String[] lookupKeyStr={"EmployeeID"}; RecordKey key=new RecordKey(lookupKeyStr,sourceMetadata[0]); key.init(); LookupTable lt = graph.getLookupTable("LookupTable0"); lt.init(); mylookup = lt.createLookup(key); return true; } public int transform(DataRecord[] source, DataRecord[] target){ mylookup.seek(source[0]); if (!mylookup.hasNext()) {// skip this one errorMessage = "Order: " + source[0].getField(0).getValue() + " - no corresponding record in lookup table"; return SKIP; } DataRecord employee = (DataRecord)mylookup.next(); target[0].getField(0).setValue(source[0].getField(0).getValue()); target[0].getField(1).setValue(source[0].getField(1).getValue()); target[0].getField(2).setValue(source[0].getField(2).getValue().toString()); target[0].getField(3).setValue(employee.getField(0).getValue().toString()); target[0].getField(4).setValue(employee.getField(1).getValue()); return ALL; } } </attr> </Node>
Lookup definition dialog:
| Attribute | Mandatory | Description | Default |
|---|---|---|---|
id | yes | component identification | |
type | yes | component type | dbLookup |
metadata | yes | metadata associated with the DataRecord stored in this lookup table. For internal table this is id of requested metadata, for external table this is file with metadata definition | |
sqlQuery | yes | db query for getting data from database in accordance to key value. | |
dbConnection | yes | db connection to database. For internal table this is id of requested connection, for external table this is file with connection properties | |
maxCached | no | max record cached in lookup table | 0 |
storeNulls | no | indicates if to remember records for which there are no corresponding records in database | false |
Xml definition:
<Global> ... <LookupTable id="LookupTable0" type="dbLookup" metadata="Metadata0" dbConnection="DBConnection0" name="name" > <attr name="sqlQuery">select * from employee where employee_id=?</attr> </LookupTable> </Global> <Node id="JOIN" type="REFORMAT"> <attr name="transform"> import org.jetel.component.DataRecordTransform; import org.jetel.data.DataRecord; import org.jetel.data.RecordKey; import org.jetel.data.lookup.LookupTable; import org.jetel.exception.*; import org.jetel.graph.TransformationGraph; public class reformatTest extends DataRecordTransform { LookupTable mylookup; public boolean init(){ try { mylookup = graph.getLookupTable("LookupTable0"); if (!mylookup.isInitialized()) { mylookup.init(); } } catch (ComponentNotReadyException e) { e.printStackTrace(); return false; } // create key String[] lookupKeyStr={"EmployeeID"}; RecordKey key=new RecordKey(lookupKeyStr,sourceMetadata[0]); key.init(); mylookup.setLookupKey(key); return true; } public boolean transform(DataRecord[] source, DataRecord[] target){ DataRecord employee; employee=mylookup.get(source[0]); if (employee==null) return false; // skip this one target[0].getField(0).setValue(source[0].getField(0).getValue()); target[0].getField(1).setValue(source[0].getField(1).getValue()); target[0].getField(2).setValue(source[0].getField(2).getValue().toString()); target[0].getField(3).setValue(employee.getField(0).getValue().toString()); target[0].getField(4).setValue(employee.getField(1).getValue()); return true; } public void finished(){ mylookup.free(); } } </attr> </Node>
Range lookup table contains records, which defines intervals. It means that they must have special structure: first field is the name of the interval, odd fields marks starts of intervals, even fields (from 2) means corresponding ends of intervals, eg: Lookup table defined as follows:
has 4 intervals with 2 searching parameters: first from interval 0-10, and second from interval 0-100.
Intervals can overlap.
It is possible to use “unlimited” intervals by setting null value to start or end field, eg. interval null,0 “contains” elements smaller then 0.
Xml attributes:
| Attribute | Mandatory | Description | Default | Version |
|---|---|---|---|---|
id | yes | component identification | ||
type | yes | component type | simpleLookup | |
metadata | yes | metadata associated with the DataRecord stored in this lookup table. For internal table this is id of requested metadata, for external table this is file with metadata definition | ||
fileURL | yes | path to the data file. | ||
dataType | yes | type of file data - delimited or fixed | till 2.4 | |
charset | no | character encoding of the lookup data | ISO-8859-1 | |
useI18N | no | true/false perform comparing according to national rules - e.g. Czech or German handling of characters like “i”,”í”. | false | |
locale | no | locale to be used when sorting using I18N rules. | system default | |
startInclude | no | indicates if start points are included to intervals. It has to have such number of booleans as there is number of intervals | true | |
endInclude | no | indicates if end points are included to intervals. It has to have such number of booleans as there is number of intervals | false |
Xml definition:
<Global> ... <LookupTable type="rangeLookup" id="LookupTable0" metadata="Metadata0" charset="ISO-8859-1" dataType="delimited"> <attr name="endInclude">true;true;true</attr> <attr name="startInclude">true;true;true</attr> <attr name="useI18N">true</attr> <attr name="name">Tour schedule</attr> </LookupTable> </Global> <Node id="LOOKUP_JOIN0" joinKey="date;time;name" lookupTable="LookupTable0" type="LOOKUP_JOIN"> <attr name="transform">${out.0.name} = ${in.0.name}; ${out.0.date} = ${in.0.date}; ${out.0.time} = ${in.0.time}; ${out.0.activity} = ${in.1.name}; </attr> </Node>
ETL Version since: 2.6
Complete lookup table stored in external file. It uses jdbm.
Xml attributes:
| Attribute | Mandatory | Description | Default |
|---|---|---|---|
id | yes | component identification | |
type | yes | component type | persistentLookup |
fileURL | yes | path to the data file where lookup is stored | |
metadata | yes | metadata associated with the DataRecord stored in this lookup table. For internal table this is id of requested metadata, for external table this is file with metadata definition | |
key | yes | names of fields which comprise indexKey to lookup table. Specifies what object will be used for looking up data. | |
replace | no | option to replace existing entry (persistent lookup cannot store duplicate entries) | true |
commitInterval | no | commit interval in number of records (too high interval can invoke java.nio.BufferUnderflowException) | 100 |
disableTransactions | no | option to disable transaction (to increase performance at the cost of potential data loss). | false |
pageSize | no | number of entries per page (must be even) | 16 |
cacheSize | no | given maximum number of objects in cache | 1000 |
Xml definition:
<Global> ... <LookupTable fileURL="${PROJECT}/data-out/persistentLookup" id="LookupTable1" key="EmployeeID" metadata="Metadata0" name="lookupName" type="persistentLookup"/> </Global> <Node enabled="enabled" id="JOIN" type="REFORMAT"> <attr name="transform"><![CDATA[ import org.jetel.component.DataRecordTransform; import org.jetel.data.DataRecord; import org.jetel.data.RecordKey; import org.jetel.data.lookup.LookupTable; import org.jetel.exception.*; import org.jetel.graph.TransformationGraph; public class reformatTest extends DataRecordTransform{ LookupTable mylookup; public boolean init(){ try { mylookup = graph.getLookupTable("LookupTable1"); if (!mylookup.isInitialized()) { mylookup.init(); } } catch (ComponentNotReadyException e) { e.printStackTrace(); return false; } // create key String[] lookupKeyStr={"EmployeeID"}; RecordKey key=new RecordKey(lookupKeyStr,sourceMetadata[0]); key.init(); mylookup.setLookupKey(key); return true; } public int transform(DataRecord[] source, DataRecord[] target){ DataRecord employee; employee=mylookup.get(source[0]); if (employee==null) {// skip this one errorMessage = "Input:\n" + source[0] + " - There is no record with corresponding key in lookup table\n"; return SKIP; } target[0].getField(0).setValue(source[0].getField(0).getValue()); target[0].getField(1).setValue(source[0].getField(1).getValue()); target[0].getField(2).setValue(source[0].getField(2).getValue().toString()); target[0].getField(3).setValue(employee.getField(0).getValue().toString()); target[0].getField(4).setValue(employee.getField(1).getValue()); return ALL; } public void finished(){ mylookup.free(); } } ]]></attr> </Node>
COMMERCIAL, since version 2.6
An Aspell-like lookup table that enables you to look up data records with similar lookup key. The similarity of two words (keys) is measured in terms of edit operations, their cost and sum of these costs (see below). Such approach may be useful e. g. when looking for a street the name of which is misspelled to a certain extent.
All data records stored in the Aspell lookup table are kept in memory. Therefore, sufficient memory has to be available in order to accommodate all data records to be stored. If the data records are loaded from a data file, the size of available memory should be approximately 7× bigger than the size of the data file. However, this multiplier may vary for different types of data records stored in the data file.
Similarity of Words:
The edit operations and their default costs are as follows:
| Operation | Description | Cost |
|---|---|---|
| CASE | change the case of a single character | 10 |
| TRANSPOSE | transpose two adjacent characters | 90 |
| DELETE | delete a single character | 95 |
| INSERT | insert a single character | 95 |
| REPLACE | replace a single character | 100 |
The edit distance is defined as the minimum sum of costs of edit operations required to transform a misspelled word into a correctly spelled one. The edit distance between two words is actually the similarity of these two words.
Let's assume that the Aspell lookup table contains only the word “Hello”. Then the edit distance of the following queries would be as follows:
| Query | Edit Operations | Edit Distance |
|---|---|---|
| HELLO | CASE (E) + CASE (L) + CASE (L) + CASE (O) | 40 |
| helo | CASE (h) + INSERT (l) | 105 |
| Halllo | REPLACE (a → e) + REMOVE (l) | 195 |
| haLol | CASE (h) + REPLACE (a → e) + CASE (L) + TRANSPOSE (ol) | 210 |
XML Attributes:
| Attribute | Mandatory | Short Description | Default | ETL Version |
|---|---|---|---|---|
id | yes | The ID of the lookup table. | N/A | |
type | yes | The type of the lookup table (has to be aspellLookup). | N/A | |
name | no | The name of the lookup table. | null | |
metadata | yes | The ID of the meta data associated with the lookup table. | N/A | |
lookupKeyField | yes | The name of a string data field used as a key for lookup. | N/A | |
editDistanceField | no | The name of a numeric data field used to store edit distance. | null | since 2.8.1 |
spellingThreshold | no | The maximum allowed edit distance from a correctly spelled word. | 230 | |
editCosts | no | User-defined costs of edit operations. | null | since 2.9.0 |
removeDiacriticalMarks | no | Should the diacritical marks be removed from lookup keys? | false | |
includeBestGuesses | no | Should the “best guesses” be included in the result? | false | since 2.8.0 |
dataFileUrl | no | The URL of the data file containing the lookup table data. | null | |
dataFileCharset | no | The character set of the data file. | ISO-8859-1 |
The editDistanceField attribute may be set if the edit distance should be stored within the looked up data records. In such a case, performance of the lookup table may be reduced because data records have to be duplicated in order to store the edit distance. The editDistanceField has to be present within metadata referenced by the metadata attribute. It is advised to enable autofilling (with default_value) on this field in order to preserve data layout of data records stored in the lookup table data file.
The spellingThreshold attribute defines the maximum allowed edit distance from a correctly spelled word. The higher the threshold, the more mistakes can be present in incorrectly spelled words and vice versa. Words with the edit distance above the spellingThreshold are NOT contained in the query result, unless the includeBestGuesses attribute is set to true.
The editCosts attribute may be used to override the default costs of individual edit operations. User-defined edit costs are specified in the form <edit_operation_1> = <edit_cost_1>; …; <edit_operation_n> = <edit_cost_n> where <edit_operation_x> is the name of an edit operation and <edit_cost_x> is its non-negative cost.
The removeDiacriticalMarks attribute specifies whether or not the diacritical marks should be removed from the lookup key before any put/get operation. Setting this property to true should solve any potential problems you might encounter when using words with diacritical marks. However, this may lead to a problem because similar (but different) words might be indexed using a same key. This should not be an issue when similarity searching is performed.
The includeBestGuesses attribute specifies whether or not the “best guesses” (the closest matches when no suggestions were found) should be included in the result. Setting this flag to true results in returning more suggestions if the spelling threshold is too strict.
XML Definition:
<Global> ... <LookupTable dataFileCharset="UTF-8" dataFileUrl="${PROJECT}/lookup/streets.dat" id="LookupTable0" lookupKeyField="name" metadata="Metadata0" name="streets" type="aspellLookup"/> </Global> <Node charset="UTF-8" id="LOOKUP_JOIN0" joinKey="street" lookupTable="LookupTable0" type="LOOKUP_JOIN"> <attr name="transform"><![CDATA[//#TL // Transforms input record into output record. function transform() { $0.street := $0.street; $0.name := $1.name; $0.city := $1.city; } ]]></attr> </Node>