CF_REextractv1.4July 2004 |
|---|
| DESCRIPTION |
|---|
If several occurrences are to be extracted, the process must be included inside a loop, and this makes the whole thing even a little more complex.
This tag can make things much easier for at least three main reasons: first, it accepts two separate strings to match and it will return the string in between. Secondly, it is ablle to search the whole input text and return the results in a list or better, a query. Finally, if the text to be searched is somewhere in some file on the server's disk, or in some page on the Internet, the tag will get it for you.
Note that a straight string is just a particular case of a regular expression. This tag can then also be used in simple cases where one just wants to retreive a string embedded between two others. However, care must be taken to escape all regular expression control characters properly (see example 1 below).
| NOTE |
|---|
| INSTALLATION |
|---|
The other files are this documentation (REextractDoc.cfm); just store them in any convenient place in the HTTP area on your development server.
| SYNTAX & TABLE OF CONTENT |
|---|
<CF_REextract [INPUTMODE] = "http|file|variable|string|embed" INPUT = "" RE1 = "<first regular expression>" RE2 = "<second regular expression>" [OUTPUTMODE] = "output|query|queryappend|list" OUTPUT = "<query or list name>" [DELIMITER] = "<delimiter for the list>" [EXTRACT] = "first|last|all" [CASESENSITIVE] = "yes|no" [TIMEOUT] = "<timeout in seconds>" [INCLUDE1 = "yes|no" [INCLUDE2 = "yes|no" [RECYCLE] = "yes|no" [NODUPLICATES] = "yes|no" [OFFSET] = "<Offset to add to positions>" [CATEGORY] = "optinal text" [RESOLVEURL] = "yes|no" (New!) [EMPTYELEMENT] = "optinal text" (New!) > ExamplesNew!
|
| TAG | ATTRIBUTE | CONTENT | REQUIRED | DEFAULT |
|---|---|---|---|---|
| <CF_REextract | INPUTMODE= | Specifies the way the text to analyse is tranmitted to the tag.
This attribute may take five values: STRING: the text is passed directly in the INPUT attribute. VARIABLE: the tag will get the text in a caller's variable. The name of the variable must be given in the INPUT attribute. FILE: get the text from a file in the sever. The attribute INPUT gives the full path of the file. HTTP: CF_REextract will get the text on the Internet. Just give the http address of the page in the attribute INPUT. EMBED: CF_REextract will get the text found between its opening and closing tags. This is the only case a closing tag should be used. |
No | "string" |
| INPUT | Information for getting the text to search, depending on the option in INPUTMODE. | Yes | ||
| RE1= | Regular expression to use to find the begening of the string to extract. | Yes | ||
| RE2= | Regular expression to use to find the end of the string to extract. | Yes | ||
| OUTPUTMODE= | Secifies what to do with the strings extracted:
OUTPUT: the tag will just output the text extracted; the attribute OUTPUT may contain a string that will be prepend to the output string. QUERY: all data about results will be found in a query in nine columns:
QUERYAPPEND: Same as above, except that new rows will be added to a query created by a previous execution of the tag. This option is particularily helpful to develop parsers. It permits to gather in a same query different parts of text extracted with different criterions and distinct Regular expressions. LIST: The tag returns extracted strings in a list. The attribute DELIMITER may be used to specify something else than the default comma. In order to use this option, one must be certain that the delimiter may not be found in the extracted strings. If not sure, better use the query output mode. |
No | output | |
| OUTPUT= | Complement of information, depending on the OUTPUTMODE defined:
|
No | "" | |
| DELIMITER= | This attribute may be used only for OUTPUT="list", in case a delimiter other than the default comma should be used. | No | "," | |
| EXTRACT= | Specifies the type of extraction to be made:
|
No | "all" | |
| CASESENSITIVE= | If this attribute is present, the searches will be case sensitive; | No | "no" | |
| TIMEOUT= | Use this parameter only for INPUTMODE="http", to limit the tme the tag will wait for the page to come in. A value, in seconds. | No | "60" | |
| INCLUDE1 | Specifies that the string matched by RE1 should be added at the begining of the returned string; Useful if the easiest way to define the first match involves the begening of the desired string (ie: "http//:" when searching for addresses; but you would not use this option when looking for email addresses: "mailto:") | No | "no" | |
| INCLUDE2 | Specifies that the string matched by RE1 should be added at the end of the returned string; | No | "no" | |
| RECYCLE (New in version 1.2) |
Use this attribute to have the tag continue analysis at the begining of the second occurence of the second match instead of after. This can be useful to parse texts where no "end of field" is used. | No | "no" | |
| NODUPLICATES (New in version 1.3) |
Prevents duplicates in results. | No | "no" | |
| OFFSET (New in version 1.3) |
This attribute may be used when parsing is made on the same text in several passes. The numerical value given in the offset will be added to the three positions values returned (pos1, pos2, pos3).
This way, the positions can be kept relative to the original text, even when the tag is called for a second pass on a substring.
Suppose for instance a fist pass returns some string2 at position pos2. Then the parser submits string2 for some more detailed analysis. All positions returned will be relative to string2. If you supply the value of pos2 in the attribute OFFSET in the second call, then all positions will be relative to the begining of the original string. |
No | "0" | |
| CATEGORY (New in version 1.3) |
This attribute may contain text for an extra column CATEGORY in the query returned. This feature may be used in conjunction with OUTPUTMODE="QueryAppend". Each time the tag is called, one may supply some information like the type of data which is retreived with this particular pass. This information will then be available when looping on the query. | No | "" | |
| RESOLVEURL (New in version 1.4) |
This attribute will be transmitted to the CFHTTP tag in case HTTP input mode is selected. In previous versions, this value was always "yes", now it is still "yes" by default, but it can be set to no. | No | "yes" | |
| EMPTYELEMENT (New in version 1.4) |
When the output mode is selected to "list", the value of this attributr will be appended in the liste for every empty element found by REextract. Otherwise, an emty element is appended, but we all know that empty elements are ignored by ColdFusion. Then, if in an application all element count, one may have at least one space appended, for instance, instead of nothing. | No | "" |
| EXAMPLES |
|---|
output all text between parenthesis in some text:
<CF_REextract "
RE1="\("
RE2="\)"
OUTPUT="<BR>"
INCLUDE1
INCLUDE2>
Telephone support is available Monday through Friday,
8 A.M. to 8 P.M. Eastern time (except holidays)
Toll Free: 888.939.2545 (U.S. and Canada)
Tel: 617.219.2100 (outside U.S. and Canada)
</CF_REextract>(except holidays) (U.S. and Canada) (outside U.S. and Canada) Extract a list of all words of at least 4 characters from a text, with no duplicates:
<CF_REextract INPUTMODE="embed"
OUTPUTMODE="list"
OUTPUT="wordList"
RE1="[a-zA-Z]{4,4}" INCLUDE1
RE2="[^a-zA-Z]+"
NODUPLICATES>
Telephone support is available Monday through Friday,
8 A.M. to 8 P.M. Eastern time (except holidays)
Toll Free: 888.939.2545 (U.S. and Canada)
Tel: 617.219.2100 (outside U.S. and Canada)
</CF_REextract>Extract all headers from this document: <CFSET thisFile=GetCurrentTemplatePath()> <CF_REextract INPUTMODE="file" INPUT="#thisFile#" RE1="<h[:digit:]>" RE2="</h>" OUTPUT="<BR>" > CF_REextract v1.4 July 2004 Examples New! Suppose we want to get all addresses of pages accessible from the Macromedia home page (or else)
We first get all <A and <AREA tags in a query: <CF_REextract INPUTMODE="HTTP" INPUT="#address#" OUTPUTMODE="query" RE1="<a|<area|<frame|<iframe|<form[[:space:]]*" RE2=">" > Then we loop on it and analyse what's in column string2 <CFOUTPUT>record found: #REextract.recordcount#</CFOUTPUT> <CFLOOP query="REextract"> <CF_REextract INPUT="#string2#" OUTPUTMODE="output" OUTPUT="<BR>" RE1='(href|src|action)[[:space:]]*=[[:space:]]*"' RE2='"' ></B> </CFLOOP> SQL Parser example The new option OUTPUTMODE="queryappend" is particularly handful to develop parsers. Here is simple example which constitues a first step for compiling an SQL query. This part recognizes a SELECT query and returns its elements in the same query (note that this SQL command does not really make sense; its purpose is only to serve as a syntactic example).
<CFSET sql = "
SELECT DISTINCT ClientName,
Address1,
Address2,
City,
State,
ZipCode,
Phone,
Fax,
Browser,
Food,
NumberOfTables,
DescriptionShort,
DescriptionLong,
Image
FROM Clients
WHERE Client_ID < 1000
GROUP BY Food
HAVING NumberOfTables > 0
ORDER BY State,zipcode
">
<!--- Try for SELECT --->
<CF_REextract
INPUT = "#sql#"
RE1 = "SELECT[[:space:]]+((DISTINCT|ALL)[[:space:]]+){0,1}"
RE2 = "$|FROM"
OUTPUTMODE = "query"
OUTPUT = "SQLparser"
CATEGORY = "SELECT"
>
<CFIF SQLparser.recordCount GT 0>
<!--- SELECT Query found --->
<CF_REextract
INPUT = "#sql#"
RE1 = "FROM[[:space:]]+"
RE2 = "$|WHERE|GROUP|HAVING|ORDER[[:space:]]+BY"
OUTPUTMODE = "queryappend"
OUTPUT = "SQLparser"
CATEGORY = "FROM"
>
<CF_REextract
INPUT = "#sql#"
RE1 = "WHERE[[:space:]]+"
RE2 = "$|GROUP|HAVING|ORDER[[:space:]]+BY"
OUTPUTMODE = "queryappend"
OUTPUT = "SQLparser"
CATEGORY = "WHERE"
>
<CF_REextract
INPUT = "#sql#"
RE1 = "GROUP[[:space:]]+BY[[:space:]]+"
RE2 = "$|HAVING|ORDER[[:space:]]+BY"
OUTPUTMODE = "queryappend"
OUTPUT = "SQLparser"
CATEGORY = "GROUP"
>
<CF_REextract
INPUT = "#sql#"
RE1 = "HAVING[[:space:]]+"
RE2 = "$|ORDER[[:space:]]+BY"
OUTPUTMODE = "queryappend"
OUTPUT = "SQLparser"
CATEGORY = "HAVING"
>
<CF_REextract
INPUT = "#sql#"
RE1 = "ORDER[[:space:]]+BY[[:space:]]+"
RE2 = "$|ASC|DESC"
OUTPUTMODE = "queryappend"
OUTPUT = "SQLparser"
CATEGORY = "ORDER"
>
<CFELSE>
......
Result:
|