CF_REextract

v1.4

July 2004

© Claude Schnéegans

DESCRIPTION

Regular expressions are at least as difficult to use as they are powerfull. Most of the time, they are used to extract something from a string, a text file, or a page somewhere on the Internet. This tag will make everything much easier for you. Regular expressions are able to return parts of the text searched, but defining the patern to recognize the begining of the targeted part of text, the part itself that the expression should pick up, and the patern to recognize the end of the part to return, all in the same time in the same expression, is sometimes a very difficult task and its fine tuning may be very time consuming.

If several occurrences are to be extracted, the process must be included inside a loop, and this makes the whole thing even a little more complex.

This tag can make things much easier for at least three main reasons: first, it accepts two separate strings to match and it will return the string in between. Secondly, it is ablle to search the whole input text and return the results in a list or better, a query. Finally, if the text to be searched is somewhere in some file on the server's disk, or in some page on the Internet, the tag will get it for you.

Note that a straight string is just a particular case of a regular expression. This tag can then also be used in simple cases where one just wants to retreive a string embedded between two others. However, care must be taken to escape all regular expression control characters properly (see example 1 below).

NOTE

If you find this custom tag useful, or if you have any suggestion to make it even more useful, just let me know, I'll be glad to enhance it.

INSTALLATION

In order to use this Custom Tag, just store the file REextract.cfm found in the zip file in the special Custom Tags directory in your ColdFusion server. This directory is generally named \Cfusion\CustomTags. You may also store the file in a directory defined as a path for custom tags in the cold Fusion Administrator (CF 5.0+) or in the same directory as the calling template.

The other files are this documentation (REextractDoc.cfm); just store them in any convenient place in the HTTP area on your development server.

SYNTAX & TABLE OF CONTENT
<CF_REextract 
	[INPUTMODE] 	= "http|file|variable|string|embed"
	INPUT 		= ""
	RE1 		= "<first regular expression>"
	RE2 		= "<second regular expression>"
	[OUTPUTMODE] 	= "output|query|queryappend|list"
	OUTPUT 		= "<query or list name>"
	[DELIMITER]	= "<delimiter for the list>"
	[EXTRACT] 	= "first|last|all"
	[CASESENSITIVE] 	= "yes|no"
	[TIMEOUT] 	= "<timeout in seconds>"
	[INCLUDE1 	= "yes|no"
	[INCLUDE2 	= "yes|no"
	[RECYCLE] 	= "yes|no"
	[NODUPLICATES] 	= "yes|no"
	[OFFSET] 	= "<Offset to add to positions>"
	[CATEGORY] 	= "optinal text"
	[RESOLVEURL] 	= "yes|no" (New!)
	[EMPTYELEMENT] 	= "optinal text" (New!)
	
	>

Examples

New!    your own text on line

TAG ATTRIBUTE CONTENT REQUIRED DEFAULT
<CF_REextract INPUTMODE= Specifies the way the text to analyse is tranmitted to the tag.
This attribute may take five values:
STRING: the text is passed directly in the INPUT attribute.
VARIABLE: the tag will get the text in a caller's variable. The name of the variable must be given in the INPUT attribute.
FILE: get the text from a file in the sever. The attribute INPUT gives the full path of the file.
HTTP: CF_REextract will get the text on the Internet. Just give the http address of the page in the attribute INPUT.
EMBED: CF_REextract will get the text found between its opening and closing tags. This is the only case a closing tag should be used.
No "string"
INPUT Information for getting the text to search, depending on the option in INPUTMODE. Yes  
RE1= Regular expression to use to find the begening of the string to extract. Yes  
RE2= Regular expression to use to find the end of the string to extract. Yes  
OUTPUTMODE= Secifies what to do with the strings extracted:
OUTPUT: the tag will just output the text extracted; the attribute OUTPUT may contain a string that will be prepend to the output string.
QUERY: all data about results will be found in a query in nine columns:
  • string1: Contains the string matched by RE1;
  • string2: Contains the string extracted between the occurrences of RE1 and RE2;
  • string3: Contains the string matched by RE2;
  • pos1: Start position of string1 in the input text;
  • pos2: Start position of string2 in the input text;
  • pos3: Start position of string3 in the input text;
  • len1: Length of string1;
  • len2: Length of string2;
  • len3: Length of string3;
This option represents the most powerful way of searching text since it returns all information about results in a convenient query.
QUERYAPPEND: Same as above, except that new rows will be added to a query created by a previous execution of the tag. This option is particularily helpful to develop parsers. It permits to gather in a same query different parts of text extracted with different criterions and distinct Regular expressions.
LIST: The tag returns extracted strings in a list. The attribute DELIMITER may be used to specify something else than the default comma. In order to use this option, one must be certain that the delimiter may not be found in the extracted strings. If not sure, better use the query output mode.
No output
OUTPUT= Complement of information, depending on the OUTPUTMODE defined:
  • For OUTPUTMODE="output", the OUTPUT attribute may contain a string (ie : "<BR>") that will precede every string output;
  • For OUTPUTMODE="query", the OUTPUT attribute may contain a name for the query. The default value is "REextract".
  • For OUTPUTMODE="list", the OUTPUT attribute may contain a name for list. The default value is "REextract".
No ""
DELIMITER= This attribute may be used only for OUTPUT="list", in case a delimiter other than the default comma should be used. No ","
EXTRACT= Specifies the type of extraction to be made:
  • first: The tag stops after the first exctraction;
  • last: Only the string between the last occurences is extracted;
  • all: The tag returns all string extracted.
No "all"
CASESENSITIVE= If this attribute is present, the searches will be case sensitive; No "no"
TIMEOUT= Use this parameter only for INPUTMODE="http", to limit the tme the tag will wait for the page to come in. A value, in seconds. No "60"
INCLUDE1 Specifies that the string matched by RE1 should be added at the begining of the returned string; Useful if the easiest way to define the first match involves the begening of the desired string (ie: "http//:" when searching for addresses; but you would not use this option when looking for email addresses: "mailto:") No "no"
INCLUDE2 Specifies that the string matched by RE1 should be added at the end of the returned string; No "no"
RECYCLE
(New in version 1.2)
Use this attribute to have the tag continue analysis at the begining of the second occurence of the second match instead of after. This can be useful to parse texts where no "end of field" is used. No "no"
NODUPLICATES
(New in version 1.3)
Prevents duplicates in results. No "no"
OFFSET
(New in version 1.3)
This attribute may be used when parsing is made on the same text in several passes. The numerical value given in the offset will be added to the three positions values returned (pos1, pos2, pos3). This way, the positions can be kept relative to the original text, even when the tag is called for a second pass on a substring.
Suppose for instance a fist pass returns some string2 at position pos2. Then the parser submits string2 for some more detailed analysis. All positions returned will be relative to string2. If you supply the value of pos2 in the attribute OFFSET in the second call, then all positions will be relative to the begining of the original string.
No "0"
CATEGORY
(New in version 1.3)
This attribute may contain text for an extra column CATEGORY in the query returned. This feature may be used in conjunction with OUTPUTMODE="QueryAppend". Each time the tag is called, one may supply some information like the type of data which is retreived with this particular pass. This information will then be available when looping on the query. No ""
RESOLVEURL
(New in version 1.4)
This attribute will be transmitted to the CFHTTP tag in case HTTP input mode is selected. In previous versions, this value was always "yes", now it is still "yes" by default, but it can be set to no. No "yes"
EMPTYELEMENT
(New in version 1.4)
When the output mode is selected to "list", the value of this attributr will be appended in the liste for every empty element found by REextract. Otherwise, an emty element is appended, but we all know that empty elements are ignored by ColdFusion. Then, if in an application all element count, one may have at least one space appended, for instance, instead of nothing. No ""

EXAMPLES

output all text between parenthesis in some text:
<CF_REextract "
 RE1="\(" 
 RE2="\)"
 OUTPUT="<BR>"
 INCLUDE1
 INCLUDE2>
Telephone support is available Monday through Friday, 
8 A.M. to 8 P.M. Eastern time (except holidays) 
Toll Free: 888.939.2545 (U.S. and Canada) 
Tel: 617.219.2100 (outside U.S. and Canada)
</CF_REextract>
Result:
(except holidays)
(U.S. and Canada)
(outside U.S. and Canada)

Extract a list of all words of at least 4 characters from a text, with no duplicates:

<CF_REextract INPUTMODE="embed"
	OUTPUTMODE="list"
	OUTPUT="wordList"
	RE1="[a-zA-Z]{4,4}" INCLUDE1
	RE2="[^a-zA-Z]+"
	NODUPLICATES>
Telephone support is available Monday through Friday, 
8 A.M. to 8 P.M. Eastern time (except holidays) 
Toll Free: 888.939.2545 (U.S. and Canada) 
Tel: 617.219.2100 (outside U.S. and Canada)
</CF_REextract>
Result: Telephone,support,available,Monday,through,Friday,Eastern,time,except,holidays,Toll,Free,Canada,outside

Extract all headers from this document:

<CFSET thisFile=GetCurrentTemplatePath()>
<CF_REextract INPUTMODE="file"
	INPUT="#thisFile#"
	RE1="<h[:digit:]>"
	RE2="</h>"
 OUTPUT="<BR>"
 >
Result:
CF_REextract
v1.4
July 2004
Examples
New!    your own text on line

Suppose we want to get all addresses of pages accessible from the Macromedia home page (or else)
Doing this with just one regular expression would be rather complicated, may be even impossible. It is relatively easy to recognize <A and <AREA tags, but some may not contain any HREF attributes, for others, we don't know where the HREF attribute may be. So we better do it in two steps:
1. get all <A and <AREA <FRAME <IFRAME and <FORM tags,
2. analyse their content and extract the value of their HREF, SRC or ACTION attribute. Doing this directly with the functions working with regular expressions is not a five minutes job, but with CF_REextract, the task is fairly simple.

We first get all <A and <AREA tags in a query:

<CF_REextract
	INPUTMODE="HTTP"
	INPUT="#address#"
	OUTPUTMODE="query"
	RE1="<a|<area|<frame|<iframe|<form[[:space:]]*"
	RE2=">"
	>

Then we loop on it and analyse what's in column string2

<CFOUTPUT>record found: #REextract.recordcount#</CFOUTPUT>
<CFLOOP query="REextract">
<CF_REextract
	INPUT="#string2#"
	OUTPUTMODE="output"
	OUTPUT="<BR>"
	RE1='(href|src|action)[[:space:]]*=[[:space:]]*"'
	RE2='"'
	></B>
</CFLOOP>
Enter an URL address:
(Allow for the time to download the page; your server must be able to go on the Internet to run this example.)
Note: CF_REextract calls the CFHTTP tag with the attribute RESOLVEURL set to "yes". However, it appears that this functionality does not work on HREF found in AREAs, so these addresses are not fully resolved in our results.

SQL Parser example The new option OUTPUTMODE="queryappend" is particularly handful to develop parsers. Here is simple example which constitues a first step for compiling an SQL query. This part recognizes a SELECT query and returns its elements in the same query (note that this SQL command does not really make sense; its purpose is only to serve as a syntactic example).

<CFSET sql = "
  SELECT   DISTINCT     ClientName,
    Address1,
    Address2,
    City,
    State,
    ZipCode,
    Phone,
    Fax,
    Browser,
    Food,
    NumberOfTables,
    DescriptionShort,
    DescriptionLong,
    Image
  FROM  Clients
  WHERE Client_ID < 1000
	GROUP   BY  Food
	HAVING NumberOfTables > 0
	ORDER BY State,zipcode
	">
<!--- Try for SELECT --->
<CF_REextract 
	INPUT 		= "#sql#"
	RE1 		= "SELECT[[:space:]]+((DISTINCT|ALL)[[:space:]]+){0,1}"
	RE2 		= "$|FROM"
	OUTPUTMODE 	= "query"
	OUTPUT = "SQLparser"
	CATEGORY = "SELECT"
	>
<CFIF SQLparser.recordCount GT 0>
	<!--- SELECT Query found --->
	<CF_REextract 
		INPUT 		= "#sql#"
		RE1 		= "FROM[[:space:]]+"
		RE2 		= "$|WHERE|GROUP|HAVING|ORDER[[:space:]]+BY"
		OUTPUTMODE 	= "queryappend"
		OUTPUT = "SQLparser"
	CATEGORY = "FROM"
		>
	<CF_REextract 
		INPUT 		= "#sql#"
		RE1 		= "WHERE[[:space:]]+"
		RE2 		= "$|GROUP|HAVING|ORDER[[:space:]]+BY"
		OUTPUTMODE 	= "queryappend"
		OUTPUT = "SQLparser"
		CATEGORY = "WHERE"
		>
	<CF_REextract 
		INPUT 		= "#sql#"
		RE1 		= "GROUP[[:space:]]+BY[[:space:]]+"
		RE2 		= "$|HAVING|ORDER[[:space:]]+BY"
		OUTPUTMODE 	= "queryappend"
		OUTPUT = "SQLparser"
		CATEGORY = "GROUP"
		>
	<CF_REextract 
		INPUT 		= "#sql#"
		RE1 		= "HAVING[[:space:]]+"
		RE2 		= "$|ORDER[[:space:]]+BY"
		OUTPUTMODE 	= "queryappend"
		OUTPUT = "SQLparser"
		CATEGORY = "HAVING"
		>
	<CF_REextract 
		INPUT 		= "#sql#"
		RE1 		= "ORDER[[:space:]]+BY[[:space:]]+"
		RE2 		= "$|ASC|DESC"
		OUTPUTMODE 	= "queryappend"
		OUTPUT = "SQLparser"
		CATEGORY = "ORDER"
		>
<CFELSE>
......

Result:
string1string2string3pos1pos2pos3len1len2len3Category
SELECT DISTINCT ClientName, Address1, Address2, City, State, ZipCode, Phone, Fax, Browser, Food, NumberOfTables, DescriptionShort, DescriptionLong, Image FROM 426218221924SELECT
FROM Clients WHERE 2182242346105FROM
WHERE Client_ID < 1000 GROUP 2342402586185WHERE
GROUP BY Food HAVING 2582702761266GROUP
HAVING NumberOfTables > 0 ORDER BY 2762833037208HAVING
ORDER BY State,zipcode  3033123279150ORDER

See other cool tags by
See other cool tags