Search

Monday, October 15, 2007

Unzip with UTF Support in ColdFusion Curtosy of Ant via cfant

Update: Function was updated to v0.2 (see Unzip with UTF Support in ColdFusion - Function Updates for v0.2)

A fellow by the name of Ricardo Parente contacted me today about unzipping a zip file that contains files with special characters (to be more technical, the file names were using UTF8 encoding). Some filenames contained accents and/or other characters that would cause issues. He was trying to extract the zip file programmatically using the zip component from Alagad but that was throwing an exception. After looking around for a bit, turns out there is allready an existing user defined function on cflib called unzipFile. However, it seems that suffered from the same problem as the Alagad component when trying to extract files with UTF8 encoding. So after some searching, I found that the UTF encoding issue is an issue with Java and has not be fixed for 6 years!! Bah! So anyway, it turns out Ant, does support that with no problem and forutnatelly ColdFusion MX 7 comes with the undocumented function "cfant". So here is the custom function to unzip those pesky zip files contained UTF encoded file names. Createvely, I've named it "unzipWithUTFSupport". You can call it as simple as:
<cfset unzipResults = unzipWithUTFSupport("myUTFEncodedZipFile.zip") />
or like so to specify the destination directory:
<cfset unzipResults = unzipWithUTFSupport(
	zipFile = "myUTFEncodedZipFile.zip",
	destination = "yourRelativeDestination"
) />
This will return a structure with the following keys:
  • antMessage = string with the message Ant returned
  • argumentErrors = array of argument validation errors
  • fileList = query of files unzipped
  • success = boolean indicating the success or the failure of the unzip process
Now, for the more advanced functionality. You can have the function unzip the different file types to different sub directores. For example ".jpg" files can go in a "images" sub directory, while ".xml" files can go in a "configuration" sub directory. To take advantage of this, you need to pass the function a structure with the argument name "extractLocationsByFileType". That structure will define how what file extentions wil be extracted where. Here is a sample configuration that will extract only ".jpg" files to a "images" sub diretory:
<cfset locationsByFileType = structnew() />
<cfset locationsByFileType.jpg = structnew() />
<cfset locationsByFileType.jpg.destination = "images" />
Once you configure the "locationsByFileType" structure, you just pass it to the function like so:
<cfset results = unzipWithUTFSupport(
	zipFile = "myUTFEncodedZipFile.zip",
	extractLocationsByFileType = locationsByFileType
) />
The important think to keep in mind is that, if you use this functionality, you will have to define the configurtion for each file type. This means that if you use the configuration above, only ".jpg" files will be extracted from the zip archive. This will return a structure with the following keys:
  • antMessage = string with the message Ant returned
  • argumentErrors = array of argument validation errors
  • fileList = a structure of queries with the unzipped files under each file extension
  • success = boolean indicating the success or the failure of the unzip process
There are a couple of optional parameters you can pass to the function. If you are not using the advanced feature defined above, you can also pass two more values to the function:
  • specialCharsMatchRegEx = string with the regular expression that will match the special characters in the file names (by default "[^A-Za-z0-9\._\-]")
  • replaceSpecialChars = boolean indicating if special characters in the file names should be replaced (by default true)
If you are using the structure that defines specific file types to be extracted thenn you can specify specific setting for the "replaceSpecialChars" and "specialCharsMatchRegEx" for each file type as so:
<cfset locationsByFileType = structnew() />
<cfset locationsByFileType.jpg = structnew() />
<cfset locationsByFileType.jpg.destination = "images" />
<cfset locationsByFileType.jpg.replaceSpecialChars = false />
<cfset locationsByFileType.jpg.specialCharsMatchRegEx = "[^A-Za-z0-9\._\-]" />
Finally, here is the function code
<!---
Function:		unzipWithUTFSupport
Created on:		10.15.2007
Author:			Boyan Kostadinov
Version:		0.1
Arguments:		zipFile (string) required
				The name of the zip file to extract

				destination (string)
				The relative destination directory to extract the zip to (by default ".")

				overwriteDestination (boolean)
				Should the destination directory to be overwritte (by default true)

				extractLocationsByFileType (struct)
				A structure containing the type of files to be extracted
				and their destinations (by default empty structure)

				Example:
				<cfset locationsByFileType.jpg = structnew() />

				Specify the relative destination where to extract this type of files
				<cfset locationsByFileType.jpg.destination = "jpgFilesDir" />

				Should special characters be replaced for this type of files
				<cfset locationsByFileType.jpg.replaceSpecialChars = true />

				The regular expression to match special characters against
				<cfset locationsByFileType.jpg.specialCharsMatchRegEx = "[^A-Za-z0-9\._\-]" />

				If "replaceSpecialChars" and "specialCharsMatchRegEx" do not exist, the default
				values in the functions are used (true and "[^A-Za-z0-9\._\-]")

				replaceSpecialChars (boolean)
				Should special characters in the file names be replaced (by default true)

				specialCharsMatchRegEx (string)
				The regular expression that will match the special characters in the file names
				that need to replaced (by default "[^A-Za-z0-9\._\-]")

Return Value:	unzipResults (struct)
				A structure containing the results of the unzip process with the keys:
				success (boolean)
					the sucess of the unzip
				antMessage (string)
					the message the unzip task returns
				fileList (struct)
					the list of files/directories unzipped
Description:	I extract a zip file no matter of encoding with the help of Ant
--->
<cffunction name="unzipWithUTFSupport" hint="I extract a zip file no matter of encoding with the help of Ant" returntype="struct">
	<cfargument name="zipFile" type="string" required="yes" />
	<cfargument name="destination" type="string" required="no" default="." />
	<cfargument name="overwriteDestination" type="boolean" required="no" default="true" />
	<cfargument name="extractLocationsByFileType" type="struct" required="no" default="#structnew()#" />
	<cfargument name="replaceSpecialChars" type="boolean" required="no" default="true" />
	<cfargument name="specialCharsMatchRegEx" type="string" required="no" default="[^A-Za-z0-9\._\-]" />

	<!--- Create local variables for the zip file name and the destination directory --->
	<cfset var zipFileName = "" />
	<cfset var unzipDestination = "" />
	<cfset var uniqueUnzipDestinationDirectory = "" />
	<cfset var unzipResults = structnew() />
	<cfset var buildMessage = "" />
	<cfset var currentDir = "" />
	<cfset var fileRenamed = false />

	<!--- Set the name of the temporary ant build file --->
	<cfset var buildFile = expandpath("unzip.xml") />

	<cfset unzipResults.success = false />
	<cfset unzipResults.antMessage = "" />
	<cfset unzipResults.fileList = 0 />
	<cfset unzipResults.argumentErrors = arraynew(1) />

	<!--- Set the name of the zip file to extract --->
	<cfif arguments.zipFile neq "">
		<cfset zipFileName = expandPath(arguments.zipFile) />

		<cfif not fileExists(zipFileName)>
			<cfset arrayappend(unzipResults.argumentErrors, "The zip file #arguments.zipFile# was not found") />
		</cfif>
	<cfelse>
		<cfset arrayappend(unzipResults.argumentErrors, "The zip file was not specified") />
	</cfif>

	<!--- Set the extract destination --->
	<cfif arguments.destination neq "">
		<cfset unzipDestination = expandPath(arguments.destination) />
	<cfelse>
		<cfset arrayappend(unzipResults.argumentErrors, "Destination was empty") />
	</cfif>

	<!--- If there were not argument validation errors --->
	<cfif arrayisempty(unzipResults.argumentErrors)>
		<!--- Create a directory for the zip file based on the name of the zip archive --->
		<cfset uniqueUnzipDestinationDirectory =
			unzipDestination & "\" &
			rereplacenocase(arguments.zipFile, "\.zip$", "") />

		<cfdump var="#uniqueUnzipDestinationDirectory#" />
		<!--- Create the xml string for the ant build file --->
		<cfoutput>
		<!--- If the "extractLocationsByFileType" structure is not empty, loop over it
		to extract the different files types to the specified sub directory --->
		<cfsavecontent variable="unzipXml">
		<project>
			<target name="unzip">
				<cfif not structisempty(extractLocationsByFileType)>
					<cfloop list="#structKeyList(extractLocationsByFileType)#" index="key">
						<cfif isstruct(extractLocationsByFileType[key])>
				<unzip src="#zipFileName#" dest="#uniqueUnzipDestinationDirectory#\#lcase(extractLocationsByFileType[key].destination)#">
					<patternset>
						<include name="**/*.#lcase(key)#"/>
					</patternset>
				</unzip>
						<cfelse>
				<unzip src="#zipFileName#" dest="#uniqueUnzipDestinationDirectory#\#lcase(extractLocationsByFileType[key])#">
					<patternset>
						<include name="**/*.#lcase(key)#"/>
					</patternset>
				</unzip>
						</cfif>
					</cfloop>
				<cfelse>
				<unzip src="#zipFileName#" dest="#uniqueUnzipDestinationDirectory#" />
				</cfif>
			</target>
		</project>
		</cfsavecontent>
		</cfoutput>

		<cftry>
			<!--- If the destination directory already exists, delete it --->
			<cfif directoryExists(uniqueUnzipDestinationDirectory) and arguments.overwriteDestination>
				<cfdirectory action="delete" directory="#uniqueUnzipDestinationDirectory#" recurse="yes" />
			</cfif>

			<!--- Write the temporary ant build file on the file system --->
			<cffile action="write" nameconflict="overwrite" file="#buildFile#" output="#unzipXml#" />
		
			<!--- Execute the ant task with the created ant build file --->
			<!--- "messages" holds the coldfusion variable to write the ant output to --->
			<!--- "target" is the name of the "defaultTarget" (ant speak) to execute when runing the ant task --->
			<cfant buildFile="#buildFile#"
				defaultDirectory=""
				anthome=""
				messages="buildMessage"
				target="unzip"
			/>

			<cfif refindnocase(".*unable to expand to file.*", buildMessage)>
				<cfset unzipResults.success = false />
			<cfelseif refindnocase(".*build successful.*", buildMessage)>
				<cfset unzipResults.success = true />
			</cfif>

			<cfset unzipResults.antMessage = buildMessage />

			<!--- Delete the temporary ant build file --->
			<cffile action="delete" file="#buildFile#" />

			<cfif unzipResults.success>
				<cfif not structisempty(extractLocationsByFileType)>
					<cfset unzipResults.fileList = structnew() />
	
					<cfloop list="#structKeyList(extractLocationsByFileType)#" index="key">
						<cfif isstruct(extractLocationsByFileType[key])>
							<cfset currentDir =
							uniqueUnzipDestinationDirectory & "\" &
							extractLocationsByFileType[key].destination />
	
							<cfif structkeyexists(extractLocationsByFileType[key], "replaceSpecialChars")>
								<cfif isboolean(extractLocationsByFileType[key].replaceSpecialChars)>
									<cfset replaceSpecialChars = extractLocationsByFileType[key].replaceSpecialChars />
								</cfif>
							</cfif>
	
							<cfif replaceSpecialChars>
								<cfif structkeyexists(extractLocationsByFileType[key], "specialCharsMatchRegEx")>
									<cfset specialCharsMatchRegEx = extractLocationsByFileType[key].specialCharsMatchRegEx />
								</cfif>
							</cfif>
						<cfelse>
							<cfset currentDir = uniqueUnzipDestinationDirectory & "\" &
							extractLocationsByFileType[key] />
						</cfif>
	
						<!--- Get a list of the files in the directory --->
						<cfdirectory
							action="list"
							directory="#currentDir#" name="currentFileList" />
	
						<cfset fileRenamed = false />
	
						<cfif replaceSpecialChars>
							<!--- Loop over all the files --->
							<cfloop query="currentFileList">
								<!--- If the filename has special characters --->
								<cfif refind(specialCharsMatchRegEx, name)>
									<!--- Create a new name for the file by replacing all special characters with "_" --->
									<cfset newName = rereplace(name, specialCharsMatchRegEx, "_", "all") />
				
									<!--- Rename the file to the new name --->
									<cffile
										action="rename"
										source="#directory#\#name#"
										destination="#directory#\#newName#" />
	
									<cfset fileRenamed = true />
								</cfif>
							</cfloop>
						</cfif>
	
						<cfif fileRenamed>
							<!--- Get a list of the files in the directory
						(again since some files might have been renamed) --->
							<cfdirectory
								action="list"
								directory="#currentDir#" name="currentFileList" />
						</cfif>
	
						<cfset unzipResults.fileList[key] = currentFileList />
					</cfloop>
				<cfelse>
					<!--- Get a list of the files in the directory --->
					<cfdirectory
						action="list"
						directory="#uniqueUnzipDestinationDirectory#" name="currentFileList" />
	
					<cfset fileRenamed = false />
					<cfif arguments.replaceSpecialChars and specialCharsMatchRegEx neq "">
						<!--- Loop over all the files --->
						<cfloop query="currentFileList">
							<!--- If the filename has special characters --->
							<cfif refind(specialCharsMatchRegEx, name)>
								<!--- Create a new name for the file by replacing all special characters with "_" --->
								<cfset newName = rereplace(name, specialCharsMatchRegEx, "_", "all") />
			
								<!--- Rename the file to the new name --->
								<cffile
									action="rename"
									source="#directory#\#name#"
									destination="#directory#\#newName#" />
	
								<cfset fileRenamed = true />
							</cfif>
						</cfloop>
					</cfif>
	
					<cfif fileRenamed>
						<!--- Get a list of the files in the directory
						(again since some files might have been renamed) --->
						<cfdirectory
							action="list"
							directory="#uniqueUnzipDestinationDirectory#" name="currentFileList" />
					</cfif>
	
					<cfset unzipResults.fileList = currentFileList />
				</cfif>
			</cfif>
		<cfcatch type="any">
			<cfdump var="#cfcatch#" />
		</cfcatch>
		</cftry>
	</cfif>

	<cfreturn unzipResults />
</cffunction>
// //]]>