Purpose

This method allows user custom code to scan an incoming attachment file for potentially dangerous characters and reject or allow the file to be inserted as an attachment to the database.  For example, you might want to reject files that are considered dangerous due to cross-site scripting or XSS exploits.  Many installations require files that contain these characters to be uploaded normally; some organizations have a policy not to upload files that contain these characters.

 

Signature

public String scanUploadFile( File attachmentFile,
                              SesameSession session,
                              Connection dbconn,
                              String fileName,
                              String fileCharset,
                              String contentType) throws Exception 

Notes

The return value from the method should be either a null or a blank string to accept the attachment.  If you want the method to reject the attachment, the return should be a message that will be displayed to the end user providing an explanation that their file is not being uploaded.

The specific algorithm written in user custom code for a specific customer is as follows:

  1. Detect if file extension is in whitelist of allowed file extensions. If not, reject the attachment.

  2. If text, html, or js MIME type, as determined by the stated file type, try to convert to character stream using specified charset; if conversion works, use the Java built-in class named Scanner to find the script tag

  3. If not text, html or js, get the contained text and JavaScript based on document type, for example, Excel spreadsheet; then employ existing full text search means (e.g., using POI) – use Scanner to find script tag in the text

  4. If no match so far, the extension may be incorrect, so assume that it is a binary, unrecognized MIME type file; search bytes for pattern using KMPMatch (a java class that matches strings inside binary files) with extensions for upper/lower case. The search will include two patterns – the <script pattern without binary zeroes and the <ZsZcZrZiZpZt pattern, where Z == binary zero. The latter pattern is used to identify script in text that is encoded with UCS-2 or UTF-16 (double byte character encoding)

  5. Conversion of data through different transfer-encodings: this is unnecessary because ExtraView does not use other transfer-encodings to send an attachment to the browser.There is no use case where ExtraView would send a base64-transfer-encoded document to the browser as an attachment, for example. The transfer-encoding is not an attribute of the attribute file; it is specified in the header of the message sent to the browser or received from the browser.)

KMPMatch Method

/**
 * Knuth-Morris-Pratt Algorithm for Pattern Matching
 */
class KMPMatch {
    /**
     * Finds the first occurrence of the pattern in the text.
     */
    public int indexOf(byte[] data, byte[] pattern) {
        int[] failure = computeFailure(pattern);

        int j = 0;
        if (data.length == 0) return -1;

        for (int i = 0; i < data.length; i++) {
            while (j > 0 && pattern[j] != data[i]) {
                j = failure[j - 1];
            }
            if (pattern[j] == data[i]) { j++; }
            if (j == pattern.length) {
                return i - pattern.length + 1;
            }
        }
        return -1;
    }

    /**
     * Computes the failure function using a boot-strapping process,
     * where the pattern is matched against itself.
     */
    private int[] computeFailure(byte[] pattern) {
        int[] failure = new int[pattern.length];

        int j = 0;
        for (int i = 1; i < pattern.length; i++) {
            while (j > 0 && pattern[j] != pattern[i]) {
                j = failure[j - 1];
            }
            if (pattern[j] == pattern[i]) {
                j++;
            }
            failure[i] = j;
        }