Tuesday, January 30, 2007

Split and Merge files in C#

In this feed I'll show you how to Split a file into user-specified chunks and eventually merge them all together. You will find this very helpful if you have very large text files, greater than a GB,  that cannot be viewed in your "lousy Notepad". These large text files could be one of the crucial log files from your enterprise applications that may accrue data, if left un-attended, over time.

The code example I have shown below is generalized to split any file irrespective of their format.

private void btnSplit_Click(object sender, EventArgs e)
{
    string inputFile = txtInputFile.Text; // Substitute this with your Input File 
    FileStream fs = new FileStream(inputFile, FileMode.Open, FileAccess.Read);
    int numberOfFiles = Convert.ToInt32(txtChunks.Text);
    int sizeOfEachFile = (int)Math.Ceiling((double)fs.Length / numberOfFiles);

    for (int i = 1; i <= numberOfFiles; i++)
    {
        string baseFileName = Path.GetFileNameWithoutExtension(inputFile);
        string extension = Path.GetExtension(inputFile);
        FileStream outputFile = new FileStream(Path.GetDirectoryName(inputFile) + "\\" + baseFileName + "." + i.ToString().PadLeft(5, Convert.ToChar("0")) + extension + ".tmp", FileMode.Create, FileAccess.Write);
        int bytesRead = 0;
        byte[] buffer = new byte[sizeOfEachFile];

        if ((bytesRead = fs.Read(buffer, 0, sizeOfEachFile)) > 0)
        {
            outputFile.Write(buffer, 0, bytesRead);
        }
        outputFile.Close();
    }
    fs.Close();
}
private void btnMerge_Click(object sender, EventArgs e)
{
    string outPath = txtInputFolder.Text; // Substitute this with your Input Folder 
    string[] tmpFiles = Directory.GetFiles(outPath, "*.tmp");
    FileStream outputFile = null;
    string prevFileName = "";

    foreach (string tempFile in tmpFiles)
    {

        string fileName = Path.GetFileNameWithoutExtension(tempFile);
        string baseFileName = fileName.Substring(0, fileName.IndexOf(Convert.ToChar(".")));
        string extension = Path.GetExtension(fileName);

        if (!prevFileName.Equals(baseFileName))
        {
            if (outputFile != null)
            {
                outputFile.Flush();
                outputFile.Close();
            }
            outputFile = new FileStream(outPath + baseFileName + extension, FileMode.OpenOrCreate, FileAccess.Write);
        }
        
        int bytesRead = 0;
        byte[] buffer = new byte[1024];
        FileStream inputTempFile = new FileStream(tempFile, FileMode.OpenOrCreate, FileAccess.Read);

        while ((bytesRead = inputTempFile.Read(buffer, 0, 1024)) > 0)
            outputFile.Write(buffer, 0, bytesRead);

        inputTempFile.Close();
        File.Delete(tempFile);
        prevFileName = baseFileName;
    }
    outputFile.Close();
}
 
The split method is straightforward, you set the count of number of files to be split, and the size of each file is allocated equally. Each file is named after its parent, numbered and tailed with an extension of ".tmp". If you're splitting a Text file with no intention of merging them at a later time, you can replace the ".tmp" extension with ".txt".

The Merge method above, is in fact a "Merge All" method. It merges all the files with extensions ".tmp" in the specified directory and re-creates the parent file back. That's the reason why I'm retaining the original fileName and their extensions.

The Directory.GetFiles() method returns an array of all the file paths from the directory in ascending order. If you have fileNames like Testfile1.txt, Testfile2.txt, .... testfile100.txt then their order in the string array would be Testfile1.txt, Testfile10.txt, Testfile100.txt, Testfile2.txt, Testfile20.txt, Testfile3.txt,.... as the numbers are just strings. The Merge would fail eventually because of this merge-order. This can be addressed if you LeftPad the number with 0's, preferably a padding of 5 characters, while splitting. The fileNames now would be something like these, Testfile00001.txt, Testfile00002.txt...Testfile00100.txt.

All the "tmp" files are deleted from the folder after they are merged.