Sunday, July 13, 2014

Upload large files to Azure block blob storage in parallel and Async using C#

Alright guys read carefully. This blog post explains the code by which you can UPLOAD MULTIPLE LARGE FILEs TO AZURE BLOB IN PARALLEL (PRECISELY BLOCK BLOB).
What I am doing and what I am not?
1.Ask user to provide list large files and blob names to upload in one go in parallel
2.The code will use TPL (Parallel.ForEach loop precisely) to perform simultaneous uploads of azure blobs.
3.Code will chop the large files into multiple blocks.
4.Every block of file is uploading in SYNC. I am not performing any ASYNC operation while uploading individual blocks to azure blob. (Precisely I will be using PutBlock and PutBlockList methods).
5.However, the UI project (In my case it is console application, also it can be either Worker Role) calls the method of upload blob in ASYNC way with the help of BeginInvoke and EndInvoke.
Applicable technology -
I am using VS2013, Azure SDK 2.3 and Storage library as “Windows Azure Storage 4.1.0” launched on 23rd of Jun 2014.
Implementation
In real world scenario, we always tend to perform large file uploads to azure blob using worker role. Hence I will depict code with respect to console application. Don’t worry it can be easily converted to worker role specific code. J
Alright so, let’s start with it!!
Again – This might also be long post due to heavy code blocks. So be prepared.
Reference – 70% of this code is based on solution provided by codeplex on this post - http://azurelargeblobupload.codeplex.com/SourceControl/latest#AzurePutBllockExample/
It is a great solution to upload large files to azure blob storage. Just awesome!! I will change a bit so that, we can perform parallel uploads of large files.
Let’s understand few important components.

First I create a simple console application and named it as AzureBlobUploadParallelSample as shown below –

Then I added another class library in the same solution named as AzureBlobOperationsManager as shown below –

This class library will perform the upload of large files to azure blob storage and hence should have reference to Azure storage libraries. Nuget is the best way to get latest of dll’s therefore I opened Tools-> Library Package Manager-> Package Manager Console and types following command to install storage libraries –

Also add reference to Microsoft.WindowsAzure.Serviceruntime latest version from Add Reference dialog box.

I am defining a class here named as FileBlobNameMapper. This class defines two properties BlobName and FilePath. These properties will be used by user to specify name of the blob and path of file to be uploaded. This class will help users to provide multiple large files to be uploaded in Azure blob storage.
    /// <summary>
    /// Class to be used for holding the file-blobname mapping.    
    /// </summary>
    public class FileBlobNameMapper
    {
        public FileBlobNameMapper(string blobName, string filePath)
        {
            BlobName = blobName;
            FilePath = filePath;
        }
 
        public string BlobName { get; set; }
 
        public string FilePath { get; set; }
    }
After invoking async upload of multiple blobs in azure storage, we need to know which uploads is successful and which are failed. Therefore to get this status information I have defined another class named as BlobOperationStatus. It is as follows –
    public class BlobOperationStatus
    {
        public string Name { get; set; }
 
        public Uri BlobUri { get; set; }
 
        public OperationStatus OperationStatus { get; set; }
 
        public Exception ExceptionDetails { get; set; }
    }
 
    public enum OperationStatus
    {
        Failed, Succeded
    }
Now we need a class which will actually perform upload of large file to blob in parallel. Therefore I added a class named as AsyncBlockBlobUpload.
In this class I copied method GetFileBlocks, and internal class FileBlock from the codeplex link which is also specified above.
I defined MaxBlockSize class varibale to 2 MB as follows. This means every file block will be of size 2MB.
private const int MaxBlockSize = 2097152; // Approx. 2MB chunk size
Now I defined a method which will use Parallel.ForEach to start the upload of all blobs in parallel, means every blob upload on different thread and hence faster.
public List<BlobOperationStatus> UploadBlockBlobsInParallel(List<FileBlobNameMapper> fileBlobNameMapperList, string containerName)
So if you see, this is where I use the earlier defined class FileBlobMapper. The method has parameter containerName means all files and blobs names present in FileBlobNameMapper class list will be uploaded in the specified container. So if you wish to upload a single file or multiple files to blob then also this method serves the purpose. The full method code is as follows –
public List<BlobOperationStatus> UploadBlockBlobsInParallel(List<FileBlobNameMapper> fileBlobNameMapperList, string containerName)
        {
            //create list of blob operation status
            List<BlobOperationStatus> blobOperationStatusList = new List<BlobOperationStatus>();
           
            //upload every file from list to blob in parallel (multitasking)
            Parallel.ForEach(fileBlobNameMapperList, fileBlobNameMapper =>
            {
                string blobName = fileBlobNameMapper.BlobName;
 
                //read file contents in byte array
                byte[] fileContent = File.ReadAllBytes(fileBlobNameMapper.FilePath);
 
                //call private method to actually perform upload of files to blob storage
                BlobOperationStatus blobStatus =  UploadBlockBlobInternal(fileContent, containerName, blobName);
 
                //add the status of every blob upload operation to list.
                blobOperationStatusList.Add(blobStatus);
            });
 
            return blobOperationStatusList;
        }
Let’s have a look at private method where I am actually performing the upload operation of blob using PutBlock and PutBlockList.
private BlobOperationStatus UploadBlockBlobInternal(byte[] fileContent, string containerName, string blobName)
This method will be called as many times as the number of records in list of FileBlobNameMapper.
Let’s look at the complete code of this method.
private BlobOperationStatus UploadBlockBlobInternal(byte[] fileContent, string containerName, string blobName)
        {  
            BlobOperationStatus blobStatus = new BlobOperationStatus();
            try
            {
                // Create the blob client.
                CloudBlobClient blobClient = storageAccount.CreateCloudBlobClient();
 
                // Retrieve reference to container and create if not exists
                CloudBlobContainer container = blobClient.GetContainerReference(containerName);
                container.CreateIfNotExists();
 
                // Retrieve reference to a blob and set the stream read and write size to minimum
                CloudBlockBlob blockBlob = container.GetBlockBlobReference(blobName);
                blockBlob.StreamWriteSizeInBytes = 1048576;
                blockBlob.StreamMinimumReadSizeInBytes = 1048576;
 
                //set the blob upload timeout and retry strategy
                BlobRequestOptions options = new BlobRequestOptions();
                options.ServerTimeout = new TimeSpan(0, 180, 0);
                options.RetryPolicy = new ExponentialRetry(TimeSpan.Zero, 20);
 
                //get the file blocks of 2MB size each and perform upload of each block
                HashSet<string> blocklist = new HashSet<string>();
                List<FileBlock> bloksT = GetFileBlocks(fileContent).ToList();
                foreach (FileBlock block in GetFileBlocks(fileContent))
                {
                    blockBlob.PutBlock(
                        block.Id,
                        new MemoryStream(block.Content, true), null,
                        null, options, null
                        );
 
                    blocklist.Add(block.Id);
 
                }
                //commit the blocks that are uploaded in above loop
                blockBlob.PutBlockList(blocklist, null, options, null);
 
                //set the status of operation of blob upload as succeeded as there is not exception
                blobStatus.BlobUri = blockBlob.Uri;
                blobStatus.Name = blockBlob.Name;
                blobStatus.OperationStatus = OperationStatus.Succeded;
 
                return blobStatus;
            }
            catch (Exception ex)
            {
                //set the status of blob upload as failed along with exception message
                blobStatus.Name = blobName;
                blobStatus.OperationStatus = OperationStatus.Failed;
                blobStatus.ExceptionDetails = ex;
                return blobStatus;
            }
        }
The comments in above method are self-explanatory and simple to understand. So here we complete our library classes for large file upload to blob storage. Build the class library project and add reference to console application project.
No we client project (means my console app or worker role app) I need to invoke these methods of azure blob upload asynchronously. After completion of upload operation retrieve the result in callback method and take necessary action.
Alright we need to now look into console application code from which we will call upload operation in Async way. I highly recommend you to go through this link - http://msdn.microsoft.com/en-us/library/2e08f6yc(v=vs.110).aspx to understand how can we call any method in Async way from C#. So based on this approach I have defined delegate AsyncBlockBlobUploadCaller having the same signature as that of actual blob upload method. So I will use object of this delegate to use BeginInvoke and EndInvoke method.
I declared delegate in Program class of console application as class variable –
public delegate List<BlobOperationStatus> AsyncBlockBlobUploadCaller(List<FileBlobNameMapper> blobFileMapperList, string containerName);
So Main method code is as follows –
static void Main(string[] args)
        {
            //define file paths
            string file1 = @"C:\Kunal_Apps\Sample hours1.xlsx";//5MB
            string file2 = @"C:\Kunal_Apps\Sample hours2.xlsx";//1MB
            string file3 = @"C:\Kunal_Apps\Sample hours3.xlsx";//6MB
            string file4 = @"C:\Kunal_Apps\Boot Camp 14.zip";//100MB
 
            //map the file names to blob names
            List<FileBlobNameMapper> blobFileMapperList = new List<FileBlobNameMapper>();
            blobFileMapperList.Add(new FileBlobNameMapper("blob1", file1));
            blobFileMapperList.Add(new FileBlobNameMapper("blob2", file2));
            blobFileMapperList.Add(new FileBlobNameMapper("blob3", file3));
            blobFileMapperList.Add(new FileBlobNameMapper("blob4", file4));
 
            //specify the container name
            string containerName = "mycontainer";
 
            AsyncBlockBlobUpload blobUploadManager = new AsyncBlockBlobUpload();
            AsyncBlockBlobUploadCaller caller = new AsyncBlockBlobUploadCaller(blobUploadManager.UploadBlockBlobsInParallel);
            caller.BeginInvoke(blobFileMapperList, containerName, new AsyncCallback(OnUploadBlockBlobsInParallelCompleted), null);
 
            //to keep main thread alive I am using While(true). Because Async operations here will be based on ThreadPool and if main thread is ended then async operation child threads will also end.
            //Note: If you are using worker role here then it usually run's the operation in Run method in While(true) method keeping your main thread alive always.
            while (true)
            {
                Console.WriteLine("continue the main thread work...");
                Thread.Sleep(90000);               
            }
        }
If you see I have added While(true) loop. It is of no use here. It just to simulate that my main thread of console operation is doing some work and in the background my async upload of azure blob storage is also happening at the same time. If you are using worker role then you will not need it. In above code change yellow marked file paths to your file paths and they can be of different sizes. Also you may change the container name, blob names as per your choice.
Not it was time for me defining the callback method which will get automatically called when blob upload async operation fails or succeeds.
/// <summary>
        /// Callback method for upload to azure blob operation
        /// </summary>
        /// <param name="result">async result</param>
        public static void OnUploadBlockBlobsInParallelCompleted(IAsyncResult result)
        {
            // Retrieve the delegate.
            AsyncResult asyncResult = (AsyncResult)result;
            AsyncBlockBlobUploadCaller caller = (AsyncBlockBlobUploadCaller)asyncResult.AsyncDelegate;
 
            //retrive the blob upload operation status list to take necessary action
            List<BlobOperationStatus> operationStausList = caller.EndInvoke(asyncResult);
 
            //print the status of upload operation for each blob
            foreach (BlobOperationStatus blobStatus in operationStausList)
            {
                Console.WriteLine("Blob name:" + blobStatus.Name + Environment.NewLine);
                Console.WriteLine("Blob operation status:" + blobStatus.OperationStatus + Environment.NewLine);
                if (blobStatus.ExceptionDetails != null)
                {
                    Console.WriteLine("Blob operation exception if any:" + blobStatus.ExceptionDetails.Message + Environment.NewLine);
                }
 
                //Note:This is where you can write the failed blob operation entry in table/ queue and again make worker role traverse th' to perform upload again.
            }
 
        }
That’s it. If you run the application the output will be as follows –

If you observe the main thread work had started and continuing then when entire blob uploads operation was successful then the message of those blob upload appeared and after that again main thread continue work message. J J
Hence my entire large file uploads to azure blob storage was async and in parallel.

Let’s check if the sample is working correct and getting correct results if my async azure blob upload fails. To fail the blob upload, best way is to specify name of blob to length greater than 1024 characters. Therefore I wrote some random sentence in word file and made sure that its length is greater than 1024(I am having 2019 length of those random words) and then in debug mode I changed the name of my blob to this random name of heavy length.

As expected it got failed and I got the correct result of failure as shown below –



Enhancements –
Right now the code uploads multiple files in parallel but all blocks of file are uploaded synchronously. The enhancements can be, to upload blocks of one file ALSO IN PARALLEL.

 If you are looking for REST API based upload of large files to Azure blob storage using SAS and also renew of SAS then refer to the following link – http://sanganakauthority.blogspot.com/2014/08/using-sas-renew-sas-and-rest-api-to.html
Important – Please suggest your Feedback/ Changes / Comments to the article to improve it.
To download full source code refer to the link - http://code.msdn.microsoft.com/Upload-large-file-to-azure-fd1ac46d
Cheers…
Happy Uploading!!

3 comments:

  1. Same solution i tried with 500MB file, i am getting out of memory exception

    ReplyDelete
  2. It will work for azure request timeout greater than 230 seconds?

    ReplyDelete