I ran into a problem while reading binary data from a site for my web-spidering application that I was developing a couple of months ago. I was able to read text strings from few sites but failed on many websites because the ResponseStream was not seekable. See this code snippet
HttpWebRequest myRequest = (HttpWebRequest)WebRequest.Create(url); HttpWebResponse myResponse = (HttpWebResponse) myRequest.GetResponse(); int length = myResponse.ContentLength;The ContentLength property of the WebResponse object failed to retrieve the stream’s length and threw “This stream does not support seek operations” exceptions. I later realized that it was not actually a problem with the WebResponse object, but the way I intended to retrieve binary data from the web was not right. Ok, so how can I retrieve data out of this stream, say, a straight HTML text for further parsing? The best way to do this would be to copy this stream to a MemoryStream and finally convert it into a Byte Array.
HttpWebRequest myRequest = (HttpWebRequest)WebRequest.Create(url); HttpWebResponse myResponse = (HttpWebResponse) myRequest.GetResponse(); Stream respStream = myResponse.GetResponseStream(); MemoryStream memStream = new MemoryStream(); byte[] buffer = new byte[2048]; int bytesRead = 0; do { bytesRead = respStream.Read(buffer, 0, buffer.Length); memStream.Write(buffer, 0, bytesRead); } while (bytesRead != 0);
respStream.Close(); buffer = memStream.ToArray(); string html = System.Text.Encoding.ASCII.GetString(buffer);
Here, I am instantiating a new MemoryStream object, reading fixed bytes from the stream and copying it over to the MemoryStream. The Stream.Read() method reads a maximum of "bytesRead" bytes from the current stream and store them in the buffer. In the above example it reads a maximum of 2048 bytes each time, stores them in a buffer and finally write that into the MemoryStream. The method returns a 0 if there is no more data to be read. One important thing to be noted here is that the Stream.Read() method can return fewer bytes than requested (< 2048) even if end of the stream has not been reached.
MemoryStream.ToArray() method finally converts it into a Byte Array. If the retrieved data is of plain-text type, which can be know from its headers, can be converted into a string using the System.Text.Encoding.ASCII.GetString(buffer) method. Else, write the byte array to a file using the FileStream object.
You will find this implementation very useful if you’re planning to download something and later resume broken downloads in your web-spidering applications….!
4 comments:
This is not a good option when your website is getting several hits ( like 20 or more). The MemoryStrem object will eat up your memory and crash your server.
jesperated, you're absolutely right. This solution is not recommended for server-based applications. If your web application implements this to access larger streams from thirdparty websites, with more than 30 users simultaneously hitting your website then that will definitely crash your server.
I've implemented this solution on a stand-alone windows client application (for web-spidering) that used 8 threads to download media and not more than 30 threads to access & parse HTML streams.
as long as you clean up after you are done, and do it fase, it should be safe. also on a production machine you should build it with enough memory to acheive what you are trying to do.
Thanks a lot! your article saved me from pulling my hair
Ephrem
Post a Comment