Miscellaneous .NET tips, code, comments, and what-not.

Wednesday, March 23, 2005

Python.NET

I'm curious so I've downloaded SPE IDE and Python 2.4. My exercises in the XML world have lead me down the Python path. It seems a lot of good string handling and parsing utilities are built in Perl (yuck) and Python so I'm going to have to sate my curiousity now. :)

some Python.NET del.icio.us links

Using HTML Tidy in .NET to screen scrape and create RSS/XML

A little code using HTML Tidy and XPath to screen scrape links and descriptions of those links from a web page. This technique probably will not work for every web site because HTML Tidy chokes on especially poor, standards incompliant web pages, but it should work for most "plainer" web pages with matching tags, good div tags. I would like to give a shout-out to O'Reilly's XML.Com pages and the wonderful XPath Explorer.






public void getSpurlLinks(string urlString, string styleSheet, string cachedFileName, string tagList)
{
//Spurl.Com has multiple page results, so I loop through 10, arbitrarily, to get a number of result links from their search
int numberOfPages = 10;
WebClient webClient = new WebClient();
ArrayList a = new ArrayList();
//a couple of file options for HTMLTidy
string optFile = @"c:\foo.tidy";
string errFile = @"c:\err.tidy";
byte[] reqHTML;
ArrayList theseLinks = new ArrayList();
for(int s=1;s<=numberOfPages;s++)
{
urlString = urlString + "&page=" + Convert.ToString(s);
reqHTML = webClient.DownloadData(urlString);
UTF8Encoding objUTF8 = new UTF8Encoding();
string myString = objUTF8.GetString(reqHTML);
//HTML Tidy uses the Document class for the following settings:
Document thisHTML = new Document();
thisHTML.LoadConfig( optFile );
thisHTML.SetErrorFile( errFile );
thisHTML.ParseString(myString);
thisHTML.CleanAndRepair();
thisHTML.RunDiagnostics();
thisHTML.SetOptBool( TidyOptionId.TidyForceOutput, 1 );
string fixedDoc = thisHTML.SaveString();
//I couldn't get HTML Tidy to put in the proper namespace designator, so here is a little hack:
fixedDoc = fixedDoc.Replace("xmlns=", "xmlns:html=");
StreamWriter sw = new StreamWriter(@"c:\xml\spurl.html", false);
sw.Write(fixedDoc);
sw.Flush();
sw.Close();
//These are the nodes I want from my newly created XHTML file:
string hrefXPath = "/html/body/div[@class='results']/div[@class='spurlResLink']/a";
string descXPath = "/html/body/div[@class='results']/div[@class='spurlResLink']/div[1]";
string titleXPath = "/html/body/div[@class='results']/div[@class='spurlResLink']/a";

XmlDocument myDoc = new XmlDocument();
XmlTextReader myRdr;

myRdr = new XmlTextReader(@"c:\xml\spurl.html");
myRdr.WhitespaceHandling = WhitespaceHandling.None;
try
{
myDoc.Load(myRdr);

//put the nodes into node collection for looping through
XmlNodeList thisNodes = myDoc.SelectNodes(hrefXPath);
XmlNodeList titleNodes = myDoc.SelectNodes(titleXPath);
XmlNodeList descNodes = myDoc.SelectNodes(descXPath);


int i=0;
foreach(XmlNode xn in thisNodes)
{
links l = new links(); //my own links class
l.LinkUrl = xn.Attributes.GetNamedItem("href").Value;

l.LinkId = i+990; //just some arbitrary link id #
if(titleNodes[i]==null||descNodes[i]==null)
{
l.LinkTitle = "";
l.LinkDescription = "";
}
else
{
l.LinkTitle = titleNodes[i].InnerText;
l.LinkDescription = descNodes[i].InnerText;

}

l.LinkTags = tagList;

i++;

theseLinks.Add(l);

}


myRdr.Close();


//bind my link collection to my DataGrid for presentation
spurlDG.DataSource = theseLinks;
spurlDG.DataBind();
}
catch (Exception goner)
{
myRdr.Close();

}
}
}

Monday, March 14, 2005

RDF to ASP.NET Control: Step 4

Now to bind the returned DataView to your DataList (goes in your aspx page codebehind):



public void getMyDelLinks(string urlString,
string transformFile)
{
DataView dv = RDFToDataView(urlString, transformFile);

try
{
myDelLinksDG.DataSource = dv;
myDelLinksDG.DataBind();
}
catch (Exception goner)
{
Response.Write(goner.ToString());
}
}

RDF to ASP.NET Control: Step 3

Create the code (goes in your aspx file codebehind) that will grab your transformed RSS stream and stream it to a DataSet, then convert to a DataView (the variables urlString would be your RDF link, such as "http://del.icio.us/rss/tag/.NET" and the stylesheet is the file name from Step 1):

public DataView RDFToDataView(string urlString, string stylesheet)
{
System.IO.Stream str = new System.IO.MemoryStream();
System.Xml.XPath.XPathDocument doc;
// Create the XslTransform.
System.Xml.Xsl.XslTransform xslt = new System.Xml.Xsl.XslTransform();

// Load the stylesheet that creates XML Output.
xslt.Load(System.Web.HttpContext.Current.Server.MapPath(stylesheet));

// Load the XML data file into an XPathDocument.
// if this is in the cache, grab it from there...
if (System.Web.HttpContext.Current.Cache["cached" + urlString] == null)
{
// No cache found.
doc = new System.Xml.XPath.XPathDocument(urlString);
// Add to cache.
System.Web.HttpContext.Current.Cache.Add("cached" + urlString, doc,
null, DateTime.Now.AddMinutes(60), TimeSpan.Zero,
System.Web.Caching.CacheItemPriority.High, null);
}
else
{
// Cache found.
doc = (System.Xml.XPath.XPathDocument)System.Web.HttpContext.Current.Cache
["cached" + urlString];
}

// Create an XmlWriter which will output to our stream.
System.Xml.XmlWriter xw = new System.Xml.XmlTextWriter(str,
System.Text.Encoding.UTF8);

// Transform the feed.
xslt.Transform(doc, null, xw, null);

// Flush the XmlWriter and set the position of the stream to 0
xw.Flush();
str.Position = 0;

// Create a dataset to bind to the control.
DataSet ds = new DataSet();
ds.ReadXml(str);

// Close the writer and thereby free the memory stream.

xw.Close();
DataView dv = null;
try
{
dv = ds.Tables[0].DefaultView;
}
catch (Exception nothing)
{
}
return dv;
}

RDF to ASP.NET Control: Step 2

Create the DataList control (that will be in your .aspx page):




<asp:DataList ID="myDelLinksDG" Runat="server" >
<ItemTemplate>
<table width=100% border="0">
<tr><td valign="top" align="left" width="60%">
<a href='<%# DataBinder.Eval(Container.DataItem, "link") %>'><%# DataBinder.Eval(Container.DataItem, "title") %></a></td>
</tr>
<tr><td width="10px"></td><td><%# DataBinder.Eval(Container.DataItem, "description") %></td>
<tr><td></td><td><%# DataBinder.Eval(Container.DataItem, "subject") %></td>
</tr>
</table>
</ItemTemplate>
</asp:DataList>




RDF to ASP.NET Control: Step 1

I wanted to get my delicious links and create my own DataGrid or DataList control to put on my personal site, but I didn't want to have to download and save my del.icio.us file every time: I wanted the "live" version. In order to do that, you have to "transform" the RDF to RSS format that your delicious links are served up in, stream the result into a dataset, and bind that dataset to your web control (in my case, a DataList control). I gleaned most of this from Build an RSS DataList Control in ASP.NET.


RDF to RSS XSLT file:



<?xml version="1.0" encoding="ISO-8859-1"?>
<xsl:stylesheet version="1.0" xmlns:rss="http://purl.org/rss/1.0/"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:dc="http://purl.org/dc/elements/1.1/">
<xsl:output omit-xml-declaration="yes" />
<xsl:template match="rss|/rdf:RDF">
<feed>
<xsl:apply-templates select="channel/item|/rdf:RDF/rss:item"/>
</feed>
</xsl:template>

<xsl:template match="channel/item|/rdf:RDF/rss:item">

<item>

<title>
<xsl:value-of select="title|rss:title"/>
</title>

<link>
<xsl:value-of select="link|rss:link"/>
</link>

<description>
<xsl:value-of select="description|rss:description"/>
</description>

<subject>
<xsl:value-of select="dc:subject|rss:subject"/>
</subject>
<pubdate>
<xsl:value-of select="dc:date|rss:pubdate"/> </a>
</pubdate>
</item>
</xsl:template>

</xsl:stylesheet>

Hacking .NET

I don't consider myself a true-blue hacker in the sense that Paul Graham writes (Great Hackers). I don't think I qualify as a hacker, since I like working with C#. Don't get me wrong: languages such as Perl and Ruby are cool. I just don't have a lot of time on my hands to dink around with them. I know C#, so that's what I "hack" with.

I began working on a personal project to build a personal bookmark manager because I was a little ticked that the del.icio.us founder/programmer doesn't want to allow users to "privatize" links or link categories. So I built my own front end for delicious, which has now turned into a full grown project to utilize APIs for searching for additional bookmarks. I figure, why stop with delicious? Why not use the "semantic web" to search for a multitude of links that I may need? So my personal bookmark manager finds related links to Google, Technorati, and right now I'm researching how to stream HTML and transform it to XML for those sites that have great search tools but no API nor RSS.

I'm learning a lot. As it turns out, this research has brought me a great way to learn about applying XML/XSLT for a new module I'm building for a system at work. Instead of hitting the database for every update, I'm thinking of putting the changes in an XML file for that user, and when they are completely done, the update pushes the XML file to the database. It seems like it will be a LOT faster user experience.