Jon Garfunkel
July 7, 2006
Daily Kos is today the most popular online community devoted to Democratic politics, in terms of the number of readers, contributors, and media mentions of late. For the last two years the YearlyKos has ttracted thousands of community members and several aspiring candidates seeking their attention. This is all the more remarkable considering that just four years ago, it started as the effort of one man with no prior experience in politics, Markos Moulitsas Zuniga (“Kos”).
Kos began writing a weblog using MovableType software on May 26, 2002, at his domain fishyshark.com. By fall Kos started focussing on the mid-term elections. In October he had split off fishyshark.com for his personal writings as an expectant father, and moved the political writings to the domain dailykos.com. At this time each post began attracting dozens of comments. In January, Kos started tapping a few community members to sub in for him at times, in order to keep the daily feed going. The coverage turned to the impending Iraq war. By summer Kos began turning his attention towards the Democratic primary race. By now the posts were drawing around a hundred comments each, and the MovableType platform was starting reach its limits. On July 1st Kos asked his community for donations towards a new server. The new server would use the Scoop platform, allowing each community member to publish on their own. It was launched in mid-October 2003.
The original 2,500+ MovableType posts were archived at www.dailykos.net. In all the ensuing time since this first stage, there has been no public effort, that I can locate, to index that early content into a browseable catalog. The objective of this project was to meet that need. In doing so, I hope I can provide a window into the evolution of the site from a single person's soapbox into a veritable community.
As for the existing archive site, it is in fact indexed by month and by topic. The existing monthly index archives contains each post whole; for example, see the initial fourteen posts from May 2002. This is effectively innavigable. Compare this with the headline-based display of May 2002 here. The topical archives are headline-based, but they are still reverse-chronological and omit the author and other potentially useful metadata. Compare the existing index for topic “media” with this new index.
Reconstructing a catalog is not as easy as it should be. Web content descriptor standards like RDF (Resource Description Framework) and DC (the Dublin Core Metadata Initiative) have been underway for several years now, but their popular usage (by publishing tools such as MovableType) has been towards syndicating the recent content, and not in cataloguing the whole of it.
To be able to properly catalog Daily Kos, I had to first download all of the content. There were no legal hazards in the way (there are often not, but this should be checked); in Kos's case his copyright notice clearly stated “Steal all you want. (For non-commercial use, that is.)” The downloading was trivial. By default, MovableType indexes each post sequentially, so I downloaded posts #000001 to #004569, 22.7 MB worth. This brought 2,603 files. Of those, 33 were discarded once I determined that they were rough drafts and not ever published.
Of the 2,570 posts, only 26 contain RDF metadata; how this was so spotty I may yet learn. For the moment, I had to write a perl script (now over a thousand lines) to extract the data from wherever it was in the page. The script does these three steps: (1) extracts the title, date, author, word count, number of comments, and number of hyperlinks. (2) makes a “guess” as to the topic, based on 300 keywords, and (3) prepares CSV, RDF (Atom) files as well as report summary pages for each topic and each month.
(It would have been of superior design to generate the summary pages rom the RDF/Atom file-- thus it could be reused for any catalog indexed in RDF. Such a program could also generate reports based on author, and support paging, etc. I have also left code in to extract hyperlinks in order to do a link analysis.)
I made several alternations to the topics. One, I separated out site updates from misc. Two, I split elections into distinct categories for congressional, state/local, California, and the Democratic Presidential primary. In addition, it was necessary to discern a category for Democratic party. Lastly, Kos's category for the War on Terror was dropped in favor of one simply for War. While this no doubt would please the Vice President, it is more helpful to keep these categories separate.
All told, I have spent over 40 hours over several days preparing this data. It's not perfect. The word count merely counts the space between words, and thus would be inflated for the few pages which include HTML tables.
Much information is missing. I have not coded the genre of each-- original essays vs. the newspaper-clipping most commonly seen in blogs. Moreover, I have not had the opportunity to review the comments. Furthermore, Kos did not post in a vacuum; the other political blogs (a smaller community then!) surely made reference to him. This data may well be buried in the bowels of Google and Technorati, but it is practically irretrievable at this time.
Any proper analysis of Daily Kos as it was should also use the Internet Archive to see how it was at the time-- particularly what the front page looked like.
I have reviewed all of the summary pages, and most of the topics, and I have reviewed perhaps a quarter of the posts. It is possible that as many 2% of the posts are mis-categorized. It is sufficient for my purposes. As a definitive reference for research purposes, I would implore the researcher to review this in whole; I have no formal training as a cataloguer.
I am not a regular member of the Daily Kos community, nor do have any affiliation with any research institution. I do invite any institution interested in this data to adopt it.
Lastly, as fun as this work has been, it should be unnecessary. Online publishing tools should by now have the capability to make the content natural to browse. I hope that my work here encourages more publishers to be proactive in cataloguing their content and using forms like I have designed for simple cross-navigation.