Here’s a question for the #hivemind : What do people use to do server-side processing of html? I generally use BeautifulSoup with Python, but I’m about to start a new project and am curious to see what else is out there. Java, Python, and Go are potential language choices for me (most of this will run on Google Appengine); Any suggestions?
Are you trying to parse potentially complex HTML documents? Or is this just for simple markup?
LikeLike
I don’t expect huge complexity, or very many buggy pages in general.
LikeLike
Okay, so you’re parsing real HTML. In that case you want lxml (on Python) which is based on libxml2.
Also, Python 3 has HTMLParser as part of the standard library (and is based on BeautifulSoup).
LikeLike
Looks interesting – Thanks!
LikeLike
go? I thought that was dead…
LikeLike
Not really, but it doesn’t have much takeup that I know of outside of GAE. I run into it here and there and it’s interesting, so I’m not against using it for something small to see how it, you know, goes.
LikeLike
that’s kind of what I meant 🙂
LikeLike