Hexenbracken vs Kraal - A Quick Data Analysis

Monday, April 8, 2013 , 0 Comments

Fully realizing I missed the point of the Phantom Tollbooth, I've got to admit that I've always liked Dictionopolis much more than Digitopolis. Lists of words are consistently able to get my mind running towards ideas, while lists of numbers get me running away from them. Because of this, I  dabble a bit with data visualization to aggregate word lists out of large chunks of text, just to see what shows up, so I hit +Zak Smith's wonderfully orchestrated (and wonderfully fun) Hexenbracken and Kraal with some data viz sticks and turned all 1144 hex details into two pictures.

The Hexenbrachen
The Kraal
So... what the fuck is going on in these pictures? In short, the more frequently a word occurs in the hex descriptions of the Hexenbracken and the Kraal, the larger the word appears in the word cloud. In Hexenbracken, the word "WATER" appears in 51 unique hexes. In the Kraal the word "ICE" appears in 79 unique hexes.

Because word clouds only show the number of times a word appears in the text, you don't ever want to use the data if its too raw. In the Kraal's raw data, for example, the word ice appears ~107 times across 79 hexes. When trying to understand the themes of the map as a whole, the number of hexes containing the word ice is more "valuable" than the total number of times the word ice appears in the whole document. To account for this, the data you're seeing in the word clouds above has been cleaned a number of times (detailed below).

Additionally, a blacklist of common English words (he, she, it, they, large, small, etc) was applied to the data. The blacklist contains 309 words, and has got to be handled carefully. For example, in Hexenbracken I blacklisted the word giant (since it tended to refer to size), but left it in the Kraal (since it was referring to creatures). 

On the one hand, I find these word clouds to be great for inspiration, and on the other I think they do an excellent way of quickly identifying the overarching themes, feeling, and flavor found in each hexmap. Why pay to have someone write up a blurb about your product when you can just aggregate out all the commonalities in the work and see if your themes are evocative enough to engage the audience on their own?

Tools Used:

Process Used (pretty sloppy I know, but it works and if anyone wants to laugh and show me a better way to do this, I'd <3 you a long time):
  1. Acquire RAW DATA (Hexenbracken) (The Kraal) ensuring each description is on it's own line
  2. Toss it into Notepad++
  3. Find/Replace all the symbols in the text with spaces
  4. Import the data into Excel and remove the duplicate words from each hex description. This way if "giant" was used multiple times in a single hex it would only be counted a single time. Plural and other forms remained (e.g., "giant" and "giants") but likewise were only counted once.
  5. Throw it into the WordCloud Generator to remind myself I was making progress, and work on the blacklist (Discovered I needed to blacklist the words "hex", "hexes" and "their" which weren't on the "common english words list" I was using).
  6. Use Notepad++ to find/replace all the common english words with spaces (I really need to figure out how to script this part). This was done because wordle.net uses its own blacklist for common words and, not knowing what it is, I needed to remove the words manually because wordle didn't blacklist as much as was needed)
  7. Throw it into wordle and tweak the font/layout/colors
Thanks to +Zak Smith for organizing this dog and pony show and  +Random Wizard+Ramanan Sivaranjan for packaging up all the data so beautifully!