August 08, 2016

Waffling around: square pie charts in R

August 08, 2016/ Harry Caufield

Sometimes, when visualizing data, the most obvious conclusion is "some parts of this set are larger than others." It's tempting to use a pie chart for such a purpose, and why not? People like pie charts and I suspect these are the top reasons why:

1. Pie charts are obviously charts. It's easy to see a pie chart and immediately realize it's attempting to summarize some kind of data.

2. Pie charts lend themselves to pastry-based humor.

3. Pie charts are colorful by necessity. Many charts can be monochromatic but a pie chart without chromatically-unique slices is just a circle.

Otherwise, pie charts aren't too useful. They're difficult to parse and may even be inherently misleading. I won't rant about them here; other folks have done that for me in articles with titles like "The Worst Chart in the World" and "Pie Charts Are Bad".

Let's try an alternative strategy with the 'waffle' package in R. It makes waffle charts. These are essentially square pie charts, but instead of wedges of a circle, groups are represented by sets of squares (or even other shapes, if you want to get fancy).

There's some very helpful documentation and examples on its GitHub page. I'll try to go a bit beyond those examples here but I'll also assume you're an R novice.

You'll need ggplot2 so install it if you haven't already, then install waffle:

install.packages("ggplot2")
install.packages("waffle")
library(ggplot2)
library(waffle)

Now let's set up some example data - in this case, counts of the nodes in the NCBI Taxonomy database.

tax_count <- c(`Archaea and Viruses (4,271)`= 4271, `Bacteria (21,345)`= 21345, `Eukaryota (470,122)`= 470122, `Fungi (41,952)`= 41952, `Metazoa (255,771)`= 255771, `Viridiplantae (156,967)`= 156967)

(I've combined Archaea and Viruses here as their individual counts are much smaller than the others.)

Now let's waffle:

waffle(tax_count/1000, rows=20, size=0.5, colors=c("#cc0000", "#ff9900", "#ff6699", "#6699ff", "#006666", "#33cc33"), title="All NCBI Taxonomy Nodes by Kingdom", xlab="1 square is 1,000 nodes.")

We have a colorful waffle now and can play around with the colors some more with RColorBrewer:

install.packages("RColorBrewer")
library(RColorBrewer)
waffle(tax_count/1000, rows=20, size=0.5, colors=brewer.pal(6,"Set1"), title="All NCBI Taxonomy Nodes by Kingdom", xlab="1 square is 1,000 nodes.")

We can also compress this waffle:

waffle(tax_count/4000, rows=9, size=0.5, colors=brewer.pal(6,"Set1"), title="All NCBI Taxonomy Nodes by Kingdom", xlab="1 square is 4,000 nodes.")

Let's resize some of that text. This is a ggplot2 object, so ggplot2 options apply.

waffle(tax_count/4000, rows=9, size=0.5, colors=brewer.pal(6,"Set1"), title="All NCBI Taxonomy Nodes by Kingdom", xlab="1 square is 4,000 nodes.") + theme(axis.title.x=element_text(size = 16), text = element_text(size = 16))

There's also the option to use glyphs instead of squares by using the FontAwesome set. The extrafont package enables non-standard fonts to be used but you'll likely have to make it aware of the FontAwesome file as well.

install.packages("extrafont")
library(extrafont)

fa <- tempfile(fileext = ".ttf")
download.file("http://maxcdn.bootstrapcdn.com/font-awesome/4.3.0/fonts/fontawesome-webfont.ttf?v=4.3.0",
destfile = fa, method = "curl")
font_import(paths = dirname(fa), prompt = FALSE)

Use fa_list() to see the glyph options. Here's one with stars - note that I changed the row number to avoid overlap:

waffle(tax_count/4000, rows=12, size=0.5, colors=brewer.pal(6,"Set1"), title="All NCBI Taxonomy Nodes by Kingdom", xlab="1 square is 4,000 nodes.", use_glyph="star") + theme(axis.title.x=element_text(size = 16), text = element_text(size = 16))

Here's another example with some different data for the sake of variety:

pet_count <- c(`Cats (512)`= 512, `Dogs (903)`= 903, `Conifers (3,023)`= 3023)

waffle(pet_count/100, rows=3, size=0.5, colors=brewer.pal(3,"Set1"), title="Pets") + theme(text = element_text(size = 16)) + geom_label(label="100", size = 3)

There's an extra label on the empty square, unfortunately. This may not be a problem for you.

We can use geom_text to use Unicode characters as glyphs, too:

waffle(pet_count/100, rows=3, size=5.5, colors=brewer.pal(3,"Set1"), title="Pets") + theme(text = element_text(size = 16)) + geom_label(label=sprintf("\u0394"), size = 8)

The waffle package isn't the only option for waffle plots. It's one of the easiest to use, though, especially if you're already familiar with ggplot.

I've tried out Ruben Arslan's formr package for waffle plots as well. It produces plots like this:

pets_list <- c(rep("Cats", 5), rep("Dogs", 9), rep("Conifers", 30))

qplot_waffle_tile(pets_list) + theme(text = element_text(size = 28))

For another example of waffle plots in R, see the following: GitHub-style waffle plots in R .

July 01, 2016

Three papers about sharing, even if it's just sharing data

July 01, 2016/ Harry Caufield

If you search Wikimedia Commons for 'sharing' the most interesting results will involve milkshakes. This one is here.

Here are a few recent papers I found interesting:

1. A Commensal Bacterium Promotes Virulence of an Opportunistic Pathogen via Cross-Respiration.

Streptococcus gordonii is a commensal bacterial species in the human mouth. Aggregatibacter actinomycetemcomitans is an opportunistic pathogen, but not on its own: S. gordonii does something in the oral environment to allow A. actinomycetemcomitans to shift from anaerobic to aerobic growth, rendering it a more potent pathogen.

Citation: Stacy A, Fleming D, Lamont RJ, Rumbaugh KP, Whiteley M. A Commensal Bacterium Promotes Virulence of an Opportunistic Pathogen via Cross-Respiration. MBio. American Society for Microbiology; 2016;7: e00782–16. doi:10.1128/mBio.00782-16.

2. A crowdsourcing approach for reusing and meta-analyzing gene expression data.

OMiCC is an interface for the NCBI Gene Expression Omnibus (GEO) designed for easier comparative analyses. It looks like it's limited to a set of curated human and mouse studies for now but could be useful for combing through voluminuous gene expression data sets. The OMiCC site is here.

Citation: Shah N, Guo Y, Wendelsdorf K V, Lu Y, Sparks R, Tsang JS. A crowdsourcing approach for reusing and meta-analyzing gene expression data. Nat Biotechnol. Nature Publishing Group; 2016; doi:10.1038/nbt.3603.

3. Goldilocks: a tool for identifying genomic regions that are ‘just right’

A small genome analysis toolkit. It's on Github. It doesn't try to do more than is necessary and I really like that. I may also be the target market for bioinformatics tools without alphabet soup names like CHWRtn.

See also: the first author's blog. He seems like a cool guy.

Citation: Nicholls SM, Clare A, Randall JC. Goldilocks: a tool for identifying genomic regions that are “just right.” Bioinformatics. Oxford University Press; 2016;32: 2047–2049. doi:10.1093/bioinformatics/btw116.

June 23, 2016

Microbiology in Boston and visualizing data everywhere else

June 23, 2016/ Harry Caufield

I attended the ASM Microbe 2016 meeting in Boston this past weekend. It was my first big, national convention, and as a first-time attendee, the experience was overwhelming at times. It's a perpetual buffet of new material, new connections, new approaches, and free pens.

A sunny day outside the convention center. In a near-future dystopia, convention attendees could be tracked across the city by searching surveillance feeds for their distinctive swag bags.

Though I genuinely enjoyed seeing hundreds of research posters and talking with the researchers behind them, the event I found most relevant to long-term career choices was a session on "Unique Perspectives on Science Communication". Speakers included genomics researcher and Canadian TV host Jennifer Gardy, journalist Maryn McKenna, and data artist Jer Thorp. I don't usually feel inspired by short talks - at least a couple of these speakers have given TED talks of various types before - but this session helped me realize something I'm passionate about: extracting something visually meaningful from otherwise impenetrable data sets.

I believe any field of science can benefit from more effective communication. Too often, researchers - and I'm counting myself in this group - assume that our results must be relevant because our data sets are large, or barring that, we simply expect our results to be self-explanatory. I don't think these expectations are helpful: novel results and conclusions require novel approaches to communication. I'm not saying that every scientific paper needs to have its own Snapchat account but I am saying that incomprehensible results can ruin even the most carefully-designed project.

So, with that in mind, my posts here will likely take on more of a data visualization flavor. I'll provide some simple tutorials and examples of approaches I've found helpful. I hope someone else finds it helpful, too.

June 10, 2016

Hype in the looking glass

June 10, 2016/ Harry Caufield

I like reading reviews of bad movies. Or, more specifically, I like reading reviews in which the movie is clearly subpar, the reviewer is obviously incensed, and the reader can't help but learn from the entire experience (even if they aren't a filmmaker, at least if they're me, as I haven't tried to make a film since high school).

This is a prism rather than a looking glass. Someday, a generation of children will experience prisms for the first time without hearing a single reference to Pink Floyd. From Wikimedia Commons. — This is a prism rather than a looking glass. Someday, a generation of children will experience prisms for the first time without hearing a single reference to Pink Floyd. From Wikimedia Commons.

Matt Zoller Seitz's review of the recent Alice in Wonderland film is an ideal, educational example. Don't worry, I'm going to connect this whole thing with science shortly. His ending rant stayed with me:

“Every now and then, people ask if films ever offend me. Of course they do. They offend me because their world view is fashionably cynical. They offend me because their racial or sexual politics are glib and crude or because they flatter their target audience’s fantasies about themselves instead of challenging them. They offend me because they swagger about trafficking in “edgy” violence that’s not abstractly beautiful, mythologically rich, or psychologically complex, but merely opportunistic and cruel.

But the most offensive kind of film is one that spends an enormous amount of money yet seems to have nothing on its mind but money. You give it, they take it. And you get nothing in return but assurances that you’re seeing magic and wonder. The movie keeps repeating it in your ear, and flashing it onscreen in big block letters: MAGIC AND WONDER. MAGIC AND WONDER ...

How many small- or medium-sized films were never funded or released because the entire Hollywood studio apparatus has devoted itself to churning out listless fantasies that are machine-tooled for maximum repeatability and exploitability while claiming to be magical and wonderful?

This is not artistry. It’s con artistry.”

I occasionally worry about science working this way. While the budget of even the smallest indie film could fund some labs for years* the same general idea applies: budget size does not correlate with quality, but with a large enough budget, the combination of loud marketing and sheer sensory overstimulation can look like something better than quality. It can appear new and exciting. It can transform half-baked ideas into seemingly revolutionary concepts.

It's how a new approach to drawing blood couldn't fail because billions of dollars can't be wrong. They were wrong. Very wrong.

It's how CRISPR is so transformative that there isn't a genetic issue it can't be used to address.

I'm overjoyed that such technologies exist and they need funding to thrive. I'm not questioning that. It's simply worrisome when the marketing becomes the product.

Or maybe I just have a problem with TED talks. That's probably it.

*The canonical example of a success in this context might be The Blair Witch Project, with hundreds of millions of dollars in profits borne from its comparatively paltry $600,000 production cost. Primer is also a good example.

May 20, 2016

Randomly generated "science"

May 20, 2016/ Harry Caufield

I have a new Twitter bot - it's named Talk About Science. It randomly generates bits of technical language using keywords sourced from a variety of disciplines. I have it generate potential publication titles as well. It's fun, didn't take very long to make, and even comes up with a fun idea every so often, a bit like the Bored George Church account (though that one's not a bot, unless it's a very advanced AI).

Some recent examples:

Salmonella enterica requires heuristic pluripotency with vertical capacitance.
You fool! I found the multivariant lepidopteran.
OpenUFQRJ v2: a C library for cellular crRNA
The "galactic frenulum" (GF)
Hamlet's Reproduction.
biomimetic SaaS

At the very least, if you're writing some bad science fiction* and need a scientist to have some dialogue, follow this bot for some starter material.

J. Harry Caufield

J. Harry Caufield

severalog

J. Harry Caufield

Waffling around: square pie charts in R

Three papers about sharing, even if it's just sharing data

Microbiology in Boston and visualizing data everywhere else

Hype in the looking glass

Randomly generated "science"

J. Harry Caufield