Mutating: The Results
Dec 2nd, 2008 | By Jonathan Golob | Category: Dear Science ColumnI asked you to help me with an experiment as a follow up to a recent column on mutation.
Here are the results:
(Click on the image for a full-sized version.)
Ultimately, I decided to not filter out all of the noise comments (including my own) that weren’t attempts to copy the original. Almost all of these clustered together in the green block.
The attempts that riffed off the original–like Fnarf‘s and Urgutha Forka‘s–clustered together as well in the blue blocks.
My original paragraph was slotted in as comment zero, located in the dendrogram as the left-most leaf in the red block. All of the legitimate attempts to copy the paragraph ended up clustered together in the red block.
A few cool mutations emerged. My original:
CCR is short for chemokine receptor. Chemokines and chemokine receptors allow the cells in your immune system to speak to one another; their epic fight against invaders is like a game of Marco Polo. CCR5 is the chemokine receptor found on macrophages—the gobbling-up cells at the front line of your immune system.
CCR is short for chemokine receptor. Chemokines and chemokine receptors allow the cells in your immune system to speak to one another; their epic fight against invaders is like a game of Marco Polo. CCR5 is a the chemokine receptor found on macrphages–the gobbling-up cells at the front line of your immune system.
Like most mutations during the copying of DNA, the differences in the copies didn’t really change the meaning, just a few little details of how it was written or punctuated.
See any others?
Here’s the python code it took to make this output:
from HTMLParser import HTMLParser import string import sys import os.path import copy import matplotlib matplotlib.use('Agg') import matplotlib.pyplot as plt from hcluster import pdist, linkage, dendrogram, totree import numpy class Spider(HTMLParser): def __init__(self, file, inlist): HTMLParser.__init__(self) self.inComment = False self.tempComment = [] self.commentList = inlist self.feed(file.read()) def handle_starttag(self, tag, attrs): if tag == 'div' and attrs: for att in attrs: if att[0] == "class" and att[1] == "commentText clearfix": self.inComment = True def handle_data(self,data): if self.inComment: self.tempComment.append(data) def handle_endtag(self, tag): if tag == 'div' and self.inComment: self.commentList.append(copy.deepcopy(self.tempComment)) self.tempComment = [] self.inComment = False def getCommentList(self): return self.commentList filename = sys.argv[1] if os.path.isfile(filename): myfile = open(filename,'r') rawCommentList = [] # Let's initialize the CommentList with my original seed text CommentList = [{'text': " CCR is short for chemokine receptor. Chemokines and chemokine receptors allow the cells in your immune system to speak to one another; their epic fight against invaders is like a game of Marco Polo. CCR5 is the chemokine receptor found on macrophages--the gobbling-up cells at the front line of your immune system.", 'num': '0'}] tempVector = [] for c in CommentList[0]['text']: if not c == '\t': tempVector.append(ord(c)) CommentList[0]['vector'] = copy.deepcopy(tempVector) Spider(myfile, rawCommentList) for rawComment in rawCommentList: tempCommentDict = {} tempCommentDict['num'] = rawComment[2] tempCommentDict['text'] = rawComment[6] tempVector = [] for c in rawComment[6]: if not c == '\t': tempVector.append(ord(c)) tempCommentDict['vector'] = copy.deepcopy(tempVector) CommentList.append(copy.deepcopy(tempCommentDict)) vectorList = [] maxVectorLen = 0 for Comment in CommentList: vectorList.append(Comment['vector']) if len(Comment['vector']) > maxVectorLen: maxVectorLen = len(Comment['vector']) for index, v in enumerate(vectorList): if len(v) < maxVectorLen: paddingLen = maxVectorLen - len(v) vectorList[index] = v+ [0]*paddingLen dm = pdist(vectorList) lm = linkage(dm) dendrogram(lm) plt.savefig('plot.png', dpi=(200))