Reading Reflections – “Big Data” and Data Mining

This week’s blog is a reflection on the readings for last week. The readings largely deal with the topics of “Big Data” and various technologies and techniques used in the field of “Data Mining” (in which a person “mines” for certain data among all the data that is out there, sort of like a real miner mining for gold or coal in a mine). Toni Weller’s book, History in the Digital Age, has a chapter by William Turkel, Kevin Kee and Spencer Roberts followed by a chapter by Jim Musell.  Turkel, Kee and Roberts mention how new technology has made the traditional issue of lack of access to sources for historians something that is no longer a problem. They mention the traditional process used by historians in their research and then go on to list 7 rules for digital research. The first two rules are: “make everything digital” (they talk of scanning sources and digitizing them, mentioning some handheld pen scanner device that can scan text into a computer) and “keep stuff in the cloud”. For the most part I don’t do either of those things in my research, If I have a printed book from a library I would use information from the book without digitizing the book and I generally just have my files on my computer or sometimes on a flashdrive. Jim Mussel writes about the difference between “history 1.0” and “history 2.0” specifically saying that historians should view the web as more than just a tool and that “history 2.0” involves shifting from a focus on documents to a focus on data. To be honest seeing the emerging form of history as “2.0” and its predecessor as “1.0” seems a bit inaccurate, given how long history has been around (which in some form or other is along as the human race has been remembering and retelling the past). I imagine that we are really in a much later stage than “history 2.0” if we are to view history in different “.0” forms (as if history was “AOL”).

Daniel Cohen in his article “From Babel to Knowledge” starts by mentioning a short story, “The Library of Babel” by Jorge Luis Borges. Cohen talks about how it can be difficult to find pertinent (and accurate) information while conducting online research. He then discusses some online tools that can help find information, such as the “Syllabus Finder” and the H-bot which searches the web to answer historical questions. Tim Hitchcox  discusses “Academic History and the Headache of Big Data” on the blog Historyonics. In this blog, Hitchcox mentions how “Big Data” is hard to work with while trying to maintain a commitment to “history from below” according to the “British Marxist Historical Tradition”. He discusses his project on the “Old Bailey”, which focuses on lower class people while he says most other historical web projects focus on intellectuals such as philosophers and scientists (for some reason he puts the word “scientist” in quotations, as if to say that some of those scientists are not really scientists, which seems unnecessary).  He says that other forms of history such as economic and demographic history, gender history and “the radical tradition” (an odd term given that the term “radical” is often used to describe people and groups which endeavor to break with or uproot traditions and traditional structures, but it is an accurate term in that such groups do actually come to have, and even endeavor to be faithful to, some sort of tradition) are not well represented.

Perhaps the biggest issue about the existence of “big data” or the internet is how to separate the wheat from the chaff. After all, not everything you find on a Google search is something you would want to include in your college paper and not everything you see someone say on Facebook or Twitter is something you want to use to boast about how knowledgeable you are (well better yet, don’t boast about that in the first place). Tools can help with this task, but experience is also important. Over time using the internet, you learn a bit about how to figure out which information is better and how to access online information and cite it better. Ten years ago I once cited the source of my information for a school paper  as simply “”, and after doing so I learned that Google is just a search-engine and that I had to cite the site that actually had the information on it, not the search engine I used to find it (well that was my first year having a computer at home, so as I said before, you learn over time). Sorting through and mining all this big data can be time consuming and sometimes challenging (though probably less challenging, and certainly less dangerous than other forms of mining, like coal mining) and it is a task that hopefully people become more proficient at with time and practice.



