‘A predictive model needs a response feature which is what would be predicted when the other attributes are known. All algorithms will output a response. While there are checks to ensure the quality of the prediction, such as checks for precision, recall and even checks overfitting, they all work with the assumption that the underlying data frame is meaningful’ (Deshpande 2015).
Data science, much like many creative fields, involves a process of research, experimentation, iteration, ingenuity, and implementation. Take making a special dinner for a small group of friends for example; similar to data mining, I would most likely start here:
1) Research some recipes for the occasion (business understanding).
2) Trial ingredients to uncover the perfect menu (data understanding).
3) Clean, peel, grate, dice, marinade and measure (data preparation).
4) Test and sample my offerings so they’re just right (modeling).
5) Evaluate the final product and plating of the feast (evaluation).
6) Serve dinner (deployment).
Starting with something simple, like focaccia bread, let’s take a look at the recipe:
1 envelope (2 1/4 teaspoons) active dry yeast
1 3/4 cups warm water
3 1/2 cups white flour
1/2 cups whole wheat flour
2 teaspoons salt
4 1/2 tablespoons extra virgin olive oil
5 sprigs of fresh rosemary
Several pinches of flakey sea salt
1. Dissolve and proof the yeast.
2. Make the dough.
3. Leave it to rise.
4. Put the dough into the pan.
5. Preheat the oven and prep the rosemary.
6. Place in the oven and bake.
8. Remove from oven and cool.
Execution Time: Bake about 20 – 25 minutes at 375F.
Output: Makes one large focaccia bread
As data scientists we should ‘question each channel or attribute of data: should it be in there? Why or why not? We may also need to critically examine if all the records (rows) need to be in the data frame’ (Deshpande 2015). It’s not enough to assume that the data is given to us is clean, useable or meaningful.
Part A: Insights gleaned from Assignment 2
Learning everything you can about the problem, I feel, allows for experimentation within the iterative process of analysis. Assignment 2 gave minimal detail regarding the data set, but a little more detail regarding the problem. Baseline research can garner initial hypothesis for the project brief which helps define and shape the project goals. ‘Whether you analyse your data using statistics or choose some other method, there is a process you follow and interpretations to be made from that process’ (Lichtman 2012).
Our group members examined the data through the use of histograms, box-plots, correlation matrices, etc; ‘using iterative and sequential methods reveal a greater depth of understanding than we would have found with only one method’ (Lichtman 2012). This gave the group the ability to understand the “mystery” attributes and reduce the number of attributes for more accurate and efficient modeling. Making sense of the data by sifting, identifying and sorting made the business problem ‘less like throwing darts in the dark’ (Puri 2015).
What about the VEGETARIANS! Back to dinner preparations – I didn’t account for ALL my guests! If dinner was going to be a roast, not once did I think to include yet another class of diner, the vegetarian, or in this case the pescatarian.
Data should be diverse, inclusive and whole. In this case, the meal plan was not inclusive to all dietary needs; a whole sub-class of eaters neglected. Of course a friendly dinner is in fact a small scale problem, but big data, if over or underrepresented, could cause significant ramifications. Of course, unintended outcomes may be accidental – continuing to ask the question “why?” throughout the process could lead to further inquiries you may have neglected to notice at the start.
Ethics is simple: do the right thing. Measure yourself with the New York Times Rule … “never do anything you wouldn’t want to see reported on the front page of the New York Times.”
‘Data is a tool that human beings created for other human beings. As a result, it’s up to you to create your own constraints for analysis. You choose your terms. You choose the questions you want to answer. You choose the techniques that you want to deploy. You’re in control’ (Puri 2015). Therefore, we must always consider the probability for error, censorship, discrimination, and so on. And that in itself is a big responsibility.
Random Forest: side dishes (veggies)
Algorithms, simply put are recipes – a set of instruction telling a computer what to do. And like chefs, computers can have biases towards certain ingredients (variables), and the potential to make mistakes. Algorithms are literal creatures wearing horse-blinders, and this is why I support the need for human-computer interaction in order to leverage them both successfully and ethically, together.
‘The missing “D” in Data Science is a word we do not often associate with data science armour. It’s DOMAIN’
– Roy 2015
Part B: Insights gleaned from Assignment 2
Curating the data and harvesting a story
Data analysis is about process, and interpretation, ‘as a qualitative researcher, your task is to organise and make sense of the data. One way to do this is to see if you can identify key concepts that come out of the data. An alternate way to do this is to see if you can develop a story from the data’ (Lichtman 2012).
Dealing with the data and making sense of it, especially during the course of Assignment 2, proved difficult with little attribute knowledge, aside from the categories of physiological, environmental and vehicular. ‘More enduring will be the need for data scientists to communicate in language that all their stakeholders understand – and to demonstrate the special skills involved in storytelling with data’ (Davenport & Patil 2012).
Going through a rigorous process of data exploration, coupled with research of current industry trends, our group was able to hypothesis the possibilities to compliment the reasoning for further support and investigation into the attributes; which I believe is pivotal in the final ‘sell’.
Only the beginning …
‘Algorithms make predictions more accurate – but they also create risks of their own, especially if we do not understand them’ (Luca, Kleinberg & Mullainathan 2016). And this is where I am currently – continuing to try to understand them, make correct use of them and be able to adjust accordingly to the business problem or situation if required; whether that be a change in project scope like Assignment 2, project evolution or the discovery of ethical implications.
Reflection of my contributions to the group
I was not our group’s in-house mathematician, but I still felt I contributed equally in delivering the project. I wrangled a little data, tableau-ed some box-plots and tried my hand at a few histograms – but mostly I watched and I learned from my very patient teammates. My algorithmic background is minimal, and their know-how paved the way to better understanding. I would say I was more a “data explorer” or the second in command to my group of “data scientists”, so I pushed forward in researching insight, suggesting hypothesis, (some) analytics, project management, compiling of information, editing, and visualising the reports. Best group ever = 100% awesome.
Davenport, T. & Patil, D.J., 2012. ‘Data Scientist: The Sexiest Job of the 21st Century’, Harvard Business Review, October viewed 23 June 2016, <https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century>.
Deshpande, B., 2015. ‘The value of domain knowledge in data science’, Simafore, 06 March, viewed 22 June 2016, <http://www.simafore.com/blog/the-value-of-domain-knowledge-in-data-science>.
Lichtman, M., 2012. Qualitative Research in Education: A User’s Guide: A User’s Guide (3rd Edition), SAGE Publications, viewed 20 June 2016, <http://www.sagepub.com/sites/default/files/upm-binaries/45660_12.pdf>.
Luca, M., Kleinberg, J. & Mullainathan, S., 2016. ‘Algorithms Need Managers, Too’, Harvard Business Review, January-February, viewed 20 June 2016, <https://hbr.org/2016/01/algorithms-need-managers-too>.
Puri, R., 2015. ‘3 ways to get more meaning from data’, GA Blog, September, viewed 19 June, <https://generalassemb.ly/blog/these-data-driven-perspectives-transform-approach-business>.
Roy, D., 2015. ‘The missing “D” in data science’, KDNuggets, July, viewed 22 June 2016, <http://www.kdnuggets.com/2015/07/data-science-domain-knowledge.html>.