Hey! Jeremy here. Recently, someone from the editorial team for Kite, an AI autocomplete for Python, reached out to see if I would share some of their content. Since I thought the tool looked awesome, I figured I'd help them out. After some chatting, we decided on this data science article by Kirit Thadaka. Enjoy!\n\n\n\n\n\nWhat is Data Science?\n\n\n\nHow often do you think you\u2019re touched by data science in some form or another? Finding your way to this article likely involved a whole bunch of data science (whooaa). To simplify things a bit, I\u2019ll explain what data science means to me.\n\n\n\n\u201cData Science is the art of applying scientific methods of analysis to any kind of data so that we can unlock important information.\u201d\n\n\n\nThat\u2019s a mouthful. If we unpack that, all data science really means is to answer questions by using math and science to go through data that\u2019s too much for our brains to process.\n\n\n\nData Science covers\u2026\n\n\n\nMachine learningData visualizationPredictive analysisVoice assistants\n\n\n\n\u2026 and all the buzzwords we hear today, like artificial intelligence, deep learning, etc.\n\n\n\nTo finish my thought on data science being used to find this article, I\u2019ll ask you to think of the steps you used to get here. For the sake of this explanation, let\u2019s assume that most of you were online looking at pictures of kittens and puppies when you suddenly came across a fancy word related to data science and wanted to know what it was all about. You turned to Google hoping to find the meaning of it all, and you typed \u201cWhat is *fill in your data science related buzzword*.\u201d\n\n\n\nYou would have noticed that Google was kind enough to offer suggestions to refine your search terms \u2013 that\u2019s predictive text generation. Once the search results came up, you would have noticed a box on the right that summarizes your search results \u2013 that\u2019s Google\u2019s knowledge graph. Using insights from SEO (Search Engine Optimization) I\u2019m able to make sure my article reaches you easily, which is a good data science use case in and of itself. All of these are tiny ways that data science is involved in the things we do every day.\n\n\n\nTo be clear, going forward I\u2019m going to use data science as an umbrella term that covers artificial intelligence, deep learning and anything else you might hear that\u2019s relevant to data and science.\n\n\n\nPositives: Astrophysics, Biology, and Sports\n\n\n\nData science made a huge positive impact on the way technology influences our lives. Some of these impacts have been nice and some have been otherwise. *looks at Facebook* But, technology can\u2019t inherently be good or bad, technology is\u2026 technology. It\u2019s the way we use it that has good or bad outcomes.\n\n\n\nWe recently had a breakthrough in astrophysics with the first ever picture of a black hole. This helps physicists confirm more than a century of purely theoretical work around black holes and the theory of relativity.\n\n\n\nTo capture this image, scientists used a telescope as big as the earth (Event Horizon Telescope or EHT) by combining data from an array of eight ground-based radio telescopes and making sense of it all to construct an image. Analyzing data and then visualizing that data \u2013 sounds like some data science right here.\n\n\n\nA cool side note on this point: a standard Python library of functions for EHT Imaging was developed by Andrew Chael from Harvard to simulate and manipulate VLBI (Very-long-baseline interferometry) data helping the process of creating the black hole image.\n\n\n\nOlivier Elemento at Cornell uses Big Data Analytics to help identify mutations in genomes that result in tumor cells spreading so that they can be killed earlier \u2013 this is a huge positive impact data science has on human life. You can read more about his incredible research here.\n\n\n\nPython is used by researchers in his lab while testing statistical and machine learning models. Keras, NumPy, Scipy, and Scikit-learn are some top notch Python libraries for this.\n\n\n\nIf you\u2019re a fan of the English Premier League, you\u2019ll appreciate the example of Leicester City winning the title in the 2015-2016 season.\n\n\n\nAt the start of the season, bookmakers had the likelihood Leicester City winning the EPL at 10 times less than the odds of finding the Loch Ness monster. For a more detailed attempt at describing the significance of this story, read this.\n\n\n\nEveryone wanted to know how Leicester was able to do this, and it turns out that data science played a big part! Thanks to their investment into analytics and technology, the club was able to measure players\u2019 fitness levels and body condition while they were training to help prevent injuries, all while assessing best tactics to use in a game based on the players\u2019 energy levels.\n\n\n\nAll training sessions had plans backed by real data about the players, and as a result Leicester City suffered the least amount of player injuries of all clubs that season.\n\n\n\nMany top teams use data analytics to help with player performance, scouting talent, and understanding how to plan for certain opponents.\n\n\n\nHere\u2019s an example of Python being used to help with some football analysis. I certainly wish Chelsea F.C. would use some of these techniques to improve their woeful form and make my life as a fan better. You don\u2019t need analytics to see that Kante is in the wrong position, and Jorginho shouldn\u2019t be in that team and\u2026 Okay I\u2019m digressing \u2013 back to the topic now!\n\n\n\nNow that we\u2019ve covered some of the amazing things data science has uncovered, I\u2019m going to touch on some of the negatives as well \u2013 it\u2019s important to critically think about technology and how it impacts us.\n\n\n\nThe amount that technology impacts our lives will undeniably increase with time, and we shouldn\u2019t limit our understanding without being aware of the positive and negative implications it can have.\n\n\n\nSome of the concerns I have around this ecosystem are data privacy (I\u2019m sure we all have many examples that come to mind), biases in predictions and classifications, and the impact of personalization and advertising on society.\n\n\n\nNegatives: Gender Bias and More\n\n\n\nThis paper published in NIPS talks about how to counter gender biases in word embeddings used frequently in data science.\n\n\n\nFor those who aren\u2019t familiar with the term, word embeddings are a clever way of representing words so that neural networks and other computer algorithms can process them.\n\n\n\nThe data used to create Word2Vec (a model for word embeddings created by Google) has resulted in gender biases that show close relations between \u201cmen\u201d and words like \u201ccomputer scientist\u201d, \u201carchitect\u201d, \u201ccaptain\u201d, etc. while showing \u201cwomen\u201d to be closely related to \u201chomemaker\u201d, \u201cnanny\u201d, \u201cnurse\u201d, etc.\n\n\n\n\n\n\n\nHere\u2019s the Python code used by the researchers who published this paper. Python\u2019s ease of use makes it a good choice for quickly going from idea to implementation.\n\n\n\nIt isn\u2019t always easy to preempt biases like these from influencing our models. We may not even be aware that such biases exist in the data we collect.\n\n\n\nIt is imperative that an equal focus is placed on curating, verifying, cleaning, and to some extent de-biasing data.\n\n\n\nI will concede that it isn\u2019t always feasible to make all our datasets fair and unbiased. Lucky for us, there is some good research published that can help us understand our neural networks and other algorithms to the extent that we can uncover these latent biases.\n\n\n\nWhen it comes to data science, always remember \u2013\n\n\n\n\u201cGarbage in, garbage out.\u201d\n\n\n\nThe data we train our algorithms with influences the results they produce. The results they produce are often seen by us and can have a lasting influence.\n\n\n\nWe must be aware of the impact social media and content suggestions have on us. Today, we\u2019re entering a loop where we consume content that reinforces our ideas and puts people in information silos.\n\n\n\nResearch projects that fight disinformation and help people break out of the cycle of reinforcement are critical to our future. If you were trying to come up with a solution to this fake news problem, what would we need to do?\n\n\n\nWe would first need to come up with an accurate estimate of what constitutes \u201cfake\u201d news. This means comparing an article with reputable news sources, tracing the origins of a story, and verifying that the article\u2019s publisher is a credible source.\n\n\n\nYou\u2019d need to build models that tag information that hasn\u2019t been corroborated by other sources. To do this accurately, one would need a ton of not \u201cfake\u201d news to train the model on. Once the model knows how to identify if something is true (to a tolerable degree of confidence), then the model can begin to flag news that\u2019s \u201cfake.\u201d\n\n\n\nCrowd sourced truth is also a great way to tackle this problem, letting the wisdom of the crowd determine what the \u201ctruth\u201d is.\n\n\n\nBlockchain technology fits in well here by allowing data to flow from people all over the world and arrive at consensus on some shared truth.\n\n\n\nPython is the fabric that allows all these technologies and concepts to come together and build creative solutions.\n\n\n\nPython, a Data Science Toolset\n\n\n\nI\u2019ve talked about data science, what it means, how it helps us, and how it may have negative impacts on us.\n\n\n\nYou\u2019ve seen through a few examples how Python is a versatile tool that can be used across different domains, in industry and academia, and even by people without a degree in Computer Science.\n\n\n\nPython is a tool that makes solving difficult problems a little bit easier. Whether you\u2019re a social scientist, a financial analyst, a medical researcher, a teacher or anyone that needs to make sense of data, Python is one thing you need in your tool box.\n\n\n\nSince Python is open source, anyone can contribute to the community by adding cool functionalities to the language in the form of Python libraries.\n\n\n\nData visualization libraries like Matplotlib and Seaborn are great for representing data in simple to understand ways. NumPy and Pandas are the best libraries around to manipulate data. Scipy is full on scientific methods for data analysis.\n\n\n\nWhether you want to help fight climate change, analyze your favorite sports team or just learn more about data science, artificial intelligence, or your next favorite buzzword \u2013 you\u2019ll find the task at hand much easier if you know some basic Python.\n\n\n\nHere are some great Python libraries to equip yourself with:\n\n\n\nNumPyPandasScikit-LearnKerasMatplotlib\n\n\n\nI\u2019ll illustrate an example of how easy it is to get started with data science using Python. Here\u2019s a simple example of how you can use Scikit-Learn for some meaningful data analysis.\n\n\n\nPython Example with Scikit-learn\n\n\n\nThis code is available at the Kite Blog github repository.\n\n\n\nI\u2019ve used one of Scikit-Learn\u2019s datasets called Iris, which is a dataset that consists of 3 different types of irises\u2019 (Setosa, Versicolour, and Virginica) petal and sepal length, stored in a 150\u00d74 numpy.ndarray. The rows are the samples and the columns are: Sepal Length, Sepal Width, Petal Length, and Petal Width.\n\n\n\nI\u2019m going to run a simple linear regression to display the correlation between petal width length. The only libraries used here are scikit-learn (for the regression and data set) and matplotlib for the plotting.\n\n\n\nfrom sklearn import datasets, linear_model\nimport matplotlib.pyplot as plt\n\niris = datasets.load_iris()\n\n# Data and features are both numpy arrays\ndata = iris.data\nfeatures = iris.feature_names\n\n\n\nNow, we\u2019ll plot a linear regression between the length and width of the petals to see how they correlate.\n\n\n\n# Create the regression model\nregression = linear_model.LinearRegression()\n\n# Reshape the Numpy arrays so that they are columnar\nx_data = data[:, 2].reshape(-1, 1)\ny_data = data[:, 3].reshape(-1, 1)\n\n# Train the regression model to fit the data from iris (comparing the petal width)\nregression.fit(x_data, y_data)\n\n\n# Display chart\nplt.plot(x_data, regression.predict(x_data), color='black', linewidth=3)\nplt.scatter(x_data, y_data)\nplt.show()\n\n\n\n\n\n\n\nHere\u2019s a tutorial I created to learn NumPy, and here\u2019s a notebook that shows how Keras can be used to easily create a neural network. Just this much will allow you to build some pretty cool models.\n\n\n\nConcluding thoughts\n\n\n\nBefore I end, I\u2019d like to share some of my own ideas of what I think the future of data science looks like.\n\n\n\nI\u2019m excited to see how concerns over personal data privacy shapes the evolution of data science. As a society, it\u2019s imperative that we take these concerns seriously and have policies in place that prevent our data accumulating in the hands of commercial actors.\n\n\n\nWhen I go for walks around San Francisco, I\u2019m amazed at the number of cars I see with 500 cameras and sensors on them, all trying to capture as much information as they possibly can so that they can become self driving cars. All of this data is being collected, it\u2019s being stored, and it\u2019s being used. We are a part of that data.\n\n\n\nAs we come closer to a future where self driving cars become a bigger part of our life, do we want all of that data to be up in the cloud? Do we want data about the things we do inside our car available to Tesla, Cruise or Alphabet (Waymo)?\n\n\n\nIt\u2019s definitely a good thing that these algorithms are being trained with as much data as possible. Why would we trust a car that hasn\u2019t been trained enough? But that shouldn\u2019t come at the cost of our privacy.\n\n\n\nInstead of hoarding people\u2019s personal data in \u201csecure\u201d cloud servers, data analysis will be done at the edge itself. This means that instead of personal data leaving the user\u2019s device, it will remain on the device and the algorithm will run on each device.\n\n\n\nLots of development is happening in the field of Zero Knowledge Analytics which allows data to be analyzed without needing to see what that data is. Federated Learning allows people to contribute to the training of Neural Networks without their data to leaving their device.\n\n\n\nThe convergence of blockchain technology and data science will lead to some other exciting developments. By networking people and devices across the globe, the blockchain can provide an excellent platform for distributed computation, data sharing, and data verification. Instead of operating on information in silos, it can be shared and opened up to everyone. Golem is one example of this.\n\n\n\nHypernet is a project born out of Stanford to solve a big problem for scientists \u2013 how to get enough compute power to run computationally and data intensive simulations.\n\n\n\nInstead of waiting for the only computer in the university with the bandwidth to solve the task and going through the process of getting permission to use it, Hypernet allows the user to leverage the blockchain and the large community of people with spare compute resources by pooling them together to provide the platform needed for intensive tasks.\n\n\n\nNeural networks for a long time have felt like magic. They do a good job, but we\u2019re not really sure why. They give us the right answer, but we can\u2019t really tell how. We need to understand the algorithms that our future will be built on.\n\n\n\nAccording to DARPA, the \u201cthird-wave\u201d of AI will be dependent on artificial intelligence models being able to explain their decisions to us. I agree, we should not be at the mercy of the decisions made by AI.\n\n\n\nI\u2019m excited with what the future holds for us. Privacy, truth, fairness, and cooperation will be the pillars that the future of data science forms on.\n\n\n\nThis article originally appeared on Kite.