My Very First Model(s)
Three weeks neck deep in Python and Data Science
Garrett Mayock posted 2018-12-20 20:28:00 UTC
For a TL;DR, scroll through the section headers.
Well. Python. There’s a lot going on.
I've learned there's a lot to learn
It's a lot more than "just" Python. It's the libraries, the documentation, GitHub and Stack Overflow, Kaggle, Anaconda, Jupyter Notebooks and Google Colab, and so on ... but it's a lot more than that, too.
The real trick is in the application.
How can you get the data in a usable format? What features give signal, not noise? What models do you use? What parameters do you choose for the model? And how do you present the resulting information to inform decision making?
And then there's math.
Math and me: a complete history (abridged)
I finished my high school's math offerings in tenth grade (through AP Stats and AP Calculus) and hadn't studied math (directly) since. Many of the concepts were covered while studying finance and economics in the university, of course, but I had literally never written code to solve math problems (outside of Excel, which only counts halfway).
Python makes math beautiful again. I'm remembering linear algebra concepts I haven't thought about in a decade just by being able to play around with graphs. I'm understanding z-scores like I had taken stats yesterday. That is incredible, and frankly makes me very excited. I mean, I had calculated linear regression equations before. By hand. I have absolutely no recollection of how I did that. I may have repressed the memory. But a few lines of code in Python and it's there? A couple more and it's graphed?
Not to belabor the point, but I had thought Excel made math easy. No. No way. Not compared to code.
It's so beautifully clean. I feel like, if I had been programming in high school, I wouldn't have stopped studying mathematics once I finished AP Stats and Calc. I would have continued on, so much further. But I digress.
I'm enjoying learning about programming
It's exciting. I've learned a lot not just about Python, but programming in general.
And for the most part, the culture around learning to code is pretty awesome. There's even a tweet from the author of Ruby on Rails admitting he himself would google for usage examples even though he invented the language - beautiful. https://twitter.com/dhh/status/1074372160884371456
I've spent about 100 hours reading, learning, and coding
In the past three weeks, I’ve followed along with PythonProgramming.net's Intro to Python 3(.7) course (4 hours of videos), coded a command line tic-tac-toe (on GitHub here), attended Lambda School’s Intro to Data Science bootcamp (~9 hours of videos), and went off on a lot of exploratory tangents. The first take-away is nothing goes quickly.
I spent ~100 hours in that time studying data science and practicing Python, even though the total video lessons I've watched amount to only about ~15-20 hours.
On top of that are ~8 hours coding the tic-tac-toe game and ~27 hours of homework for the Lambda School. The actual minimum-viable-product code was written much quicker, but I wanted to experiment with everything I could. I'd do things ten different ways to see how it worked.
I spent the rest of the time researching the rest of the data science world. I've learned about how different people learn and practice data science. I've read about the different types of work that goes into a data scientist's day job. I've learned delineations between data engineering, business intelligence, and data science (as well as the overlap). I've read about and through the online communities surrounding data science and programming in general.
So here's what I spent that time doing.
My intro to Python: writing tic-tac-toe
I learn best by doing. Therefore, to kick-start my learning, I decided to code a simple tic-tac-toe game.
There are plenty of guides online. I went through two guides of lesser quality before coming across the PythonProgramming.net guide (Intro to Python 3(.7), same as linked above).
Here's a summary of what I studied for tic-tac-toe game:
- Tuples, strings, and lists
- Built-in functions
- Indices and slices
- Defining functions, function parameters, and typing
- Defining variables
- Error Handling (google-fu, stack overflow, reading documentation)
- Iterators / iterables
- Commenting Code
- Basic operators
- Syntax, semantics, and syntactic sugar
My intro to data science: the Lambda School bootcamp
Lambda School put together a two-week Intro to Data Science (same as linked above) evening bootcamp on data science. This was a nice introduction to data science (as well as a portion of their application process; I'm not applying because the value isn't there for me, but their bootcamp is free to the public so I wanted to partake).
Here's a summary of what I studied at Lambda School:
- Python and modules like matplotlib, numpy, pandas (more details in Appendix B below)
- Data exploration and data wrangling (reading, analyzing, replacing null values, encoding categorical numbers)
- Jupyter notebooks (Google Colab)
- Statistics and hypothesis testing
- Linear algebra
- Regression modeling (linear, polynomial, k nearest neighbors, decision tree regressors)
- Classification modeling (logistic regression, decision tree classifiers, random forest)
Over the two weeks, there were eight assignments and one sprint challenge at Lambda. Check out my work here:
- Week one
- Intro to Python (assignment one): https://colab.research.google.com/drive/1gtYzPRGN4D7fS_EXqhPbpyzuyNJ_xNBk
- Data Exploration (assignment two): https://colab.research.google.com/drive/18NJueJUZxZJMCwmMlYT712OwgzU_i7PS
- Statistics Fundamentals (assignment three): https://colab.research.google.com/drive/1vCzalhYeQsRGDA3t2o2VRdtPkZ__w373
- Hypothesis Testing (assignment four): https://colab.research.google.com/drive/1n_j4015Wy8e2LPwPD4CVDvQQdXdsV6VE
- Sprint Challenge: https://colab.research.google.com/drive/1M4Fwb0WvXfkYtmvtFBetE7QzgYNLln0u
- Week two
- Linear Algebra Foundations (assignment five): https://colab.research.google.com/drive/1p6SoZL9w2oPgfA7-CkHmfWimOredOV4R
- Linear Algebra Applications (assignment six): https://colab.research.google.com/drive/1sg1g-zD0XMWmnJJ59W-BS2tV8w_GZbEn
- Regression Modeling (assignment seven): https://colab.research.google.com/drive/1JgHgN0SiGuTFmulaAsRSESo2vCPOe733
- Classification Modeling (assignment eight): https://colab.research.google.com/drive/16xgj6GtEfweC9seUjsvzwXwwzPqRB-q2
Well, a lot! I haven't decided which of the big three parts of data science I want to pursue next. The big three:
- Data Engineering
- Business Intelligence
- Data Science (machine learning / artificial intelligence)
Next steps for data engineering could be increasing the complexity of my website and its database. This could be as simple as allowing blog post functionality, or adding a new meta tagging system in third normal form in the database.
For business intelligence, the first steps I'm going to take are the Intro to Tableau Public course on their website, and eventually the ~7.5 hours of live trainings at the same link, or the ~26.5 hours of pre-recorded official training on YouTube. I have some experience with Platfora BI, but that's been bought by Workday, Inc, since, and doesn't seem to have the same level of public accessibility as Tableau. Maybe I'll pursue Power BI's free version next after Tableau.
Finally, there's all I have to do to continue learning Python and applied data science. This is where its really open-ended. I've got books (The Master Algorithm by Pedro Domingos, The Data Science Handbook by Field Cady, The Nature of Software Development by Ron Jeffries) to read and re-read, there's a plethora of options to go through next on PythonProgramming.net, such as the 5-hour Intermediate Python Fundamentals, the 4-hour NLTK with Python 3 for Natural Language Processing, or the 19-hour Practical Machine Learning with Python. Or the Kaggle Titanic tutorial, or their Digit Recognizer tutorial, or their Advanced House Prices Regressor, or, or, or ... you get my point.
I'll be showing work
Throughout each step, I'll be looking to show tangible results on the website. For example, the Tableau training may only total 26.5 hours, but embedding the resulting vizzes into the website may take another couple days. I have absolutely no idea - it will literally be the first time I've ever done anything like that. But that's been the story for everything I've done so far, so I'm not intimidated.
Furthermore, I will continue updating the blog entries (hopefully a bit more frequently than I have since my first blog). That'll take some time too.
And I guess I'll be needing to keep my resume up-to-date and be active in the job search as well.
That's it for now - see you again soon!
Appendix A: Getting a feel for the tools
Part of using Python is writing it and getting it to run. Here are the programs I've used in the past three weeks.
When installing Python, I used Anaconda Navigator due to recommendations I read. I understand that it installs a bunch of common things and makes it a bit easier to use for a newbie than doing everything from the command line. I haven't had any issues getting set up or running Python code so far ... but that's probably not saying too much.
Visual Studio Code
I wrote the tic-tac-toe game (and all its iterations) in Visual Studio Code. This has become my preferred text editor. I installed the Python Extension and bound a couple hotkeys to run the file in the terminal and to kill the terminal. It made iteration quick and painless.
Google Colab and Jupyter Notebooks
When I moved on to the Lambda School bootcamp, they used Google Colab for their assignments. I think this is awesome - Colab is essentially a Jupyter Notebook hosted by Google. There's some upper limit on compute time per day, but it's way beyond the scope of what I'm doing for now. I've also used Jupyter Notebooks spun up locally via Anaconda Navigator.
Appendix B: Modules I've used
I've gotten exposure to the most iconic trio of Python modules (pandas, matplotlib, numpy) and more. Here's a brief summary of the modules I've used, and how:
Pandas (Python Data Analysis Library) is a library which helps structure and analyze data. Mostly I've done basic operations on dataframes with pandas. Sample formulas used:
matplotlib, matplotlib.pyplot, and mpl_toolkits.mplot3d
Matplotlib is a package and pyplot is a module which are designed to make plotting easier. Pyplot is described as "providing the state-machine interface to the underlying plotting library in matplotlib". I don't really know what that means yet. However, I do know that when I import matplotlib.pyplot as plt, and call functions like plt.plot, it works.
I've plotted histograms, scatter plots, lines, and legends, in 2D and 3D. I've set limits to the axes, labelled them, added titles, and denoted figure size. I've plotted vectors (arrow()) and matrices (fill())
The documentation also mentioned matplotlib.pylab, and mentioned IPython and making plots interactive. In the Google Colab notebooks for Lambda School, I was never able to get
to work (ie, to make the plots interactive, specifically the 3d ones (such as mpl_toolkits.mplot3d). It never showed the plots, not even with plt.show(). I just submitted the assignments on December 18th, so hopefully I'll get feedback from Lambda about it, but otherwise maybe I could try pylab.
I've used numpy mainly for linear algebra. Sample of functions used:
I've used sci-kit-learn for a variety of regression and classification models including:
graphviz and dtreeviz
I used these to visualize the decision trees from sklearn.
Seaborn is a data visualization library. Documentation says it is based on matplotlib. Experience says the graphs it produces are prettier than pyplot. I'm not sure there's any other significant difference. I've used distplots, jointplots, scatters, barplots, and the load_dataset feature (for the diamonds dataset).
Statistics is a great module which quickly calculates statistics of numeric data. I used this for quick mean(), median(), and mode() calculations, as well as stdev().
Random, as you may be able to tell from the name, implements pseudo-random number generators. I've used choice(), choices(), and randint(). Choices() is used (with default replacement) to create bootstrap samples.
I used this one time to create an OLS model (ordinary least squares) because a google search led me to believe it was the easiest way to find the coefficients for the plane of best fit on the stretch problem on the Lambda School assignment 6. Note - as of time of writing I'm seeking feedback from Lambda School on graphing the plane itself as I was not able to figure it out playing around with it.
Itertools is a module which implements a bunch of useful building blocks for iteration. So far I've only used takewhile() and cycle().
I've only used this to import reduce to create lambda functions to calculate mean:
from functools import reduce def findMean(numbers): total = reduce(lambda x, y: x+y, numbers) mean = total/len(numbers) return mean
Collections is a module which is described in documentation as implementing specialized container datatypes as alternatives to the built-in containers of dict, list, set, and tuple. I used it for the Counter container, in a formula put together to return mode for multimodal data (using Collection's built-in most_common() function).contact me