Trying to Take a Good First Step
We have spent a week and one-half setting the table and clearing our throats. It is now time to begin putting software into action. The complement to our KBpedia environment is the one for Python. We now switch gears to finding and installing a basic Python ‘starter package’ for this Cooking with Python and KBpedia series. We will approach this question from the standpoints of our local Windows 10 operating environment, the needs for KBpedia and its tools, and our desire to move into data science and machine learning applications.
Though I am an absolute newbie with regard to Python, I have been monitoring it for some years as a possible language to adopt. For quite a few years there was apparently a lengthy and difficult transition from Python 2 to Python 3. (As of this writing, the current version is Python 3.8.5.) Many of the issues we will address in coming installments deal with questions like encodings, management of CSV files, file I/O, and other topics that have apparently been much easier to handle with Python 3 (specifically since Python 3.6 and 3.7). Because of this transition, one may still find tutorials and online guidance offered in both Python 2 and 3 flavors. Fortunately, since all that we are developing in this series is new, we do not need to account for a legacy code base. We are thus free to not have to provide duplicate Python 2 and 3 routines. Please note, however, if you are using Python 2 locally that likely most of the routines offered during this series will not run without an upgrade.
Going back to the days of the LAMP stack on a Windows machine, which I did years ago with the XAMPP package, I am leery about installing languages and complicated stacks on Windows. Unlike Linux packages that can be installed with a single command, and easier methods on Macs as well, my impression is that Windows has never been a particularly friendly environment for installing natively non-Windows systems and applications. Though the official Python site has some impressive documentation and installation kits, including some nice kits from third parties, I decided going in that I wanted to use a more automated approach to handling Python and related package installation and dependencies. The dependencies portion is especially tricky since many data science applications with Python build on other packages, and getting them installed in the correct order with appropriate settings can be a real time waster.
My initial research suggested I wanted a data science focus for my Python installation with machine learning, of course. But, I also thought it might be a good idea to install an integrated development environment (IDE) to aid code completion, language lookup, and debugging. I also had become quite enamored with the REPL (read-eval-print-loop) capabilities in our Clojure work, which allows code snippets to be interpreted and run immediately, so I also was quite intrigued with adding an electronic notebook environment as well. My initial research suggested either the PyCharm or Spyder IDEs might be suitable for data science. I had already been following development with the Jupyter notebook (initially iPython) for quite some time and wanted to include that in my platform as well.
So, while I was beginning to zero in on a suite of Python-based tools, I had not worked directly with any of them. I therefore also wanted a Python package installer that was flexible enough to enable me to choose among alternative tools and switch them out if need be with relative ease. Who knows what kind of speedbumps I might encounter as we continue on the journey in this CWPK series? Prior experience with Eclipse and other IDEs warned me that component might especially be one that required some choice and flexibility.
There are a number of guides for installing Python locally that are quite good. One is geared around Jupyter and uses PyCharm as the IDE. I did not really like the lock-in on the IDE and also did not like the suggestion to immediately incorporate GitHub into the workflow. (GitHub is essential for anyone hoping to share code with others or develop an open-source package, but my target audience of the focused newbie may not require this step.) Other guides, however, for examples the ones from KDnuggets or DataCamp, kept pointing me to the open-source Anaconda installer package.
Anaconda is a package manager, an environment manager, and a Python distribution that particularly emphasizes data science applications. The environment enables one to quickly download more than 7,500 Python/R data science packages (so, it also supports incorporation of R; see CWPK #13), including the essential ones of machine learning in scikit-learn, TensorFlow, and Theano, plus the data analysis tools of Dask, NumPy, pandas, and Numba. Multiple data visualization packages of Matplotlib, Bokeh, Datashader, and Holoviews can also be managed. Anaconda enables one to manage libraries and their dependencies on Windows, MacOS and Linux using the Conda package manager. Conda complements the standard pip Python package manager. More than 20 million users worldwide use the Individual Edition version of Anaconda.
The final factor that convinced me to use Anaconda was its Navigator graphical user interface that enables one to launch applications or manage packages. Navigator showed me a way to easily choose Jupyter, PyCharm or Spyder, three of the big options at the center of this work, among many other package choices down the road. Thus, while I may prefer working directly in an editor via an IDE when writing programs, having the option to manage apps and dependencies in a GUI is really attractive.
Installation of Anaconda is a breeze, though I do add one important wrinkle to the fully automated path.[1] Begin by going to the Anaconda Individual Edition download page and pick Download:
When presented with the alternative approach, choose it. While it is nice to have the Anaconda GUI available for some tasks, we also want to work with Python at the command line without needing to invoke Anaconda. The alternate path approach writes the appropriate path information to the Windows environment variables, meaning we can invoke Python and related applications from any location on our local machine.
Your first actual install screen will ask about Advanced Options. Make sure and pick the Add Anaconda3 to the system PATH environment variable. When you do, you will get a red font notice telling you this option is not recommended. Proceed with it anyway by picking Install:
You will then have the choice to install as a single user or for the entire machine. In my case, since it is a dedicated computer not on a business network, I use the entire machine option. Your local circumstance or IT department may mandate the single user option.
You should pause at this point and think through what you want your directory structure to be. You can accept placing all files in standard install locations (assuming a C: drive) of C:\Users\mike\anaconda3
if you set it up for you as a single user, or C:\ProgramData\Anaconda3
if you selected to install for all users. However, I find it useful to set up my own directory structures and be able to modify and expand at will. A Python installation with Anaconda will take substantial space, and you may be setting configurations for many different apps as well as projects other than KBpedia or derivatives. For security purposes, we also want to keep our use of Python somewhat fenced and unable to reach the root. (We’ll discuss this again in CWPK #15.) Here is how I am setting up my directory structure:
|-- PythonProject # Set up a 'master' directory for all of your Python work; name as you wish
|-- Python # Create a 'Python' (upper or lower) directory under this root
|-- [Anaconda3 distribution] # Direct your Anaconda install to this location
|-- TBA
|-- TBA # We'll add to this directory structure as we move on
|-- TBA
|-- TBA
The entire install process is automated by following the instructions on each screen. During the actual install process you may click the Details button and watch progress as it proceeds file by file. The longest step is setting up the package cache. Overall the installation takes a few minutes. Please be patient. At the conclusion of the installation and feedback to the screen, proceeding through all steps as presented, we conclude by clicking on Finish.
To check to see if the install proceeded properly, call up a command prompt (Powershell or cmd at the Windows Run option), and see if you get version information for these two commands:
conda --version
python --version
If you get version information echoed to the screen, you are fine. If these commands do not work, see the ‘Add Anaconda to Path (Optional)’ section in DataCamp.
If you have Windows issues, you can inspect the general Python Windows use guide or look into these possible issues after install.
For additional information about Anaconda, see the quick start user guide or the getting started tutorial (requires registry).
The ease of installation and package management with Anaconda does come at the cost of some bloat, as well as higher memory requirements for what gets loaded at start up. For the former problem, it is possible to replace Anaconda at a later date with Miniconda and conda, especially as your environment has stabilized and you have less frequent need for Navigator package manager. As for memory management, if your machine is only marginally capable, you may want to look into these scripts or these other scripts.