View on GitHub

DATA146

Project 1

Question 1:

A Python package is a collection of executable modules and functions under a common namespace, indicated by the . used to call it. A library, on the other hand, is a collection of packages that may be imported into a file– however, there may also be a library of functions inside a package that indicates a collection of functions rather than a collection of packages.

Using a package and its library of functions requires that the user first install the package by importing it:

   import pandas as pd
   
   import numpy

Later, the user can call a function from the package’s function library as though the function were created by the user, as long as the user specifies which package the function comes from.

   example = pd.head()

The pd. tells the computer to look for the function head() in the pandas package, which was imported under the alias pd above. If no alias had been given, the user would call it by its full name.

   example = numpy.array(some_argument)

As you can see, importing the package under an alias allows the user to access the functions inside it moore quickly and cleanly (as long as the alias is easy to remember!). Alii are especially helpful when using the package repeatedly!

Question 2:

A dataframe is a (typically two-dimensional) data structure that organizes and allows clear presentation of a data set. In Python, the pandas package offers tools to process and analyze data by creating dataframes from data sets that may be either imported or created.

pandas has a useful function read_csv() that allows a user to import the contents of an external file. It takes as an argument the path of the file to be imported, and may take an additional argument specifying the type of file if it is not a .csv extension.

So, to import the file my_data.csv into the file in which I’m working, I would call the read_csv() function in pandas and pass the file path as an argument. Here, I will assume that my_data.csv is in the same directory as my current file and so I don’t need to supply the full path.

   import pandas as pd
   
   imported_data = pd.read_csv('my_data.csv')

If the file my_data.csv had instead been my_data.tsv, indicating values separated by tabs instead of commas, I would need to pass in the argument sep= to tell the function to look for tabs.

   import pandas as pd
   
   imported_data = pd.read_csv('my_data.tsv', sep=`\t`)

“my_data.tsv” contains data for the favorite colors of people in my family and the number of shirts they own in that color. If I want to get details about the average and mean of the number of favorite-colored shirts per person, I can use the describe() function that will return a chart containing calculations such as standard deviation, mean (average), maxima and minima, and more.

The .shape command (notice it has no parentheses!) returns the number of rows and columns in the dataframe, which may also be called features and targets.

Question 3:

The year variable contains the years for which data (population, life expectancy, GDP, etc) was recorded. A brief analysis of the years reveals that data was recorded every five years from 1952 and until 2007. In order to update the data to 2021, we would add two more entries after 2007, one in 2012 and one in 2017.

Question 4:

The lowest recorded life expetancy is from Rwanda in 1992, when it was just over 23 years old. This was possibly due to civil unrest and widespread violence– in 1992, Rwanda was in the middle of a civil war that ended in the Rwandan genocide of 1994.

Question 5:

Country GDP (Nearest billion)
Germany 265B
France 186B
Italy 166B
Spain 117B

Question 6:

You have been introduced to four logical operators thus far: &, ==, | and ^. Describe each one including its purpose and function. Provide an example of how each might be used in the context of programming.

Question 7:

Consider our GDP table from above (with an additional index column) to be a dataframe called df:

Index Country GDP (Nearest billion)
0 Germany 265B
1 France 186B
2 Italy 166B
3 Spain 117B

Since they’re already sorted from greatest to least GDP, subset this by the three lowest GDPs we can use .iloc to subset the last three, from index 1 (Spain) until the first index which we do not want to include (in this case, index 4; alternatively, we could have left the last index blank to indicate stopping at the end).

         df.iloc[1:4]

The resulting table will then read:

Index Country GDP (Nearest billion)
1 France 186B
2 Italy 166B
3 Spain 117B

Question 8:

An Application Programming Interface (API) acts as an intermediate between a data source (such as a web source) and a the local file in which a user analyzes or processes the data– it allows the user to access, copy, and alter the data locally when the code is run later.

We will need to import requests, os, and pandas, in order to construct a request to a remote server, pull the data and write to a local file, and import the data into a local work session, respectively.

         import requests
         import os
         import pandas

Next, we need to specify a data source by its url and a folder name in which the data should go when it is pulled from the web. Here, I’ve called this folder “Data” and provided an if conditional that uses os to check if there is such a folder in the directory in which I’m working– if there is no such folder, it will create it for me. Then I specify a filename into which I will write the copied data.

         url = 'https://onlinedata.com/download_data'
         
         folder_name = 'Data'
         if not os.path.exists(data_folder):
            os.makedirs(data_folder)
            
         file_name = 'os.path.join(folder_name, 'File')        

Finally, the last steps are to access the web_source using requests, and open (open()) and write() a copy of the data into the file I created.

Once this is done, I can use pandas to simply read the file and create a dataframe!

         web_source = requests.get(url)
         
         with open(file_name, 'wb') as i:
            i.write(web_source.content)
            
         my_dataframe = pandas.read_csv(filename)

Question 9:

The pandas apply() function allows a particular function to be applied repeatedly throughough a dataframe. It allows users to specify which axis to apply the function using the axis argument (axis=0 for rows and axis=1 for columns). Instead of applying it to each individual “cell” within the dataframe, it allows operations to be performed in bulk.

Alternatively, a user may execute a for loop to visit each cell in a series, though this is less preferable as it requires more lines of code. The apply() function essentially allows us to execute a for loop across the contents of a series, applying an operation to each item in the series as the function iterates through.

Question 10:

Instead of using iloc to subset a number of variables to a new data frame, a user may prefer to use logical operators to group together and select for desired conditions, then use [] brackets to subset a a dataframe by filtering out columns or rows that don’t meet the required criteria.

First, we would set a conditional (or series of conditionals, if needed), such as dataframe['Column']=='Contains_This'. Then, we could select each row in this column for which this conditional is True by setting this equal to the original dataframe (or a copy!):

         filtered_dataframe = dataframe[dataframe['Column']=='Contains_This']

The resulting dataframe filtered_dataframe will filter through the rows in column Column that contain the variable Contains_This.