Data Wrangling Best Seller Books
Data Wrangling Best Seller Books
The is a project based on data wrangling and cleaning data from various sources.
Explore the docs »
·
Report Bug
·
Request Feature
Table of Contents
About The Project
Built With
- Powerpoint
- Word
- jupyter notebook
Getting Started
To get a local copy up and running follow these simple steps.
Prerequisites
- Microsoft Office Suite
- PDF Reader
Installation
- Clone the repo
git clone https://github.com/AMeyer89/DataWranglingBestSellerBooks.git
Description
During Milestone one I selected three data sources: amazon reviews for book categories, bestsellers from amazon website, and Goodreads for an API call. After feedback and further research, I did switch my website data source to New York Best Sellers. I did find this milestone hard to figure out three data sources that related. Milestone two was cleaning and transforming the amazon book reviews csv file. This data set only had the user id, isbn, date/time, and rating. This step was not as hard as the others since I have pulled data into a dataframe from files before. I did have to do some research to learn how to convert/manipulate data and time values. Aggregation and group by were another topic discovered about during Milestone two. This was interesting and felt very similar to how it is done in SQL. Milestone three was working with the New York Times Bestseller website. I learned a lot with this milestone. I had not worked with beautiful soup before. While working with a for loop I had to freshen up on lists and nest lists. Throughout this whole step I kept trying to use a list like a dataframe. I continued to use beautiful soup for Milestone four. In this phase I used an API to search all books by an Author on Goodreads. I liked this step the most, because I felt like I learned a lot about manipulating the author’s name to work in the API call. Finally, in the last milestone 5 I used the search_author function created in milestone four to search for additional books by the authors on the New York Best Seller list. This was my favorite part of milestone five was figuring out how to loop through all the authors and call the search function from milestone four. I liked using sql to pull the data from the database. It was easier for me than aggregating with the dataframes from previous milestone. My major disappoint was picking the csv file that I did. Initial I thought I would be able to join those reviews with my other data sets, but the ISBN was an ASIN. ASIN stands for Amazon Standard Identification Number. It is a 10-charcter alphanumeric unique identifier that is assigned by Amazon.com. I was unable to join it with the other two datasets. Lastly, I ended up only doing bar charts for my data visualization, since it was all categorical data.
Contributing
Contributions are what make the open source community such an amazing place to be learn, inspire, and create. Any contributions you make are greatly appreciated.
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/AmazingFeature
) - Commit your Changes (
git commit -m 'Add some AmazingFeature'
) - Push to the Branch (
git push origin feature/AmazingFeature
) - Open a Pull Request
License
Distributed under the MIT License. See LICENSE
for more information.
Contact
April Meyer - swim53185@gmail.com
Project Link: https://github.com/AMeyer89/DataWranglingBestSellerBooks