Data Wrangling Best Seller Books

2 minute read

Data Wrangling Best Seller Books

The is a project based on data wrangling and cleaning data from various sources.
Explore the docs »

· Report Bug · Request Feature

About The Project
- Built With
Getting Started
- Prerequisites
- Installation
Description
Contributing
License
Contact

About The Project

Built With

Powerpoint
Word
jupyter notebook

Getting Started

To get a local copy up and running follow these simple steps.

Prerequisites

Microsoft Office Suite
PDF Reader

Installation

Clone the repo

git clone https://github.com/AMeyer89/DataWranglingBestSellerBooks.git

Description

During Milestone one I selected three data sources: amazon reviews for book categories, bestsellers from amazon website, and Goodreads for an API call. After feedback and further research, I did switch my website data source to New York Best Sellers. I did find this milestone hard to figure out three data sources that related. Milestone two was cleaning and transforming the amazon book reviews csv file. This data set only had the user id, isbn, date/time, and rating. This step was not as hard as the others since I have pulled data into a dataframe from files before. I did have to do some research to learn how to convert/manipulate data and time values. Aggregation and group by were another topic discovered about during Milestone two. This was interesting and felt very similar to how it is done in SQL. Milestone three was working with the New York Times Bestseller website. I learned a lot with this milestone. I had not worked with beautiful soup before. While working with a for loop I had to freshen up on lists and nest lists. Throughout this whole step I kept trying to use a list like a dataframe. I continued to use beautiful soup for Milestone four. In this phase I used an API to search all books by an Author on Goodreads. I liked this step the most, because I felt like I learned a lot about manipulating the author’s name to work in the API call. Finally, in the last milestone 5 I used the search_author function created in milestone four to search for additional books by the authors on the New York Best Seller list. This was my favorite part of milestone five was figuring out how to loop through all the authors and call the search function from milestone four. I liked using sql to pull the data from the database. It was easier for me than aggregating with the dataframes from previous milestone. My major disappoint was picking the csv file that I did. Initial I thought I would be able to join those reviews with my other data sets, but the ISBN was an ASIN. ASIN stands for Amazon Standard Identification Number. It is a 10-charcter alphanumeric unique identifier that is assigned by Amazon.com. I was unable to join it with the other two datasets. Lastly, I ended up only doing bar charts for my data visualization, since it was all categorical data.

Contributing

Contributions are what make the open source community such an amazing place to be learn, inspire, and create. Any contributions you make are greatly appreciated.

Fork the Project
Create your Feature Branch (git checkout -b feature/AmazingFeature)
Commit your Changes (git commit -m 'Add some AmazingFeature')
Push to the Branch (git push origin feature/AmazingFeature)
Open a Pull Request

License

Distributed under the MIT License. See LICENSE for more information.

Contact

April Meyer - swim53185@gmail.com

Project Link: https://github.com/AMeyer89/DataWranglingBestSellerBooks

Share on

Twitter Facebook Google+ LinkedIn

April Meyer

Data Wrangling Best Seller Books