Airbnb is currently onsidered the **largest hotel company in the world**. It is a platform that aims to connect people who want to travel and find an accomodation, with hosts who want to rent their properties in a pactical way.

One of Airbnb initiatives is to provide their website data for some of the world’s largest cities. Through the portal __Inside Airbnb__, it is possible to download a large amount of data to develop Data Science projects and solutions.

**With that in mind, we are going to analyze the data for the city of Zurich in Switzerland and see what insights can be extracted from it.**

The main 4 questions of this projects are:

What is the correlation between the variables?

What type of property is most rented on Airbnb in Zurich?

What is the most expensive location in Zurich?

What is the average minimum rental nights?

To answer these 4 main questions, we need to anwser 4 side questions:

How many variables and how many entries does our dataset have? What are the type of these variables?

What is the percentage of missing data in the dataset?

How are the variables distributed?

Are there any ouliers in the dataset that could potentialy lead us to distorted conclusions in our statistical analysis?

## 1. Data acquisition

For this initial exploratory analysis, we are going to use the file listings.csv (Summary information and metrics for listings in Zurich) available in Inside Airbnb portal.

Let’s start by importing necessary libraries and packages.

Now, it is time to import the csv file into a dataframe, then display the first 5 rows to help us visualize the data.

## 2. Data analysis

Since we visualized the first 5 rows of the dataframe, we know what the variables are. With that in mind, we need to create a dictionary of variables to help us understand how the data is structured.

**Dictionary of variables**

id — Number of id generated to identify the property

name — Name of the advertised property

host_id — Property owner (host) id number

host_name — Host name

neighbourhood_group — This column does not contain any valid values

neighbourhood — Neighborhood name

latitude — Property latitude coordinate

longitude — Property longitude coordinate

room_type — Type of room offered

price — Renting price of the property

minimum_nights — Minimum number of nights to book

number_of_reviews — Number of reviews the property has

last_review — Date of the last review

reviews_per_month — Number of reviews per month

calculated_host_listings_count — Number of properties from the same host

availability_365 — Number of availability days within 365 days

Before we start answering our main questions we need to answer the side questions.

## 3. Answering Side Questions

**Side Question 1. How many attributes (variables) and how many entries does our dataset have? What are the types of the variables?**

Let’s go ahead and identify the number of entries that our dataset have and see the types of the variables in each column.

**Side Question 2. What is the percentage fo missing data in the dataset?**

The quality of a dataset is directly related to the amount of missing data. It is important to early understand on whether these null values are significant compared to the total entries.

To help out visualization, let’s sort the variables in descending order by the amount of missing values.

By looking at this output, we can infer that:

reviews_per_month andlast_review variables have null values in almost 25% of the lines.

name variable has very few missing values.

The other variables have no missing values.

**Side Question 3. How are the variables distributed?**

To answer that question we need to plot the histogram of numerical variables.

**Side Question 4. Are there any ouliers in the dataset that could potentialy lead us to distorted conclusions in our statistical analysis?**

By analyzing the histogram generated in the previous question, we can notice that there are some ouliers in our dataset. These outliers are the variables price,minimum_nights and calculated_host_listings_count.

The values do not follow a distribution, and distort the entire graphic representation. To confirm that, there are two quick ways to help detect outliers:

Statistical summary using the describe () method

Plot boxplots for the variable.

Looking at the statistical summary above, we can confirm some hypotheses such as:

The price variable has a value below 156 in 75% of the occurrences, but its maximum value is 12500.

The minimum_nights variable has a value below 5 in 75% of the occurrences, but its maximum value is 365.

**Boxplot for minimum_nights**

**Boxplot for price**

## 4. Cleanning data

We are doing that beacuse analyzing data without extracting the outliers can result in a distortion of the reality.

Since we have identified outliers in price andminimum_nights variables, we will now clear their DataFrame and plot the histogram again.

**Histogram without outliers**

## 5. Answering Main Questions

**Main Question 1. What is the correlation between the variables?**

To identify the correlation between the variables of interest, let’s:

Create a correlation matrix

Generate a heatmap from this matrix, using the seaborn library

**Main Question 2. What type of property is most rented on Airbnb?**

The column of the variable room_type indicates the type of rental that is advertised on Airbnb. If you have already rented on the website, you know that there are options for apartments / entire houses, just renting a room or even sharing a room with other people.

Let’s count the number of occurrences of each type of rental, using the value_counts () method.

We can see that the most rented type of property is the entire room or apartment.

**Main Question 3. What is the most expensive location in Zurich?**

One way to check one variable against another is to use groupby(). In this case, we want to compare the neighborhoods based on the rental price.

he most expensive location in Zurich is in the “City” (Downtown).

**Main Question 4. What is the average minimum rental night (minimum_nights)?**

The average minimum rental night is around 4 to 5 nights.

## 6. Conclusions

From a mere superficial analysis in the Airbnb database, it was already possible to identify the existence of outliers in some variables, which can distort reality during the analysis of the data.

It was also noted that in some locations there are few properties available, which can also cause distortions in the statistical analysis.

Finally, it is important to note that this dataset is a short version. For further analysis, it is recommended to use a complete dataset.

## Comments