Name some descriptive statistic functions used with DataFrames.

**Answer**

Some descriptive statistic functions used with DataFrames are min(), max(), idxmax(), idxmin(), mode(), mean(), median(), count(), sum(), quantile(), std(), var().

To calculate statistical values for each row, the axis argument should be.

**Answer**

The axis argument should be equal to 1 to calculate statistical values for each row.

What are quantile and quartiles ?

**Answer**

Quantile is a process of dividing the total distribution of given data into a given number of equal proportions.

Quartile refers to the division of the total distribution of given data into a four equal proportions with each containing one fourth of the total population.

What does quantile() do ?

**Answer**

The `quantile()`

function is used to calculate any type of quantiles in Pandas.

Name pivoting functions available for DataFrames.

**Answer**

The pivoting functions available for DataFrames are pivot(), pivot_table(), melt(), stack(), unstack(), crosstab().

What does hist() do ?

**Answer**

The `hist()`

function of PyPlot module is used to create and plot histogram from a given sequence(s) of numbers.

What are missing values in a dataset ?

**Answer**

Missing values are the values that cannot contribute to any computation or we can say that missing values are the values that carry no computational significance.

Write two methods generally used for handling missing values.

**Answer**

The `dropna()`

and `fillna()`

are the two methods used for handling missing values in a DataFrame.

What is the difference between dropna() and fillna() ?

**Answer**

The `dropna()`

method is used to drop all rows that contain NaN values from a DataFrame. On the other hand, the `fillna()`

method is used to replace all NaN values in a Pandas object with the specified value.

To iterate over horizontal subsets of DataFrame, ............... function may be used.

- iterate()
- iterrows()
- itercols()
- iteritems()

**Answer**

iterrows()

**Reason** — The `iterrows()`

method iterates over a DataFrame row-wise, where each horizontal subset is in the form of (row-index, Series) where Series contains all column values for that row-index.

To iterate over vertical subsets of a DataFrame, ............... function may be used.

- iterate()
- iterrows()
- itercols()
- iteritems()

**Answer**

iteritems()

**Reason** — The `iteritems()`

method in pandas iterates over a DataFrame column-wise, where each vertical subset is in the form of (column-index, Series) where Series contains all row values for that column-index.

To add two DataFrames' values, ............... function may be used.

- plus
- rplus
- add
- radd

**Answer**

add, radd

**Reason** — We can add two DataFrame objects using either (+) operator or using `add()`

function as per syntax : `<DF1>.add(<DF2>)`

which means `<DF1>+<DF2>`

or by using `radd()`

function i.e., reverse add as per syntax : `<DF1>.radd(<DF2>)`

which means `<DF2>+<DF1>`

.

To subtract the values of two DataFrames, ............... function may be used.

- sub
- difference
- minus
- rsub

**Answer**

sub, rsub

**Reason** — We can subtract two DataFrame objects using either (-) operator or using `sub()`

function as per syntax : `<DF1>.sub(<DF2>)`

which means `<DF1>-<DF2>`

or by using `rsub()`

function i.e., reverse subtract as per syntax : `<DF1>.rsub(<DF2>)`

which means `<DF2>-<DF1>`

.

To divide the values of two DataFrames, ............... function may be used.

- divide
- div
- rdiv
- division

**Answer**

div, rdiv

**Reason** — We can divide two DataFrame objects using either (/) operator or using `div()`

function as per syntax : `<DF1>.div(<DF2>)`

which means `<DF1>/<DF2>`

or by using `rdiv()`

function i.e., reverse division as per syntax : `<DF1>.rdiv(<DF2>)`

which means `<DF2>/<DF1>`

.

Which of the following would give the same output as DF/DF1 where DF and DF1 are DataFrames ?

- DF.div(DF1)
- DF1.div(DF)
- Divide(DF, DF1)
- Div(DF, DF1)

**Answer**

DF.div(DF1)

**Reason** — The `div()`

function with the syntax `<DF>.div(<DF1>)`

, performs element-wise division between `DF`

and `DF1`

, which is equivalent to `<DF>/<DF1>`

in terms of functionality and output.

To skip NaN values in a calculation, you can specify ............... attribute.

- NaN
- NA
- skipna
- all of these

**Answer**

skipna

**Reason** — The `skipna`

attribute in Pandas is used to skip (exclude) NaN, NA, or null values when performing calculations.

Which of the following is not a valid function that can be used with DataFrames ?

- count()
- sum()
- length()
- mad()

**Answer**

length()

**Reason** — In Pandas, the `length()`

function is not a valid function for DataFrames. Instead, we can use the `len()`

function to get the length of a DataFrame. The other functions listed (count(), sum(), mad()) are valid functions for performing operations on DataFrames.

The technique that divides total distribution of data into a given number of equal proportions is called a ............... .

- quartile
- tercile
- median
- quantile

**Answer**

quantile

**Reason** — Quantile is a process of dividing the total distribution of given data into a given number of equal proportions.

............... divides the total distribution in four equal parts.

- quartile
- tercile
- median
- quantile

**Answer**

quartile

**Reason** — Quartile refers to the division of the total distribution of given data into a four equal proportions with each containing one fourth of the total population.

To divide total distribution of given data in two equal parts, ............... function is used.

- median()
- quartile()
- quantile()
- all of these

**Answer**

quantile()

**Reason** — The `quantile()`

function is used to divide the total distribution of given data into two equal parts. For example, to divide the data of DataFrame `df`

into two equal parts, the statement would be `df.quantile(q = 0.5)`

.

To divide total distribution of given data in four equal parts, ............... function is used.

- median()
- quartile()
- quantile()
- all of these

**Answer**

quantile()

**Reason** — The `quantile()`

function is used to divide the total distribution of given data into four equal parts. For example, to divide the data of DataFrame `ndf`

into four equal parts, the statement would be `df.quantile(q = 0.25)`

.

To divide total distribution of given data in eight equal parts, ............... function is used.

- median()
- quartile()
- quantile()
- all of these

**Answer**

quantile()

**Reason** — To divide the total distribution of given data into eight equal parts, we can use the `quantile()`

function with a quantile value of 0.125 (1/8). For example, to divide the data of DataFrame `df1`

into eight equal parts, the statement would be `df.quantile(q = 0.125)`

.

Which of the following is not a descriptive statistics function ?

- count()
- add()
- sum()
- max()

**Answer**

add()

**Reason** — The `add()`

function is a binary operation function, while `count()`

, `sum()`

, `max()`

are descriptive statistics functions.

Which function calculates descriptive statistical details for a DataFrame ?

- info()
- describe()
- show()
- list()

**Answer**

describe()

**Reason** — The descriptive statistics details can be calculated through a single function, namely the `describe()`

function.

To calculative cumulative sum of a column of a DataFrame, you may use ............... function.

- sum()
- sum(cumulative = True)
- cumsum()
- none of these

**Answer**

cumsum()

**Reason** — The `cumsum()`

function in pandas is used to calculate the cumulative sum of a column in a DataFrame i.e., in the output of this function, the value of each row is replaced by the sum of all prior rows, including this row.

The function to get the index of maximum value in a column of DataFrame is ............... .

- max()
- index()
- idxmax()
- maxidx()

**Answer**

idxmax()

**Reason** — The `idxmax()`

function in pandas is used to get the index of the maximum value in a column of a DataFrame.

To get top 5 rows of a DataFrame, you may use ............... function.

- head()
- head(5)
- top()
- top(5)

**Answer**

head(), head(5)

**Reason** — The `head()`

function in pandas is used to get the top `n`

rows of a DataFrame, where `n`

is an optional argument. If `n`

is not provided, it defaults to 5. Hence, both `head()`

and `head(5)`

are correct and will return the top 5 rows of the DataFrame.

To get bottom 3 rows of a DataFrame, you may use ............... function.

- tail()
- tail(3)
- bottom()
- bottom(3)

**Answer**

tail(3)

**Reason** — The `tail()`

function in pandas is used to get the bottom `n`

rows of a DataFrame, where `n`

is an optional argument. Hence, `tail(3)`

is correct and will return the bottom 3 rows of the DataFrame.

Function ............... can be used to drop missing values.

- fillna()
- isnull()
- dropna()
- delna()

**Answer**

dropna()

**Reason** — The `dropna()`

function in pandas is used to drop rows with missing values.

Which of the following methods of combining two DataFrames is a patching method ?

- concat()
- merge()
- join()
- none of these

**Answer**

join()

**Reason** — The `join()`

function in pandas is a patching method, which means that it aligns the DataFrames based on a common index or set of keys, and then combines them.

* iterrows()* and

*functions help you to iterate over a DataFrame.*

**iteritems()**To add two DataFrames, you may use functions * add()* or

*.*

**radd( )**To concatenate two string columns of a DataFrame, * +* operator is used.

The * mode()* function returns the maximum repeating value.

The * mean()* function returns the average of given data.

The * median()* function returns the halfway point in a given data.

Using * quantile()* function you can calculate terciles.

Using * quantile()* function you can calculate quartiles.

Using * quantile()* function you can calculate octiles.

Using * isnull()* method, you can check if there are any missing values in DataFrame.

The * merge()* function combines two DataFrames such that two rows with some common value are merged together in the final result.

The iteritems() iterates over the rows of a DataFrame.

**Answer**

False

**Reason** — The `iteritems()`

method in pandas iterates over a DataFrame column-wise, where each vertical subset is in the form of (column-index, Series) where Series contains all row values for that column-index.

The iteritems() iterates over the columns of a DataFrame.

**Answer**

True

**Reason** — The `iteritems()`

method in pandas iterates over a DataFrame column-wise, where each vertical subset is in the form of (column-index, Series) where Series contains all row values for that column-index.

The result produced by the functions sub() and rsub() is the same.

**Answer**

False

**Reason** — The `sub()`

and `rsub()`

functions produce different results because they subtract the operands in a different order. The `sub()`

function performs element-wise subtraction between two DataFrames, subtracting the right operand from the left operand. The syntax is `<DF1>.sub(<DF2>)`

, which means `<DF1> - <DF2>`

. On the other hand, the `rsub()`

function performs element-wise subtraction with the right operand subtracted from the left operand. The syntax is `<DF1>.rsub(<DF2>)`

, which means `<DF2> - <DF1>`

.

The result produced by the functions add() and radd() is the same.

**Answer**

True

**Reason** — When adding two DataFrame objects using the `add()`

function with the syntax `<DF1>.add(<DF2>)`

, it means `<DF1> + <DF2>`

. Similarly, using the `radd()`

function with the syntax `<DF1>.radd(<DF2>)`

means `<DF2> + <DF1>`

. Both operations produce the same results because addition is commutative, meaning the order of operands does not affect the result.

The result produced by the functions div() and rdiv() is the same.

**Answer**

False

**Reason** — The `div()`

and `rdiv()`

functions do not produce the same result. The `div()`

function performs element-wise division between two DataFrames, where the left operand is the dividend and the right operand is the divisor. The syntax is `<DF1>.div(<DF2>)`

, which means `<DF1> / <DF2>`

. On the other hand, the `rdiv()`

function performs element-wise division with the right operand as the dividend and the left operand as the divisor. The syntax is `<DF1>.rdiv(<DF2>)`

, which means `<DF2> / <DF1>`

.

The info() and describe() list the same information about a DataFrame.

**Answer**

False

**Reason** — The `info()`

method in pandas provides basic information about a DataFrame, including the data types of each column, the number of rows, and the memory usage. It also displays the index name and type. On the other hand, the `describe()`

method provides detailed descriptive statistics about the numerical columns in a DataFrame, including count, mean, standard deviation, minimum value, 25th percentile, median (50th percentile), 75th percentile, and maximum value.

Function **add()** and operator + give the same result.

**Answer**

True

**Reason** — In pandas, both the '+' operator and the `add()`

function can be used to perform element-wise addition between two DataFrames, resulting in the same output. The syntax for the '+' operator is `<DF1> + <DF2>`

, while the syntax for the `add()`

function is `<DF1>.add(<DF2>)`

, which is equivalent to `<DF1> + <DF2>`

.

Function **rsub()** and operator - give the same result.

**Answer**

False

**Reason** — The `rsub()`

function and the '-' operator in pandas do not produce the same result because the `rsub()`

function performs reverse subtraction, which means it subtracts the left operand from the right, while the '-' subtracts the right operand from the left operand.

The minus - operator's result is same as sub() and rsub().

**Answer**

False

**Reason** — The '-' operator in pandas subtracts the right operand from the left operand, similar to the `sub()`

function. However, the `rsub()`

function subtracts the left operand from the right operand. Therefore, '-' and `sub()`

produce the same result, while `rsub()`

produces a different result due to the order of operands.

Python integer datatype can store NaN values.

**Answer**

False

**Reason** — Python integer datatype cannot store NaN values. To store a NaN value in a column, the datatype of a column is changed to non-integer suitable type.

Functions sum() and cumsum() produce the same result.

**Answer**

False

**Reason** — The function `sum()`

returns the sum of the values for the requested axis. On the other hand, `cumsum()`

calculates the cumulative sum. In the output of this function, the value of each row is replaced by the sum of all prior rows, including this row.

The fillna() can also fill individual missing values for different columns.

**Answer**

True

**Reason** — The `fillna()`

function in pandas can be used to fill individual missing values for different columns. This can be done by passing a dictionary with column names as keys and the corresponding filling values as values to the `fillna()`

function.

To drop missing values from a DataFrame, the function used is delna().

**Answer**

False

**Reason** — To drop missing values from a DataFrame, the function used is `dropna()`

.

**Assertion.** A quantile refers to equally distributed portion of a data set.

**Reason.** A median divides a distribution in 2 quantiles while a quartile divides a distribution in 4 quantiles.

- Both A and R are true and R is the correct explanation of A.
- Both A and R are true but R is not the correct explanation of A.
- A is true but R is false.
- A is false but R is true.

**Answer**

Both A and R are true and R is the correct explanation of A.

**Explanation**

Quantile is a statistical term that refers to dividing the total distribution of given data into a given number of equal proportions. Each portion is called a quantile, and it represents a specific percentage of the data set. The median divides a distribution into two equal parts, representing the middle value of a data set when the data is sorted in ascending order. A quartile divides a distribution into four equal parts, with each quartile representing 25% of the data set.

**Assertion.** Data aggregation produces a summary statistics of a dataset.

**Reason.** Data aggregation summarizes data using statistical aggregation functions.

- Both A and R are true and R is the correct explanation of A.
- Both A and R are true but R is not the correct explanation of A.
- A is true but R is false.
- A is false but R is true.

**Answer**

Both A and R are true and R is the correct explanation of A.

**Explanation**

Data aggregation produces a summary statistic of a dataset because it involves combining and summarizing data from multiple sources or data points, and then calculating a single summary statistic or a set of summary statistics. This process utilizes statistical aggregation functions, such as sum, average, count, minimum, and maximum.

**Assertion.** In a dataset, there can be missing values that cannot contribute to any computation.

**Reason.** In a dataset, NULL, NaN or None are considered the missing values.

- Both A and R are true and R is the correct explanation of A.
- Both A and R are true but R is not the correct explanation of A.
- A is true but R is false.
- A is false but R is true.

**Answer**

Both A and R are true and R is the correct explanation of A.

**Explanation**

Missing values are the values that cannot contribute to any computation or we can say that missing values are the values that carry no computational significance. In a dataset, NULL, NaN or None are considered the missing values.

**Assertion (A).** The output of addition of two series will be NaN, if one of the elements or both the elements have no value(s).

**Reason (R).** While performing mathematical operations on a series, by default all missing values are filled in with 0.

- Both A and R are true and R is the correct explanation of A.
- Both A and R are true but R is not the correct explanation of A.
- A is true but R is false.
- A is false but R is true.

**Answer**

A is true but R is false.

**Explanation**

When adding two pandas Series and encountering NaN values in either or both Series, the result of the addition operation will be NaN for those corresponding elements. Additionally, pandas does not automatically replace missing values with 0 during mathematical operations but instead with NaN.

Name the function to iterate over a DataFrame horizontally.

**Answer**

The `iterrows()`

function iterates over a DataFrame horizontally.

Name the function to iterate over a DataFrame vertically.

**Answer**

The `iteritems()`

function iterates over a DataFrame vertically.

Write equivalent expressions for the given functions :

(i) A.add(B)

(ii) B.add(A)

(iii) A.sub(B)

(iv) B.sub(A)

(v) A.rsub(B)

(vi) B.mul(A)

(vii) A.rdiv(B)

(viii) B.div(A)

(ix) B.rdiv(A)

(x) A.div(B)

**Answer**

(i) A.add(B) — A + B

(ii) B.add(A) — B + A

(iii) A.sub(B) — A - B

(iv) B.sub(A) — B - A

(v) A.rsub(B) — B - A

(vi) B.mul(A) — B * A

(vii) A.rdiv(B) — B / A

(viii) B.div(A) — B / A

(ix) B.rdiv(A) — A / B

(x) A.div(B) — A / B

Is the result of sub() and rsub() the same? Why/why not ?

**Answer**

The `sub()`

and `rsub()`

functions produce different results because they subtract the operands in a different order. The `sub()`

function performs element-wise subtraction between two DataFrames, subtracting the right operand from the left operand. The syntax is `<DF1>.sub(<DF2>)`

, which means `<DF1> - <DF2>`

. On the other hand, the `rsub()`

function performs element-wise subtraction with the right operand subtracted from the left operand. The syntax is `<DF1>.rsub(<DF2>)`

, which means `<DF2> - <DF1>`

.

Write appropriate functions to perform the following on a DataFrame ?

(i) Calculate the sum

(ii) Count the values

(iii) Calculate the average

(iv) Calculate the most repeated value

(v) Calculate the median

(vi) Calculate the standard deviation

(vii) Calculate the variance

(viii) Calculate the maximum value

(ix) Calculate the standard deviation

(x) Calculate the variance

**Answer**

(i) Calculate the sum — sum() function

(ii) Count the values — count() function

(iii) Calculate the average — mean() function

(iv) Calculate the most repeated value — mode() function

(v) Calculate the median — median() function

(vi) Calculate the standard deviation — std() function

(vii) Calculate the variance — var() function

(viii) Calculate the maximum value — max() function

(ix) Calculate the standard deviation — std() function

(x) Calculate the variance — var() function

What does info() and describe() do ?

**Answer**

The `info()`

method in pandas provides basic information about a DataFrame, including the data types of each column, the number of rows, and the memory usage. It also displays the index name and type. On the other hand, the `describe()`

method provides detailed descriptive statistics about the numerical columns in a DataFrame, including count, mean, standard deviation, minimum value, 25th percentile, median (50th percentile), 75th percentile, and maximum value.

Are sum() and add() functions the same ?

**Answer**

No, the `sum()`

and `add()`

functions are not the same in pandas. The `sum()`

function in pandas calculates the sum of values along a specified axis. On the other hand, the `add()`

function in pandas is used to add two objects element-wise.

Name some functions that perform descriptive statistics on a DataFrame.

**Answer**

Some functions that perform descriptive statistics on a DataFrame are min(), max(), idxmax(), idxmin(), mode(), mean(), median(), count(), sum(), quantile(), std(), var().

To consider only the numeric values for calculation, what argument do you pass to statistics functions of Pandas ?

**Answer**

The `numeric_only = True`

argument is passed to statistics functions in pandas to consider only the numeric values (int, float, boolean columns) for calculation.

Is there one function that calculates much of descriptive statistics values ? Name it.

**Answer**

Yes, there is one function that calculates many descriptive statistics values, namely the describe() function.

What happens if mode() returns multiple values for a column but other columns have a single mode ?

**Answer**

When the `mode()`

function in pandas returns multiple values for a column (meaning there are multiple modes), but other columns have a single mode, pandas fills the non-mode columns with NaN values.

What is quantile and quartile ?

**Answer**

Quantile is a process of dividing the total distribution of given data into a given number of equal proportions.

Quartile refers to the division of the total distribution of given data into a four equal proportions with each containing one fourth of the total population.

Name the function that lets you calculate different types of quantiles.

**Answer**

The function used to calculate different types of quantiles is `quantiles()`

with syntax : `<dataframe>.quantile(q = 0.5, axis = 0, numeric_only = True)`

.

Name the functions that give you maximum and minimum values in a DataFrame.

**Answer**

The `max()`

and `min()`

functions find out the maximum and minimum values respectively from a DataFrame.

Name the functions that give you the index of maximum and minimum values in a DataFrame.

**Answer**

The `idxmax()`

and `idxmin()`

functions find out the index of maximum and minimum values respectively from a DataFrame.

What is missing data ?

**Answer**

Missing data, also known as missing values are the values that cannot contribute to any computation or we can say that missing values are the values that carry no computational significance.

Why is missing data filled in DataFrame with some value ?

**Answer**

The `dropna()`

method in pandas can be used to remove rows or columns with missing values from a DataFrame. However, this can also lead to a loss of non-null data, which may not be desirable for data analysis. To avoid losing non-null data, we can use the `fillna()`

method to fill missing values with a specified value. This can help to ensure that the dataset is complete and can be used for data analysis.

Name the functions you can use for filling missing data.

**Answer**

In Python, the `fillna()`

function is used to fill missing data in a Pandas DataFrame.

How do you iterate over a DataFrame?

**Answer**

To iterate over a DataFrame in Python, we can use various methods depending on our specific needs. We can use the `iterrows()`

method to iterate over each row in the DataFrame. In this method, each horizontal subset is in the form of (row-index, series), where the series contains all column values for that row-index. Additionally, we can use the `iteritems()`

method to iterate over each column in the DataFrame. Here, each vertical subset is in the form of (column-index, series), where the series contains all row values for that column-index.

What are binary operations ? Name the functions that let you perform binary operations on a DataFrame.

**Answer**

Binary operations refer to operations that require two values to perform and these values are picked element-wise. In a binary operation involving DataFrames, the data from the two DataFrames are aligned based on their row and column indexes. For matching row and column indexes, the specified operation is performed, while for non-matching row and column indexes, NaN values are stored in the result. The functions that perform binary operations on a DataFrame include add(), radd(), sub(), rsub(), mul(), div(), and rdiv().

What is descriptive statistics ? Name Pandas descriptive statistics functions.

**Answer**

A descriptive statistic is a summary statistic that quantitatively describes or summarizes features of a collection of information.

The functions that perform descriptive statistics on a DataFrame are min(), max(), idxmax(), idxmin(), mode(), mean(), median(), count(), sum(), quantile(), std(), var().

The info() and describe() are said to be inspection functions. What do they do ?

**Answer**

The `info()`

and `describe()`

functions in pandas are considered inspection functions because they provide valuable insights into the structure, content, and statistical summary of a DataFrame. The `info()`

method in pandas provides basic information about a DataFrame, including the data types of each column, the number of rows, and the memory usage. It also displays the index name and type. On the other hand, the `describe()`

method provides detailed descriptive statistics about the numerical columns in a DataFrame, including count, mean, standard deviation, minimum value, 25th percentile, median (50th percentile), 75th percentile, and maximum value.

What is the difference between sum and cumulative sum? How do you perform the two on DataFrame ?

**Answer**

The sum provides the total of all numbers in a sequence, while the cumulative sum represents the running total of numbers encountered up to a specific point in the sequence.

The `sum()`

function gives the total sum of values along specified axes, whereas the `cumsum()`

function provides the cumulative sum of values along specified axes, either column-wise or row-wise.

Name and explain some cumulative functions provided by Pandas.

**Answer**

The cumulative functions provided by Pandas are cumsum(), cumprod(), cummax(), cummin().

**cumsum()**— It calculates cumulative sum i.e., in the output of this function, the value of each row is replaced by sum of all prior rows including this row. The syntax is`<DF>.cumsum([axis = None])`

.**cumprod()**— It calculates cumulative product of values in a DataFrame object.**cummax()**— It calculates cumulative maximum of value from a DataFrame object.**cummin()**— It calculates cumulative minimum of value from a DataFrame object.

The head() and tail() extract rows or columns from a DataFrame. Explain.

**Answer**

The `head()`

function in pandas retrieves the top n rows of a DataFrame, where n is an optional argument defaulting to 5 if not provided. It is used with the syntax `<DF>.head()`

. Similarly, the `tail()`

function in pandas fetches the bottom n rows of a DataFrame, where n is also optional and defaults to 5 if not specified. Its syntax is `<DF>.tail()`

.

Why does Python change the datatype of a column as soon as it stores an empty value (NaN) even though it has all other values stored as integer ?

**Answer**

When a column in a pandas DataFrame contains a mixture of integers and missing values (NaNs), the data type is automatically changed to a floating-point type to accommodate these missing values because in Python integer types cannot store NaN values.

What do quantile and var() functions do ?

**Answer**

The `quantile()`

function returns the values at the given quantiles over requested axis (axis 0 or 1). On the other hand, `var()`

function computes variance and returns the unbiased variance over the requested axis.

What is a quartile ? How is it different from quantile ?

**Answer**

Quartile refers to the division of the total distribution of given data into a four equal proportions with each containing one fourth of the total population.

Quartiles specifically divide the dataset into four equal parts, while quantiles can divide the dataset into any desired number of equal parts.

How do you create quantiles and quartiles in Python Pandas ?

**Answer**

In Python pandas, we can create quantiles and quartiles using the `quantile()`

method.

The statement to create quartiles is `df.quantile(q = 0.25)`

.

Assume that required libraries (Pandas and Numpy) are imported and DataFrame ndf has been created as shown in solved problem 16. Predict the output produced by following code fragment :

```
print(ndf[ndf["age"] > 30])
print(ndf.head(2))
print(ndf.tail(3))
```

The following DataFrame `ndf`

is from solved problem 16 :

```
Name Sex Position City age Projects Budget
0 Rabina F Manager Bangalore 30 13 48
1 Evan M Programer New Delhi 27 17 13
2 Jia F Manager Chennai 32 16 32
3 Lalit M Manager Mumbai 40 20 21
4 Jaspreet M Programmer Chennai 28 21 17
5 Suji F Programmer Bangalore 32 14 10
```

**Answer**

```
Name Sex Position City age Projects Budget
2 Jia F Manager Chennai 32 16 32
3 Lalit M Manager Mumbai 40 20 21
5 Suji F Programmer Bangalore 32 14 10
Name Sex Position City age Projects Budget
0 Rabina F Manager Bangalore 30 13 48
1 Evan M Programer New Delhi 27 17 13
Name Sex Position City age Projects Budget
3 Lalit M Manager Mumbai 40 20 21
4 Jaspreet M Programmer Chennai 28 21 17
5 Suji F Programmer Bangalore 32 14 10
```

`print(ndf[ndf["age"] > 30])`

— The`ndf["age"] > 30`

creates a boolean mask where True indicates rows where the "age" column has values greater than 30. Then`ndf[...]`

uses this boolean mask to filter the DataFrame`ndf`

, returning only rows where the condition is True.`print(ndf.head(2))`

— It prints the first 2 rows of the DataFrame`ndf`

.`print(ndf.tail(3))`

— It prints the last 3 rows of the DataFrame`ndf`

.

Given the two DataFrames as :

```
>>> dfc1
- -
0 1
0 2 a
1 3 b
2 4 c
```

```
>>> dfc2
- -
0 1 2
0 2 3 4
2 p q r
```

Why are following statements giving errors ?

(a) print(dfc1 + dfc2)

(b) print(dfc1.sub(dfc2))

(c) print(dfc1 * dfc2)

**Answer**

(a) `print(dfc1 + dfc2)`

— This statement tries to add two DataFrames `dfc1`

and `dfc2`

. However, they have different shapes (`dfc1`

has 3 rows and 2 columns, while `dfc2`

has 2 rows and 3 columns), and their indices do not match. DataFrame addition requires both DataFrames to have the same shape and compatible indices, which is not the case here.

(b) `print(dfc1.sub(dfc2))`

— This statement attempts to subtract `dfc1`

from `dfc2`

using the `sub()`

method. Similar to addition, subtraction between DataFrames requires them to have the same shape and compatible indices, which is not satisfied here due to the mismatch in shapes and indices.

(c) `print(dfc1 * dfc2)`

— This statement tries to perform element-wise multiplication between `dfc1`

and `dfc2`

. Again, this operation requires both DataFrames to have the same shape and compatible indices, which is not the case here due to the mismatched shapes and indices.

Consider the following code that creates two DataFrames :

```
ore1 = pd.DataFrame(np.array([[20, 35, 25, 20], [11, 28, 32, 29]]),
columns = ['iron', 'magnesium', 'copper', 'silver'])
ore2 = pd.DataFrame(np.array([[14, 34, 26, 26], [33, 19, 25, 23]]),
columns = ['iron', 'magnesium', 'gold', 'silver'])
```

What will be the output produced by the following code fragments ?

(a) print(ore1 + ore2)

ore3 = ore1.radd(ore2)

print(ore3)

(b) print(ore1 - ore2)

ore3 = ore1.rsub(ore2)

print(ore3)

(c) print(ore1 * ore2)

ore3 = ore1.mul(ore2)

print(ore3)

(d) print(ore1 / ore2)

ore3 = ore1.rdiv(ore2)

print(ore3)

**Answer**

(a)

```
copper gold iron magnesium silver
0 NaN NaN 34 69 46
1 NaN NaN 44 47 52
copper gold iron magnesium silver
0 NaN NaN 34 69 46
1 NaN NaN 44 47 52
```

`print(ore1 + ore2)`

: This line attempts to add the DataFrames`ore1`

and`ore2`

using the '+' operator. When adding DataFrames with different shapes and column names, pandas aligns the DataFrames based on indices and columns, resulting in NaN values where elements are missing in either DataFrame.`ore3 = ore1.radd(ore2)`

: This line uses the`radd()`

method, which reverses the addition operation between DataFrames`ore1`

and`ore2`

.- The
`radd()`

function and '+' operator produce the same result in pandas, as the order of operands does not affect addition due to its commutative property.

(b)

```
copper gold iron magnesium silver
0 NaN NaN 6 1 -6
1 NaN NaN -22 9 6
copper gold iron magnesium silver
0 NaN NaN -6 -1 6
1 NaN NaN 22 -9 -6
```

`print(ore1 - ore2)`

: This line performs a subtraction operation between corresponding elements in DataFrames`ore1`

and`ore2`

. When subtracting DataFrames with different shapes and column names, pandas aligns the DataFrames based on indices and columns, resulting in NaN values where elements are missing in either DataFrame.`ore3 = ore1.rsub(ore2)`

: This line uses the`rsub()`

method, which reverses the subtraction operation between DataFrames`ore1`

and`ore2`

.- The
`rsub()`

function and the '-' operator in pandas do not produce the same result because the`rsub()`

function performs reverse subtraction, which means it subtracts the left operand from the right, while the '-' subtracts the right operand from the left operand.

(c)

```
copper gold iron magnesium silver
0 NaN NaN 280 1190 520
1 NaN NaN 363 532 667
copper gold iron magnesium silver
0 NaN NaN 280 1190 520
1 NaN NaN 363 532 667
```

`print(ore1 * ore2)`

: This line attempts to perform element-wise multiplication between the DataFrames`ore1`

and`ore2`

using the '*' operator. When multiplying DataFrames with different shapes and column names, pandas aligns the DataFrames based on indices and columns, resulting in NaN values where elements are missing in either DataFrame.`ore3 = ore1.mul(ore2)`

: This line uses the`mul()`

method to perform element-wise multiplication between DataFrames`ore1`

and`ore2`

.

(d)

```
copper gold iron magnesium silver
0 NaN NaN 1.428571 1.029412 0.769231
1 NaN NaN 0.333333 1.473684 1.260870
copper gold iron magnesium silver
0 NaN NaN 0.7 0.971429 1.300000
1 NaN NaN 3.0 0.678571 0.793103
```

`print(ore1 / ore2)`

: This line attempts to perform element-wise division between the DataFrames`ore1`

and`ore2`

using the '/' operator. When dividing DataFrames with different shapes and column names, pandas aligns the DataFrames based on indices and columns, resulting in NaN values where elements are missing in either DataFrame.`ore3 = ore1.rdiv(ore2)`

: This line uses the`rdiv()`

method to perform reciprocal division between DataFrames`ore1`

and`ore2`

.

Consider the DataFrame wdf as shown below :

minTemp | maxTemp | Rainfall | Evaporation | |
---|---|---|---|---|

0 | 2.9 | 8.0 | 24.3 | 0.0 |

1 | 3.1 | 14.0 | 26.9 | 3.6 |

2 | 6.2 | 13.7 | 23.4 | 3.6 |

3 | 5.3 | 13.3 | 15.5 | 39.8 |

4 | 6.3 | 17.6 | 16.1 | 2.8 |

5 | 5.4 | 18.2 | 16.9 | 0.0 |

6 | 5.5 | 21.1 | 18.2 | 0.2 |

7 | 4.8 | 18.3 | 17.0 | 0.0 |

8 | 3.6 | 20.8 | 19.5 | 0.0 |

9 | 7.7 | 19.4 | 22.8 | 16.2 |

10 | 9.9 | 24.1 | 25.2 | 0.0 |

11 | 11.8 | 28.5 | 27.3 | 0.2 |

12 | 13.2 | 29.1 | 27.9 | 0.0 |

13 | 16.8 | 24.1 | 30.9 | 0.0 |

14 | 19.4 | 28.1 | 31.2 | 0.0 |

15 | 21.6 | 34.4 | 32.1 | 0.0 |

16 | 20.4 | 33.8 | 31.2 | 0.0 |

17 | 18.5 | 26.7 | 30.0 | 1.2 |

18 | 18.8 | 32.4 | 32.3 | 0.6 |

19 | 17.6 | 28.6 | 33.4 | 0.0 |

20 | 19.7 | 30.3 | 33.4 | 0.0 |

(a) Write statement(s) to calculate minimum value for each of the columns.

(b) Write statement(s) to calculate maximum value for each of the rows.

(c) Write statement(s) to calculate variance for column Rainfall.

(d) Write statement(s) to compute mean , mode median for last 10 rows.

**Answer**

(a)

`>>> wdf.min()`

```
Minimum values for each column:
minTemp 2.9
maxTemp 8.0
Rainfall 15.5
Evaporation 0.0
dtype: float64
```

(b)

`>>> wdf.max(axis=1)`

```
Maximum values for each row:
0 24.3
1 26.9
2 23.4
3 39.8
4 17.6
5 18.2
6 21.1
7 18.3
8 20.8
9 22.8
10 25.2
11 28.5
12 29.1
13 30.9
14 31.2
15 34.4
16 33.8
17 30.0
18 32.4
19 33.4
20 33.4
dtype: float64
```

(c)

`>>> wdf['Rainfall'].var()`

```
Variance for column Rainfall: 38.852999999999994
```

(d)

```
>>> last_10_rows = wdf.tail(10)
>>> last_10_rows.mean()
>>> last_10_rows.mode()
>>> last_10_rows.median()
```

```
Mean for last 10 rows:
minTemp 17.78
maxTemp 29.60
Rainfall 30.97
Evaporation 0.20
dtype: float64
Mode for last 10 rows:
minTemp maxTemp Rainfall Evaporation
0 11.8 24.1 31.2 0.0
1 13.2 26.7 33.4 NaN
2 16.8 28.1 NaN NaN
3 17.6 28.5 NaN NaN
4 18.5 28.6 NaN NaN
5 18.8 29.1 NaN NaN
6 19.4 30.3 NaN NaN
7 19.7 32.4 NaN NaN
8 20.4 33.8 NaN NaN
9 21.6 34.4 NaN NaN
Median for last 10 rows:
minTemp 18.65
maxTemp 28.85
Rainfall 31.20
Evaporation 0.00
dtype: float64
```

Write a program to print a DataFrame one column at a time and print only first three columns.

```
import pandas as pd
data = {
'Name': ['Aliya', 'Hemanth', 'Charlie'],
'Age': [25, 30, 35],
'City': ['Bangalore', 'Chennai', 'Mumbai'],
'Salary': [50000, 60000, 70000]
}
df = pd.DataFrame(data)
first_three_columns = df.iloc[:, :3]
print("Each column:")
for column_name in first_three_columns:
column_data = first_three_columns[column_name]
print(column_name)
print(column_data)
```

```
Each column:
Name
0 Aliya
1 Hemanth
2 Charlie
Name: Name, dtype: object
Age
0 25
1 30
2 35
Name: Age, dtype: int64
City
0 Bangalore
1 Chennai
2 Mumbai
Name: City, dtype: object
```

Write a program to print a DataFrame one row at a time and print only first five rows.

```
import pandas as pd
data = {
'Name': ['Amruta', 'Harsh', 'Yogesh', 'Shreya', 'Zoya', 'Nyra'],
'Age': [25, 30, 35, 40, 45, 28],
'City': ['Chandigarh', 'Jaipur', 'Dehradun', 'Delhi', 'Vadodara', 'Guwahati']
}
df = pd.DataFrame(data)
first_five_rows = df.head(5)
print("Each row:")
for index, row in first_five_rows.iterrows():
print("Index:", index)
print(row)
```

```
Each row:
Index: 0
Name Amruta
Age 25
City Chandigarh
Name: 0, dtype: object
Index: 1
Name Harsh
Age 30
City Jaipur
Name: 1, dtype: object
Index: 2
Name Yogesh
Age 35
City Dehradun
Name: 2, dtype: object
Index: 3
Name Shreya
Age 40
City Delhi
Name: 3, dtype: object
Index: 4
Name Zoya
Age 45
City Vadodara
Name: 4, dtype: object
```

Write a program that performs count, sum, max, and min functions :

On rows

On columns

```
import pandas as pd
data = {
'A': [1, 2, 3],
'B': [4, 5, 6],
'C': [7, 8, 9]
}
df = pd.DataFrame(data)
print("Operations on rows:")
print("Count:")
print(df.count(axis=1))
print("Sum:")
print(df.sum(axis=1))
print("Max:")
print(df.max(axis=1))
print("Min:")
print(df.min(axis=1))
print("\nOperations on columns:")
print("Count:")
print(df.count())
print("Sum:")
print(df.sum())
print("Max:")
print(df.max())
print("Min:")
print(df.min())
```

```
Operations on rows:
Count:
0 3
1 3
2 3
dtype: int64
Sum:
0 12
1 15
2 18
dtype: int64
Max:
0 7
1 8
2 9
dtype: int64
Min:
0 1
1 2
2 3
dtype: int64
Operations on columns:
Count:
A 3
B 3
C 3
dtype: int64
Sum:
A 6
B 15
C 24
dtype: int64
Max:
A 3
B 6
C 9
dtype: int64
Min:
A 1
B 4
C 7
dtype: int64
```

Take a DataFrame of your choice. Write a program to calculate count of values only in a selective column.

```
import pandas as pd
data = {
'Name': ['Deepika', 'Simran', 'Shivang', 'Anurag', 'Ankith'],
'Age': [25, 30, None, 40, 45],
'City': ['Goa', 'Kochi', 'Pondicherry', 'Rohtak', 'Kasol']
}
df = pd.DataFrame(data)
selected_column = 'Age'
count_values = df[selected_column].count()
print("Count of values in", selected_column, ":" , count_values)
```

```
Count of values in Age : 4
```

Give two identical DataFrames **Sales16** and **Sales17**. But **Sales17** has some values missing. Write code so that **Sales17** fills its missing values from the corresponding entries of **Sales16**.

```
import pandas as pd
data_sales16 = {
'Product': ['A', 'B', 'C', 'D'],
'Sales': [100, 150, 120, 180]
}
Sales16 = pd.DataFrame(data_sales16)
data_sales17 = {
'Product': ['A', 'B', 'C', 'D'],
'Sales': [100, None, 120, None]
}
Sales17 = pd.DataFrame(data_sales17)
Sales17 = Sales16.fillna({'B': 150, 'D': 180})
print("Sales16:")
print(Sales16)
print("\nSales17 (after filling missing values):")
print(Sales17)
```

```
Sales16:
Product Sales
0 A 100
1 B 150
2 C 120
3 D 180
Sales17 (after filling missing values):
Product Sales
0 A 100
1 B 150
2 C 120
3 D 180
```

Write code that just produces single True/False as a result for the presence of missing values in whole DataFrame namely **df**.

```
import pandas as pd
data = {'A': [1, 2, None, 4], 'B': [None, 5, 6, 7], 'C': [8, 9, 10, 11]}
df = pd.DataFrame(data)
has_missing_values = False
for column in df.columns:
for value in df[column]:
if pd.isnull(value):
has_missing_values = True
break
print(has_missing_values)
```

```
True
```

Four Series objects Temp1, Temp2, Temp3 and Temp4 store the temperatures of week1, week2, week3 and week4 respectively. Create a DataFrame from these four series objects where the indexes should be 'Sunday', 'Monday', ... , 'Saturday', and columns should be 'Week1', 'Week2', 'Week3' and 'week4'.

```
import pandas as pd
weekdays = ['Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday']
week1 = [25.4, 26.7, 27.2, 28.9, 29.8, 30.2, 31.5]
week2 = [22.3, 23.2, 24.6, 25.4, 26.8, 27.9, 28.4]
week3 = [20.1, 21.2, 22.4, 23.9, 24.7, 25.57, 26.78]
week4 = [18.4, 19.1, 20.3, 21.7, 22.67, 23.4, 24.09]
Temp1 = pd.Series(week1, index=weekdays)
Temp2 = pd.Series(week2, index=weekdays)
Temp3 = pd.Series(week3, index=weekdays)
Temp4 = pd.Series(week4, index=weekdays)
df = pd.DataFrame({'Week1': Temp1, 'Week2': Temp2, 'Week3': Temp3, 'Week4': Temp4})
print(df)
```

```
Week1 Week2 Week3 Week4
Sunday 25.4 22.3 20.10 18.40
Monday 26.7 23.2 21.20 19.10
Tuesday 27.2 24.6 22.40 20.30
Wednesday 28.9 25.4 23.90 21.70
Thursday 29.8 26.8 24.70 22.67
Friday 30.2 27.9 25.57 23.40
Saturday 31.5 28.4 26.78 24.09
```

Given a DataFrame that stores the details of past 25 years' monthly sales. Some old data is however missing. Write a script to calculate average :

Monthly sales across years

Yearly sales

Make sure that missing values do not hamper the overall result.

```
import pandas as pd
data = {
'Year': [],
'Month': [],
'Sales': []
}
n = int(input("Enter the number of rows in the DataFrame: "))
for i in range(n):
year = int(input("Enter the year: "))
month = int(input("Enter the month (1-12): "))
sales = float(input("Enter the sales amount: "))
data['Year'].append(year)
data['Month'].append(month)
data['Sales'].append(sales)
df = pd.DataFrame(data)
df.fillna(0, inplace=True)
avg_monthly_sales = df.groupby('Month')['Sales'].mean()
yearly_sales = df.groupby('Year')['Sales'].mean()
print("Average Monthly Sales Across Years:")
print(avg_monthly_sales)
print("\nYearly Sales:")
print(yearly_sales)
```

```
Average Monthly Sales Across Years:
Month
1 543.25
2 644.25
3 486.50
4 583.50
5 376.50
6 644.75
7 847.00
8 497.75
9 564.50
10 701.00
11 524.75
12 522.25
Name: Sales, dtype: float64
Yearly Sales:
Year
2000 781.082333
2001 480.030233
2002 390.530000
2003 682.917767
2004 421.250050
2005 408.333333
2006 509.757500
2007 634.333667
2008 907.250000
2009 876.222222
2010 940.757578
2011 690.900067
2012 623.250043
2013 890.088888
2014 491.500000
2015 601.919997
2016 623.250055
2017 489.086633
2018 523.750000
2019 789.910067
2020 456.250000
2021 481.083333
2022 590.750000
2023 618.916667
2024 621.250000
Name: Sales, dtype: float64
```

Given two identical DataFrames **Sales16** and **Sales17**. But **Sales17** has some values missing. Write code so that **Sales17** fills its missing values from corresponding entries of **Sales16**.

```
import pandas as pd
data_sales16 = {
'Product': ['A', 'B', 'C', 'D'],
'Sales': [100, 150, 120, 180]
}
Sales16 = pd.DataFrame(data_sales16)
data_sales17 = {
'Product': ['A', 'B', 'C', 'D'],
'Sales': [100, None, 120, None]
}
Sales17 = pd.DataFrame(data_sales17)
Sales17 = Sales16.fillna({'B': 150, 'D': 180})
print("Sales16:")
print(Sales16)
print("\nSales17 (after filling missing values):")
print(Sales17)
```

```
Sales16:
Product Sales
0 A 100
1 B 150
2 C 120
3 D 180
Sales17 (after filling missing values):
Product Sales
0 A 100
1 B 150
2 C 120
3 D 180
```

Write code that just produces single True/False as a result for the presence of missing values in whole DataFrame namely **df**.

```
import pandas as pd
data = {'A': [1, 2, None, 4], 'B': [None, 5, 6, 7], 'C': [8, 9, 10, 11]}
df = pd.DataFrame(data)
has_missing_values = False
for column in df.columns:
for value in df[column]:
if pd.isnull(value):
has_missing_values = True
break
print(has_missing_values)
```

```
True
```