Pandas is a powerful data manipulation library in Python that offers various functions and methods to work with structured data. While it excels in handling numerical and tabular data, it also provides robust support for string data. Manipulating string data is crucial in many data analysis tasks, such as cleaning messy text, extracting information, and transforming strings into more meaningful representations.

This tutorial aims to introduce you to the fundamentals of working with string data using Pandas. We will cover key operations, methods, and techniques to manipulate, analyze, and transform string data effectively. By the end of this tutorial, you will have a solid understanding of how to leverage Pandas to handle string data in your data analysis projects.

In this section, we will explore how to create a Pandas DataFrame containing string data and understand the various ways to import string data from different sources.

```
import pandas as pd
data = {'Name': ['John Doe', 'Jane Smith', 'Mark Johnson'],
'Email': ['john@example.com', 'jane@example.com', 'mark@example.com'],
'Phone': ['123-456-7890', '987-654-3210', '555-123-4567']}
df = pd.DataFrame(data)
df
```

Here, we will learn how to access and modify specific string columns in a DataFrame, including slicing, indexing, and applying functions to transform the string values.

```
names = df['Name']
names
```

```
df['Name'] = df['Name'].str.upper()
df
```

This section covers a wide range of common string operations and methods in Pandas, such as concatenation, splitting, stripping, substitution, matching, filtering, case conversion, and computing string length and counts.

```
df['Full Name'] = df['Name'].str.cat(df['Email'], sep=', ')
df
```

```
df['Name Parts'] = df['Name'].str.split()
df
```

```
df['Clean Phone'] = df['Phone'].str.replace('-', '')
df
```

```
filtered_df = df[df['Email'].str.contains('mark')]
filtered_df
```

```
df['Name'] = df['Name'].str.lower()
df
```

```
df['Name Length'] = df['Name'].str.len()
df['Name Vowel Count'] = df['Name'].str.count('[aeiou]')
df
```

We will dive into performing conditional string operations based on certain criteria, including conditional replacement, filtering, and extracting specific patterns from string columns.

```
df.loc[df['Name'].str.contains('john', case=False), 'Name'] = 'John Smith'
df
```

```
filtered_df = df[df[‘Name’].str.startswith(‘J’)]
filtered_df
```

```
df[‘First Initial’] = df[‘Name’].str.extract(r’^(\w)’, expand=False)
df
```

Cleaning and preprocessing text data are essential steps in any data analysis task. In this section, we will explore techniques to handle missing values, remove special characters, convert strings to lowercase, and perform other common text cleaning operations.

First, we create another string DataFrame with Pandas.import pandas as pd

```
# Creating a DataFrame with missing values in a string column
data = {'Name': ['John Doe$', 'Jane Smith', None, 'Mark Johnson%'],
'Email': ['john@example.com', 'jane@example.com', 'mark@example.com', None]}
df = pd.DataFrame(data)
df
```

In this example, we will handle missing values by replacing them with specified values.

```
df['Name'].fillna('Unknown', inplace=True)
df['Email'].fillna('unknown@example.com', inplace=True)
df
```

Let’s remove special characters from string columns.

```
df[‘Name’] = df[‘Name’].str.replace(‘[#,$,&,%]’, ‘’,regex=True)
df
```

In this example, it removes all characters in **‘Name’** column that contains **#,$,&,%**.

This section introduces advanced string operations using regular expressions with Pandas. We will cover regex-based string extraction and replacement techniques to tackle complex string patterns.

We create another string DataFrame to display the methods.

```
import pandas as pd
# Creating a DataFrame with string data
data = {'Text': ['apple', 'banana', 'orange', 'grapefruit', 'watermelon']}
df = pd.DataFrame(data)
df
```

```
# Using regular expressions to find strings containing ‘app’ or ‘an’
filtered_df = df[df[‘Text’].str.contains(r’app|an’)]
filtered_df
```

```
# Extracting the numeric portion from strings using regex
df['Number'] = df['Text'].str.extract(r'(\d+)')
df
```

```
# Replacing all occurrences of ‘a’ with ‘X’ using regex
df[‘Modified Text’] = df[‘Text’].str.replace(r’a’, ‘X’)
df
```

These examples demonstrate how regular expressions can be applied in conjunction with Pandas to perform advanced string operations. Regular expressions provide a powerful and flexible approach to pattern matching, extraction, and substitution in string data, allowing you to handle complex string patterns effectively.

In this tutorial, we covered a wide range of topics related to working with string data in Pandas, a powerful data manipulation library in Python. We started by introducing the basics of creating a DataFrame with string data and accessing/modifying string columns. Then, we explored common string operations and methods provided by Pandas, such as concatenation, splitting, stripping, substitution, matching, filtering, case conversion, and computing string length and counts.

We also delved into conditional string operations, where we performed actions based on specific criteria. Additionally, we covered advanced string operations using regular expressions with Pandas. Regular expressions allowed us to handle complex string patterns, perform pattern matching, extraction, and substitution tasks.

Furthermore, we emphasized the importance of string data cleaning and preprocessing. We learned techniques to handle missing values, remove special characters, convert strings to lowercase, and remove stop words. These steps are crucial for ensuring data quality and consistency before further analysis.

Throughout the tutorial, we provided examples that showcased the practical implementation of various concepts. These examples demonstrated how to create a DataFrame with string data, manipulate string columns, perform string operations, implement advanced techniques using regular expressions, and preprocess string data.

By leveraging Pandas’ capabilities for string data manipulation, you now have the necessary tools and knowledge to effectively work with string data in your data analysis projects. Whether it’s extracting insights from textual data, cleaning messy strings, or performing advanced string operations, Pandas offers a rich set of functionalities to handle string data efficiently.

Remember to continue practicing and experimenting with different scenarios to deepen your understanding. With Pandas and its extensive string manipulation capabilities, you can unlock the full potential of your text-based datasets and gain valuable insights for your data analysis endeavors.

*Originally published at https://medium.com**/* * on July 1, 2023.*

0 - 0

In data analysis and preprocessing, it’s crucial to identify and handle duplicate rows in a dataset. Duplicate rows can lead to biased analysis, skewed statistics, and erroneous results. Pandas, a powerful Python library for data manipulation, provides several methods to identify and handle duplicate rows effectively. In this tutorial, we will explore some common methods in Pandas to find duplicate rows, accompanied by easy understanding examples.

Let’s consider an example where we have a DataFrame called **‘df’** with columns ‘A’, ‘B’, and ‘C’:

```
import pandas as pd
# Creating a sample DataFrame
data = {'A': [1, 2, 1, 2, 1],
'B': ['apple', 'banana', 'apple', 'banana', 'apple'],
'C': [10, 20, 10, 20, 10]}
df = pd.DataFrame(data)
df
```

The **‘duplicated()’** method in Pandas helps identify duplicate rows in a DataFrame. It returns a boolean Series indicating whether each row is a duplicate or not.

We can see the duplicate rows clearly because the DataFrame has only 5 rows. In most case, we have a large DataFrame, which usually takes time for us to find the duplicate rows.

```
# Finding duplicate rows
duplicate_rows = df.duplicated()
duplicate_rows0
```

False

1 False

2 True

3 True

4 True

dtype: bool

In the output, **‘False’** indicates non-duplicate rows, while **‘True’** represents duplicate rows.

The **‘drop_duplicates()’** method allows us to remove duplicate rows from a DataFrame. By default, it keeps the first occurrence of the duplicate rows and drops the subsequent ones. Here’s an example:

```
# Dropping duplicate rows
df_no_duplicates = df.drop_duplicates(inplace=False)
df_no_duplicates
```

In the output, we can observe that the duplicate rows with index 2, 3 and 4 have been dropped, and the resulting DataFrame contains unique rows. The use of `inplace=False`

tells pandas to return a new DataFrame with duplicates dropped, but you keep the original DataFrame unchanged.

`df`

However, you can also use `inplace=True`

to change the original DataFrame with duplicates dropped.

```
df.drop_duplicates(inplace=True)
df
```

In many cases, only some columns are duplicates, which we can use `.duplicates(subset=['cols']`

)` method to remove. Here is an example:

```
import pandas as pd
# Creating a sample DataFrame
data = {'A': [1, 2, 3, 4, 5],
'B': ['apple', 'banana', 'apple', 'banana', 'apple'],
'C': [10, 20, 10, 20, 10]}
df_new = pd.DataFrame(data)
df_new
```

# Finding duplicate rows

dup_rows1 = df_new.duplicated()

dup_rows1

In the output, there is no duplicate because not all the columns contain duplicates. In this case, we specify the subset as follows:

```
# Finding duplicate rows
dup_rows2 = df_new.duplicated(subset=['B','C'])
dup_rows2
```

In this example, we have duplicate items in the columns ‘B’ and ‘C’. We can remove them using `.drop_duplicates(subset=['col1','col2'])`

method to remove. Here is the example:

```
df_no_duplicates1 = df_new.drop_duplicates(subset=['B'],keep="first", inplace=False)
df_no_duplicates1
```

Or we can specify `subset=['B']`

or `subset=['B','C'],`

because duplicates in columns ‘B’ and ‘C’ are on the same rows.

```
df_no_duplicates2 = df_new.drop_duplicates(subset=['C'],keep="first", inplace=False)
df_no_duplicates2
```

Or

```
Ordf_no_duplicates3 = df_new.drop_duplicates(subset=['B','C'],keep="first", inplace=False)
df_no_duplicates3
```

We get the same output as above.

By specifying `subset=['B']`

, `subset=['C']`

or `subset=['B','C']`

, only the values in the related column or columns will be considered when identifying duplicates. Adjusting the subset and keep parameters can help you tailor the behavior of `drop_duplicates()`

according to your specific requirements. In our example, duplicates in columns ‘B’ and ‘C’ are the same, so the results are the same.

Identifying and handling duplicate rows is a vital step in data analysis and preprocessing. In this tutorial, we explored two essential methods in Pandas: `duplicated()`

and `drop_duplicates()`

. The `duplicated()`

method helped us identify duplicate rows by returning a boolean Series, while the `drop_duplicates()`

method enabled us to remove duplicate rows from a DataFrame. These methods can be invaluable in ensuring data integrity and producing accurate analysis results. By utilizing these techniques, you can effectively handle duplicate rows in your datasets and enhance the quality of your data analysis projects.

*Originally published at https://medium.com/* * on June 29, 2023.*

0 - 0

Stationary Wavelet Transform (SWT) is a widely used signal processing technique that offers various advantages in signal analysis. However, it is important to be aware of the main restrictions and requirements associated with SWT to ensure accurate and reliable results. In this tutorial, we will explore two crucial restrictions of SWT: the signal length must be even and the signal length should be divisible by 2^n, where n represents the maximum number of decomposition levels.

One of the primary restrictions of SWT is that the length of the signal being analyzed must be even. This requirement arises due to the filter bank structure used in the SWT algorithm. The filter banks operate by splitting the signal into even and odd samples, and the subsequent decomposition and reconstruction steps rely on maintaining this parity. If the signal length is odd, it must be padded with an additional sample to make it even before applying SWT.

Another important restriction is that the signal length should be divisible by 2^n, where n represents the number of desired decomposition levels. This requirement ensures the proper functioning of the SWT algorithm and allows for the consistent downsampling and filtering operations at each level. If the signal length does not meet this criterion, it may lead to incorrect decomposition results and potential loss of information.

To overcome these restrictions, the following steps can be taken:

If the signal length is odd, it should be padded with an extra sample to make it even before applying SWT. This can be achieved by duplicating the last sample or using interpolation techniques to estimate the additional sample value.

If the signal length is not divisible by 2^𝑛, it may be necessary to adjust the signal length by truncation or zero-padding. Truncation involves removing a few samples from the signal to make it divisible, while zero-padding involves adding zeros at the end of the signal to achieve the desired length.

It is important to note that modifying the signal length may introduce some artifacts or affect the analysis, particularly in the higher frequency bands. Therefore, careful consideration should be given to the choice and implementation of these adjustments.

Let’s consider a simple example to illustrate the restrictions of Stationary Wavelet Transform (SWT) regarding the signal length.

The first restriction of SWT is that the signal length must be even. In this case, the signal length is odd (17 samples).

```
import pywt
odd_S = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17]
coeffs = pywt.swt(odd_S, 'db2', level=4)
coeffs
```

When we decompose the odd signal, there is ValueError. To satisfy the even length requirement, we need to pad the signal with an additional sample.

```
pad_S = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,17]
coeffs = pywt.swt(pad_S, 'db2')
coeffs
```

```
[(array([ 9.106945 , 2.31078903, 3.7250026 , 5.13921616, 6.55342972,
7.96764328, 9.38185685, 10.79607041, 12.21028397, 13.62449753,
15.0387111 , 16.45292466, 17.86713822, 19.28135178, 20.69556535,
22.23918843, 25.62922001, 22.39647151]),
array([-2.19996188e+00, 1.66533454e-16, 1.11022302e-16, 3.33066907e-16,
4.44089210e-16, 2.22044605e-16, 0.00000000e+00, 2.22044605e-16,
1.33226763e-15, 4.44089210e-16, 8.88178420e-16, 8.88178420e-16,
4.44089210e-16, 4.44089210e-16, 8.88178420e-16, 4.82962913e-01,
7.85681613e+00, -6.13981716e+00]))]
```

Now, we have an even-length signal of 18 samples, satisfying the requirement.

The second restriction is that the signal length should be divisible by 2^𝑛, where 𝑛 represents the maximum number of desired decomposition levels. Samples shoud be 2¹, 2², 2³,…, 2^n.pywt.swt_max_level(len(odd_S))

It shows that no levels of stationary wavelet decomposition are possible, and the signal to be transformed must have a size that is a multiple of 2^n for an n-level decomposition.

`pywt.swt_max_level(len(pad_S))`

1

The above padded signal is even, but its maximum level is only 1. As we know in the previous article, a signle with 16 samples can decomposed into 4 levels.

`pywt.swt_max_level(16)`

4

Let’s assume we want to decompose the signal into 4 levels or 5 levels, we need to modify the signal length by truncation or zero-padding to meet the requirement. For example, let’s truncate the signal by removing one sample to decompose the signal into 4 levels.

```
trunc_S = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
coeffs = pywt.swt(trunc_S, 'db2')
coeffs
```

```
[(array([34., 34., 34., 34., 34., 34., 34., 34., 34., 34., 34., 34., 34.,
34., 34., 34.]),
array([ -5. , 0.56217783, 5.02627944, 8.75833025,
12.12435565, 13.75833025, 14.02627944, 11.56217783,
5. , -0.56217783, -5.02627944, -8.75833025,
-12.12435565, -13.75833025, -14.02627944, -11.56217783])),
(array([27.57716447, 23.64411081, 20.48751428, 17.84855585, 15.46841646,
14.31302194, 14.12355325, 15.86593621, 20.50609665, 24.43915031,
27.59574684, 30.23470527, 32.61484466, 33.77023918, 33.95970787,
32.21732491]),
array([ -8.5732141 , -6.76148078, -5.01909782, -3.79435295,
-3.53553391, -2.56960808, -1.34486321, 1.81173332,
8.5732141 , 12.69573645, 14.81705679, 12.55703574,
3.53553391, -3.36464759, -8.45309576, -10.57441611])),
(array([18.19615242, 12. , 7.53589838, 5.80384758, 7.80384758,
9.80384758, 11.80384758, 13.80384758, 15.80384758, 17.80384758,
19.53589838, 22. , 26.19615242, 28.39230485, 29.12435565,
26.39230485]),
array([-4.92820323e+00, -2.73205081e+00, -1.00000000e+00, 7.21644966e-16,
-5.55111512e-17, -2.22044605e-16, 9.99200722e-16, 1.11022302e-15,
-1.77635684e-15, -6.66133815e-16, -1.00000000e+00, 7.32050808e-01,
8.92820323e+00, 9.66025404e+00, -2.00000000e+00, -7.66025404e+00])),
(array([ 8.62398208, 2.31078903, 3.7250026 , 5.13921616, 6.55342972,
7.96764328, 9.38185685, 10.79607041, 12.21028397, 13.62449753,
15.0387111 , 16.45292466, 17.86713822, 19.28135178, 22.76611771,
20.59402938]),
array([-2.07055236e+00, 1.66533454e-16, 1.11022302e-16, 3.33066907e-16,
4.44089210e-16, 2.22044605e-16, 0.00000000e+00, 2.22044605e-16,
1.33226763e-15, 4.44089210e-16, 8.88178420e-16, 8.88178420e-16,
4.44089210e-16, 4.44089210e-16, 7.72740661e+00, -5.65685425e+00]))]
```

In this example, we demonstrated the restrictions of SWT regarding signal length. The signal length must be even, requiring padding with an additional sample if necessary. Additionally, the signal length should be divisible by 2^n to ensure proper decomposition, which may involve truncating or zero-padding the signal. By satisfying these restrictions, we can perform SWT accurately and obtain reliable results in signal analysis and processing. In this article, we just display the restrictions and general solutions to modify the signal length by truncation or zero-padding to meet the requirement. In the next article, we will show different methods to pad the signal in details.

*Originally published at https://medium.com/* * on June 28, 2023.*

0 - 0

In the previous article, we have talked about how to decompose 1D signal using Stationary Wavelet Transform (SWT). The maximum decomposition level in the Stationary Wavelet Transform (SWT) depends on the length of the input signal. It is determined by the number of times the signal can be downsampled by 2 until it reaches a length of 1 or less.

The information of maximum decomposition level is valuable as it helps in setting the appropriate number of decomposition levels for a signal analysis task. It allows us to determine the number of approximation and detail coefficients that will be generated at each level, providing insights into the signal’s frequency content and capturing various scales of detail.

Understanding the maximum decomposition level aids in selecting an optimal balance between capturing fine details and avoiding excessive decomposition, which could lead to noise amplification or computational overhead. By leveraging the capabilities of PyWavelets and the concept of the maximum decomposition level, researchers, and practitioners can effectively utilize SWT for applications such as signal denoising, feature extraction, and time-frequency analysis, among others.

We can easily calculates the maximum level of SWT for data of given length using the following function in PyWavelets.

`pywt.swt_max_level(input_len)`

**Parameters:**

- input_len [int] Input data length

**Returns:**

- max_level [int] Maximum level of Stationary Wavelet Transform for data of given length

Let’s illustrate this with a simple example using the PyWavelets library in Python.

```
import pywt
S = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
# Get the maximum decomposition level
max_level = pywt.swt_max_level(len(S))
print("Maximum decomposition level:", max_level)
```

Maximum decomposition level: 4

In this example, we have an input signal with 16 samples. We use the `pywt.swt_max_level`

function to determine the maximum decomposition level based on the length of the signal. In this case, the maximum decomposition level for the input signal of length 16 is 4. This means that the signal can be decomposed into 4 levels, resulting in 4 sets of approximation and detail coefficients.

Let’s decompose it into the maximum level with **‘db2’**, which refers to the Daubechies 2 wavelet.

```
coeffs = pywt.swt(S, 'db2', level=4)
coeffs
```

```
[(array([34., 34., 34., 34., 34., 34., 34., 34., 34., 34., 34., 34., 34.,
34., 34., 34.]),
array([ -5. , 0.56217783, 5.02627944, 8.75833025,
12.12435565, 13.75833025, 14.02627944, 11.56217783,
5. , -0.56217783, -5.02627944, -8.75833025,
-12.12435565, -13.75833025, -14.02627944, -11.56217783])),
(array([27.57716447, 23.64411081, 20.48751428, 17.84855585, 15.46841646,
14.31302194, 14.12355325, 15.86593621, 20.50609665, 24.43915031,
27.59574684, 30.23470527, 32.61484466, 33.77023918, 33.95970787,
32.21732491]),
array([ -8.5732141 , -6.76148078, -5.01909782, -3.79435295,
-3.53553391, -2.56960808, -1.34486321, 1.81173332,
8.5732141 , 12.69573645, 14.81705679, 12.55703574,
3.53553391, -3.36464759, -8.45309576, -10.57441611])),
(array([18.19615242, 12. , 7.53589838, 5.80384758, 7.80384758,
9.80384758, 11.80384758, 13.80384758, 15.80384758, 17.80384758,
19.53589838, 22. , 26.19615242, 28.39230485, 29.12435565,
26.39230485]),
array([-4.92820323e+00, -2.73205081e+00, -1.00000000e+00, 7.21644966e-16,
-5.55111512e-17, -2.22044605e-16, 9.99200722e-16, 1.11022302e-15,
-1.77635684e-15, -6.66133815e-16, -1.00000000e+00, 7.32050808e-01,
8.92820323e+00, 9.66025404e+00, -2.00000000e+00, -7.66025404e+00])),
(array([ 8.62398208, 2.31078903, 3.7250026 , 5.13921616, 6.55342972,
7.96764328, 9.38185685, 10.79607041, 12.21028397, 13.62449753,
15.0387111 , 16.45292466, 17.86713822, 19.28135178, 22.76611771,
20.59402938]),
array([-2.07055236e+00, 1.66533454e-16, 1.11022302e-16, 3.33066907e-16,
4.44089210e-16, 2.22044605e-16, 0.00000000e+00, 2.22044605e-16,
1.33226763e-15, 4.44089210e-16, 8.88178420e-16, 8.88178420e-16,
4.44089210e-16, 4.44089210e-16, 7.72740661e+00, -5.65685425e+00]))]
```

Next, let’s see how it is going if we decompose the signal into levels more than its maximum levels., say 5 in the example.

`coeffs = pywt.swt(S, 'db2', level=5)`

There will be an error if we try to decompose the signal more than the maximum level.

The determination of the maximum decomposition level in the Stationary Wavelet Transform (SWT) using PyWavelets allows us to understand the depth to which a given signal can be decomposed. By analyzing the length of the input signal, we can determine the maximum level of decomposition.

In our example, we used the PyWavelets library in Python to calculate the maximum decomposition level for a simple input signal of length 16. By applying the `pywt.swt_max_level`

function with the appropriate parameters, we obtained a maximum decomposition level of 4.

*Originally published at https://medium.com/* * on June 28, 2023.*

0 - 0

In the previous article, we have discussed some key differences, advantages and disadvantages between the Decimated Wavelet Transform (DWT) and the Undecimated Wavelet Transform (UWT). The Stationary Wavelet Transform (SWT) is a powerful signal processing technique that allows the decomposition of a signal into different frequency bands. It has widespread applications in various fields such as image processing, data compression, and denoising.

In general, the process of SWT includes the following main steps:

- Signal Decomposition: The first step in SWT is to decompose the signal into different frequency bands. This is achieved by applying a series of high-pass and low-pass filters to the signal. The high-pass filter extracts the high-frequency components, while the low-pass filter captures the low-frequency components.
- Filter Banks: SWT utilizes filter banks to perform the signal decomposition. A filter bank consists of a pair of high-pass and low-pass filters that operate simultaneously. The high-pass filter removes the low-frequency components, leaving behind the high-frequency components, while the low-pass filter extracts the low-frequency components.
- Downsampling: After filtering, downsampling is performed to reduce the sampling rate of the decomposed signals. This step reduces redundancy and facilitates further analysis. Downsampling involves discarding every alternate sample in the signal.
- Recursive Filtering: SWT involves a recursive filtering process, where the filtering operation is applied repeatedly on the decomposed signals. This recursive filtering creates a multiresolution analysis, allowing the decomposition of the signal into multiple scales.
- Wavelet Selection: The choice of wavelet function is crucial in SWT. Different wavelets have different characteristics and are suitable for different types of signals. Commonly used wavelets include Daubechies, Haar, and Symlets. The wavelet function determines the properties of the resulting frequency bands.
- Reconstruction: Once the signal has been decomposed into different frequency bands, it can be reconstructed by performing an inverse SWT. This involves upsampling and applying the inverse filter banks to each scale to obtain the original signal.

In this tutorial, we will explore decomposition methods involved in performing SWT using a concrete and easily understanding example.

We’ll use the **‘PyWavelets’** library, which provides convenient implementations of various wavelet transforms, including SWT.

`pywt.swt(S, wavelet, level=None, start_level=0, axis=-1, trim_approx=False, norm=False)`

**Parameters**:

**S**: Input signal**wavelet**: Wavelet to use (Wavelet object or name)**level**: [int, optional], i.e. the number of decomposition steps to perform**start_level**: [int, optional], The level at which the decomposition will begin (it allows one to skip a given number of transform steps and compute coefficients starting from start_level) (default: 0)**axis**: int, optional Axis over which to compute the SWT. If not given, the last axis is used.**trim_approx**: [bool, optional], If True, approximation coefficients at the final level are retained.**norm**: [bool, optional], If True, transform is normalized so that the energy of the coefficients will be equal to the energy of data. In other words,`np.linalg.norm(data.ravel())`

will equal the norm of the concatenated transform coefficients when ‘trim_approx’ is True.

**Returns**: **coeffs** [list] List of approximation and details coefficients pairs in order similar to wavedec function:

`[(cAn, cDn), …, (cA2, cD2), (cA1, cD1)]`

where n equals input parameter level.

If `start_level = m`

is given, then the beginning m steps are skipped:

`[(cAm+n, cDm+n), …, (cAm+1, cDm+1), (cAm, cDm)]`

If `trim_approx`

is True, then the output list is exactly as in pywt.wavedec, where the first coefficient in the list is the approximation coefficient at the final level and the rest are the detail coefficients:

`[cAn, cDn, …, cD2, cD1]`

To illustrate the process, let’s consider a concrete example of analyzing a simple signal using SWT. In this example, we will set different parameters of `trim_approx`

and `norm`

.

First, let’s set `trim_approx`

to false, i.e. the default.

```
import pywt
# create a simple signal
S = [1,2,3,4,5,6,7,7,9,10,11,12,13,14,15,16]
# S = list(range(1,17))
# decompose with SWT
coeffs = pywt.swt(S, 'db2', level=2)
coeffs
```

```
[(array([18.19615242, 11.98325318, 7.56490474, 5.9411071 , 7.8161071 ,9.72460075, 11.5080944 , 13.16658805, 15.29158805, 17.39984123,19.30264521, 22. , 26.19615242, 28.39230485, 29.12435565,26.39230485]),
array([-4.92820323, -2.79455081, -0.89174682, 0.51225953, 0.04575318,-0.72876588, -0.35376588, 0.17075318, 0.13725953, 0.10825318,-0.9375 , 0.73205081, 8.92820323, 9.66025404, -2. ,-7.66025404])),
(array([ 8.62398208, 2.31078903, 3.7250026 , 5.13921616, 6.55342972,8.09705281, 9.15771298, 9.95955411, 11.72732106, 13.62449753,15.0387111 , 16.45292466, 17.86713822, 19.28135178, 22.76611771,20.59402938]),
array([-2.07055236e+00, 1.66533454e-16, 1.11022302e-16, 3.33066907e-16,4.44089210e-16, 4.82962913e-01, -8.36516304e-01, 2.24143868e-01,1.29409523e-01, 4.44089210e-16, 8.88178420e-16, 8.88178420e-16,4.44089210e-16, 4.44089210e-16, 7.72740661e+00, -5.65685425e+00]))]
```

You can extract the approximation and details coefficients as follows.

```
(cA2, cD2), (cA1, cD1) = coeffs
print('cA2:\n', cA2)
print('cD2:\n',cD2)
print('cA1:\n',cA1)
print('cD1:\n',cD1)
```

```
cA2:
[18.19615242 11.98325318 7.56490474 5.9411071 7.8161071 9.72460075
11.5080944 13.16658805 15.29158805 17.39984123 19.30264521 22.
26.19615242 28.39230485 29.12435565 26.39230485]
cD2:
[-4.92820323 -2.79455081 -0.89174682 0.51225953 0.04575318 -0.72876588
-0.35376588 0.17075318 0.13725953 0.10825318 -0.9375 0.73205081
8.92820323 9.66025404 -2. -7.66025404]
cA1:
[ 8.62398208 2.31078903 3.7250026 5.13921616 6.55342972 8.09705281
9.15771298 9.95955411 11.72732106 13.62449753 15.0387111 16.45292466
17.86713822 19.28135178 22.76611771 20.59402938]
cD1:
[-2.07055236e+00 1.66533454e-16 1.11022302e-16 3.33066907e-16
4.44089210e-16 4.82962913e-01 -8.36516304e-01 2.24143868e-01
1.29409523e-01 4.44089210e-16 8.88178420e-16 8.88178420e-16
4.44089210e-16 4.44089210e-16 7.72740661e+00 -5.65685425e+00]
```

In this example, we choose ‘db2’ wavelet and decompose a simple signal into 2 levels using SWT method with the default settings of `trim_approx=False`

and `norm=False`

. We obtain pairs of approximation and detail coefficients at the decomposed levels, i.e. 2 levels in the example.

We check the lengths of the original signal and coefficients.

```
print(len(S))
print(len(cA2))
print(len(cD2))
print(len(cA1))
print(len(cD1))
```

16

16

16

16

We see that they have the same length, which results in a shift-invariant wavelet transform. It has been discussed in this point in the previous article.

Now we set `trim_approx=True`

and see the results.

```
coeff2 = pywt.swt(S, 'db2', level=2,trim_approx=True)
coeff2
```

```
array([18.19615242, 11.98325318, 7.56490474, 5.9411071 , 7.8161071 ,
9.72460075, 11.5080944 , 13.16658805, 15.29158805, 17.39984123,
19.30264521, 22. , 26.19615242, 28.39230485, 29.12435565,
26.39230485]),
array([-4.92820323, -2.79455081, -0.89174682, 0.51225953, 0.04575318,
-0.72876588, -0.35376588, 0.17075318, 0.13725953, 0.10825318,
-0.9375 , 0.73205081, 8.92820323, 9.66025404, -2. ,
-7.66025404]),
array([-2.07055236e+00, 1.66533454e-16, 1.11022302e-16, 3.33066907e-16,
4.44089210e-16, 4.82962913e-01, -8.36516304e-01, 2.24143868e-01,
1.29409523e-01, 4.44089210e-16, 8.88178420e-16, 8.88178420e-16,
4.44089210e-16, 4.44089210e-16, 7.72740661e+00, -5.65685425e+00])]
```

Similarly, we extract the approximation and details coefficients, and we also check their lengths.

```
(trcA2, trcD2, trcD1) = coeff2
print(trcA2)
print(trcD2)
print(trcD1)
print(len(trcA2))
print(len(trcD2))
print(len(trcD1)
```

```
[18.19615242 11.98325318 7.56490474 5.9411071 7.8161071 9.7246007511.5080944 13.16658805 15.29158805 17.39984123 19.30264521 22.26.19615242 28.39230485 29.12435565 26.39230485][-4.92820323 -2.79455081 -0.89174682 0.51225953 0.04575318 -0.72876588-0.35376588 0.17075318 0.13725953 0.10825318 -0.9375 0.732050818.92820323 9.66025404 -2. -7.66025404][-2.07055236e+00 1.66533454e-16 1.11022302e-16 3.33066907e-164.44089210e-16 4.82962913e-01 -8.36516304e-01 2.24143868e-011.29409523e-01 4.44089210e-16 8.88178420e-16 8.88178420e-164.44089210e-16 4.44089210e-16 7.72740661e+00 -5.65685425e+00]
16
16
16
```

From the above performance, SWT performance with setting `trim_approx=True`

will decompose a signal into the output list exactly as in `pywt.wavedec`

, where the first coefficient in the list is the approximation coefficient at the final level and the rest are the detail coefficients. But the different is that SWT results in the coefficients with the same length as the original signal.

Next, we set `norm=True`

and `trim_approx=True`

and see the results.

```
nocA2, nocD2, nocD1] = pywt.swt(S, 'db2', level=2,trim_approx=True,norm=True)
print(nocA2)
print(nocD2)
print(nocD1)
print(len(nocA2))
print(len(nocD2))
print(len(nocD1)
```

```
[ 9.09807621 5.99162659 3.78245237 2.97055355 3.90805355 4.86230038
5.7540472 6.58329403 7.64579403 8.69992061 9.6513226 11.
13.09807621 14.19615242 14.56217783 13.19615242]
[-2.46410162 -1.3972754 -0.44587341 0.25612976 0.02287659 -0.36438294
-0.17688294 0.08537659 0.06862976 0.05412659 -0.46875 0.3660254
4.46410162 4.83012702 -1. -3.83012702]
[-1.46410162e+00 -1.11022302e-16 -1.11022302e-16 -4.99600361e-16
0.00000000e+00 3.41506351e-01 -5.91506351e-01 1.58493649e-01
9.15063509e-02 -2.22044605e-16 -3.33066907e-16 -4.44089210e-16
-2.22044605e-16 -8.88178420e-16 5.46410162e+00 -4.00000000e+00]
16
16
16
```

As we discussed in Section 1.1., if True, the transform is normalized so that the energy of the coefficients will be equal to the energy of the data.

Let’s plot and compare the approximation coefficients decomposed with above SWT methods.

```
import matplotlib.pyplot as plt
plt.plot(S)
plt.plot(cA2)
plt.plot(trcA2)
plt.plot(nocA2)
plt.legend(['S','cA2','trcA2','nocA2'])
```

From the visualized comparison, we confirm that approximation coefficient with normalized transform is more closed with the original signal.

The Undecimated Wavelet Transform (UWT) or Stationary Wavelet Transform (SWT) is a powerful signal processing techniqu, which has several advantages over the DWT. This article displays how to decomposition methods using an easily understanding example with Python PyWavelets library. First, we overview the SWT methods, and next we decompose a simple signal into 2 levels with SWT and ‘db2’ wavelet with different parameter settings, such as the default setting, `trim_approx=True`

and` norm=True.`

For the normalized transform, the energy of the coefficients will be equal to the energy of data.

*Originally published at h**ttps://medium.com/* * on June 27, 2023.*

0 - 0

In this article, we will review generally the differences between the **Decimated Wavelet Transform (DWT)** and the **Undecimated Wavelet Transform (UWT)** or **the Stationary Wavelet Transform (SWT)**.

In general, the Undecimated Wavelet Transform (UWT), also known as the Stationary Wavelet Transform (SWT), is a variant of the traditional Decimated Wavelet Transform (DWT). While the DWT downsamples the signal at each level of decomposition, the SWT overcomes this limitation by maintaining the original signal length throughout the analysis. This shift-invariant wavelet transform offers several advantages over the DWT, including improved frequency localization and shift-invariance. In this way, the SWT provides a powerful tool for signal and image processing tasks, such as denoising, feature extraction, and time-frequency analysis.

The Decimated Wavelet Transform (DWT) is a widely used technique in signal and image processing for analyzing and decomposing signals into different frequency components. Unlike the Continuous Wavelet Transform (CWT), which provides a continuous-time representation of the signal, the DWT operates on discrete-time signals and performs a multi-resolution analysis.

The DWT involves a series of filtering and downsampling operations to decompose a signal into different frequency bands or scales. It involves to remove, drop a large percentage or part of the signal, and it suffers a drawback that is not a time- invariant transform. Thus, the DWT of a translated version of a signal ** S** is not, in general, the translated version of the DWT of

The DWT provides a multi-resolution representation of the signal, with different levels or scales capturing different frequency bands. The high-frequency components represent fine details, while the low-frequency components capture the coarse information. The DWT allows for efficient representation and analysis of signals in both time and frequency domains.

The choice of wavelet function plays a crucial role in the DWT. Different wavelets have different frequency characteristics and properties, making them suitable for specific types of signals or analysis goals. Commonly used wavelets include Daubechies, Symlets, and Haar wavelets.

The DWT has various applications, including signal denoising, compression, feature extraction, image processing, and time series analysis. It provides a powerful tool for analyzing and understanding the frequency content of signals in a multi-resolution framework.

The Undecimated (or Nondecimated) Wavelet Transform (also known as the Stationary Wavelet Transform or SWT) is a variation of the traditional Decimated Wavelet Transform (DWT) that overcomes some of its limitations. While the DWT involves downsampling the signal at each level of decomposition, the SWT avoids this downsampling, resulting in a **shift-invariant** wavelet transform that preserves **the original signal length** throughout the analysis.

The Undecimated Wavelet Transform provides several advantages over the DWT:

**Shift-Invariance**: The SWT is shift-invariant, meaning that small shifts in the input signal result in small shifts in the transform coefficients. This property is beneficial for applications such as signal denoising and time series analysis.

**Redundancy**: The SWT has a redundancy factor of 2, which means that the number of coefficients in the transformed signal is doubled compared to the original signal. This redundancy can provide more accurate representation of signals with localized features. **Improved Frequency Localization**: The SWT has improved frequency localization compared to the DWT, especially for high-frequency components. This is because the SWT avoids aliasing caused by downsampling.

The choice of wavelet function is still important in the Undecimated Wavelet Transform. Different wavelets have different characteristics and are suitable for different types of signals or applications.

The Undecimated Wavelet Transform is commonly used in various applications, including signal and image denoising, feature extraction, time-frequency analysis, and biomedical signal processing.

It’s worth noting that the SWT can have higher computational complexity compared to the DWT due to the absence of downsampling. However, this drawback is often mitigated by the benefits it offers, especially in applications that require shift-invariance or precise localization of high-frequency components.

The Undecimated Wavelet Transform (SWT/UWT) has emerged as a valuable technique in the field of signal and image processing. By avoiding downsampling and maintaining the original signal length, the SWT provides shift-invariant analysis and improved frequency localization. These advantages make it a suitable choice for applications that require precise representation and analysis of signals, especially in cases where small shifts in the input signal need to be accurately captured. From denoising to feature extraction and time-frequency analysis, the SWT has found widespread applications and continues to be an area of active research. By harnessing the power of wavelet analysis, the SWT enables researchers and practitioners to delve into the frequency domain and extract meaningful information from complex signals and images, contributing to advancements in various domains, including biomedical signal processing, audio and video processing, and more.

*Originally published at **https://medium.com/* * on June 26, 2023.*

0 - 0

Pandas was initially built on NumPy, which has made Pandas the popular library. NumPy is an essential library for anyone working with numerical data in Python, providing a foundation for scientific computing, data analysis, and machine learning applications. However, it does have some limitations when it comes to handling large data due to memory limitations, limited support for out-of-core computing, limited support for parallel processing, etc.

Pandas 2.0.0 was officially released on April 2, 2023, which is a major release from the pandas 1 series. There are many changes and updates in the Panadas 2.0, and one of major update is the new Apache Arrow backend for pandas data.

Pandas 2.0 integrates Arrow to improve the performance and scalability of data processing. Arrow is an open-source software library for in-memory data processing that provides a standardized way to represent and manipulate data across different programming languages and platforms.

By integrating Arrow, Pandas 2.0 is able to take advantage of the efficient memory handling and data serialization capabilities of Arrow. This allows Pandas to process and manipulate large datasets more quickly and efficiently than before.

Some of the key benefits of integrating Arrow with Pandas include:

- Improved performance: Arrow provides a more efficient way to handle and process data, resulting in faster computation times and reduced memory usage.
- Scalability: Arrow’s standardized data format allows for easy integration with other data processing tools, making it easier to scale data processing pipelines.
- Interoperability: Arrow provides a common data format that can be used across different programming languages and platforms, making it easier to share data between different systems.

Overall, integrating Arrow with Pandas 2.0 provides a significant performance boost and improves the scalability and interoperability of data processing workflows.

You should install Pandas 2.0 or its updated version if you still use a previous version. The release can be installed from PyPI:

`pip install — upgrade pandas>=2.0.0`

Or from conda-forge

`conda install -c conda-forge pandas>=2.0.0`

for mamba user:

`mamba install -c conda-forge pandas>=2.0.0`

Install the latest version of PyArrow from conda-forge using Conda:

`conda install -c conda-forge pyarrow`

Install the latest version from PyPI:

`pip install pyarrow`

for namba user:

First, let’s import required libraries. Besides pandas, we also need `time`

to calculate programming running time.

```
import pandas as pd
import time
```

We use Pandas default numPy backend to read the **‘yellow_tripdata_2019–01.csv’**, a large dataset that can be downloaded from Kaggle.

```
start = time.time()
pd_df = pd.read_csv("./data/yellow_tripdata_2019-01.csv")
end = time.time()
pd_duration = end - start
print("Time to read data with pandas: {} seconds".format(round(pd_duration, 3)))
```

Time to read data with pandas: 13.185 seconds

`pd_df.head()`

Part of the result is as follows:

`pd_df.shape`

The result:(7667792, 18)

Thus, the dataset has 7667792 rows and 18 columns.

`pd_df.dtypes`

```
VendorID int64
tpep_pickup_datetime object
tpep_dropoff_datetime object
passenger_count int64
trip_distance float64
RatecodeID int64
store_and_fwd_flag object
PULocationID int64
DOLocationID int64
payment_type int64
fare_amount float64
extra float64
mta_tax float64
tip_amount float64
tolls_amount float64
improvement_surcharge float64
total_amount float64
congestion_surcharge float64
dtype: object
```

Next, we check how it is when we read the dataset with PyArrow Engine.

```
start = time.time()
arrow_eng_df = pd.read_csv("./data/yellow_tripdata_2019-01.csv", engine='pyarrow')
end = time.time()
arrow_eng_duration = end -start
print("Time to read data with pandas pyarrow engine: {} seconds".format(round(arrow_eng_duration, 3)))
```

Time to read data with pandas pyarrow engine: 2.591 seconds

It only takes 2.591 seconds, which is much faster than that of Pandas default NumPy backend.

`arrow_eng_df.dtypes`

```
VendorID int64
tpep_pickup_datetime datetime64[ns]
tpep_dropoff_datetime datetime64[ns]
passenger_count int64
trip_distance float64
RatecodeID int64
store_and_fwd_flag object
PULocationID int64
DOLocationID int64
payment_type int64
fare_amount float64
extra float64
mta_tax float64
tip_amount float64
tolls_amount float64
improvement_surcharge float64
total_amount float64
congestion_surcharge float64
dtype: object
```

If you compare the data types, there is also some different, for example, **‘tpep_pickup_datetime’** and **‘tpep_dropoff_datetime’** are **‘datetime64[ns] datetime64[ns]’**.

Next, let’s see how the situation is going when we read the dataset using PyArrow backend.

```
start = time.time()
df_pyarrow_type = pd.read_csv("./data/yellow_tripdata_2019-01.csv", dtype_backend="pyarrow",low_memory=False)
end = time.time()
df_pyarrow_type_duration = end -start
print("Time to read data with pandas pyarrow datatype backend: {} seconds".format(round(df_pyarrow_type_duration, 3)))
```

Time to read data with pandas pyarrow datatype backend: 24.407 seconds

However, this method takes much larger than the previous two methods.

`df_pyarrow_type.dtypes`

```
VendorID int64[pyarrow]
tpep_pickup_datetime string[pyarrow]
tpep_dropoff_datetime string[pyarrow]
passenger_count int64[pyarrow]
trip_distance double[pyarrow]
RatecodeID int64[pyarrow]
store_and_fwd_flag string[pyarrow]
PULocationID int64[pyarrow]
DOLocationID int64[pyarrow]
payment_type int64[pyarrow]
fare_amount double[pyarrow]
extra double[pyarrow]
mta_tax double[pyarrow]
tip_amount double[pyarrow]
tolls_amount double[pyarrow]
improvement_surcharge double[pyarrow]
total_amount double[pyarrow]
congestion_surcharge double[pyarrow]
dtype: object
```

Now we test the speed using `engine='pyarrow'`

and `dtype_backend="pyarrow"`

.

```
start = time.time()
df_pyarrow = pd.read_csv(“./data/yellow_tripdata_2019–01.csv”,engine=’pyarrow’, dtype_backend=”pyarrow”)
end = time.time()
df_pyarrow_duration = end -start
print(“Time to read data with pandas pyarrow engine and datatype: {} seconds”.format(round(df_pyarrow_duration, 3)))
```

Time to read data with pandas pyarrow engine and datatype: 2.409 seconds

It only takes 2.409 seconds, which is very fast.

Copy-on-Write (CoW) is a technique first introduced in Pandas 1.5.0 to optimize memory usage when copying data. With CoW, when data is copied, the copy only contains pointers to the original data, rather than a full copy of the data itself. This can save memory and improve performance, as the original data can be shared between multiple copies until one of the copies is modified.

In Pandas 2.0, there have been several improvements to CoW that further optimize memory usage and performance, especially when working with smaller datasets or individual columns of data.

```
pd.options.mode.copy_on_write = True
start = time.time()
df_copy = pd.read_csv("./data/yellow_tripdata_2019-01.csv")
end = time.time()
df_copy_duration = end -start
print("Time to read data with pandas copy on write: {} seconds".format(round(df_copy_duration, 3)))
```

Time to read data with pandas copy on write: 17.187 seconds

Pandas 2.0 can greatly increase the speed of date processing. Compared with Pandas default NumPy backend, PyArrow Engine, PyArrow Backend, both PyArrow Engine and PyArrow Backend, it shows that using combination of `PyArrow Engine`

and `PyArrow`

backend has the fastest speed, and using `PyArrow Engine`

also significantly speeds up data handling process.

*Originally published at **https://**medium**.com/* * on June 26, 2023.*

0 - 0

Reshape DataFrame with pandas is a common task when working with data analysis and manipulation. It involves transforming the structure of the DataFrame to better suit the analysis or visualization needs. Fortunately, pandas provides several methods that allow us to reshape the data effortlessly. In this tutorial, we will explore different techniques to reshape a pandas DataFrame using functions like `transpose()`

or `T`

, `pivot()`

, `melt()`

, `stack()`

, `unstack()`

, and combining `groupby()`

and `agg()`

.

Methods for Reshaping a Pandas DataFrame:

**‘transpose()’** or

: This function interchanges the rows and columns of a DataFrame, effectively reshaping it.**T**

**‘pivot()’**: This function converts unique values from one column into multiple columns, resulting in a wider DataFrame. It allows us to create a new column for each unique value, making it useful for summarizing and comparing data across categories.

**‘melt()’**: The `melt()`

function is used to transform a DataFrame from wide format to long format. It gathers columns into rows, creating a “variable” column and a “value” column. This method is handy when we need to `stack`

or `unpivot`

multiple columns.

**‘stack()’** and **‘unstack()’**: These functions are used when dealing with multi-level column indexes. stack() pivots a level of column labels to the innermost level of row labels, creating a more compact DataFrame. On the other hand, unstack() performs the reverse operation, converting the innermost row labels to column labels.

Combining `groupby()`

and `agg()`

: This approach is powerful for reshaping a DataFrame based on groups. By grouping the DataFrame using specific columns and applying aggregation functions through `agg()`

, we can summarize the data and reshape it into a new structure.

The `transpose()`

method in pandas is used to interchange the rows and columns of a DataFrame, effectively reshaping it. This method allows you to rotate the DataFrame, making the columns become rows and vice versa. Here’s an example:import pandas as pd

```
# Create a sample DataFrame
data = {
'City': ['New York', 'London', 'Tokyo'],
'Temperature': [30, 25, 35],
'Precipitation': [100, 80, 120]
}
df = pd.DataFrame(data)
df
```

```
# Transpose the DataFrame
transposed_df = df.transpose()
transposed_df
```

You can use the shortform of `df.T`

to get the same result.

```
transposed_df = df.T
transposed_df
```

In the above example, the original DataFrame has three columns: ‘City’, ‘Temperature’, and ‘Precipitation’. After applying the `transpose()`

method, the columns become rows, resulting in a new DataFrame where the original column names are now the row index. Each row represents a column from the original DataFrame.

Transposing a DataFrame can be useful when you want to switch the orientation of your data or when you need to perform specific operations on rows instead of columns. However, it’s important to note that transposing a large DataFrame can significantly impact memory usage, so use it judiciously.

Remember that the `transpose()`

method returns a new DataFrame, and if you want to modify the original DataFrame in-place, you can assign the transposed DataFrame back to the original variable:

Keep in mind that when you transpose a DataFrame, the original index values become column headers, and the original column headers become the new index.

The `pivot()`

function allows you to reshape a DataFrame by converting unique values from one column into multiple columns. Here’s an example:import pandas as pd

```
# Create a sample DataFrame
data = {
'City': ['New York', 'London', 'Tokyo', 'Paris'],
'Year': [2019, 2019, 2020, 2020],
'Temperature': [30, 25, 35, 28]
}
df = pd.DataFrame(data)
df
```

```
# Reshape the DataFrame using pivot()
reshaped_df = df.pivot(index=’Year’, columns=’City’, values=’Temperature’)
reshaped_df
```

The `melt()`

function is used to unpivot a DataFrame from wide format to long format. It gathers columns into rows, creating a “variable” column and a “value” column. Here’s an example:import pandas as pd

```
# Create a sample DataFrame
data = {
'Year': [2019, 2020],
'New York': [30, None],
'London': [25, None],
'Tokyo': [None, 35],
'Paris': [None, 28]
}
df = pd.DataFrame(data)
df
```

```
# Reshape the DataFrame using melt()
reshaped_df = df.melt(id_vars=’Year’, var_name=’City’, value_name=’Temperature’)
reshaped_df
```

The `stack()`

function is used to pivot a level of column labels to the innermost level of row labels, while `unstack()`

does the opposite. These functions are helpful when dealing with multi-level column indexes. Here’s an example:

```
import pandas as pd
# Create a sample DataFrame with multi-level column index
data = {
('City', 'A'): [10, 20],
('City', 'B'): [30, 40],
('Temperature', 'A'): [25, 35],
('Temperature', 'B'): [15, 25]
}
df = pd.DataFrame(data)
df
```

```
# Reshape the DataFrame using stack()
stacked_df = df.stack()
stacked_df
```

```
# Reshape the DataFrame using unstack()
unstacked_df = stacked_df.unstack()
unstacked_df
```

This combination functions can be a powerful way to reshape a pandas DataFrame. Here’s an example:import pandas as pd

```
# Create a sample DataFrame
data = {
'City': ['New York', 'London', 'Tokyo', 'Paris', 'New York', 'London', 'Tokyo', 'Paris'],
'Year': [2019, 2019, 2019, 2019, 2020, 2020, 2020, 2020],
'Temperature': [30, 25, 35, 28, 32, 27, 38, 30],
'Precipitation': [100, 80, 120, 90, 110, 70, 130, 100]
}
df = pd.DataFrame(data)
df
```

```
# Reshape the DataFrame using groupby() and agg()
reshaped_df = df.groupby(['Year', 'City']).agg({'Temperature': 'mean', 'Precipitation': 'sum'}).reset_index()
reshaped_df
```

```
# Reshape the DataFrame using groupby() and agg() by year
reshaped_df = df.groupby(['Year']).agg({'Temperature': 'mean', 'Precipitation': 'sum'}).reset_index()
reshaped_df
```

```
# Reshape the DataFrame using groupby() and agg() by City
reshaped_df = df.groupby(['City']).agg({'Temperature': 'mean', 'Precipitation': 'sum'}).reset_index()
reshaped_df
```

In this example, we grouped the DataFrame by the ‘Year’ and ‘City’ columns and applied aggregation functions to reshape the data. We calculated the mean of the ‘Temperature’ column and the sum of the ‘Precipitation’ column for each group. The resulting DataFrame has a row for each unique combination of ‘Year’ and ‘City’ with the aggregated values.

You can customize the aggregation functions and columns according to your specific requirements.

Reshaping a pandas DataFrame is an essential skill for data analysts and scientists. Understanding different methods such as **‘transpose()’** or

, **T**`pivot()`

, `melt()`

, `stack()`

, `unstack()`

, and combining `groupby()`

and `agg()`

allows you to manipulate and transform your data into the desired format. By leveraging these techniques, you can effectively reshape your DataFrame to meet specific analysis, visualization, or modeling requirements. Experiment with these methods, adapt them to your use cases, and unleash the full potential of pandas for data reshaping.

*Originally published at **https://medium.com/* * on June 25, 2023.*

0 - 0

In this tutorial, we have explored how to use LSTM (Long Short-Term Memory) to predict time series using a real-world dataset `apple_share_price.csv`

. We utilized the Keras library in Python, which provides a convenient API for building and training deep learning models. By following the step-by-step instructions, we have successfully built a simple Univariate LSTM model that can make predictions on time series data.

To begin, we imported the necessary libraries such as pandas, NumPy, and Keras. We then loaded and preprocessed the dataset, ensuring that the time series data was in the appropriate format. We split the dataset into training and testing sets and scaled the data using `MinMaxScaler`

to ensure the LSTM model could work effectively.

Next, we prepared the training data by creating input sequences and target values using a sliding window approach. This allowed the LSTM model to learn patterns and make predictions based on historical data. We then proceeded to build the LSTM model, which consisted of an LSTM layer and a dense output layer. The model was compiled using an appropriate optimizer and loss function.

After building the model, we trained it using the prepared training data and evaluated its performance on the testing set. We calculated the root mean squared error (RMSE) to assess the accuracy of the predictions compared to the actual values.

Finally, we demonstrated how to use the trained LSTM model to make one-step-ahead predictions on future unseen data. By scaling the input data and utilizing the trained model, we obtained predicted values for the future time steps.

In this tutorial, we will use the Keras library in Python, which provides a convenient API for building and training deep learning models.

First, let’s import the necessary libraries:

```
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
from keras.models import Sequential, load_model
from keras.layers import LSTM, Dense
```

Next, we need to load and preprocess the time series dataset. For this tutorial, let’s use a CSV file named “apple_share_price.csv” containing the historical time series data of Apple share prices. We read it directly from GitHub.

```
url = 'https://raw.githubusercontent.com/NourozR/Stock-Price-Prediction-LSTM/master/apple_share_price.csv'
df = pd.read_csv(url,usecols=[0,1,2,3,4])
df
```

We have discussed how to reverse the order of the rows with 5 different methods in this previous article. You can reverse the rows in chronological order.

```
df = df.reindex(index = df.index[::-1]).reset_index(drop=True)
df
```

The dataset contains 1664 rows and 5 columns. In this example, we only use the column of **‘Open’**.

```
df= df[['Open']]
df.head()
```

We use Matplotlib to visualize the dataset.

`plt.plot(df)`

Before training the LSTM model, we need to split the dataset into a training set and a testing set. Typically, we use a majority of the data for training and a smaller portion for testing. In this tutorial, let’s use 80% of the data for training and 20% for testing:

```
train_size = int(len(df) * 0.8)
train_data = df.iloc[:train_size]
test_data = df.iloc[train_size:]
```

LSTMs are sensitive to the scale of input data, so it’s important to scale the data before training the model. We can use the `MinMaxScaler`

from scikit-learn to scale the values between 0 and 1:

```
scaler = MinMaxScaler(feature_range=(0, 1))
train_scaled = scaler.fit_transform(train_data)
test_scaled = scaler.transform(test_data)
```

To train an LSTM model, we need to prepare the input sequences and target values. We will create a sliding window of input sequences and their corresponding target values.

```
def create_sequences(data, input_size):
X, y = [], []
for i in range(len(data) - input_size):
X.append(data[i:i + input_size])
y.append(data[i + input_size])
return np.array(X), np.array(y)
input_size = 5
X_train, y_train = create_sequences(train_scaled, input_size)
X_test, y_test = create_sequences(test_scaled, input_size
```

You can check the shape of training and testing dataset as follows:

```
print(X_train.shape)
print(X_test.shape)
```

Now, let’s build the LSTM model using Keras. We will create a sequential model with an LSTM layer and a dense output layer:

```
model = Sequential()
model.add(LSTM(100, activation=’relu’,input_shape=(X_train.shape[1], X_train.shape[2])))
model.add(Dense(1))
model.compile(optimizer=’adam’, loss=’mse’)
```

We can now train the LSTM model using the prepared training data:

`history = model.fit(X_train, y_train, epochs=50, batch_size=50,validation_split=0.2)`

```
# plot training history
plt.plot(history.history[‘loss’], label=’train’)
plt.plot(history.history[‘val_loss’], label=’validation’)
plt.legend()
plt.show()
```

We can easily save the trained model, and load it back for future use.

`model.save(‘./model/simple_lstm_stock_model’)`

You can load it as follows when you use it.

```
# loading model
new_model = load_model(‘./model/simple_lstm_stock_model’)
```

After training the model, we can evaluate its performance on the testing set:

```
train_predicted = model.predict(X_train.reshape(X_train.shape))
test_predicted = model.predict(X_test.reshape(X_test.shape))
train_predicted = scaler.inverse_transform(train_predicted)
test_predicted = scaler.inverse_transform(test_predicted)
train_actual = scaler.inverse_transform(y_train)
test_actual = scaler.inverse_transform(y_test)
train_rmse = np.sqrt(np.mean((train_actual - train_predicted) ** 2))
test_rmse = np.sqrt(np.mean((test_actual - test_predicted) ** 2))
print("Train RMSE:", train_rmse)
print("Test RMSE:", test_rmse)
```

We can visualize the training and testing result as follows. First, we create x-axis ranges for train data and test data.

```
x1 = np.arange(0, len(train_actual))
x2 = np.arange(len(train_actual), len(train_actual)+len(test_actual))
```

Then you can visualize the actual train data and train fitting results.

```
plt.plot(x1,train_actual)
plt.plot(x1,train_predicted)
plt.legend(['train_actual','train_predicted'])
```

Similarly, you can visualize the model testing result and actual testing data.

```
plt.plot(x2,test_actual)
plt.plot(x2,test_predicted)
plt.legend(['test_actual','test_predicted'])
```

Or you just visualize the training and testing results together. In this example, we use a dashed red line to separate training and testing results. In the previous article, it displays the different methods to plot vertical or horizontal lines.

```
plt.plot(x1,train_actual)
plt.plot(x1,train_predicted)
plt.plot(x2,test_actual)
plt.plot(x2,test_predicted)
plt.legend(['train_actual','train_predicted','test_actual','test_predicted'])
plt.vlines(x=int(len(train_actual)), color='r',linestyles='dashed', ymin = 0, ymax = max(test_actual))
plt.show()
```

Finally, we can use the trained LSTM model to make predictions on future unseen data:

```
future_data = df.iloc[-input_size:]
future_scaled = scaler.transform(future_data)
future_sequence = np.array([future_scaled])
future_predicted = model.predict(future_sequence)
future_predicted = scaler.inverse_transform(future_predicted)
print("Future predicted values:", future_predicted)
```

In this tutorial, we have learned how to apply LSTM for time series prediction using a real-world dataset. LSTMs are powerful tools for capturing and learning complex temporal patterns, making them suitable for forecasting future values based on historical data. By following the step-by-step instructions, you have gained hands-on experience in building, training, and evaluating an LSTM model using Keras.

Remember that the performance of the LSTM model can be influenced by various factors such as the size of the sliding window, the number of LSTM units, and the number of training epochs. It’s essential to experiment with different configurations and hyperparameters to optimize the model’s accuracy and generalization.

By mastering LSTM for time series prediction, you can unlock numerous applications in finance, weather forecasting, stock market analysis, and many other domains where understanding and predicting sequential data is crucial. With further exploration and practice, you can continue to refine your skills in building sophisticated deep learning models for time series forecasting.

*Originally published at **https://medium.com/* * on June 24, 2023.*

0 - 0

Matplotlib is a powerful Python library for creating visualizations, including line plots. Drawing horizontal and vertical lines on a plot is a common task when analyzing data or highlighting specific values or regions of interest. In this tutorial, we will explore three different methods to achieve this using Matplotlib: the **‘axhline’** and **‘axvline’** functions, the **‘plot’** function, and the **‘hlines’** and **‘vlines’** functions.

By the end of this tutorial, you will have a clear understanding of how to plot horizontal and vertical lines using Matplotlib, enabling you to enhance your data visualizations and convey important insights effectively.

The **‘axhline’** and **‘axvline’** functions are the simplest methods to draw horizontal and vertical lines, respectively. Here’s an example:import matplotlib.pyplot as plt

```
# Create a figure and axis
fig, ax = plt.subplots()
# Plotting horizontal line
ax.axhline(y=0.5, color='r', linestyle='--', linewidth=2)
# Plotting vertical line
ax.axvline(x=0.7, color='b', linestyle=':', linewidth=2)
# Display the plot
plt.show()
```

In this example, we create a figure and an axis using **‘plt.subplots()’**. Then, we use the **‘axhline’** function to draw a horizontal line at **‘y=0.5’**, specifying the color, linestyle, and linewidth. Similarly, we use the **‘axvline’** function to draw a vertical line at **‘x=0.7’**. Finally, we call **‘plt.show()’** to display the plot.

The **‘plot’** function can also be used to draw horizontal and vertical lines. Here’s an example:

```
import matplotlib.pyplot as plt
import numpy as np
# Create x-axis values
x = np.linspace(0, 1, 100)
# Create y-axis values for horizontal line
y_hline = np.full_like(x, 0.5)
# Create y-axis values for vertical line
y_vline = np.linspace(0, 1, 100)
# Create a figure and axis
fig, ax = plt.subplots()
# Plotting horizontal line
ax.plot(x, y_hline, color='r', linestyle='--', linewidth=2)
# Plotting vertical line
ax.plot([0.7, 0.7], [0, 1], color='b', linestyle=':', linewidth=2)
# Display the plot
plt.show()
```

In this example, we create an array x using **‘np.linspace’** to generate evenly spaced values between 0 and 1. Then, we create a corresponding array **‘y_hline’** filled with **‘0.5’** values for the horizontal line. For the vertical line, we use the same x array for the **‘x-axis’** values, and manually specify the y-axis values as **‘[0, 1]’**.

Next, we create a figure and an axis using **‘plt.subplots()’**. We use the plot function to draw the horizontal line by plotting **‘x’** against **‘y_hline’** and specifying the color, linestyle, and linewidth. Similarly, we use the plot function to draw the vertical line by plotting **‘[0.7, 0.7]’** against **‘[0, 1]’**. Finally, we call **‘plt.show()’** to display the plot.

The **‘hlines’** and **‘vlines’** functions can be used to draw horizontal and vertical lines across the entire plot, respectively. Here’s an example:

```
import matplotlib.pyplot as plt
# Create a figure and axis
fig, ax = plt.subplots()
# Plotting horizontal line
ax.hlines(0.5, 0, 1, colors='r', linestyles='--', linewidths=2)
# Plotting vertical line
ax.vlines(0.7, 0, 1, colors='b', linestyles=':', linewidths=2)
# Display the plot
plt.show()
```

In this example, we create a figure and an axis using **‘plt.subplots()’**. Then, we use the **‘hlines’** function to draw a horizontal line at y=0.5, specifying the start and end x-coordinates, colors, linestyles, and linewidths. Similarly, we use the **‘vlines’** function to draw a vertical line at **‘x=0.7’**, specifying the start and end y-coordinates, colors, linestyles, and linewidths. Finally, we call **‘plt.show()’** to display the plot.

These are three different methods you can use to draw horizontal and vertical lines using Matplotlib. Feel free to choose the one that suits your needs best.

In this tutorial, we have covered three different methods to plot horizontal and vertical lines using Matplotlib. The **‘axhline**’ and **‘axvline’** functions provide a straightforward way to draw lines at specific positions, while the **‘plot**’ function allows for more flexibility by plotting arrays of x and y values. The **‘hlines’** and **‘vlines’** functions are useful when you want to draw lines across the entire plot. By mastering these techniques, you can create visually appealing and informative plots that effectively highlight important information or patterns in your data. Remember to experiment with different colors, linestyles, and linewidths to customize the appearance of your lines. Matplotlib’s versatility and ease of use make it a valuable tool for data visualization and exploration.

*Originally published at **https://medium.com/* * on June 24, 2023.*

0 - 0